Computer Science 2
Computer Science 2
Abbreviations ix
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 General Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.1 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Named Entity Recognition and Extraction . . . . . . . . . . . . . . . 7
2.2.1 Evaluation metrics for NER . . . . . . . . . . . . . . . . . . . 8
2.2.2 Approaches to NER . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2.1 Rule-based methods . . . . . . . . . . . . . . . . . . 8
2.2.2.2 Machine Learning-Based Methods . . . . . . . . . . . 9
2.2.2.3 Hybrid Methods . . . . . . . . . . . . . . . . . . . . 9
2.3 Deep Learning in NER . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Neural Network Architectures for NER . . . . . . . . . . . . . 9
2.3.1.1 Recurrent Neural Networks . . . . . . . . . . . . . . 10
2.3.1.2 Convolutional Neural Network . . . . . . . . . . . . . 10
2.3.1.3 Long Short-Term Memory . . . . . . . . . . . . . . . 10
2.3.1.4 Bidirectional Long Short-Term Memory Networks . . 10
2.3.1.5 Conditional Random Fields . . . . . . . . . . . . . . 10
2.3.1.6 Bidirectional Long Short-Term Memory With a Con-
ditional Random Field Layer . . . . . . . . . . . . . 11
2.3.1.7 Transformer-based Models . . . . . . . . . . . . . . . 11
iv
CONTENTS
3 Methodology 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Business Understanding . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Data Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1.2 Data Collection Process . . . . . . . . . . . . . . . . 23
3.5.1.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . 23
3.5.2 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2.1 Data Preparation for Training . . . . . . . . . . . . . 24
3.5.2.2 Data Export for Training . . . . . . . . . . . . . . . 25
3.5.2.3 Annotated Data Summary . . . . . . . . . . . . . . . 26
3.6 Model Design and Rationale for Selection of Models . . . . . . . . . . 28
v
CONTENTS
5 Conclusions 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Conclusion and Limitations . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Final Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 44
References 55
A Code Snippets 56
A.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.1.1 Download News Text Using GoogleNews . . . . . . . . . . . . 57
A.1.2 Download Scientific Text Using SemanticScholar . . . . . . . . 57
A.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.2.1 Pre-annotation Text Processing . . . . . . . . . . . . . . . . . 58
A.2.2 Pre-annotation Creation of Gazzetters . . . . . . . . . . . . . 58
A.2.3 Annotated Data Splitting and Conversion . . . . . . . . . . . 59
A.3 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . 59
A.3.1 Training a baseline BiLSTM-CRF model . . . . . . . . . . . . 59
A.3.2 Fine Tuning, Evaluating and Testing LLMs using Hugging Face 60
vi
List of Figures
3.1 Project workflow diagram illustrating the system architecture and pro-
cess flow for the RT&B crop diseases NER model. . . . . . . . . . . . 20
3.2 Diagram demonstrating the interconnections among the different stages
of CRISP-DM (Jensen, 2012) . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Prodigy Annotation Tool . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Prodigy Train curve diagnostics . . . . . . . . . . . . . . . . . . . . . 25
vii
List of Tables
viii
List of Abbreviations
BERT Bidirectional Encoder Representations from Transformers. 11, 18, 19, 28–32,
34–36
IE Information Extraction. 7
ix
Abbreviations
NER Named Entity Recognition. iii, vii, 4, 5, 7–16, 18–21, 24, 25, 28, 29, 31, 32, 35,
36, 38, 39, 43–45
NLP Natural Language Processing. iii, 3, 7, 9, 11, 19, 28, 31, 32, 34
PLLM Pretrained Large Language Model. iii, 4, 5, 7, 12, 15, 18, 19, 22, 23, 28, 29,
35, 36, 38, 39, 43, 44
RT&B Roots, Tubers and Bananas. ii, iii, vii, 1–5, 16, 18–23, 28, 35, 36, 38, 39, 43,
44
SciDeBERTa SciDeBERTa model trained on scientific text. iii, 29, 35, 38, 39
x
Chapter 1
Introduction
1.1 Background
Roots, Tubers and Bananas (RT&B) are important food crops that are propagated
vegetatively (Thiele et al., 2017) that include cassava (Manihot esculenta), potatoes
(Solanumspp.), sweet potatoes (Ipomea batata), yams(Dioscorea spp.), and bananas
(Musa spp.). In developing countries, these crops play a vital role in ensuring that
people have enough food, promote good nutrition, and create income opportunities.
According to (Thiele et al., 2022, 2017), about 300 million people worldwide depend
on the value chains of RT&B crops. RT&B crops play a crucial role in providing
the necessary nutritional and dietary energy. This is due to their significant yield
and high levels of carbohydrates. They provide more energy per hectare grown than
cereals (RTB, 2016), and in sub-Saharan Africa, they contribute up to 50% of the
total daily calorie intake (Petsakos et al., 2019).
To understand the significance of RT&B crops in maintaining food security in sub-
Saharan Africa, it is essential to take into account the impact of climate change on
agriculture in this region. Where agriculture is more vulnerable due to the effects of
climate change(Girvetz et al., 2019); however, RT&B crops have characteristics that
increase their ability to withstand the consequences of climate change. (Prain and
Naziri, 2020). Although farmers in Africa use primarily RT&B crops for subsistence,
these crops are also of global importance, as they are used as animal feed or for
industrial production, such as ethanol production (Petsakos et al., 2019).
RT&B crops are all vegetatively propagated, which means that the planting mate-
rials, for example, stem cuttings or vines, are genetically identical to the parent plant
(Andrade-Piedra et al., 2016). Therefore, these crops share similar breeding, seed
systems, and post-harvest challenges (RTB, 2016). However, a significant challenge is
the widespread occurrence of pests and diseases. These outbreaks of crop diseases and
pests cost farmers and consumers significant losses yearly due to yield loss and poor
1
1.1 Background
harvest quality. The yield loss is estimated to be between 20% and 40% (Kreuze et al.,
2022; Savary et al., 2019). Vegetatively propagated crops are particularly vulnerable
to pest infestation and pathogen infections because pests and pathogens tend to build
up over time with each planting cycle (Thomas-Sharma et al., 2016). Researchers
predict that this problem will deteriorate as climate change increases the geographic
scope of pests and diseases or increases the severity of some diseases that affect these
crops (Thiele et al., 2017). The global food trade network and the evolution of new
pathogens will also contribute to the spread of plant diseases (Ristaino et al., 2021).
The relevant authorities can generally manage endemic plant diseases to avoid
adverse agricultural effects. However, emerging plant diseases cause large-scale plant
epidemics that devastate food security (Kreuze et al., 2022; Savary et al., 2019). These
plant epidemics are transboundary and can affect yield in multiple countries at the
same time. RT&B crops have several ongoing large-scale outbreaks in Africa and
Asia, for example, Fusarium Wilt in bananas and Cassava Mosaic Disease (CMD) in
cassava (Kreuze et al., 2022).
The surveillance responsibility for monitoring crops to protect them from pan-
demics is in the hands of National Plant Protection Agencies (NPPO)’s in different
countries. For example, Kenya’s NPPO, Kenya Plant Health Inspectorate Services,
is acknowledged as a centre of excellence in eastern and southern Africa(Miller et al.,
2009). In practise, plant disease diagnosis networks carry out disease monitoring and
surveillance. NPPOs, research universities, international research organisations, de-
velopment agencies, the private sector, and farmers make up these networks. (Miller
et al., 2009; Ristaino et al., 2021). For example, such networks are in place to monitor
Cassava viruses in Africa and Asia. Networks traditionally perform diagnostics using
field surveys that use classical techniques such as diagnosis, grafting, and mechanical
inoculation. However, with DNA sequencing becoming cheaper, whole genome se-
quencing has been added to these traditional texting programmes (Legg et al., 2015).
The ability to detect outbreaks accurately and quickly is critical to implementing
effective intervention measures. Early detection can minimise the impact and threat
posed by disease, and a delayed response can have significant economic, social, and
ecological impacts. The year 2020, was designated as the International Year of Plant
Health by the General Assembly of the United Nations. This declaration was to en-
courage the creation of global surveillance networks and to increase public and policy
makers’ awareness (Seed World, 2018).
Monitoring and controlling crop disease outbreaks is still an ongoing challenge
worldwide. Plant disease surveillance is severely underfunded (Carvajal-Yepes et al.,
2019). For example, global-scale surveillance is only conducted for wheat rust and
late blight in potatoes (Ristaino et al., 2021). Despite the lack of funding, many mod-
ern and digital technologies are being applied to disease surveillance. This includes
2
1.2 Problem Statement
geospatial and remote sensing systems, field sensors, data mining, and big data an-
alytics, including NLP (Ristaino et al., 2021). Disease detection has also been used
based on images from smartphones and drones (Kreuze et al., 2022). Disease surveil-
lance networks are using all these methods to continuously monitor the spatial spread
and incidence of pests and pathogens.
3
1.3 Aim and Objectives
crop disease outbreaks by domain experts in a local context. This can adversely affect
the deployment of effective containment measures for global crop disease pandemics.
For scientists and RT&B crop protection experts, the challenge becomes how to ex-
tract disease information from these growing digital text sources in an automated and
efficient manner.
Relevant information that exists in these texts and that can be used for creating
a disease surveillance system includes:
• Crop name
• Pathogen name
• Disease name
• Symptom
• Geographic location
• Event Date
• Organisation
2. Find the most appropriate Pretrained Large Language Model (PLLM) that
uses transfer learning to correctly recognise the named entities of RT&B crop
diseases.
4
1.4 Research Questions
3. Assess how well the fine-tuned model performs in Named Entity Recognition of
RT&B crop diseases in the scientific literature and online text.
2. What Pretrained Large Language Model emerges as the most suitable choice
for transfer learning to produce a NER model aimed at RT&B crop diseases
detection? How does the choice of this PLLM influence the effectiveness of the
resultant model?
3. How does the fine-tuned model perform in terms of Named Entity Recognition
for RT&B crop diseases when applied to scientific literature and online texts?
What are the key factors that significantly impact this performance?
1.5 Justification
RT&B crops are crucial contributors to food security and income generation, with
a particularly profound impact in Africa (RTB, 2016). Despite their importance,
managing and controlling crop disease outbreaks pose formidable global challenges
(Carvajal-Yepes et al., 2019). As technological advances in crop disease monitoring
networks lead to a proliferation of relevant textual data for crop protection (Miller
et al., 2009; Ristaino et al., 2021), the insights derived from this study are intended to
significantly boost large-scale crop disease monitoring efforts. The focus is mainly on
combating transboundary crop epidemics. Through the study, our objective was to
facilitate the robust and continuous extraction of information from extensive online
textual data sources. This approach is anticipated to enhance targeted crop pro-
tection initiatives and provide valuable support for resource-constrained crop disease
networks, thus contributing to a more sustainable and secure agricultural future.
1.6 Assumptions
1.6.1 Scope and Limitations
1. The research study limited the evaluation of NER to diseases affecting only five
RT&B crops: Cassava, Banana, Plantain, Potato, and Sweet Potato.
5
1.6 Assumptions
2. The study collected scientific data from abstracts available in the Semantic-
Scholar open access literature database. However, the methodology can be
applied to other databases such as PubMed or Google Scholar. The news items
were collected from free news media indexed on Google News and refined using
the crop name as the keyword search.
3. The project extracted data from online text and documents created digitally
by the authors. This study did not use scanned PDF documents that required
optical character recognition for text extraction.
6
Chapter 2
Literature Review
2.1 Introduction
The ability to rapidly detect the spread of crop epidemics is an integral part of crop
disease surveillance networks. The growing availability of digital data in online sources
provides an avenue for data-driven cross-border surveillance. Advancements in Natu-
ral Language Processing (NLP) techniques have enabled it to analyse data from Web
sources, such as social networks, search queries, blogs, scientific literature, and online
news articles for outbreak-related incidents related to diseases (O’Shea, 2017; Thomas
et al., 2011). This review of the literature revolves around how deep transfer learning
can be used to extract entities that are relevant to the monitoring and surveillance of
crop diseases. The study focusses on understanding the developments in the natural
language process, the emergence of Pretrained Large Language Model, and how these
have been used in transfer learning to enhance deep learning models in areas with
little training data. This review also focusses on NER in the agricultural sector.
7
2.2 Named Entity Recognition and Extraction
uct names can be considered entities. The goal of Named Entity Recognition is to
recognise the mentions of these identifiers that belong to a predetermined text class
(Nadeau and Sekine, 2007). The performance of NER tasks can be affected by factors
such as language, type of entity and domain, for example biomedical or agriculture.
Nested entities, ambiguity in the text, and the amount of annotated training data are
challenges for NER (Goyal et al., 2018).
The goal of NER is to identify specific terms that are part of a predetermined
category in a given text.
Rule-based methods for NER depend on manually created rules, patterns, and dictio-
naries to classify entities in the text(Grishman, 1995). These methods often involve
regular expressions or pattern matching techniques to capture specific syntactic or
morphological structures associated with named entities (Chinchor and Robinson,
1997). Although rule-based methods can achieve high precision, they may require
an improvement in recall due to the difficulty in creating comprehensive rules and
dictionaries that encompass all potential variations of named entities (Nadeau and
Sekine, 2007).
8
2.3 Deep Learning in NER
Machine learning approaches for Named Entity Recognition use supervised learn-
ing strategies to autonomously discern patterns and characteristics from annotated
datasets (Lafferty et al., 2001). Commonly used machine learning algorithms for
NER include Support Vector Machines (SVMs), Conditional Random Fields (CRF)
and Hidden Markov Models (HMMs) (McCallum and Li, 2003). More recently, deep
learning-based methods, such as Recurrent Neural Network (RNNs), Long Short-Term
Memory (LSTM) networks and Transformer-based models, have attained top-tier re-
sults on different NER benchmarks (Devlin et al., 2019; Lample et al., 2016).
Hybrid methods utilise a combination of rule and machine learning based approaches
to take advantage of the strengths of both techniques and mitigate their weaknesses
(Finkel et al., 2005). These methods typically involve the use of rules to generate
features or initial annotations, which are refined or combined with machine learning-
based methods to produce the final output NER (Nadeau and Sekine, 2007). Hybrid
methods can achieve improved performance by combining the high precision of rule
and machine learning based approaches possessing the ability to adapt and generalise
effectively.
9
2.3 Deep Learning in NER
RNNs fall under a class of neural networks designed especially for handling sequential
data, preserving hidden states that retain details from prior stages (Elman, 1990).
RNNs have been applied to NER tasks to capture contextual information and long-
range model dependencies in input text (Chiu and Nichols, 2016).
CNNs are a type of neural network which employs convolutional layers to capture local
features in input data by applying filters (Lecun et al., 1998). Although originally
designed for image recognition, CNNs underwent modifications for NER tasks by
interpreting the text in a series of characters or words, using convolutions to grasp
the local context and characteristics (Collobert et al., 2011).
LSTMs are a type of RNNs that addresses the problem of vanishing gradients, which
occurs when the network cannot learn long-term dependencies because of the decrease
in gradients during training (Hochreiter and Schmidhuber, 1997). LSTMs utilised in
NER tasks aim to more effectively grasp distant relationships and context within
input sequences (Lample et al., 2016).
BiLSTMs, which extend LSTMs, process input series from both the preceding and the
succeeding directions. This facilitates the model’s ability to comprehend context from
both earlier and upcoming tokens (Graves et al., 2005). BiLSTMs have been success-
fully applied to NER tasks, demonstrating improved performance over unidirectional
LSTMs (Huang et al., 2015).
10
2.4 Transfer Learning in NER
The BiLSTM-CRF model combines the strengths of BiLSTMs and CRFs. It employs
BiLSTM to assimilate contextual specifics, while the CRF layer is used to articu-
late the interrelationships among the labels assigned to the named entities (Huang
et al., 2015). This combination has outperformed individual BiLSTM or CRF mod-
els in NER tasks by effectively modelling both the input sequence context and the
relationships between output labels (Ma and Hovy, 2016).
11
2.4 Transfer Learning in NER
results in improved performance with fewer training data (Pan and Yang, 2010). In
the context of NER, transfer learning allows models to leverage representations or
structures pre-trained on expansive text databases. This decreases the volume of
annotated data necessary for the intended NER activity (Ruder et al., 2019a).
12
2.5 NER in Low-Resource Domains
Techniques for fine-tuning in the context of NER usually require running the model
for several iterations using a reduced learning rate to prevent the erasure of the pre-
learned information (Ruder et al., 2019a). Various strategies, such as layer-wise learn-
ing rate schedules, differential learning rates, and freezing specific layers during train-
ing, have been proposed as methods to enhance the fine-tuning procedure, ensuring
the efficient transition of pre-trained insights to the specific NER objective (Howard
and Ruder, 2018; Ruder et al., 2019a).
Techniques for data augmentation create new training instances by making varied
modifications to the original data, including substituting with synonyms, randomly
adding or deleting words, or interchanging their positions (Wei and Zou, 2019). In
the context of NER, model performance can be improved by using data augmentation
to enhance the quantity and diversity of training data. For instance, (Li et al., 2020)
suggested an iterative data augmentation method that merges a rule-driven system
with a neural network architecture to automatically generate labelled data for low-
resource NER tasks.
13
2.6 Deep Transfer Learning for NER
Multi-Task Learning (MTL) is an approach where several tasks are trained con-
currently to enhance the capability of generalising effectively (Caruana, 1997). In
low-resource NER, MTL can take advantage of the commonalities between related
tasks, for example, chunking, part-of-speech tagging and NER, to enhance the per-
formance of the model (Plank et al., 2016). For instance, (Bingel and Sgaard, 2017)
demonstrated that MTL could improve NER performance in low-resource languages
by jointly learning-related tasks.
Cross-lingual learning involves the transfer of knowledge learnt from one language to
another. In the context of low-resource NER, cross-lingual learning can help lever-
age the knowledge obtained from high-resource languages to enhance NER accuracy
in lowly-resourced languages (Ruder et al., 2019b). As an example, (Conneau et al.,
2018) proposed XNLI, which uses a pre-trained sentence encoder to transfer the knowl-
edge from high to low resource languages for various NLP activities, such as NER.
14
2.7 NER Applications in Agriculture and Plant Pathology
15
2.8 Conceptual Model
entities, making it difficult for general-purpose NER systems to perform well in this
domain.
Despite these challenges, there are promising opportunities for NER in agriculture
and plant pathology. Developing domain-specific NER systems can lead to better
decision-making, early warning, and research methodologies. For instance, in research
carried out by (Jiang et al., 2021)on fine-tuning BERT-based frameworks to compile
plant health reports, revealed the efficacy of employing pre-trained linguistic models in
classifying agricultural texts. Furthermore, integrating NER with other data sources,
such as remote sensing, geospatial data, and expert knowledge, can result in more
comprehensive information systems for agriculture and plant pathology.
2.9 Conclusion
2.9.1 Summary
This review of the literature covered the essential aspects of Named Entity Recogni-
tion (NER), focussing on low-resource domains, deep learning, transfer learning, and
applications in agriculture and plant pathology. It discussed the definitions and tasks
related to NER, multiple strategies related to NER, including hybrid, rule and ma-
chine learning based methods, and their evaluation metrics. The review also examined
the impact of deep learning on NER, presenting different neural network architectures
and their benefits and challenges. Furthermore, it addressed the concept of transfer
learning, its types, and its application to NER. The review then delved into the chal-
lenges and techniques in low-resource NER and examined the importance, existing
systems, and challenges of NER in agriculture and plant pathology.
16
2.9 Conclusion
17
2.9 Conclusion
18
Chapter 3
Methodology
3.1 Introduction
This chapter outlines the concepts, principles, procedures, and techniques employed
in this research. It outlines the process of collecting and analysing data and the model
design. Primary data was obtained from open online databases and news sites and
annotated by a researcher working in crop protection. Furthermore, it provides details
of the experimental structure utilised to both train and evaluate different models. The
study selected several BERT-based models, which have been shown to work well for
NLP (Devlin et al., 2019), and used the transfer learning approach to create a deep
learning architecture that detects named entities of Roots, Tubers and Bananas crop
diseases from scientific and online texts. After the fine-tuning process, the chapter
also details the method used to find the most appropriate Pretrained Large Language
Model for NER and how the model was validated.
19
3.2 Research Design
Figure 3.1: Project workflow diagram illustrating the system architecture and process
flow for the RT&B crop diseases NER model.
20
3.3 Business Understanding
Figure 3.2: Diagram demonstrating the interconnections among the different stages
of CRISP-DM (Jensen, 2012)
21
3.5 Data Preparation
extensively analysed to identify articles that contained specific entities targeted for
extraction as positive examples and those that did not have disease-related informa-
tion as negative examples. A similar process was used to obtain news articles, utilising
Google News keyword searches.
Both scenarios involved the utilisation of Python to perform the search queries
and download the scientific text. See Appendix A for more details. Only English
texts were considered for this study.
SemanticScholar (The Allen Institute for Artificial Intelligence, 2022) served as the
main data source for this study. It is an AI-powered research tool that assists re-
searchers in discovering pertinent publications and extracting information from an
extensive corpus of scientific literature. Data was collected from SemanticScholar us-
ing their API, which grants access to a plethora of metadata and abstracts related to
the research topic.
Google News, an online news aggregation service that compiles and presents news
articles from various sources, was also used as a data source. Google News API was
used to collect links to the data. Python was used to download the information
contained in the news articles, their headlines and snippets relevant to the research
topic from the source news website.
22
3.5 Data Preparation
Python is used as the primary programming language for data collection. Requests
are made to the SemanticScholar and Google News APIs, which return relevant pub-
lications and news articles as JSON objects. A set of search queries is designed to
retrieve the maximum number of pertinent documents. Keywords such as ”cassava
diseases”, ”potato diseases”, and ”banana crop diseases” were used to ensure com-
prehensive coverage. The data collection scripts are available in Appendix A.
After obtaining the raw data, it was necessary to preprocess the text to ensure that
it was in a suitable format for annotation. The study developed a Python script to
clean and preprocess the data. The script performed the following tasks:
• Storage Format and Annotation Tool: After preprocessing, the data were
stored in the JSON Lines (JSONL) format, which is convenient for handling
large volumes of text data. Each line in the JSONL file represents a single doc-
ument or article and includes the required fields for annotation. Every JSONL
data entry was split into two segments: the text field, which held the subject
matter and a unique identifier field. Scientific articles had a DOI identifier,
whereas news articles had a URL identifier. The JSONL format is compatible
with Prodigy (ExplosionAI, 2023), a popular annotation tool used in this study
to annotate named entities related to RT&B diseases.
The preprocessing script was designed to keep the data as close to the original
format as possible, ensuring that the context and meaning of the text were preserved.
The data obtained served as the basis for the subsequent stages of the study. The
details of the Python script are described in Appendix A.
23
3.5 Data Preparation
Initially, gazetteers containing plant names, diseases, and other key information were
used to partially preannotate a small set of 50 abstracts using Prodigy. These ab-
stracts were then manually reviewed, further annotated, and corrected as necessary.
These manually annotated data were exported from Prodigy and used to train a
named entity recognition model using the spaCy library (Honnibal and Montani,
2021). The study used this intermediate model to improve the data annotation pro-
cess through active learning. The goal was to suggest entities to the annotators who
could verify the predictions’ accuracy and make changes or additions. This saved
time compared to annotating all the text without any suggestions. To assess whether
increasing the volume of data would increase the model’s efficacy, the initial spaCy
model was trained incrementally with data amounts of 25%, 50%, 75% and 100%.
This was accomplished using Prodigy’s train-curve recipe, which sends chunked data
to spaCy. The aim was to test the viability of creating a model that could effectively
learn the new entities.
Following this process, a total of 300 abstracts and 256 news items were annotated
for the study. These annotated data served as the basis for the subsequent stages of
the investigation, offering a firm foundation for developing and assessing the NER
model. The study used a fine-grained set of tags adopted from others that have been
used in NER research in agriculture, like (Liu et al., 2020; Malarkodi et al., 2016).
• Crop name
• Pathogen name
24
3.5 Data Preparation
• Symptoms
• Disease name
• Geographic location
• Event Date
• Organization
A set of data transformation steps was added to the workflow to prepare the data for
the subsequent stages of this study. An essential transformation involved converting
the data into the Inside, Outside, Beginning (IOB) format. The structured delineation
of each token’s position in a named entity makes this format commonly used in Named
25
3.5 Data Preparation
Entity Recognition tasks. The generation of the IOB format was facilitated by a
Python script, the details of which are described in Appendix A.
This script was designed with flexibility and adaptability in mind and was able to
export the IOB format with different separators to meet the requirements of an array
of machine-learning models and libraries. In addition, the script has the ability to
divide annotated data into training, validation, and testing batches. It is crucial to fol-
low this step when creating machine learning models. This involves fitting the model
to a specific dataset (training set), fine-tuning the model for transfer learning using
another set (development set), and then assessing its performance on a completely
distinct dataset (validation set) that the model has not encountered before. By fol-
lowing the above process, the model’s performance and capability to apply knowledge
to data that have not been previously encountered can be reliably assessed.
After going through the pre-processing and transformation steps, the data used in this
study were represented in terms of the various data labels and their corresponding
counts. These details are summarised in Table 3.1. The data was also classified
according to specific labels, providing a more granular view of the data distribution.
This categorisation is presented in Table 3.4. These tables give a summary of the
data and offer key insights into the characteristics and composition of the dataset
employed in this research.
Data/Label Count
Total number of annotation documents 556
Total number of tokens 289387
CROP 5996
LOC 1498
PLANT PART 1576
GPE 3615
DATE 2496
DISEASE 1801
SYMPTOM 620
PATHOGEN 2409
ORG 2171
Below is an overview of the data set partitioned into three distinct subsets, train-
ing, validation, and evaluation, with a proportion of 75%, 15%, and 15%, in that
order. By using this partitioning strategy, it was possible to train, fine-tune, and
assess how well the model performs on new and previously untested data. It also pro-
vided insight into the composition and balance of the data usedduring every phase of
26
3.5 Data Preparation
Label Count
B-CROP 5996
I-CROP 1240
B-LOC 1498
I-LOC 1622
B-PLANT PART 1576
B-GPE 3615
I-GPE 909
B-DATE 2496
I-DATE 2511
B-DISEASE 1801
I-DISEASE 2177
B-SYMPTOM 620
I-SYMPTOM 2479
B-PATHOGEN 2409
I-PATHOGEN 2760
B-ORG 2171
I-ORG 3803
I-PLANT PART 29
O 249675
Table 3.3: Summary of the overall count of documents and tokens use in experiments.
27
3.6 Model Design and Rationale for Selection of Models
28
3.6 Model Design and Rationale for Selection of Models
The implementation of transfer learning in our study allowed for a rapid and ef-
ficient methodology for handling NER tasks in a low-resource data domain. The
methodology involved rapidly evaluating multiple Pretrained Large Language Model
(PLLM) to determine the most effective option. As Pretrained Large Language
Models continue to evolve, our scientific workflow for training models can be readily
modified to incorporate future PLLM that will surpass current models and achieve
superior performance. Using this approach, the development of more accurate and ef-
ficient NER models will be greatly improved, especially in low-resource data domains.
The configuration file contained key parameters for model training, including
model name or path, labels, data directory, output directory, maximum sequence
length, number of training epochs, batch size, save steps, logging steps, and seed. It
also specified whether to report to the Weights & Biases (WandB) Machine Learn-
ing tracking platform (Biewald, 2020). Finally, it also configured whether to perform
training only or include evaluation and prediction. The overwriting of the output
directory and cache was also configurable.
An example configuration for the SciBERTa model is as follows:
29
3.6 Model Design and Rationale for Selection of Models
{
"model_name_or_path": "KISTI-AI/Scideberta-full",
"labels": "./data_30/labels.txt",
"data_dir": "./data_30/sciberta_full/128",
"output_dir": "./output/sciberta_full/128",
"max_seq_length": 256,
"num_train_epochs": 14,
"per_device_train_batch_size": 32,
"save_steps": 500,
"logging_steps": 500,
"seed": 3,
"report_to": "wandb",
"do_train": true,
"do_eval": true,
"do_predict": true,
"overwrite_output_dir": true,
"overwrite_cache": true
}
Each model was pre-processed with its own tokeniser using the Hugging Face
library, ensuring that the input data were appropriately formatted for each specific
model architecture. This approach provided a flexible and efficient framework for
conducting various experiments with different BERT models and configurations.
There are limitations on the quantity of tokens that can be handled by BERT
and its related models. The original BERT model can only handle up to 512 tokens
(Devlin et al., 2019). To ensure consistency in our preprocessing, we have limited
the number of tokens to 128 and 256 for all models except the Longformer model.
Longformer is specifically designed to handle larger documents and can process up
to 4096 tokens (Beltagy et al., 2020). Documents larger than the specified token size
were chunked using a preprocessing script as detailed in the appendix A.
30
3.6 Model Design and Rationale for Selection of Models
reference for the performance of the models developed using transfer learning. Sub-
sequently, we trained the different BERT models, each with different configurations,
and compared their performance metrics, including the F1 score, the accuracy, the
non-O accuracy, and the precision with the baseline.
WandB played a pivotal role in this process. It allowed us to log various informa-
tion about our experiments, including hyperparameters, metrics, and artifacts such
as model weights and predictions. This comprehensive logging facilitated real-time
tracking of our experiments, enabling us to monitor the progress and contrast the
outcomes of various models efficiently.
In addition, the visualisation tools offered by WandB presented the experiments
with different methods to display the results, such as graphs, tables, and interactive
dashboards. By comparing these metrics, the study could determine the most efficient
setup of the model and ultimately gauge the efficacy of the fine-tuned models when
compared to the baseline. The knowledge gained from WandB simplified the process
of training and evaluating the models.
31
3.7 Model Evaluation
• True positives (TP): If an entity was predicted to belong to a class and it indeed
matched that class.
• False positives (FP): If an entity was predicted to belong to a class and it did
not match that class.
• True negatives (TN): If an entity was predicted to belong to a class and it truly
did not match that class.
• False negatives (FN): If an entity was predicted not to belong to a class, while
it actually did so.
We used the confusion matrix to gauge the effectiveness and obtain a critical
assessment of the model’s correct and incorrect classifications. The matrix offers
insight into the errors made by the classifier and the types of errors that occur. This
is critical as certain entities, such as crop disease, may require correct prediction for
effective monitoring, while others, such as the plant part, may not be as essential.
Our study specifically employed a normalised multiclass confusion matrix to analyse
how the models performed. The confusion matrix visually displayed the performance
32
3.7 Model Evaluation
of a classification model across multiple classes, the Y-axis displayed the actual labels
of the entities, while the X-axis showed the predicted labels.
Assuming that a study has n entity classes, the confusion matrix would be an
n × n table. Every cell within the matrix denotes the proportion of model predictions,
categorised by the actual and predicted classes.
In the normalised confusion matrix, each cell value is a number between 0 and
1, representing the proportion of predictions for each class. The study used the
normalised confusion matrix instead of a standard confusion matrix. The standard
confusion matrix, which presents absolute counts, could have led to misleading inter-
pretations due to the high prevalence of nonentities (O entities) and the imbalance
among other entities. The normalised confusion matrix, on the other hand, provided a
more accurate and fair evaluation of our model’s performance across all entity classes.
The study also used the following metrics to validate the robustness of the model:
Accuracy: The percentage of correct predictions in the test data set. We calculated
it as follows:
TP + TN
Accuracy =
TP + FP + TN + FN
Precision: The percentage of positive instances among all total predicted positive
instances.
TP
Precision =
TP + FP
Recall: The percentage of positive instances among all actually positive instances.
TP
Recall =
TP + FN
F1-Score: Average precision and recall, weighted by their inverses. Therefore, the
higher the F1 score, the better; a perfect model would have an F1 score of 1.
33
3.8 Model Deployment
2 2 × precision × recall
F1 = 1 1 =
precision
× recall
precision + recall
TP
= 1
TP + 2
(F P + F N)
The study evaluated all models in the experiment, including the baseline model
and the BERT models that used transfer learning, using these metrics.
34
Chapter 4
35
4.2 Introduction to Results
36
4.3 Discussion of Results
the models, including non-O accuracy, accuracy, precision, F1 score, and recall. It
evaluated the models’ overall performance by analysing metrics such as their ability
to accurately identify entities, classify non-entities, and strive to achieve a trade-off
between precision and recall. Note that the model prefix listed by Hugging Face has
been omitted from the table due to space constraints.
37
4.3 Discussion of Results
performance of PLLMs that have been trained using transfer learning. The baseline
model achieved a non-O accuracy of 75.52%, an F1 score of 80.6%, an accuracy of
96.02%, a precision of 84.16%, and a recall of 77.44%. Although these results are
respectable, the objective of the experiments was to explore whether the transformer-
based models could outperform this baseline.
As demonstrated in the results shown in Table 4.1, two families of models trained
with transfer learning showed very promising results for Named Entity Recognition
of RT&B crop diseases. First, are DeBERTa based models, especially SciDeBERTa
(Jeong and Kim, 2022) and version 3 of DeBERTa (He et al., 2023). The second
was PubMedBERT (Gu et al., 2022). These models performed well on all observed
metrics, especially on the F1 score and the non-O accuracy.
The SciDeBERTa model ’Scideberta-full-128’ trained on data with a maximum
length of 128 tokens displayed superior performance across all metrics. It achieved
the highest non-O accuracy of 91.39% and accuracy of 97.80%, indicating a significant
improvement in correctly identifying both entities and non-entities. Furthermore, the
F1 score of this model was one of the best, which means an effective balance between
identifying true positives and minimising both false negatives and false positives.
Figure 4.1: F1 Score for the 10 best models compared to the baseline
It is important to mention that certain models used in the study did not show
38
4.3 Discussion of Results
39
4.3 Discussion of Results
disease (a false negative) can be much higher than the cost of incorrectly identifying
a non-entity as a disease (a false positive).
40
4.3 Discussion of Results
41
4.3 Discussion of Results
42
Chapter 5
Conclusions
5.1 Introduction
This chapter presents an assessment of the research conducted in this study, focussing
on the application of Named Entity Recognition (NER) in fields with low machine
learning resources. The study specifically looked at the recognition of crop disease
entities for Roots, Tubers and Bananas crops. The study’s conclusions, drawn from
empirical evidence gathered through experimentation, offer valuable insights into the
effectiveness of transfer learning in creating new models from Pretrained Large Lan-
guage Model for NER tasks. The chapter recognises the study’s limitations but also
emphasises the potential for further research. It also recognises that our knowledge of
NER in low-resource domains is constantly developing. The chapter concludes with
future work recommendations, with the aim of guiding subsequent research towards
enhancing our capabilities in NER tasks, particularly in high-stakes, low-resource
domains such as crop disease recognition.
43
5.3 Final Recommendations
ically for a NER task in the specialised and underexplored domain of RT&B crop
diseases. The generated models were evaluated using several NER metrics, including
non-O accuracy and F1 score, providing a quantitative assessment of their perfor-
mance. These results demonstrate the effectiveness and great potential of leveraging
transfer learning techniques to create novel models, significantly improving NER per-
formance in low-resource domains.
It is worth mentioning that there are certain limitations to consider when inter-
preting the results of the study. A primary concern in the use of the models is their
ability to generalise accurately to related or different data and domains. Given this
challenge, it is important to exercise caution when applying models to broader ap-
plications. Furthermore, the study was unable to perform extensive hyperparameter
tuning, a missed chance to improve the models’ performance, which is unfortunate.
Finally, the ever-evolving landscape of PLLM development and innovation presents
an inherent challenge in maintaining the relevance of the study.
Future work could explore alternative transfer learning techniques or other PLLM
architectures that may be more effective in low-resource settings. Research could
also investigate methods for generating or augmenting data in specialised domains to
alleviate data scarcity. This research has applications beyond RT&B crop diseases
and can be extended to other crops and other low-resource domains, presenting more
possibilities for future research. This study serves as a stepping stone toward more
advanced and effective NER solutions for low-resource domains.
44
5.3 Final Recommendations
erate additional annotated data for use with transfer learning and to expand the avail-
able training data. The study also suggests that further research should go deeper into
hyperparameter tuning compared to the current study. This process could uncover
more potential for transfer learning models to make use of the limited data resources
that are present in these domains.
Following these recommendations, the research community can push the bound-
aries of Named Entity Recognition in low-resource domains, providing valuable tools
and insights for various stakeholders in agriculture and beyond.
45
References
Akbik, A., Bergmann, T., and Vollgraf, R. (2019). Pooled Contextualized Embed-
dings for Named Entity Recognition. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long and Short Papers), pages 724–728,
Minneapolis, Minnesota. Association for Computational Linguistics. 226 citations
(Semantic Scholar/DOI) [2023-03-27]. 12
Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A Pretrained Language Model
for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China.
Association for Computational Linguistics. 1502 citations (Semantic Scholar/DOI)
[2023-03-27]. 29
Beltagy, I., Peters, M. E., and Cohan, A. (2020). Longformer: The Long-Document
Transformer. arXiv:2004.05150 [cs]. 29, 30
Bingel, J. and Sgaard, A. (2017). Identifying beneficial task relations for multi-task
learning in deep neural networks. arXiv:1702.08303 [cs]. 14
Bir, A., Cuesta-Vargas, A. I., Martn-Martn, J., Szilgyi, L., and Szilgyi, S. M. (2023).
Synthetized Multilanguage OCR Using CRNN and SVTR Models for Realtime
Collaborative Tools. Applied Sciences, 13(7):4419. 0 citations (Crossref) [2023-08-
03] Number: 7 Publisher: Multidisciplinary Digital Publishing Institute. 30
46
REFERENCES
Carvajal-Yepes, M., Cardwell, K., Nelson, A., Garrett, K. A., Giovani, B., Saunders,
D. G. O., Kamoun, S., Legg, J. P., Verdier, V., Lessel, J., Neher, R. A., Day, R.,
Pardey, P., Gullino, M. L., Records, A. R., Bextine, B., Leach, J. E., Staiger, S.,
and Tohme, J. (2019). A global surveillance system for crop diseases. Science,
364(6447):1237–1239. Publisher: American Association for the Advancement of
Science. 2, 3, 5
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I.
(2020). LEGAL-BERT: The Muppets straight out of Law School. arXiv:2010.02559
[cs]. 15
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). ELEC-
TRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
arXiv:2003.10555 [cs]. 28
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.
(2011). Natural Language Processing (Almost) from Scratch. NATURAL LAN-
GUAGE PROCESSING. 10
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jgou, H. (2018). Word
Translation Without Parallel Data. arXiv:1710.04087 [cs]. 14
Derczynski, L., Nichols, E., van Erp, M., and Limsopatham, N. (2017). Results
of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In
Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147,
Copenhagen, Denmark. Association for Computational Linguistics. 55 citations
(Crossref) [2023-07-15]. 39
47
REFERENCES
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
[cs]. 9994 citations (Semantic Scholar/arXiv) [2023-03-27] arXiv: 1810.04805. 9,
11, 12, 19, 28, 30
Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating Non-local Informa-
tion into Information Extraction Systems by Gibbs Sampling. In Proceedings of the
43rd Annual Meeting of the Association for Computational Linguistics (ACL’05),
pages 363–370, Ann Arbor, Michigan. Association for Computational Linguistics.
3357 citations (Semantic Scholar/DOI) [2023-03-27]. 9
Girvetz, E., Ramirez-Villegas, J., Claessens, L., Lamanna, C., Navarro-Racines, C.,
Nowak, A., Thornton, P., and Rosenstock, T. S. (2019). Future Climate Projections
in Africa: Where Are We Headed? In Rosenstock, T. S., Nowak, A., and Girvetz,
E., editors, The Climate-Smart Agriculture Papers: Investigating the Business of a
Productive, Resilient and Low Emission Future, pages 15–27. Springer International
Publishing, Cham. 1
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press. 9, 11
Goyal, A., Gupta, V., and Kumar, M. (2018). Recent Named Entity Recognition
and Classification techniques: A systematic review. Computer Science Review,
29:21–43. 159 citations (Semantic Scholar/DOI) [2023-03-27]. 8
Graves, A., Fernndez, S., and Schmidhuber, J. (2005). Bidirectional LSTM networks
for improved phoneme classification and recognition. In Artificial Neural Networks:
Formal Models and Their ApplicationsICANN 2005: 15th International Conference,
Warsaw, Poland, September 11-15, 2005. Proceedings, Part II 15, pages 799–804.
Springer. 10
Grishman, R. (1995). The NYU System for MUC-6 or Where’s the Syntax? Technical
report, NEW YORK UNIV NY DEPT OF COMPUTER SCIENCE. 8
48
REFERENCES
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J.,
and Poon, H. (2022). Domain-Specific Language Model Pretraining for Biomedical
Natural Language Processing. ACM Transactions on Computing for Healthcare,
3(1):1–23. 173 citations (Crossref) [2023-07-13] arXiv:2007.15779 [cs]. 29, 38
Guo, X., Hao, X., Tang, Z., Diao, L., Bai, Z., Lu, S., and Li, L. (2021). ACE-ADP: Ad-
versarial Contextual Embeddings Based Named Entity Recognition for Agricultural
Diseases and Pests. Agriculture, 11(10):912. 2 citations (Semantic Scholar/DOI)
[2023-03-27] Number: 10 Publisher: Multidisciplinary Digital Publishing Institute.
15
Guo, X., Zhou, H., Su, J., Hao, X., Tang, Z., Diao, L., and Li, L. (2020). Chinese agri-
cultural diseases and pests named entity recognition with multi-scale local context
features and self-attention mechanism. Computers and Electronics in Agriculture,
179:105830. 12 citations (Semantic Scholar/DOI) [2023-03-27]. 15
He, P., Gao, J., and Chen, W. (2023). DeBERTaV3: Improving DeBERTa us-
ing ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing.
arXiv:2111.09543 [cs]. 38
He, P., Liu, X., Gao, J., and Chen, W. (2021). DeBERTa: Decoding-enhanced BERT
with Disentangled Attention. arXiv:2006.03654 [cs]. 28
Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence
Tagging. 2997 citations (Semantic Scholar/arXiv) [2023-03-27] arXiv:1508.01991
[cs]. 10, 11, 28
49
REFERENCES
Jensen, K. (2012). Process diagram showing the relationship between the different
phases of CRISP-DM. CC BY-SA 3.0. vii, 21
Jeong, Y. and Kim, E. (2022). SciDeBERTa: Learning DeBERTa for Science Tech-
nology Documents and Fine-Tuning Information Extraction Tasks. IEEE Access,
10:60805–60813. 0 citations (Crossref) [2023-07-13] Conference Name: IEEE Ac-
cess. 29, 38
Jiang, S., Angarita, R., Cormier, S., and Rousseaux, F. (2021). Fine-tuning BERT-
based models for Plant Health Bulletin Classification. arXiv:2102.00838 [cs]. 0
citations (Semantic Scholar/arXiv) [2023-03-27] arXiv: 2102.00838. 16
Kreuze, J., Adewopo, J., Selvaraj, M., Mwanzia, L., Kumar, P. L., Cuellar, W. J.,
Legg, J. P., Hughes, D. P., and Blomme, G. (2022). Innovative Digital Technologies
to Monitor and Control Pest and Disease Threats in Root, Tuber, and Banana
(RT&B) Cropping Systems: Progress and Prospects. In Thiele, G., Friedmann,
M., Campos, H., Polar, V., and Bentley, J. W., editors, Root, Tuber and Banana
Food System Innovations: Value Creation for Inclusive Outcomes, pages 261–288.
Springer International Publishing, Cham. 2, 3
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In Proceedings of
the eighteenth international conference on machine learning, ICML ’01, pages 282–
289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Number of pages:
8. 9, 10
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016).
Neural Architectures for Named Entity Recognition. 3338 citations (Semantic
Scholar/arXiv) [2023-03-27] arXiv:1603.01360 [cs]. 9, 10
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521 (7553),
436-444. Google Scholar Google Scholar Cross Ref Cross Ref. 9
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. 9994
citations (Semantic Scholar/DOI) [2023-03-27] Conference Name: Proceedings of
the IEEE. 10
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020).
BioBERT: a pre-trained biomedical language representation model for biomedical
50
REFERENCES
Legg, J. P., Lava Kumar, P., Makeshkumar, T., Tripathi, L., Ferguson, M., Kanju,
E., Ntawuruhunga, P., and Cuellar, W. (2015). Chapter Four - Cassava Virus
Diseases: Biology, Epidemiology, and Management. In Loebenstein, G. and Katis,
N. I., editors, Advances in Virus Research, volume 91 of Control of Plant Virus
Diseases, pages 85–142. Academic Press. 2
Li, J., Sun, A., Han, J., and Li, C. (2020). A Survey on Deep Learning for Named
Entity Recognition. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING. 526 citations (Semantic Scholar/arXiv) [2023-03-27] Number:
arXiv:1812.09449 arXiv:1812.09449 [cs]. 13
Lin, Y., Shen, S., Liu, Z., Luan, H., and Sun, M. (2016). Neural Relation Extraction
with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages
2124–2133, Berlin, Germany. Association for Computational Linguistics. 14
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle-
moyer, L., and Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT
Pretraining Approach. arXiv:1907.11692 [cs]. 28
Liu, Z., Luo, M., Yang, H., and Liu, X. (2020). Named Entity Recognition for the
Horticultural Domain. Journal of Physics: Conference Series, 1631(1):012016. 3
citations (Semantic Scholar/DOI) [2023-03-27] Publisher: IOP Publishing. 4, 24
Malarkodi, C., Lex, E., and Sobha, L. D. (2016). Named Entity Recognition for the
Agricultural Domain. In 17th International Conference on Intelligent Text Pro-
cessing and Computational Linguistics (CICLING 2016); Research in Computing
Science. 4, 24
Matthew, P., Mark, N., Mohit, I., Matt, G., Christopher, C., Kenton, L., and Luke,
Z. (2018). Deep contextualized word representations. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association
for Computational Linguistics. 11, 12
51
REFERENCES
McCallum, A. and Li, W. (2003). Early results for Named Entity Recognition with
Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In
Proceedings of the Seventh Conference on Natural Language Learning at HLT-
NAACL 2003, pages 188–191. 9
Miller, S. A., Beed, F. D., and Harmon, C. L. (2009). Plant Disease Diagnostic
Capabilities and Networks. Annual Review of Phytopathology, 47(1):15–38. 2, 5, 21
Mwanzia, Leroy (2023). lmwanzia/deberta v2 xlarge ner rtb diseases Hugging Face.
34
Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classi-
fication. Lingvistic Investigationes, 30(1):3–26. ISBN: 0378-4169 Publisher: John
Benjamins Type: https://doi.org/10.1075/li.30.1.03nad. 8, 9
Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., and Ji, H. (2017). Cross-lingual
Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
13
Patil, S., Pawar, S., and Palshikar, G. (2013). Named Entity Extraction using In-
formation Distance. In Proceedings of the Sixth International Joint Conference on
Natural Language Processing, pages 1264–1270, Nagoya, Japan. Asian Federation
of Natural Language Processing. 13, 21, 22
Petsakos, A., Prager, S. D., Gonzalez, C. E., Gama, A. C., Sulser, T. B., Gbegbelegbe,
S., Kikulwe, E. M., and Hareau, G. (2019). Understanding the consequences of
changes in the production frontiers for roots, tubers and bananas. Global Food
Security, 20:180–188. 1
52
REFERENCES
Plank, B., Sgaard, A., and Goldberg, Y. (2016). Multilingual Part-of-Speech Tag-
ging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss.
arXiv:1604.05529 [cs]. 14
Prain, G. and Naziri, D. (2020). The role of root and tuber crops in strengthening
agrifood system resilience in Asia. A literature review and selective stakeholder
assessment. Report, International Potato Center. Accepted: 2020-01-21T20:16:13Z
ISBN: 9789290605393. 1
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving
Language Understanding by Generative Pre-Training. Technical report, OpenAI.
11, 12
Ristaino, J. B., Anderson, P. K., Bebber, D. P., Brauman, K. A., Cunniffe, N. J.,
Fedoroff, N. V., Finegold, C., Garrett, K. A., Gilligan, C. A., Jones, C. M., Martin,
M. D., MacDonald, G. K., Neenan, P., Records, A., Schmale, D. G., Tateosian,
L., and Wei, Q. (2021). The persistent threat of emerging plant disease pan-
demics to global food security. Proceedings of the National Academy of Sciences,
118(23):e2022239118. 2, 3, 5, 21
RTB (2016). Roots, Tubers and Bananas (RTB) Full Proposal 2017-2022. Technical
report, CGIAR Research Program on Roots, Tubers and Bananas (RTB). Accepted:
2016-04-10T16:41:29Z. 1, 5
Ruder, S., Peters, M. E., Swayamdipta, S., and Wolf, T. (2019a). Transfer Learn-
ing in Natural Language Processing. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Tutori-
als, pages 15–18, Minneapolis, Minnesota. Association for Computational Linguis-
tics. 289 citations (Semantic Scholar/DOI) [2023-03-27]. 12, 13, 28
Ruder, S., Vuli, I., and Sgaard, A. (2019b). A Survey of Cross-lingual Word Embed-
ding Models. Journal of Artificial Intelligence Research, 65:569–631. 13, 14
Savary, S., Willocquet, L., Pethybridge, S. J., Esker, P., McRoberts, N., and Nelson,
A. (2019). The global burden of pathogens and pests on major food crops. Na-
ture Ecology & Evolution, 3(3):430–439. Number: 3 Publisher: Nature Publishing
Group. 2
53
REFERENCES
Scherm, H., Thomas, C., Garrett, K., and Olsen, J. (2014). Meta-Analysis
and Other Approaches for Synthesizing Structured and Unstructured Data in
Plant Pathology. Annual Review of Phytopathology, 52(1):453–476. eprint:
https://doi.org/10.1146/annurev-phyto-102313-050214. 3
Seed World (2018). Global Initiative Announced to Protect Worlds Plants from Pests.
2
The Allen Institute for Artificial Intelligence (2022). Semantic Scholar. 21, 22
Thiele, G., Friedmann, M., Campos, H., Polar, V., and Bentley, J. W., editors (2022).
Root, Tuber and Banana Food System Innovations: Value Creation for Inclusive
Outcomes. Springer International Publishing, Cham. 1
Thiele, G., Khan, A., Heider, B., Kroschel, J., Harahagazwe, D., Andrade, M., Bonier-
bale, M., Friedmann, M., Gemenet, D., Cherinet, M., Quiroz, R., Faye, E., and
Dangles, O. (2017). Roots, Tubers and Bananas: Planning and research for climate
resilience. Open Agriculture, 2(1):350–361. Publisher: De Gruyter Open Access. 1,
2
Thomas, C. S., Nelson, N. P., Jahn, G. C., Niu, T., and Hartley, D. M. (2011). Use
of media and public-domain Internet sources for detection and assessment of plant
health threats. Emerging Health Threats Journal, 4(1):7157. Publisher: Taylor &
Francis eprint: https://doi.org/10.3402/ehtj.v4i0.7157. 3, 7
Thomas-Sharma, S., Abdurahman, A., Ali, S., Andrade-Piedra, J. L., Bao, S.,
Charkowski, A. O., Crook, D., Kadian, M., Kromann, P., Struik, P. C.,
Torrance, L., Garrett, K. A., and Forbes, G. A. (2016). Seed degenera-
tion in potato: the need for an integrated seed health strategy to mitigate
the problem in developing countries. Plant Pathology, 65(1):3–16. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/ppa.12439. 2
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,
., and Polosukhin, I. (2017). Attention is All you Need. In Guyon, I., Luxburg,
U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R.,
editors, Advances in Neural Information Processing Systems, volume 30. Curran
Associates, Inc. 11, 19, 28, 31
Wei, J. and Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting
Performance on Text Classification Tasks. arXiv:1901.11196 [cs]. 13
Wirth, R. and Hipp, J. (2000). CRISP-DM: Towards a Standard Process Model for
Data Mining. In Proceedings of the 4th international conference on the practical
applications of knowledge discovery and data mining., volume 1, page 11. 19
54
REFERENCES
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma,
C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q.,
and Rush, A. M. (2020). HuggingFace’s Transformers: State-of-the-art Natural
Language Processing. arXiv:1910.03771 [cs]. 29, 31
Yang, Z., Salakhutdinov, R., and Cohen, W. (2016). Multi-Task Cross-Lingual Se-
quence Tagging from Scratch. arXiv:1603.06270 [cs]. 14
Zhang, J., Guo, M., Geng, Y., Li, M., Zhang, Y., and Geng, N. (2021). Chinese
named entity recognition for apple diseases and pests based on character augmen-
tation. Computers and Electronics in Agriculture, 190:106464. 6 citations (Semantic
Scholar/DOI) [2023-03-27]. 15
Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level Convolutional Net-
works for Text Classification. In Cortes, C., Lawrence, N., Lee, D., Sugiyama,
M., and Garnett, R., editors, Advances in Neural Information Processing Systems,
volume 28. Curran Associates, Inc. 11
Zhou, H., Ning, S., Yang, Y., Liu, Z., Lang, C., and Lin, Y. (2018). Chemical-induced
disease relation extraction with dependency information and prior knowledge. Jour-
nal of Biomedical Informatics, 84:171–178. 18 citations (Semantic Scholar/DOI)
[2023-03-27]. 14
55
Appendix A
Code Snippets
In this paper, we used the Python programming language for data collection and
pre-processing, as well as training and evaluating the model. In this section, we
provide snippets that highlight key aspects of these processes. These code snippets
were instrumental in implementing the proposed methodology and conducting the
experiments. In Appendix A, the reader will find links to a comprehensive collection
of Python code used in the project as open source on Github. These snippets offer
valuable insight into the technical implementation and serve as a valuable resource
for readers interested in replicating or further exploring the methodology presented in
this thesis. All the Python files with main functionality can be called and used from
the console; this allows the files to be used in an unattended manner or in a notebook.
56
A.1 Data Acquisition
optional arguments :
-h , -- help show this help message and exit
-c CROP , -- crop CROP Crop name
-s SEARCH , -- search SEARCH
Search string
- sd STARTDATE , -- startdate STARTDATE
Start date
- ed ENDDATE , -- enddate ENDDATE
End date
-p PAGESIZE , -- pagesize PAGESIZE
Page size
57
A.2 Data Preprocessing
return text
normalize_text ( scitext )
58
A.3 Model Training and Evaluation
optional arguments :
-h , -- help show this help message and exit
-f FILE_PATHS [ FILE_PATHS ...] , -- file_paths FILE_PATHS [ FILE_PATHS ...]
Input file names , can be multiple space separated files .
- min MIN_LENGTH , -- min_length MIN_LENGTH
Minimum sentence length , default is the document length
- max MAX_LENGTH , -- max_length MAX_LENGTH
Maximum ballpark sentence length , default is the document length
-s , -- split Split the data into train , test and validation sets .
- sm SPACY_MODEL , -- spacy_model SPACY_MODEL
Spacy language model .
- sep SEPARATOR , -- separator SEPARATOR
Separator for the output file . Allowed values are " ," ( comma ) , \ t
,→ ( tab ) and \ s ( space ) Default is comma .
• Text processing parameters, such as lowercase words and replacing digits with
zeros.
59