AgriBot: Agriculture-Specific Question Answer System
Naman Jain Pranjali Jain Pratik Kayal
Discipline of Computer Science Discipline of Computer Science Discipline of Computer Science
Indian Institute of Technology Indian Institute of Technology Indian Institute of Technology
Gandhinagar Gandhinagar Gandhinagar
Gandhinagar, India Gandhinagar, India Gandhinagar, India
[email protected] [email protected] [email protected] P Jayakrishna Sahit Soham Pachpande Jayesh Choudhari
Discipline of Computer Science Discipline of Computer Science Discipline of Computer Science
Indian Institute of Technology Indian Institute of Technology Indian Institute of Technology
Gandhinagar Gandhinagar Gandhinagar
Gandhinagar, India Gandhinagar, India Gandhinagar, India
[email protected] [email protected] [email protected] Mayank Singh
Discipline of Computer Science
Indian Institute of Technology
Gandhinagar
Gandhinagar, India
[email protected]
Abstract — India is an agro-based economy and proper agricultural information which can help them in taking
information about agricultural practices is the key to optimal better decisions related to the crops that they cultivate. This
agricultural growth and output. In order to answer the queries leads to reduced crop yield, increased wastage of valuable
of the farmer, we have build an agricultural chatbot based on
labor, and market inefficiency. These reasons add up to
the dataset from Kisan Call Center. This system is robust
enough to answer queries related to weather, market rates, severely impact a farmer’s earnings, time and opportunities
plant protection and government schemes. This system is to increase the crop yield. In recent years, the use of
available 24*7, can be accessed through any electronic device Information Technology (IT) in agriculture extension has
and the information is delivered with the ease of grown. According to TRAI[13], there were 647 million
understanding. The system is based on a sentence embedding urban mobile subscribers while 519 million rural subscribers
model which gives an accuracy of 56%. After eliminating as of Aug 2018 and The Economic Times predicted that by
synonyms and incorporating entity extraction, the accuracy
2020, 50% of internet users will be from the rural sector.
jumps to 86%. With such a system, farmers can progress
towards easier information about farming related practices This data shows that mobile connectivity is seeing an
and hence a better agricultural output. The job of the Call exponential growth aiding to the promotion of agricultural
Center workforce would be made easier and the hard work of information by IT services. The Government faces
various such workers can be redirected to a better goal. difficulties in spreading vital information related to farming.
Also, the problems worsens due to the spread of
Keywords — Question-Answering, Agriculture, Sentence misinformation. These problems are present due to the vast
Embedding, Text Similarity
language diversity and lack of confidence of the rural
I. INTRODUCTION population on modern technologies. In such a scenario,
usage of mobile devices to spread agriculture related
In India, agriculture plays an important role in the information appears to be a promising solution.
economic development by contributing about 16% to the
overall GDP and accounting for employment of II. RELATED WORK
approximately 52% of the Indian population[12]. According To address the above defined problem, Government has
to the Farmers’ portal[12], rapid growth in agriculture is initiated various agriculture related IT services which
essential not only for self-reliance but also to earn valuable provide access to a central knowledge bank. Most prominent
foreign exchange. However, most farmers do not have services are mentioned below:
access to authentic information about the latest farming
practices and trends. One of the reasons is that the people Farmer’s Portal
involved in the occupation of farming are comparatively Farmer's Portal makes use of the Internet as a tool to
slow adopters of latest technology. Traditionally, field make knowledge accessible. Farming related information is
officers visit the farmlands and provide training, advice, and available on this website but is mainly presented in English
support to the farmers. Many of the rural villages lack the and Hindi. However, one of the significant challenges faced
ease of accessibility which results in the wastage of time and by this service is that most farmers are not literate to operate
money spent on obtaining information or contacting computers properly. According to a survey by Times of
officials. Hence, farmers are often unable to obtain India[14], only 6% of rural households owned a computer,
and only 18% of the rural youth knew how to operate one. problem because neither our data is properly formatted in a
This survey explains the fact that a website is not a feasible comprehension nor the facts can be extracted to form a
option to spread agriculture related information to farmers knowledge graph. Another way to approach the problem is
because of the lack of computer training. using Question-Answer pair hashing. However, it must be
modified to fit our needs because many semantically similar
AgriApp
questions have different answers. Hence, we come up with a
AgriApp is one of the most popular apps among farmers. practical methodology to implement an agriculture specific
It has a rating of 4.3 out of 5 on Google Play store. This AI system.
portal brings information about farming resources and
In a nutshell, right information is crucial for social and
government services through an online mobile application to
economic activities which fuel the development process of a
the farmers. It also provides a chat option for farmers which
nation. In order to achieve it, we require a decision support
enables them to chat with an agricultural expert using this
solution as simple as a messaging app which makes use of
app efficiently. However, AgriApp is a knowledge bank
Internet to ease accessibility and automate the process of the
wherein the user has to search for a particular piece of
conversation with an operator to avoid redundancy. Also,
information manually and if the user opts to chat with the
the system should integrate features like real-time outputs,
application operator instead of searching manually, the user
farmer-friendly interface, information delivery in multiple
has to wait for a significant period of time for a response
languages, and cost-effectiveness for both the farmer and the
from the operator.
operating authority. Such a solution can potentially bridge
Kisan Call Center (KCC) the information gap for the farmers to facilitate building a
productive market.
KCC is a helpline service for farmers to clarify their
queries over the phone. Since the service facilitates a III. DATA COLLECTION
telephonic conversation, this service is able to cater to the
We collected our data from https://data.gov.in[9][10].
need of the farmers on an individual basis as the information
The collection of each file requires entry of the user's name
is provided in their native language and relevant to their
and email ID. Since we are collecting data for all states of
location. Also, the farmers get valuable information related
India for the past five years, we automated the whole
to new farming practices. This service reduces the difficulty
process through a JavaScript program which downloads and
of the farmer to ask for help related to latest agricultural
store files as Comma Separated Values (CSV).
practices which also helps in building the trust of the rural
class on the Government. However, KCC services are only For each state, we retrieve district wise CSV. Each file
available from 6 AM to 10 PM, and skilled labor with good contains the query ID, the query, query-type, query creation
knowledge of agricultural practices is required to operate the time, state name, district name, season and the answer to a
Call Center. Also, it is observed that with time, queries to given query. Table I shows the total number of queries,
KCC have increased exponentially due to increase in namely question-answer pairs in the data. Also, it explicitly
awareness among farmers as well as technology adoption. shows the number of queries in the years 2017 and 2018.
This has the potential to generate the need to set up new call
The data we obtained is not properly formatted and
centers which will require massive cost along with training
machine readable because it is a summary of the telephonic
the human resource.
conversation of a KCC employee with a farmer which has
According to the analysis of Kisan Call Center data, been noted down by the KCC personnel to maintain records.
about 1.36 million calls were made to Kisan Call Centers in
One of the most critical aspects of the data is that it is
2017 which increased to about 1.72 million calls in 2018.
multilingual. We observe that some words in the data were
This shows a 21% increase in calls from 2017 to 2018. In
written in the native language of a particular state and in
Maharashtra itself, 92% of calls were redundant in the year
some cases the entire data entry had been written in the
2017 and for all the states, only 5% unique new queries were
native language. In addition to that, the data entries do not
made in 2018 compared to 2017. The number of questions
have proper grammar, spelling or punctuation. Another
are increasing gradually, and soon these call centers may not
important aspect about the data is the ambiguity in the
be able to efficiently answer all these queries on time, plus
responses to the queries. Most of the answers do not
most of the queries are redundant. Hence, a scalable solution
completely describe the information asked in the question.
is needed to accommodate the increase in the number of
Also, a large number of answers related to fertilizer names,
queries in a better way. We use the power of Artificial
or quantities of fertilizer, pesticide or water to be used for
Intelligence(AI) to build a solution to this problem.
the plants are just described in numerical figures. We also
There exists a good number of Q&A models which deal noticed that the answer to the same question can be different
with a similar problem. Authors in [15] use a knowledge in different states. Some of the answers to a query vary
graph based method, where the knowledge graph is built
upon the data and questions are answered using the
knowledge graph. Another work [16] is a comprehension
based question answering system. In such systems, for every
question the system generates an answer based on the
knowledge gathered by understanding the comprehensions.
However, these methods cannot be used to solve our
within the same district plus the answer to a question also
depends on the season in which it has been asked.
IV. DATA ANALYSIS
To understand our dataset, we explored the features we
are already presented with, namely, the state names from
where each query was asked, the season and query type.
Based on this information we can derive the statistics as
mentioned in the following tables. We got the statistics
related to the number of queries per state (Fig. 2), the
number of queries per crop (Table III) as well as crop type
(Table II), the number of queries per season (Table VI) and
the distribution of queries based on sectors (Fig. 1) as well
as query types (Table V).
The data analysis gives a good picture of the agricultural
landscape of India regarding which crops are popular in
which state, what kind of queries are most commonly asked,
and the different sectors the queries are related. For instance,
the maximum number of queries asked were related to
cereals, specially paddy. Also, the maximum number of
queries were asked from the state of Uttar Pradesh. All of
these statistics turned out to be factually accurate.
we noticed is that the number of queries related to weather is
about 64.4% of the total number of queries asked. For such
queries, our model deals differently and quite easily by
integrating a weather API.
The number of queries asked during each season is shown This analysis shows that only a limited number of unique
in Table VI. This distribution indicates that the number of queries are encountered while most of the queries are
farmers grows crops during Kharif season is most as redundant. It also shows that the number of queries varies
compared to other seasons and hence the number of queries drastically from state to state. The answers to each query
for this season are 54%. also differs on the basis of state and district from where the
query has been asked.
Another thing we noticed from the dataset is the number
of times each question is repeated. From 2017 to 2018 there
was a 5% increase in the number of unique questions. Also,
limited. Various questions and answers uses a few or all
words from a native language. Also, most of them are not
proper sentences. Such quality of data makes it difficult to
process them even after translation to English.
Third, in order to check the accuracy of our system, we
need a dataset with ground truth corresponding to each
query which does not exist. Lack of truth values made it
difficult to determine if the answer given as output for a
given input query to the system is correct or not. Such a
metric is necessary to measure the reliability of our model.
Hence, the determination of a suitable metric for the model
was a significant task.
VI. METHODOLOGY
One of the crucial aspects of our model is sentence
embedding which is done by using Sen2Vec model given by
Arora et al.[2] We give a brief overview of the Sen2Vec
model here, and the complete architecture is described in
Figure 3.
Sen2Vec Model
The Sen2Vec model can be described as a method of
converting a sentence into a vector, where the allotted
V. CHALLENGES
weight to each dimension of the vector represents its
Apart from collecting the data, there were three inclination towards a particular context. The primary
significant challenges. First, we saw a lack of consistency in purpose of this model is to cluster the similar sentences
the format of the questions and answers. Most of the data is without taking into consideration the ordering of the words.
poorly written with many redundant words, spelling errors,
Considering the improper format of the queries, we
incorrect grammar and punctuation. These features make the
attempt to match input question to questions which are
process of information extraction from the data a difficult
present in our given dataset rather than processing the
task. The questions are well written compared to the answers
answers - the idea being that given the size of the dataset
in terms of the ease of understanding. Hence, we chose to
and redundancy, the question is highly likely to be already
process the questions to find the critical words. Also, the
present. We divided the collected data into two parts - train
answers are very vague and are not framed sentences.
and test. Using the training data, we train our model based
Various answers are just numbers. Processing answers to
on Sen2Vec[2] and then for each query in test data we find
understand their meaning and relevance to a particular input
the most similar question indexed in the training data.
question is a challenging task. The given question is an
example of a typical query in the data. A. Pre-processing
For now, we consider English as a primary language and
remove the queries registered in other languages. We then
cleaned our data, which includes lower casing words,
removing stop-words and stem to the roots. Then apart from
using the current spell correct[11] for English dictionary, we
Second, some of the queries are registered in the regional develop our spell correct for local language words which
language which poses a problem in pre-processing the data may have appeared in the query mainly the crop names (Fig.
because the translation resources for specific languages are 3). We then removed weather-related queries from our
training data which will be compensated by a real-time
weather API. We also remove redundancy using synonyms A. Modified Jaccard Score
(Fig. 3). We looked at the frequency distribution of the
words used in the queries and clubbed the queries if they use We define our modified Jaccard score as the number of
words with similar semantics. Finally, we group all answers words in the intersection of the given question (known
for a particular question into a list to remove redundancy. sentence) and the predicted question (predicted sentence),
Finally, the data-frame (containing the query, query-type, i.e.
state, district, time of query and the list of answers
corresponding to that query) is given as an input into our
model.
B. Training the model
In this method, we simply use the words in the sentences as
Given the pre-processed data-frame, we separated the our parameters.
data-frame into the train (80%) and test (20%). We train the
word2Vec model[1]. We choose the model to train on 75 B. Modified Lesk Score
dimensions. The model outputs trained word embedding
along with the required weights for the matrix for our We first used words from meanings for various senses of
model. These embeddings are converted to sentence words to create a gloss bag of words. We define our metric
embedding using the method described by Arora et al.[2] as the number of common words in the gloss bag of input
question (known sentence) and the predicted question
C. Embedding Optimization (predicted sentence) divided by the number of words in the
Now, there is an great chance to find questions which are gloss bag of the input question (known sentence), i.e.,
highly similar however have different crop names for
example, ’market rate of wheat?’ and ’market rate of
paddy?’. Since we know that the crop name is an essential
determiner for the answer, we gave it a higher weight
compared to other words. This was done by building an In both metrics we add 1 to the denominator to avoid zero
Entity Extractor which can be used to tag nouns[4] and then division.
filter the crops from it. Using the entity extractor, we can
identify crop names and give them higher weights (Fig. 3). Evaluating the metric
D. Prediction In order to evaluate our metric, we manually labeled 100
test data queries and calculated our modified Jaccard scores
We pre-processed the input query from the test dataset
and modified Lesk scores for the prediction of the test data
similarly and convert it into a vector using the embedding of
questions. Using these predictions and the ground truth we
the trained model. The model outputs the most similar query
then define a threshold for both scores. The threshold tells
from the training data by comparing the embedding vectors
the model which predictions are to be considered as good
using cosine similarity (Fig. 3). By similarity techniques, we
results. We accordingly use the metrics for ranking our
obtain a list of answers which satisfy the input question. We
answers, where the final predicted answer is given by:
applied an answer ranking method to output the best answer.
The answer ranking takes input query calculates the lesk output answer = argmax[score(question, answeri )]
score with each answer from the list and output the one with
the highest score.
VIII. RESULTS
VII. METRIC
Using the Modified Lesk score metric, our model was
For the metric, we wanted to capture similarity between able to obtain an accuracy of about 56% without synonym
the input sentence to the model and output predicted elimination and entity extraction.
sentence from the model and use this to determine whether
the two sentences are same. One key observation was that the crop names were
important determiner while comparing the most similar
We found that none of the standard scientific metrics to queries. We thus performed entity extraction for the crop
be suitable for evaluating our model. Because of the names. We had observed that the accuracy jumped from
improper and inconsistent structure of question-answer pairs 56% to 86% after using entity extraction.
regarding language usage, we had to design a metric from
We then varied the dimension to improve accuracy. As
scratch. Taking inspiration from Jaccard and Lesk
demonstrated by the Fig. 4, the best performance of the
similarity[3] metrics, we devised two metrics - modified
model was observed at 75 number of dimensions for the
Jaccard and modified Lesk scores in order to evaluate our
embedding.
model.
Our metric can be thought of as the amount of similarity
between two sentences - input and prediction. Thus being
able to find this value should give us a direct understanding
of how our model is performing.
The table VII shows the metric scores for predictions maximum queries on its own without any human
when we trained our model for 75 dimension with synonym intervention with high accuracy. This will lead to better
elimination and entity extraction for crop names. utilization of human resource and avoid unnecessary costs in
setting up new call centers. Our system is capable of
handling all the redundant queries and getting updated with
new queries on the go. The system also provides an option
that enables the farmer to ask questions directly to the KCC
employees if and when necessary.
Above all, we believe that the system helps in analyzing
the farmers’ mindset and the structure of the Agricultural
Sector in India. While the system provides a secure
communication channel to the farmer, it also helps the
policy makers to understand the needs and concerns of the
farmers. The data analysis also provides an understanding of
which sector or season farmers requires attention. Thus, our
decision support system uses all the available resources
judiciously to tackle the problem of lack of awareness and
Fig. 4. Variation of the average metric score for the test data over information in the agricultural sector in India.
the number of dimensions of the sentence embedding
X. FUTURE WORK
For the future, we plan to improve the answer ranking
mechanism, implement multilingual support for the chatbot
with voice-over support and entity extraction from answers
for generating knowledge graphs.
REFERENCES
[1] Rong, Xin. ”word2vec parameter learning explained.” (2014).
[2] Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. ”A simple but
tough-to-beat baseline for sentence embeddings.” (2016).
[3] Banerjee, S., Pedersen, T., An adapted Lesk algorithm for word sense
disambiguation using WordNet (2002)
[4] Bikel, Daniel M., Richard Schwartz, and Ralph M. Weischedel. ”An
algorithm that learns what’s in a name.” (1999)
[5] Wes McKinney. Data Structures for Statistical Computing in Python,
Proceedings of the 9th Python in Science Conference, 51-56 (2010)
We also note that our chatbot can only answer [6] John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in
pre-existing questions in the database. Absolutely new Science Engineering, 9, 90-95 (2007)
questions cannot be answered by our system. We plan to [7] Travis E, Oliphant. A guide to NumPy, USA: Trelgol
re-route such queries to human employees for answers. The Publishing,(2006).
newly created question-answer pair can be then added to our [8] Radim Rˇehu ̊ˇrek and Petr Sojka, Proceedings of the LREC 2010
dataset for future reference to make the system adaptive to Workshop on New Challenges for NLP Frameworks (2010)
new queries. [9] Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham
Pachpande, Kisan Call Center retrieved from
IX. CONCLUSION https://dackkms.gov.in/account/login.aspx (2018)
[10] Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham
Our chatbot can positively impact underserved Pachpande, Open Government Data retrieved from https://data.gov.in/
communities by solving queries related to agriculture, (2018)
horticulture and animal husbandry using natural language [11] Peter Norvig, Spelling Corrector (2007)
technology. The farmer will be able to receive agricultural [12] Farmers Portal, https://www.farmer.gov.in/ (2013)
information as well as localized information such as the [13] TRAI Report (2018)
current market prices of various crops in his/her district and [14] Mahendra K Singh, Just 18% of rural, 49% of city youths can use
weather forecast through an messaging app. A farmer can computers, Times of India (2016)
directly message our AI enabled the system in his/her [15] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath and G. Weikum,
”NAGA: Searching and Ranking Knowledge,” 2008 IEEE 24th
language, and get an answer. Our system would enable the
International Conference on Data Engineering, Cancun, 2008, pp.
farmer to ask any number of questions, anytime, which will 953-962.
in turn help in spreading the modern farming technology [16] Rajpurkar, Pranav, Robin Jia, and Percy Liang. ”Know What You
faster and to a higher number of farmers. Don’t Know: Unanswerable Questions for SQuAD.” arXiv preprint
arXiv:1806.03822 (2018).
Moreover, we found that most of the queries related to
localized information such as weather and market prices
were redundant. Our Question-Answer system can answer