Analysis and Prediction in Agricultural Data Using Data Mining Techniques
Analysis and Prediction in Agricultural Data Using Data Mining Techniques
Abstract : Agriculture contributes nearly sixteen percent to total GDP of India and ten
percent of the total exports which helps in increasing foreign exchange. The population of
India is continuously increasing and to meet the food necessities of this growing
population, agricultural yield should be boosted. Knowledge discovered from raw data is
useful for many purposes. Data mining techniques are better choices for the same. This
paper aims to analyze the agricultural data of India using data mining algorithms and to
find useful information from the results of these techniques which would help to improve
the agricultural yield. Various mining algorithms applied on agricultural data were
studied. Data mining techniques applied in this paper include clustering algorithms- K-
means, DBSCAN, EM; the results of these algorithms are analyzed.
Keywords –Data Mining, DBSCAN, K-means, EM, WEKA
I. INTRODUCTION
Agriculture is the backbone of Indian economy. Though number of emerging sectors such
as IT and BPOs are contributing significantly to the GDP of India, agriculture is still the most
important sector. Agriculture majorly contributes to the exports of India, directly improving
foreign currency exchange. In India, majority of the farmers do not get expected yield due to
several reasons. The agricultural yield primarily depends on environmental factors such as
rainfall, temperature and geographical topology of the particular region. These factors along
with some other influence the crop cultivation.
In this context farmers require timely advices to predict the crop productivity and to predict
this, an intensive analysis should be made in order to achieve desired results accurately. Yield
is an important agricultural issue. Large amount of data can be gathered from Indian
agriculture sector. Knowledge acquired from data is highly useful for many purposes. Data
mining is a field in Information Technology that deals with finding unknown and hidden
patterns from the available data. Applying data mining techniques in agricultural field to
predict useful crop productivity related information is a noble work [1].
This paper aims to analyze such agricultural data using data mining techniques and
consolidate the knowledge acquired from the result of data mining techniques. The
comparison of results from different data mining algorithms will be made which will help in
finding the most suitable algorithm for agricultural data.
II. BACKGROUND
Data mining in the field of agriculture is a recent research topic. It consists within the
application of knowledge of data mining techniques to agriculture. Recent technologies are
nowadays able to abundant information on agriculture related activities, which can then be
analyzed in order to find important information. India is agriculture based country. Crop yield
depends on multiple different factors such as climate changes, soil type etc. Farmers are
interested in knowing the crop yield beforehand. Traditionally, this process was dependent on
experiences of farmers and it used to be limited only for a particular region.
Data mining techniques can be helpful in predicting crop yield. Data mining techniques such
as data classification and data clustering can be used for data analysis. Data classification is
supervised learning where training data set is used to classify the further data. Data clustering
is unsupervised learning where training set in unavailable [2]. Multiple data mining
algorithms have been used to analyze agricultural data. Various algorithms including K-
Means, K-Nearest Neighbor (KNN), Artificial Neural Networks (ANN) and Support Vector
Machines (SVM) are applicable to agricultural data. Suitable data models can be found out
that achieve a high accuracy in terms of yield prediction [3].
Agriculture sector in India is facing problems to increase the crop yield to meet the demands
of growing population. More than 50% of crops are still dependent on Monsoon. The
researchers’ implemented K-Means algorithm to forecast the pollution in the atmosphere, the
K Nearest Neighbor is applied for simulating daily rains and other weather variables and
various changes of the weather conditions are analyzed using Support Vector Machines [3].
Artificial Neural Networks can be used to analyze the patterns in soil data set [4].
Frequent pattern mining is also a data mining technique. A frequent pattern is a pattern that
occurs frequently in a dataset and provides crucial information that was unknown before [5].
Support vector machine is a binary classifier. It is able to disjoint classes. The basic idea
behind it is to classify the sample data into linearly separable classes. It is a set of allied
supervised learning methods used for classification and regression. It is used to access
spatiotemporal characteristics of the soil moisture product [6].
Decision tree is one of the popular classification algorithm that is currently used in data
mining and machine learning. Decision tree involves algorithmic gaining of structured
knowledge in the forms such as- concepts, decision trees and discrimination nets or
production rules [7].
A Naïve Bayes classifier is a simple probabilistic classifier established on applying Bayes
theorem with strong independence assumptions. Depending on the precise kind of probability
model, Naïve Bayes classifier can be trained very proficiently in a supervised learning
settings. J48 is an open source java implementation of the C4.5 algorithm in the weka data
mining tool. C4.5 is a program that makes a decision tree based on the set of labelled input
data. This decision tree can be tested against unseen labelled test data to tell how well it
generalizes [8].
IJRISE| www.ijrise.org|[email protected] [386-393]
International Journal of Research In Science & Engineering e-ISSN: 2394-8299
Special Issue 7-ICEMTE March 2017 p-ISSN: 2394-8280
Partitioning algorithms are based on specifying initial number of groups and
iteratively altering objects among groups to conjunction. In contrast hierarchical algorithms
combine and divide existing groups creating hierarchical structure that returns the order in
which groups are combined or divided [9]. Data clustering is an efficient unsupervised
learning technique that deals with grouping unlabeled data into clusters. Clustering
algorithms such as k-Means Clustering, Hierarchical Clustering, DBSCAN (Density Based
Spatial Clustering of Applications with Noise) clustering, OPTICS (Ordering Points to
Identify the Clustering Structure), STING (Statistical Information Grid) [10]. The WEKA
(Waikato Environment for Knowledge Analysis) system provides a broad suite of facilities
for applying data mining techniques to large data [11]. Overview of the data used for analysis
is given in the next section.
The data used in this paper are obtained for the years from 2005 to 2009 from website of
Planning Commission of India [12]. It contains information about plantation, fruits and
vegetables of 35 states of India including- Andhra Pradesh, Andaman Nicobar, Arunachal
Pradesh, Assam, Bihar, Chandigarh, Chhattisgarh, Dadra and Nagar Haveli, Daman and Diu,
Delhi, Goa, Gujrat, Haryana, Himachal Pradesh, Jammu and Kashmir, Jharkhand, Karnataka,
Kerala, Lakshadweep, Madhya Pradesh, Maharashtra, Manipur, Meghalaya, Mizoram,
Nagaland, Orissa, Pondicherry, Punjab, Rajasthan, Sikkim, Tamil Nadu, Tripura, Uttar
Pradesh, Uttarakhand, West Bengal. The dataset contains total 4180 instances having eight
attributes. They are Year, State, Crop type, and Crop name, Area, Production, Rainfall and
Temperature. The data has been gathered from website of planning commission of India [13].
Following figure shows database schema.
The following diagram depicts the process model to analyze agricultural data to predict
useful information from it-
V. RESULT ANALYSIS
1. K-means
2. DBSCAN
3. EM
1504.266
Mean 98.7995 62.3811 45.6236 0 8 31.2018
Production
Std. 115.402 718.772 1481.156
Dev 7 83.3918 59.0153 3 5 37.5763
Result analysis shows that production tends to increase when rainfall ranges from 1405.904mm
to 1562.3756mm and temperature ranges from 23.5156o C to 26.0942o C.
DBSCAN algorithm gives similar results as base algorithm K-means, whereas EM gives more
specific production values on given rainfall and temperature range as compared to K-means
and DBSCAN.
VI. CONCLUSION
In this paper certain data mining algorithms were adopted to cluster the data that
shows relevance with desired attributes. K-means clustering algorithm is adopted as base
algorithm. DBSCAN and EM algorithms are also applied to data. DBSCAN showed similar
behavior to K-means algorithm. Many data mining techniques yet have not been applied to
agricultural data.
Future work aimed at applying advanced mining techniques to larger dataset such as one of
the big data techniques MapReduce.
ACKNOWLEDGEMENT
We would like to express our sincere gratitude towards Finolex Academy of Management
and Technology for providing an encouraging environment and all the required resources for
project work. We are thankful to all our professors and friends for their unrelenting support.
We are thankful to planning commission of India for making data available.
REFERENCES
[1] Jiawei Han, Micheline Kamber, Jian Pie, “Data Mining Concepts and Techniques”,
Morgan Kaufmann, ASIN B0058NBJ2M
[2] D. Ramesh, Vishnu Vardhan, “Analysis of Crop Yield Prediction using Data Mining
Techniques” IJRET: International Journal of Research in Engineering and Technology
eISSN: 2319-1163 | pISSN : 2321-7308
[4] Dr. D. Ashok Kumar, N. Kannathasan, “A Survey on Data Mining and Pattern
Recognition Techniques for Soil Data Mining” IJCSI International Journal of Computer
Science Issues, Volume 8, Issue 3, No. 1, May 2011 ISSN :1694-0814 www.ijcsr.org
[5] Dr. Jean-Claude Franchitti, “Data Mining Session 6 – Mining Frequent Patterns,
Association, and Correlations” Adapted from course textbook resources Data Mining
Concepts and Techniques (2nd Edition)
Jiawei Han and Micheline Kamber
[6] Andrew Smith, Neil Alldrin, Doug Turnbull, “Clustering with EM and K-Means”
International Journal of Advance Research in Computer and Communication Engineering
[7] Dr. D. Ashok Kumar, N. Kannathasan, “A Survey on Data Mining and Pattern
Recognition Techniques for Soil Data Mining” IJCSI International Journal of Computer
Science Issues, Volume 8, Issue 3, No. 1, May 2011 ISSN :1694-0814
[8] “The Institute connecting the dots with Big Data” September 2014,
www.theinstitute.ieee.org.in
[9] Mr. Osama Abu Abbas, “Comparison between Data Clustering Algorithms” The
International Arab Journal of Information Technology Volume 5, No. 3, July 2008
[10] Aastha Joshi, Ranjeet Kaur, “A Review: Comparative Study of Various Clustering
Techniques in Data Mining”, International Journal of Advanced Research in Computer
Science and Software Engineering Volume 3, ISSN: 2277 128X, Issue 3, March 2013
[12] www.planningcommission.nic.in/data/datatables/index.php?data=datatab