Data Mining Application To Identify Crop Disease and Recommendation A Solution
Data Mining Application To Identify Crop Disease and Recommendation A Solution
a Solution
Abstract
Rapidly advancements in the technology causes agricultural data enter into the era
of big data now days. Traditional tools and techniques are unable to store and
analyze this massive amount of data. To store and analyze this type of data parallel
computing and analyze paradigm is required. Big data analytic is used as a solution
to this. In the paper big data analytic Agriculture framework is developed that
identify disease based on symptoms similarity and recommend a solution based on
high similarity. To achieve this objective HADOOOP and other one is Hive tools
has been used now days for such type of problem solution. The data is collected,
cleansed and normalized. Data is collected from laboratory reports, web sites etc.
then cleansing of data is done that is important information is extracted from
unstructured redundant data. In the next step normalization is done that is features
are extracted from cleaned data.
INTRODUCTION
With the technological advancements and augmented growth of data agriculture
data has entered the era of big data .Big data is a term used to depict augmented
growth of data. Data may be in the form of file system or it may be in database.
And that data can’t be processed by traditional software techniques and databases
The main aim of the paper is to develop a recommendation system to identify and
provide solution of agriculture crop diseases. With the help of big data agriculture
analytics, researchers can easily make decision from historical data. It will be a
great innovation and pioneering work in human history if big data analytics is used
in agriculture [4].Agriculture data is increasing day by day at astonishing rate.
from hadoop system Hive provides interface that is similar to SQL interface which
is termed as HIVEQL HIVE query language. Hive is used for querying data in
distributed environment.
Apache HBase It is distributed and non relational DBMS that means it does not
support SQL Structured query language.java is used to write the HBase
applications.
Apache Pig
RELATED WORK
There are various sectors in which big data analytics is used. Data in Agriculture
sector is growing at rapid rate and it also enter in the era of big data .in the year
2015 IBM introduced agriculture big data analytics. Various software platforms are
developed to give information to farmers about new tools and techniques
sources are used that are web pages,databases,flat files etc. Predict the healthcare
benefits of different drugs and life style choice of patient. Risk factor of heart
disease is identified based on LDL and HDL level of cholesterol. At ideal levels of
diastolic and systolic patient’s blood pressure is under control and have less risk of
moving to next stage of hypertension. FRAMEWORK METHODOLOGY
Primary motive of generation of results from the collection of data is to serve
researchers by giving a solution for various diseases of crops. It was not an easy
task to develop a new framework identify disease and recommend solution based
on symptoms similarity. These frameworks provide the solution based on historical
data. Data for this framework is collected from various sources. This model
basically works on recommendation system. The recommendation systems use the
historical data or the knowledge of the product. Many e-commerce companies use
recommendation system for sales (e.g. Amazon. in). In the proposed model
recommendation system is applied to agriculture domain. Firstly data is collected
from various sources e.g. lab reports, agriculture websites etc. collected data is
known as raw data because it contain irregularities and unwanted information. So
data is unformatted and it needs formatting or confirmation. This data is stored on
HDFS. NameNode of HDFS keeps track how your files are broken down into file
blocks, which nodes store those blocks. clients communicates directly with
DataNode to process the local files corresponding to the blocks. Data sources are:
Laboratory Test reports: It is a crucial source of data for researchers .the tests
conducted are soil, water, manure, plant analysis etc.
Agriculture info websites: These websites act like mentor for farmers. These sites
give information related to agricultural economic entity; commonly used pesticides
etc. agriculture information websites provide information to farmers about which
crop to plant where and when. And suggest solutions to various problems related to
crops. by these sites farmers get knowledge about new techniques and tools.
Agriculture department reports: Using these reports decision making is easy for
crops of particular area.These reports are important to provide information
regarding particular field of a geographical area.
Data that is collected from above sources is stored on Hadoop distributed file
system in the form of text file. Collected data is unstructured and it contain
irrelevant data.
Firstly unimportant data is removed and relevant data is extracted from collected
data. Then features are selected and extracted from relevant data and save into text
file on hive data warehouse. Hive is used to querying the data in distributed
environment. Hive is open source software tool used for data ware housing. To
extract data out from Hadoop system Hive provides interface that is similar to SQL
interface which is termed as HIVEQL HIVE query language.
warehouse and save query results into text file that will store on HDFS. Now
submit text file to distributed environment to identify crop disease name based on
crop disease symptoms similarity. In this process after splitting text file submitted
to mapper to calculate pair based symptoms similarity, pair based similarity ignore
spelling mistakes and word ordering this will increase efficiency of
recommendation system. After calculate similarity mapper create a pair <key,
value>, save into file and submitted to reducer, in this system disease name is key
and similarity, solution and location are save as values. Reducer calculates average
similarity where disease name (key) is same and select high similarity disease.
Now select a high similarity solution id from file that saves by mapper.
For demonstration purpose the developed model use to identify Paddy crop leaf
blast disease based on symptoms similarity and recommend solution for the
specific region. To select crop and region specific data from the Hive data
warehouse Query is: INSERT OVERWRITE LOCAL DIRECTORY
'/home/raghu/Documents/' ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' SELECT Dname, location, solutionID FROM crop-
details WHERE crop name="qurban" AND state="sindh" ORDER by location;
Above Query is implemented using Java API and save results on HDFS in text file
format. Now submitted this file to map reduce to calculate the similarity to identify
disease name. mapper divides the file into parts and give each part to different
hosts for processing .In this process mapper splitting text file submitted to different
host to calculate pair based symptoms similarity, pair based similarity ignore
spelling mistakes and word ordering this will increase efficiency of
recommendation system. After calculate similarity reducer calculate the average
of same disease shown in graph
.CONCLUSION
with such type of apps we solve our now days agricultural problem on the spot and
countries goes on the peak of the economy and grow.