Agriculture Data Analysis Using Parallel K-Nearest Neighbour Classification Algorithm
Agriculture Data Analysis Using Parallel K-Nearest Neighbour Classification Algorithm
Corresponding Author:
Vimala Muninarayanappa
School of Computer Science and Applications, REVA University
Rukmini Knowledge Park, Yelahanka, Kattigenahalli, Bengaluru, Sathanur, Karnataka 560064, India
Email: [email protected]
1. INTRODUCTION
To improvise the agricultural productivity, it is essential to update the system with data such as
yield, crop type, and crop growth conditions along with rainfall pattern data as well as weather related
information (such as pressure, humidity, and temperature) time to time. The agro data captured by these
sensors is usually in unstructured form and is moved to cloud environment though gateway or internet. For
smart agro farming, an effective system is needed for storing, and analysing such unstructured type of data on
cloud platforms.
This research sought to address these issues and propose effective categorization model (ECM)
methodology. In order to categorise unstructured type of multi-dimension high-dimensional data to structural
form, a priority-based k-nearest neighbour (KNN) algorithm is first developed. Additionally, a concurrent
categorization approach using the Hadoop MapReduce (HMR) architecture is provided. Figure 1 illustrates
the design of a quick and effective agro data classification algorithm for an agricultural management system.
The significance of proposed crop classification technique are as follows. First, a multi-dimension,
high-dimensional, unstructured agro data classification system based on priority was developed. Next, a
parallel classification approach using the HMR is described. The proposed classification model can perform
analysis considering real-time agro sensory data with good accuracy, reduced time, higher memory
efficiency, and speedup.
Because it can analyse enormous volumes of data and extract crucial information, big data (machine
learning and deep learning) is used in precision agriculture. For the purpose of monitoring environmental
factors on a farm, this project uses internet of things (IoT) technology for intelligent agriculture. Three-
dimensional cluster analysis (3D CA) was used to study the environmental factors impacting the farm. The
hyperspectral series of images or videos accelerates the rate at which data is generated and the volume at
which it is produced, which poses challenges for big data, especially in applications for agricultural remote
sensing. We provide an overview of the IoT, big data, and artificial intelligence (AI), as well as how these
technologies will impact the agri-food sector in the future [1]–[4]. We undertake an analysis of the most
recent research on the application of intelligent data processing technologies in agriculture, particularly in the
production of rice. We provide a unified vision for IoT technology, data processing, and practical analytics in
digital agriculture. Thanks to coronavirus disease-2019 (COVID-19), more people are now concerned about
food safety, which is advantageous for the market share of smart agriculture. Contrary to existing solutions,
the framework for integrating and analysing agricultural data from various sources provided in this research
uses cloud computing (CC), which improves the solution's scalability, flexibility, affordability, and
maintainability [5]–[8].
We thoroughly assess agriculture mobile crowd sensing (AMCS) and offer recommendations for
approaches to agricultural data collection. Using a small quantity of ground truth data, this work offered
Gaussian kernel regression for estimating rice yield from optical and synthetic aperture radar (SAR) imaging.
We provide a unique joint federated learning (FL) model based on partial least squares (PLS) regression and
neural networks (NN) (FL-NNPLS). This paper suggested a high-resolution spatiotemporal image fusion
approach (HISTIF) made up of multiplicative modulation of temporal change (MMTC) and filtering for
cross-scale spatial matching (FCSM). First, we evaluate the state of industrial agriculture and the takeaways
from industrialized agricultural production patterns in this essay [9]–[12]. We start by suggesting an image
compression method for data gathering. Initially provide a picture compression method for data gathering.
We analyse how close a drone using a long range (LoRa) radio essential fly toward sensors in order to gather
the data within a certain level of data quality [13]–[16].
In this study, a brand-new mechanism for automatically defining zones for variable rate application
is proposed. In this work, we demonstrate an embedded system enhanced with AI that enables continuous
analysis and on-site prediction of plant leaf growth dynamics. Finding the significant technologies towards the
advancement of intelligent agriculture that may successfully enhance the production efficiency to ensure the
quality of the agricultural yields is done using data visualization analysis along with cluster analysis [17], [18].
Figure 1. Accurate classification model's architectural design for a multi-level cloud storage concept
Agriculture data analysis using parallel k-nearest neighbour classification … (Vimala Muninarayanappa)
334 ISSN: 2089-4864
The paper is organized as following. In second section of paper provides the efficient classification
methodology for analyzing raw unstructured data is presented. In penultimate section, experiment is
conducted for evaluating accuracies of classification model is presented. The conclusion of research and
future work is defined in last section.
For analysis or categorization in this work, crop-monitoring datasets gathered from [19] are used.
Sensory data acquired from various temperature, humidity, and gas sensors makes up the information. The
circumstances under which wine and banana fruits mature are determined using this data. The data comprises
11 attributes or dimensions, including id, time, R1, R2, R3, R4, R5, R6, R7, and R8, as well as temperature
and humidity, and is made up of 919,438 data points that are dispersed throughout various locations and
periods. The dataset used in this investigation is described in full in [19]. We categorised these data using
priority clustering. Set to 3, the K (i.e. we take into consideration three groups, such as not affected,
averagely affected, and totally impacted). The K can be modified to meet the criteria for user categorization.
This is why we separate the data into three groups and store it in the cloud.
2.1. Clustering model for classifying unstructured raw data into structured data
The suggested priority-based KNN classification model is constructed by utilising k-mean clustering
to divide the data points at each stage into L distinct areas. The data points in a location region are iteratively
subjected to the same procedure following clustering. When there are less data points in an area than L, the
iterative calculation is finished. Algorithm 1 presents the proposed priority-based KNN model.
Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 332-340
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 335
iterations←iteration + 1
end while
for each cluster 𝐷𝑗 ∈ 𝐷 do
build non-terminal node with center 𝑄𝑗
Continuously apply clustering method to the feature points in 𝐷𝑗
end for
end if
The algorithm's feature or attribute known as the diverging influence is the number of clusters L that
should be taken into account while separating the data at each node, and choosing L is important for
achieving a successful classification conclusion. J_, which represents the maximum clustering iterations, is
another parameter of the priority-based KNN clustering method. Smaller iterations can speed up clustering at
the expense of accuracy. Finally, yet importantly, the parameter Dstr is utilised to govern the initial centres
selection in the clustering algorithm. The suggested priority-based KNN clustering, however, achieves good
convergence with minimal time. The raw input data used to perform classification is displayed in Figure 3.
From Figure 3 it is visible the raw data is composed of 20-dimension point, which is generated
similar to [19], [20]. The complexity of computation mainly dependent on dimension size rather than size of
data (rows). Classification is carried out to identify least affected (i.e. class a), averagely affected (i.e. class b)
and most affected (i.e. class c) under assumption described in Figure 4. The outcome of classification model
is shown in Figure 5.
Agriculture data analysis using parallel k-nearest neighbour classification … (Vimala Muninarayanappa)
336 ISSN: 2089-4864
Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 332-340
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 337
publicly available software model (i.e. it is open source in nature), and broadly utilized for MR calculations.
Owen et al. [22] has been worked to keep running over HMR and the Hadoop distributed file system [23],
[24]. Hadoop distributed file system (HDFS) is an execution of the google file system (GFS) where an
extensively large dataset is fragmented into equal length of small blocks and a duplicate copy of each blocks
maintained (this process is known as data replication). While handling the information, the framework pushes
calculations to the virtual computing nodes where these chunks are facilitated to expand information location
awareness amid computing for quicker algorithm computation makespan. At the point when HMR is initiated
with HDFS, HMR can exploit information location awareness and push calculations to the information they
should work on, eliminating the systems or network administration overhead, which might be caused when
collecting from HDFS. This may offer the HMR based usage an edge in computing overheads when
contrasted with other distributed and parallel processing architecture.
zz
In Hadoop, distributed system MapReduce job execution performed on multiple system or machine.
Where one is master nodes and other is worker nodes or known as slave nodes. Master node distribute task
among the worker nodes. Each slave nodes has fixed number of mapper and reducer function or can be called
as map and reduce slots. Worker nodes periodically send their free or engaged map or reduced slots detail to
the master node. Master nodes schedule the task based on availability of mapper reducer function in the
cluster.
The MapReduce function combines the tasks of mapping and reducing. The input dataset is divided
into uniformly sized blocks of data, which are then distributed among the nodes of the Hadoop cluster.
Applying a user-defined mapper function to the input from the map task results in intermediate output that
serves as data for the reduce task's input. Reduced stage combines reduction phase and two-phase shuffle.
The output data to the map job is used as an input into the shuffle phase, where the already completed map
task is shuffled and then sorted. The sorted data is now sent into the user-defined reduce function, and the
output is written back into HDFS. A map stage involves several distinct map tasks, each of which is listed.
Reduce stage is combination of shuffle/sort and reduce phases. In reduce stage shuffle/sort phase
start working only after the first map task completed. Working of shuffle phase completed after the all map
Agriculture data analysis using parallel k-nearest neighbour classification … (Vimala Muninarayanappa)
338 ISSN: 2089-4864
task work is completed. Once the shuffle/sort work over reduce task start working. Shuffle phase result
obtained in first cycle may differ from result obtained in 2 nd cycle. Result of shuffle phase varies due to
dependency on Map cycle. Reduce shuffle phase measurement based on two reduce cycle one is called initial
shuffle and other is called typical shuffle. Reduce phase begins once the shuffling phase is finished [20].
Provides information on HMR operation details. The Hadoop HDInsight cluster's distributed key building
technique is displayed in Algorithm 2. This work uses distributed architecture to classify agricultural data,
and our model achieves good accuracy, reduces computing time, and satisfies the real-time requirement, as
empirically demonstrated in the next section.
Table 1. Comparison along with several state of art approach for developing classification tree
Random [7] ANN [7] ECM-Local ECM-Hadoop
Total CPU time (s) 129.69 52.5 35.25 2.37
Average accuracy 0.977 0.971 0.989 0.989
Memory overhead (kilobytes) 0.71 0.69 0.31 0.11
Speedup 14 14 - 16
Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 332-340
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 339
4. CONCLUSION
From the above research, we can establish an efficient classification technique regarding the
performance analysis based on agro related data in unstructured form. Here a priority-based KNN
classification model is presented, which performs the analysis on multi-dimensional data (high dimensional
data). Here we have adopted a distributed computing framework for the analysis purpose. Parallel clustering
algorithm approach by applying Hadoop framework is developed for establishing scalable performance
during analysis of high dimensional data. All the research are carried out on real-time data scrapped from
agro sensors. Further, the results display that the ECM-local reduces the total CPU time as well as memory
overhead by 32.85% along with 55.07% respectively. Here the accuracy improvises by 1.82%. Likewise, the
ECM-Hadoop model for classification decreases the total CPU time by 95.86% along with memory overhead
by 84.05% respectively. Here the accuracy is improvised by 1.82% and the speedup enhances to 16. The
overall performance result displays the scalable performance of developed ECM model when compared with
several state-of-art paradigms on several parameters such as total CPU time as well as accuracy and memory
efficiency along with speedup. Further, the future research would consider evaluating considering different
dataset and minimize the storage and processing cost.
REFERENCE
[1] S. A. Bhat and N.-F. Huang, “Big data and AI revolution in precision agriculture: survey and challenges,” IEEE Access, vol. 9,
pp. 110209–110222, 2021, doi: 10.1109/ACCESS.2021.3102227.
[2] F.-H. Tseng, H.-H. Cho, and H.-T. Wu, “Applying big data for intelligent agriculture-based crop selection analysis,” IEEE
Access, vol. 7, pp. 116965–116974, 2019, doi: 10.1109/ACCESS.2019.2935564.
[3] K. L.-M. Ang and J. K. P. Seng, “Big data and machine learning with hyperspectral information in agriculture,” IEEE Access, vol.
9, pp. 36699–36718, 2021, doi: 10.1109/ACCESS.2021.3051196.
[4] N. N. Misra, Y. Dixit, A. Al-Mallahi, M. S. Bhullar, R. Upadhyay, and A. Martynenko, “IoT, big data, and artificial intelligence
in agriculture and food industry,” IEEE Internet of Things Journal, vol. 9, no. 9, pp. 6305–6324, May 2022, doi:
10.1109/JIOT.2020.2998584.
[5] R. Alfred, J. H. Obit, C. P.-Y. Chin, H. Haviluddin, and Y. Lim, “Towards paddy rice smart farming: a review on big data,
machine learning, and rice production tasks,” IEEE Access, vol. 9, pp. 50358–50380, 2021, doi: 10.1109/ACCESS.2021.3069449.
[6] S. Chaterji et al., “Lattice: a vision for machine learning, data engineering, and policy considerations for digital agriculture at
scale,” IEEE Open Journal of the Computer Society, vol. 2, pp. 227–240, 2021, doi: 10.1109/OJCS.2021.3085846.
[7] J. Song, Q. Zhong, W. Wang, C. Su, Z. Tan, and Y. Liu, “FPDP: flexible privacy-preserving data publishing scheme for smart
agriculture,” IEEE Sensors Journal, vol. 21, no. 16, pp. 17430–17438, Aug. 2021, doi: 10.1109/JSEN.2020.3017695.
[8] A. Goldstein, L. Fink, and G. Ravid, “A cloud-based framework for agricultural data integration: a top-down-bottom-up
approach,” IEEE Access, vol. 10, pp. 88527–88537, 2022, doi: 10.1109/ACCESS.2022.3198099.
Agriculture data analysis using parallel k-nearest neighbour classification … (Vimala Muninarayanappa)
340 ISSN: 2089-4864
[9] S. H. Sreedhara, V. Kumar, and S. Salma, “Efficient big data clustering using adhoc fuzzy C means and auto-encoder CNN,” in
Inventive Computation and Information Technologies, vol. 563, S. Smys, K. A. Kamel, and R. Palanisamy, Eds., in Lecture Notes
in Networks and Systems, vol. 563., Singapore: Springer Nature Singapore, 2023, pp. 353–368. doi: 10.1007/978-981-19-7402-
1_25.
[10] Y. Alebele et al., “Estimation of crop yield from combined optical and SAR imagery using gaussian kernel regression,” IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 10520–10534, 2021, doi:
10.1109/JSTARS.2021.3118707.
[11] D. Vimalajeewa, C. Kulatunga, D. Berry, and S. Balasubramaniam, “A service-based joint model used for distributed learning:
application for smart agriculture,” IEEE Transactions on Emerging Topics in Computing, pp. 1–1, 2022, doi:
10.1109/TETC.2020.3048671.
[12] J. Jiang et al., “HISTIF: a new spatiotemporal image fusion method for high-resolution monitoring of crops at the subfield level,”
EEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 4607–4626, 2020, doi:
10.1109/JSTARS.2020.3016135.
[13] Y. Liu, X. Ma, L. Shu, G. P. Hancke, and A. M. Abu-Mahfouz, “From Industry 4.0 to agriculture 4.0: current status, enabling
technologies, and research challenges,” IEEE Transactions on Industrial Informatics, vol. 17, no. 6, pp. 4322–4334, Jun. 2021,
doi: 10.1109/TII.2020.3003910.
[14] S. Nesteruk et al., “Image compression and plants classification using machine learning in controlled-environment agriculture:
antarctic station use case,” IEEE Sensors Journal, vol. 21, no. 16, pp. 17564–17572, Aug. 2021, doi:
10.1109/JSEN.2021.3050084.
[15] A. Caruso, S. Chessa, S. Escolar, J. Barba, and J. C. Lopez, “Collection of data with drones in precision agriculture: analytical
model and LoRa case study,” IEEE Internet Things Journal, vol. 8, no. 22, pp. 16692–16704, Nov. 2021, doi:
10.1109/JIOT.2021.3075561.
[16] J. Xu, N. V. Bermeo, M. Zheng, D. Langton, M. O’Grady, and G. M. P. O’Hare, “Automated zone identification for variable-rate
services in precision agriculture,” IEEE Access, vol. 9, pp. 163242–163252, 2021, doi: 10.1109/ACCESS.2021.3134488.
[17] D. Shadrin, A. Menshchikov, A. Somov, G. Bornemann, J. Hauslage, and M. Fedorov, “Enabling precision agriculture through
embedded sensing with artificial intelligence,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 7, pp. 4103–
4113, Jul. 2020, doi: 10.1109/TIM.2019.2947125.
[18] J. Chen and A. Yang, “Intelligent agriculture and its key technologies based on internet of things architecture,” IEEE Access, vol.
7, pp. 77134–77141, 2019, doi: 10.1109/ACCESS.2019.2921391.
[19] F. Huerta and R. Huerta, “Gas sensors for home activity monitoring data set,” 2016, Accessed: Jul. 26, 2018. [Online]. Available:
http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring.
[20] “Apache Hadoop,” The Apache Software Foundation, 2006. Accessed: Oct. 21, 2017. [Online]. Available:
http://hadoop.apache.org.
[21] T. White, Hadoop: The definitive guide, 1st ed. O'Reilly Media, Inc., 2009.
[22] S. Owen, B. E. Friedman, R. Anil, and T. Dunning, Mahout in Action, Manning Publications, 2011.
[23] D. Borthakur, “The hadoop distributed file system: architecture and design,” The Apache Software Foundation, 2007, Accessed:
Jul. 26, 2018, [Online]. Available: https://svn.apache.org/repos/asf/hadoop/common/tags/release-0.16.3/docs/hdfs_design.pdf.
[24] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no.
1, pp. 107–113, Jan. 2008, doi: 10.1145/1327452.1327492.
[25] L. Verdoliva, D. Cozzolino, and G. Poggi, “A reliable order-statistics-based approximate nearest neighbor search algorithm,”
IEEE Transactions on Image Processing, vol. 26, no. 1, pp. 237–250, Jan. 2017, doi: 10.1109/TIP.2016.2624141.
[26] L. Wan, Q. Cao, F. Wang, and S. Oral, “Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-
scale hierarchical storage systems,” Journal of Parallel and Distributed Computing, vol. 100, pp. 16–29, 2017, doi:
10.1016/j.jpdc.2016.10.002.
BIOGRAPHIES OF AUTHORS
Dr. Rajeev Ranjan after completing a Ph.D. in wireless sensor network at Indian
Institute of Information Technology, Allahabad (IIIT-A), he is associate professor in the
School of Computer Science and Applications at REVA University, Bangalore. His area of
work includes wireless sensor networks-coverage and connectivity, sensor deployment and
localization, IoT, and wireless sensor statistical routing. He can be contacted at email:
[email protected].
Int J Reconfigurable & Embedded Syst, Vol. 13, No. 2, July 2024: 332-340