Prediction of Fertilizers For The Efficient Yield Through Machine Learning
Prediction of Fertilizers For The Efficient Yield Through Machine Learning
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
Submitted by
P. PRUDHVIRAJU (317126512047) N. SUMITRANJALI(317126512041)
S. S MANOJ (317126512050) G. DAVID ELIEZER(317126512020)
Under the guidance
Of
Mr. N. RAM KUMAR
Assistant Professor
i
ACKNOWLEDGEMENT
We would like to express our deep gratitude to our project guide N.Ram Kumar, Assistant
professor ,Department of Electronics and Communication Engineering, ANITS, for his/her
guidance with unsurpassed knowledge and immense encouragement. We are grateful to
Dr. V. Rajyalakshmi, Head of the Department, Electronics and Communication
Engineering, for providing us with the required facilities for the completion of the project
work.
We are very much thankful to the Principal and Management, ANITS, Sangivalasa, for
their encouragement and cooperation to carry out this work.
We express our thanks to all teaching faculty of Department of ECE, whose suggestions
during reviews helped us in accomplishment of our project. We would like to thank all
non-teaching staff of the Department of ECE, ANITS for providing great assistance in
accomplishment of our project.
We would like to thank our parents, friends, and classmates for their encouragement
throughout our project period. At last but not the least, we thank everyone for supporting
us directly or indirectly in completing this project successfully.
PROJECT STUDENTS
Prudhviraju Polamarasetty (317126512047),
Sumitranjali Nagothi (317126512041),
Satyanarayana Manoj Samudrala (317126512050),
David Eliezer Gorremuchu(317126512020)
iii
ABSTRACT
Richness of soil assumes a vital part in gaining great yield from the harvest. Present day
headways in innovation are end up being a help in catalyzing the harvest yield. A large
portion of the ranchers have faith in legends in the public eye and develop without earlier
information or appropriate examination of manures that are most appropriate for the given
soil and harvest type. Soil testing, determining the best reasonable compost will expand
the agrarian creation by improving the supplement content accessible in the dirt. Utilization
of wrong manures definitely impacts the yield and soundness of the harvest. ML is an up-
and-coming field of informatics that can be applied effectively to the horticultural area, so
we proposed an ML model which break down the given informational collection with the
distinctive ML models like Decision Tree, Random Forest, Gradient Boost, Ada Boost,
Gaussian NB and anticipate the most appropriate fertilizer by choosing the best suitable
model. ML methods helps in compost prediction consequently, assists the ranchers with
improving the harvest yield.
iv
CONTENTS
ABSTRACT iv
LIST OF FIGURES ix
LIST OF TABLES xi
CHAPTER 1 INTRODUCTION 01
1.1 Project Outline 01
1.2 Project Objective 02
CHAPTER 2 METHODOLOGY 03
v
2.2.7 Seaborn 12
2.2.8 Jupyter Notebook IDE 14
2.2.9 Jupyter Kernals 15
2.3 Algorithms Used 16
2.3.1 Decision Tree (DT) 16
2.3.2 Random Forest (RF) 18
2.3.3 Gaussian Naïve Bayes (NB) 19
2.3.3.1 Bayes’ Therom 20
2.3.3.2 Naïve Bayes classifier 20
2.3.3.3 Gaussian Naïve Bayes 21
2.3.4 Boosting 22
2.3.4.1 Ada Boost (AB) 23
2.3.4.2 Gradient Boost (GB) 24
vi
4.6 Ada Boost 47
4.7 Comparison Table 48
CHAPTER 5 CONCLUSION AND FUTURE WORK 49
REFERENCES 50
APPENDIX 52
vii
LIST OF SYMBOLS
viii
LIST OF FIGURES
ix
Fig. 3.10 Histogram of Potassium 33
Fig. 3.11 Correlation Matrix 35
Fig. 3.12 Temperature vs Fertilizer 36
Fig. 3.13 Moisture vs Fertilizer 37
Fig. 3.14 Nitrogen vs Fertilizer 37
Fig. 3.15 Potassium vs fertilizer 37
Fig. 4.1 Confusion Matrix of Gaussian NB 40
Fig. 4.2 Confusion Matrix of Decision Tree 40
Fig. 4.3 Confusion Matrix of Random Forest 41
Fig. 4.4 Confusion Matrix of Gradient Boost 41
Fig. 4.5 Confusion Matrix of Ada Boost 42
Fig. 4.6 Gradient Boost Classifier 43
Fig. 4.7 Decision Tree Classifier 44
Fig. 4.8 Random Forest Classifier 45
Fig. 4.9 Gaussian NB Classifier 46
Fig. 4.10 Ada Boost Classifier 47
x
LIST OF TABLES
xi
LIST OF ABBREVATIONS
AB Ada Boost
GB Gradient Boost
DT Decision Tree
RF Random Forest
ML Machine Learning
IoT Internet of Things
N Nitrogen
P Phosphorus
K Potassium
xii
CHAPTER -1
INTRODUCTION
1
1.2 PROJECT OBJECTIVE
Fertilizers are chemical substances supplied to the crops to increase their productivity.
These are used by the farmers daily to increase the crop yield. The fertilizers contain the
essential nutrients required by the plants, including nitrogen, potassium, and phosphorus.
They enhance the water retention capacity of the soil and increase its fertility. Fertility of
soil plays a crucial role in getting good yield from the crop. A fertile soil will contain all
the major nutrients for basic plant nutrition like Nitrogen(N), Potassium(k),
Phosphorous(P). This project helps the farmer to predict the suitable fertilizer for the given
nutrient levels of the soil obtained from the soil test. Every combination of soil and plant
are unique, and they require different form of nutrients. So, the type of fertilizer required
for them also vary. Farmers may not know the exact requirement by the soil or plant until
they get the result. Hence, farmer in one region may end up with good yield due to the
right selection of fertilizer while farmer in a different region with same type of soil and
plant yield improper result. This ensures that the fertilizer being used in the farm is based
on any unclear predictions. The key is to get this balance right and to maintain a level of
nutrients in soils that will support our crops. So, there is a need to establish a platform to
suggest the right fertilizer for a given crop and soil type.
2
CHAPTER -2
METHODOLOGY
3
2.1 INTRODUCTION TO MACHINE LEARNING
Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people. Although machine learning is a field within computer
science, it differs from traditional computational approaches. In traditional computing,
algorithms are sets of explicitly programmed instructions used by computers to calculate
or problem solve. Machine learning algorithms instead allow for computers to train on data
inputs and use statistical analysis in order to output values that fall within a specific range.
Because of this, machine learning facilitates computers in building models from sample
data in order to automate decision-making processes based on data inputs.
In machine learning, tasks are generally classified into broad categories. These categories
are based on how learning is received or how feedback on the learning is given to the
system developed. Machine learning implementations are classified into three major
categories as follows: -
4
2.1.1 Supervised Learning
In supervised learning, the computer is provided with example inputs that are labelled with
their desired outputs. The purpose of this method is for the algorithm to be able to “learn”
by comparing its actual output with the “taught” outputs to find errors and modify the
model accordingly. Supervised learning therefore uses patterns to predict label values on
additional unlabelled data.
When an algorithm learns from plain examples without any associated response, leaving
to the algorithm to determine the data patterns on its own. This type of algorithm tends
to restructure the data into something else, such as new features that may represent a class
or a new series of un-correlated values. They are quite useful in providing humans with
insights into the meaning of data and new useful inputs to supervised machine learning
algorithms.
When you present the algorithm with examples that lack labels, as in unsupervised
learning. However, you can accompany an example with positive or negative feedback
according to the solution the algorithm proposes comes under the category of
Reinforcement learning, which is connected to applications for which the algorithm must
make decisions (so the product is prescriptive, not just descriptive, as in unsupervised
learning), and the decisions bear consequences. In the human world, it is just like learning
by trial and error. Errors help you learn because they have a penalty added (cost, loss of
time, regret, pain, and so on), teaching you that a certain course of action is less likely to
succeed than others. An interesting example of reinforcement learning occurs when
computers learn to play video games by themselves. In this project we have used different
supervised algorithms. Fig 2.3 shows the trained supervised model
5
Fig 2.3: A trained model of Supervised Learning
2.2.1 Python:
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC
programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was
released in 2000 and introduced new features, such as list comprehensions and a garbage
6
collection system using reference counting. Python 3.0 was released in 2008 and was a
significant revision of the language that is not entirely backwards compatible, and much
Python 2 code does not run unmodified on Python 3. Python 2 was discontinued version
2.7.18 in 2020.
2.2.2 Libraries:
Python's sizeable standard library, commonly cited as one of its greatest strengths provides
tools suited to many tasks. For Internet-facing applications, many standard formats and
protocols such as MIME and HTTP are supported. It includes modules for creating
graphical user interfaces, connecting to relational databases, generating pseudorandom
numbers, arithmetic with arbitrary-precision decimals, manipulating regular expressions,
and unit testing.
Specifications cover some parts of the standard library (for example, the Web Server
Gateway Interface (WSGI) implementation follows PEP, but most modules are not. They
are specified by their code, internal documentation, and test suites. However, because
mostof the standard library is cross-platform Python code, only a few modules need
altering or rewriting for variant implementations.
2.2.3 Pandas
Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
7
manipulating numerical tables and time series. It is free software released under the three
clause BSD license. The name is derived from the term "panel data", an econometrics term
for data sets that include observations over multiple periods for the same individuals. Its
name is a play on the phrase "Python data analysis" itself. Wes McKinney started building
what would become pandas at AQR Capital while he was a researcher from 2007 to 2010.
Features:
• Tools for reading and writing data between in-memory data structures and different file
formats.
• Label-based slicing, fancy indexing, and sub setting of large data sets.
8
2.2.4 Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots
into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
A procedural "pylab" interface is also based on a state machine (like OpenGL), designed
to resemble MATLAB, though its use is discouraged closely. SciPy makes use of
Matplotlib.
John D. Hunter originally wrote Matplotlib. Since then, it has an active development
community and is distributed under a BSD-style license. Michael Droettboom was
nominated as matplotlib's lead developer shortly before John Hunter's death in August 2012
and was further joined by Thomas Caswell. Matplotlib 2.0.x supports Python versions 2.7
through 3.6. Python 3 support started with Matplotlib 1.2. Matplotlib 1.4 is the last version
to support Python 2.6. Matplotlib has pledged not to support Python 2 past 2020 by signing
the Python 3 Statement.
Tools required:
• Base map: map plotting with various map projections, coastlines, and political boundaries
• Excel tools: utilities for exchanging data with Microsoft Excel • GTK tools: interface to
the GTK library
• Qt interface
9
• Nat grid: interface to the nat grid library for gridding irregularly spaced data.
• Seaborn: provides an API on top of Matplotlib that offers sane choices for plot style and
colour defaults, defines simple high-level functions for common statistical plot types, and
integrates with the functionality provided by Pandas
2.2.5 NumPy
NumPy is a library for the Python programming language, adding support for large,
multidimensional arrays and matrices, along with an extensive collection of high-level
mathematical functions to operate on these arrays. NumPy, Numeric, was created by Jim
Hugunin with contributions from several other developers. In 2005, Travis Oliphant
created NumPy by incorporating features of the competing Num array into Numeric, with
extensive modifications. NumPy is open-source software and has many contributors.
Features:
• NumPy addresses the slowness problem partly by providing multi-dimensional arrays and
functions and operators that operate efficiently on arrays, requiring rewriting some code,
primarily inner loops, using NumPy.
• Using NumPy in Python gives functionality comparable to MATLAB since they are both
interpreted, and they both allow the user to write fast programs as long as most operations
work on arrays or matrices instead of scalars. In comparison, MATLAB boasts many
10
additional toolboxes, notably Simulink, whereas NumPy is intrinsically integrated with
Python, a more modern and complete programming language.
• Moreover, complementary Python packages are available; SciPy is a library that adds
MATLAB-like functionality, and Matplotlib is a plotting package that provides MATLAB
like plotting functionality. Internally, both MATLAB and NumPy rely on BLAS and
LAPACK for efficient linear algebra computations.
• Python bindings of the widely used computer vision library OpenCV utilize NumPy
arrays to store and operate on data. Since images with multiple channels are simply
represented as three-dimensional arrays, indexing, slicing or masking with other arrays are
very efficient ways to access specific pixels of an image.
• The NumPy array as a universal data structure in OpenCV for images, extracted feature
points, filter kernels and many more vastly simplifies the programming workflow and
debugging.
2.2.6 Scikit-learn
Scikit-learn (formerly scikits.learn and sklearn) is a free software machine learning library
for the Python programming language.
Scikit-learn is written mainly in Python and uses NumPy extensively for high-performance
linear algebra and array operations. Furthermore, some core algorithms are written in
Cython to improve performance.
11
A Cython wrapper around LIBSVM implements support vector machines, logistic
regression and linear support vector machines by a similar wrapper around LIBLINEAR.
In such cases, extending these methods with Python may not be possible.
Scikit-learn integrates well with many other Python libraries, such as Matplotlib and Plotly
for plotting, NumPy for array vectorization, Pandas data frames, SciPy, and many more.
2.2.7 Seaborn
Seaborn helps to explore and understand the data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots.
Its dataset-oriented, declarative API lets you focus on what the different elements of the
plots mean rather than on the details of how to draw them.
There is no universally best way to visualize data. Different plots best answer different
questions. Seaborn makes it easy to switch between different visual representations by
using a consistent dataset-oriented API.
When statistical values are estimated, seaborn uses bootstrapping to compute confidence
intervals and draw error bars representing the estimate's uncertainty.
Statistical analyses require knowledge about the distribution of variables in your dataset.
The seaborn function displot() supports several approaches to visualizing distributions.
These include classic techniques like histograms and computationally-intensive
approaches like kernel density estimation.
12
Some seaborn functions combine multiple kinds of plots to give informative summaries of
a dataset quickly. One, jointplot(), focuses on a single relationship. It plots the joint
distribution between two variables along with each variable's marginal distribution
13
2.2.8 Jupyter Notebook IDE
14
Jupyter Notebook provides a browser-based REPL built upon many popular open-source
libraries:
• IPython
• ØMQ (ZeroMQ)
• jQuery
• MathJax
The Notebook interface was added to IPython in the 0.12 release (December 2011),
renamed to Jupyter notebook in 2015 (IPython 4.0 – Jupyter 1.0). Jupyter Notebook is
similar to the notebook interface of other programs such as Maple, Mathematica, and
SageMath, a computational interface style that originated with Mathematica in the 1980s.
According to The Atlantic, Jupyter interest overtook the popularity of the Mathematica
notebook interface in early 2018.
A Jupyter kernel is responsible for handling various requests (code execution, code
completions, inspection) and reply. Kernels talk to the other components of Jupyter using
ZeroMQ and thus can be on the same or remote machines. Unlike many other Notebook-
15
like interfaces, in Jupyter, kernels are not aware that they are attached to a specific
document and can be connected to many clients at once. Usually, kernels allow only a
single language, but there are a couple of exceptions.
The Jupyter Notebook has become a popular user interface for cloud computing, and major
cloud providers have adopted the Jupyter Notebook or derivative tools as a front-end
interface for cloud users. Examples include Amazon's SageMaker Notebooks, Google's
Colaboratory and Microsoft's Azure Notebook.
We have used four different Machine Learning algorithms in this study to compare and
contrast their performances, the algorithms are as follows:
Decision Tree makes use of a model, were in a structure that resembles a tree used to make
decisions and their likely outcomes, as well as chance event outcomes, resource costs, and
utility. Each node in the tree is a representation of a conditional statement(‘if’) and on the
whole, the decision tree can be viewed as a representation of a nested conditional.
A mandatory node that is present for all tree is root node. Root node definitely have leaf
nodes which return the decision based on the attribute or parameter on each node. For each
level on the tree two parameters are determined: Information Gain and Entropy.
16
Information Gain: It is the change that occurs with entropy after deciding a particular
attribute with respect to the independent variables.
These parameters are again calculated after a decision is returned by a leaf node and the
attribute with the highest IG is removed from the list. This process keeps on repeating
resulting in the depletion of attributes and finally classification of the datasets.
2.The class variable split and the validation split is done to obtain the test and train
dataset.
17
3.The decision tree optimization is done by specifying the criterion for the attribute
selection
4. The evaluation parameters are obtained to compare with the other classifiers
5. Then the model is compiled and fit to give the prediction of faults in the dataset
Feature Randomness: Splitting in DTs occur such that there is most separation
between the left and right nodes after considering every feature present in the dataset.
But in the case of RF, as there are only limited features in each tree the splitting occurs
18
with very different separation in individual trees and it is very unlikely that two trees
will be exactly the same.
Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal
distribution and supports continuous data. We have explored the idea behind Gaussian
Naive Bayes along with an example. Before going into it, we shall go through a brief
overview of Naive Bayes.
Naive Bayes are a group of supervised machine learning classification algorithms based on
the Bayes theorem. It is a simple classification technique, but has high functionality. They
find use when the dimensionality of the inputs is high. Complex classification problems
can also be implemented by using Naive Bayes Classifier.
19
2.3.3.1 Bayes Theorem
Bayes Theorem can be used to calculate conditional probability. Being a powerful tool in
the study of probability, it is also applied in Machine Learning.
20
2.3.3.3 Gaussian Naive Bayes
When working with continuous data, an assumption often taken is that the continuous
values associated with each class are distributed according to a normal (or Gaussian)
distribution. The likelihood of the features is assumed to be-
21
Fig 2.9: Gaussian NB classifier
The above illustration indicates how a Gaussian Naive Bayes (GNB) classifier works. At
every data point, the z-score distance between that point and each class-mean is calculated,
namely the distance from the class mean divided by the standard deviation of that class.
Thus, we see that the Gaussian Naive Bayes has a slightly different approach and can be
used efficiently.
2.3.4 Boosting
Ensemble meta-algorithm that is mainly used to reduce bias (group of presumptions
made so the target function is relatively easier to learn) and variance (relative change
in the estimated target function when the training data is changed).
22
2.3.4.1 Ada Boost (AB)
23
2.3.4.2 Gradient Boost (GB)
Training of models is done in a subsequent and additive method. So, the intuition behind
gradient boosting algorithm is to repetitively leverage the patterns in residuals and
strengthen a model with weak predictions and make it better. The main difference
between AB and GB is the way both these algorithms identify the faults of the weak
learners. The three causes of discrepancy between original and predicted values are
noise, variance, and bias. As noise cannot be reduced only the other two are reduced by
the use of ensemble methods While the AB uses weighted data points, GB uses gradients
in the loss function (LF). LF is used as a measure of how well our model matches the
training data that we are given. A main advantage of GB is that it allows the user to choose
the cost function and optimizes that cost function.
24
CHAPTER-3
DATA VISUALIZATION
This is an essential step to perform before creating a visualization. Clean, consistent data
will be much easier to visualize. Clean data is data that is free of errors or anomalies which
may make it hard to use or analyze the data. Starting from a clean dataset allows you to
focus on creating an effective visualization rather than trying to diagnose and and fix issues
while creating visualizations. Data cleaning tasks will be very dependent on the dataset that
you’re working with. In most cases, data cleaning involves:
1.Removing unnecessary variables
2.Deleting duplicate rows/observations
3.Addressing outliers or invalid data
4.Dealing with missing values
5.Standardizing or categorizing values
6.Correcting typographical errors
25
The following observations have been made during the process of data cleaning before
visualizing the data
Figure 3.2 shows the number of null values and also the number of unique values of
26
Fig 3.3 : Types of Crops in that are conisdered.
Figure 3.3 shows the different types of crops that have been considered for the project
and the count of each crop of in the dataset
Figure 3.4 shows the different types of fertilizers that have been considered for the project
and the count of each fertilizer of in the data set.
27
3.2 WHAT IS DATA VISUALIZATION?
With the help of data visualization, we can see how the data looks like and what kind of
correlation is held by the attributes of data. It is the fastest way to see if the features
correspond to the output. With the help of following Python recipes, we can understand
ML data with statistics.
28
either in 2-d or 3-d. But trust me almost of the data that you obtain in real world won’t be
this way. As a Machine learning engineer, working with more than 1000-dimensional data
is very common. So, what can we do in such cases where data is more than 3D? There are
some Dimensionality Reduction (DR) techniques like PCA, TSNE, LDA etc which helps
you to convert data from a higher dimension to a 2D or 3D data in order to visualize
them. There may be some loss of information with each DR techniques, but only they can
help us visualize very high dimensional data on a 2d plot. TSNE is one of the state-of-the-
art DR techniques employed for visualization of high dimensional data.
From perspective of building models, by visualizing the data we can find the hidden
patterns, explore if there are any clusters within data and we can find if they are linearly
separable/too much overlapped etc. From this initial analysis we can easily rule out the
models that won’t be suitable for such a data and we will implement only the models that
are suitable, without wasting our valuable time and the computational resources. This part
of data visualization is a predominant one in initial Exploratory Data Analysis (EDA) on
the field of Data science/ML.
The following techniques have been used to visualize the data from the collected data set
that consists of N, P, K values, temperature, humidity, moisture, crop type, soil type and
fertilizer name corresponding to the respective data.
a. Histograms
b. Correlation Matrix
c. Bar Plots
a. Histogram
Histograms group the data in bins and is the fastest way to get idea about the distribution
of each attribute in dataset. The following are some of the characteristics of histograms −
29
It provides us a count of the number of observations in each bin created for
visualization.
From the shape of the bin, we can easily observe the distribution i.e. weather it is
Gaussian, skewed or exponential.
The following observations have been made while plotting the histograms of different input
parameters like temperature, moisture, humidity and the macro nutrient values present in
the soil like nitrogen(N), phosphorous(P), potassium(k) (these nutrient values can be
obtained by testing of the soil) excluding the crop type, soil type, and fertilizer names as
they have not been labelled with values. So, labelling has been done for those parameters
to procced further. Different histograms have been plotted so as to get a clear understanding
for the machine to read the data like the number of values of an input parameter that are
present at a particular instant.
These are some of the histograms that have been plotted i.e., humidity, moisture,
temperature, nitrogen, potassium and phosphorus.
30
Figure 3.6 shows the values of humidity that are in between 15-55 with a highest frequency
of 55, 55-60 occurred with a highest freaquency of 68, 60-65 occurred with a highest
frequency of 60, 65-70 occurred with highest frequency of 29, 70-74 occurred with highest
frequency of 5.
Figure 3.7 shows the values of moisture that are in between 0-30 with a highest frequency
of 24, 30-40 occurred with a highest freaquency of 58, 40-50 occurred with a highest
frequency of 52, 50-60 occurred with highest frequency of 25, 60-65 occurred with highest
frequency of 35.
31
Fig 3.8: Histogram of nitrogen
Figure 3.2.1(c) shows the values of nitrogen that are in between 0-5 with a highest
frequency of 15, 5-10 occurred with a highest freaquency of 65, 10-15 occurred with a
highest frequency of 115, 15-20 occurred with highest frequency of 20, 20-25 occurred
with highest frequency of 43, 25-30 occurred with a highest frequency of 5, 30-35
occurred with a highest frequency of 40, 35-43 occurred with a highest frequency of 42
32
Figure 3.9 shows the values of phosphorous that are in between 0-10 with a highest
frequency of 85, 10-20 occurred with a highest freaquency of 43, 20-30 occurred with a
highest frequency of 38, 30-40 occurred with highest frequency of 35, 40-45 occurred with
highest frequency of 35.
Figure 3.10 shows the values of potassium that are in between 0.0-2.5 with a highest
frequency of 255, 2.5-5.0 occurred with a highest freaquency of 0, 5.0-7.5 occurred with
a highest frequency of 10, 7.5-10.0 occurred with highest frequency of 17, 10-12.5
occurred with highest frequency of 15.12.5-15.0 occurred with highest frequency of 7,
15.0-17.5 occurred with highest frequency of 10, 17.5-20.0 occurred with highest
frequency of 11.
b. Correlation Matrix
Correlation coefficients are indicators of the strength of the linear relationship between
two different variables, x and y. A linear correlation coefficient that is greater than zero
indicates a positive relationship. A value that is less than zero signifies a negative
relationship. The correlation coefficient (ρ) is a measure that determines the degree to
which the movement of two different variables is associated. The most common
33
correlation coefficient, generated by the Pearson product-moment correlation, is used to
measure the linear relationship between two variables. However, in a non-linear
relationship, this correlation coefficient may not always be a suitable measure of
dependence.
The possible range of values for the correlation coefficient is -1.0 to 1.0. In other words,
the values cannot exceed 1.0 or be less than -1.0. A correlation of -1.0 indicates a
perfect negative correlation, and a correlation of 1.0 indicates a perfect positive
correlation. If the correlation coefficient is greater than zero, it is a positive relationship.
Conversely, if the value is less than zero, it is a negative relationship. A value of zero
indicates that there is no relationship between the two variables.
Note: When interpreting correlation, it's important to remember that just because two
variables are correlated, it does not mean that one causes the other
Following Correlation Matrix has been obtained for the given input parameters showing
how they are related to each other.
34
Fig 3.11: Correlation Matrix
The results that are inferred from the above Correlation Matrix are tabulated below
35
Correlation coefficients are used to measure the strength of the linear relationship
between two variables.
A correlation coefficient greater than zero indicates a positive relationship while a
value less than zero signifies a negative relationship
A value of zero indicates no relationship between the two variables being
compared.
A negative correlation, or inverse correlation, is a key concept in the creation of
diversified portfolios that can better withstand portfolio volatility.
Calculating the correlation coefficient is time-consuming, so data are often plugged
into a calculator, computer, or statistics program to find the coefficient.
c. Bar Plots
A bar plot shows categorical data as rectangular bars with the height of bars
proportional to the value they represent. It is often used to compare between values of
different categories in the data.
36
Fig 3.13: Moisture vs Fertilizer
37
CHAPTER-4
RESULTS AND DISCUSSION
4.1.1 Confusion Matrix: In every algorithm, a set of test data will be kept
aside for comparison by the Cross Validation method and the comparison
of the performance of the classifier on this test dataset with the true labelled
values is visualized as confusion matrix.
Predicted
Confusion Matrix Yes No
Yes TP FN
Actual No FP TN
Where,
38
4.1.2 Accuracy: Accuracy is defined as the fraction of the predictions that
the classifier got right
4.1.3 Precision: Precision is defined as how often the classifier predicts the output
correctly, irrespective of it being yes or no
4.1.4 Recall: Recall is defined as the number of actual positives our model labels as
positive
Confusion Matrices of all the proposed Machine Learning algorithms have been obtained i.e., for
Decision Tree, Random Forest, Gaussian NB, Ada Boost, Gradient Boost through which we can infer
that evaluation parameters like accuracy, recall, precision and f-score.
39
Fig 4.1: Confusion Matrix for Gaussian NB
40
Fig 4.3: Confusion Matrix for Random Forest
41
Fig 4.5: Confusion Matrix for Ada Boost
42
4.2 GRADIENT BOOST CLASSIFIER
43
4.3 DECISION TREE
44
4.4 RANDOM FOREST:
45
4.5 GAUSSIAN NB
46
4.6 ADA BOOST
47
4.7 COMPARISION TABLE
Table 4.7: Comparison Table for all proposed algorithms
Precision 93 93 90 90 88
Recall 93 94 93 93 88
F-Score 93 93 91 91 87
48
CHAPTER 5
Utilizing ML, we could anticipate the most appropriate compost for the given harvest
soil n p k qualities assisting the ranchers with expanding the yield. This framework
does the forecast utilizing five distinct ML models. Decision Tree and Gaussian NB
have shown similarities in F-Score that is 93% but have shown 93.33% and 93.68%
respectively in terms of Accuracy. Based on Accuracy and F-score results conclusion
has been made that Gaussian NB makes the accurate prediction. Also further predicted
the suitable fertilizer through Gaussian NB model when provided with the parameters
by the end user. In addition to this work the dataset can be extended by getting more
values from the analysis of the soil and extend the work for more crops. A user
interface can be developed so that the farmers can directly get their results from the
given inputs. Using advanced algorithms in Machine Learning also increase the level
of accuracy.
49
REFERENCES
[1] Bondre, Devdatta A., and Santosh Mahagaonkar. "Prediction of crop yield and fertilizer
recommendation using machine learning algorithms." International Journal of Engineering
Applied Sciences and Technology 4, no. 5 (2019): 371-376.
[2] Archana, K., and K. G. Saranya. "Crop Yield Prediction, Forecasting and Fertilizer
Recommendation using Voting Based Ensemble Classifier." SSRG Int. J. Comput. Sci.
Eng 7 (2020).
[3] Kim, Yun Hwan, Seong Joon Yoo, Yeong Hyeon Gu, Jin Hee Lim, Dongil Han, and
Sung Wook Baik. "Crop pests prediction method using regression and machine learning
technology: Survey." IERI Procedia 6 (2014): 52-56.
[4] Kalimuthu, M., P. Vaishnavi, and M. Kishore. "Crop Prediction using Machine
Learning." In 2020 Third International Conference on Smart Systems and Inventive
Technology (ICSSIT), pp. 926-932. IEEE, 2020.
[5] Kumar, Y. Jeevan Nagendra, V. Spandana, V. S. Vaishnavi, K. Neha, and V. G. R. R.
Devi. "Supervised Machine learning Approach for Crop Yield Prediction in Agriculture
Sector." In 2020 5thInternational Conference on Communication and Electronics Systems
(ICCES), pp. 736-741. IEEE, 2020.
[6] Medar, Ramesh, Vijay S. Rajpurohit, and Shweta Shweta. "Crop yield prediction using
machine learning techniques." In 2019 IEEE 5th International Conference for
Convergence in Technology (I2CT), pp. 1-5. IEEE, 2019
[7] Nigam, Aruvansh, Saksham Garg, Archit Agrawal, and Parul Agrawal. "Crop yield
prediction using machine learning algorithms."In 2019 Fifth International Conference on
Image Information Processing (ICIIP), pp. 125-130. IEEE, 2019.
[8] Jahan, Nusrat, and Rezvi Shahariar. "Predicting fertilizer treatment of maize using
decision tree algorithm." Indonesian Journal of Electrical Engineering and Computer
Science 20, no. 3 (2020): 1427-1434.
50
[9] Kanuru, L., Tyagi, A.K., Aswathy, S.U., Fernandez, T.F.,Sreenath, N. and Mishra, S.,
2021, January.Prediction of Pesticides and Fertilizers using Machine Learning and Internet
of Things.In 2021 International Conference on Computer Communication and Informatics
(ICCCI) (pp. 1-6). IEEE. Materials Science and Engineering, vol. 1022, no. 1, p. 012104.
IOP Publishing, 2021.
[10] Motia, Sanjay, and S. R. N. Reddy. "Method for dataset preparation for soil data
analysis in decision support applications." In IOP Conference Series:
[11] Pandiarajaa, P. "A survey on machine learning and text processing for pesticides and
fertilizer prediction." Turkish Journal of Computer and Mathematics Education
(TURCOMAT) 12, no. 2 (2021): 2295-2302.
[12] Maeda, Yuichiro, Taichi Goyodani, Shunsaku Nishiuchi,and Eisuke Kita. "Yield
prediction of paddy rice with machine learning."In Proceedings of the International
Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA),
pp.361-365. The Steering and Committee of The World Congress in Computer
Science,Computer Engineering and Applied Computing (WorldComp),2018
51
APPENDIX
52
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['Fertilizer Name']= label_encoder.fit_transform(df['Fertilizer Name'])
df['Fertilizer Name'].unique()
df['Fertilizer Name'].value_counts().to_dict()
fig_dims = (20, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x='Moisture',data=df,y='Fertilizer Name')
sns.barplot(x='Temparature',data=df,y='Fertilizer Name')
fig_dims = (20, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x='Phosphorous',data=df,y='Fertilizer Name')
fig_dims = (20, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x='Nitrogen',data=df,y='Fertilizer Name')
fig_dims = (20, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x='Phosphorous',data=df,y='Fertilizer Name')
fig_dims = (20, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x='Potassium',data=df,y='Fertilizer Name')
data['Crop Type'].value_counts().to_dict()
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
data['Crop Type']= label_encoder.fit_transform(data['Crop Type'])
data['Crop Type'].unique()
data['Crop Type'].value_counts().to_dict()
data['Soil Type'].value_counts().to_dict()
53
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
data['Soil Type']= label_encoder.fit_transform(data['Soil Type'])
data['Soil Type'].unique()
data['Soil Type'].value_counts().to_dict()
from sklearn.model_selection import train_test_split
ip=data
op=df['Fertilizer Name']
X_train,X_test,y_train,y_test=train_test_split(ip,op,test_size=0.7,random_state=1)
X_train
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
nb = GaussianNB()
nb.fit(X_train, y_train)
n_pred=nb.predict(X_test)
print('Accuracy of GaussianNB classifier on training set: {:.3f}'.format(nb.score(X_train,
y_train)))
print('Accuracy of GaussianNB classifier on test set: {:.3f}'.format(nb.score(X_test,
y_test)))
print('Classification Report:')
print(classification_report(y_test,n_pred))
print('\n')
print('Confusion Matrix:')
data=confusion_matrix(y_test, n_pred)
sns.heatmap(data,annot=True,cmap="summer")
plt.show()
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier().fit(X_train, y_train)
dtc.fit(X_train, y_train)
54
d_pred = dtc.predict(X_test)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
.format(dtc.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
.format(dtc.score(X_test, y_test)))
from sklearn.metrics import classification_report
k_range = range(1, 10)
scores=[]
for k in k_range:
clf2 = DecisionTreeClassifier(max_depth = k).fit(X_train, y_train)
scores.append(clf2.score(X_train, y_train))
print(max(scores))
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
.format(clf2.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
.format(clf2.score(X_test, y_test)))
print("Classification Report")
print(classification_report(y_test,d_pred))
print('\n')
data=confusion_matrix(y_test, d_pred)
sns.heatmap(data,annot=True,cmap="summer")
plt.show()
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50,max_depth=10, random_state=1)
classifier.fit(X_train, y_train)
from sklearn.metrics import classification_report, accuracy_score
print('Accuracy of Random Forest classifier on training set: {:.2f}'
.format(classifier.score(X_train, y_train)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'
55
.format(classifier.score(X_test, y_test)))
r_pred=classifier.predict(X_test)
print("Classification Report")
print(classification_report(y_test,r_pred))
print('\n')
data=confusion_matrix(y_test, r_pred)
sns.heatmap(data,annot=True,cmap="summer")
plt.show()
from sklearn.ensemble import GradientBoostingClassifier
56
from sklearn import metrics
clff =
AdaBoostClassifier(n_estimators=100,learning_rate=1,random_state=1).fit(X_train,
y_train)
print('Accuracy of adaboost classifier on training set: {:.2f}'
.format(clff.score(X_train, y_train)))
print('Accuracy of adaboost classifier on test set: {:.2f}'
.format(clff.score(X_test, y_test)))
a_pred=classifier.predict(X_test)
print("Classification Report ")
print(classification_report(y_test,a_pred))
print('\n')
data=confusion_matrix(y_test, a_pred)
sns.heatmap(data,annot=True,cmap="summer")
plt.show()
print("Accuracy scores of each model")
print("Decision tree : ",accuracy_score(y_test,d_pred))
print("Random forest : ",accuracy_score(y_test,r_pred))
print("GaussianNB : ",nb.score(X_test, y_test))
print("Adaboost : ",clff.score(X_test, y_test))
print("Gradient Boost : ",accuracy_score(y_test,g_pred))
def one():
temperature=int(input('Enter temperature'))
humidity=int(input('Enter humidity'))
moisture=int(input('Enter moisture'))
soiltype=int(input('Enter soil Type 0-black 1-clayey 2-loamy 3-red 4-sandy'))
croptype=int(input('Enter Crop Type 0 :Tobacco 1 :Cotton 2 :Ground Nuts 3 :Maize 4
:millets 5 :oil seeds 6 :pulses 7 :paddy 8 :sugarcane 9 :Barley 10:wheat'))
nitrogen=int(input('Enter mitrogen value'))
57
potassium=int(input('Enter potassium value'))
phosphorous=int(input('Enter phosphorous value'))
temp=nb.predict([[temperature,humidity,moisture,soiltype,croptype,nitrogen,potassium,p
hosphorous]])
return temp
def two():
temperature=int(input('Enter temperature'))
humidity=int(input('Enter humidity'))
moisture=int(input('Enter moisture'))
soiltype=int(input('Enter soil Type 0-black 1-clayey 2-loamy 3-red 4-sandy'))
croptype=int(input('Enter Crop Type 0 :Tobacco 1 :Cotton 2 :Ground Nuts 3 :Maize 4
:millets 5 :oil seeds 6 :pulses 7 :paddy 8 :sugarcane 9 :Barley 10:wheat'))
nitrogen=int(input('Enter mitrogen value'))
potassium=int(input('Enter potassium value'))
phosphorous=int(input('Enter phosphorous value'))
temp=dtc.predict([[temperature,humidity,moisture,soiltype,croptype,nitrogen,potassium,p
hosphorous]])
return temp
def three():
temperature=int(input('Enter temperature'))
humidity=int(input('Enter humidity'))
moisture=int(input('Enter moisture'))
soiltype=int(input('Enter soil Type 0-black 1-clayey 2-loamy 3-red 4-sandy'))
58
croptype=int(input('Enter Crop Type 0 :Tobacco 1 :Cotton 2 :Ground Nuts 3 :Maize 4
:millets 5 :oil seeds 6 :pulses 7 :paddy 8 :sugarcane 9 :Barley 10:wheat'))
nitrogen=int(input('Enter mitrogen value'))
potassium=int(input('Enter potassium value'))
phosphorous=int(input('Enter phosphorous value'))
temp=classifier.predict([[temperature,humidity,moisture,soiltype,croptype,nitrogen,potas
sium,phosphorous]])
return temp
def four():
temperature=int(input('Enter temperature'))
humidity=int(input('Enter humidity'))
moisture=int(input('Enter moisture'))
soiltype=int(input('Enter soil Type 0-black 1-clayey 2-loamy 3-red 4-sandy'))
croptype=int(input('Enter Crop Type 0 :Tobacco 1 :Cotton 2 :Ground Nuts 3 :Maize 4
:millets 5 :oil seeds 6 :pulses 7 :paddy 8 :sugarcane 9 :Barley 10:wheat'))
nitrogen=int(input('Enter mitrogen value'))
potassium=int(input('Enter potassium value'))
phosphorous=int(input('Enter phosphorous value'))
temp=clff.predict([[temperature,humidity,moisture,soiltype,croptype,nitrogen,potassium,
phosphorous]])
return temp
def five():
temperature=int(input('Enter temperature'))
humidity=int(input('Enter humidity'))
59
moisture=int(input('Enter moisture'))
soiltype=int(input('Enter soil Type 0-black 1-clayey 2-loamy 3-red 4-sandy'))
croptype=int(input('Enter Crop Type 0 :Tobacco 1 :Cotton 2 :Ground Nuts 3 :Maize 4
:millets 5 :oil seeds 6 :pulses 7 :paddy 8 :sugarcane 9 :Barley 10:wheat'))
nitrogen=int(input('Enter mitrogen value'))
potassium=int(input('Enter potassium value'))
phosphorous=int(input('Enter phosphorous value'))
temp=gb.predict([[temperature,humidity,moisture,soiltype,croptype,nitrogen,potassium,p
hosphorous]])
return temp
switcher = {
1: one,
2: two,
3: three,
4: four,
5: five
}
print("1:Guassian Naive bayes 2:Decision Tree 3:randomForest")
print("4:Adaboost 5:GradientBoost")
n=int(input('Enter model'))
func =switcher.get(n)
print(func())
60
61