Project Report CPP
Project Report CPP
Degree in
DATA SCIENCE
By
Simran Kishan Kanojia
Department of Mathematics
The Institute of Science
Mumbai – 400032
May, 2023
Declaration
I hereby declare that the project work entitled “Crop Production Prediction”
carried out at the Department of Mathematics, The Institute of Science, Mumbai,
is a record of an original work done by me under the guidance of Dr. Selby Jose,
The Institute of Science, and this project work is submitted in the partial
fulfilment of the requirements for the award of the degree of Master of Science
in Data Science, Dr. Homi Bhabha State University. The results embodied in this
report have not been submitted to any other University or Institute for the award
of any degree or diploma.
SIMRAN KANOJIA
DS2216
Acknowledgment
This project as a whole has been a journey of growth and learning for me,
and I am grateful for the experience. Finally, I would like to thank my mentor for
their guidance and support, which helped me navigate the challenges and
opportunities that arose during the project.
1. Introduction
2. Preliminaries
5. Conclusion
7. Bibliography
Chapter 1
Introduction
Agriculture, since its invention and inception, be the prime and pre-
eminent activity of every culture and civilization throughout the history of
mankind. It is not only an enormous aspect of the growing economy, but it’s
essential for us to survive. It’s also a crucial sector for Indian economy and also
human future. It also contributes an outsized portion of employment. Because the
time passes the requirement for production has been increased exponentially. So
as to produce in mass quantity people are using technology in an exceedingly
wrong way. New sorts of hybrid varieties are produced day by day. However,
these varieties don’t provide the essential contents as naturally produced crop.
These unnatural techniques spoil the soil. It all ends up in further environmental
harm. Most of these unnatural techniques are wont to avoid losses.
But when the producers of the crops know the accurate information on the
crop yield it minimizes the loss. Machine learning, a fast-growing approach that’s
spreading out and helping every sector in making viable decisions to create the
foremost of its applications. Most devices nowadays are facilitated by models
being analysed before deployment. The main concept is to increase the
throughput of the agriculture sector with the Machine Learning models. Another
factor that also affects the prediction is the amount of knowledge that’s being
given within the training period, as the number of parameters was higher
comparatively. The core emphasis would be on precision agriculture, where
quality is ensured over undesirable environmental factors.
So as to perform accurate prediction and stand on the inconsistent trends
in the season a crop is chosen for cultivation and the area under cultivation various
Machine Learning models like Random Forest Regression Model, Linear
Regression, Ridge, Lasso, Elastic Net, Decision Tree Regression are applied to
urge a pattern. By applying these machine learning models, I came into a
conclusion that Random Forest algorithm provides the foremost accurate value.
System predicts crop prediction from the gathering of past data. Using the past
information on Year, Area, State and UT the major two crops Rice and Wheat
was cultivated the model was trained to give meaningful prediction of production
in tonnes.
Chapter 2
Preliminaries
2) Data Gathering
Identifying and selecting the appropriate sources from which to obtain
the necessary data. This step aims to collect the data required for training
the model.
3) Preparing Data
Manipulating and organizing the data in a suitable format for model
training. This step involves data cleaning tasks such as removing
duplicates, correcting errors, handling missing values, normalizing data,
and converting data types.
4) Data Exploration
Visualizing the data to identify relevant relationships between variables,
detect class imbalances, or perform other exploratory analyses.
Additionally, splitting the data into training and evaluation sets is done
in this step.
5) Building a Model
Developing a predictive model that performs better than a baseline or
reference model. This step involves selecting an appropriate algorithm or
model architecture and training it using the prepared data.
7) Predictions
Applying the trained model to new, unseen data (test set) that was
previously withheld from the model. This data, for which the class labels
are known, is used to evaluate the model's performance and obtain a
better estimate of how the model will perform in real-world scenarios.
2) Boosting –
It involves training multiple models sequentially and giving more weight
to the misclassified data. The goal is to improve the accuracy of the
model and reduce the bias. Involves ML model like XGBoost.
3) Stacking -
It involves training multiple models and using their predictions as input
to a meta-model. The goal is to improve the accuracy and reduce the
variance of the model by combining the strengths of multiple models.
Involves methods to build new prediction models such as KNN (K
Nearest Neighbour) or SVM (Support Vector Machine).
2.3.1 Bagging
Bagging is a machine learning technique that involves training multiple
models on different subsets of the data and combining their predictions. The
goal of bagging is to reduce the variance of the model and prevent
overfitting. This method is commonly used to reduce variance and maintain
bias.
The name "bagging" comes from the fact that the technique involves
creating random subsets of the training data, also known as "bags". Each
bag contains a random sample of the training data, and a model is trained
on each bag.
The models can be of different types, but decision trees are often used
because they tend to overfit the training data. By training multiple trees on
different subsets of the data, bagging can help to reduce the variance of the
model and improve its accuracy. Once the models are trained, their
predictions are combined to make a final prediction. This can be done by
taking the average of the predictions for regression tasks, or by using a
voting scheme for classification tasks.
Machine learning algorithms that use the bagging method are as follows –
1) Random Forest -
It is an ensemble learning method that combines multiple decision trees
to make predictions. It uses bagging to reduce the variance of the model
and improve its accuracy.
3) Bagged SVM –
It involves training multiple support vector machines (SVMs) on
different subsets of the data and combining their predictions. It is a type
of bagging method that can be used for classification tasks.
Advantages -
1) Random forest regression is a powerful and flexible modelling technique
that can handle complex data structures and high dimensional data.
2) It is less prone to overfitting than other modelling techniques, such as
decision trees, due to the use of bagging and random feature selection.
3) It can handle missing data and categorical variables without the need for
data pre-processing.
4) It provides feature importance measures that can be used to identify the
most important predictors in the model.
Disadvantages -
1) Random forest regression can be computationally expensive and require
a lot of memory, especially for large datasets.
2) It is not as interpretable as simpler models, such as linear regression.
3) It may not perform well on small datasets or datasets with a small number
of predictors.
4) It may not work well if there are strong correlations between predictors.
2.4.2 Decision Tree Regression
A Decision Tree is a predictive model that uses a set of binary rules in order
to calculate the dependent variable. Each tree consists of branches, nodes,
and leaves.
Terminologies used in this regression are as follows:
1) The root node represents the entire population and is divided into two
or more homogeneous sets.
2) A decision node is when a sub-node splits into further sub-nodes.
3) A leaf is when a node does not split. These are also referred to as
"Terminal Nodes".
Limitations -
1) Decision Tree Regression is prone to overfitting, especially when the
tree is deep and the training data is small.
2) It may not perform well when the input variables have complex
interdependencies or when there are many irrelevant variables in the data.
3) It may not generalize well to new data if the training data is not
representative of the entire population.
Assumptions -
The assumptions of ridge regression are the same as that of linear
regression: linearity, constant variance, and independence.
To avoid overfitting the learning should constrain the solution in order to fit
a global pattern. Adding such a penalty will force the coefficients to be
small, i.e., to shrink them toward zeros.
Ridge regression impose a 𝑙 penalty on the coefficients, i.e., it penalizes with
the Euclidean norm of the coefficients while minimizing SSE. The objective
function becomes:
𝑅𝑖𝑑𝑔𝑒(𝜃) = 𝑦 − 𝑥𝜃 + 𝜆 𝜃
Advantages -
1) Ridge regression can help to prevent overfitting in linear regression
models by adding a regularization term to the loss function.
2) It can handle multicollinearity in the data by shrinking the coefficients
of correlated predictors towards each other.
3) It is computationally efficient and easy to implement.
4) It can improve the stability and generalizability of the model.
Disadvantages -
1) Ridge regression assumes that all predictors are equally important,
which may not be the case in some situations.
2) It may not perform as well as other regularization techniques, such as
Lasso regression, when there are a large number of predictors.
3) It may not work well if the data is not well-suited to linear modelling.
4) It may require some tuning of the regularization parameter to get
optimal results.
To add such a penalty forces the coefficients to be small, i.e., it shrinks them
toward zero. The objective function to minimize becomes:
𝑙𝑎𝑠𝑠𝑜(𝜃) = 𝑦 − 𝑥𝜃 + 𝜆 𝜃
If a regression model uses the L1 Regularization technique, then it is called
Lasso Regression.
Advantages -
1) Lasso regression can help to prevent overfitting in linear regression
models by adding a regularization term to the loss function.
2) It can perform feature selection by setting the coefficients of irrelevant
predictors to zero.
3) It can handle multicollinearity in the data by shrinking the coefficients
of correlated predictors towards each other. 4) It is computationally
efficient and easy to implement.
Disadvantages -
1) Lasso regression may not perform as well as other regularization
techniques, such as Ridge regression, when there are a large number of
predictors.
2) It may not work well if the data is not well-suited to linear modelling.
3) It may require some tuning of the regularization parameter to get
optimal results.
4) It may not be able to handle situations where there are more predictors
than observations.
2.4.5 Elastic Net Regression
Elastic net (also called ELNET) regression is a statistical hybrid method that
combines two of the most often used regularized linear regression
techniques, lasso, and ridge, to deal with multicollinearity issues when they
arise between predictor variables. Regularization aids in solving the
overfitting issues with the models.
It is also used for regularizing and choosing the essential predictor variables
that significantly impact the response variable. Ridge employs an L2
penalty, while lasso employs an L1. Since the elastic net utilizes both the L2
and the L1 models, the question of choosing between either one does not
arise.
An elastic net is a combination of two regressions, lasso, and ridge, and
hence the resultant equation to calculate it is:
The elastic net penalty comes in two varieties: l1 and l2. The lasso and ridge
regression models are two types of regularization models that apply l1 and
l2 penalties, respectively. The absolute value of the coefficient’s magnitude
is added as a penalty term in the lasso regression model. The ridge
regression adds the coefficient’s squared magnitude as a penalty to the loss
function.
Thus, it deals with both multicollinearity problems and the selection of
regression coefficients. ELNET uses coefficient regression shrinkage
towards zero or equal to zero to reduce the occurrence of predictor variables.
The tuning parameter (λ1) multiplied by the sum of coefficient variables (l1
norms) absolute values is utilized for this purpose. The tuning parameter
(λ2) is multiplied by the l2 norm’s squared coefficient variables to handle
the high correlation between the predictor variables. The ELNET regression
approach helps produce a fitting, interpretable model by minimizing
unnecessary variables that do not appear in the final model to improve
prediction accuracy. The ELNET manages multicollinearity by maintaining
or excluding highly correlated predictor variables from the fitted model.
While building ElasticNet regression model, both hyperparameters (L2) and
(L1) need to be set.
ElasticNet takes the following two parameters:
• alpha – Constant that multiplies the penalty terms. Default value is set
to 1.0.
(alpha = 𝛾 + 𝜎)
• l1_ratio – The ElasticNet mixing parameter, with 0 ≤ l1_ratio ≤ 1.
Over the past few decades, India has seen significant growth in both
rice and wheat production. According to the United States Department
of Agriculture, the total rice production in India has increased from
around 80 million tonnes in 1997 to around 120 million tonnes in 2020-
21. Similarly, the total wheat production in India has increased from
around 60 million tonnes in 1997 to around 109 million tonnes in 2020-
21. So, both rice and wheat production have increased significantly
over the past few decades.
Also, the Rice is generally grown in greater quantities than wheat in
India. According to the Ministry of Agriculture and Farmers' Welfare,
in 2020-21, the total area under rice cultivation in India was around 43
million hectares, while the total area under wheat cultivation was
around 30 million hectares. So, rice is grown more than wheat in India.
CODE -
GRAPHICAL REPRESENTATION,
3.3.1 Rice
As per the data from the Ministry of Agriculture and Farmers' Welfare,
the top rice-producing states in India from 1997 to 2020 were West
Bengal, Uttar Pradesh, and Andhra Pradesh. West Bengal consistently
maintained its position as the largest producer of rice in India during
this period, with a production of around 14-15 million metric tons per
year. Uttar Pradesh and Andhra Pradesh were also among the top
producers during this period, with production levels ranging from 10-
14 million metric tons per year.
CODE -
OUTPUT -
GRAPHICAL REPRESENTATION,
CODE -
GRAPHICAL REPRESENTATION,
3.3.2 Wheat
3.3.2.1 State, District-wise production of wheat
The top wheat-producing states in India from 1997 to 2020 were Uttar
Pradesh, Punjab, and Haryana. Punjab is said to be the leading state.
CODE -
OUTPUT -
GRAPHICAL REPRESENTATION,
3.3.2.2 Year-wise production of wheat
CODE -
GRAPHICAL REPRESENTATION,
Rice and wheat are two of the most important food crops grown in
India. Rice is mainly grown during the kharif season, while wheat is
mainly grown during the rabi season. The production of these crops
varies depending on the season, with different factors affecting the
production during each season. Despite rest of the challenges like
climate and pests, India has managed to maintain a steady increase in
the production of rice and wheat, which has helped to meet the growing
demand for food in the country.
Being a kharif crop, rice is grown in another season as well. This might
be possible due to the favourable climatic and soil conditions in certain
regions.
3.5.2 Season based production of wheat
CODE -
OUTPUT -
GRAPHICAL REPRESENTATION,
Chapter 5
Conclusion
Crop yield prediction is still remaining as a challenging issue for farmers. The
aim of this research is to propose and implement a rule-based system to predict
the crop yield production from the collection of past data. This has been achieved
by applying association rule mining on agriculture data from 1997 to 2020.
Farmers require modern technologies to help them raise their crops. Agriculturists
can be notified about accurate crop predictions on a timely basis. The agriculture
factors were analysed using a variety of machine learning approaches. The
features are chosen are determined by the dataset availability and research goal.
According to studies, models with more characteristics may not always deliver
the highest yield prediction performance. Models with more and fewer features
should be evaluated to discover the best performing model. The findings reveal
that while no definitive conclusion can be taken about which model is the best,
they do show that some machine learning models are utilized more frequently
than others. With the help of Random Forest Regression model, I got an accuracy
of about 97%.
Scope of the project
I have taken into consideration only two major factors i.e., the area and season of
the districts. There are other factors such as the fertility and type of soil present
in the area which would affect the crop yield. Also, one of the major factor
climatic conditions.
On proper analysis, I can help find the soil type and fertilizers and add
on the weather conditions which can be helpful in maximizing and predicting the
crop yield. P.C.A and many Deep Learning techniques can be used on images of
the field and crop to detect if they are infected by any disease or the presence of
weed, which also affects the quality of the crop, and can be separated from the
healthy crops at the earliest possible.
Bibliography
[1] https://www.academia.edu/
[2] https://apeda.gov.in/apedawebsite/SubHead_Products/Wheat.html
[3] https://www.aps.dac.gov.in/
[4] https://agricoop.nic.in/sites/pocketbook
[5] https://edis.ifas.ufl.edu/publications
[6] https://techvidvan.com
[7] www.ijraset.com/research-paper/agriculture-crop-yield-prediction