0% found this document useful (0 votes)
9 views13 pages

PDS Report 2024-25

The document presents a crop recommendation system utilizing machine learning to help farmers select optimal crops based on soil and environmental conditions. Various models, including Random Forest, Decision Tree, and SVM, were evaluated, with Random Forest achieving the highest accuracy of 99.15%. The project emphasizes the importance of data-driven decision-making in agriculture, while also acknowledging limitations such as dataset size and class imbalance.

Uploaded by

Shav Aggrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

PDS Report 2024-25

The document presents a crop recommendation system utilizing machine learning to help farmers select optimal crops based on soil and environmental conditions. Various models, including Random Forest, Decision Tree, and SVM, were evaluated, with Random Forest achieving the highest accuracy of 99.15%. The project emphasizes the importance of data-driven decision-making in agriculture, while also acknowledging limitations such as dataset size and class imbalance.

Uploaded by

Shav Aggrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Bhartiya Vidya Bhavan’s

Sardar Patel Institute of Technology


(Autonomous Institute Affiliated to University of Mumbai)
Department of Computer Engineering

Crop Recommendation System


By
Aaryan Mantri(2023300001)
Agarwal Vedant Rakesh(2023300002)
Balla Mahadev Shrikrishna(2023300010)

Guided by
Sunil Ghane

Course Project
Python for Data Science (S.Y.)
Abstract

This project presents a crop recommendation system using machine learning to


assist farmers and agronomists in selecting optimal crops based on soil and
environmental conditions. By analyzing a comprehensive dataset of soil properties
and climate variables, such as nitrogen, phosphorus, potassium content,
temperature, humidity, pH, and rainfall, the project leverages machine learning to
classify the suitability of crops under varying conditions. Key objectives include
developing and comparing multiple machine learning models to identify the best-
performing algorithms. The methodology involves data preprocessing, feature
engineering, model training and evaluation, and accuracy comparison. We trained
several models, including Decision Tree, Naive Bayes, SVM, Logistic Regression,
and Random Forest. Experimental results highlight that the Random Forest model
achieved the highest accuracy, making it ideal for crop recommendation in real-
world applications. This system, backed by scientific data, is poised to support
sustainable agricultural practices by promoting informed crop selection.
Introduction

● Problem Statement: Agriculture plays a crucial role in supporting the


global food supply. However, deciding on the right crop to plant under
varying environmental conditions can be challenging and often depends on
expertise or limited local data. This project addresses the task of
recommending suitable crops based on specific soil and environmental data,
empowering farmers to make data-driven decisions that maximize yield and
resource efficiency.
● Objective: The primary goal of this project is to develop a machine learning
model that can predict the most suitable crop for a given combination of soil
and environmental factors. The system aims to classify and recommend
crops by analyzing multiple features, including soil nutrients and weather
conditions.
● Motivation: Accurate crop selection can lead to significant improvements in
agricultural productivity, sustainability, and profitability. A data-driven
approach not only saves resources but also helps to meet increasing food
demands. This project combines data science with agriculture, making a
practical impact in a critical field.
● Outline: The report begins with a description of the dataset, followed by
preprocessing steps and exploratory analysis. Next, we describe the
methodology, including model selection and implementation. Finally, we
present a comparison of results, discuss findings, and conclude with future
directions.
Dataset
● Description of Dataset:

 Source: Kaggle’s Crop Recommendation Dataset.


 Size: The dataset contains over 2200 entries with 8 key attributes.
 Type of Data: Each entry includes both numeric and categorical data.
 Features:
o N, P, K: Levels of nitrogen, phosphorus, and potassium in the soil.
o Temperature: Ambient temperature in degrees Celsius.
o Humidity: Relative humidity percentage.
o pH: Soil pH level indicating acidity or alkalinity.
o Rainfall: Annual rainfall in mm.
o Label: Target variable representing the recommended crop type for
the given conditions.

This dataset provides a comprehensive basis for training machine learning models
to predict suitable crops based on environmental and soil parameters.

● Preprocessing:

 Handling Missing Data - Ensured no missing values for cleaner and more
effective model training.
 Normalization - For SVM, all feature values were scaled to a 0-1 range
using MinMaxScaler to improve model stability and convergence.
 Label Encoding - Categorical crop labels were encoded into numeric values
to ensure compatibility with machine learning models.

● Data Exploration:

 Summary Statistics: Calculated mean, median, standard deviation, etc., for


each feature.
 Visualization: Histograms and pair plots to analyze feature distributions and
relationships.
 Correlation Analysis: A correlation heat-map was used to identify
relationships between features, highlighting significant correlations.
Methodology
1. Machine Learning Models

● Models Used: The models tested include Decision Tree, Naive Bayes, SVM,
Logistic Regression, and Random Forest.
● Justification: Models were selected to capture a range of learning methods,
from simple decision boundaries (Decision Tree) to ensemble learning
(Random Forest) for complex patterns. SVM was chosen for its robustness
with normalized data, while Naive Bayes and Logistic Regression provided
baseline comparisons.

2. Model Implementation

● Training and Testing: Data was split into 80% for training and 20% for
testing to assess generalizability.
● Hyperparameter Tuning: Optimized parameters such as the maximum
depth for Decision Tree and kernel type for SVM. Random Forest was
evaluated with different tree counts.

3. Feature Selection and Extraction

 All features—nitrogen, phosphorus, potassium, temperature, humidity, pH,


and rainfall—were retained because each uniquely contributes to predicting
crop suitability. These attributes represent essential agricultural factors, like
soil nutrients and environmental conditions, crucial for crop health.

 Techniques like PCA or LDA were not applied, as reducing dimensions


could obscure the influence of individual features. Keeping all features
ensured comprehensive analysis, maximizing model interpretability and
accuracy, especially in the model like Random Forest that leverages feature
importance effectively.
Experimental Setup
● Tools Used: Python 3.8, Scikit-learn, Seaborn, Matplotlib, and Pandas.
● Evaluation Metrics:
 Accuracy: Accuracy is the most straightforward metric, calculated as
the proportion of correct predictions out of the total predictions made.

 Precision: Precision measures the proportion of true positive predictions


out of all positive predictions made by the model.

 Recall: Recall, also known as sensitivity or true positive rate, measures


the proportion of true positives correctly identified by the model.

 F1-Score: The F1-score is the harmonic mean of precision and recall,


providing a balanced metric when both false positives and false
negatives need to be minimized.
Results and Discussion
● Performance Comparison:
Each model was trained and tested on the crop recommendation dataset, and
their performance was measured using accuracy and classification metrics:
○ Decision Tree: This model achieved an accuracy of 97.18%. It
constructed clear and interpretable decision boundaries, making it
effective for classification. However, due to its depth (maximum
depth = 5), the model occasionally exhibited overfitting tendencies. It
performed well for common crops but struggled to generalize for rarer
ones, especially when feature distributions overlapped. Despite these
challenges, its simplicity and explainability remain advantageous in
real-world applications.
○ Naive Bayes: As a probabilistic model, Naive Bayes achieved an
accuracy of 98.59%. It worked well for crops with distinct feature
distributions due to its assumption of feature independence. However,
this same assumption limited its ability to handle interdependencies
among features, causing lower precision and recall for certain crops.
The model performed better for crops with clearly separated clusters
in the feature space but struggled in regions with complex interactions.
○ Support Vector Machine (SVM): With normalized data and a
polynomial kernel, SVM achieved an accuracy of 98.02%. This model
effectively captured non-linear relationships, demonstrating its
capability in handling overlapping classes. However, the kernelized
approach required significant computational resources, making it less
efficient for large datasets. Its ability to define precise decision
boundaries made it suitable for predicting crops where subtle
variations in features mattered.
○ Logistic Regression: Logistic Regression achieved an accuracy of
95.76%. It was particularly effective for crops with linearly separable
data but struggled with those requiring non-linear decision boundaries.
As a baseline model, it provided a reference point for evaluating more
complex algorithms. While it lacked the sophistication to capture
intricate patterns, its simplicity and computational efficiency made it a
viable option for straightforward problems.
○ Random Forest: This ensemble model delivered the highest accuracy
of 99.15%. By combining predictions from multiple decision trees, it
captured complex patterns while minimizing overfitting. Its bagging
approach ensured robustness, and feature importance analysis
highlighted attributes like temperature, pH, and rainfall as critical for
predictions. Random Forest provided consistent performance across
all classes, including crops with overlapping features, making it ideal
for practical deployment.

● Confusion Matrix/Classification Report:


○ Random Forest Confusion Matrix: The Random Forest model
displayed strong performance with minimal misclassifications across
all crop categories. The classification report highlighted high
precision, recall, and F1-scores for most crops, though a few crops
with overlapping feature distributions exhibited slightly lower recall.
○ Support Vector Machine (SVM) Confusion Matrix: The SVM
model demonstrated commendable accuracy, particularly for crops
with non-linear boundaries. Its confusion matrix reflected balanced
classification, although its performance slightly dropped for crops
with subtle variations in feature space.
○ Naive Bayes Confusion Matrix: Naive Bayes, while effective for
crops with distinct feature distributions, showed increased
misclassifications for crops requiring a nuanced understanding of
feature interactions. Its classification report reflected lower F1-scores
for certain classes due to its assumption of feature independence.
○ Classification Report Insights: The Random Forest model emerged
as the most balanced performer, achieving high F1-scores even for
crops with limited samples in the test set. SVM followed closely,
performing well for non-linearly separable crops. In contrast, Naive
Bayes struggled with interdependent features, leading to reduced
precision and recall for certain crops.
● Model Interpretation:
○ Feature Importance:
■ For tree-based models like Random Forest, an analysis of
feature importance revealed the most influential attributes for
crop predictions. Features such as temperature, pH, and rainfall
had the highest importance scores, underscoring the critical role
of environmental factors in crop suitability. This insight
reinforces the need for comprehensive environmental data in
agricultural decision-making.
○ Error Analysis:
■ Underfitting: Naive Bayes and Logistic Regression struggled
with crops requiring non-linear decision boundaries due to their
inherent model assumptions. Their simplifying assumptions
limited their ability to capture intricate patterns, leading to
reduced accuracy for such cases.

■ Overfitting: The Decision Tree model, despite its constrained


depth, occasionally memorized training patterns, resulting in
reduced generalization during testing. This issue was mitigated
in Random Forest due to its ensemble learning approach.

■ Class Imbalance: All models exhibited slightly lower recall for


less-represented crops, a common issue in datasets with uneven
class distributions. Techniques like oversampling or synthetic
data generation could help mitigate this problem in future
iterations.
Conclusion
● Summary: This project explored the application of machine learning to
recommend crops based on soil and environmental factors. Using a variety
of models, we demonstrated the effectiveness of data-driven decision-
making in agriculture.
● Findings: The Random Forest model emerged as the top performer,
achieving the highest accuracy of 99.15%. Its ensemble learning approach
provided robust predictions, making it particularly suitable for complex
agricultural datasets. Other models like Decision Tree and SVM also showed
promise, albeit with limitations in generalization and computational
efficiency.
● Limitations: While the models performed well on the dataset, their real-
world applicability is constrained by the limited size and scope of the dataset.
Factors like extreme environmental conditions or additional variables (e.g.,
micronutrient levels) were not accounted for. Additionally, class imbalance
in the dataset affected recall for less-represented crops, highlighting the need
for balanced data or advanced techniques like oversampling.
● Future Work:
■ Future efforts can focus on the following enhancements:
● Incorporating advanced models like XGBoost, which
leverage gradient boosting to capture subtle patterns and
relationships in complex datasets. XGBoost has the
potential to outperform Random Forest by providing
higher accuracy and better generalization.
● Expanding the dataset to include more crop types, soil
compositions, and environmental conditions to improve
model robustness and applicability across diverse
agricultural contexts.
● Experimenting with deep learning or transfer learning
approaches to explore their suitability for crop
recommendation tasks. Neural networks, especially
convolutional or recurrent architectures, might provide
further improvements in accuracy.
Timesheet
● Aaryan Mantri: Initial model setup, Hyperparameter tuning

● Agarwal Vedant Rakesh: Data preprocessing, EDA, Analysis

● Balla Mahadev Shrikrishna: Model training, Report Preparation,


Documentation
References
● Introduction to Machine Learning for Everyone

○ Machine Learning for Everybody by Kylie Ying: FreeCodeCamp


Tutorial, FreeCodeCamp, 2022. Link: https://youtu.be/i_LwzRVP7bg

● Pandas Documentation

○ Pandas: Python Data Analysis Library, 2024. Link:


https://pandas.pydata.org

● NumPy Documentation

○ NumPy: The fundamental package for scientific computing with


Python, 2024. Link: https://numpy.org

● Scikit-learn: Machine Learning in Python

○ Link: https://scikit-learn.org/stable/

● Seaborn Documentation

○ M. Waskom, Seaborn: Statistical Data Visualization, 2024. [Online].


Link: https://seaborn.pydata.org

● Matplotlib Documentation

○ Link: https://matplotlib.org
Appendices (optional)
● Additional Figures or Tables: Include any figures or tables that do not fit
into the main body.
● Code Snippets: Provide any relevant code sections, especially if you want
to highlight a specific method or function.

You might also like