Stock Prediction Model CEP
Stock Prediction Model CEP
Group 9
M. Saeed Shaikh, 2022421, CE, Affan Khan, 2022047, CE, and Farhan
Moaviz, 2022166, CE
Abstract—This report investigates the application Machine learning offers a powerful tool for tackling the
of machine learning for crop yield prediction, complex task of crop yield prediction. By analyzing historical
focusing on three different models: linear data on weather, soil conditions, agricultural practices, and
past yields, machine learning algorithms can discover patterns
regression, polynomial regression, and XGBoost and relationships that contribute to crop yields. This report
regression. Conducted in Python, the project investigates the application of three different machine learning
utilized a dataset containing country-wide data on models for crop yield prediction:
rainfall, average temperature, pesticide use, and 1. Linear Regression: A simple yet effective model that
crop yields across various countries. This dataset establishes a linear relationship between input variables
allowed for the assessment of the models' ability to and crop yields.
predict crop yields based on these environmental
2. Polynomial Regression: An extension of linear regression
and agricultural factors. Each model's performance that allows for non-linear relationships between variables.
was evaluated through metrics like Mean Squared
Error (MSE) and R-squared, revealing that the 3. XGBoost Regression: A powerful ensemble learning
technique that combines multiple decision trees to
XGBoost model achieved the highest prediction achieve high accuracy.
accuracy and best generalized to unseen data. This
suggests that XGBoost is a promising tool for By comparing the performance of these models, this
predicting crop yields based on readily available project aims to identify the most effective approach for crop
yield prediction in a specific context. This report will delve
country-wide data. This research highlights the into the details of each model, implementation specifics, and
potential of machine learning as a valuable tool for analysis of the outcomes. The findings will provide valuable
farmers, agricultural stakeholders, and insights into the potential of machine learning for improving
policymakers to optimize resource allocation, crop yields and contribute to advancements in agricultural
enhance crop yields, and improve food security. practices.
1) Improved Flexibility and Fit: XGBoost Regression stands out as a powerful and
versatile tool for crop yield prediction. This ensemble
Polynomial Regression addresses the limitations of learning technique combines multiple decision tree models
linear regression by capturing non-linear relationships to achieve high accuracy and robustness. Compared to linear
between variables. As Iqbal (2021) noted, "Non-linear data and polynomial regression, XGBoost offers several distinct
cannot be fit by linear regression technique (under-fitting). advantages:
So, we increase the model complexity and use the
polynomial regression model, which fits such big non-linear 1) Enhanced Accuracy:
data in a better way." This model introduces non-linear
terms to the equation, allowing for the representation of By leveraging the combined power of multiple decision
more complex relationships between input and output trees, XGBoost can capture complex non-linear
variables. relationships and intricate interactions between various
factors influencing crop yield. This often leads to improved
2) Modeling Complex Interactions: prediction accuracy compared to simpler models like linear
regression.
Polynomial regression can capture complex interactions
between input variables. For example, the effect of rainfall 2) Robustness to Outliers and Noise:
on crop yield might not be linear but rather quadratic, with
both low and high rainfall levels potentially impacting yield XGBoost's ensemble learning nature makes it less
negatively. Polynomial modeling can account for such sensitive to outliers and noise in the training data compared
interactions, leading to a more accurate representation of the to single-model approaches like linear regression. This
underlying system. robustness leads to more reliable predictions in real-world
scenarios where data quality might be imperfect.
3) Adaptability to Diverse Datasets:
3) Automatic Feature Selection:
Unlike linear regression, which assumes a constant
linear relationship across the entire data range, polynomial XGBoost automatically selects the most relevant
regression can adapt to diverse datasets with varying features from the available data, eliminating the need for
patterns. This is particularly beneficial when dealing with manual feature engineering. This feature selection process
real-world agricultural data, which can exhibit significant reduces model complexity and improves generalizability.
variations due to factors like different climate zones, soil
types, and agricultural practices. 4) Regularization for Avoiding Overfitting:
XGBoost can efficiently handle large datasets with 2) Outlier Detection and Removal:
millions of data points, making it suitable for real-world
applications with extensive agricultural data. This scalability Outliers can significantly impact the performance of
allows for comprehensive analysis and potentially more machine learning models. We used statistical methods and
accurate predictions across diverse landscapes. visualizations to identify outliers and remove them from the
dataset, ensuring the model learned from representative data
6) Interpretability: points.
The first crucial step in any machine learning project is By performing these meticulous data preprocessing
selecting and preparing the data that will be used to train steps, we ensure that the models are trained on high-quality,
and evaluate the models. For this project, we utilized the clean data, paving the way for accurate and reliable
Crop Yield Prediction Dataset available on Kaggle predictions of crop yield.
(https://www.kaggle.com/datasets/patelris/crop-yield-
prediction-dataset). This dataset contains data on various B. Building the Models
factors influencing crop yield, including:
Once our data is prepared and preprocessed, we can
• Pesticides: Amount of pesticides used per country per proceed to building the three models for crop yield
year. prediction: Linear Regression, Polynomial Regression, and
• Rainfall: Average rainfall per country per year. XGBoost Regression. We will utilize the scikit-learn library
• Temperature: Average temperature per country per in Python to implement each model.
year.
• Yield: Crop yield for various crops per country per 1) Linear Regression:
year.
• This data provides a valuable foundation for building • We import the LinearRegression class from
and evaluating models for crop yield prediction. sklearn.linear_model.
• We create a LinearRegression object and fit it to the
Before utilizing the dataset for model training, we training data.
performed several data preprocessing steps to ensure its
quality and suitability for machine learning algorithms:
• We obtain the model's coefficients and can interpret Furthermore, the XGBoost model's R² value of 0.922
their meaning in terms of the impact of each feature on indicates that it can explain 92.2% of the variance in the
crop yield. actual crop yields, which is a remarkable performance. This
suggests that the XGBoost model can be used to generate
2) Polynomial Regression: reliable and accurate predictions of crop yields, which can be
valuable for farmers, agricultural stakeholders, and
• We import the PolynomialFeatures class from policymakers.
sklearn.preprocessing.
• We create a PolynomialFeatures object with the desired
IV. SUMMARY OF RESULTS
degree of polynomial terms.
• We transform the training data using the
PolynomialFeatures object to create new features based This project investigated the application of three
on existing ones. different machine learning models for crop yield prediction:
• We then follow the same steps as for linear regression, linear regression, polynomial regression, and XGBoost
but with the transformed data. regression. The analysis utilized a dataset containing
country-wide data on rainfall, average temperature, pesticide
3) XGBoost Regression: use, and crop yields across various countries. Each model's
performance was evaluated based on its ability to accurately
• We import the XGBRegressor class from xgboost. predict crop yields using these environmental and
agricultural factors.
• We create an XGBRegressor object and set its
hyperparameters.
• We fit the XGBoost model to the training data. A. Evaluation Metrics:
• We can utilize XGBoost's built-in feature importance
analysis to understand the relative contribution of each • Mean Squared Error (MSE): Measures the average
feature to the model's predictions. squared difference between predicted and actual crop
By building these three models and comparing their yields.
performance, we can gain valuable insights into their • R-squared (R²): Measures the proportion of variance in
effectiveness for crop yield prediction. Analyzing the the actual crop yields explained by the model.
strengths and limitations of each approach allows us to
identify the most suitable model for this specific context and
potentially leverage its capabilities for further research and B. Model Performance:
development.
Model MSE R²
Linear Regression 9.69e+08 0.755
Polynomial Regression 2.15e+08 0.945
XGBoost Regression 3.08e+08 0.922
XGBoost Regression achieved the lowest MSE and Armed with precise yield predictions, farmers gain
highest R², indicating superior performance in predicting unprecedented control over resource allocation. Planting and
crop yields compared to linear and polynomial regression. harvesting strategies, fertilization and irrigation practices,
XGBoost's R² value of 0.922 suggests it can explain 92.2% and land utilization can be optimized based on anticipated
of the variance in actual crop yields, exhibiting remarkable yields, minimizing waste, maximizing output, and driving
accuracy and reliability. significant cost reductions.
E. Future Directions:
D. Looking Ahead, a Future of Abundance
• Further research could investigate the use of different
The path forward in crop yield prediction is paved with
data sources, including satellite imagery and soil data,
continuous research and development. Expanding data
to enhance prediction accuracy.
sources, optimizing models, ensuring accessibility, and
• Exploring ensemble models combining XGBoost with
addressing ethical considerations are crucial aspects for
other machine learning algorithms might lead to further
realizing the full potential of this technology. By harnessing
improvements in prediction performance.
the power of machine learning, we can usher in a new era
• Adapting and refining the presented methodology for for agriculture, one where sustainability, resilience, and
specific crops and regions can offer tailored solutions abundance are the cornerstones of a food-secure world.
for diverse agricultural contexts.
REFERENCES
V. CONCLUSION
[1] T. Van Klompenburg, A. Kassahun, and Ç. Çatal, “Crop yield
This research venture stands as a testament to the prediction using machine learning: A systematic literature review,”
transformative potential of machine learning, particularly Computers and Electronics in Agriculture, vol. 177, p. 105709, Oct.
2020, doi: 10.1016/j.compag.2020.105709.
XGBoost regression, in revolutionizing crop yield
[2] M. A. Iqbal, “Application of Regression Techniques with their
prediction. By leveraging readily available, country-wide Advantages and Disadvantages,” ResearchGate, Sep. 2021, [Online].
data, XGBoost remarkably outperformed simpler models, Available: https://www.researchgate.net/publication/354921553_App
demonstrating its exceptional capabilities in accurately lication_of_Regression_Techniques_with_their_Advantages_and_Dis
forecasting crop yields. This breakthrough holds profound advantages
implications across the agricultural landscape, impacting [3] “Crop Yield Prediction Dataset,” Kaggle, Dec. 01, 2021.
https://www.kaggle.com/datasets/patelris/crop-yield-prediction-
farmers, policymakers, and the broader community. dataset/