0% found this document useful (0 votes)
22 views

Methods and Models

The document discusses methods and models for NPK demand prediction including data preparation, feature selection using random forest, and training models like SVR, random forest regression, elastic net regression, gradient boosting, and stacking. Technologies used include Python libraries like Scikit-learn, Pandas, Numpy, Feature Engine, and tools like GitHub, Azure Machine Learning, and FastAPI.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Methods and Models

The document discusses methods and models for NPK demand prediction including data preparation, feature selection using random forest, and training models like SVR, random forest regression, elastic net regression, gradient boosting, and stacking. Technologies used include Python libraries like Scikit-learn, Pandas, Numpy, Feature Engine, and tools like GitHub, Azure Machine Learning, and FastAPI.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Methods and models

Introduction:
In this chapter we will present the approach we used for our work, it will go from how we
treated our data to the prediction. We will also present the technologies used to carry out
our project successfully
General overview of our pipeline:
Our NPK demand prediction pipeline consists of 3 stages:
- Data loading and processing
- Feature selection using the Random Forest algorithm
- Training and model generation

Data Preparation
This section presents the process we followed for preparation of the dataset to be used by
our machine learning models, the process consists of:
 Data Acquisition
 Data Preprocessing
Data acquisition:
The data collection is a crucial step in setting up a machine learning model, since it is a
learning process that is done with the help of available data. Nowadays we are facing a
great challenge regarding the data collection and reliability especially in agriculture where
data are hardly available in the African continent.
That’s why that the first step in our study was the identification of data sources. This led
us to DAPSA, an entity of ANSD which the statistic national agency of Senegal focused
on agriculture part. They are doing annuals surveys to collect information from farmers
and for our work we used their dataset of 2017-2018 agriculture campaign, which contains
11700 farmers records with 90 features, to build our model.
Data Preprocessing:
Data pre-processing is a key step in the process of generating a machine learning model. It
is the step that focuses on how to have the data ready for the training of the models and
have the best possible result.
We will explain how we managed this process and try to obtain the readiest dataset
possible for the treatment by the machine learning models .
Data Cleaning:
In our dataset we noticed the presence of many null values, so the first thing we did was to
keep the rows where the value of our target variable was different of null value. We
obtained after this task only 2500 records, so in the study of the DAPSA we can say that
only 25% of the farmers used the NPK fertilizer. This is an indication of the low
consumption of the fertilizer in Senegal.
We also deleted the following attributes because we judged them inappropriate for our
study.
• Id_parcelle : refers to the identifier of a field of the farmer
• Id_menage : refers to the identifier of the household of the farmer
• Poids_par :refers to the weight of the field in the survey .
• Id_responsable :refers to the identifier of the responsible of the field
• Unite_production_maximale/minimale : refers to the unit of the
minimum/maximum production of the farmer. We uniformize it to the universal unit
which is the kg.

Features Engineering
We applied standard methods such as filling the missing values, reformatting, removing
unnecessary features in order to make it the base for the creation of the model.
We also encoded our categorical variables because some machine learning algorithms can
only deal with numeric values, so we must transform categorical variables into numeric
variables.
At the end we applied a scaling of our features to have all the values in the same range,
the MinMaxScaler specifically here.
Features Selection
Before applying a machine learning model for our prediction, we had to keep the more
relevant features in our dataset since here we have a lot of features. To do it we applied an
algorithm that give us indication about the features importance, in our project we had
chosen the the random forest algorithm.

We retained the ones that had an importance of 2.5% minimum.


ML Models:
Regression :
Regression is method that helps to have a better view of the relation between independent
variables or features and a depend variable or target. The regression models are used in
predictive analytics to do forecasting or predicting outcomes . To do it the models has to
train labelled data in order to learn the relationship between input and output data .[0]

SVR:
Support Vector Regression allows us to predict discrete values based on the principle of
the SVM .To do so the model tries to fit the best line with a threshold value which is the
distance between the hyperplane and boundary line [1] .

Random Forest Regression:


The Random Forest Regression is a supervised learning algorithm that uses the ensemble
learning which is a technique that combine many regressors to solve complex problem .
To achieve this performance it passes within many steps that are:
- Randomly select k data points from the training set.
- For each k data points we build a decision tree
- Select a number N of trees we want to build and repeat step 1 and 2
- N-tree trees will be built for each new point we will in order to predict the y
value and to get the final predict value we will take the average of each
prediction [2]

Elastic Net Regression :


Elastic Net is a supervised machine learning algorithm that combines the Lasso and Ridge
approaches in order to find coefficients that minimize the sum of error squares .The key
idea of the algorithm is to find a trade off between the Lasso and Ridge in order to have
the best regulator possible .
[3]
Gradient Boosting :
Gradient Boosting is an ensemble technique that has the aim to get a strong model by
underfitting multiples models .The intuition behind it is to minimize error by setting the
target outcomes from the previous models to the next model [4]

Stacking :
Stacking or stacked generalisation was introduced by Wolpert. In the essence, stacking
makes prediction by using a meta-model trained from a pool of base models — the base
models are first trained using training data and asked to give their prediction; a different
meta model is then trained to use the outputs from base models to give the final
prediction.
Stacking is an ensemble technique which produces a meta model trained on the outputs
obtained by training multiple models called base models . This meta model provides the
final prediction of the whole process [5]

Technologies and Tools :


In the process of setting up our model we have used several technologies and tools to
make our work easier. Below we will outline each of them and the role they played in our
project
Python Librairies
Scikit learn :
Scikit learn is a famous python library generally used to do machine learning tasks. It
allowed us to do the preprocessing task, to use machine learning regression models and
metrics to evaluate them.

Pandas :
pandas is intended to facilitate the tasks related to data manipulation and analysis. The
library helped us in the cleaning tasks of our dataset as well as in the exploration and
preprocessing of our data .
Numpy
NumPy is a library aimed at manipulating matrices or multidimensional arrays as well as
mathematical functions operating on these arrays .

Feature Engine:
Feature engine is a library with a great utility in the task of data preprocessing as it has
several transformers, engineered to make things easier for us.

GitHub :
Github is a version management software set up by Linus Torvalds in 2005 ,the creator of
linux, and which became essential in the world of software development. In our project it
helped us to manage the different versions of our codes.
Azure Machine Learning :
Azure machine learning is a cloud service developed by Microsoft to help manage the
lifecycle of machine learning projects. It allows to train machine learning models, to
facilitate their deployment as well as the implementation of pipelines.
In our project we used it mainly to train our model on the cloud and thus facilitate the
access to our codes to some collaborators.
FastAPI :
FastApi is a python web framework that allows the generation of APIs and create web
applications with the template engine Jinja2 among other features. Unlike other python
frameworks like Flask it is created on an asynchronous server namely the
ASGI(Asynchronous Server Gateway Interface). In our project we used it to consume our
model and generate a web application with Jinja2.
0 :https://www.seldon.io/machine-learning-regression-explained#:~:text=Regression
%20is%20a%20technique%20for,used%20to%20predict%20continuous%20outcomes.

1 = https://towardsdatascience.com/unlocking-the-true-power-of-support-vector-
regression-847fd123a4a0#:~:text=Support%20Vector%20Regression%20is%20a,the
%20maximum%20number%20of%20points.
2 : https://levelup.gitconnected.com/random-forest-regression-209c0f354c84
3: https://medium.com/mlearning-ai/elasticnet-regression-fundamentals-and-modeling-in-
python-8668f3c2e39e
4: https://medium.com/analytics-vidhya/introduction-to-the-gradient-boosting-algorithm-
c25c653f826b

You might also like