0% found this document useful (0 votes)
16 views48 pages

USTH

Uploaded by

dduy193.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views48 pages

USTH

Uploaded by

dduy193.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Contents

List of Acronyms i

List of Figures iii

List of Tables iv

List of Tables v

Abstract vi

1 Introduction 1
1.1 Context and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theoretical Background 3
2.1 Hyperspectral Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Chlorophyll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Nitrogen, Phosphorus, Potassium concentration . . . . . . . . . . . . . . . . . . . . . 4
2.4 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6.1 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6.3 Decision Tree Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.4 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.5 Support Vector Regression (SVR) . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.6 Boosting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.6.1 Adaptive Boosting (AdaBoost) . . . . . . . . . . . . . . . . . . . . . . 11
2.6.6.2 Extreme Gradient Boosting (XGBoost) . . . . . . . . . . . . . . . . . 12
2.6.6.3 Categorical Boosting (CatBoost) . . . . . . . . . . . . . . . . . . . . 13
Contents

2.6.6.4 Light Gradient-Boost Machine (LGBM) . . . . . . . . . . . . . . . . 15


2.7 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.2 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7.3 Resnet50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7.4 DenseNet121 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7.5 MobileNetV2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.6 EfficientNetB0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Material and Methodology 20


3.1 Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Experimental Site and Experimental Design . . . . . . . . . . . . . . . . . . . 20
3.1.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Preprocessing Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2.1 Determine ROI position . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2.2 Extracting the ROIs in hyperspectral image . . . . . . . . . . . . . . 25
3.2.3 Model Configuration and Training . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4.1 RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4.2 MAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4.3 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Tools and Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Result and Dicussion 31


4.1 Chlorophyll Model Prediction and Comparison . . . . . . . . . . . . . . . . . . . . . . 31
4.2 N concentration Model Prediction and Comparison . . . . . . . . . . . . . . . . . . . 32
4.3 P Concentration Model Prediction and Comparison . . . . . . . . . . . . . . . . . . . 33
4.4 K Concentration Model Prediction and Comparison . . . . . . . . . . . . . . . . . . . 34

5 Conclusion and Future Work 35


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

References 38

A Hyperparameters for ML learning models vii


List of Acronyms

AdaBoost Adaptive Boosting.

CatBoost Categorical Boosting.


CV Coefficient of Variance.

DL Deep Learning.

HPO Hyper-parameter Optimization.


HSI Hyperspectral Image.

K Potassium.

LGBM Light Gradient-Boost Machine.

MAPE Mean Absolute Percentage Error.


ML Machine Learning.
MSE Mean Square Error.

N Nitrogen.

P Phosphorus.
PCA Principal Component Analysis.

RMSE Root Mean Square Error.


ROI Region Of Interest.
RSS Residual Sum of Squares.

SVR Support Vector Regression.

i
List of Acronyms ii

UAV Unmanned Aerial Vehicle.

XGBoost Extreme Gradient Boosting.


List of Figures

2.1 Images record a reflectance spectrum for each pixel in the image[15] . . . . . . . . 4
2.2 The scatter plot shows the relationship between the dependent variable and inde-
pendent variable [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Illustration of decision tree regression [11] . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 The schematic of random forest [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Working of Boosting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 The schematic of XGBoost [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Symmetric Tree Architecture of CatBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Leaf Wise Tree Grow Architecture of LGBM . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 The typical architecture of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 The architecture of VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.11 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.12 Resnet-50 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.13 DenseNet121 architecture [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.14 MobilenetV2 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.15 EfficientNetB0 architecture [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 The spatial distribution of the plot design . . . . . . . . . . . . . . . . . . . . . . . . . 21


3.2 The proposed workflow for developing the model . . . . . . . . . . . . . . . . . . . . 23
3.3 The ROI’s spatial distribution of 3 cm per pixel image after extracting RGB channels 25
3.4 The number of component needed to explain variance . . . . . . . . . . . . . . . . . 26
3.5 Illustration of our machine learning workflow . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Illustration of our deep learning workflow . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii
List of Tables

3.1 Descriptive nutrients’ statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Comparison of Learning Models Performance in Chlorophyll Prediction . . . . . . . 31


4.2 Comparison of Learning Models Performance in N Concentration Prediction . . . . 32
4.3 Comparison of Learning Models Performance in P Concentration Prediction . . . . 33
4.4 Comparison of Machine Learning Models in K Concentration Prediction . . . . . . 34

A.1 Table of some good chlorophyll machine learning models with their hyper-parameters vii
A.2 Table of some good N concentration machine learning models with their hyper-
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
A.3 Table of some good P concentration machine learning models with their hyper-
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
A.4 Table of some good K concentration machine learning models with their hyper-
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

iv
List of Algorithms

2.1 AdaBoost pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


2.2 Simplification of XGBoost Regression pseudocode . . . . . . . . . . . . . . . . . . . . 13

3.1 The proposed pseudocode for getting all ROI position of an HSI . . . . . . . . . . . 24

v
Abstract

Chlorophyll content is one of the most essential elements in the photosynthesis process. It can be
affected by sowing seeds, weather, watering, etc. Nitrogen (N) is essential for rice leaves growth
and helps produce chlorophyll. On the other hand, Phosphorus (P) is important for root growth
and helps grow seeds. And Potassium (K) is for resisting diseases and stresses. Therefore, the
estimations of the these nutrients play important roles to assess the quality of nutrition in the rice
leaves. So that it can help farmers adjust the way to take care of rice plants. Hyperspectral Image
(HSI) captured from Unmanned Aerial Vehicle (UAV) gives the useful information in nutrient
concentration to optimize the agricultural practices.

This work focused on Machine Learning (ML) and Deep Learning (DL) model to solve the problem
of estimation in chlorophyll and N, P, K concentration using hyperspectral image captured from
Unmanned Aerial Vehicle (UAV). This work applied different Machine Learning and Deep Learning
models. Comparing the performance of each model, choosing the best models by using different
model evaluation techniques include R2 , Mean Absolute Percentage Error (MAPE) and Root Mean
Square Error (RMSE). This model can help farmers solve the problem in estimation of nutrient
concentration.

Keywords: Chlorophyll, N concentration, P concentration, K concentration, machine learning,


deep learning, regression, hyperspectral image

vi
Chapter 1

Introduction

1.1 Context and motivation

Precision Agriculture is a science field using technologies such as drone, sensors, weather station
or satellite imagery, etc. From the data collected by these kinds of devices, it helps farmers can
receive information of the issues such that environments, weather changes, nutrient concentration
in the leaf so that they can optimize fertilization, irrigation, and soil sowing operations to increase
the profitability and efficiency.

There are several ways to measure nutrient content of the rice leaves: commonly measured by
extraction of chlorophyll in a solvent followed by in vitro measurements in a spectrophotometer or
using non-destructive, in situ, optical techniques. These procedures for getting measured nutrients
are often time-consuming, laborious, economically inefficient and non-scalable. Moreover, by
observing the rice field with hyperspectral cameras, with each pixel has an associated continuous
spectrum, that provides for us nutrient concentration’s information, so that we can detect things
like early symptoms of diseases, water, soil quality and crop health. However, most of the existing
studies have relied on the weather data, water quality level, optical flow analysis or multispectral
imaging for prediction of nutrients. There is a lack of research on using hyperspectral image
to predict rice leaves nutrients, especially in regions with similar climate to Vietnam. Most of
the previous works have focused on different types of leaves, such as oil palm, grapes, broccoli,
and potatoes. Therefore, To address these challenges, I applied the state-of-art techniques of
supervised learning to estimate the concentration of nutrients in Vietnamese rice leaves using
Hyperspectral Image (HSI). Unfortunately, there are no public datasets available for this task
using HSI. However, Dr. Tran Giang Son and his team already built the dataset by fertilizing the
nutrition in the rice field of Phu Tho. Primarily, they divided the rice field into a few replicates,
which they already fertilized with different nutrients. Then, they captured Hyperspectral Image

1
1.1 Context and motivation 2

by using Unmanned Aerial Vehicle, these images were used as the dataset to perform regression
models. As a result, we could know the nutrient contents from each place.

1.2 Objectives

The objective of this report is to create different models to predict the nutrients’ concentration
from Hyperspectral Image (HSI), This work studied and implemented different machine learning
(such as XGBoost, LGBM and Catboost) and deep learning models, then we used different model
evaluation methods for regression problems for analyzing which model is the best one.

1.3 Related Works

Another similar works conducted by Songtao Ban et al., in that study, they acquire images from
two different regions using multispectral and hyperspectral cameras. They had analysis vegetation
indices were high correlated with rice LCC. Study area is in Ningxia and Shanhai. And they
achieved the values which is R2 score = 0.9, RMSE = 1.63 and MAPE is 4.13% in SVR models
for Shanhai calibration datasets based on using eight vegetation indices [4]. In another study by
Xiaokai Chen, they did analysis on canopy using spectral transformation and machine learning
method [5]. And Bogdan Ruszczak, he performed unbiasing the estimation of chlorophyll from
hyperspectral images. With SPAD parameter, R2 score is 0.818, MAPE is 7.2% and 9.583 MSE;
with FvFm, Ridge has R2 = 0.727, MAPE is 3.6% and MSE is 0.001, Ridge is also the best with PI
and RWC parameter[13].

Sulaymon Eshkabilov et al. estimated nutrient concentration on lettuce based on HSI data,
they achieved with a mean of R2 is 0.911 for hydroponics and 0.877 for NFT-grown cultivars [6].
There are not many related works for estimating phosphorus concentration, though. Megan Io
Ariadne Abenina et al. performed prediction of potassium in Peach Leaves using HSI, they used
pretreatment methods, and PLS prediction for original-PLS, R2 is 0.3479 and for SVN-PLS, the
highest R2 which is 0.8446 and RMSE is 0.2917.[1].

1.4 Report Structure

• Chapter 1: Introduction about general information and objectives of the thesis.

• Chapter 2: Explaining some theoretical definition of method, which we will use.

• Chapter 3: Give the details explanation of methodology.

• Chapter 4: Showing the result of what we got and discuss it.

• Chapter 5: Conclusion from the result we got and propose in the future.
Chapter 2

Theoretical Background

2.1 Hyperspectral Image

Hyperspectral remote sensing is the activity of collecting image in many narrow spectral bands.
Hyperspectral Image (HSI) is the image which contains information for collecting and processing
electromagnetic spectrum to obtain spectrum of each pixel in the image (usually capture lights
range from 400 nm to 2500 nm, including near infrared (NIR) and short wave infrared(SWIR)).
Hyperspectral imaging involves using a device called imaging spectrometer (hyperspectral cam-
era, hyperspectral sensors) to collect spectral information. A hyperspectral camera captures
an area’s light, and it will be divided into individual wavelengths or spectral bands. It offers a
two-dimensional representation of the area, while concurrently storing the spectral data of each
pixel. The result is a hyperspectral image with each pixel represents a unique spectrum. Since
materials, compounds at this pixel interact with light in distinct ways, it also has unique spectral
signatures to identify them.

Hyperspectral image is to analyze a spectral response to detect and classify features or objects in
images based on their unique spectra. It also provides both spatial and spectral information about
the object’s physical and chemical properties. The spectral information allows for the identification
and classification of material’s distribution such as P concentration, K concentration, etc. or areal
separation. Hyperspectral imaging helps us to solve the question "what" (based on the spectrum),
"where" (based on location), and "when".

A hyperspectral image contains multiple spectra to create a massive hyperspectral data cube
comprising position, wavelength and time-related information. Compared to multispectral imag-
ing, multispectral image can acquire only a relatively small number of bands (smaller than 10
bands) and broad spectral bands (about 100 nm bandwidth), but hyperspectral imaging provides
more information for more accurate analysis, it can provide us a hundred of bands and narrow
wavelength between bands [16].

3
2.1 Hyperspectral Image 4

Figure 2.1 – Images record a reflectance spectrum for each pixel in the image[15]

Hyperspectral imaging can be used for various applications, such as environmental monitoring
(monitoring the changes in land use, vegetation health and water quality), agriculture precision
(judging crops’ health, monitoring moisture in the soil and nutrient concentration to optimize
crop management and crop yields in practice) [7].

2.2 Chlorophyll

Chlorophyll is an important reaction in the photosynthetic process and using to analyze vege-
tation stress, nutrient cycling, productivity, growth stages and diseases. It is a green pigment
in photosynthetic bacteria, algae, and plants. In plants, chlorophyll is used for the process of
photosynthesis by converting and absorbing light energy into chemical energy. Chlorophyll is
called the central pigment of the photosynthesis reaction because it can accommodate light that
is absorbed by other pigments through photosynthesis. Chlorophyll content is one indicator of
photosynthetic activity, it also plays a role in the process of plant organogenesis [18].

2.3 Nitrogen, Phosphorus, Potassium concentration

Nitrogen (N), Phosphorus (P), Potassium (K) are the three primary nutrients in commercial
fertilizers. All three of these fundamental nutrients plays an important role for plant growth and
development[8].

Nitrogen is necessary for making sure plants are healthy. It is essential in the constitution of
protein, and protein appears in the most living thing’s tissue. Nitrogen had an impact in organic
structure, physiologic. Therefore, if we do not have enough Nitrogen, it may affect the structure
and function of photosynthesis [8].
2.3 Nitrogen, Phosphorus, Potassium concentration 5

Phosphorus is related to the plant’s ability to use and store energy, including the process of
photosynthesis, helps the plant grow normally. When P is insufficient, it will decrease leaves,
change photosynthesis and carbon [8].

Potassium presents the most in cellular cation, have a key role in cellular activities such as
charge balance, membrane protein transport [8]. Therefore, it helps plants’ to have the ability to
resist disease. Also, it increases crop yields and quality of plants. K protects the plant from is cold
or dry weather, enhance the root and prevent the wilt.

2.4 Regression Analysis

Regression analysis is one of the typical task in machine learning and deep learning. The task is
to predict a numeric value (dependent variable or outcome) such as price of a car, given the set of
features (independent variables) such as mileage, age, brand, etc. Another example is in context
of this thesis, the goal is to predict the nutrient concentration (dependent variables) based on the
pixel values on each band (independent variables).

Figure 2.2 – The scatter plot shows the relationship between the dependent variable and
independent variable [10]

To know how much error when the system makes predictions in regression, some typical
common performance measures are mean square error (M SE), mean absolute error (M AE), mean
absolute percentage error (M AP E), R square (R2 ).

2.5 Principal Component Analysis

Principal Component Analysis (PCA) is a technique which used for highlights the importance of
each new features. If we have hundreds of features, the redundancy now become suboptimal. But
with PCA, we could reshape the data to a form of the new variables set (principal components)
2.5 Principal Component Analysis 6

which are uncorrelated and ordered according to the first few retain most of the variation presents
in all the original variables.

Firstly, we will standardize the dataset by calculating the mean and standard deviation of each
feature. Then, we apply the below formula for each data of each feature:

x −µ
x new = (2.1)
σ

Next, we will calculate covariance matrix. Covariance and Variance are measure of how "spread"
of a set of points around their center of mass (mean). Covariance measure between 2 dimensions
to see if there is a relationship between these two dimensions.

n
1X
C ov(X , Y ) = (x i − x)( yi − y) (2.2)
n i=1

Where:

• Con(X, Y) is the covariance between X and Y variables

• x and y are the members of X and Y variables

• x and y are the mean of X and Y variables

• n is the number of members

Using the above formula, we construct the covariance matrix. For example, we have three
variables X, Y and Z  
C ov(X , X ) C ov(X , Y ) C ov(X , Z)
 
A=
 C ov(Y, X ) C ov(Y, Y ) ,
C ov(Y, Z)  (2.3)
C ov(Z, X ) C ov(Z, Y ) C ov(Z, Z)

The diagonal is the variances of X, Y and Z. This matrix is symmetrical about the diagonal. Next,
we calculate eigenvector and eigenvalue. Eigenvectors are principal components and Eigenvalues
are the percentage of information (variance) explained. The formula for finding Eigenvectors and
Eigenvalues is:

Av = λv (2.4)

• A is the matrix

• v is Eigenvector

• λ is Eigenvalue
2.5 Principal Component Analysis 7

Rearrange the equation:

Av − λv = 0

(A − λI)v = 0

Since v is a non-zero vector, the equation can be zero only when: det(A − λI) = 0. Solving for
the λ. Then finding values for vector v for different λ.

After finding Eigenvalues and their corresponding Eigenvectors, we need to sort it such that
from highest to lowest. Then pick the top k eigenvalues and form a matrix of eigenvectors. Finally,
we transform the original matrix.

Transformed Data = Feature Matrix × top k Eigenvectors (2.5)

2.6 Machine Learning

2.6.1 Ridge

Ridge regression is similar to the Linear Regression, but adding another term called penalty term.
This penalty term is added to ensure there is regularization, it also can be called L2 penalty. This
term is used for shrinking the weights of the model close to zero for ensuring that the model does
not over-fit the data. The cost function of Ridge regression can be represented as the function
below:

n D
1 X X
const(w) = ( yi − ŷi )2 + λ w2j (2.6)
2n i=1 j=1

However, ridge regression does not reduce the number of variables because it never leads a
coefficient to a zero, it only minimizes it. Therefore, this learning model is not good for feature
reduction. This model is suitable when all features are important.

2.6.2 Lasso

The lasso regression is similar to ridge regression. It has another name, which is L1 regularization.
The purpose also is to shrink the coefficient to zero. However, lasso takes the magnitude of the
coefficients.

n D
1 X X
const(w) = ( yi − ŷi )2 + λ |w j | (2.7)
2n i=1 j=1
2.6 Machine Learning 8

If we have two or more highly collinear variables then lasso regression select one of them
randomly, hence, it is not good for interpretation of data. But it is suitable when some features
are irrelevant or redundant.

2.6.3 Decision Tree Regression

Regression Tree is a kind of decision tree is used for the regression problem to predict the values
which are continuous. There are two steps in regression tree:

— Partition the predictor space set of possible feature variables into separate and non-overlapping
regions.

— For each observation that falls into a region, we predict, which is the mean of response value
in the training set.

For decision tree regressor, RSS is the measure tells us how much the prediction deviate from
the original target value and the goal is to divide the space in a way that minimize RSS:
n
X
RSS = ( yi − ŷi )2 (2.8)
i=1

— There are several ways to avoid overfitting problems on regression tree, the simplest is to only
split observations when there are more than some minimum number.

Figure 2.3 – Illustration of decision tree regression [11]

2.6.4 Random Forest Regression

Random Forest is an ensemble technique using in both regression and classification tasks with the
use of multiple decision trees and a technique called Bootstrap and Aggregation (bagging).

Implementation of Bagging:

Step 1: Multiple subsets are created from the original data set with equal tuples, selecting obser-
vations with replacement.

Step 2: A base model is created on each of these subsets.


2.6 Machine Learning 9

Step 3: Every model is learned independently of each another and in tandem with every training
set.

Step 4: The final predictions are determined by combing the predictions from all the models.

Basically, Random Forest Regression uses multiple decision trees as base learning models.
Randomly perform row sampling and feature sampling from the datasets forming datasets for
every model, combine multiple decision trees in determining the final output rather than relying
on individual decision trees.

Figure 2.4 – The schematic of random forest [14]

Random Forest gives us more advantages, such as it is less sensitive to the training data compared
to the decision tree. Besides, it is more accurate than the decision tree, effective in handling large
datasets and missing data, outliers and noisy features.

2.6.5 Support Vector Regression (SVR)

Support Vector Machine is a method had been used in wide range of fields. In machine learning,
SVR is a variant of the SVM. The strategy here is to minimize the error, individualize the hyperplane
which maximizes the margin.

We have training points (x 1 , y1 ), ..., (x N , yN ) where x i ∈ Rn is the vector of feature and yi ∈ R1


is the output. We have parameters C > 0 and ε > 0. We have:

N
1 X
min ||w||2 + C (ξi + ξ∗i ) (2.9)
2 i=1

such that

• yi − wx i − b ≤ ε + ξi

• wx i + b − yi ≤ ε + ξ∗i
2.6 Machine Learning 10

• ξi , ξ∗i ≥ 0

The two functions below were used to minimize the above function

For linear SVR, we have:


N
X
y= (αi − α∗i ) 〈x i , x〉 + b (2.10)
i=1

For non-linear SVR, we map data into a higher dimensional, so we have:

N
X
y= (αi − α∗i )K(x i , x) + b (2.11)
i=1

Where 〈x i , x〉 and K(x i , x) is the kernel function

K(x i , x) ≡ 〈ϕ(x i ), ϕ(x)〉, The kernel functions transform the data into a higher dimensional
feature space to make it possible to perform linear separation.

2.6.6 Boosting Algorithm

Boosting (Hypothesis Boosting) is an ensemble modeling technique of combining a set of weak


learners (also base models or weak regressors) into a strong learner. The performance of weak
learner is poor in prediction. In contrast, a strong learner is a regressor with high performance.
Boosting Regressor algorithms work by iteratively training weak learners and adjusting their
weights to emphasize the examples which are difficult to predict. At each iteration, a new weak
learner is trained to correct the mistakes made by the previous learners. The final prediction is a
weighted combination of the predictions by all the weak learners.

During training process, boosting regressor algorithm assign weights to the training examples
based on the importance. Previously mis-predicted examples or have higher error assigned
with higher weights, so that subsequent weak learners focus more on them. This adaptive
weighting scheme allows the boosting regressor to give more attention to challenging examples
and potentially improve overall performance.

Model 1 Model 2 Model ... Model N

weakness weakness weakness

Weight 1 Weight 2 Weight ... Weight N


Ensemble (with all its predecessors)

Figure 2.5 – Working of Boosting Algorithms


2.6 Machine Learning 11

Gradient descent optimization is used to minimize the loss function in boosting regressor
algorithm. The optimization process aims to find the best weights and parameters for each weak
learner to minimize the overall loss.

2.6.6.1 Adaptive Boosting (AdaBoost)

AdaBoost (short for Adaptive Boosting), is a kind of boosting algorithm that works by first fitting
a weak regressor (typically decision tree) on the training dataset, the weak learner is trained to
minimize the weighted error (weights reflect the importance of each training example). After
training the weak learner, the performance is evaluated by calculating the weighted error, which
measures how well the weak learner predicts the target values. The algorithm will "re-weight" the
training samples based on the performance of the weak learner. The sample that were incorrectly
predicted by the weak learner are assigned higher weights to give the next weak learner focuses
on the subsequent iterations. On the other hand, correctly predicted examples are assigned lower
weights. This process is repeated for in predefined number of iterations. In each iteration, the
algorithm assigns a weight to each weak learner in the ensemble based on its performance. The
learner have better performance will have higher weight in the final prediction.

And to make prediction on new samples, the AdaBoost regressor will combine the prediction of
all weak learners, weighted by their individual weights and the final prediction is the weighted
sum of the predictions from each weak learner.

The pseudocode below represent how AdaBoost works:


2.6 Machine Learning 12

Input : X - Training features, y - Target values, num_est imat ors - Number of weak learners
(iterations)
Output : Ensemble Model
1
Initialize the sample weights w i = N, where N is the number of training examples;
for t = 1 to num_est imat ors do
Train a weak learner h t on the training data using the weights w i ;
Compute the weighted error e t :
N
w i | yi − h t (X i )|
P
i=1
et = N
, for each training example (X i , yi );
P
wi
i=1
Compute the weak learner weight α t :
1 1 − et
 ‹
αt = ln
2 et
Update the sample weights w i :
w i = w i e−αt ( yi −h t (X i )) , for each training example (X i , yi ).
Normalize the weights:
wi
wi = N
P
wi
i=1

end
Pnum_est imat ors
return Ensemble Model = t=1 (α t × h t (X )) for each weak learner.

Algorithm 2.1 – AdaBoost pseudocode

The AdaBoost have many advantages such as fast, simple, easy to use. However, it also can be
vulnerable to noise or lead to overfitting the data.

2.6.6.2 Extreme Gradient Boosting (XGBoost)

XGBoost is a popular library designed by using gradient boosted trees algorithm, which tries to
accurately predict a target by combining the estimates of a set of simpler, weaker models.
2.6 Machine Learning 13

Input : X - Training features, y - Target values, num_est imat ors - Number of weak learners
(iterations), l ear ning_r at e - Learning rate to control the contribution of each weak,
ma x_d epth: Maximum depth of each weak learner (decision tree)
Output : Ensemble Model
Initialize F0 (x) = ini t ial_pr edic t ion = mean( y)
for t = 1 to num_est imat ors do
Compute the first derivative (gradient) and second derivative (hessian) of the loss function of
each training example:
∂ L( yi , F (x i ))
gi = for each training example (x i , yi ).
∂ F (x i )
∂ 2 L( yi , F (x i )
hi = for each training example (x i , yi ).
∂ F (x i )2
L( yi , F (x i )) is the loss function MSE
Train a weak learner m t on the first g i and second derivative hi :
- g i indicate the direction and magnitude of the change in the loss function to minimize it
when the ensemble predictions change.
- hi measures the curvature of the loss function and provide additional information about
the the rate of change of the gradients. It helps adjust the learning rate of the weak learners
based on the curvature of the loss function.
- Specify parameter ma x_depth to control the depth of the weak learn. Fit the weak learner
to the training data and obtain the prediction m t (x).
Update the ensemble model:
F t (x) = F t−1 (x) + lear ning_r at e × m t (x)
end
Output: Ensemble Model = F t (x), for t = 1 to num_est imat ors.

Algorithm 2.2 – Simplification of XGBoost Regression pseudocode

How XGBoost works can be illustrated in the figure below:

Figure 2.6 – The schematic of XGBoost [14]

2.6.6.3 Categorical Boosting (CatBoost)

CatBoost is a relatively new open-source machine learning algorithm developed in 2017 by Yandex.
One of the main advantages of CatBoost’s is its ability to integrate a wide variety of different data
2.6 Machine Learning 14

types (such as images, audio, etc.). CatBoost strength consist of handling the data, requiring
minimum of categorical feature transformation, unlike other machine learning algorithms, can
not handle non-numeric values; its gradient-based optimization and regularization techniques.
Combining these techniques can build a strong regressor.

CatBoost is built based on the decision trees and gradient boosting. Specifically, CatBoost grows
symmetric trees, which means that the trees are grown by imposing the rule that all nodes at the
same level, test the same predictor with the same condition, and hence an index of a leaf can
be calculated with bit-wise operations. The symmetric tree makes a simple fitting scheme and
efficient CPUs, while the tree structure find an optimal solution and avoid overfitting.

Figure 2.7 – Symmetric Tree Architecture of CatBoost

Here is the overview of how CatBoost Regressor works:

• Data Preparation: CatBoost can handle both numerical and categorical features directly
without encoding. For categorical features, it uses "Ordered Target Encoding" technique
(utilizes the target variable’s statistical information).

• Initialize Model and Prediction: Define model hyperparameter and an empty ensemble of
decision trees, then use the mean of the target variable as the initial prediction.

• Gradient Calculation: Compute the negative gradients (residuals) between the true target
values and the current predictions. These gradients represent the direction and magnitude
of the errors.

• Decision Tree Construction: Using symmetric tree growth strategy and builds trees level by
level, similar to gradient boosting algorithms.

• Gradient Boosting: Update the predictions by adding the predictions from the newly created
decision tree, then multiply the predictions by a learning rate to control the contribution of
each tree.

• Regularization: CatBoost uses various regularization techniques such as L1, L2 regularization


on leaf weights or gradient-based random feature selection.

• Repeat steps 3 to 6 for a specified number of iterations. For each iteration, a new decision
tree is built to correct the errors made by the ensemble so far.
2.6 Machine Learning 15

• The final prediction is obtained by weighted summing the predictions from all decision
trees in the ensemble.

2.6.6.4 Light Gradient-Boost Machine (LGBM)

LGBM, originally developed by Microsoft, is another gradient-boosting machine learning based


on decision trees using two techniques: Gradient-based One Side Sampling and Exclusive Feature
Bundling. LGBM has the similarities with CatBoost as the gradient boosting algorithm, but there
are some different in terms of implementation and features.

• For LGBM, it requires categorical features which are preprocessed and encoded as numerical
values before training, but for CatBoost, it has built-in support for handling categorical
features using an algorithm called Order Target Encoding.

• LGBM provides options to handle missing values. However, CatBoost can automatically
handle missing values in both categorical and numerical features.

• LGBM uses a leaf-wise tree growth strategy, where it grows the tree by expanding the leaf
with the highest loss reduction at each level, and employs techniques like histogram-based
binning. but CatBoost using a symmetric tree growth strategy and builds trees level by level.

Figure 2.8 – Leaf Wise Tree Grow Architecture of LGBM

2.7 Deep Learning

2.7.1 Convolutional Neural Network

Convolutional Neural Networks (CNN/ConvNet) is a type of deep learning algorithm. The main
purpose of ConvNet is to reduce the images into a form that is easier to process, without losing
important features required for good prediction. CNN use fewer parameters (weights) to learn
than a fully connected network, they are designed invariant to object position or distortion of the
scene, and it can automatically learn and generalize features from the input domain.

There are three types of layers in a convolutional neural network:

• Convolutional layers: The element firstly involves in the convolution is called the kernel
(filter). This kernel will shift to the right with a stride length, until it parses the complete
width. Every time, it will perform an element-wise multiplication operation between the
2.7 Deep Learning 16

kernel and the portion of the image. The objective of the convolution operation is extracting
the features. The first convolution layer is responsible for extracting the low level features,
and if adding more layers, the architecture will learn the high level features as well.

• Pooling layers: is used for reducing the spatial size of the convolved feature or reducing the
dimensionality. Some types of pooling are max pooling and average pooling.

• Fully connected layer: The output now will be flattened and feed to the neural network.

Figure 2.9 – The typical architecture of CNN

2.7.2 VGG16

VGGNet was developed by company Visual Geometry Group. VGG16 is very simple and classical
but can be considered one of the best computer vision model. It had 2 or 3 convolution layers
and a pooling layers, then again 2 or 3 convolution layers and a pooling layers and so on (until
reaching 16 or 19 convolutional layers depend on its variants), and finally a dense network with
hidden layers and output.

Figure 2.10 – The architecture of VGG16

The limits of VGG 16 can be very slow to train and a large of parameters.

2.7.3 Resnet50

Resnet50 is introduced by Microsoft Research in 2015, ResNet50 is another kind of CNN which
has 50 layers and ResNet is short for residual networks. This model show us that computer vision
models can be getting deeper but with fewer parameters. The key of ResNet to train is using skip
connection, which skips over some layers of the model. When you have a neural network, the
purpose is to make it model a target function h(x). You add input x to the output of network (skip
connection), it will force modeling f(x) = h(x)-x instead of h(x). This is residual learning.
2.7 Deep Learning 17

Figure 2.11 – Residual Learning

Resnet-50 architecture can be divided into 6 parts: Input Pre-processing, Cfg[0] blocks, Cfg[1]
blocks, Cfg[2] blocks, Cfg[3] blocks and fully connected layer. As the below figure shows:

Figure 2.12 – Resnet-50 Architecture

2.7.4 DenseNet121

When the number of CNN layers get deeper, it raises vanishing gradient. It means that some
certain information get lost or vanish. Therefore, it affects the network to train effectively.

DenseNet include components:

• Connectivity: the feature maps of all the previous layers are not summed but concatenated
and used as inputs. Hence, the i t h layer receives the feature-maps of all preceding layers
x 0 , ..., x i−1 as input: x l = H l ([x 0 , x 1 , ..., x l−1 ]) where [x 0 , x 1 , ..., x l−1 ] is the concatenation
of the feature maps. The multiple inputs of H i are concatenated into a single tensor to ease
implementation.

• DenseBlock: To enable reducing the size of feature-map through dimensionality reduction,


DenseNet divided into DenseBlock, where feature maps’ dimension remains constant within
each block to enable feature concatenation. A DenseBlock H l typically composed of convo-
lution layers, activation function (reLU) and batch normalization. The layers between two
adjacent blocks are the transition layers, which change the size of the feature map through
convolution and pooling operations.
2.7 Deep Learning 18

• Growth Rate: The size of the feature map grows after each dense layer, with each layer
adding ’K’ features on existing features. This parameter k is the growth rate of the network.
If each function H i produces k feature maps, then the i t h layer has: kl = k0 + k(l − 1) where
l is layer index, k0 is the number of channels in the input channel.

Figure 2.13 – DenseNet121 architecture [12]

2.7.5 MobileNetV2

MobileNet using Depthwise Separable Convolution and Pointwise Convolution to reduce in


computational complexity and number of parameters. For MobileNetV2, it introduces of using
linear bottlenecks and invert residual block (shortcut connections between bottlenecks).

Figure 2.14 – MobilenetV2 architecture

2.7.6 EfficientNetB0

EfficientNet using the idea of compound scaling, which is stead of scaling only one model attribute
which are depth, width and resolution, strategy here is scaling all three of them together to give a
better result. here is the formula for scaled attributes:
2.7 Deep Learning 19

• depth: d = αφ

• width: w = β φ

• resolution: r = γφ

such that αβ 2 γ2 ≈ 2, α ≥ 1, α ≥ 1, γ ≥ 1

The EfficientNetB0 network includes a stem convolutional layer, multiple MB blocks, global av-
erage pooling and fully connected layer for performing regression or classification. EfficientNetB0
is the base model of EfficientNet family.

Figure 2.15 – EfficientNetB0 architecture [2]

The purpose of EfficientNetB0 is to design for both accurate and computationally efficient by
using scaling approach.
Chapter 3

Material and Methodology

3.1 Material

3.1.1 Experimental Site and Experimental Design

The data field was conducted by Dr.Tran Giang Son and his ICTlab team at Phu Tho province. The
land in these areas is used mainly by the smallholder farmers. The common crops in those areas
cultivated is rice. In this thesis, the dataset was carried out on 6th May 2022. There are two main
types of rice cultivar. The left of the field is using rice cultivar TBR225 and the right of the field
using J05. With each plot is numbered from right to left, from top to bottom. In total, we have 27
plots for rice cultivar TBR225 and J05. Other than that, we also have rice cultivar bc15 which had
been planted in 5 different location at the edge of the rice field. The shape of each plot is a square
field with side 10 m. The plot design has the spatial distribution as the illustration in Figure 3.1.

20
3.1 Material 21

Figure 3.1 – The spatial distribution of the plot design

3.1.2 Data Acquisition

The Hyperspectral Image (HSI) were collected by Hyperspectral Camera OCI-F attached to the UAV
DJI Matrice 600 Pro, which can capture images with 120 channels in the visible and near-infrared
wavelength (400 nm-1000 nm). This camera captures images by using the push-room method and
its scanning range width is 800 pixels. There are two hyperspectral images were collected in 6th
May 2022 which is stored as the ENVI standard format. We can see the details of Hyperspectral
Image files as below:

• The header ENVI file contains metadata:

– description, samples, lines, bands, header offset, file type, data type, interleave, sensor
type, byte order, system type, main file format, gps file included
– map info: show geographic information of Hyperspectral Image

* Projection name: Name of projection type, our dataset is UTM


* Reference (tie point) pixel x location (in file coordinates): 1.0
* Reference (tie point) pixel y location (in file coordinates): 1.0
* Pixel easting: The easting coordinates of HSI
* Pixel northing: The northing coordinates of HSI
* x pixel size: The size of a pixel in x direction
* y pixel size: The size of a pixel in y direction
* Projection zone (UTM only): 48
* North or South (UTM only): North
* Datum: WGS-84
* Units: Meters
– band names
– wave length
3.1 Material 22

• The .img files contain pixel value information of the image

From the information provided by metadata, we can know that an HSI has the resolution of 2
cm per pixel, the dimension is 10254 × 11687 × 122. The other one with the resolution is 3 cm
per pixel and the dimension is 7655 × 7347 × 122. These two images contain 122 continuous
spectral bands with the corresponding wavelength ranging from 410 nm to 958 nm, each band
step is about 4 nm.

Also, a CSV file was assigned which contains the following information: Code, Northing, Easting,
Height, Latitude, Longitude, Elevation, Date, Time, Chlorophyll, Rice height, N Concentration, P
Concentration, K Concentration, etc. Because of the thesis’s scope, what we need is to analyze the
geographical positions of the sample (code, longitude, latitude, north, east) and their nutrient
concentrations which are N concentration, P concentration, K concentration, and chlorophyll
content.

From the information in the .csv file, there are three interested positions in each plot. Each
position has different irrigation, fertilizers, so the rice leaves have various nutrition. In total, there
are 171 rice leaves in different regions that have been collected and analyzed.

The table below shows some statistics information of the nutrients.

Table 3.1 – Descriptive nutrients’ statistics

Nutrients Sample Mean St. Deviation Min Max CV(%)

Chlorophyll 171 40.00 2.44 30.50 48.30 6.08


N concentration 171 3711.46 2445.44 1005.50 10271.76 65.79
P concentration 171 2663.47 1220.19 1029.10 16122.00 45.74
K concentration 171 15999.45 3542.95 4487.00 23831.00 22.11

3.2 Methodology

3.2.1 Overall Framework

Firstly, the HSI captured the entire region of the field. However, only some region was measured
the nutrient concentration. What we need to do is extract the Region Of Interest (ROI), merge it
with the measured concentration as the actual value based on the field code. Then, we would apply
some preprocessing technique to remove redundant data and make the value dataset become
more suitable for making more precise in prediction. The next step is finding and studying the
proper learning algorithm, applying it to the dataset, finding the optimal hyperparameter to satisfy
statistical metrics. All of these models will be compared and assessed. These process would be
iterated until finding the best model based on these metrics. The detail of these all steps can be
illustrated as the Figure 3.2.
3.2 Methodology 23

Figure 3.2 – The proposed workflow for developing the model

3.2.2 Preprocessing Dataset

3.2.2.1 Determine ROI position

The HSI file consists of metadata, hence we could know geographic information based on the
attribute "map info". In this attribute, what we are interested here is the key-value pairs of pixel
easting, pixel northing and x, y pixel size.

Easting and Northing are the term used for geographic Cartesian coordinates (projected coordi-
nate system) for a point. Instead of using latitude and longitude which are spherical coordinates
are hard for determining position of ROI in planar HSI. Pixel easting refers to the position of
the eastward-measured distance (or the x-coordinate) of HSI and pixel northing refers to the
northward-measured distance (or the y-coordinate). In short, it represents the position of HSI’s
geographic coordinates in the planar earth map. This position is used as the original point when
determine the position of ROI.

Another information we need to use, which is geographic Cartesian coordinates of ROI. Thanks
to Department of Space and Applications at USTH for converting from unprojected position in
longitude and latitude to the planar coordinates, so that we can get the exact planar geographic
position of the field code.

The position of ROI in HSI now is the distance between the position of ROI in planar earth map
and the original point (position of HSI in planar earth map). After getting this distance, we need
to scaling it to make this position fits the HSI size by dividing it the x pixel size and y pixel size.

All the steps now can be represented as equation 3.1:


3.2 Methodology 24

The east (north) of ROI - The origion east (north)


The east (north) of ROI in HSI = (3.1)
x (y) pixel size

From the above idea, the proposed pseudocode had been used for getting the list of all ROI in a
hyperspectral image.

Input : csv_ f il e_path - The file .csv contain related information of ROI including geographic
position
head er_ f il e_path - The header file .hdr of HSI containing the map info
Output : The list of all ROI position in HSI
field_pd = READ_CSV(csv_ f il e_path)
header = READ_ENVI_HEADER(header_ f ile_path)
f iel d_pd["East"] − header["map info"]["Pixel easting"]
f iel d_east =
head er["map info"]["x pixel size"]
f iel d_pd["North"] − header["map info"]["Pixel northing"]
f iel d_nor th = −
header["map info"]["y pixel size"]
coordinate = {"east": f iel d_east, "north": f ield_nor th}
coordinate = CONVERT_TO_DATAFRAME(coordinate)
coordinate = CONVERT_TO_NUMPY_ARRAY(coordinate_df)
return coordinate

Algorithm 3.1 – The proposed pseudocode for getting all ROI position of an HSI

This pseudocode can be easily implemented by using library spectral, pandas and numpy in
python, what we need to have is a file CSV which has planar geographic information of ROI and
header file .hdr of hyperspectral image. Then we calculate the coordinate in HSI, convert the
result into a list of (x, y) coordinate in Hyperspectral Image as array.

After knowing these coordinates, we can map all these positions to the image by pinpoint
them so that we can know if the process is working correctly. To map the point, the python code
could be implemented to show how to map the position by using OpenCV function: cv2.circle(),
cv2.putText(), cv2.imwrite().

The Figure 3.3 show the illustration of all ROI we are interested in the hyperspectral image if
they are in the correct position or not.
3.2 Methodology 25

Figure 3.3 – The ROI’s spatial distribution of 3 cm per pixel image after extracting RGB
channels

3.2.2.2 Extracting the ROIs in hyperspectral image

After determining these ROI had been correctly map to the Hyperspectral Image. The next step is
getting all these pixels value of ROI as the input of learning model.

The library spectral provides convenient interface for getting the pixel value of ENVI format.
What we need to do now is using their interface such as .read_pixel() to get pixel values of all
bands at designated coordinate or slice pixel value of ROI, the same as array in the library numpy.

For the machine learning models, we extracted the exact pixel value at these coordinate, to
create a patch of 1×1. Therefore, we have a dataset with each band from channel 1 to channel
122 now becomes the features. Each feature shows the pixel value at samples’ positions. And
for the deep learning models, we extracted the small regions around those coordinate, created a
patch of 32×32 with 122 channels. We had 2 hyper-spectral image, each one have 171 points
found. In total, we have 342 samples.

Before putting this dataset into learning models, some normalization techniques will be per-
formed to improve the performance and training stability of the model. We also need to get rid
of the null nutrient data. After that, we normalized the dataset by using z-score technique for
machine learning and convert value range of 0-255 into the range of 0-1 in deep learning.
3.2 Methodology 26

3.2.3 Model Configuration and Training

3.2.3.1 Machine Learning

PCA

Because of the high dimension in HSI, PCA is the technique used for bands reduction to decrease
redundancies and increase model’s efficiency [9]. Principal Components explain the part of the
variance. For machine learning model, we would like to see if reducing the bands can give the
better performance. To choose how many principal components should we have, we need to get
the information about the explained variance and plot the cumulative variance. The figure shows
the needs to explain variance of the dataset 1×1 used for the machine learning algorithm.

Figure 3.4 – The number of component needed to explain variance

From the above figure, we see that to get the 90% of variance explained, we need to have 60
number of components. Therefore, we define that 90% is the cut-off threshold. So for the case of
using PCA to reduce the dimension dataset for machine learning models, the number of principal
components was decided which is 60.

Hyperparameter Optimization

In machine learning, hyperparameters is a kind of parameter must be set before training ML models
to configure a ML model and reduce the loss function. Hyperparameter Tuning is the process of
finding an optimal hyperparameter. Manual tuning is not an optimal choice because there are
many so many hyperparameters for complex models, time will be consumed for model evaluation.
Therefore, so many HPO technique had been research for automating tuning hyperparameter and
making it effective in practical problems[17]. In this work, grid-search will be used for Ridge,
3.2 Methodology 27

Lasso, Decision Tree Regression and Random Forest Regression. Grid-search firstly start with
searching a large space and step, then it will narrow the previous result until finding optimum.
The grid-search is not the choice when it comes to high-dimensionality hyperparameter space
because of it complexity O(nk )[17]. For the XGBoost, CatBoost and LGBM, we will use Optuna
which is an optimization framework using "define-by-run" principle [3] because of their large
parameter search space.

Architecture of Machine Learning Model

After the dataset is processed, it will be used as the input of the machine learning models. There
are 122 features corresponds to 122 bands in the image and 60 features after performing PCA
technique. The dataset was divided into 80% (about 273 samples) using as training dataset and
20% test set (about 69 samples). While the training models, 20% of the training dataset was used
as the validation set (about 55 samples). The 5 folds cross-validation had been applied while
hyperparameter tuning to find the optimal hyperparameter. The performance of the model will
be evaluated by using the test set.

Hyperparameter Tuning

PCA
Training Data Training Model Validation Result

1x1x122...

PCA
Test Data Model

Figure 3.5 – Illustration of our machine learning workflow

3.2.3.2 Deep Learning

For the deep learning model, after preprocessing the image, we still divided the same ratio as
machine learning model, 80% for training set and 20% for test set. While training the models,
20% data of training set was used as the validation set before being evaluated in test set. We
use the size 32×32×122 because it is the requirement input size of some CNN architectures to
learn. Then, we will use 5 architectures: VGG16, Resnet50, DenseNet121, MobileNetV2 and
EfficientNetB0. After that, we put it into global average pooling layer before put it into fully
connected layers with output layer using linear activation function. Each of the models is trained
about 200 epochs with Adam optimization is 0.001.
3.2 Methodology 28

Figure 3.6 – Illustration of our deep learning workflow

3.2.4 Model Evaluation

Performance metrics are important for regression models – to evaluate and monitor the perfor-
mance and error of their predictions. The quality of the statistical metrics depends on many
factors, such as the nature of the variables employed in the model, the units of measure of the
variables, and the data transformation is used. In this work, the learning algorithm performance
were evaluated and compared by using different statistical metrics.

3.2.4.1 RMSE

RMSE is usually used as a standard metric for measuring errors of regression. RMSE constitutes
the standard deviation of the residuals (the differences between the model predictions and the
true values). RMSE plays an important role in estimating of how large the residuals are distributed.
When it comes to outliers, RMSE is more sensitive than MAE by producing large errors in the
presence of outliers. RMSE can be calculated as the following equation:

v
N
t1
u X
RM SE = ( yi − ŷi ) (3.2)
N i=1

where:

• yi is actual value

• ŷ is predicted value

• N: number of samples

The smaller RMSE, the smaller the variance between the errors.

3.2.4.2 MAPE

Mean Absolute Percentage Error (MAPE) is the mean of all absolute percentage errors between
the predicted and actual values. It is similar to MAE, but it is a metric for calculating the error as
a percentage. The formula of MAPE can be represented as:
3.2 Methodology 29

N
1 X | yi − ŷi |
M AP E = (3.3)
N i=1 yi

where:

• yi is actual value

• ŷ is predicted value

• N is a number of samples

3.2.4.3 Coefficient of determination

R2 (coefficient of determination) is a statistical metrics show us the variation of a dependent


variable (output value) is explained by an independent variable in a regression model. The
value range of R2 is between 0 and 1. The higher R2 score, the better at performing predictions.
However, R2 cannot determine whether the predictions are biased. The formula of R2 can be
represented as:
N
( yi − ŷi )2
P
SS RES i=1
R2 = 1 − =1− N (3.4)
SS T OT
( yi − y)2
P
i=1

where:

• yi is actual value

• ŷ is predicted value

• y is the mean of measured target

• N is a number of samples

3.3 Tools and Library

This thesis was implemented in my personal computer for image preprocessing, Google Colab for
machine learning models and Kaggle kernels for deep learning models. Besides, here is the list of
libraries was used for this work:

• TensorFlow and Keras: TensorFlow is an open-source library developed by Google using for
deep learning application. Keras is a higher level library that is build on top of TensorFlow.

• Spectral (SPy): Python library using for processing hyperspectral image data, including
reading, displaying, manipulating, and classifying hyperspectral imagery.

• Scikit-learn: developing machine learning models, performing preprocessing techniques


and hyperparameter optimization.
3.3 Tools and Library 30

• Optuna: a hyperparameter framework using state-of-the-art algorithms to search large


spaces and parallelizing hyperparameter searches.

• XGBoost, CatBoost, LGBM: different gradient boosting algorithm framework.

• Numpy, Matplotlib and OpenCV.


Chapter 4

Result and Dicussion

4.1 Chlorophyll Model Prediction and Comparison

Table 4.1 – Comparison of Learning Models Performance in Chlorophyll Prediction

Models 122 channels PCA (60 components)


RMSE R2 MAPE RMSE R2 MAPE
Ridge 1.99 0.23 4.06% 1.93 0.28 3.96%
Lasso 2.31 -0.02 4.73% 2.15 0.11 4.06%
Decision Tree Regression 2.30 -0.02 4.74% 2.24 0.04 4.73%
Random Forest Regression 2.11 0.14 4.35% 2.12 0.14 4.34%
SVR 2.05 0.20 4.06% 1.99 0.24 4.11%
AdaBoost 2.14 0.12 4.35% 2.09 0.16 4.34%
XGBoost 2.19 0.08 4.68% 2.12 0.14 4.31%
CatBoost 2.09 0.16 4.33% 2.15 0.11 4.45%
LGBM 2.16 0.10 4.41% 2.15 0.11 4.47%
VGG16 1.94 0.17 4.24% - - -
Resnet50 3.37 -1.22 6.92% - - -
DenseNet121 2.05 0.22 4.40% - - -
MobileNetV2 2.14 0.23 4.49% - - -
EfficientNetB0 2.17 -0.06 4.51% - - -

The performance of chlorophyll overall working good with the dataset. With the MAPE is very
low (<5%). The R2 is quite good when comparing with other models in N, P, K concentrations.
The best one is the Ridge model after applying PCA (with a very small RMSE which is 1.93 and
3.96% in MAPE and highest R2 score). Overall, the machine learning algorithm seems to perform
very well. Besides, SVR after PCA also give us a very good result. For boosting machine learning
algorithm. It seems that there are not much difference between them. CatBoost perform the best
with 122 channels, it has the value of 4.33% in MAPE and 2.09 in RMSE, and the R2 is 0.16 which
is relative good. Although, after performing PCA, the score is now become worse because of the
MAPE and RMSE increase a little and R2 now reduce to 0.11. To see hyperparameters of some
good machine learning models, we can see in Appendix A. In contrast, decision tree regression

31
4.1 Chlorophyll Model Prediction and Comparison 32

is the worst in machine learning models with very low in R2 and high RMSE, MAPE. Even after
applying PCA technique, with just can improve a little bit but can not compare to other learning
models.

With the deep learning models, DenseNet121 seems to be the best one with the low result in
RMSE which is 2.05 and 4.40% in MAPE. Besides, it has a second-highest value in R2 which is
0.22. The next one is MobileNetV2 which has the highest value – 0.23 in R2 score but RMSE and
MAPE is not good as DenseNet121. The third one is VGG16, it also has a relative good value in
R2 and even the best result in RMSE and MAPE. Though, Resnet50 and EfficientNetB0 is not as
our expectation because of their bad performance in all three model evaluation. So we can see
that ridge regression in machine learning and DenseNet121 is the best model in deep learning.

4.2 N concentration Model Prediction and Comparison

Table 4.2 – Comparison of Learning Models Performance in N Concentration Prediction

Models 122 channels PCA (60 components)

RMSE R2 MAPE RMSE R2 MAPE


Ridge 2599.54 -0.132 77.50% 2582.09 -0.117 74.66%
Lasso 2582.10 -0.136 81.93% 2555.63 -0.094 77.46%
Decision Tree Regression 2786.22 -0.301 91.94% 2665.91 0.191 81.15%
Random Forest Regression 2427.67 0.012 81.68% 2457.01 -0.012 85.62%
SVR 2445.31 -0.002 62.45% 2444.18 -0.002 62.23%
AdaBoost 2429.62 0.010 73.50% 2441.97 0.000 70.85%
XGBoost 2413.18 0.024 68.45% 2486.76 -0.037 67.63%
CatBoost 2426.92 0.013 81.56% 2455.12 -0.011 84.37%
LGBM 2357.00 0.069 77.77% 2456.63 -0.012 85.70%

VGG16 2403.57 0.026 68.88% - - -


Resnet50 2451.95 -0.287 90.89% - - -
DenseNet121 2490.42 0.138 49.90% - - -
MobileNetV2 3887.10 -1.389 75.85% - - -
EfficientNetB0 2737.57 -0.188 51.61% - - -

For performance in N concentration, it does not perform well. Overall, it has very high in MAPE
and RMSE. And most of them can not beat the R2 score. All the models seem not to learn anything
from the dataset. In N concentration, boosting algorithm (AdaBoost, XGBoost, CatBoost and
LGBM) performs seem to be better than the others, with XGBoost (68.45% in MAPE and 0.024 in
R2 ) and LGBM (0.069 in R2 ) can be considered good choice.
4.2 N concentration Model Prediction and Comparison 33

For deep learning models, we can see that DenseNet121 still is the best one with the highest
value in R2 score and lowest value in MAPE (49.90%). It beats other models in both machine
learning and deep learning. The next good choice is VGG16, which is not good enough, though it
can beat almost all the machine learning models.

In N concentration, it seems that deep learning model performs better than machine learning
models. With DenseNet121 is the best one. And machine learning models we used here is not the
optimal choice for applying in N prediction.

4.3 P Concentration Model Prediction and Comparison

Table 4.3 – Comparison of Learning Models Performance in P Concentration Prediction

Models 122 channels PCA (60 components)

RMSE R2 MAPE RMSE R2 MAPE


Ridge 875.95 -0.756 30.54% 870.96 -0.736 30.06%
Lasso 1430.97 -3.688 49.87% 743.66 -0.266 24.99%
Decision Tree Regression 760.27 -0.323 22.70% 682.58 -0.067 20.09%
Random Forest Regression 713.75 -0.166 22.78% 687.15 -0.081 22.53%
SVR 679.75 -0.058 20.32% 677.24 -0.050 20.19%
AdaBoost 687.18 -0.081 19.66% 674.30 -0.041 20.54%
XGBoost 806.50 -0.489 25.43% 785.07 -0.411 25.87%
CatBoost 664.50 -0.011 20.35% 671.30 -0.032 20.52%
LGBM 683.55 -0.070 21.38% 680.64 -0.060 21.75%

VGG16 666.90 -0.615 20.92% - - -


Resnet50 641.10 -0.790 20.43% - - -
DenseNet121 1164.00 -5.250 35.51% - - -
MobileNetV2 763.72 -2.053 20.63% - - -
EfficientNetB0 699.67 -1.703 19.50% - - -

Now with the prediction of P, the MAPE in general is good (<25%) though the models have the
relative high in RMSE and very low in R2 score. All the models can not beat the R2 score. The
possible reason here comes from our dataset, it does not have a good measured value range of
P concentration. In models for P prediction. SVR, AdaBoost and CatBoost seems to have the
similar performance. but with CatBoost give us the best result in prediction of P concentration. In
contrast to the performance of chlorophyll and N concentration, DenseNet121 now become the
worst model. Unlike VGG16, which still has the relative good performance in RMSE and MAPE
when compared to other models and Resnet50 also is another good choice though.
4.3 P Concentration Model Prediction and Comparison 34

In general, prediction of P concentration is still not good as chlorophyll, but they have an
overall good results in MAPE and RMSE when compare to N concentration. The machine learning
model, specifically catboost seems to have better result than the others’ machine learning and
deep learning models, but it is trivial.

4.4 K Concentration Model Prediction and Comparison

Table 4.4 – Comparison of Machine Learning Models in K Concentration Prediction

Models 122 channels PCA (60 components)

RMSE R2 MAPE RMSE R2 MAPE


Ridge 3543.78 -0.16 19.74% 3491.76 -0.127 19.59%
Lasso 4934.13 -1.251 27.73% 3496.53 -0.130 19.71%
Decision Tree Regression 3878.16 -0.391 21.43% 3555.54 -0.169 20.71%
Random Forest Regression 3349.62 -0.038 19.55% 3307.24 -0.011 19.20%
SVR 3286.25 0.0013 19.24% 3288.64 0.000 19.26%
AdaBoost 3365.15 -0.047 19.73% 3269.29 0.012 19.23%
XGBoost 3434.37 -0.090 19.68% 3435.34 -0.091 20.35%
CatBoost 3377.80 -0.055 19.80% 3295.45 -0.004 19.18%
LGBM 3280.15 0.005 19.02% 3289.46 0.000 19.30%

VGG16 3347.12 -0.045 19.33% - - -


Resnet50 3427.05 -0.599 18.94% - - -
DenseNet121 4027.24 -0.789 23.13% - - -
MobileNetV2 4034.78 -0.526 21.89% - - -
EfficientNetB0 3385.90 0.053 19.87% - - -

Now moving to the K prediction, in general, MAPE of all models can achieve very good result
(<20%) though R2 and RMSE is not good as our expectation. We see that SVR and LGBM are two
models have achieved higher R2 than the most other models (LGBM achieve values of R2 score is
0.005 and MAPE is 19.02% with total 122 channels). However, after applying PCA techniques,
we can see that AdaBoost is improved so much with R2 is 0.012, better result than R2 of LGBM
but still have the approximately results in RMSE and MAPE as LGBM. So we can conclude that
AdaBoost with applying PCA give us the best result for prediction of K concentration in machine
learning. And for deep learning models, EfficientNetB0 surprisingly gives us the best the result
with highest R2 and performs good at MAPE and RMSE compared to other deep learning models.
Chapter 5

Conclusion and Future Work

5.1 Conclusion

In this thesis, we performed of predicting nutrition regression, specifically with the chlorophyll
content, N concentration, P concentration and K concentration. As the result we discussed, we
got the best result in predicting chlorophyll content. This is because of good dataset which uses
HSI captured with suitable wavelength, the stability in chlorophyll content and the accuracy in
measured chlorophyll content. However, we did not successfully get the good result with N, P, K
concentration. Specifically, N concentration had the worst result because of its poor performance
in all three of model evaluations, all the models almost learned nothing from the dataset. P and K
concentration prediction perform good at MAPE, but low R2 and relatively bad at RMSE. The
possibility is because of imbalanced data in measured N, P, K concentration, or the training data of
a season is not enough for learning model. Another possible reason is that the captured wavelength
of hyper-spectral image is not covered enough wavelength (N, P, K may use the wavelength in
the range of 900 nm to 2100 nm [19]) to lead to low correlation between N, P, K concentration
and spectral bands. Many models after applying PCA can get better result, but some of them do
not improve the performance. Therefore, we can conclude that PCA can be a good choice for
preprocessing the datasets to have better results. Boosting algorithm and SVR, all of them may
not give the best result, but they give us the stability of performance because there are no many
differences in model evaluation. For deep learning models, VGG16 did not give us any impressive
results, but the overall results are better than other models. DenseNet121 performs the best in
Chlorophyll and N concentration, but it becomes the worst in P, K concentration prediction.

5.2 Future Work

In the future work, we would like to improve the quality of the dataset and increase the sample in
the dataset. Because the dataset in one season is not enough for us to train a good models. Besides,

35
5.2 Future Work 36

studying and applying more image preprocessing techniques such as Gaussian smoothing, median
filtering, and wavelet denoising to reduce the noise of hyper-spectral image. Using normalized
difference vegetation index (NDVI) for finding the most correlation wavelength features of N, P, K
concentration. Finding the way to solve the imbalanced dataset in regression by using techniques
label distribution smoothing (LDS) and feature distribution smoothing (FDS). And using more
state-of-the-art learning models such as vision transformer and introduce more metrics to have a
balanced view in model evaluation.
References

[1] Megan Io Ariadne Abenina, Joe Mari Maja, Matthew Cutulle, Juan Carlos Melgar, and
Haibo Liu. “Prediction of Potassium in Peach Leaves Using Hyperspectral Imaging and
Multivariate Analysis.” In: AgriEngineering 4.2 (2022), pp. 400–413. ISSN: 2624-7402.
DOI : 10.3390/agriengineering4020027. URL : https://www.mdpi.com/2624-7402
/4/2/27.
[2] Tashin Ahmed and Noor Sabab. “Classification and understanding of cloud structures via
satellite images with EfficientUNet.” In: (Sept. 2020). DOI: 10.1002/essoar.10507423
.1.
[3] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.
Optuna: A Next-generation Hyperparameter Optimization Framework. 2019. arXiv: 1907.1
0902 [cs.LG].
[4] Songtao Ban et al. “Rice Leaf Chlorophyll Content Estimation Using UAV-Based Spectral
Images in Different Regions.” In: Agronomy 12.11 (2022). ISSN: 2073-4395. URL: https:
//www.mdpi.com/2073-4395/12/11/2832.
[5] Xiaokai Chen et al. “Estimation of Winter Wheat Canopy Chlorophyll Content Based
on Canopy Spectral Transformation and Machine Learning Method.” In: Agronomy 13.3
(2023). ISSN: 2073-4395. DOI: 10.3390/agronomy13030783. URL: https://www.mdpi
.com/2073-4395/13/3/783.
[6] Sulaymon Eshkabilov et al. “Hyperspectral Image Data and Waveband Indexing Methods
to Estimate Nutrient Concentration on Lettuce (Lactuca sativa L.) Cultivars.” In: Sensors
22.21 (2022). ISSN: 1424-8220. DOI: 10.3390/s22218158. URL: https://www.mdpi.c
om/1424-8220/22/21/8158.
[7] Dehua Gao et al. “In-field chlorophyll estimation based on hyperspectral images segmenta-
tion and pixel-wise spectra clustering of wheat canopy.” In: Biosystems Engineering 217
(2022), pp. 41–55. ISSN: 1537-5110. DOI: https://doi.org/10.1016/j.biosystem
seng.2022.03.003. URL : https://www.sciencedirect.com/science/article/pii
/S1537511022000551.
[8] Ma Jiaying et al. “Functions of Nitrogen, Phosphorus and Potassium in Energy Status and
Their Influences on Rice Growth and Development.” In: Rice Science 29.2 (2022), pp. 166–

37
5.2 Future Work 38

178. ISSN: 1672-6308. DOI: https://doi.org/10.1016/j.rsci.2022.01.005. URL:


https://www.sciencedirect.com/science/article/pii/S1672630822000075.
[9] Atiya Khan, Amol D. Vibhute, Shankar Mali, and C.H. Patil. “A systematic review on
hyperspectral imaging technology with a machine and deep learning methodology for
agricultural applications.” In: Ecological Informatics 69 (2022), p. 101678. ISSN: 1574-
9541. DOI: https://doi.org/10.1016/j.ecoinf.2022.101678. URL: https://www
.sciencedirect.com/science/article/pii/S1574954122001285.
[10] Donna Kirk. Contemporary Mathematics. OpenStax, 2023. ISBN: 978-1-951693-68-8.
[11] Michael Mayo, Lynne Chepulis, and Ryan Paul. “Glycemic-aware metrics and oversampling
techniques for predicting blood glucose levels using machine learning.” In: PLOS ONE 14
(Dec. 2019), e0225613. DOI: 10.1371/journal.pone.0225613.
[12] Noha Radwan. “Leveraging Sparse and Dense Features for Reliable State Estimation in
Urban Environments.” PhD thesis. June 2019. DOI: 10.6094/UNIFR/149856.
[13] Bogdan Ruszczak, Agata M. Wijata, and Jakub Nalepa. “Unbiasing the Estimation of
Chlorophyll from Hyperspectral Images: A Benchmark Dataset, Validation Procedure and
Baseline Results.” In: Remote Sensing 14.21 (2022). ISSN: 2072-4292. DOI: 10.3390/rs1
4215526. URL : https://www.mdpi.com/2072-4292/14/21/5526.
[14] Yuna Shin et al. “Prediction of Chlorophyll-a Concentrations in the Nakdong River Using
Machine Learning Methods.” In: Water 12.6 (2020). ISSN: 2073-4441. DOI: 10.3390/w12
061822. URL : https://www.mdpi.com/2073-4441/12/6/1822.
[15] Peg Shippert. “Introduction to Hyperspectral Image Analysis.” URL: https://docplayer
.net/14887579-Introduction-to-hyperspectral-image-analysis.html.
[16] Mustafa Teke, Hüsne Seda Deveci, Onur Haliloğlu, Sevgi Zübeyde Gürbüz, and Ufuk
Sakarya. “A short survey of hyperspectral remote sensing applications in agriculture.” In:
2013 6th International Conference on Recent Advances in Space Technologies (RAST). 2013,
pp. 171–176. DOI: 10.1109/RAST.2013.6581194.
[17] Li Yang and Abdallah Shami. “On hyperparameter optimization of machine learning
algorithms: Theory and practice.” In: Neurocomputing 415 (2020), pp. 295–316. ISSN:
0925-2312. DOI: https://doi.org/10.1016/j.neucom.2020.07.061. URL: https:
//www.sciencedirect.com/science/article/pii/S0925231220311693.
[18] Xi-Guang Yang, Wenyi Fan, and Ying Yu. “Chlorophyll content retrieval from hyperspectral
remote sensing imagery.” In: Environmental monitoring and assessment 187 (June 2015),
pp. 1–13. DOI: 10.1007/s10661-015-4682-4.
[19] Xin feng YAO et al. “A New Method to Determine Central Wavelength and Optimal
Bandwidth for Predicting Plant Nitrogen Uptake in Winter Wheat.” In: Journal of Integrative
Agriculture 12.5 (2013), pp. 788–802. ISSN: 2095-3119. DOI: https://doi.org/10.101
6/S2095-3119(13)60300-7. URL : https://www.sciencedirect.com/science/arti
cle/pii/S2095311913603007.
Appendix A

Hyperparameters for ML learning models

Table A.1 – Table of some good chlorophyll machine learning models with their hyper-
parameters

Models Hyper-parameters
Ridge (PCA) alpha: 100
SVR (PCA) C: 10, gamma: 0.0003
AdaBoost (PCA) learning_rate=1, n_estimators=1000

Table A.2 – Table of some good N concentration machine learning models with their hyper-
parameters

Models Hyper-parameters
XGBoost max_depth: 6, learning_rate: 0.01443, n_estimators: 146,
min_child_weight: 4, gamma: 0.0000232, subsample:
0.096, colsample_bytree: 0.267, reg_alpha: 0.00000057,
reg_lambda: 0.00000097766
SVR (PCA) C: 1, gamma: 1, kernel: linear
AdaBoost (PCA) learning_rate: 0.01, n_estimators: 250

Table A.3 – Table of some good P concentration machine learning models with their hyper-
parameters

Models Hyper-parameters
AdaBoost learning_rate: 0.1, n_estimators: 1000
CatBoost learning_rate: 0.017, depth: 13, l2_leaf_reg: 1.5, min_-
child_samples: 4
SVR (PCA) C: 1, gamma: 1, kernel: linear

vii
A Hyperparameters for ML learning models viii

Table A.4 – Table of some good K concentration machine learning models with their hyper-
parameters

Models Hyper-parameters
AdaBoost (PCA) learning_rate: 0.01, n_estimators: 250
LGBM reg_alpha: 6.71, reg_lambda: 0.00155, colsample_bytree:
0.7, subsample: 0.8, learning_rate: 0.006, max_depth: 20,
num_leaves: 2435, min_child_samples: 104, min_data_-
per_groups: 8
SVR C: 0.1, gamma: 1, kernel: linear

You might also like