341-Forest Cover Type Prediction
341-Forest Cover Type Prediction
341-Forest Cover Type Prediction
Abstract—The traditional techniques of classifying the type of This work aims at potent prediction of forest cover using
various forest cover based on field surveys and manual mapping various classification algorithms, such as random forests,
can be time-consuming and labor-intensive. The purpose of this KNN (K-Nearest Neighbor) and other machine learning
research was to help the Forest Department in predicting the models have been tested to develop robust and accurate
forest cover types and decreasing the time taken for predicting predictive models for classifying the type of forest cover.
using Machine learning and to improve the results with minor Using the common ecological and environmental data, the
changes in the algorithm. The approach that had been chosen was forest department could easily point out the related forest area
to use the cartographic variables and find the algorithm which
cover and its type to take any action. This was made possible
was best suited for the modern and digital world where data
with the improved accuracy of prediction using KNN model
needed to be acquired as quickly as possible and accurately. The
comparison of accuracy of the algorithms like K-Nearest
and a user-friendly interface to reach out the model.
Neighbors (KNN), Linear Support Vector Classifier (Linear II. LITERATURE SURVEY
SVC), Gaussian Naive Bayes (Gaussian NB), Logistic Regression,
Random Forest Classifier (RFC), Gradient Boosting Classifier In the paper, machine learning techniques which are used
and Decision Tree Classifier had been done. The accuracy of for the forest cover prediction of specific or various types
Random Forest and KNN were highest and taken into using remote sensing and cartographic variables [1]. The
consideration for further analysis. machine learning techniques used are regression, decision tree,
GBM and random forest classifier. The objective is to compare
Keywords— Machine learning, Forest Cover Type, Classification, these techniques and identify the one that provides the best
K-Nearest Neighbors, Dimensionality Reduction, KNN, Forest, prediction accuracy. They want to develop an automated
Forest Prediction. system for tree species classification that is applicable in real-
life scenarios [1]. Four different machine learning algorithms
I. INTRODUCTION were tested, and the highest classification performance was
achieved using the random forest algorithm, with an overall
The development of ecosystem management strategies is
accuracy of 92.3% [3]. These papers also explored the use of
the responsibility of natural resource managers, who need
remote sensing tools to map forest type and conditions. The
basic descriptive data, such as forest land inventory data, to
data used in this paper are Aerial photographic data as well as
guide their decision-making. For holdings or neighboring
Landsat Enhanced Thematic Mapper (ETM+) data [4]. The
lands that are outside of their immediate jurisdiction, managers
ECSU campus would then be mapped in terms of its land cover
typically do not have access to this kind of information. The
using remote sensing datasets that had been verified in the
application of predictive models is one way to get this data.
field [4].
Based on the feature values inserted in the trained model, these
models can accurately predict the type of cover of any specific In their study, researchers evaluated the efficacy of
area. discriminant analysis and artificial neural networks for
predicting various types of forest cover. They found that
The Roosevelt National Forest situated in the Colorado’s
neural networks were generally more accurate than
northern region, in the predictive model, there were four
discriminant analysis for predicting forest cover [5]. While
wilderness areas included in the study research. The dependent
there are other papers comparing how fast the algorithm
variables were the seven main types of forest cover, and the
predicts the forest cover with better accuracy [6]. One of the
independent variables were the twelve cartographic measures.
best methods for prediction involves using an immune and
Several subsets of these variables were looked at in order to
genetic approach. This biological approach gave by far the
decide which predictive model was the best overall.
most accurate predictions, although it is not always possible
1
to get hands on biological data [8]. There are infinite TABLE I. DATASET DESCRIPTION
possibilities for prediction algorithms. By using cartographic
variables, the model can keep learning and predicting with the Sr. Data_field Description
ever-changing terrain and other geographical features [1][2]. No
In one of the surveys efficiently classify unbalanced data 1 Elevation Elevation measured in meters.
sets related to forest cover using two sampling approaches and
distributed SVM architectures. Results show successful 2 Aspect Aspect measured in degrees azimuth.
classification with enhanced accuracy, particularly for large
datasets, and the significant increase of the minority class in 3 Slope Slope measured in degrees.
the true-positive rates. This approach reduces optimization
complexity and accelerates training time in parallel. Overall, 4 Horizontal_Distance_ Closest Water Body’s Horizontal Distance.
the study highlights the advantages of using distributed SVM To_Hydrology
architectures for solving binary classification issues involving
different types of forest cover [9]. 5 Vertical_Distance_To Closest Water Body’s Vertical Distance
_Hydrology
III. METHODOLOGY
A standard method exists for building a model, but various 6 Horizontal_Distance_ Nearest roadway measured across
modifications may be necessary to meet the specific To_Roadway Horizontal distance.
requirements of the desired model. The model-building
process involves multiple steps, including collecting 7 Hillshade_9am Hillshade Index at 9 am.
resources, data cleaning, pre-processing, training, testing, and
deployment. During the resource collection stage, relevant 8 Hillshade_Noon Hillshade Index at 12 pm.
data was collected and collated. Unnecessary data gets
cleaned. Pre-processing involves transforming data into a 9 Hillshade_3pm Hillshade Index at 3 pm.
suitable format. The model is then developed using the
training data, evaluating the model's accuracy while testing 10 Horizontal_Distance_ Horizontal distance to closest wildfire
data. The detailed process has been explained below. The To_Fire_Points ignition points.
dataset description has been given below in TABLE I [12].
11 Wilderness_Area Designation of Wilderness Areas.
A. Dataset Exploration
The dataset chosen for the model building was ‘Forest 12 Soil_Type Designation of Soil-types.
Cover Type [9][12]. This dataset is a collection of cartographic
information about forest cover types for the purpose of 1) Distance to Hydrology: The distance to hydrology
predicting forest cover types based on various geographical refers to the distance between a forest cover and the nearest
and environmental factors. It includes information about 7 water body. To measure this distance, Euclidean distance (DE)
different cover types in the Northern Colorado’s Roosevelt is calculated between the forest cover and the nearest water
National Forest. The dataset consists of 581,012 observations bodies. Euclidean distance is defined as the distance between
with 54 different attributes. The dataset is often used for two points, and in this case, the vertical and horizontal
classification and data mining tasks and has been widely distances to the hydrology are measured in meters. This
studied in machine learning research. distance is used as a predictor variable for forest cover type, as
it is indicative of the water intake of a particular forest. The
B. Data Pre-Processing
formula for calculating the distance to hydrology is given by
Data preprocessing, which is a component of data equation (1)
preparation, refers to any type of processing carried out on raw
data to prepare it for another data processing procedure. It has
Distance = 𝑑(𝑝, 𝑞) = √(𝑞1 − 𝑝1 )2 + (𝑞2 − 𝑝2 )2 (1)
traditionally been a significant opening phase in the process of
data mining. For example, by dropping the unnecessary
feature or by combining and making a new single feature helps where, p and q are distant points
in achieving a better time complexity d is the distance between p and q
One of the pre-processing methods, Dimensionality 2) Hill Shade: Hill shading is a technique used to enhance
Reduction has been used in the model, as there were too many topographical maps by adding a lighting effect based on
individual binary types for the same data and features that can elevation variations within the landscape. This technique
be reduced, summing up to 55 features, thus making the model simulates the effects of sunlight on hills and canyons,
slower. Below are the features that have been converted using providing a clearer picture of the terrain. The resulting
the dimensionality reduction, refer Fig. 1. quantitative value ranges from 0 to 255. In the present study,
the mean value of three hillshade indexes was computed at
specific times, namely 9am, 12pm, and 3pm. The formula used
to calculate the mean value is given by equation (2).
2
∑ 𝑥𝑖
Mean = 𝑥̅ = (2) C. Model Selection and Evaluation
𝑛
Seven algorithms were compared using both raw and
where, xi is ith observation. processed data to evaluate their accuracy and efficiency. The
n is the count of observations. results showed that the K-Nearest Neighbors (KNN) algorithm
performed the best with both raw and processed data, and was
3) Wilderness Area: The study makes use of a dataset selected for further evaluation and prediction. The technical
made up of four wilderness areas in the Northern Colorado’s details of the KNN algorithm have been explained in the
Roosevelt National Forest. These areas serve as excellent following section.
examples of forests that have experienced few disturbances
from human activity. The individual binary attributes for each 1) K-Nearest Neighbors (KNN): KNN (K-Nearest
wilderness area were converted into a single numeric attribute, Neighbors), a machine learning algorithm, can be used for
resulting in a more convenient and manageable dataset. The both regression and classification tasks. In the case of
conversion process allowed for the effective use of this feature classification, KNN assigns a label to an input data point based
in the analysis of forest cover types. An illustration for the on the labels of the training dataset's k-nearest data points. The
same has been provided above in TABLE II. user usually chooses the value of k, when k is larger, the
4) Soil Type: Soil type is a major feature used in decision boundary is smoother; when k is smaller, the decision
predicting forest cover type, and the dataset used in this study boundary is more complex. KNN is a non-parametric
contains a total of 40 soil types. These 40 soil types were algorithm, which means it do not assume a specific data
individual binary data types taking up 40 columns in the distribution. KNN can also be used for regression tasks,
dataset. To reduce workload and improve efficiency, this predicting a continuous value based on the k-nearest data
binary data was further condensed into a single column, points' values. However, for bigger dataset, KNN could be
represented by integers ranging from 1 to 40. An illustration expensive in terms of computation, and the distance metrics
of this process is provided in TABLE III. By converting the
used in KNN can be sensitive to feature scaling.
soil type data into a more manageable format, it was possible
to effectively utilize this feature in the analysis of forest cover 2) System Diagram: A system diagram is a graphical
types. representation of a system that demonstrates the interactions
of its components. It helps to understand the architecture and
TABLE II. WILDERNESS AREA DIMENSIONALITY REDUCTION
flow of the system in a concise and easy-to-understand
Updated manner. System diagram is represented in the given figure,
Wilderness Wilderness Wilderness Wilderness
Wilderness Fig. 1.
Area 1 Area 2 Area 3 Area 4
Area
0 0 1 0 3
1 0 0 0 1
0 0 0 1 4
1 0 0 0 1
0 1 0 0 2
Type Type
Type Type Type Type Updated
of of
of of of of Types of
Soil_ Soil_
Soil_1 Soil_2 Soil_3 Soil_4 Soil
5-39 40
0 0 1 0 ..0.. 0 3
1 0 0 0 ..0.. 0 1
0 0 0 0 ..0.. 1 40
0 0 0 1 ..0.. 0 4
0 1 0 0 ..0.. 0 2
0 0 0 0 ..1.. 0 23
3
a) Splitting Dataset: The testing dataset was created by TABLE IV. ALGORITHM PERFORMANCE WITH RAW DATA
randomly selecting 30% of the total dataset. This testing
dataset was used for the final evaluation of the model. Sr. No. Algorithm Accuracy
b) Training Model: From the original dataset, 70% of
1. Linear SVC 0.497229
the data was selected for training purposes. From this 70%
training dataset, a 70:30 dataset split was created, where 70%
2. Decision Tree Classifier 0.935882
data was utilized for training the model and rest of the 30%
data was used for testing it.
3. Logistics Regression 0.619160
c) Cross-validation: The cross-validation method is
used to evaluate a model's performance in the field of machine 4. Gaussian NB 0.458177
learning. It involves dividing the data into number of smaller
subsets (or folds) and then trained and tested the model on 5. Random Forest Classifier 0.952198
various combinations of the folds. This allows the model to be
tested on multiple different samples of the data, which can aid 6. Gradient Boosting Classifier 0.772076
in lowering overfitting and enhancing the final model's
accuracy. K-fold cross-validation is the most popular type of 7. K-Nearest Neighbors 0.966105
cross-validation. In this method, the model is independently
trained and assessed on each of the k evenly proportionate
folds of the data. The model employed a 5-fold cross
validation procedure.
d) Model Evaluation and Fine Tuning: After the training
was completed, the extracted 30% testing dataset was used for
the final evaluation of the model. The model was fine-tuned
through the iterative process of changing the value of k or
n_neighbors, and evaluating the accuracy of the results.
Through experimentation, it was observed that n_neighbors
equal to 3 produced the most accurate results.
This was the process for successfully building, training,
and testing the prediction of the Forest cover-type with the
model.
IV. RESULTS and DISCUSSIONS
The initial focus of the project was to boost the
performance of the model by decreasing the columns while
retaining its data, thus decreasing the time complexity of the
model. The time complexity for KNN is O(DN).
Where,
- D number of features
- N number of samples
Fig. 2. Algorithm Performance With Processed Data
The primary objective of data pre-processing was to
optimize the time taken for data processing. The pre-
The confusion matrix in Fig 3 is a table that displays how
processing step revealed that the dataset's time complexity was
well a classification model performed, where every cell
54n. The time complexity was, however, greatly decreased to
represents the frequency of occurrence of a certain
9n after the pre-processing step, resulting in a significant
classification outcome. In this study, the confusion matrix is a
decrease in processing time. The final performances of the
7x7 table where the rows and columns represent the 7 different
algorithms are given in the TABLE IV.
forest cover types.
From TABLE IV and Fig. 2, it was observed that Linear
SVC gave the lowest accuracy of around 48% before and after
pre-processing, while KNN and Random Forest both gave the
accuracy of around 95%~96%. The KNN algorithm stood out
from other algorithms as the nearest algorithm, Random
Forest, took more time and had increased time complexity,
thus decreasing the performance rate of the model.
4
The principal diagonal shows the correctly predicted forest [3] Mosin, Vasilii, Roberto Aguilar, Alexander Platonov, Albert Vasiliev,
cover type by our model. Every other cell has a wrong Alexander Kedrov, and Anton Ivanov. "Remote sensing and machine
learning for tree detection and classification in forestry applications."
prediction and instead shows where it predicted instead of In Image and Signal Processing for Remote Sensing XXV, vol. 11155,
what it had to predict. pp. 130-141. SPIE, 2019, DOI: 10.1117/12.2531820.
For instance, in the cell (1,1), the first forest cover type, the [4] Smith, Amber, and Barrett N. Rock. "Mapping Forest Types using
model predicted 60483 correctly out of 63399, resulting in an Multi-Sensor Remote Sensing Methods." In IGARSS 2008-2008 IEEE
accuracy of 0.954. However, the model also misclassified International Geoscience and Remote Sensing Symposium, vol. 3, pp.
2916 instances, where 2646 instances were classified as the III-708. IEEE, 2008, DOI: 0.1109/IGARSS.2008.4779446.
second forest cover-type, instead of the correct first forest [5] J. A. Blackard and D. J. Dean, “Comparative accuracies of artificial
cover type. This misclassification error is also known as a neural networks and discriminant analysis in predicting forest cover
Type II error or a false negative, where the model failed to types from cartographic variables,” Computers and Electronics in
classify the true positive instance correctly. Whichever Agriculture, vol. 24, no. 3, pp. 131–151, 1999, DOI: 10.1016/S0168-
predictions that were misclassified, were due to the values of 1699(99)00046-0.
attributes being similar, and the varying attribute values had [6] Sjöqvist, H., Längkvist, M., & Javed, F. (2020). An Analysis of Fast
very minor differences. Learning Methods for Classifying Forest Cover Types. Applied
Artificial Intelligence, 1–19, DOI: 10.1080/08839514.2020.1771523.
V. CONCLUSION
[7] Kishore, Rahul R., Shalvin S. Narayan, Sunil Lal, and Mahmood A.
For the increase in efficiency of the model for predicting Rashid. "Comparative accuracy of different classification algorithms
the outputs accurately, we successfully decreased the time for forest cover type prediction." In 2016 3rd Asia-Pacific World
taken, that is Time Complexity of the model, by applying Congress on Computer Science and Engineering (APWC on CSE), pp.
various dimensionality reduction methods. The accuracies and 116-123. IEEE, 2016. DOI: 10.1109/APWC-on-CSE.2016.029
performances of the employed algorithms were compared and [8] Feng, Shaojin. "Predicting forest cover types with immune and
it was discovered that KNN provides better prediction with an genetic." In 2012 IEEE 14th International Conference on
accuracy rate of 95.5% and better performance as compared to Communication Technology, pp. 966-970. IEEE, 2012, DOI:
other algorithms. 10.1109/ICCT.2012.6511338.
By making the use of confusion matrix, it became easier to [9] Trebar, Mira, and Nigel Steele. "Application of distributed SVM
understand and evaluate the model by recognizing the false architectures in classifying forest data cover types." Computers and
Electronics in Agriculture 63.2 (2008): 119-130, DOI:
and correct predictions. On further analysis, it was observed 10.1016/j.compag.2008.02.001.
that the major misclassified predictions were due to the similar
attributes, and some of the attributes which had differences [10] Taunk, Kashvi, Sanjukta De, Srishti Verma, and Aleena Swetapadma.
were incredibly small. "A brief review of nearest neighbor algorithm for learning and
classification." In 2019 international conference on intelligent
Using all these modifications and features, the predictions computing and control systems (ICCS), pp. 1255-1260. IEEE, 2019,
of the forest cover for Forest department and such could be DOI: 10.1109/ICCS45141.2019.9065747.
used for determining the type of forest on the affected area or [11] U. C. I. M. Learning, “Forest cover type dataset,” Kaggle, 03-Nov-
the area of interest. The prediction data can be fetched using a 2016.[Online].Available:
user interface by providing the required attributes which are https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset.
easily available.
[12] “Sci-kit Learn,” sci-kit. [Online]. Available: https://scikit-
ACKNOWLEDGEMENTS learn.org/stable/index.html.
We are thankful to the UC Irvine Machine Learning [13] “Weka 3: Machine Learning Software,” Weka 3 - Data Mining with
Repository, for making the data from the US Forest Service Open Source Machine Learning Software in
Java.[Online].Available: https://www.cs.waikato.ac.nz/ml/weka/.
(USFS) and the US Geological Survey (USGS) publicly
available. [14] “UCI Forest Cover type dataset,” UCI Machine Learning Repository:
Covertype data set. [Online]. Available:
We want to sincerely thank Ms. Manya Gidwani, who https://archive.ics.uci.edu/ml/datasets/covertype.
guided us throughout our research and provided invaluable
support and assistance. Her expertise and guidance were
instrumental in the successful completion of this project, and
we are very grateful for her contribution.
REFERENCES
[1] T. Anant, R. Bhargavi, T. Anant, and R. M., “Forest cover type
prediction using cartographic variables,” International Journal of
Computer Applications, vol. 182, no. 30, pp. 14–18, 2018, DOI:
10.5120/ijca2018918191.