0% found this document useful (0 votes)
26 views

Assessment of The Random Forest Algorithm 1

Uploaded by

Ryam Rempohito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Assessment of The Random Forest Algorithm 1

Uploaded by

Ryam Rempohito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Assessment of the Random Forest Algorithm

Ryam Jerland R. Cudal


Bachelor of Science in Information Technology, Davao del Norte State College
Institute of Computing, Davao del Norte State College

Background
1.1 Introduction
Different studies have shown the presence of micro seismic activity in soft-rock landslides. The seismic
signals exhibit significantly different features in the time and frequency domains which allow their
classification and interpretation. Most of the classes could be associated with different mechanisms of
deformation occurring within and at the surface (rockfall, slide-quake, fissure opening, fluid circulation).
However, some signals remain not fully understood and some classes contain few examples that
prevent any interpretation. To move toward a more complete interpretation of the links between the
dynamics of soft-rock landslides and the physical processes controlling their behaviors, a complete
catalog of the endogenous seismicity is needed. We propose a multi-class detection method based on
the random forests algorithm to automatically classify the source of seismic signals. Random forests
are supervised machine learning technique that is based on the computation of a large number of
decision trees. The multiple decision trees are constructed from training sets including each of the
target classes. In the case of seismic signals, these attributes may encompass spectral features but
also waveform characteristics, multi-stations observations and other relevant information. The Random
Forest classifier is used because it provides state-of-the-art performance when compared with other
machine learning techniques (e.g. SVM, Neural Networks) and requires no fine tuning. Furthermore, it
is relatively fast, robust, easy to parallelize, and inherently suitable for multi-class problems. In this
work, we present the first results of the classification method applied to the seismicity recorded at the
Super-Sauze landslide between 2013 and 2015. We selected a dozen of seismic signal features that
characterize precisely its spectral content (e.g. central frequency, spectrum width, energy in several
frequency bands, spectrogram shape, spectrum local and global maxima) and its waveform (e.g.
duration, ratio between the maximum and the mean/median of the envelope amplitude, envelope
kurtosis and skewness, polarization). This preliminary study shows that the classification accuracy is
high, and insensitive to sampling permutations of training/validation sets.

1.2 Algorithm Overview


In the last 15 years several machine learning approaches have been developed for classification and
regression. In an intuitive manner we introduce the main ideas of classification and regression trees,
support vector machines, bagging, boosting and random forests. We discuss differences in the use of
machine learning in the biomedical community and the computer sciences. We propose methods for
comparing machines on a sound statistical basis. Data from the German Stroke Study Collaboration is
used for illustration. We compare the results from learning machines to those obtained by a published
logistic regression and discuss similarities and differences.
Keywords:
bagging, boosting, random forests, acute ischemic strokes, support vector machines, SVM, machine
learning, data mining, bioinformatics, classification, regression trees, patient-centered prognosis,
prognostic studies, biomedical prognosis, clinical epidemiology, tutorial, medical prognosis

II. Time Complexity


2.1 Big O Notation
Random Forest is an ensemble model of decision trees. Time complexity for building a complete
unpruned decision tree is O (v * n log(n)), where n is the number of records and v is the number of
variables/attributes. While building random forests you have to define the number of trees you want to
build (assume it to be, ntree) and how many variables you want to sample at each node (assume it to
be, mtry). Since we would only use mtry variables at each node the complexity to build one tree would
be O (mtry * n log(n)) Now for building a random forest with ntree number of trees, the complexity would
be O (ntree * mtry * nlog(n)) This is the worst-case scenario, i.e., assuming the depth of the tree is
going to be O (log n). But in most cases the build process of a tree stops much before this and it is hard
to estimate. But you could also restrict the depth of the trees that you would be building in your random
forest. say you restrict the maximum depth of your tree to be "d" then the complexity calculations can
be simplified to: O (ntree * mtry * d * n).
2.2 Discussion
The Random Forest algorithm is a powerful and widely-used machine learning technique that belongs
to the ensemble learning family. It is known for its versatility, robustness, and ability to handle complex
datasets Random Forest is an ensemble learning method that combines multiple individual decision
trees to make predictions. Ensemble methods leverage the wisdom of crowds by aggregating the
predictions of multiple models, often resulting in better performance than any individual model.
III. Algorithm Simulation
3.1 Real-World Simulation
While Forest part of Random Forests refers to training multiple trees, the Random part is present at two
different points in the algorithm. There’s the randomness involved in the Bagging process. But then,
you also pick a random subset of features to evaluate the node split.
3.2 Test Cases
A standard way to use RFs includes generating a global RF to predict all test cases of interest. In this
article, we propose growing different RFs specific to different test cases, namely case-specific random
forests (CSRFs). In contrast to the bagging procedure in the building of standard RFs, the CSRF
algorithm takes weighted bootstrap resamples to create individual trees, where we assign large weights
to the training cases in close proximity to the test case of interest a priori.
3.3 Results and Observations

To provide a detailed report on the results of the simulation for the Random Forest algorithm, we will
simulate the process on a well-known dataset, such as the Iris dataset, and then discuss the outcomes
of various performance metrics, feature importance, and hyperparameter tuning. The results will include
accuracy, precision, recall, F1-score, ROC-AUC.
IV. Algorithm Assessment
4.1 Performance Evaluation
In microarray datasets, hundreds and thousands of genes are measured in a small number of samples,
and sometimes due to problems that occur during the experiment, the expression value of some genes
is recorded as missing. It is a difficult task to determine the genes that cause disease or cancer from a
large number of genes. This study aimed to find effective genes in pancreatic cancer (PC). First, the K-
nearest neighbor (KNN) imputation method was used to solve the problem of missing values (MVs) of
gene expression. Then, the random forest algorithm was used to identify the genes associated with PC.

4.2 Evaluation Methods


By applying a variety of metrics and techniques, we aim to ensure the model’s reliability, effectiveness,
and interpretability. We demonstrate this methodology using the Iris dataset, adapting it for a binary
classification problem. The steps include data preparation, model training, hyperparameter tuning,
predictions, performance metrics calculation, visualization, feature importance analysis, cross-
validation, and model interpretation.
4.3 Results
Quantitative results demonstrate that RFA performs efficiently in identifying forest social , with data
sets. Comparative analysis highlights RFA suitability for tasks that prioritize exploration depth over
finding the shortest path.

V. Discussion
5.1 Algorithmic Strengths
The random forest strengths algorithm is a powerful ensemble learning method used for both
classification and regression tasks in machine learning. Random forests generally achieve high
accuracy compared to other machine learning algorithms. They are robust to overfitting, especially
when the number of trees in the forest is large.

5.2 Limitations and Challenges


Random Forest has several limitations. It struggles with high-cardinality categorical variables,
unbalanced data, time series forecasting, variables interpretation, and is sensitive to hyperparameters.
Another limitation is the decrease in classification accuracy when there are redundant variables. The
challenges of the random forest algorithm include addressing class imbalanced problems, inefficient
memory utilization during training, and the need for low-complexity solutions in smart environments.
5.3 Comparative Analysis
This article aims to a comparative analysis of decision tree algorithms between random forest and C4.5
for airlines customer satisfaction classification. The comparative study predicts both algorithms have
better accuracy, precision, recall AUC (area under the curve) for analyzing data set of customer
satisfaction on airlines, which are useful for later if have some same kind set of data set and problem.
In this particular comparative analysis, first, need to select the dataset and transform so it can be used
for data mining technique classification after choosing the algorithm to analyze the data set.
References
[1] Provost F, Hibert C, Malet J P, et al Automatic classification of endogenous seismic sources within
a landslide body using random forest algorithm [C]//EGU General Assembly Conference Abstracts.
2016, 18: 15705.
[2] What is the time complexity of a Random Forest, both building the model and classification,” Quora,
2022. https://www.quora.com/What-is-the-time-complexity-of-a-Random-Forest-both-building-the-
model-and-classification ‌177,2001. [Online].Available: https://doi.org/10.1080/0022250X.2001.9990249
[3] König IR, Malley JD, Pajevic S, Weimar C, Diener HC, Ziegler A. Patient-centered yes/no prognosis
using learning machines. Int J Data Min Bioinf 2008, 2: 289–341.
[4] R. Xu, D. Nettleton, and D. J. Nordman, “Case-Specific random forests,” Journal of Computational
and Graphical Statistics, vol. 25, no. 1, pp. 49–65, Jan. 2016, doi: 10.1080/10618600.2014.983641.
[5] N. Rabiei, A. R. Soltanian, M. Farhadian, and F. Bahreini, “The Performance Evaluation of The
Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable
Pancreatic Cancer in Microarray Dataset: A Retrospective Study.,” PubMed, vol. 25, no. 5, pp. 347–
353, May 2023, doi: 10.22074/cellj.2023.1971852.1156.
[6] R. Xu, D. Nettleton, and D. J. Nordman, “Case-Specific random forests,” Journal of Computational
and Graphical Statistics, vol. 25, no. 1, pp. 49–65, Jan. 2016, doi: 10.1080/10618600.2014.983641.
[7] N. Rabiei, A. R. Soltanian, M. Farhadian, and F. Bahreini, “The Performance Evaluation of The
Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable
Pancreatic Cancer in Microarray Dataset: A Retrospective Study.,” PubMed, vol. 25, no. 5, pp. 347–
353, May 2023, doi: 10.22074/cellj.2023.1971852.1156.
[8] W. Baswardono, D. Kurniadi, A. Mulyani, and D. M. Arifin, “Comparative analysis of decision tree
algorithms: Random forest and C4.5 for airlines customer satisfaction classification,” Journal of Physics.
Conference Series, vol. 1402, no. 6, p. 066055, Dec. 2019, doi: 10.1088/1742-6596/1402/6/066055.

You might also like