Random Forest
Random Forest
Data mining involves six common classes of tasks: anomaly detection, association rule
learning, clustering, classification, regression, summarization, and sequential pattern mining.
Classification organizes data into classes by using predetermined class labels. Classification
algorithms normally use a training set where all objects are already associated with known
class labels. The classification algorithm learns from the training set and builds a model, also
called a classifier, as shown in Figure 1.The model is then applied to predict the class labels
for the unclassified objects in the testing data as shown in Figure 3.2.
.
Classification algorithm
Training Data set
Classifier Model
Classifier Model
Testing Data set
Predict to correct
class
There’s not an outright winner; it depends on the data, the simulation and the circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine several
estimates from different models. So the result may be a model with higher stability.
If the problem is that the single model gets a very low performance, Bagging will rarely get
a better bias. However, Boosting could generate a combined model with lower errors as it
optimises the advantages and reduces pitfalls of the single model.
What is Random Forest (RF)
RF is an ensemble learning method used for classification and regression. Developed
by Breiman the method combines Breiman's bagging sampling approach and the random
selection of features, introduced independently in order to construct a collection of decision
trees with controlled variation. Using bagging, each decision tree in the ensemble is
constructed using a sample with replacement from the training data. Statistically, the sample
is likely to have about 64% of instances appearing at least once in the sample. Instances in
the sample are referred to as in-bag instances, and the remaining instances (about 36%) are
referred to as out-of-bag instances. Each tree in the ensemble acts as a base classifier to
determine the class label of an unlabeled instance. This is done via majority voting where
each classifier casts one vote for its predicted class label, than the class label with the most
votes is used to classify the instance.
Random forest is a supervised learning algorithm which is used for both classification as well
as regression. But however, it is mainly used for classification problems. Random forest
algorithm can use both for classification and the regression kind of problems. A forest is
made up of trees and more trees mean more robust forest. Similarly, random forest algorithm
creates decision trees on data samples and then gets the prediction from each of them and
finally selects the best solution by means of voting. It is an ensemble method which is better
than a single decision tree because it reduces the over-fitting by averaging the result.
Some important characteristics of random forest are
The same random forest algorithm or the random forest classifier can use for both
classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, random forest classifier won’t over fit the
model.
Can model the random forest classifier for categorical values also.
Tree2 Tree3
Tree1
Tree1
Predictions
Input Tree2
Data set Predictions
Tree3
Predictions Random forest prediction
by majority voting
Trees Predictions
Majority voting
Prediction
4 Yes No No No
6 No Yes No No
7 No No Yes Yes
8 No No No No
Blocked Arteries
Yes No
Yes No
Good Blood No
Circulation
Yes No
Yes No
4 No Yes Yes No
5 No Yes No No
7 No No Yes No
8 No Yes No No
Chest Pain
No
Yes
No
Good Blood
Circulation
No Yes
Blocked Arteries
Yes
Yes No
No
Yes
2 No Yes No No
3 No Yes No No
7 No Yes No No
Good Blood
Circulation
No
Yes
Yes
Blocked
Arteries
Yes No
Chest Pain No
Yes No
Yes No