0% found this document useful (0 votes)
65 views10 pages

Random Forest

Classification organizes data into predefined classes using classification algorithms and a training dataset with known class labels. The algorithm builds a classification model from the training data to predict the class labels of unlabeled data. Ensemble classification uses multiple classifiers that vote to determine class labels, which can improve accuracy over individual classifiers. Bagging and boosting are common ensemble methods that generate additional training data and build classifiers sequentially or in parallel to combine their predictions.

Uploaded by

noname
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views10 pages

Random Forest

Classification organizes data into predefined classes using classification algorithms and a training dataset with known class labels. The algorithm builds a classification model from the training data to predict the class labels of unlabeled data. Ensemble classification uses multiple classifiers that vote to determine class labels, which can improve accuracy over individual classifiers. Bagging and boosting are common ensemble methods that generate additional training data and build classifiers sequentially or in parallel to combine their predictions.

Uploaded by

noname
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Introduction to Classification

Data mining involves six common classes of tasks: anomaly detection, association rule
learning, clustering, classification, regression, summarization, and sequential pattern mining.
Classification organizes data into classes by using predetermined class labels. Classification
algorithms normally use a training set where all objects are already associated with known
class labels. The classification algorithm learns from the training set and builds a model, also
called a classifier, as shown in Figure 1.The model is then applied to predict the class labels
for the unclassified objects in the testing data as shown in Figure 3.2.
.
Classification algorithm
Training Data set

S. No Attribute1 Attribute2 Attribute3 Attribute4 Class

Classifier Model

Figure 1 Classification process – classifier construction

Classifier Model
Testing Data set

S. No Attribute1 Attribute2 Attribute3 Attribute4 Class

Predict to correct
class

Figure 2 Classification process – prediction.


What is Ensemble classification
Ensemble classification is an application of ensemble learning to boost the accuracy of
classification. Ensemble learning is a machine learning paradigm where multiple models are
used to solve the same problem. In ensemble classification, multiple classifiers are used and
are more accurate than the individual classifiers in the ensemble. A voting scheme is then
used to determine the class label for unlabeled instances. A simple and yet effective voting
scheme is majority. In majority voting, each classifier in the ensemble is asked to predict the
class label of the instance being considered. Once all the classifiers have been queried, the
class that receives the greatest number of votes is returned as the final decision of the
ensemble. Veto voting is an alternative voting scheme where one single classifier vetoes the
decision of other classifiers.
Three widely used ensemble approaches could be identified are
1. Boosting:-Boosting is an incremental process of building a sequence of classifiers, where
each classifier works on the incorrectly classified instances of the previous one in the
sequence.
2. Bagging:-Bagging involves building each classifier in the ensemble using a randomly
drawn sample of the data, having each classifier giving an equal vote when labeling
unlabeled instances. Bagging is known to be more robust than boosting against model
overfitting.
3.3 How do Bagging and Boosting get N learners
Bagging and Boosting get N learners by generating additional data in the training stage. N
new training data sets are produced by random sampling with replacement from the original
set. By sampling with replacement some observations may be repeated in each new training
data set. In the case of Bagging, any element has the same probability to appear in a new data
set. However, for Boosting the observations are weighted and therefore some of them will
take part in the new sets more often:
we begin to deal with the main difference between the two methods. While the training stage
is parallel for Bagging (i.e., each model is built independently), Boosting builds the new
learner in a sequential way:
In Boosting algorithms each classifier is trained on data, taking into account the previous
classifiers’ success. After each training step, the weights are redistributed. Misclassified data
increases its weights to emphasise the most difficult cases. In this way, subsequent learners
will focus on them during their training.
How does the classification stage work?
To predict the class of new data we only need to apply the N learners to the new
observations. In Bagging the result is obtained by averaging the responses of the N learners
(or majority vote). However, Boosting assigns a second set of weights, this time for the N
classifiers, in order to take a weighted average of their estimates.

There’s not an outright winner; it depends on the data, the simulation and the circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine several
estimates from different models. So the result may be a model with higher stability.
If the problem is that the single model gets a very low performance, Bagging will rarely get
a better bias. However, Boosting could generate a combined model with lower errors as it
optimises the advantages and reduces pitfalls of the single model.
What is Random Forest (RF)
RF is an ensemble learning method used for classification and regression. Developed
by Breiman the method combines Breiman's bagging sampling approach and the random
selection of features, introduced independently in order to construct a collection of decision
trees with controlled variation. Using bagging, each decision tree in the ensemble is
constructed using a sample with replacement from the training data. Statistically, the sample
is likely to have about 64% of instances appearing at least once in the sample. Instances in
the sample are referred to as in-bag instances, and the remaining instances (about 36%) are
referred to as out-of-bag instances. Each tree in the ensemble acts as a base classifier to
determine the class label of an unlabeled instance. This is done via majority voting where
each classifier casts one vote for its predicted class label, than the class label with the most
votes is used to classify the instance.
Random forest is a supervised learning algorithm which is used for both classification as well
as regression. But however, it is mainly used for classification problems. Random forest
algorithm can use both for classification and the regression kind of problems. A forest is
made up of trees and more trees mean more robust forest. Similarly, random forest algorithm
creates decision trees on data samples and then gets the prediction from each of them and
finally selects the best solution by means of voting. It is an ensemble method which is better
than a single decision tree because it reduces the over-fitting by averaging the result.
Some important characteristics of random forest are
 The same random forest algorithm or the random forest classifier can use for both
classification and the regression task.
 Random forest classifier will handle the missing values.
 When we have more trees in the forest, random forest classifier won’t over fit the
model.
 Can model the random forest classifier for categorical values also.
Tree2 Tree3
Tree1

Figure 3 Three different trees with given data

Tree1
Predictions

Input Tree2
Data set Predictions

Tree3
Predictions Random forest prediction
by majority voting
Trees Predictions

Figure 4 Tree prediction and random forest prediction by majority voting


Random Forest (RF) Algorithm
1. Randomly select “k” features from total “m” features.
Where k <= m
2. Among the “k” features, calculate the node “d” using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number
oftrees.
The beginning of random forest algorithm starts with randomly selecting “k” features out of
total “m” features. In the image, you can observe that we are randomly taking features and
observations. In the next stage, we are using the randomly selected “k” features to find the
root node by using the best split approach. The next stage, we will be calculating the
daughter nodes using the same best split approach. Will the first 3 stages until we form the
tree with a root node and having the target as the leaf node. Finally, we repeat 1 to 4 stages to
create “n” randomly created trees. This randomly created tree forms the random forest.
Outline of Proposed Approach
We can understand the working of Random Forest algorithm with the help of following
steps −
 Step 1 − First, start with the selection of random samples from a given dataset.
 Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it
will get the prediction result from every decision tree.
 Step 3 − In this step, voting will be performed for every predicted result.
 Step 4 − At last, select the most voted prediction result as the final prediction result.

Training Data Set

Sample1 Sample2 Sample3

Tree1 Tree2 Tree3

Majority voting

Prediction

Figure 5 Outline and working of random forest


Illustrates with an example
Table 1 Simple Cardiovascular disease Data set1
Chest Good Blood Blocked Cardiovascular
S. N.
Pain Circulation Arteries disease
1 Yes Yes Yes Yes

2 Yes Yes No Yes

3 Yes No Yes Yes

4 Yes No No No

5 No Yes Yes Yes

6 No Yes No No

7 No No Yes Yes

8 No No No No

Blocked Arteries

Yes No

Yes Chest Pain

Yes No

Good Blood No
Circulation

Yes No

Yes No

Figure 6 Tree1 with first data set sample


Table 2 Simple Cardiovascular disease Data set2
Good Blood Blocked
S. N. Chest Pain Cardiovascular Disease
Circulation Arteries
1 Yes Yes No Yes

2 Yes No Yes Yes

3 Yes No Yes Yes

4 No Yes Yes No

5 No Yes No No

6 Yes No Yes Yes

7 No No Yes No

8 No Yes No No

Chest Pain

No
Yes

No
Good Blood
Circulation

No Yes

Blocked Arteries
Yes
Yes No
No
Yes

Figure 7 Tree2 (Learner 2) with data set sample


Table 3 Simple Cardiovascular disease Data set3
Chest Good Blood Blocked Cardiovascular
S.N
Pain Circulation Arteries disease
1 Yes No Yes Yes

2 No Yes No No

3 No Yes No No

4 Yes No Yes Yes

5 Yes No Yes Yes

6 Yes Yes Yes Yes

7 No Yes No No

8 Yes No Yes Yes

Good Blood
Circulation

No
Yes

Yes
Blocked
Arteries

Yes No

Chest Pain No

Yes No

Yes No

Figure 8 Tree3 (Learner 3) with data set sample2


Consider a tuple with following conditions
Tree1 and Tree3 classify this tuple into class Yes and only Tree 2 classify the tuple into
class No .According to majority voting the given tuple classify into class Yes .

Chest Good Blood Blocked Cardiovascular


Pain Circulation Arteries Disease
No No Yes ?

You might also like