Random Forest: The Algorithm in A Nutshell
Random Forest: The Algorithm in A Nutshell
Random Forest
● Random forest is a supervised machine learning algorithm used for both regression
and classification
● This notebook demonstrates how to implement the classification algorithm from
scratch
● Random forest is built on top of weak learners - Decision trees
○ An analogy of many trees forming a forest
○ The term "random" indicates that each decision tree is built with a
random subset of data
● Random forest algorithm is based on the bagging method - a combination of
learning models with the aim of increase in accuracy
● In a nutshell:
○ If you understand how a single decision tree works, you'll
understand random forest
○ The math for the entire forest is identical as for a single tree, so we
don't have to go over it again
Implementation
In [63]:
class Node:
'''
Helper class which implements a single tree node.
'''
def __init__(self, feature=None, threshold=None, data_left=None, data_right=None, gain=None,
value=None):
self.feature = feature
self.threshold = threshold
self.data_left = data_left
self.data_right = data_right
self.gain = gain
self.value = value
● The constructor holds values for min_samples_split and max_depth. These are
hyperparameters. The first one is used to specify a minimum number of samples
required to split a node, and the second one specifies a maximum depth of a tree.
Both are used in recursive functions as exit conditions
● The _entropy(s) function calculates the impurity of an input vector s
● The _information_gain(parent, left_child, right_child) calculates the information gain
value of a split between a parent and two children
● The _best_split(X, y) function calculates the best splitting parameters for input
features X and a target variable y
○ It does so by iterating over every column in X and every thresold
value in every column to find the optimal split using information
gain
● The _build(X, y, depth) function recursively builds a decision tree until stopping
criterias are met (hyperparameters in the constructor)
● The fit(X, y) function calls the _build() function and stores the built tree to the
constructor
● The _predict(x) function traverses the tree to classify a single instance
● The predict(X) function applies the _predict() function to every instance in matrix X
In [62]:
class DecisionTree:
'''
Class which implements a decision tree classifier algorithm.
'''
def __init__(self, min_samples_split=2, max_depth=5):
self.min_samples_split = min_samples_split
self.max_depth = max_depth
self.root = None
@staticmethod
def _entropy(s):
'''
Helper function, calculates entropy from an array of integer values.
:param s: list
:return: float, entropy value
'''
# Convert to integers to avoid runtime errors
counts = np.bincount(np.array(s, dtype=np.int64))
# Probabilities of each class label
percentages = counts / len(s)
# Caclulate entropy
entropy = 0
for pct in percentages:
if pct > 0:
entropy += pct * np.log2(pct)
return -entropy
# Go to the left
if feature_value <= tree.threshold:
return self._predict(x=x, tree=tree.data_left)
# Go to the right
if feature_value > tree.threshold:
return self._predict(x=x, tree=tree.data_right)
The RandomForest class is built on top of a single decision tree and has the following methods:
● The __init__() method holds hyperparameter values for the number of trees in the
forest, minimum samples split and maximum depth. It will also hold individually
trained decision trees once the model is trained
● The _sample(X, y) applies bootstrap sampling to input features and input target
● The fit(X, y) method trains the Random Forest classifier
● The predict(X) method makes predictions with individual decision trees and then
applies majority voting for the final prediction
In [48]:
class RandomForest:
'''
A class that implements Random Forest algorithm from scratch.
'''
def __init__(self, num_trees=25, min_samples_split=2, max_depth=5):
self.num_trees = num_trees
self.min_samples_split = min_samples_split
self.max_depth = max_depth
# Will store individually trained decision trees
self.decision_trees = []
@staticmethod
def _sample(X, y):
'''
Helper function used for boostrap sampling.
iris = load_iris()
X = iris['data']
y = iris['target']
In [55]:
model = RandomForest()
model.fit(X_train, y_train)
preds = model.predict(X_test)
In [57]:
np.array(preds, dtype=np.int64)
In [58]:
Y_test
accuracy_score(y_test, preds)
sk_model = RandomForestClassifier()
sk_model.fit(X_train, y_train)
sk_preds = sk_model.predict(X_test)
In [61]:
accuracy_score(y_test, sk_preds)