0% found this document useful (0 votes)
59 views11 pages

Decision Trees

The document discusses decision trees, including how they are constructed from nodes and branches, the different types of nodes, how to determine the root node, the training and prediction processes, and the key mathematical concepts of entropy and information gain. Decision trees are built recursively by evaluating features to find the best split criteria at each node, using entropy to measure purity and information gain to evaluate splits, until a stopping criteria is reached. The document provides examples of calculating entropy and information gain in Python code and discusses how recursion is applied in decision tree implementations.

Uploaded by

Derek Degbedzui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views11 pages

Decision Trees

The document discusses decision trees, including how they are constructed from nodes and branches, the different types of nodes, how to determine the root node, the training and prediction processes, and the key mathematical concepts of entropy and information gain. Decision trees are built recursively by evaluating features to find the best split criteria at each node, using entropy to measure purity and information gain to evaluate splits, until a stopping criteria is reached. The document provides examples of calculating entropy and information gain in Python code and discusses how recursion is applied in decision tree implementations.

Uploaded by

Derek Degbedzui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

5.

Decision Trees

● Decision trees are a non-parametric model used for both regression and
classification tasks
● This notebook demonstrates how to use decision trees for classification
● Decision trees are constructed from only two elements - nodes and branches
● Decision trees are based on recursion
○ A function that calls itself until some exit condition is met
○ The algorithm is built by recursively evaluating different features
and using the best split feature and criteria at each node
○ You'll learn how the best split feature and criteria are calculated in
the math section
● There are multiple types of node a decision tree could have, and all are presented
in the following figure:

● Root node - node at the top of the tree, contains a feature that best splits the data
(a single feature that alone classifies the target variable most accurately)
● Decision nodes - nodes where the variables are evaluated. These nodes have
arrows pointing to them and away from them
● Leaf nodes - final node at which the prediction is made

How to determine the root node

● Check how every input feature classifies the target variable independently
● If neither is 100% correct, we can consider them as impure
● The Entropy metric can be used to calculate impuritz
○ Formula discussed later
○ Values range from 0 (best) to 1 (worst)
● The variable with the lowest entropy (impurity) is used as a root node

Training process
● Determine the root node (discussed earlier)
● Calculate the Information gain for a single split
○ Formula discussed later
○ The higher the gain the better the split
● Do a greedy search
○ Go over all input feature and their unique values (thresholds)
○ Calculate information gain for every feature/threshold combination
○ Save the best split feature and best split threshold for every node
○ Build the tree recursively
○ Some stopping criteria should be applied when doing so
■ Think of it as an exit condition of a recursive
function
■ This could be maximum depth, minimum samples
at node...
○ If at the leaf node, return the prediction (most common value)
■ You'll know you're at a leaf node if a stopping
criteria has been met or if the split is pure

Prediction process

● Recursively traverse the tree


● At each node check if the direction of the traversal (left or right), based on the input
data
● When the leaf node is reached, the most common value is returned

Math behind

● Essentially, you only need to implement two formulas


○ Entropy
○ Information gain
● Entropy
○ Measures the purity of the split
○ Calculated at the node level
○ Ranges between 0 (pure) and 1 (impure)
● Example:

● Example in Python:
In [1]:
import numpy as np
from collections import Counter

In [2]:
def entropy(s):
counts = np.bincount(s)
percentages = counts / len(s)

entropy = 0
for pct in percentages:
if pct > 0:
entropy += pct * np.log2(pct)
return -entropy

In [3]:
s = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
print(f'Entropy: {np.round(entropy(s), 5)}')

Entropy: 0.88129

● Information gain:
○ Simply an average of all entropy based on a specific split
○ The higher the information gain, the better the decision split is

● Example:

● Example in Python:
In [4]:
def information_gain(parent, left_child, right_child):
num_left = len(left_child) / len(parent)
num_right = len(right_child) / len(parent)

gain = entropy(parent) - (num_left * entropy(left_child) + num_right * entropy(right_child))


return gain

In [5]:
parent = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
left_child = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
right_child = [0, 0, 0, 0, 1, 1, 1, 1]

print(f'Information gain: {np.round(information_gain(parent, left_child, right_child), 5)}')

Information gain: 0.18094

Recursion Crash Course

● A lot of decision trees implementation boils down to recursion


● Put simply, a recursive function is a function that calls itself
● Some "exit condition" is required if a function will call itself multiple times
○ It's common to write it at the top of the function
● Let's take a look at the simplest example - a recursive function that returns a
factorial of an integer
In [16]:
def factorial(x):
# Exit condition
if x == 1:
return 1
return x * factorial(x - 1)

print(f'Factorial of 5 is {factorial(5)}')

Factorial of 5 is 120

● Recursion is needed in decision tree classifiers to build additional nodes of a tree


until some condition (exit) is met
Implementation

● We'll need two classes


○ Node - implements a single node of a decision tree
○ DecisionTree - implements the algorithm
● The Node class is here to store the data about the feature, threshold, data going left
and right, information gain, and the leaf node value
○ All are initially set to None
○ The leaf node value is available only for leaf nodes
In [6]:
class Node:
'''
Helper class which implements a single tree node.
'''
def __init__(self, feature=None, threshold=None, data_left=None, data_right=None, gain=None,
value=None):
self.feature = feature
self.threshold = threshold
self.data_left = data_left
self.data_right = data_right
self.gain = gain
self.value = value

● The DecisionTree class contains a bunch of methods


● The constructor holds values for min_samples_split and max_depth. These are
hyperparameters. The first one is used to specify a minimum number of samples
required to split a node, and the second one specifies a maximum depth of a tree.
Both are used in recursive functions as exit conditions
● The _entropy(s) function calculates the impurity of an input vector s
● The _information_gain(parent, left_child, right_child) calculates the information gain
value of a split between a parent and two children
● The _best_split(X, y) function calculates the best splitting parameters for input
features X and a target variable y
○ It does so by iterating over every column in X and every threshold
value in every column to find the optimal split using information
gain
● The _build(X, y, depth) function recursively builds a decision tree until stopping
criterias are met (hyperparameters in the constructor)
● The fit(X, y) function calls the _build() function and stores the built tree to the
constructor
● The _predict(x) function traverses the tree to classify a single instance
● The predict(X) function applies the _predict() function to every instance in matrix X
In [7]:
class DecisionTree:
'''
Class which implements a decision tree classifier algorithm.
'''
def __init__(self, min_samples_split=2, max_depth=5):
self.min_samples_split = min_samples_split
self.max_depth = max_depth
self.root = None

@staticmethod
def _entropy(s):
'''
Helper function, calculates entropy from an array of integer values.

:param s: list
:return: float, entropy value
'''
# Convert to integers to avoid runtime errors
counts = np.bincount(np.array(s, dtype=np.int64))
# Probabilities of each class label
percentages = counts / len(s)

# Caclulate entropy
entropy = 0
for pct in percentages:
if pct > 0:
entropy += pct * np.log2(pct)
return -entropy

def _information_gain(self, parent, left_child, right_child):


'''
Helper function, calculates information gain from a parent and two child nodes.

:param parent: list, the parent node


:param left_child: list, left child of a parent
:param right_child: list, right child of a parent
:return: float, information gain
'''
num_left = len(left_child) / len(parent)
num_right = len(right_child) / len(parent)

# One-liner which implements the previously discussed formula


return self._entropy(parent) - (num_left * self._entropy(left_child) + num_right *
self._entropy(right_child))

def _best_split(self, X, y):


'''
Helper function, calculates the best split for given features and target

:param X: np.array, features


:param y: np.array or list, target
:return: dict
'''
best_split = {}
best_info_gain = -1
n_rows, n_cols = X.shape

# For every dataset feature


for f_idx in range(n_cols):
X_curr = X[:, f_idx]
# For every unique value of that feature
for threshold in np.unique(X_curr):
# Construct a dataset and split it to the left and right parts
# Left part includes records lower or equal to the threshold
# Right part includes records higher than the threshold
df = np.concatenate((X, y.reshape(1, -1).T), axis=1)
df_left = np.array([row for row in df if row[f_idx] <= threshold])
df_right = np.array([row for row in df if row[f_idx] > threshold])

# Do the calculation only if there's data in both subsets


if len(df_left) > 0 and len(df_right) > 0:
# Obtain the value of the target variable for subsets
y = df[:, -1]
y_left = df_left[:, -1]
y_right = df_right[:, -1]

# Caclulate the information gain and save the split parameters


# if the current split if better then the previous best
gain = self._information_gain(y, y_left, y_right)
if gain > best_info_gain:
best_split = {
'feature_index': f_idx,
'threshold': threshold,
'df_left': df_left,
'df_right': df_right,
'gain': gain
}
best_info_gain = gain
return best_split

def _build(self, X, y, depth=0):


'''
Helper recursive function, used to build a decision tree from the input data.

:param X: np.array, features


:param y: np.array or list, target
:param depth: current depth of a tree, used as a stopping criteria
:return: Node
'''
n_rows, n_cols = X.shape

# Check to see if a node should be leaf node


if n_rows >= self.min_samples_split and depth <= self.max_depth:
# Get the best split
best = self._best_split(X, y)
# If the split isn't pure
if best['gain'] > 0:
# Build a tree on the left
left = self._build(
X=best['df_left'][:, :-1],
y=best['df_left'][:, -1],
depth=depth + 1
)
right = self._build(
X=best['df_right'][:, :-1],
y=best['df_right'][:, -1],
depth=depth + 1
)
return Node(
feature=best['feature_index'],
threshold=best['threshold'],
data_left=left,
data_right=right,
gain=best['gain']
)
# Leaf node - value is the most common target value
return Node(
value=Counter(y).most_common(1)[0][0]
)

def fit(self, X, y):


'''
Function used to train a decision tree classifier model.

:param X: np.array, features


:param y: np.array or list, target
:return: None
'''
# Call a recursive function to build the tree
self.root = self._build(X, y)

def _predict(self, x, tree):


'''
Helper recursive function, used to predict a single instance (tree traversal).

:param x: single observation


:param tree: built tree
:return: float, predicted class
'''
# Leaf node
if tree.value != None:
return tree.value
feature_value = x[tree.feature]

# Go to the left
if feature_value <= tree.threshold:
return self._predict(x=x, tree=tree.data_left)

# Go to the right
if feature_value > tree.threshold:
return self._predict(x=x, tree=tree.data_right)

def predict(self, X):


'''
Function used to classify new instances.

:param X: np.array, features


:return: np.array, predicted classes
'''
# Call the _predict() function for every observation
return [self._predict(x, self.root) for x in X]
Testing

● We'll use the Iris dataset from Scikit-Learn


In [8]:
from sklearn.datasets import load_iris

iris = load_iris()

X = iris['data']
y = iris['target']

● The below code applies train/test split to the dataset:


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

● You can now initialize and train the model, and make the predictions afterwards:
In [10]:
model = DecisionTree()
model.fit(X_train, y_train)
preds = model.predict(X_test)

In [11]:
np.array(preds, dtype=np.int64)

In [12]:
y_test

● As you can see, the arrays are identical


● Let's calculate the accuracy to confirm this:
In [13]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, preds)

● As expected, the perfect score was obtained on the test set


Comparison with Scikit-Learn

● We already know our model works good, but let's compare it to the
DecisionTreeClassifier from Scikit-Learn
In [14]:
from sklearn.tree import DecisionTreeClassifier

sk_model = DecisionTreeClassifier()
sk_model.fit(X_train, y_train)
sk_preds = sk_model.predict(X_test)
In [15]:
accuracy_score(y_test, sk_preds)

● Both perform the same, at least accuracy-wise

You might also like