0% found this document useful (0 votes)
21 views7 pages

Decision Trees

The document explains the concept of Decision Trees, a tree-structured classifier that predicts target variables using feature values. It outlines the process of building a Decision Tree, including splitting data, selecting the best split using Gini index and Entropy, and implementing a basic classifier from scratch. Additionally, it provides Python code for calculating entropy, information gain, and constructing a Decision Tree classifier.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

Decision Trees

The document explains the concept of Decision Trees, a tree-structured classifier that predicts target variables using feature values. It outlines the process of building a Decision Tree, including splitting data, selecting the best split using Gini index and Entropy, and implementing a basic classifier from scratch. Additionally, it provides Python code for calculating entropy, information gain, and constructing a Decision Tree classifier.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Decision Trees from Scratch

A Decision Tree is a tree-structured classifier that uses a set of rules


to predict a target variable. The tree consists of nodes where
decisions are made based on feature values, leading to predictions at
the leaf nodes.

Process:
 Splitting
 Selecting the best Split (Gini index - impurity)
 Pure node, Max depth

Implementation:
1. Implement a basic Decision Tree Classifier from scratch.
2. Use Entropy (disorder) and Information Gain (variance) to
determine the best feature to split on.

Class A: 30
Class B: 20
Total: 50
PA = 30/50
PB = 20/50

Gini Index: G = 1−∑ p 2i


i =1

Entropy: E = -∑ pi log2 ⁡(p i)


i=1
n
ni
Information Gain: IG = E p −¿ ∑ E
i=1 n c

import math

# Helper functions
def calculate_entropy(data):
"""
Calculate the entropy of a dataset.
data: List of target labels
"""
total = len(data)
if total == 0:
return 0
counts = {}
for label in data:
counts[label] = counts.get(label, 0) + 1
entropy = 0
for count in counts.values():
prob = count / total
entropy -= prob * math.log2(prob)
return entropy
def split_data(dataset, feature_index):
"""
Split the dataset based on a feature.
dataset: List of lists where each inner list is a data point
feature_index: Index of the feature to split on
"""
splits = {}
for row in dataset:
key = row[feature_index]
if key not in splits:
splits[key] = []
splits[key].append(row)
return splits

def calculate_information_gain(dataset, feature_index, target_index):


"""
Calculate the Information Gain for splitting on a specific feature.
dataset: List of lists where each inner list is a data point
feature_index: Index of the feature to split on
target_index: Index of the target variable
"""
total_entropy = calculate_entropy([row[target_index] for row in
dataset])
splits = split_data(dataset, feature_index)
total_samples = len(dataset)

weighted_entropy = 0
for subset in splits.values():
prob = len(subset) / total_samples
subset_entropy = calculate_entropy([row[target_index] for row
in subset])
weighted_entropy += prob * subset_entropy

information_gain = total_entropy - weighted_entropy


return information_gain

# Decision Tree Classifier


class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None

def fit(self, dataset, features, target_index):


"""
Build the decision tree.
dataset: List of lists (rows of data)
features: List of feature names
target_index: Index of the target variable
"""
self.tree = self._build_tree(dataset, features, target_index,
depth=0)

def _build_tree(self, dataset, features, target_index, depth):


# Check stopping criteria
target_values = [row[target_index] for row in dataset]
if len(set(target_values)) == 1: # Pure node
return target_values[0]
if not features or (self.max_depth is not None and depth >=
self.max_depth): # No features or max depth
return max(set(target_values), key=target_values.count)

# Find the best feature to split


best_feature_index = -1
best_gain = -float('inf')
for i in range(len(features)):
gain = calculate_information_gain(dataset, i, target_index)
if gain > best_gain:
best_gain = gain
best_feature_index = i

if best_gain == 0: # No further splits


return max(set(target_values), key=target_values.count)
# Split dataset
best_feature = features[best_feature_index]
splits = split_data(dataset, best_feature_index)
subtree = {}
remaining_features = features[:best_feature_index] +
features[best_feature_index + 1:]

for value, subset in splits.items():


subtree[value] = self._build_tree(subset, remaining_features,
target_index, depth + 1)

return {best_feature: subtree}

def predict(self, row):


"""
Predict the class label for a single data point.
row: List of feature values
"""
node = self.tree
while isinstance(node, dict):
feature = list(node.keys())[0]
value = row[feature]
node = node[feature].get(value, None)
if node is None:
return None
return node

# Example Usage
dataset = [
['Sunny', 'Hot', 'High', 'No'],
['Sunny', 'Hot', 'High', 'No'],
['Overcast', 'Hot', 'High', 'Yes'],
['Rainy', 'Mild', 'High', 'Yes'],
['Rainy', 'Cool', 'Normal', 'Yes'],
['Rainy', 'Cool', 'Normal', 'No'],
['Overcast', 'Cool', 'Normal', 'Yes'],
['Sunny', 'Mild', 'High', 'No'],
['Sunny', 'Cool', 'Normal', 'Yes'],
['Rainy', 'Mild', 'Normal', 'Yes']
]

features = ['Outlook', 'Temperature', 'Humidity']


target_index = 3

tree = DecisionTree(max_depth=3)
tree.fit(dataset, features, target_index)
print(tree.tree)

You might also like