The decision tree algorithm is a supervised learning method used for regression and classification, represented as a tree structure with nodes for features and leaves for outcomes. It includes types based on target variables, important terminologies, and measures like Gini index and entropy for evaluating splits. While decision trees are easy to understand and visualize, they can suffer from overfitting and instability, necessitating techniques like pruning and hyperparameter tuning to improve accuracy.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views
ML CLASS 6 Decision Tree Algorithm
The decision tree algorithm is a supervised learning method used for regression and classification, represented as a tree structure with nodes for features and leaves for outcomes. It includes types based on target variables, important terminologies, and measures like Gini index and entropy for evaluating splits. While decision trees are easy to understand and visualize, they can suffer from overfitting and instability, necessitating techniques like pruning and hyperparameter tuning to improve accuracy.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21
Decision Tree Algorithm
Session by Gayathri Prasad S
Overview The decision tree algorithm is a supervised learning algorithm that can be used for solving regression and classification problems. It uses a flowchart like a tree structure to show the predictions that result from a series of feature-based splits. It starts with a root node and ends with a decision made by leaves. The internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. Types of Decision Trees Types of decision trees are based on the type of target variable. It can be of two types: Categorical Variable Decision Tree: Decision Tree which has a categorical target variable Continuous Variable Decision Tree: Decision Tree that has a continuous target variable Important Terminologies related to Decision Trees Root Node: It represents the entire population or sample and from this node the population starts dividing based on various features. Splitting: It is a process of dividing a node into two or more sub- nodes. Decision Node: The nodes we get after splitting the root nodes Leaf / Terminal Node: Nodes that do not split further Pruning: When we remove sub-nodes of a decision node, this process is called pruning. It is nothing but cutting down some nodes to stop overfitting. You can say it as the opposite process of splitting. Branch / Sub-Tree: A sub section of the entire tree is called branch or sub-tree. Parent and Child Node: A node, which is divided into sub- nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node. Pictorial Representation Decision Trees follow Sum of Product (SOP) representation. For a class, every branch from the root of the tree to a leaf node having the same class is product of values, different branches ending in that class form a sum. The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node and each level. Handling this is known as the attribute selection. We have different Attributes Selection Measures(ASM) to identify the attribute which can be considered at each level. A tree is composed of nodes, and those nodes are chosen looking for the optimum split of the features. For that purpose, different criteria exist. In the decision tree Python implementation of the scikit-learn library, this is made by the parameter ‘criterion‘. This parameter is the function used to measure the quality of a split and it allows users to choose between ‘gini‘ or ‘entropy‘. Gini Pure Pure means, in a selected sample of dataset all data belongs to same class (PURE). Impure Impure means, data is mixture of different classes.
The Gini index is a cost function used to
evaluate splits in the dataset. Higher value of Gini index implies higher inequality, higher heterogeneity. Gini Index The gini impurity is calculated using the following formula:
Where pj is the probability of
class j Entropy Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy. Information Gain=Entropy(before)- sum(Entropy(after)) where “before” is the dataset before the split, after) is subset after the split. Entropy Gini vs Entropy The Gini Index and the Entropy have two main differences: Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1] Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster. The algorithm selection is also based on the type of target variables. ID3 → (Iterative Dichotomiser 3) C4.5 → (successor of ID3) CART → (Classification And Regression Tree) CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees) MARS → (multivariate adaptive regression splines) CART (Classification and Regression Trees) compared to other algorithms supports numerical target variables (regression) and constructs binary trees using the feature and threshold that yield the largest information gain at each node. scikit-learn uses an optimised version of the CART algorithm. CART (Classification and Regression Tree) uses the Gini index as the default method to create split points. When the algorithm performs a split, the main goal is to decrease impurity as much as possible. The more the impurity decreases, the more informative power that split gains. The splitting process results in fully grown trees until the stopping criteria are reached. But, the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data. The ways to remove overfitting Hyper parameter tuning Pruning Decision Trees. Random Forest In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. Hyperparameters min_impurity_split max_depth min_samples_leaf min_leaf_nodes max_features The hyperparameters need to be carefully adjusted in order to have a robust decision tree with a high out-of-sample accuracy. We do not have to use all of them. Depending on the task and the dataset, a couple of them could be enough. Advantages of the Decision Tree It is simple to understand as it follows the same process which a human follow while making any decision in real-life. It can be very useful for solving decision- related problems. Trees can be visualised. It helps to think about all the possible outcomes for a problem. Resistant to outliers, There is less requirement of data cleaning compared to other algorithms. Disadvantages of the Decision Tree Decision-tree learners can create over-complex trees that do not generalise the data well. It may have an over fitting issue For more class labels, the computational complexity of the decision tree may increase. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. Decision tree may create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree. Datasets https://drive.google.com/file/d/15pc24lVzok KXhPvjqjvgmMNqSc611EoL/view?usp=shari ng https://drive.google.com/file/d/1ailAwduVTt 08yG12MYIzq86-Etz4N9kM/view?usp=shari ng https://drive.google.com/file/d/1CV5T2pp3V 90eJwURoklFr_Xkg8UqyMDv/view?usp=shar ing Thank You ..