Classification and Regression Tree
Classification and Regression Tree
CART is a type of decision tree algorithm used for predictive modeling in statistics and machine learning.
It splits data into subsets based on input features, forming a tree structure. CART is widely used in two
types of problems:
1. Classification Trees:
Goal: Predict the class label for a given set of input features.
At each split, the algorithm tries to maximize the purity of the resulting subsets (using measures
like Gini Index or Entropy).
Example:
o Predicting whether a customer will buy a product based on their age and income.
2. Regression Trees:
Used when the target variable is continuous (e.g., height, weight, price).
At each split, the algorithm minimizes the variance (or other error measures like Mean Squared
Error) within the resulting subsets.
Example:
2. Splitting Criteria:
4. Leaf Nodes: End nodes where predictions are made (a single class for classification or an average
value for regression).
Key Characteristics:
Feature Selection: Automatically selects the most relevant features for splitting.
Advantages:
Disadvantages:
Can be unstable (small changes in data may lead to a completely different tree).
Requires pruning or ensemble methods (like Random Forest) for better generalization.
Key Differences:
The leaf node value in a decision tree is the final prediction or output made at the end of a branch. It
represents the result of the splitting process for a particular path through the tree.
The leaf node value depends on the type of decision tree being used:
The leaf node value represents the class label or probability distribution of classes for the data
points in that node.
o Majority Class Rule: The most frequent class in the leaf node is assigned as the
prediction for that path.
o Class Probability: Optionally, the proportions of classes in the leaf can also be used as
probabilities.
Example:
Class A: 70%
o P(A)=0.7P(A) = 0.7P(A)=0.7
o P(B)=0.3P(B) = 0.3P(B)=0.3
The leaf node value represents the numerical prediction for the data points in that node.
o It is typically the mean (average) of the target variable values of all data points in the
leaf.
Example:
It is the final output of the tree for any given input data point.
For a given set of features, a path through the tree leads to a specific leaf node, and the value of
that leaf determines the tree's prediction.