0% found this document useful (0 votes)
24 views31 pages

Unit 3

Uploaded by

Mohamed riyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views31 pages

Unit 3

Uploaded by

Mohamed riyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit-3

Decision trees
Representing concepts as Decision trees
Representing concepts as decision trees is a valuable approach in various fields, including
data analysis, machine learning, and artificial intelligence. Decision trees are graphical
representations that model decisions and their possible consequences, displaying choices as
branches and outcomes as leaves. They help in understanding complex relationships and
making predictions based on input features.

In this method, each internal node represents a "test" on an input feature, each branch
represents the outcome of the test, and each leaf node holds a class label or a probability
distribution. The tree is built in a top-down manner, iteratively splitting the data based on the
feature that provides the most significant information gain or reduction in impurity.

Decision trees can be particularly useful in the following ways:

1. Visualization: They provide a visual representation of the decision-making process,


making it easier to understand and interpret the model.

2. Handling non-linear relationships: Decision trees can capture non-linear relationships


between input features and the target variable, which may not be possible with linear models.

3. Handling both categorical and continuous data: Decision trees can handle both types of
data, allowing for a more versatile approach to problem-solving.

4. Feature selection: During the tree construction process, the algorithm naturally selects the
most important features, which can help in identifying the most relevant predictors in the
dataset.

5. Interpretability: Decision trees are relatively easy to interpret, allowing users to understand
how the model arrives at its predictions or classifications.

However, it's essential to be aware of potential limitations, such as overfitting, where the
model performs well on the training data but poorly on unseen data. To mitigate this,
techniques like pruning or using more complex decision trees (e.g., Random Forests or
Gradient Boosting Machines) can be employed.
Recursive induction of Decision trees

Recursive induction is the process of building a decision tree by recursively splitting the data
based on the most informative feature at each step. This method is widely used in various
machine learning algorithms, such as the CART (Classification and Regression Trees)
algorithm, ID3 (Iterative Dichotomizer 3), C4.5, and their extensions like C5.0 and Random
Forests.

Here's a detailed explanation of the recursive induction process:

1. Data preparation: Begin with a dataset containing input features (X) and a target variable
(Y). The dataset is usually split into a training set and a validation or test set.

2. Root node creation: Choose the best feature (based on information gain, Gini impurity, or
other criteria) to split the data. This feature becomes the root node of the decision tree.

3. Splitting: Split the data based on the selected feature's possible values. For example, if the
feature is "temperature," the split might be into "cold" and "not cold" groups. Each resulting
subset of data forms a new node in the decision tree.

4. Recursion: Repeat steps 2 and 3 for each new node created in the previous step, until one
of the following conditions is met:

a. Purity: If a node's data is homogeneous (all samples belong to the same class), it becomes
a leaf node with the corresponding class label.

b. Stopping criteria: A predefined threshold for the number of samples, depth of the tree, or
maximum allowed impurity reduction might be set to prevent Decision Trees. If any of these
conditions are met, the node becomes a leaf node.

c. No further improvement: If no feature can significantly improve the model's


performance, the node becomes a leaf node.

5. Tree pruning: In some cases, it's beneficial to remove certain branches from the decision
tree that were overfitting the training data. This process, called pruning, can help improve the
model's performance on unseen data.
6. Output: The final decision tree consists of nodes and branches representing the decision-
making process. Each leaf node holds a class label or a probability distribution, which can be
used to make predictions or classifications.

Recursive induction of decision trees is an iterative process that continues until the stopping
criteria are met. It allows the model to capture complex relationships between input features
and the target variable while maintaining interpretability. However, it's crucial to carefully
choose the appropriate stopping criteria and feature selection methods to avoid overfitting
and ensure the model's generalization ability.
Searching for simple trees and Computational complexity
Simple trees, also known as shallow trees or small decision trees, are decision trees with
limited depth or a restricted number of nodes. They are used to balance the trade-off between
model complexity and generalization performance. By limiting the tree's depth or size, we
can reduce the risk of overfitting, which occurs when a model performs well on the training
data but poorly on unseen data.

Computational complexity refers to the time and space requirements needed to execute an
algorithm or solve a problem. In the context of decision trees, we are concerned with both the
time taken to build the tree and the memory required to store it.

1. Time complexity: The time complexity of building a decision tree depends on the number
of nodes and the time taken to evaluate each node. In the worst-case scenario, when building
a complete binary tree, the time complexity is O(n), where n is the number of samples.
However, this is an idealized scenario, and in practice, the time complexity can be higher due
to the need to evaluate different features and their possible values at each node.

2. Space complexity: The space complexity of a decision tree depends on the number of
nodes in the tree. In the worst-case scenario, when building a complete binary tree, the space
complexity is O(n), where n is the number of samples. However, this can be higher in
practice due to the need to store additional information like feature names, node depths, and
other metadata.
Simple trees can help reduce the computational complexity of decision trees by limiting the
tree's size or depth. This can lead to faster training times and lower memory requirements.
However, it's essential to strike a balance between model complexity and performance, as
overly simplified trees may not capture the underlying relationships in the data effectively.

To find the optimal balance, techniques like cross-validation and tuning hyperparameters
(such as maximum tree depth) can be employed. These methods help ensure that the decision
tree model is both computationally efficient and capable of capturing the essential patterns in
the data.
Overfitting, noisy data, and pruning are crucial concepts in machine learning, particularly
when working with decision trees. Let's explore each of these topics in detail.

1. Overfitting: Overfitting occurs when a machine learning model learns the training data too
well, leading to poor performance on unseen or new data. In the context of decision trees,
overfitting can happen when the tree becomes too complex, with many nodes and deep
branches. This complexity allows the model to fit the training data accurately but fails to
generalize well to unseen data, as the model has learned the noise or random fluctuations in
the training data.

2. Noisy data: Noisy data refers to data that contains errors, outliers, or random fluctuations
that do not represent the underlying pattern or relationship between variables. Noisy data can
make it challenging for any machine learning model, including decision trees, to learn the
true pattern and generalize well. In such cases, more complex models may be more prone to
overfitting, as they tend to fit the noise in the data rather than the actual pattern.

3. Pruning: Pruning is a technique used to remove or simplify parts of a decision tree model
to prevent overfitting and improve the model's generalization performance. The main idea
behind pruning is to remove branches or nodes that contribute little to the model's accuracy
but increase its complexity

Removing noisy data is a crucial step in machine learning, as it can significantly improve the
performance and accuracy of your models. Here are some common types of noisy data and
techniques to remove them:

Types of noisy data:


Outliers: Data points that are significantly different from the rest of the data.
Noise: Random fluctuations or errors in the data.
Missing values: Data points that are missing or incomplete.
Anomalies: Data points that do not follow the expected pattern or behavior.
Duplicates: Duplicate data points that are identical or very similar.

Techniques to remove noisy data:

Data cleaning: Manually inspecting and correcting errors in the data.


Data preprocessing: Applying algorithms to transform and normalize the data.
Statistical filtering: Using statistical methods to identify and remove outliers.
Machine learning algorithms: Using algorithms like clustering, decision trees, or neural
networks to identify and remove noisy data.
Data visualization: Visualizing the data to identify patterns and anomalies.
Specific techniques for each type of noisy data:

Outliers:
Median absolute deviation (MAD) method: Calculate the median absolute deviation from
the median and remove data points that are more than 3-4 times the MAD away from the
median.
Density-based spatial clustering of applications with noise (DBSCAN): Identify clusters
and remove outliers based on their proximity to other data points.
Noise:
Gaussian mixture model (GMM): Model the data as a mixture of Gaussian distributions
and remove data points that do not fit the model.
Local outlier factor (LOF): Calculate the local density of each data point and remove those
with low density.
Missing values:
Mean imputation: Replace missing values with the mean of the corresponding feature.
Median imputation: Replace missing values with the median of the corresponding feature.
K-nearest neighbors (KNN) imputation: Replace missing values with the value of the
KNN nearest neighbor.
Anomalies:
Isolation forest: Identify anomalies by isolating them from other data points using an
ensemble of decision trees.
One-class SVM: Train a support vector machine on a subset of normal data points and
identify anomalies as those that are farthest from the decision boundary.
Duplicates:
Duplicate detection algorithms: Use algorithms like Jaro-Winkler distance or Levenshtein
distance to identify duplicate data points.
Remember that there is no one-size-fits-all solution for removing noisy data. The choice of
technique depends on the specific characteristics of your data and the problem you're trying
to solve.

**What are L1 and L2 regularization?**

In machine learning, regularization is a technique used to prevent overfitting by adding a


penalty term to the loss function. The goal is to reduce the model's complexity and prevent it
from becoming too specialized to the training data.
**L1 Regularization (Lasso)**

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator),
adds a term to the loss function that is proportional to the absolute value of each model
coefficient. The idea is to encourage some coefficients to be zero, effectively removing them
from the model.

Mathematically, L1 regularization adds a term to the loss function:

`L = (1/2) * (y - Xw)^2 + α * |w|`

where `L` is the loss function, `y` is the target variable, `X` is the feature matrix, `w` is the
model coefficient vector, `α` is the regularization strength, and `|w|` is the absolute value of
`w`.

**L2 Regularization (Ridge)**

L2 regularization, also known as Ridge regression, adds a term to the loss function that is
proportional to the square of each model coefficient. The idea is to discourage large
coefficients by adding a penalty term.

Mathematically, L2 regularization adds a term to the loss function:

`L = (1/2) * (y - Xw)^2 + α * w^2`

where `L` is the loss function, `y` is the target variable, `X` is the feature matrix, `w` is the
model coefficient vector, `α` is the regularization strength, and `w^2` is the square of `w`.

**Example:**

Suppose we have a simple linear regression model with two features (`x1` and `x2`) and one
target variable (`y`). We want to predict `y` using these features.
**L1 Regularization (Lasso)**

Let's say we have a dataset with 10 samples, and our model has coefficients `w1 = 3.5`, `w2 =
2.8`, and an intercept term `b = 0.5`. The loss function without regularization would be:

`L = (1/2) * (y - (x1*w1 + x2*w2 + b))^2`

To add L1 regularization with a strength of `α = 0.5`, we would modify the loss function as
follows:

`L = (1/2) * (y - (x1*w1 + x2*w2 + b))^2 + 0.5 * |w1| + 0.5 * |w2|`

In this case, the L1 regularization term encourages some coefficients to be zero. For example,
if `w1` becomes very small, its absolute value will become smaller than 0.5, which would
make it more likely for it to be set to zero.

**L2 Regularization (Ridge)**

Using L2 regularization with a strength of `α = 0.5`, we would modify the loss function as
follows:

`L = (1/2) * (y - (x1*w1 + x2*w2 + b))^2 + 0.5 * w1^2 + 0.5 * w2^2`

In this case, the L2 regularization term discourages large coefficients by adding a penalty
term that grows quadratically with their magnitude.

By applying L1 or L2 regularization, we can reduce overfitting and improve the


generalization performance of our model.
**What is Pruning in Machine Learning?**
Pruning in machine learning is a technique used to reduce the complexity of a trained model
by removing unnecessary or redundant parts of the model. The goal is to improve the model's
performance, interpretability, and scalability by reducing its size, computational
requirements, and memory usage.

**Why is Pruning necessary?**

As models become more complex and deeper, they can become prone to overfitting, which
means they become too specialized to the training data and fail to generalize well to new,
unseen data. Pruning helps to address this issue by:

1. **Reducing Overfitting**: By removing unnecessary parameters, pruning helps to reduce


the model's capacity to fit the noise in the training data, which can lead to overfitting.
2. **Improving Interpretable**: By simplifying the model, pruning can make it easier to
understand and interpret the relationships between features and the predictions.
3. **Enhancing Scalability**: Pruning can reduce the computational requirements and
memory usage of the model, making it more suitable for large-scale applications.

**Types of Pruning:**

There are several types of pruning techniques, including:

1. **Post-pruning**: This involves pruning the model after it has been trained.
2. **Pre-pruning**: This involves pruning the model during training.
3. **Layer-wise pruning**: This involves pruning individual layers of the model.
4. **Filter-wise pruning**: This involves pruning individual filters or neurons within a layer.
5. **Weight pruning**: This involves pruning individual weights or connections within a
layer.

**Pruning Algorithms:**

Some popular pruning algorithms include:


1. **L1 Regularization**: This involves adding a penalty term to the loss function that
encourages weights to be zero.
2. **L2 Regularization**: This involves adding a penalty term to the loss function that
encourages weights to be small.
3. **Dropout**: This involves randomly dropping out neurons during training.
4. **Gaussian Pruning**: This involves randomly dropping out neurons based on a Gaussian
distribution.
5. **Thermal Pruning**: This involves dropping out neurons based on their thermal activity.

**Pruning Techniques:**

Some popular pruning techniques include:

1. **Random Pruning**: This involves randomly selecting neurons or weights to prune.


2. **Gradient-based Pruning**: This involves selecting neurons or weights based on their
gradients.
3. **Taylor Series Pruning**: This involves approximating the loss function using a Taylor
series and pruning based on this approximation.
4. **Mutual Information Pruning**: This involves pruning neurons or weights based on their
mutual information.

**Pruning in Practice:**

Pruning is commonly used in practice in various domains, including:

1. **Computer Vision**: Pruning is used in image classification and object detection tasks to
reduce the computational requirements of deep neural networks.
2. **Natural Language Processing**: Pruning is used in language models and text
classification tasks to reduce the complexity of word embeddings and language models.
3. **Speech Recognition**: Pruning is used in speech recognition systems to reduce the
computational requirements of acoustic models.

In conclusion, pruning is a powerful technique for reducing the complexity of machine


learning models and improving their performance, interpretability, and scalability. By
understanding the different types of pruning techniques and algorithms, we can better apply
this technique in our own projects and improve our models' performance.

What is Decision Tree Pruning?


Decision tree pruning is a technique used to prevent decision trees from overfitting the
training data. Pruning aims to simplify the decision tree by removing parts of it that do not
provide significant predictive power, thus improving its ability to generalize to new data.

Decision Tree Pruning removes unwanted nodes from the overfitted decision tree to make it
smaller in size which results in more fast, more accurate and more effective predictions.

Types Of Decision Tree Pruning


There are two main types of decision tree pruning: Pre-Pruning and Post-Pruning.

Pre-Pruning (Early Stopping)


Sometimes, the growth of the decision tree can be stopped before it gets too complex, this is
called pre-pruning. It is important to prevent the overfitting of the training data, which results
in a poor performance when exposed to new data.

Some common pre-pruning techniques include:

Maximum Depth: It limits the maximum level of depth in a decision tree.


Minimum Samples per Leaf: Set a minimum threshold for the number of samples in each leaf
node.
Minimum Samples per Split: Specify the minimal number of samples needed to break up a
node.
Maximum Features: Restrict the quantity of features considered for splitting.
By pruning early, we come to be with a simpler tree that is less likely to overfit the training
facts.

Post-Pruning (Reducing Nodes)


After the tree is fully grown, post-pruning involves removing branches or nodes to improve
the model’s ability to generalize. Some common post-pruning techniques include:
Cost-Complexity Pruning (CCP): This method assigns a price to each subtree primarily based
on its accuracy and complexity, then selects the subtree with the lowest fee.
Reduced Error Pruning: Removes branches that do not significantly affect the overall
accuracy.
Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini impurity or
entropy) is beneath a certain threshold.
Minimum Leaf Size: Removes leaf nodes with fewer samples than a specified threshold.

You might also like