Unit-4 (1) .Docx ML
Unit-4 (1) .Docx ML
As you can see from the above image the Decision Tree works on
the Sum of Product form which is also known as Disjunctive Normal
Form. In the above image, we are predicting the use of computer in
the daily life of people. In the Decision Tree, the major challenge is
the identification of the attribute for the root node at each level.
This process is known as attribute selection. We have two popular
attribute selection measures:
1. Information Gain
2. Gini Index
1. Information Gain:
When we use a node in a decision tree to partition the training
instances into smaller subsets the entropy changes. Information
gain is a measure of this change in entropy.
Suppose S is a set of instances,
A is an attribute
Sv is the subset of S
v represents an individual value that the attribute A can
take and Values (A) is the set of all possible values of A,
then
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
1 1 1 I
X Y Z C
1 1 0 I
0 0 1 II
1 0 0 II
Split on feature X
Split on feature Y
Split on feature Z
From the above images, we can see that the information gain is
maximum when we make a split on feature Y. So, for the root node
best-suited feature is feature Y. Now we can see that while splitting
the dataset by feature Y, the child contains a pure subset of the
target variable. So we don’t need to further split the dataset. The
final tree for the above dataset would look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly
chosen element would be incorrectly identified.
It means an attribute with a lower Gini index should be
preferred.
Sklearn supports “Gini” criteria for Gini Index and by
default, it takes “gini” value.
The Formula for the calculation of the Gini Index is given
below.
The Gini Index is a measure of the inequality or impurity of a
distribution, commonly used in decision trees and other machine
learning algorithms. It ranges from 0 to 1, where 0 represents
perfect equality (all values are the same) and 1 represents perfect
inequality (all values are different).
Some additional features and characteristics of the Gini Index
are:
It is calculated by summing the squared probabilities of
each outcome in a distribution and subtracting the result
from 1.
A lower Gini Index indicates a more homogeneous or pure
distribution, while a higher Gini Index indicates a more
heterogeneous or impure distribution.
In decision trees, the Gini Index is used to evaluate the
quality of a split by measuring the difference between the
impurity of the parent node and the weighted impurity of
the child nodes.
Compared to other impurity measures like entropy, the Gini
Index is faster to compute and more sensitive to changes
in class probabilities.
One disadvantage of the Gini Index is that it tends to favor
splits that create equally sized child nodes, even if they are
not optimal for classification accuracy.
In practice, the choice between using the Gini Index or
other impurity measures depends on the specific problem
and dataset, and often requires experimentation and
tuning.
Example of a Decision Tree Algorithm
Forecasting Activities Using Weather Information
Root node: Whole dataset
Attribute : “Outlook” (sunny, cloudy, rainy).
Subsets: Overcast, Rainy, and Sunny.
Recursive Splitting: Divide the sunny subset even more
according to humidity, for example.
Leaf Nodes: Activities include “swimming,” “hiking,” and
“staying inside.”
Beginning with the entire dataset as the root node of the
decision tree:
Determine the best attribute to split the dataset based on
information gain, which is calculated by the formula:
Information gain = Entropy(parent) – [Weighted average] *
Entropy(children), where entropy is a measure of impurity
or disorder of a set of examples, and the weighted average
is based on the number of examples in each child node.
Create a new internal node that corresponds to the best
attribute and connects it to the root node. For example, if
the best attribute is “outlook” (which can have values
“sunny”, “overcast”, or “rainy”), we create a new node
labeled “outlook” and connect it to the root node.
Partition the dataset into subsets based on the values of
the best attribute. For example, we create three subsets:
one for instances where the outlook is “sunny”, one for
instances where the outlook is “overcast”, and one for
instances where the outlook is “rainy”.
Recursively repeat steps 1-4 for each subset until all
instances in a given subset belong to the same class or no
further splitting is possible. For example, if the subset of
instances where the outlook is “overcast” contains only
instances where the activity is “hiking”, we assign a leaf
node labeled “hiking” to this subset. If the subset of
instances where the outlook is “sunny” is further split
based on the humidity attribute, we repeat steps 2-4 for
this subset.
Assign a leaf node to each subset that contains instances
that belong to the same class. For example, if the subset of
instances where the outlook is “rainy” contains only
instances where the activity is “stay inside”, we assign a
leaf node labeled “stay inside” to this subset.
Make predictions based on the decision tree by traversing
it from the root node to a leaf node that corresponds to the
instance being classified. For example, if the outlook is
“sunny” and the humidity is “high”, we traverse the
decision tree by following the “sunny” branch and then the
“high humidity” branch, and we end up at a leaf node
labeled “swimming”, which is our predicted activity.
Advantages of Decision Tree
Easy to understand and interpret, making them accessible
to non-experts.
Handle both numerical and categorical data without
requiring extensive preprocessing.
Provides insights into feature importance for decision-
making.
Handle missing values and outliers without significant
impact.
Applicable to both classification and regression tasks.
Disadvantages of Decision Tree
Disadvantages include the potential for overfitting
Sensitivity to small changes in data, limited generalization
if training data is not representative
Potential bias in the presence of imbalanced data.
Conclusion
Decision trees, a key tool in machine learning, model and predict
outcomes based on input data through a tree-like structure. They
offer interpretability, versatility, and simple visualization, making
them valuable for both categorization and regression tasks. While
decision trees have advantages like ease of understanding, they
may face challenges such as overfitting. Understanding their
terminologies and formation process is essential for effective
application in diverse scenarios.
Frequently Asked Questions (FAQs)
1. What are the major issues in decision tree learning?
Major issues in decision tree learning include overfitting, sensitivity
to small data changes, and limited generalization. Ensuring proper
pruning, tuning, and handling imbalanced data can help mitigate
these challenges for more robust decision tree models.
2. How does decision tree help in decision making?
Decision trees aid decision-making by representing complex
choices in a hierarchical structure. Each node tests specific
attributes, guiding decisions based on data values. Leaf nodes
provide final outcomes, offering a clear and interpretable path for
decision analysis in machine learning.
3. What is the maximum depth of a decision tree?
The maximum depth of a decision tree is a hyperparameter that
determines the maximum number of levels or nodes from the root
to any leaf. It controls the complexity of the tree and helps prevent
overfitting.
4. What is the concept of decision tree?
A decision tree is a supervised learning algorithm that models
decisions based on input features. It forms a tree-like structure
where each internal node represents a decision based on an
attribute, leading to leaf nodes representing outcomes.
5. What is entropy in decision tree?
In decision trees, entropy is a measure of impurity or disorder
within a dataset. It quantifies the uncertainty associated with
classifying instances, guiding the algorithm to make informative
splits for effective decision-making.
6. What are the Hyperparameters of decision tree?
1. Max Depth: Maximum depth of the tree.
2. Min Samples Split: Minimum samples required to split an
internal node.
3. Min Samples Leaf: Minimum samples required in a leaf
node.
4. Criterion: The function used to measure the quality of a
split
Ensemble Learning and
Random forest
Hrishav kumar
·
Follow
3 min read
Jan 8, 2019
Ensemble learning
Random forest
Applications
Conclusion
Gradient Boosting
Gradient Boosting is a popular boosting algorithm in machine
learning used for classification and regression tasks. Boosting is
one kind of ensemble Learning method which trains the model
sequentially and each new model tries to correct the previous
model. It combines several weak learners into strong learners.
There is two most popular boosting algorithm i.e
1. AdaBoost
2. Gradient Boosting
Gradient Boosting
Gradient Boosting is a powerful boosting algorithm that combines
several weak learners into strong learners, in which each new
model is trained to minimize the loss function such as mean
squared error or cross-entropy of the previous model using gradient
descent. In each iteration, the algorithm computes the gradient of
the loss function with respect to the predictions of the current
ensemble and then trains a new weak model to minimize this
gradient. The predictions of the new model are then added to the
ensemble, and the process is repeated until a stopping criterion is
met.
In contrast to AdaBoost, the weights of the training instances are
not tweaked, instead, each predictor is trained using the residual
errors of the predecessor as labels. There is a technique called
the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees). The below diagram explains
how gradient-boosted trees are trained for regression problems.
Courses
Practice
Video
Jobs
Clustering analysis or simply Clustering is basically an Unsupervised
learning method that divides the data points into a number of
specific batches or groups, such that the data points in the same
groups have similar properties and data points in different groups
have different properties in some sense. It comprises many different
methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph
distance), Mean-shift (distance between points), DBSCAN (distance
between nearest points), Gaussian mixtures (Mahalanobis distance
to centers), Spectral clustering (graph distance), etc.
Fundamentally, all clustering methods use the same approach i.e.
first we calculate similarities and then we use it to cluster the data
points into groups or batches. Here we will focus on the Density-
based spatial clustering of applications with noise (DBSCAN)
clustering method.
Density-Based Spatial Clustering Of
Applications With Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions
of the lower density of points. The DBSCAN algorithm is based on
this intuitive notion of “clusters” and “noise”. The key idea is that for
each point of a cluster, the neighborhood of a given radius has to
contain at least a minimum number of points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical
clustering work for finding spherical-shaped clusters or convex
clusters. In other words, they are suitable only for compact and well-
separated clusters. Moreover, they are also severely affected by the
presence of noise and outliers in the data.
Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in
the figure below.
2. Data may contain noise.
The figure above shows a data set containing non-convex shape
clusters and outliers. Given such data, the k-means algorithm has
difficulties in identifying these clusters with arbitrary shapes.
if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}
Spectral Clustering
Spectral Clustering is a variant of the clustering algorithm that uses
the connectivity between the data points to form the clustering. It
uses eigenvalues and eigenvectors of the data matrix to forecast
the data into lower dimensions space to cluster the data points. It is
based on the idea of a graph representation of data where the data
point are represented as nodes and the similarity between the data
points are represented by an edge.
Steps performed for spectral Clustering
Building the Similarity Graph Of The Data: This step builds the
Similarity Graph in the form of an adjacency matrix which is
represented by A. The adjacency matrix can be built in the
following manners:
Epsilon-neighbourhood Graph: A parameter epsilon is
fixed beforehand. Then, each point is connected to all the
points which lie in its epsilon-radius. If all the distances
between any two points are similar in scale then typically
the weights of the edges ie the distance between the two
points are not stored since they do not provide any
additional information. Thus, in this case, the graph built is
an undirected and unweighted graph.
K-Nearest Neighbours A parameter k is fixed
beforehand. Then, for two vertices u and v, an edge is
directed from u to v only if v is among the k-nearest
neighbours of u. Note that this leads to the formation of a
weighted and directed graph because it is not always the
case that for each u having v as one of the k-nearest
neighbours, it will be the same case for v having u among
its k-nearest neighbours. To make this graph undirected,
one of the following approaches is followed:-
1. Direct an edge from u to v and from v to u if either
v is among the k-nearest neighbours of u OR u is
among the k-nearest neighbours of v.
2. Direct an edge from u to v and from v to u if v is
among the k-nearest neighbours of u AND u is
among the k-nearest neighbours of v.
3. Fully-Connected Graph: To build this graph,
each point is connected with an undirected edge-
weighted by the distance between the two points
to every other point. Since this approach is used to
model the local neighbourhood relationships thus
typically the Gaussian similarity metric is used to
calculate the distance.
Practice
Jobs
One of the primary disadvantages of any clustering technique is that
it is difficult to evaluate its performance. To tackle this problem, the
metric of V-Measure was developed. The calculation of the V-
Measure first requires the calculation of two terms:-
1. Homogeneity: A perfectly homogeneous clustering is one
where each cluster has data-points belonging to the same
class label. Homogeneity describes the closeness of the
clustering algorithm to this perfection.
2. Completeness: A perfectly complete clustering is one
where all data-points belonging to the same class are
clustered into the same cluster. Completeness describes the
closeness of the clustering algorithm to this perfection.
Trivial Homogeneity: It is the case when the number of clusters is
equal to the number of data points and each point is in exactly one
cluster. It is the extreme case when homogeneity is highest while
completeness is minimum.
Trivial Completeness: It is the case when all the data points are
clustered into one cluster. It is the extreme case when homogeneity
is minimum and completeness is maximum.
where an
d The
where an
d Thus the
df = pd.read_csv('creditcard.csv')
y = df['Class']
X = df.drop('Class', axis = 1)
X.head()
Step 3: Building different clustering models and comparing
their V-Measure scores In this step, 5 different K-Means
Clustering Models will be built with each model clustering the data
into a different number of clusters.
Python3
# List of V-Measure Scores for different models
v_scores = []
N_Clusters = [2, 3, 4, 5, 6]
a) n_clusters = 2
Python3
# Building the clustering model
kmeans2 = KMeans(n_clusters = 2)
kmeans2.fit(X)
labels2 = kmeans2.predict(X)
v_scores.append(v_measure_score(y, labels2))
b) n_clusters = 3
Python3
# Building the clustering model
kmeans3 = KMeans(n_clusters = 3)
kmeans3.fit(X)
labels3 = kmeans3.predict(X)
v_scores.append(v_measure_score(y, labels3))
c) n_clusters = 4
Python3
# Building the clustering model
kmeans4 = KMeans(n_clusters = 4)
kmeans4.fit(X)
labels4 = kmeans4.predict(X)
v_scores.append(v_measure_score(y, labels4))
d) n_clusters = 5
Python3
# Building the clustering model
kmeans5 = KMeans(n_clusters = 5)
# Training the clustering model
kmeans5.fit(X)
labels5 = kmeans5.predict(X)
v_scores.append(v_measure_score(y, labels5))
e) n_clusters = 6
Python3
# Building the clustering model
kmeans6 = KMeans(n_clusters = 6)
kmeans6.fit(X)
labels6 = kmeans6.predict(X)
v_scores.append(v_measure_score(y, labels6))
plt.bar(N_Clusters, v_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('V-Measure Score')
plt.title('Comparison of different Clustering Models')
plt.show()
Website
The higher the ARI value, the closer the two clusterings are to
each other. It ranges from -1 to 1, where 1 indicates perfect
agreement between the two clusterings, 0 indicates a random
agreement and -1 indicates that the two clusterings are
completely different. The ARI is widely used in machine learning,
data mining, and pattern recognition, especially for the evaluation
of clustering algorithms.
Meta-Learning
Advantages of Meta-learning
1. Meta-Learning offers more speed: Meta-learning
approaches can produce learning architectures that
perform better and faster than hand-crafted models.
2. Better generalization: Meta-learning models can
frequently generalize to new tasks more effectively by
learning to learn, even when the new tasks are very
different from the ones they were trained on.
3. Scaling: Meta-learning can automate the process of
choosing and fine-tuning algorithms, thereby increasing the
potential to scale AI applications.
4. Fewer data required: These approaches assist in the
development of more general systems, which can transfer
knowledge from one context to another. This reduces the
amount of data you need in solving problems in the new
context.
5. Improved performance: Meta-learning can help improve
the performance of machine learning models by allowing
them to adapt to different datasets and learning
environments. By leveraging prior knowledge and
experience, meta-learning models can quickly adapt to new
situations and make better decisions.
6. Fewer hyperparameters: Meta-learning can help reduce
the number of hyperparameters that need to be tuned
manually. By learning to optimize these parameters
automatically, meta-learning models can improve their
performance and reduce the need for manual tuning.
Meta-learning Optimization
During the training process of a machine learning algorithm,
hyperparameters determine which parameters should be used.
These variables have a direct impact on how successfully a model
trains. Optimizing hyperparameters may be done in several ways.
1. Grid Search: The Grid Search technique makes use of
manually set hyperparameters. All suitable combinations of
hyperparameter values (within a given range) are tested
during a grid search. After that, the model selects the best
hyperparameter value. But because the process takes so
long and is so ineffective, this approach is seen as
conventional. Grid Search may be found in the Sklearn
library.
2. Random Search: The optimal solution for the created
model is found using the random search approach, which
uses random combinations of the hyperparameters. Even
though it has characteristics similar to grid search, it has
been shown to produce superior results overall. The
disadvantage of random search is that it produces a high
level of volatility while computing. Random Search may be
found in the Sklearn library. Random Search is superior to
Grid Search.
Applications of Meta-learning
Meta-learning algorithms are already in use in various applications,
some of which are:
1. Online learning tasks in reinforcement learning
2. Sequence modeling in Natural language processing
3. Image classification tasks in Computer vision
4. Few-shot learning: Meta-learning can be used to train
models that can quickly adapt to new tasks with limited
data. This is particularly useful in scenarios where the cost
of collecting large amounts of data is prohibitively high,
such as in medical diagnosis or autonomous driving.
5. Model selection: Meta-learning can help automate the
process of model selection by learning to choose the best
model for a given task based on past experience. This can
save time and resources while also improving the accuracy
and robustness of the resulting model.
6. Hyperparameter optimization: Meta-learning can be
used to automatically tune hyperparameters for machine-
learning models. By learning from past experience, meta-
learning models can quickly find the best hyperparameters
for a given task, leading to better performance and faster
training times.
7. Transfer learning: Meta-learning can be used to facilitate
transfer learning, where knowledge learned in one domain
is transferred to another domain. This can be especially
useful in scenarios where data is scarce or where the target
domain is vastly different from the source domain.
8. Recommender systems: Meta-learning can be used to
build better recommender systems by learning to
recommend the most relevant items based on past user
behavior. This can improve the accuracy and relevance of
recommendations, leading to better user engagement and
satisfaction.