Solution For DWDM Problems
Solution For DWDM Problems
To find the best split point for a decision tree for attribute A, you typically need to
evaluate different potential split points and choose the one that maximizes the
separation between classes. The idea is to find a threshold value for attribute A that
best separates the data into different classes.
Here's a step-by-step process to find the best split point for attribute A:
1 C1
2 C1
2 C2
3 C2
3 C2
4 C1
4 C1
Midpoint Class
1.5 C1
2.5 C1
2.5 C2
3.5 C2
3.5 C2
4.5 C1
3.Calculate the Gini index or another impurity measure for each midpoint.
Gini Index for a node = 1 - Σ(p_i)^2, where p_i is the proportion of
samples of class i in the node.
For example, the Gini index for the first midpoint (1.5) would be calculated
using the following formulas:
For C1: 1−(1/2)2−(1/2)2=0.51−(1/2)2−(1/2)2=0.5
For C2: 1−(0/2)2−(2/2)2=01−(0/2)2−(2/2)2=0
Repeat this calculation for each midpoint.
4.Choose the midpoint that results in the lowest impurity (Gini index, in this
case).
In this example, the midpoint 1.5 for attribute A has the lowest Gini index (0.5),
so it would be chosen as the best split point.
Note: The above process is a simplified explanation, and in practice, decision tree
algorithms may use other impurity measures (such as entropy or misclassification
rate) and may consider multiple attributes simultaneously. The specific details can
vary depending on the algorithm being used.
Make a decision tree for the following database using Gini Index. Indicate all
intermediate steps.
Example Colour Shape Size Class
1 Red Square Big +
2 Blue Square Big +
3 Red Circle Big +
4 Red Circle Small –
5 Green Square Small –
6 Green Square Big –
Solution:
To create a decision tree using the Gini Index, we'll go through the process of
selecting the best split points for each attribute at each level of the tree. Here are the
steps:
Step 1: Calculate the Gini Index for the root node (considering all data points):
Color:
Split on Red: Gini Index for Red branch
=1−((23)2+(13)2)=0.44=1−((32)2+(31)2)=0.44
Split on Blue: Gini Index for Blue branch
=1−((13)2+(23)2)=0.44=1−((31)2+(32)2)=0.44
Split on Green: Gini Index for Green branch
=1−((23)2+(13)2)=0.44=1−((32)2+(31)2)=0.44
Shape:
Split on Square: Gini Index for Square branch
=1−((24)2+(24)2)=0.5=1−((42)2+(42)2)=0.5
Split on Circle: Gini Index for Circle branch
=1−((12)2+(12)2)=0.5=1−((21)2+(21)2)=0.5
Size:
Split on Big: Gini Index for Big branch
=1−((23)2+(13)2)=0.44=1−((32)2+(31)2)=0.44
Split on Small: Gini Index for Small branch
=1−((13)2+(23)2)=0.44=1−((31)2+(32)2)=0.44
Step 3: Choose the attribute and split point with the lowest Gini Index:
The lowest Gini Index is for the root node, so we choose the attribute with the lowest
impurity, which is Color.
For the Red branch, all data points are of the same class (+), so no further split
is needed.
For the Blue branch, all data points are of the same class (+), so no further
split is needed.
For the Green branch, a split on Shape or Size can be considered, and the
process is repeated.
/ | \
/ |
+ Shape
/ \
Square Circle
/ \
+ -
Given data set, D, the number of attributes, n, and the number of training tuples, |D|, show
that the computational cost of growing a tree is at most n × |D| × log(|D|).
Solution:
To understand the computational cost of growing a decision tree, let's break down
the key components involved in the process. The cost is influenced by the number of
attributes (�n), the number of training tuples (∣�∣∣D∣), and the structure of the
tree. The commonly used algorithms for growing decision trees, such as CART
(Classification and Regression Trees) or ID3 (Iterative Dichotomiser 3), involve
recursive splitting of data based on attributes.
1. Selecting the best attribute to split on: For each node in the tree, we need
to evaluate each attribute and find the one that minimizes a certain impurity
measure (such as Gini Index or Entropy). This involves going through all
attributes and evaluating potential split points.
2. Splitting the data: Once the best attribute is selected, the data is split into
subsets based on the values of that attribute. This is done for each branch of
the tree.
3. Recursive growth: Steps 1 and 2 are repeated for each branch until a
stopping criterion is met, such as reaching a maximum depth, achieving a
minimum number of samples per leaf, or having all data points in a leaf node
belonging to the same class.
Calculate the gain in the Gini Index when splitting on A and B. Which attribute would the
decision tree induction algorithm choose?
A B Class Label
T F +
T T +
T T +
T F –
T T +
F F –
F F –
F F –
T T –
T F –
Solution:
To calculate the Gini Index and gain for each attribute (A and B), we first need to
compute the Gini Index for the initial dataset and then calculate the Gini Index for
each possible split on A and B. The gain is calculated by subtracting the weighted
sum of the Gini Indices for the resulting subsets from the Gini Index of the original
dataset.
where ��pi is the proportion of samples of class �i in the node, and �m is the
number of classes.
For the initial dataset: Gini Index=1−((510)2+(510)2)=0.5Gini Index=1−((105
)2+(105)2)=0.5
Now, let's calculate the Gini Index for each split on A and B:
For A = T:
�(Class = +)=35P(Class = +)=53
�(Class = -)=25P(Class = -)=52
Gini Index = 1−((35)2+(25)2)=0.481−((53)2+(52)2)=0.48
For A = F:
�(Class = +)=25P(Class = +)=52
�(Class = -)=35P(Class = -)=53
Gini Index = 1−((25)2+(35)2)=0.481−((52)2+(53)2)=0.48
Gain for A:
Gain(�)=Gini Index (Initial)−(510×Gini Index (A = T)+510×Gini Index (A
= F))Gain(A)=Gini Index (Initial)−(105×Gini Index (A = T)+105
×Gini Index (A = F))
For B = T:
�(Class = +)=46P(Class = +)=64
�(Class = -)=26P(Class = -)=62
Gini Index = 1−((46)2+(26)2)=0.441−((64)2+(62)2)=0.44
For B = F:
�(Class = +)=14P(Class = +)=41
�(Class = -)=34P(Class = -)=43
Gini Index = 1−((14)2+(34)2)=0.3751−((41)2+(43)2)=0.375
Gain for B:
Gain(�)=Gini Index (Initial)−(610×Gini Index (B = T)+410×Gini Index (B
= F))Gain(B)=Gini Index (Initial)−(106×Gini Index (B = T)+104
×Gini Index (B = F))
Now, compare the gains for A and B. The attribute with the higher gain would be
chosen by the decision tree induction algorithm.
Identify the attribute that will act as the root node of a decision tree to predict golf play for
following database with Gini Index. Indicate all the intermediate steps.
Outlook Wind PlayGolf
rain strong no
sunny weak yes
overcast weak yes
rain weak yes
sunny strong yes
rain strong no
overcast strong no
Solution:
To find the attribute that will act as the root node of a decision tree using
the Gini Index, we need to calculate the Gini Index for each attribute and
choose the one with the highest information gain. The information gain is
computed by subtracting the weighted sum of Gini Indices for the resulting
subsets from the Gini Index of the original dataset.
Step 2: Calculate the Gini Index for each attribute (Outlook and Wind):
Outlook:
Wind:
UNIT-4
1. Can we design a method that mines the complete set of frequent item sets without
candidate generation? If yes, explain it with the following table:
TID List of items
001 milk, dal, sugar, bread
002 Dal, sugar, wheat,jam
003 Milk, bread, curd, paneer
004 Wheat, paneer, dal, sugar
005 Milk, paneer, bread
006 Wheat, dal, paneer, bread
Yes, it is possible to mine frequent item sets without candidate generation using an
approach called the Apriori algorithm. The Apriori algorithm uses an iterative
approach to discover frequent item sets by avoiding the generation of unnecessary
candidate item sets.
The basic idea of the Apriori algorithm is based on the Apriori property, which states
that if an itemset is frequent, then all of its subsets must also be frequent. The
algorithm uses this property to prune the search space and focus only on item sets
that have already been identified as frequent.
Here are the steps to mine frequent item sets from the given table:
{milk, dal}, {milk, sugar}, {milk, bread}, {milk, wheat}, {milk, jam}, {milk, curd}, {milk, paneer}, ...
Step 3: Count the support of candidate 2-item sets Count the occurrences of each
candidate 2-item set in the dataset. Prune sets that do not meet the minimum support
threshold.
{milk, bread}, {milk, paneer}, {dal, sugar}, {dal, wheat}, {sugar, bread}, {wheat, paneer}, {wheat, dal},
...
Repeat Steps 2 and 3 for higher item sets Continue the process by generating candidate
k-item sets from the frequent (k-1)-item sets and counting their support until no more
frequent item sets can be found.
{milk, bread, paneer}, {milk, dal, sugar}, {dal, sugar, wheat}, {wheat, paneer, dal}, ...
The Apriori algorithm iteratively applies these steps until it identifies all
frequent item sets in the dataset without generating unnecessary
candidates. This approach helps in reducing the computational cost
compared to methods that generate and check all possible item sets.
Step 2: Generate candidate 2-item sets Create candidate 2-item sets by combining the
frequent 1-item sets found in step 1.
{a, b}, {a, c}, {a, d}, {a, e}, {b, c}, {b, d}, {b, e}, {c, d}, {c, e}, {d, e}
Step 3: Count the support of candidate 2-item sets Count the occurrences of each
candidate 2-item set in the dataset. Prune sets that do not meet the minimum support
threshold.
{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}, {d, e}
Repeat Steps 2 and 3 for higher item sets Continue the process by generating candidate
k-item sets from the frequent (k-1)-item sets and counting their support until no more
frequent item sets can be found.
{a, b, c, d}
In the end, you have identified all the frequent item sets in the dataset without generating
unnecessary candidates. Adjust the minimum support threshold as needed based on your
specific requirements.
Consider the following table to find frequent item sets using vertical data format. Support
threshold 30%
Tid List of items
T01 Milk, biscuits, surf powder, teabags
T02 Teabags, sugar, soap
T03 Milk, sugar, bread, soap
T04 Bread, teabags, biscuits
T05 Chocolates, milk, biscuits
T06 Milk, teabags, bread
T07 Bread, biscuits, chocolate
T08 Milk, surf powder, bread
To find frequent item sets using the vertical data format, we can follow a similar
approach but adapt it to the vertical representation of the data. The vertical format
organizes the data by items rather than transactions. Here's a step-by-step
breakdown:
1. Step 1: Transpose the Data Transpose the given table to create a vertical
dataset:
Item | T01 | T02 | T03 | T04 | T05 | T06 | T07 | T08 |
-------------------------------------------------------------
Milk |1 |0 |1 |0 |1 |1 |0 |1 |
Biscuits | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
Surf Powder|1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
Teabags | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
Sugar |0 |1 |1 |0 |0 |0 |0 |0 |
Soap |1 |1 |1 |0 |0 |0 |1 |0 |
Bread |0 |0 |1 |1 |0 |1 |1 |1 |
Chocolate | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
Step 2: Calculate Support for Each Item Calculate the support for each item by counting
the number of transactions where the item appears.
Step 3: Identify Frequent 1-Item Sets Select items with support greater than or equal to
the specified threshold (30%).
Step 4: Generate Candidate 2-Item Sets Create candidate 2-item sets by combining the
frequent 1-item sets.
Step 5: Calculate Support for Candidate 2-Item Sets Count the support for each
candidate 2-item set.
Step 6: Identify Frequent 2-Item Sets Select 2-item sets with support greater than or equal
to the threshold.
{Milk, Teabags}
7.Repeat the process for higher item sets if needed. Continue the
process to find frequent 3-item sets, 4-item sets, and so on.
In this example, we've identified the frequent 1-item sets and the frequent
2-item sets based on the given support threshold of 30%. Adjust the
threshold as needed for your specific requirements.
A database has four transactions. Let min_sup=60% and min_conf=80%
TID date items_bought
100 10/15/2022 {K, A, B, D}
200 10/15/2022 {D, A, C, E, B}
300 10/19/2022 {C, A, B, E}
400 10/22/2022 {B, A, D}
Find all frequent items using Apriori& FP-growth, respectively. Compare the
efficiency of the two-meaning process.
To find frequent items using the Apriori algorithm and the FP-growth algorithm, let's
start with the Apriori algorithm:
Apriori Algorithm:
Step 1: Transpose the Data Transpose the given table to create a vertical dataset:
Step 2: Calculate Support for Each Item Calculate the support for each item by counting
the number of transactions where the item appears.
A: 4/4 = 100%
B: 4/4 = 100%
C: 2/4 = 50%
D: 3/4 = 75%
E: 2/4 = 50%
Step 3: Identify Frequent 1-Item Sets Select items with support greater than or equal to
the specified threshold (60%).
Frequent 1-item sets:
Step 4: Generate Candidate 2-Item Sets Create candidate 2-item sets by combining the
frequent 1-item sets
Step 5: Calculate Support for Candidate 2-Item Sets Count the support for each
candidate 2-item set.
Step 6: Identify Frequent 2-Item Sets Select 2-item sets with support greater than or equal
to the threshold.
{A, B}
FP-growth Algorithm:
The FP-growth algorithm builds a frequent pattern tree and uses a divide-and-
conquer strategy to mine frequent itemsets. The efficiency of the FP-growth
algorithm often makes it faster than Apriori, especially for large datasets.
For the sake of brevity, I'll provide the results without a step-by-step breakdown:
The efficiency of Apriori and FP-growth depends on factors such as dataset size,
sparsity, and the specified minimum support threshold. Generally, FP-growth tends
to be more efficient than Apriori, especially for datasets with a large number of
transactions and a low support threshold.
The key reason for the efficiency of FP-growth is its ability to construct a condensed
representation of the dataset (the FP-tree) and avoid the generation of candidate
itemsets, which can be time-consuming in Apriori.
In summary, FP-growth is often more efficient than Apriori, especially for larger
datasets. However, the actual efficiency can vary based on the characteristics of the
dataset and the specific parameters used.
Unit-5
Suppose that the data-mining task is to cluster the following eight points (representing
location) into three clusters: A1 (2;10) ; A2 (2;5) ; A3 (8;4) ; B1 (5;8) ; B2 (7;5) ; B3 (6;4) ;
C1 (1;2) ; C2 (4;9). The distance function is Euclidean distance. Suppose initially we assign
A1, B1, and C1 as the center of each cluster, respectively. Use the k-means algorithm to
determine: the three cluster centers after the first round of execution
Answers:
Initial Assignment:
Cluster Centers:
A1 (2, 10)
B1 (5, 8)
C1 (1, 2)
Summary:
After the first round of execution:
Let's illustrate the K-means clustering algorithm using the provided dataset with five
points:
Step 1: Initialization
Let's initialize the K-means algorithm with �=2k=2 clusters and randomly select
initial cluster centers.
Step 2: Assignment
Assign each point to the nearest cluster based on Euclidean distance.
Calculate the mean (centroid) of each cluster to update the cluster centers.
Repeat Iterations
Summary:
After the K-means algorithm converges, the dataset will be partitioned into two
clusters, and the cluster centers will represent the mean coordinates of the points in
each cluster.
Cluster the following data into three clusters, using the k-means method.
X y
10.9 12.6
2.3 8.4
8.4 12.6
12.1 16.2
7.3 8.9
23.4 11.3
19.7 18.5
17.1 17.2
3.2 3.4
1.3 22.8
2.4 6.9
2.4 7.1
3.1 8.3
2.9 6.9
11.2 4.4
8.3 8.7
Solution :
Step 1: Initialization
Randomly select initial cluster centers. For simplicity, let's assume the initial
centers are the first three data points:
Step 2: Assignment
Assign each point to the nearest cluster based on Euclidean distance.
Repeat Iterations
Repeat the assignment and update steps until convergence or a
predetermined number of iterations. The points will be reassigned to
clusters, and cluster centers will be updated in each iteration.
Convergence:
The algorithm converges when the cluster assignments and centers do not
change significantly between iterations.
Please note that the actual numerical calculations and iterations depend on
the specific implementation of the algorithm. The given steps provide a
high-level overview of the k-means clustering process.
Suppose that the data mining task is to cluster points into three clusters, where the points are
A1(2,10),A2(2,5),A3(8,4),B1(5,8),B2(7,5),B3(6,4),C1(1,2),C2(4,9). The distance function is
Euclidean distance. Suppose initially we assign A1, B1, and C1 as the center of each cluster,
respectively. Use the k-means algorithm to show only the three cluster centers after the first
round of execution.
Sure, let's apply the k-means algorithm to cluster the given points into three clusters
and show the cluster centers after the first round of execution.
Given Points:
�1(2,10),�2(2,5),�3(8,4),�1(5,8),�2(7,5),�3(6,4),�1(1,2),�2(4,9)A1(
2,10),A2(2,5),A3(8,4),B1(5,8),B2(7,5),B3(6,4),C1(1,2),C2(4,9)
Initial Assignment:
Let's initialize the k-means algorithm with three clusters and assign the initial centers
as follows:
Calculate the mean (centroid) of each cluster to update the cluster centers.
The coordinates of the updated cluster centers after the first round of execution
represent the mean of the points assigned to each cluster. The process can be
repeated for further iterations until convergence.
Given the points x1 = {1, 0}, x2 = {0,1}, x3={2, 1}, and x4 = {3, 3}. Suppose that these
points are randomly clustered into two clusters: C1 = {x1, x3} and C2 = {x2, x4}. Apply one
iteration of Kmeans partitional-clustering algorithm and find new distribution of elements in
clusters. What is the change in a total square error?
Answer:
Initial Clustering:
�1={�1,�3}C1={x1,x3} �2={�2,�4}C2={x2,x4}
�1=(1+22,0+12)=(1.5,0.5)C1=(21+2,20+1)=(1.5,0.5)
�2=(0+32,1+32)=(1.5,2)C2=(20+3,21+3)=(1.5,2)
New Clustering:
�1={�1,�3}C1={x1,x3} �2={�2,�4}C2={x2,x4}
Since the assignment of points to clusters did not change after one
iteration, the total square error remains the same. The algorithm may
iterate further until convergence, with each iteration potentially reducing
the total square error.