Solution
Solution
Contents 1
2 Numeric Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Exercises 7
3 Categorical Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Exercises 16
4 Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Exercises 20
5 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.6 Exercises 26
6 High-dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.9 Exercises 29
7 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6 Exercises 39
8 Itemset Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.5 Exercises 47
9 Summarizing Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.6 Exercises 56
1
2 Contents
10 Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.5 Exercises 63
13 Representative-based Clustering . . . . . . . . . . . . . . . . . . . . . . 91
13.5 Exercises 91
14 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
14.4 Exercises 99
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 1 Data Mining and Analysis
1.7 EXERCISES
Q1. Show that the mean of the centered data matrix Z in Eq. (1.5) is 0.
= µ−µ= 0
d
δ∞ (x, y) = lim δp (x, y) = max |xi − yi |
p→∞ i=1
for x, y ∈ Rd .
d
! p1
X d
lim |xi − yi |p = max |xi − yi |
p→∞ i=1
i=1
Assume that dimension a is the max, and let m = |xa − ya |. For simplicity, we
assume that |xi − yi | < m for all i 6= a.
If we divide and multiply the left hand side with mp we get:
! p1 1
d
X X |xi − yi | p
p
|xi − yi | p
m p
= m 1 +
m m
i=1 i6=a
3
4 Data Mining and Analysis
p
As p → ∞, each term |xi m−yi |
→ 0, since m > |xi − yi | for all i 6= a. The finite
P |xi −yi | p
summation i6=a m converges to 0 as p → ∞, as does 1/p.
Thus δ∞ (x, y) = m × 10 = m = |xa − ya | = maxdi=1 {|xi − yi |}
Note that the same result is obtained even if we assume that dimensions other
than a achieve the maximum value m. In the worst case, we have m = |xi − yi | for
all d dimensions. In this case, the expression on LHS becomes
d
!1/p
X
lim m p
1 = lim md 1/p = lim md 0 = m
p→∞ p→∞ p→∞
i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T ONE DATA ANALYSIS
FOUNDATIONS
CHAPTER 2 Numeric Attributes
2.7 EXERCISES
Q2. Let X and Y be two random variables, denoting age and weight, respectively.
Consider a random sample of size n = 20 from these two variables
X = (69, 74, 68, 70, 72, 67, 66, 70, 76, 68, 72, 79, 74, 67, 66, 71, 74, 75, 75, 76)
Y = (153, 175, 155, 135, 172, 150, 115, 137, 200, 130, 140, 265, 185, 112, 140,
150, 165, 185, 210, 220)
7
8 Numeric Attributes
0.10
0.05
0 x
60 65 70 75 80
(d) What is the probability of observing an age of 80 or higher?
Answer: If we leverage the empirical probability mass function, we get:
P (X ≥ 80) = 0/20 = 0
Z∞
P (X ≥ 80) = N(x|µX , σX ) = 0.010769
80
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 9
Answer:
122.435
ρXY = σXY /(σX σY ) = √ = 0.889
13.845 · 1369.21
(g) Draw a scatterplot to show the relationship between age and weight.
Answer: The scatterplot is shown in the figure below.
bC
250
230
bC
bC
210
Y: Weight
bC
190 bC bC
bC
bC
170 bC
bC bC
bC bC
150
bC bC
bCbC
bC
130
bC
bC
110
65 67 69 71 73 75 77 79
X: Age
Q3. Show that the identity in Eq. (2.15) holds, that is,
n
X n
X
(xi − µ)2 = n(µ̂ − µ)2 + (xi − µ̂)2
i=1 i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10 Numeric Attributes
Answer: We assume for simplicity that all the variables are discrete. A similar
approach can be used for continuous variables.
Consider the random variable x1 + x2 . Its mean is given as
X X
µx1 +x2 = (a + b)f (a, b)
x1 =a x2 =b
Since x1 and x2 are independent, their joint probability mass function is given as:
= µx1 + µx2
In general, we can show that the expected value of the sum of the variables xi is the
sum of their expected values, i.e.,
" n # n
X X
E xi = E[xi ]
i=1 i=1
Now, let us consider the variance of the sum of the random variables:
! " n #!2
Xn Xn X
var xi = E xi − E xi
i=1 i=1 i=1
!2
n
X n
X
= E xi − E[xi ]
i=1 i=1
!2
n
X
= E (xi − E[xi ])
i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 11
Xn n X
X
= E (xi − E[xi ])2 + 2 (xi − E[xi ])(xj − E[xj ])
i=1 i=1 j >i
n
X h i n X
X
= E (xi − E[xi ])2 + 2 cov(xi , xj )
i=1 i=1 j >i
n
X
= var(xi )
i=1
The last step follows from the fact that cov(xi , xj ) = 0 since they are independent.
Q5. Define a measure of deviation called mean absolute deviation for a random variable
X as follows:
n
1X
|xi − µ|
n
i=1
Is this measure robust? Why or why not?
Answer: No, it is not robust, since a single outlier can skew the mean absolute
deviation.
Q6. Prove that the expected value of a vector random variable X = (X1 , X2 )T is simply the
vector of the expected value of the individual random variables X1 and X2 as given in
Eq. (2.18).
Answer: This follows directly from the definition of expectation of a vector random
variable. When both X1 and X2 are discrete we have
X X X x1
µX1
µ = E[X] = xf (x) = f (x1 , x2 ) =
x x x
x2 µX2
1 2
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
12 Numeric Attributes
Q7. Show that the correlation [Eq. (2.23)] between any two random variables X1 and X2
lies in the range [−1, 1].
Answer: The Cauchy-Schwartz inequality states that for any two vectors x and y in
an inner product space, they satisfy:
Define the inner product between two random variables X1 and X2 as follows:
hX1 , X2 i = E[X1 X2 ]
≤ hX1 − µ1 , X1 − µ1 i · hX2 − µ2 , X2 − µ2 i
= E[X1 − µ1 ] · E[X2 − µ2 ]
= σ1 · σ2
Since |σ12 | ≤ σ1 · σ2 , it follows that the correlation ρ12 = σ12 /σ1 σ2 lies in the range
[−1, 1].
Q8. Given the dataset in Table 2.1, compute the covariance matrix and the generalized
variance.
X1 X2 X3
x1 17 17 12
x2 11 9 13
x3 11 8 19
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 13
Q9. Show that the outer-product in Eq. (2.31) for the sample covariance matrix is
equivalent to Eq. (2.29).
Answer: Let zi = xi − µ̂ denote a centered data point. The outer product form of
covariance matrix is given as:
X n
b= 1
6 zi zT
i
n
i=1
X n n
b k) = 1 1X
6(j, zij zik = (xij − µ̂j )(xik − µ̂k ) = σ̂j k
n n
i=1 i=1
which is exactly the covariance between the j -th and k-th attribute.
Q10. Assume that we are given two univariate normal distributions, NA and NB , and let
their mean and standard deviation be as follows: µA = 4, σA = 1 and µB = 8, σB = 2.
(a) For each of the following values xi ∈ {5, 6, 7} find out which is the more likely
normal distribution to have produced it.
Answer: If we plug-in xi in the equation for the normal distribution, we obtain
the following:
Based on these values, we can claim that NA is more likely to have produced 5,
but NB is more likely to have produced 6 and 7.
We can also solve this problem by finding the z-score for each value. We can
then assign a point to the distribution for which it has a lower z-score (in terms
of absolute value). For example, for 5, we have zA (5) = (5 − 4)/1 = 1, and
zB (5) = (5 − 8)/2 = −1.5. Since |zB | > |zA | we can claim that 5 comes from
NA .
For 6 and 7 we have:
Thus, these values are more likely to have been generated from NB .
(b) Derive an expression for the point for which the probability of having been
produced by both the normals is the same.
Answer: Plugging in the parameters of NA and NB into the equation for the
normal distribution, and after setting up the equality, we obtain:
1 (x−4)2 1 (x−8)2
√ e− 2 = √ e− 8
2π 2 2π
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14 Numeric Attributes
(x−8)2 (x−4)2
2e 8 =e 2
We √
can solve this equation using the general solution for a quadratic equation:
−b± b2 −4ac
2a . Plugging in the values from above we get x = 5.67.
Q11. Consider Table 2.2. Assume that both the attributes X and Y are numeric, and the
table represents the entire population. If we know that the correlation between X
and Y is zero, what can you infer about the values of Y?
X Y
1 a
0 b
1 c
0 a
0 c
Answer: Since the correlation is zero, we have cov(X, Y) = 0, which implies that
E[XY] = E[X]E[Y]. From the data we have
Q12. Under what conditions will the covariance matrix 6 be identical to the correlation
matrix, whose (i, j ) entry gives the correlation between attributes Xi and Xj ? What
can you conclude about the two variables?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 15
Answer: If the covariance matrix equals the correlation matrix, this means that for
all i and j , we have
ρij = σij
σij
= σij
σi σj
σi σj = 1
Thus, for the covariance matrix to equal the correlation matrix, Xi and Xj must
be perfectly correlated; either σi = σj = 1 or σi = σj = −1.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 3 Categorical Attributes
3.7 EXERCISES
Q1. Show that for categorical points, the cosine similarity between any two vectors in lies
in the range cos θ ∈ [0, 1], and consequently θ ∈ [0◦ , 90◦ ].
Answer:
= E[X1 XT T T T
2 ] − E[X1 ]µ2 − µ1 E[X2 ] + µ1 µ2
= E[X1 XT T T
2 ] − E[X1 ]E[X2 ] − E[X1 ]E[X2 ] + E[X1 ]E[X2 ]
T
= E[X1 XT
2 ] − E[X1 ]E[X2 ]
T
Z=f Z=g
Y=d Y=e Y=d Y=e
X=a 5 10 10 5
X=b 15 5 5 20
X=c 20 10 25 10
16
3.7 Exercises 17
Table 3.2. χ 2 Critical values for different p-values for different degrees of freedom (q): For example, for
q = 5 degrees of freedom, the critical value of χ 2 = 11.070 has p-value = 0.05.
q 0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
1 — — 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548
Q3. Consider the 3-way contingency table for attributes X, Y, Z shown in Table 3.1.
Compute the χ 2 metric for the correlation between Y and Z. Are they dependent
or independent at the 95% confidence level? See Table 3.2 for χ 2 values.
Answer: Summing out X, we have the new 2-way contingency table between Y and
Z, along with the row/col marginal frequencies:
Z=f Z=g
Y=d 40 40 80
Y=e 25 35 60
65 75 140
Z=f Z=g
Y=d (80 · 65)/140 = 37.14 (80 · 75)/140 = 42.86
Y=e (60 · 65)/140 = 27.86 (60 · 75)/140 = 32.14
Subtracting the expected and observed values, and squaring them, we get:
Z=f Z=g
Y=d 2.862 = 8.16 −2.862 = 8.16
Y=e −2.862 = 8.16 2.862 = 8.16
Z=f Z=g
Y=d 0.22 0.19
Y=e 0.29 0.25
Finally, summing all these values we obtain χ 2 = 0.22 + 0.19 + 0.29 + 0.25 = 0.95.
With one degree of freedom, we have p-value = 0.33, which is well to the left of
the 0.05 critical value. Thus we cannot reject the null hypothesis, and we conclude
that the two variables are independent.
Q4. Consider the “mixed” data given in Table 3.3. Here X1 is a numeric attribute and
X2 is a categorical one. Assume that the domain of X2 is given as dom(X2 ) = {a, b}.
Answer the following questions.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
18 Categorical Attributes
The mean for X1 is 0.0, whereas the mean vector for X2 is (0.6, 0.4)T . Therefore,
the mean vector for both attributes is µ = (0.0|0.6, 0.4)T .
The covariance matrix is given as
X1 X2 = a X2 = b
b=
6
X1 0.92 −0.15 0.15
X2 = a −0.15 0.24 −0.24
X2 = a 0.15 −0.24 0.24
Q5. In Table 3.3, assuming that X1 is discretized into three bins, as follows:
c1 = (−2, −0.5]
c2 = (−0.5, 0.5]
c3 = (0.5, 2]
X1 X2
0.3 a
−0.3 b
0.44 a
−0.60 a
0.40 a
1.20 b
−0.12 a
−1.60 b
1.60 b
−1.32 a
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
3.7 Exercises 19
a b
c1 2 1 3
c2 4 1 5
c3 0 2 2
6 4 10
a b
c1 1.8 1.2
c2 3 2
c3 1.2 0.8
The degrees of freedom are 2, and the chi-square value us χ 2 = 3.89, with
pvalue = 0.143. At 5% significance level we cannot reject the null hypothesis that
they are independent.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 4 Graph Data
4.6 EXERCISES
Q1. Given the graph in Figure 4.1, find the fixed-point of the prestige vector.
a b
c
Figure 4.1. Graph for Q1
Answer: The adjacency matrix and its transpose for the graph is:
0 1 1 0 1 0
A = 1 0 0 AT = 1 0 1
0 1 0 1 0 0
We can observe that the prestige vector converges to p = (0.76, 1, 0.57)T or after
normalization to p = (0.548, 0.726, 0.415)T .
Q2. Given the graph in Figure 4.2, find the fixed-point of the authority and hub vectors.
20
4.6 Exercises 21
a
b c
Answer: The adjacency matrix and its transpose for the graph is:
0 1 0 0 1 1
A = 1 0 1 AT = 1 0 1
1 1 1 0 1 1
h→a→
1 2
1 → 2 →
1 2
2 10
4 → 8 →
6 10
8 48
20 → 36 →
28 48
36 0.27 228 1.0
96 = 0.73 → 168 = 0.74 →
132 1.0 228 1.0
168 0.27 1080 1.0
456 = 0.73 → 792 = 0.73
624 1.0 1080 1.0
We can observe that the authority and hub vectors converge to:
Q3. Consider the double star graph given in Figure 4.3 with n nodes, where only nodes
1 and 2 are connected to all other vertices, and there are no other links. Answer the
following questions (treating n as a variable).
(a) What is the degree distribution for this graph?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
22 Graph Data
3 4 5 ··············· n
1 2
C(3) = 1
(d) What is the clustering coefficient C(G) for the entire graph? What happens to
the clustering coefficient as n → ∞?
Answer: The average clustering coefficient for the graph is:
(n − 2) · 1 + 2 · 2/(n − 1)
C(G) = = (n2 − 3n + 6)/(n2 − n)
n
As n → ∞, C(G) → 1.
(e) What is the transitivity T(G) for the graph? What happens to T(G) and n → ∞?
Answer: The transitivity is given as:
3(n − 2)
T(G) = (n−1)
= 3/n
2 2 + (n − 2)
As n → ∞, T(G) → 0.
(f) What is the average path length for the graph?
Answer: The average path length can be computed as follows:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
4.6 Exercises 23
The sum of all the path lengths over pairs of nodes is therefore:
n−3
X
(n−1)+(n−2)+2 i = 2n−3+(n−3·n−2) = n2 −5n+6+2n−3 = n2 −3n+3
i=1
n
But there are 2 = n(n − 1)/2 pairs. Thus, the average path length is 2(n2 − 3n +
3)/n2 − n. As n → ∞, average length tends to 2.
(g) What is the betweenness value for node 1?
Answer: There are n − 2 shortest paths from 2 to i >= 3, which do not include
1. Furthermore for the other n − 2 pairs of nodes, only half go through 1. But
there are 2 paths for each such pair. The betweenness is then given as:
n−2
2
= n−2
2 2 + (n − 1)
(n − 2)(n − 3)/2
=
(n − 2)(n − 3) + (n − 1)
n2 − 5n + 6
=
2(n2 − 4n + 5)
as n → ∞, betweenness goes to 1/2.
(h) What is the degree variance for the graph?
Answer: The variance for the degree can be computed as follows:
Q4. Consider the graph in Figure 4.4. Compute the hub and authority score vectors.
Which nodes are the hubs and which are the authorities?
1 3
2 4 5
For the hubs we can directly compute the AAT matrix and compute its dominant
eigenvector
1 1 0 1 0
1 2 0 1 0
AAT = 0 0 1 0 0
1 1 0 2 0
0 0 0 0 1
Starting with the initial vector h0 = (1, 1, 1, 1, 1)T , successive iterations give:
h1 = (3, 4, 1, 4, 1)T
h2 = (11, 15, 1, 15, 1)T
h3 = (41, 56, 1, 56, 1)T
h4 = (153, 209, 1, 209, 1)T
After normalizing h4 the hub scores vector is h = (0.46, 0.63, 0.0, 0.63, 0.0)T . The
eigenvalue is obtained as the ratio of the maximum element from the last two
iterations, i.e., 209/56 = 3.73.
For the authorities we can directly compute the AT A matrix and compute its
dominant eigenvector
1 0 1 0 0
0 1 1 0 0
T
A A = 1 1 3 0 0
0 0 0 1 0
0 0 0 0 1
Starting with the initial vector a0 = (1, 1, 1, 1, 1)T , successive iterations give:
a1 = (2, 2, 5, 1, 1)T
a2 = (7, 7, 19, 1, 1)T
a3 = (26, 26, 71, 1, 1)T
a4 = (97, 97, 265, 1, 1)T
Q5. Prove that in the BA model at time-step t + 1, the probability πt (k) that some node
with degree k in Gt is chosen for preferential attachment is given as
k · nt (k)
πt (k) = P
i i · nt (i)
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
4.6 Exercises 25
k
πt (vj ) P
i · nt (i)
i
Since there are nt (k) nodes with degree k, the probability that some node with
degree k is chosen is given as
k · nt (k)
πt (k) = P
i i · nt (i)
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 5 Kernel Methods
5.6 EXERCISES
Q1. Prove that the dimensionality of the feature space for the inhomogeneous polynomial
kernel of degree q is
d +q
m=
q
Answer: The dimensionality of the feature space for the inhomogeneous poly-
nomial kernel of degree q is equivalent to the number of possible non-negative
numbers n0 , n1 , ..., nd that sum up to q. This number is in turn equivalent to the
number of (d + 1)-partitions of q objects, where some partitions may be empty (a
k-partition is a partitioning of the objects into k disjoint parts).
Let us denote an object with the symbol ∗, and let us denote a partition boundary
with the symbol |. For instance, let q = 2, and let d = 2, and consider the following
partitioning: (∗ ∗ ||). This is mapped to the non-negative numbers n0 = 2, n1 = 0, n2 =
0, since there are two stars in the first partition, and none in the other two partitions.
Likewise, the partitioning (∗||∗) is mapped to the numbers n0 = 1, n1 = 0, n2 = 1, and
so on.
There is a one-to-one mapping between the (d + 1)-partitions and the set of
numbers n0 , n1 , ..., nd . Now the number of (d + 1) partitions can be obtained by
choosing d boundary symbols | out of a total of q + d objects comprising the set of
q distinct objects and d boundary symbols, i.e., d+q q .
Q2. Consider the data shown in Table 5.1. Assume the following kernel function:
K(xi , xj ) = kxi − xj k2 . Compute the kernel matrix K.
i xi
x1 (4, 2.9)
x2 (2.5, 1)
x3 (3.5, 4)
x4 (2, 2.1)
Q3. Show that eigenvectors of S and Sl are identical, and further that eigenvalues of Sl
are given as (λi )l (for all i = 1, . . . , n), where λi is an eigenvalue of S, and S is some
n × n symmetric similarity matrix.
S = U3UT
S = U3l UT
That is, Sand Sl have the same eigenvectors and for the eigenvalue λi for S, the
corresponding eigenvalue of Sl is λli .
Given that Sui = λi ui , we can derive the eigenvalues and eigenvectors of Sl
directly as follows:
1
Q4. The von Neumann diffusion kernel is a valid positive semidefinite kernel if |β| < ρ(S) ,
where ρ(S) is the spectral radius of S. Can you derive better bounds for cases when
β > 0 and when β < 0?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
28 Kernel Methods
When β < 0, then positive eigenvalues always satisfy 1− βλi > 0, so the constraint
applies to all negative eigenvalues. For the largest negative eigenvalue we have 1 −
βλN > 0 which implies that 1 − |β| · |λN | > 0. This means that |β| < 1/|λN |.
1
Putting the two conditions together, we have |β| < max{|λ1P |,|λN |} = ρ(S) . The better
bounds are that when β > 0, β < 1/λP , regardless of how large |λN | is, and when
β < 0, |β| < 1/|λN |, regardless of how large |λP | is.
Q5. Given the three points x1 = (2.5, 1)T , x2 = (3.5, 4)T , and x3 = (2, 2.1)T .
(a) Compute the kernel matrix for the Gaussian kernel assuming that σ 2 = 5.
Answer: We have
kx1 − x2 k2 = 12 + 32 = 10
kx1 − x3 k2 = 0.52 + 1.12 = 1.46
kx2 − x3 k2 = 1.52 + 1.92 = 5.86
(b) Compute the distance of the point φ(x1 ) from the mean in feature space.
Answer: The squared distance in terms of kernel operations is given as
3
X XX
||φ(x1 ) − µφ ||2 = K(x1 , x1 ) − 2/3 K(x1 , xj ) + 1/9 K(xi , xj )
j =1 i j
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 6 High-dimensional Data
6.9 EXERCISES
Q1. Given the gamma function in Eq. (6.6), show the following:
(a) Ŵ(1) = 1
Answer:
Z∞ Z∞
0 −x
Ŵ(1) = x e dx = e−x dx
0 0
∞
= −1e−x = 0 + e0 = 1
0
√
1
(b) Ŵ 2 = π
dy
Answer: Let y = x 1/2 , then dx = 21 x −1/2 , which implies that
dx = 2x 1/2 dy = 2ydy
R∞
Substituting y = x 1/2 in Ŵ(1/2) = 0 x −1/2 e−x dx, we get
Z∞ Z∞
−1 −y 2 2
Ŵ(1/2) = 2 y e ydy = 2 e−y dy
0 0
√
π ∞
= 2· · erf(z)
2 0
29
30 High-dimensional Data
Ŵ(α) = 0 + (α − 1)Ŵ(α − 1)
Q2. Show that the asymptotic volume of the hypersphere Sd (r) for any value of radius r
eventually tends to zero as d increases.
Cd/2
lim →0
d→∞ (d/2)!
so that
d/2
Cd/2 1 2Ce
lim = lim √ →0
d→∞ (d/2)! d→∞ πd d
The last step follows since eventually d will exceed 2Ce.
√
π d!!
When d is odd, we have Ŵ(d/2 + 1) = 2(d+1)/2 . We first derive an approximation
for d!!. Since d is odd it can be written as 2n + 1 for some integer n. Now consider
the following:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 31
X2
1.0
0.5
0 X1
0 0.5 1.0
Figure 6.1. Shape of L0.5 ball of radius 0.5 in 2D
where δ(x, c) is the distance between x and c, which can be specified using the
Lp -norm:
d
! p1
X
p
Lp (x, c) = |xi − ci |
i=1
where p 6= 0 is any real number. The distance can also be specified using the
L∞ -norm:
L∞ (x, c) = max |xi − ci |
i
Answer the following questions:
(a) For d = 2, sketch the shape of the hyperball inscribed inside the unit square, using
the Lp -distance with p = 0.5 and with center c = (0.5, 0.5)T .
Answer: Using radius r = 0.5, d = 2 and p = 0.5, we get Bd as the set of all
points that satisfy the equation
p p 2
|x1 − 0.5| + |x2 − 0.5| = 0.5
p p √
|x1 − 0.5| + |x2 − 0.5| − 0.5 = 0
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
32 High-dimensional Data
X2
1.0
0.5
0 X1
0 0.5 1.0
Figure 6.2. Shape of L∞ ball of radius 0.25 in 2D
ǫ
ǫ
(c) Compute the formula for the maximum distance between any two points in
the unit hypercube in d dimensions, when using the Lp -norm. What is the
maximum distance for p = 0.5 when d = 2? What is the maximum distance for the
L∞ -norm?
Answer: Let one corner be 0 = (0, 0, · · · , 0). The diagonally oppostie corner is
1 = (1, 1, · · · , 1). The maximum Lp distance between them is
d
! p1
X 1
|1 − 0|p =dp
i=1
Q4. Consider the corner hypercubes of length ǫ ≤ 1 inside a unit hypercube. The
2-dimensional case is shown in Figure 6.3. Answer the following questions:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 33
(a) Let ǫ = 0.1. What is the fraction of the total volume occupied by the corner cubes
in two dimensions?
Answer: Each corner occupies ǫ 2 = 0.12 = 0.01 volume. Since there are four
corners, the combined corner volume is 0.04, which is 4% of the total volume
of the unit square.
(b) Derive an expression for the volume occupied by all of the corner hypercubes of
length ǫ < 1 as a function of the dimension d. What happens to the fraction of the
volume in the corners as d → ∞?
Answer: In d-dimensions, there are 2d corners, and the volume of each corner
is ǫ d . Thus, the fraction of the volume in the corners is (2ǫ)d .
As d → ∞, we can see that if ǫ < 0.5, then the volume goes to 0, otherwise, if
ǫ = 0.5, then the corner hypercubes span the entire space, and the volume is 1.
It is reasonable to assume that ǫ ≤ 0.5, since otherwise the corner hypercubes
will overlap, and the combined volume will increase without bound.
(c) What is the fraction of volume occupied by the thin hypercube shell of width ǫ < 1
as a fraction of the total volume of the outer (unit) hypercube, as d → ∞? For
example, in two dimensions the thin shell is the space between the outer square
(solid) and inner square (dashed).
Answer: The edge length for the inner hypercube is 1 − 2ǫ, and this its volume
is (1 − 2ǫ)d . The volume of the outer unit hypercube is 1d = 1. Thus, the volume
of the thin shell is given as:
1 − (1 − 2ǫ)d
Since the volume must be non-negative, it follows that ǫ ≤ 0.5. In this case the
volume of the shell approaches 1, i.e., it contains all of the volume of the outer
hypercube.
Q5. Prove Eq. (6.14), that is, limd→∞ P xT x ≤ −2 ln(α) → 0, for any α ∈ (0, 1) and x ∈ Rd .
The probability of
being −∞ standard√deviations
away from the mean is essentially
0, i.e., limd→∞ P Z < (−2 ln(α) − d)/ 2d = 0
Q6. Consider the conceptual view of high-dimensional space shown in Figure 6.4. Derive
an expression for the radius of the inscribed circle, so that the area in the spokes
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
34 High-dimensional Data
accurately reflects the difference between the volume of the hypercube and the
inscribed hypersphere in d dimensions. For instance, if the length of a half-diagonal
is fixed at 1, then the radius of the inscribed circle is √1 in Figure 6.4a.
2
b α
2
h r
α1
Answer: The difference between the volume of the hypercube and hypersphere is
given as
δhs = (2r)d − Kd r d = (2d − Kd )r d = 2d − Kd
assuming r = 1, and where Kd is as given in (6.5).
In d dimensions, the inscribed circle is divided into 2d sectors, with the area of
2
each sector being π2rd .
Now consider with the triangle defined by the two sides of the sector of length r,
with the third sides length being b. The angle of the inner triangle at the center of
the circle is
2π π
α1 = d = d−1
2 2
Thus, the two other angles are both equal to
1 π (2d−1 − 1)π
α2 = π − d−1 =
2 2 2d
By the law of sines, we have
b r
= , or
sin(α1 ) sin(α2 )
sin(α1 )
b= r = cd r
sin(α2 )
sin(α1 )
where cd = sin(α .
2)
The height of the inner triangle is given as
s q
2 4 − cd2
1
h = r2 − b = r
2 2
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 35
The difference between the inscribed circle and inner polygon (set of triangles)
is given as
q
2
π cd 4 − cd
δcp = 2d d −
2 4
1
2d · · bx = δhs + δcp
2
δhs + δcp
x = d−1
2 b
Thus the length of the half-diagonal is h + x, given as
q q
δ + π − 2d−2 c 4 − c 2
2
4 − cd hs d d
R =h+x = +
2 2d−1 cd
Q7. Consider the unit hypersphere (with radius r = 1). Inside the hypersphere inscribe
a hypercube (i.e., the largest hypercube you can fit inside the hypersphere). An
example in two dimensions is shown in Figure 6.5. Answer the following questions:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
36 High-dimensional Data
(a) Derive an expression for the volume of the inscribed hypercube for any given
dimensionality d. Derive the expression for one, two, and three dimensions, and
then generalize to higher dimensions.
Answer: In 1d, the inscribed hypercube is identical to the hypersphere. So if
radius of hypersphere is r, then the side length for the cube is l = 2r. Thus:
V(H1 ) = 2r
In 2d, the main diagonal of the square will have length 2r, since the diagonal
will do through the center of the circle, and the circle has radius r. Therefore
the side length
√ of the cube (square) is given as l 2 + l 2 = (2r)2 or 2l 2 = 4r 2 , which
gives l = 2r. The volume is then:
V(H2 ) = l 2 = 2r 2
In 3d, the main diagonal of the cube is still 2r, since the sphere has radius r, but
now there are 3 sides that need to be considered, so we have l 2 + l 2 + l 2 = (2r)2
or l 2 = √2 r. The volume is then:
3
8
V(H3 ) = l 3 = √ r 3
3 3
In general the trend is clear. We have to consider d sides to obtain the main
diagonal with length 2r, which gives
2
l= √ r
d
and
2d d
V(Hd ) = l d = r
d d/2
(b) What happens to the ratio of the volume of the inscribed hypercube to the
volume of the enclosing hypersphere as d → ∞? Again, give the ratio in one,
two and three dimensions, and then generalize.
Answer: Let’s look at the 1d, 2d and 3d cases: With r = 1, for 1d, we have
V(H1 ) 2r
= =1
V(S1 ) 2r
for 2d, we have:
V(H2 ) 2r 2 2
= 2 = = 0.64
V(S2 ) πr π
for 3d, we have: √
V(H3 ) (8/3 3)r 3 2
= =√ = 0.37
V(S3 ) (4/3)πr 3 3π
Now for the general case we have:
2d d
V(Hd ) d d/2
r 2d Ŵ(d/2 + 1)
= d/2
=
V(Sd ) π d (dπ)d/2
Ŵ(d/2+1) r
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 37
V(Hd ) 2d (d/2)!
=
V(Sd ) (dπ)d/2
d d/2 d/2 − 1 d/2 − 2 d/2 − (d/2 − 1)
= 2 ···
dπ dπ dπ dπ
d d/2 d/2 1 d/2 2 d/2 d/2 − 1
= 2 − − ··· −
dπ dπ dπ dπ dπ dπ dπ
d/2 d/2 d/2 d/2
≤ 2d ···
dπ dπ dπ dπ
1
= 2d
(2π)d/2
d/2
2
=
π
Q8. Assume that a unit hypercube is given as [0, 1]d , that is, the range is [0, 1] in each
dimension. The main diagonal in the hypercube is defined as the vector from (0, 0) =
d−1 d−1
z }| { z }| {
(0, . . . , 0, 0) to (1, 1) = (1, . . . , 1, 1). For example, when d = 2, the main diagonal goes
from (0, 0) to (1, 1). On the other hand, the main anti-diagonal is defined as the
d−1 d−1
z }| { z }| {
vector from (1, 0) = (1, . . . , 1, 0) to (0, 1) = (0, . . . , 0, 1) For example, for d = 2, the
anti-diagonal is from (1, 0) to (0, 1).
(a) Sketch the diagonal and anti-diagonal in d = 3 dimensions, and compute the angle
between them.
Answer: The main diagonal is (1, 1, 1) and the anti-diagonal is (0, 0, 1) −
(1, 1, 0) = (−1, −1, 1). The angle is therefore: cos θ = √ 1 √ = 1/3, which
3× 3
implies θ = 70.53◦ .
(b) What happens to the angle between the main diagonal and anti-diagonal as d →
∞. First compute a general expression for the d dimensions, and then take the
limit as d → ∞.
d−1
z }| {
Answer: The main diagonal is (1, 1, 1) and the anti-diagonal is (−1, · · · , −1, 1).
The angle is therefore: cos θ = −(d − 2)/d.
As d → ∞, cos(θ ) → −1 + 2/d = −1, which implies θ = 180◦ or θ = 0◦ . In other
words the diagonal and anti-diagonal are parallel!
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
38 High-dimensional Data
closely spaced 2D circles of decreasing radius along the new dimension X3 , to yield
a sphere.
In a similar manner, the 4D hypersphere will be a collection of closely spaced 3D
spheres along the new dimension X4 . Of course we cannot adequately draw a 4D
object, however Fig. 6.6 provides a conceptual sketch.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 7 Dimensionality Reduction
7.6 EXERCISES
X1 X2
8 −20
0 −1
10 −19
10 −20
2 0
That is:
17.6 −38
6=
−38 88.4
det (6 − λI) = 0
39
40 Dimensionality Reduction
Thus
p √
106 ± 1062 − 4 · 111.84 106 ± 10788.64 106 ± 103.87
λ= = =
2 2 2
Thus
106 + 103.87
λ1 = = 209.87/2 = 104.94
2
and
106 − 103.87
λ2 = = 2.13/2 = 1.07
2
(c) What is the “intrinsic” dimensionality of this dataset (discounting some small
amount of variance)?
104.94
Answer: Clearly 104.94+1.07 = 0.99 fraction of the variance is in the first
principal component. Thus, the intrinsic dimensionality of the data is only one.
(d) Compute the first principal component.
Answer: We can compute the first principal component from the equation
6ui = λi ui . Using λ1 = 104.94, we can solve for the eigenvector as follows:
17.6 − 104.94 −38 x 0
=
−38 88.4 − 104.94 y 0
−87.34 −38 x 0
=
−38 −16.54 y 0
(e) If the µ and 6 from above characterize the normal distribution from which the
points were generated, sketch the orientation/extent of the 2-dimensional normal
density function.
Answer: Figure 7.1 plots the normal distribution shape, along with the first PC
and the mean.
5 4
Q2. Given the covariance matrix 6 = , answer the following questions:
4 5
(a) Compute the eigenvalues of 6 by solving the equation det(6 − λI) = 0.
Answer: We have to solve the following equation:
(5 − λ)2 − 1 = 0
λ2 − 10λ + 9 = 0
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
7.6 Exercises 41
X2
−10
−11
b
−12
−13
−14
−15 X1
4 5 6 7
Figure 7.1. Normal Distribution
(λ − 9)(λ − 1) = 0
Q3. Compute the singular values and the left and right singular vectors of the following
matrix:
1 1 0
A=
0 0 1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
42 Dimensionality Reduction
Answer: The right singular vectors are the eigenvectors of AT A, and the left
singular vectors are the eigenvectors of AAT . The eigenvalues of both matrices are
the same and are the squares of the singular values.
1 1 0
2 0
AT A = 1 1 0 AAT =
0 1
0 0 1
After n iterations, the dominant eigenvector will be (2n , 1)T , or after normalization
(2n /2n , 1/2n )T = (1, 0)T . The other left singular vector should be orthogonal to the
first one, so we have
1 0
L=
0 1
For the right singular vector note that AT A is rank deficient, having a rank of
2 instead of 3, e.g., the first two columns (and rows) are the same. It is clear that
the r2 = (0, 0, 1)T is an eigenvector which corresponds to the eigenvalue 1. Now, we
can compute the dominant right singular vector via the power method as follows.
Starting with x0 = (1, 1, 1)T , we have
T T
AT A 1, 1, 1 = 2, 2, 1
T T
AT A 2, 2, 1 = 4, 4, 1
T T
AT A 4, 4, 1 = 8, 8, 1
√ √
The dominant right singular vector is r1 = √1 (1, 1, 0)T
= (1/ 2, 1/ 2, 0)T , and the
2 √ √
third vector is orthogonal to this one, and is given as r3 = (−1/ 2, 1/ 2, 0)T . Thus,
we have
0.707 0 −0.707
R = 0.707 0 0.707
0 1 0
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
7.6 Exercises 43
A = L1RT
√ 0.707 0.707 0
1 0 2 0 0
= 0 0 1
0 1 0 1 0
−0.707 0.707 0
Q4. Consider the data in Table 7.1. Define the kernel function as follows: K(xi , xj ) =
kxi − xj k2 . Answer the following questions:
(a) Compute the kernel matrix K.
Answer: The kernel matrix is given as:
0 5.86 1.46 4.64
5.86 0 10 1.46
K=
1.46
10 0 5.86
4.64 1.46 5.86 0
In the above steps, we scale the intermediate vectors by dividing by the largest
element, right after multiplying by K on the left hand side.
The first kernel PC is therefore
0.41
1 T 0.58
c1 = √ 0.7, 1, 1, 0.7 =
0.58
2.98
0.41
Also, based on the ratio of the largest element of the vector to the previous
value, we conclude that the dominant eigenvalue is η1 = 15.1.
i xi
x1 (4, 2.9)
x4 (2.5, 1)
x7 (3.5, 4)
x9 (2, 2.1)
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
44 Dimensionality Reduction
Q5. Given the two points x1 = (1, 2)T , and x2 = (2, 1)T , use the kernel function
2
K(xi , xj ) = (xT
i xj )
This is the same as the matrix in Q2. Thus, the eigenvector corresponding to the
dominant eigenvalue is c1 = (0.707, 0.707)T
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T TWO FREQUENT PATTERN
MINING
Table 8.1. Transaction database for Q1
tid itemset
t1 ABCD
t2 ACDF
t3 ACDEG
t4 ABDF
t5 BCG
t6 DFG
t7 ABG
t8 CDFG
8.5 EXERCISES
The candidates for level three and their support values are:
ABC − 1, ABD − 2, ACD − 3, CDF − 3, CDG − 2, CF G − 2, DF G − 3
Out of these the frequent 3-itemsets are ACD, CDF , and DF G.
No more frequent itemsets are possible.
(b) With minsup = 2/8, show how FPGrowth enumerates the frequent itemsets.
47
48 Itemset Mining
D5
C1 A1
A4 C1 G1 G1 G1
C3 B1 G1 F1 B1 B1
B1 G1 F 1 F1 F1
In this tree the items D, A, C, G, B and F are frequent, so
we output them and recursively project on each in turn and
generate all frequent sets ending in those items as shown next.
D4 D2 D2 D2
Project: AF
A2 C1 G1 C1 A1
Project: GF Project: CF
C1 B1 G1
Project on F
D2 C1 A1 D2 C1 A1
D1 D2
G1 G1 Project AB
A2 A2
A1
C1 C1
Project CB
Project B Project GB
We next process B.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 49
A B C D E
1 2 1 1 2
3 3 2 6 3
5 4 3 4
6 5 5 5
6 6
D4 D4
D3 C1 A1
Project A
A3
A1 C1
Project C
C1
Project G
We next process items G, C and A in turn as shown above.
Within G, output CG(3), DCG(2), AG(2), and DG(3).
Within C, output AC(3), DAC(3), DC(4).
Within A, output DA(4).
Q2. Consider the vertical database shown in Table 8.2. Assuming that minsup = 3,
enumerate all the frequent itemsets using the Eclat method.
Answer: The frequent itemsets based on eclat are shown below, along with their
tidsets:
A × 1356
B × 23456
C × 12356
E × 2345
AB × 356
AC × 1356
BC × 2356
BE × 2345
CE × 235
ABC × 356
BCE × 235
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
50 Itemset Mining
Q3. Given two k-itemsets Xa = {x1 , . . ., xk−1 , xa } and Xb = {x1 , . . ., xk−1 , xb } that share the
common (k − 1)-itemset X = {x1 , x2 , . . ., xk−1 } as a prefix, prove that
Answer: We know that d(Xab ) = t(Xa ) − t(Xab ). Note that d(Xab ) ∩ t(Xab ) = ∅,
since those transaction that contain xa and xb , and those that contain xa but not xb
are necessarily disjoint (in the context of t(X)), and therefore, we have |d(Xab )| =
|t(Xa )| − |t(Xab )|
We therefore have
Q4. Given the database in Table 8.3. Show all rules that one can generate from the set
ABE.
Table 8.3. Dataset for Q4
tid itemset
t1 ACD
t2 BCE
t3 ABCE
t4 BDE
t5 ABCE
t6 ABCD
Answer: We first need to compute the support of ABE and all its subsets, which
comprises:
A − 4, B − 5, E − 4, AB − 3, AE − 2, BE − 4, ABE − 2
The set of rules one can generate from ABE are as follows; the support for all the
rule is 2, since sup(ABE) = 2
A −→ BE, confidence c = 2/4 = 0.5
B −→ AE, confidence c = 2/5 = 0.4
E −→ AB, confidence c = 2/4 = 0.5
AB −→ E, confidence c = 2/3 = 0.67
AE −→ B, confidence c = 2/2 = 1.0
BE −→ A, confidence c = 2/4 = 0.5
Q5. Consider the partition algorithm for itemset mining. It divides the database into k
partitions, not necessarily equal, such that D = ∪ki=1Di , where Di is partition i, and for
any i 6= j , we have Di ∩ Dj = ∅. Also let ni = |Di | denote the number of transactions in
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 51
partition Di . The algorithm first mines only locally frequent itemsets, that is, itemsets
whose relative support is above the minsup threshold specified as a fraction. In the
second step, it takes the union of all locally frequent itemsets, and computes their
support in the entire database D to determine which of them are globally frequent.
Prove that if a pattern is globally frequent in the database, then it must be locally
frequent in at least one partition.
= minsup × |D|
That is, sup(X, D) < minsup × |D|, which is a contradiction. Thus there must exist
at least one partition where X is locally frequent.
Q6. Consider Figure 8.1. It shows a simple taxonomy on some food items. Each leaf is
a simple item and an internal node represents a higher-level category or item. Each
item (single or high-level) has a unique integer label noted under it. Consider the
database composed of the simple items shown in Table 8.4 Answer the following
questions:
tid itemset
1 2367
2 1 3 4 8 11
3 3 9 11
4 1567
5 1 3 8 10 11
6 3 5 7 9 11
7 4 6 8 10 11
8 1 3 5 8 11
(a) What is the size of the itemset search space if one restricts oneself to only itemsets
composed of simple items?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
52 Itemset Mining
b
Answer: The search space size is 211 , since there are 11 simple items.
(b) Let X = {x1 , x2 , . . . , xk } be a frequent itemset. Let us replace some xi ∈ X with its
parent in the taxonomy (provided it exists) to obtain X′ , then the support of the
new itemset X′ is:
i. more than support of X
ii. less than support of X
iii. not equal to support of X
iv. more than or equal to support of X
v. less than or equal to support of X
Answer: The answer is (iv), more than or equal to support of X. The reason
is that it may be the case that none of the other siblings of xi may occur in the
database, in which case the support of X′ will be the same as that for X. The
support cannot be lower, but it can obviously be equal or higher.
(c) Use minsup = 7/8. Find all frequent itemsets composed only of high-level items
in the taxonomy. Keep in mind that if a simple item appears in a transaction, then
its high-level ancestors are all assumed to occur in the transaction as well.
Answer: If we replace each low-level item by all of the high level ancestors,
we obtain the following set of transactions:
1: 12, 14, 15
2: 12, 14, 13, 15
3: 12, 14, 13, 15
4: 14, 15
5: 12, 14, 13, 15
6: 12, 14, 13, 15
7: 12, 14, 13, 15
8: 12, 14, 13, 15
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 53
Q7. Let D be a database with n transactions. Consider a sampling approach for mining
frequent itemsets, where we extract a random sample S ⊂ D, with say m transactions,
and we mine all the frequent itemsets in the sample, denoted as FS . Next, we make
one complete scan of D, and for each X ∈ FS , we find its actual support in the
whole database. Some of the itemsets in the sample may not be truly frequent in
the database; these are the false positives. Also, some of the true frequent itemsets
in the original database may never be present in the sample at all; these are the false
negatives.
Prove that if X is a false negative, then this case can be detected by counting
the support in D for every itemset belonging to the negative border of FS , denoted
Bd − (FS ), which is defined as the set of minimal infrequent itemsets in sample S.
Formally,
Bd − (FS ) = inf Y | sup(Y) < minsup and ∀Z ⊂ Y, sup(Z) ≥ minsup
Q8. Assume that we want to mine frequent patterns from relational tables. For example
consider Table 8.5, with three attributes A, B, and C, and six records. Each attribute
has a domain from which it draws its values, for example, the domain of A is dom(A) =
{a1 , a2 , a3 }. Note that no record can have more than one value of a given attribute.
tid A B C
1 a1 b1 c1
2 a2 b3 c2
3 a2 b3 c3
4 a2 b1 c1
5 a2 b3 c3
6 a3 b3 c3
For example, sup({a1 , a2 } × {c1 }) = 2, as both records 1 and 4 contribute to its support.
Note, however that the pattern {a1 } × {c1 } has a support of 1, since only record 1
belongs to it. Thus, relational patterns do not satisfy the Apriori property that we
used for frequent itemsets, that is, subsets of a frequent relational pattern can be
infrequent.
We call a relational pattern P = P1 × P2 × · · ·× Pk over attributes X1 , . . ., Xk as valid
iff for all u ∈ Pi and all v ∈ Pj , the pair of values (Xi = u, Xj = v) occurs together in
some record. For example, {a1 , a2 } × {c1 } is a valid pattern since both (A = a1 , C = c1 )
and (A = a2 , C = c1 ) occur in some records (namely, records 1 and 4, respectively),
whereas {a1 , a2 }×{c2 } is not a valid pattern, since there is no record that has the values
(A = a1 , C = c2 ). Thus, for a pattern to be valid every pair of values in P from distinct
attributes must belong to some record.
Given that minsup = 2, find all frequent, valid, relational patterns in the dataset in
Table 8.5.
tid multiset
1 ABCA
2 ABABA
3 CABBA
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 55
Level 1: A − 3, B − 3, C − 2
Level 2: AA − 3, AB − 3, AC − 2, BB − 2, BC − 2
Level 3: AAB − 3, AAC − 2, ABB − 2, ABC − 2
Level 4: AABB − 2, AABC − 2
(b) Find all minimal infrequent multisets, that is, those multisets that have no
infrequent sub-multisets.
Answer: In the level-wise approach above we encounter the following minimal
infrequent multisets:
Level 2: CC − 0
Level 3: AAA − 1, BBB − 0, BBC − 1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
Table 9.1. Dataset for Q2
Tid Itemset
t1 ACD
t2 BCE
t3 ABCE
t4 BDE
t5 ABCE
t6 ABCD
9.6 EXERCISES
Tid Itemset
1 ACD
2 BCD
3 AC
4 ABD
5 ABCD
6 BCD
ABCD(3)
BC(5) ABD(6)
B(8)
Q3. Given the database in Table 9.2, find all minimal generators using minsup = 1.
Q4. Consider the frequent closed itemset lattice shown in Figure 9.1. Assume that the
item space is I = {A, B, C, D, E}. Answer the following questions:
(a) What is the frequency of CD?
Answer: The frequency of CD is 3.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
58 Summarizing Itemsets
Tid Itemset
1 ACD
2 BCD
3 ACD
4 ABD
5 ABCD
6 BC
(b) Find all frequent itemsets and their frequency, for itemsets in the subset interval
[B, ABD].
Answer: The frequency of all itemsets in the interval is as follows:
B − 8, AB − 6, AD − 6, BD − 6, ABD − 6.
(c) Is ADE frequent? If yes, show its support. If not, why?
Answer: ADE is not frequent since it is not a subset of any closed itemset.
Q5. Let C be the set of all closed frequent itemsets and M the set of all maximal frequent
itemsets for some database. Prove that M ⊆ C .
Q6. Prove that the closure operator c = i ◦ t satisfies the following properties (X and Y are
some itemsets):
(a) Extensive: X ⊆ c(X)
Answer: Let x ∈ X. Consider t(X) = {t|X ⊆ i(t)}. It follows that x ∈ i(t) for all
t ∈ t(X).
T
Now, c(X) = i(t(X)) = t∈t(X) i(t), which implies that x ∈ c(X). Thus X ⊆ c(X).
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
9.6 Exercises 59
Q7. Let δ be an integer. An itemset X is called a δ-free itemset iff for all subsets Y ⊂ X, we
have sup(Y) − sup(X) > δ. For any itemset X, we define the δ-closure of X as follows:
δ-closure(X) = Y | X ⊂ Y, sup(X) − sup(Y) ≤ δ, and Y is maximal
Consider the database shown in Table 9.3. Answer the following questions:
(a) Given δ = 1, compute all the δ-free itemsets.
Answer: The δ-free sets and their closures are as follows:
Actually, the definition allows for the empty set to be counted as δ-free (though
it is not very interesting). So, if you do count ∅ as δ-free, then C and D will not
be δ-free, but ∅ will be. In that case your answer for the closure will also differ.
δ-closure(∅) = {C, D}. The final answer should be:
(b) For each of the δ-free itemsets, compute its δ-closure for δ = 1.
Answer: The closures are given in part (a) above.
Q8. Given the lattice of frequent itemsets (along with their supports) shown in Figure 9.2,
answer the following questions:
(a) List all the closed itemsets.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
60 Summarizing Itemsets
sup(ABCD) ≤ sup(BCD) = 1
≥ sup(ABC) + sup(ACD) − sup(AC) = 3 + 2 − 4 = 1
∅(6)
ABCD(1)
Q9. Prove that if an itemset X is derivable, then so is any superset Y ⊃ X. Using this
observation describe an algorithm to mine all nonderivable itemsets.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
9.6 Exercises 61
Answer: Let X = YZ, then from Eqs.(9.4) and (9.3) we conclude that
where
X
IE(Y) = −1(|X\W|+1) · sup(W)
Y⊆W⊂X
Also, let the upper and lower bounds on the support of X be given as
n o
U(X) = min IE(Y) Y ⊆ X, |X \ Y| is odd
n o
L(X) = max IE(Y) Y ⊆ X, |X \ Y| is even
Now let Y′ ⊆ X be the subset that minimizes the bound |sup(X) − IE(Y)|, and let
Z′ = X − Y′ , then
= arg min{sup(YZ)}
Y
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
62 Summarizing Itemsets
The second line follows from the fact that the transactions that contain Y′ but not Z′
can be broken into two disjoints subsets comprising those that contain a and those
that do not.
Now if an itemset X is derivable, then we know that UB(X) − LB(X) = 0, which
immediately implies that all of its supersets will also be derivable.
Finally, this result can be used in an efficient algorithm to mine all non-derivable
itemsets, since we can prune an itemset X and all of its supersets from the search
space the moment we find X is derivable. Any of the algorithms we have studied for
itemset mining can be used with this pruning strategy.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 10 Sequence Mining
10.5 EXERCISES
Q1. Consider the database shown in Table 10.1. Answer the following questions:
(a) Let minsup = 4. Find all frequent sequences.
Answer: The frequent sequences are:
A − 4, G − 4, T − 4
AA − 4, AG − 4, AT − 4, GA − 4, TA − 4, TG − 4
AAT − 4, AGA − 4, ATA − 4, ATG − 4, GAA − 4, TAA − 4
(b) Given that the alphabet is 6 = {A, C, G, T}. How many possible sequences of
length k can there be?
Answer: There can be 4k sequences of length k.
Id Sequence
s1 AATACAAGAAC
s2 GTATGGTGAT
s3 AACATGGCCAA
s4 AAGCGTGGTCAA
Id Sequence
s1 ACGTCACG
s2 TCGA
s3 GACTGCA
s4 CAGTC
s5 AGCT
s6 TGCAGCTC
s7 AGTCAG
63
64 Sequence Mining
Q2. Given the DNA sequence database in Table 10.2, answer the following questions
using minsup = 4
(a) Find the maximal frequent sequences.
Answer: The set of frequent sequences, along with their supports are as
follows:
A − 7, C − 7, G − 7, T − 7
AC − 6, AG − 6, AT − 6, GA − 5, GC − 6, GG − 4, GT − 6, CA − 6, CC − 4,
CG − 6, CT − 5, TA − 5, TC − 6, TG − 5
ACT − 4, AGC − 6, AGT − 5, ATC − 5, GAG − 4, GCA − 4, GCG − 4, GTC − 5,
CAG − 4, CGC − 4, CTC − 4, TCA − 4, TCG − 4
AGTC − 4
We now intersect the vertical lists as follows (only frequent intersections for
prefix A are shown):
t (AC) = (1,2) (1,5) (1,7) (3,3) (3,6) (4,5) (5,3) (6,6) (6,8) (7,4)
t (AG) = (1,3) (1,8) (3,5) (4,3) (5,2) (6,5) (7,2) (7,6)
t (AT) = (1,4) (3,4) (4,4) (5,4) (6,7) (7,3)
t (ACT) = (1,4) (3,4) (6,7) (7,3)
t (AGC) = (1,5) (1,7) (3,6) (4,5) (5,3) (6,8) (7,4)
t (AGT) = (1,4) (4,4) (5,4) (6,7) (7,3)
t (ATC) = (1,5) (1,7) (3,6) (4,5) (6,8) (7,4)
t (AGTC) = (1,5) (1,7) (4,5) (6,8) (7,4)
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 65
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
66 Sequence Mining
Q3. Given s = AABBACBBAA, and 6 = {A, B, C}. Define support as the number
of occurrence of a subsequence in s. Using minsup = 2, answer the following
questions:
(a) Show how the vertical Spade method can be extended to mine all frequent
substrings (consecutive subsequences) in s.
Answer: The vertical poslists for all consecutive sequences are shown in the
figure below:
Each pair shows the start and stop positions for the pattern in the input
sequence. For example, the pair (8, 9) for BA denotes the fact that the
substring starts at position 8 and ends at position 9.
The frequent substrings are therefore: A, B, AA, BA, BB, BBA.
(b) Construct the suffix tree for s using Ukkonen’s method. Show all intermediate
steps, including all suffix links.
Answer: The different steps of the suffix tree construction method are shown
below.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 67
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
68 Sequence Mining
(c) Using the suffix tree from the previous step, find all the occurrences of the query
q = ABBA allowing for at most two mismatches.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 69
(d) Show the suffix tree when we add another character A just before the $. That is,
you must undo the effect of adding the $, add the new symbol A, and then add $
back again.
Answer: The suffix tree after removing $ is shown in the figure below:
[6, e]
6
[1, 1] [3, 3]
[6, e]
[3, 5] [4, 5]
[3, e]
2 5
[2, e]
[6, e]
[10, e] [6, e] [10, e]
1 7 3
8 4
[6, e]
6
[1, 1] [3, 3]
[2, 2] [6, e]
[3, 5] [4, 5]
[3, e]
2 5
[3, e]
[11, e] [6, e]
[10, e] [6, e] [10, e]
1 9 7 3
8 4
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
70 Sequence Mining
[6, e]
6
[1, 1] [3, 3]
[12, e]
[12, e] 12
[2, 2] [6, e]
11 [3, 5] [4, 5]
[3, e]
2 5
[3, e]
[11, e] [12, e] [6, e]
[10, e] [6, e] [10, e]
1 9 10 7 3
8 4
(e) Describe an algorithm to extract all the maximal frequent substrings from a suffix
tree. Show all maximal frequent substrings in s.
Answer: In a suffix tree, each node stands for a substring (the path from root
to that node). So we can extend each node in the suffix tree with a new field
“support”, which is the number of occurrences of the substring the node stands
for. In each internal node, the support is the number of leaf nodes of that
sub-tree. In each leaf node, the number is set to 1. Then we can traverse the
suffix tree. If we reach to an internal node whose support is at least minsup but
its children’s support are all less than minsup, then the path from root to this
internal node is a potential maximal frequent substring. The only other check
for maximality we have to do is that there is no character that can be added as
a prefix, which still results in a frequent substring.
In the above example, by using this algorithm we can find the maximal frequent
substrings: AA, BBA.
Q4. Consider a bitvector based approach for mining frequent subsequences. For instance,
in Table 10.1, for s1 , the symbol C occurs at positions 5 and 11. Thus, the bitvector for
C in s1 is given as 00001000001. Because C does not appear in s2 its bitvector can be
omitted for s2 . The complete set of bitvectors for symbol C is
(s1 , 00001000001)
(s3 , 00100001100)
(s4 , 000100000100)
Given the set of bitvectors for each symbol show how we can mine all frequent sub-
sequences by using bit operations on the bitvectors. Show the frequent subsequences
and their bitvectors using minsup = 4.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 71
(s1 , 11010110110)
(s2 , 0010000010)
(s3 , 11010000011)
(s4 , 110000000011)
(s1 , 00000001000)
(s2 , 1000110100)
(s3 , 00000110000)
(s4 , 001010110000)
To find support of AG, in each of the input sequences, we have to find first
occurrence of bit 1 in the bitvector for A. Suppose we find the first occurrence at
position i, then we make the bit bi = 0 and make all other bits 1’s after bi ; i.e.,
bi+1 ...bk are all 1’s. Now take AND with the bitvector for G for the same input
sequences. For example, consider s2 . The first occurrence of A is at position 3, so
we set that to 0, and make the remaining occurrences as 1 to obtain: 0001111111.
Now taking the AND with the bitvector for G for s2 we obtain 0000110100. This
means that in s2 , there are three occurrences of a G after an A, namely at positions
5, 6, 8. When we do the bitvector operations for AG, we obtain the following results:
(s1 , 00000001000)
(s2 , 0000110100)
(s3 , 00000110000)
(s4 , 001010110000)
Since there is at least a single 1 bit in each sequence, we have the support of AG as
4. In a similar manner we can obtain all of the remaining frequent sequences listed
in Q1(a).
Q5. Consider the database shown in Table 10.3. Each sequence comprises itemset events
that happen at the same time. For example, sequence s1 can be considered to be a
sequence of itemsets (AB)10 (B)20 (AB)30 (AC)40 , where symbols within brackets are
considered to co-occur at the same time, which is given in the subscripts. Describe
an algorithm that can mine all the frequent subsequences over itemset events. The
itemsets can be of any length as long as they are frequent. Find all frequent itemset
sequences with minsup = 3.
Answer: The sequence mining proceeds as before, where we look for occurrences
of a symbol after the given prefix. However, now we also have to consider the
possible itemset extensions at each time slot. The set of all frequent itemset
sequences is given as follows:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
72 Sequence Mining
Id Time Items
10 A, B
20 B
s1
30 A, B
40 A, C
20 A, C
s2 30 A, B, C
50 B
10 A
30 B
s3 40 A
50 C
60 B
30 A, B
40 A
s4 50 B
60 C
A − 4, B − 4, C − 4
AA − 4, {A, B} − 3, AB − 4, AC − 4, BA − 3, BB − 4
AAB − 3, AAC − 3, ABB − 3, ABC − 3, {A, B}B − 3
Q6. The suffix tree shown in Figure 10.5 contains all suffixes for the three sequences
s1 , s2 , s3 in Table 10.1. Note that a pair (i, j ) in a leaf denotes the j th suffix of
sequence si .
(a) Add a new sequence s4 = GAAGCAGAA to the existing suffix tree, using the
Ukkonen algorithm. Show the last character position (e), along with the suffixes
(l) as they become explicit in the tree for s4 . Show the final suffix tree after all
suffixes of s4 have become explicit.
Answer: When adding s4 , we find that the following strings up to the current
last character will all be found in the tree: G, GA, GAA, and GAAG. When
looking at character 5, namely C, we find the first difference. At this point
e = 5, and suffixes l = 1, 2, 3, 4 will become explicit. Suffix 5 does not become
explicit, since C is already in the tree. In fact, all the remaining suffixes
CA, CAG, CAGA, CAGAA will be found in the tree and will remain implicit.
Finally, when we consider the terminal character $, all the suffixes will become
explicit, i.e., when e = 10, l = 5, 6, 7, 8, 9, 10. The final suffix tree after adding
sequence s4 is shown (without the last character $).
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 73
CAG
A
T
G
4 3 4 3
CAG
GACA
CA
T$
AA
$
A
A
G
$
$
G$
AA$
G$
(1,6) (1,7)
3 (2,3) 4 (4,9) 2 (2,4) 4 (4,4) (2,6) (2,1)
(3,4) (3,5)
C A
A
CAG$
GT $
T$
AA
G
A
$
$
$
GA$
(1,5)
3 (4,8) 2 (4,3) (2,5) (1,1) (4,5) 3 (2,2)
(3,3)
CAGA
GT $
T$
G
$
$
A$
(1,4)
(4,2) (1,2) (4,6) 3 (4,7)
(3,2)
CAGA
T$
A$
(1,3)
(4,1)
(3,1)
(b) Find all closed frequent substrings with minsup = 2 using the final suffix
tree.
Answer: Now based on the tree above, the closed frequent substrings, with
minsup = 2 are:
T-3
AG - 4
GA - 4
CAG - 3
GAAG - 3
CAGAA - 2
GAAGT - 2
s1 : GAAGT
s2 : CAGAT
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
74 Sequence Mining
s3 : ACGT
Find all the frequent subsequences with minsup = 2, but allowing at most a gap of 1
position between successive sequence elements.
Answer: The frequent sequences with gap are as follows: A(3), C(2), G(3), T(3),
AA(2), AG(3), AT(2), CG(2), GA(2), GT(3), AAT(2), AGT(2), GAT(2).
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 11 Graph Pattern Mining
11.5 EXERCISES
Q1. Find the canonical DFS code for the graph in Figure 11.1. Try to eliminate some codes
without generating the complete search tree. For example, you can eliminate a code
if you can show that it will have a larger code than some other code.
a c
b a d a
b a
Figure 11.1. Graph for Q1.
3 4 5 6
b a d a
b a
7 8
75
76 Graph Pattern Mining
a(6)
a(8)
a(4)
a(1)
b(3) c(2)
b(7) d(5)
Q2. Given the graph in Figure 11.2. Mine all the frequent subgraphs with minsup = 1. For
each frequent subgraph, also show its canonical code.
a a
a
Figure 11.2. Graph for Q2.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 77
0 a
|
1 a
0 1 a a
0 1 a a
1 2 a a
3) extending 1)
a
| \
a a
0 1 a a
1 2 a a
2 0 a a
5) extending 2)
0 a
|
1 a
|
2 a
|
3 a
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
78 Graph Pattern Mining
0 1 a a
1 2 a a
2 3 a a
6) extending 2)
0 a
|
1 a
|\
2 a a 3
0 1 a a
1 2 a a
1 3 a a
7) extending 2)
a
|\
a a
|
a
This subgraph is isomorphic to 5).
0 1 a a
1 2 a a
2 0 a a
2 3 a a
9) extending 5)
0 a
| \
1 a |
| |
2 a |
| /
3 a/
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 79
0 1 a a
1 2 a a
2 3 a a
3 0 a a
10) extending 6)
a
| \
a |
| \|
a a
11) extending 8)
0 a
|\
1 a |
|\/
2 a/|
|/
3 a
0 1 a a
1 2 a a
2 0 a a
2 3 a a
3 1 a a
Q3. Consider the graph shown in Figure 11.3. Show all its isomorphic graphs and their
DFS codes, and find the canonical representative (you may omit isomorphic graphs
that can definitely not have canonical codes).
Answer: The figure below shows the potential isomorphic graphs that can have
minimal DFS codes.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
80 Graph Pattern Mining
A
a
b
A a A
a
b
B A
Figure 11.3. Graph for Q3.
G1 G2 G3
G4 G5 G6
A A A A A A
b
a a a b a a
b a b
A A A B A A A B A b
a b a a A a
a b a a a
b a
b a
b
A B A A A A A A
A
B
a b
a b
B A B A
The DFS codes for G4 and G6 cannot be minimal since they have as first edge
the tuple (0, 1, A, A, b), but all of the other graphs shown have the first edge
(0, 1, A, A, a). In the table below we show the DFS codes for the other four graphs
and indicate the minimal DFS code in bold. Note that the final comparison is
between G2 and G5 , but since (2, 3) < (1, 3), regardless of the labels, G5 wins out.
G1 G2 G3 G5
(0, 1, A, A, a) (0, 1, A, A, a) (0, 1, A, A, a) (0,1,A,A,a)
(1, 2, A, A, b) (1, 2, A, A, a) (1, 2, A, A, b) (1,2,A,A,a)
(2, 0, A, A, a) (2, 0, A, A, b) (2, 0, A, A, a) (2,0,A,A,b)
(1, 3, A, B, a) (1, 3, A, A, b) (2, 3, A, B, a) (2,3,A,B,a)
(0, 4, A, A, b) (0, 4, A, B, a) (0, 4, A, A, b) (1,4,A,A,b)
Q4. Given the graphs in Figure 11.4, separate them into isomorphic groups.
Answer: The groups of isomorphic graphs are as follows: {G1 }, {G2 , G5 }, {G3 },
{G4 , G6 }, {G7 },
Q5. Given the graph in Figure 11.5. Find the maximum DFS code for the graph, subject to
the constraint that all extensions (whether forward or backward) are done only from
the right most path.
Answer: The maximum dfs code is for the graph shown below.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 81
G1 G2 G3 G4
a a a b
a a a a
b b b b b b
a
G5 G6 G7
a a a
a a b b b a
b b b
b c c c a
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
82 Graph Pattern Mining
0,1,C,C
1,2,C,C
2,3,C,C
3,4,C,B
4,5,B,A
5,2,A,C
3,6,C,A
6,2,A,C
6,1,A,C
6,0,A,C
3,1,C,C
Note that in the minimal code, we consider the back edges from a node before
going depth-first or branching. In the maximum code, we still have to go depth-first,
but the back edges will all come in the end, further we work backwards from
higher numbered nodes to lower numbered ones. However, rightmost path must
be respected; therefore (5, 2, A, C) has to come before (3, 6, C, A).
Q6. For an edge labeled undirected graph G = (V, E), define its labeled adjacency matrix
A as follows:
if i = j
L(vi )
A(i, j ) = L(vi , vj ) if (vi , vj ) ∈ E
0 Otherwise
where L(vi ) is the label for vertex vi and L(vi , vj ) is the label for edge (vi , vj ). In other
words, the labeled adjacency matrix has the node labels on the main diagonal, and it
has the label of the edge (vi , vj ) in cell A(i, j ). Finally, a 0 in cell A(i, j ) means that
there is no edge between vi and vj .
Given a particular permutation of the vertices, a matrix code for the graph is
obtained by concatenating the lower triangular submatrix of A row-by-row. For
example, one possible matrix corresponding to the default vertex permutation
v0 v1 v2 v3 v4 v5 for the graph in Figure 11.6 is given as
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 83
v0 v1 v2
x y
a b b
y y
y
y z
b b a
v3 v4 v5
Figure 11.6. Graph for Q6.
a
x b
0 y b
0 y y b
0 0 y y b
0 0 0 0 z a
The code for the matrix above is axb0yb0yyb00yyb0000za. Given the total ordering
on the labels
0<a<b<x <y <z
find the maximum matrix code for the graph in Figure 11.6. That is, among all possible
vertex permutations and the corresponding matrix codes, you have to choose the
lexicographically largest code.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 12 Pattern and Rule Assessment
12.4 EXERCISES
1−rsup(Y)
Answer: We know that conv(X → Y) = 1−conf (X→Y) . Now if X and Y are
independent, then conf (X → Y) = P (Y|X) = P (Y) = rsup(Y). In this case, we
immediately have conv(X → Y) = 1.
Finally, we have
P (XY)P (¬X¬Y)
oddsratio(X → Y) =
P (X¬Y)P (¬XY)
P (X)P (Y)P (¬X)P (¬Y)
=
P (X)P (¬Y)P (¬X)P (Y)
=1
84
12.4 Exercises 85
Q3. Show that for a frequent itemset X, the value of the relative lift statistic defined in
Example 12.20 lies in the range
h i
1 − |D|/minsup, 1
Q4. Prove that all subsets of a minimal generator must themselves be minimal generators.
Answer: Let X be a minimal generator, with the tidset t(X). Let Y ⊂ X. Since X is
minimal, we must have t(Y) ⊃ t(X). Also, note that t(X) = t(X \ Y ∪ Y) = t(X \ Y) ∩
t(Y).
Assume that Y is not a minimal generator, then there exists a minimal generator
Z ⊂ Y, such that t(Z) = t(Y). However, in this case, t((X \ Y) ∪ Z) = t(X \ Y) ∩
t(Z) = t(X \ Y) ∩ t(Y) = t(X), which contradicts the fact that X is minimal. Thus, we
conclude that Y must be a minimal generator.
Q5. Let D be a binary database spanning one trillion (109 ) transactions. Because it is
too time consuming to mine it directly, we use Monte Carlo sampling to find the
bounds on the frequency of a given itemset X. We run 200 sampling trials Di (i =
1 . . . 200), with each sample of size 100, 000, and we obtain the support values for X in
the various samples, as shown in Table 12.1. The table shows the number of samples
where the support of the itemset was a given value. For instance, in 5 samples its
support was 10,000. Answer the following questions:
(a) Draw a histogram for the table, and calculate the mean and variance of the
support across the different samples.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
86 Pattern and Rule Assessment
Let ni be the number of samples where X has frequency fi . The mean and
variance for the itemset are given as:
P
ni fi 5400000
µ = Pi = = 27000
i n i 200
P
2 ni (fi − µ)2 1395 × 107
σ = i P = = 69750000
i ni 200
The standard deviation is therefore σ = 8351.647 Note that with respect to the
whole dataset, the mean relative support is: µ = 27000/109 = 2.75 × 10−5 , and
its variance is σ 2 = 6.975 × 10−11 , with standard deviation σ = 8.352 × 10−6 .
(b) Find the lower and upper bound on the support of X at the 95% confidence level.
The support values given should be for the entire database D.
Answer: We assume that support follows a normal distribution, for which the
critical z−value for the 95% confidence interval is 1.96. Thus, the support
interval is given as
(µ − 1.96 × σ, µ + 1.96 × σ ) = (2.7 × 10−5 − 1.637 × 10−5 , 2.7 × 10−5 + 1.637 × 10−5
= (1.0631, 4.3369) × 10−5
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
12.4 Exercises 87
The observed support value is higher than the minsup by a count of 32500 −
25000 = 7500. We can use the empirical probability mass function from part
(a) to get the p-value of observing a difference of 7500 from 25000. We get
p-value(7500) = P (sup(X) − minsup ≥ 7500) = P (sup(X) ≥ 32500) = 65/200 =
0.325. Since the p-value is quite high, we conclude that the support is not
significantly higher than the minsup value.
Q6. Let A and B be two binary attributes. While mining association rules at 30%
minimum support and 60% minimum confidence, the following rule was mined:
A −→ B, with sup = 0.4, and conf = 0.66. Assume that there are a total of 10,000
customers, and that 4000 of them buy both A and B; 2000 buy A but not B, 3500 buy
B but not A, and 500 buy neither A nor B.
Compute the dependence between A and B via the χ 2 -statistic from the corre-
sponding contingency table. Do you think the discovered association is truly a strong
rule, that is, does A predict B strongly? Set up a hypothesis testing framework, writing
down the null and alternate hypotheses, to answer the above question, at the 95%
confidence level. Here are some values of chi-squared statistic for the 95% confidence
level for various degrees of freedom (df):
df χ2
1 3.84
2 5.99
3 7.82
4 9.49
5 11.07
6 12.59
Answer: Let our null hypothesis be: Ho : A and B are independent, and the
alternate hypothesis is Ha : A and B are dependent.
Let’s set up the contingency table:
B=1 B=0
A=1 4500 1500
A=0 3000 1000
(nij −eij )2
The values eij are given as:
B=1 B=0
A=1 55.56 166.67
A=0 83.33 250
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
88 Pattern and Rule Assessment
P P (n −e )2
We then get χ 2 = i j ij eij ij = 555.55. This value is way above the 3.84
threshold for 1 degree of freedom. Thus, we can safely reject the null hypothesis,
and we can claim A and B are highly dependent.
On the other hand this still doesn’t answer the question whether A → B is a
strong rule. For this we can compute the actual support of AB versus the null
hypothesis that they are independent, which is captured by the Lift measure. We
sup(AB) 0.40
calculate lif t (A → B) = sup(A)·sup(B) = 0.60·0.75 = 0.89. A value of 1 would imply
they are independent, but a value of < 1 implies a negative dependence. Thus, we
can conclude that instead of A being a good predictor of B, one is less likely to buy
B given A!
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T THREE CLUSTERING
C H A P T E R 13 Representative-based Clustering
13.5 EXERCISES
Q1. Given the following points: 2, 4, 10, 12, 3, 20, 30, 11, 25. Assume k = 3, and that we
randomly pick the initial means µ1 = 2, µ2 = 4 and µ3 = 6. Show the clusters obtained
using K-means algorithm after one iteration, and show the new means for the next
iteration.
µ1 = 1.5 µ2 = 4 µ3 = 18
For the second iteration, the assignment to the closest mean yields the following
clusters:
For the third iteration, the assignment to the closest mean yields the following
clusters:
µ1 = 3 µ2 = 11 µ3 = 25
91
92 Representative-based Clustering
Q2. Given the data points in Table 13.1, and their probability of belonging to two clusters.
Assume that these points were produced by a mixture of two univariate normal
distributions. Answer the following questions:
(a) Find the maximum likelihood estimate of the means µ1 and µ2 .
Answer: Based on the maximum likelihood equations, we know that:
Pn
j =1 xj · P (Ci |xj )
µi = Pn
j =1 P (Ci |xj )
Q3. Given the two-dimensional points in Table 13.2, assume that k = 2, and that initially
the points are assigned to clusters as follows: C1 = {x1 , x2 , x4 } and C2 = {x3 , x5 }.
Answer the following questions:
(a) Apply the K-means algorithm until convergence, that is, the clusters do not
change, assuming (1) the usual Euclidean distance or the L2 -norm as the distance
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
13.5 Exercises 93
X1 X2
x1 0 2
x2 0 0
x3 1.5 0
x4 5 0
x5 5 2
P 1/2
d 2
between points, defined as xi − xj
2
= − a=1 (xia
xj a ) , and (2) the
Pd
Manhattan distance or the L1 -norm defined as xi − xj 1 = a=1 |xia − xj a |.
Answer: First, we consider the Euclidean distance. Initially, the two means are
given as µ1 = (5/3, 2/3)T = (1.67, 0.67)T and µ2 = (6.5/2, 2/2)T = (3.25, 1)T .
We compute the distance of each point to the cluster means, and assign it to
the nearest mean, as follows:
d(xi , µ1 ) d(xi , µ2 ) Cluster
x1 2.1 3.4 c1
x2 1.8 3.4 c1
x3 0.7 2.0 c1
x4 3.4 2.0 c2
x5 3.6 2.0 c2
For the next iteration, we recompute the means, as follows: µ1 = (1.5/3, 2/3)T =
(0.5, 0.67)T and µ2 = (10/2, 2/2)T = (5, 1)T . The new cluster assignments for the
points are follows:
d(xi , µ1 ) d(xi , µ2 ) Cluster
x1 1.42 5.1 c1
x2 0.83 5.1 c1
x3 1.2 3.6 c1
x4 4.5 1.0 c2
x5 4.7 1.0 c2
Since there is no change in cluster assignments, so we stop.
Now we consider the Manhattan distance. From the two means, µ1 =
(1.67, 0.67)T and µ2 = (3.25, 1)T , we compute the distance of each point to the
cluster means, and assign it to the nearest mean, as follows:
d(xi , µ1 ) d(xi , µ2 ) Cluster
x1 3 4.25 c1
x2 2.34 4.25 c1
x3 0.84 2.75 c1
x4 4 2.75 c2
x5 4.66 2.75 c2
The assignments are the same as for Euclidean distance. For the next iteration,
we recompute the means, as follows: µ1 = (1.5/3, 2/3)T = (0.5, 0.67)T and µ2 =
(10/2, 2/2)T = (5, 1)T . The new cluster assignments for the points are follows:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
94 Representative-based Clustering
We have:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
13.5 Exercises 95
Q4. Given the categorical database in Table 13.3. Find k = 2 clusters in this data using
the EM method. Assume that each attribute is independent, and that the domain of
each attribute is {A, C, T}. Initially assume that the points are partitioned as follows:
C1 = {x1 , x4 }, and C2 = {x2 , x3 }. Assume that P (C1 ) = P (C2 ) = 0.5.
X1 X2
x1 A T
x2 A A
x3 C C
x4 A C
Instead of computing the mean for each cluster, generate a partition of the objects
by doing a hard assignment. That is, in the expectation step compute P (Ci |xj ), and
in the maximization step assign the point xj to the cluster with the largest P (Ci |xj )
value, which gives a new partitioning of the points. Show one full iteration of the EM
algorithm and show the resulting clusters.
Answer: Given the initial partition: C1 = {x1 , x4 }, and C2 = {x2 , x3 }, with P (C1 ) =
P (C2 ) = 0.5.
Expectation Step: First consider cluster C1 . For attribute X1 , we have:
P (X1 = A|C1 ) = 2/2 = 1, which implies P (X1 = C|C1 ) = P (X1 = T|C1 ) = 0.
For attribute X2 , we have: P (X2 = A|C1 ) = 0/2 = 0 and P (X2 = C|C1 ) = P (X2 =
T|C1 ) = 1/2.
Likewise for C2 , attributes X1 and X2 we have:
P (X1 = A|C2 ) = P (X1 = C|C2 ) = 1/2, and P (X1 = T|C2 ) = 0, and
P (X2 = A|C2 ) = P (X2 = C|C2 ) = 1/2, and P (X2 = T|C2 ) = 0.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
96 Representative-based Clustering
P (C1 |x2 ) = 0
P (C2 |x2 ) = 1
P (C1 |x3 ) = 0
P (C2 |x3 ) = 1
Q5. Given the points in Table 13.4, assume that there are two clusters: C1 and C2 , with
µ1 = (0.5, 4.5, 2.5)T and µ2 = (2.5, 2, 1.5)T . Initially assign each point to the closest
mean, and compute the covariance matrices 6 i and the prior probabilities P (Ci ) for
i = 1, 2. Next, answer which cluster is more likely to have produced x8 ?
Answer: For x8 = (2.5, 3.5, 2.8)T , we first compute the distance from the two means
µ1 and µ2 as follows, which defaults to the Euclidean distance since the covariance
matrix is I. We have
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
13.5 Exercises 97
X1 X2 X3
x1 0.5 4.5 2.5
x2 2.2 1.5 0.1
x3 3.9 3.5 1.1
x4 2.1 1.9 4.9
x5 0.5 3.2 1.2
x6 0.8 4.3 2.6
x7 2.7 1.1 3.1
x8 2.5 3.5 2.8
x9 2.8 3.9 1.5
x10 0.1 4.1 2.9
P (C1 |x8 ) = P (x8 |C1 )P (C1 )/P (x8 ) = (0.0785c · 0.5)/(0.218 · 0.5 · c) = 0.36
P (C2 |x8 ) = 1 − 0.36 = 0.64
Q6. Consider the data in Table 13.5. Answer the following questions:
(a) Compute the kernel matrix K between the points assuming the following kernel:
K(xi , xj ) = 1 + xT
i xj
(b) Assume initial cluster assignments of C1 = {x1 , x2 } and C2 = {x3 , x4 }. Using kernel
K-means, which cluster should x1 belong to in the next step?
Answer: Using
φ 2 X 1 X X
kφ(xj ) − µi k2 = K(xj , xj ) − K(xa , xj ) + 2 K(xa , xb )
ni ni
xa ∈Ci xa ∈Ci xb ∈Ci
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
98 Representative-based Clustering
X1 X2 X3
x1 0.4 0.9 0.6
x2 0.5 0.1 0.6
x3 0.6 0.3 0.6
x4 0.4 0.8 0.5
Q7. Prove the following equivalence for the multivariate normal density function:
∂
f (xj |µi , 6 i ) = f (xj |µi , 6 i ) 6 −1
i (xj − µi )
∂µi
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 14 Hierarchical Clustering
14.4 EXERCISES
Point X1 X2 X3 X4 X5
x1 1 0 1 1 0
x2 1 1 0 1 0
x3 0 0 1 1 0
x4 0 1 0 1 0
x5 1 0 1 0 1
x6 0 1 1 0 0
The similarity between categorical data points can be computed in terms of the
number of matches and mismatches for the different attributes. Let n11 be the number
of attributes on which two points xi and xj assume the value 1, and let n10 denote the
number of attributes where xi takes value 1, but xj takes on the value of 0. Define
n01 and n00 in a similar manner. The contingency table for measuring the similarity is
then given as
xj
1 0
xi 1 n11 n10
0 n01 n00
x2 x3 x4 x5 x6
x1 3/5 3/5 4/5 3/5 4/5
x2 4/5 3/5 4/5 4/5
x3 4/5 4/5 4/5
x4 5/5 4/5
x5 4/5
We pick the least distance and break ties by choosing the cluster with the
smallest index. The first merge is therefore for x1 and x2 . We get the new matrix:
x3 x4 x5 x6
x1 , x2 3/5 3/5 3/5 4/5
x3 4/5 4/5 4/5
x4 5/5 4/5
x5 4/5
The next merge is then, x1 , x2 and x3 , and the new distance matrix is:
x4 x5 x6
x1 , x2 , x3 3/5 3/5 4/5
x4 5/5 4/5
x5 4/5
The smallest distances are between x1 and x3 , and between x2 and x4 . Since
these are disjoint, we can merge them in one step to obtain the two initial
clusters. The new distance matrix is given as:
x2 , x4 x5 x6
x1 , x3 3/5 3/5 3/5
x2 , x4 5/5 3/5
x5 3/5
Next to merge are x1 , x3 and x2 , x4 assuming that smaller indexes are merged
first in cases of tie-breaks. The new distance matrix is:
x5 x6
x1 , x2 , x3 , x4 5/5 3/5
x5 3/5
x2 , x4 x5 , x6
x1 , x3 0.625 0.666
x2 , x4 0.804
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
102 Hierarchical Clustering
1 3 2 4 5 6 Distance
| | | | | |
13 24 | | 1/3
| | | |
| | 56 0.6
| | |
1234 | 0.625
| |
123456
Q2. Given the dataset in Figure 14.1, show the dendrogram resulting from the single-link
hierarchical agglomerative clustering approach using the L1 -norm as the distance
between points
2
X
δ(x, y) = |xia − yia |
a=1
Whenever there is a choice, merge the cluster that has the lexicographically smallest
labeled point. Show the cluster merge order in the tree, stopping when you have k = 4
clusters. Show the full distance matrix at each step.
The first pair of points to merge are {c, k}, {d, e}. However, note that the pairs {e, i}
and {g, k} are also at distance 1 from each other. In the single link clustering, these
will also merge in the next step. So we might as well merge these upfront to obtain
two initial clusters, namely {c, g, k} and {d, e, i}, with merge distance 1. The new
distance matrix is then given as
b c, g, k d, e, i f h j
a 2 4 6 4 8 9
b 2 6 4 6 7
c, g, k 2 4 2 2
d, e, i 2 6 6
f 4 5
h 3
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14.4 Exercises 103
9
a
8
b
7
6
c k
5
d e f g h
4
i
3
j
2
1 2 3 4 5 6 7 8 9
Figure 14.1. Dataset for Q2.
The next to merge will be a, b, then that cluster will merge with c, g, k to create
a, b, c, g, k. This cluster will merge with f to create the clusters a, b, c, f, g, k. At
this point there will be 4 clusters, namely, {a, b, c, f, g, k}, {d, e, i}, {h} and {j }. The
process stops as this point since we desire 4 clusters. The dendrogram is given as:
a b c g k f h j d e i
| | | | | | | | | | |
| | cgk | | | dei
ab | | | | |
| | | | | |
abcgk | | | |
| | | | |
abcfgk | | |
| | | |
A B C D E
A 0 1 3 2 4
B 0 3 2 3
C 0 1 3
D 0 5
E 0
Q3. Using the distance matrix from Table 14.2, use the average link method to generate
hierarchical clusters. Show the merging distance thresholds.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
104 Hierarchical Clustering
Answer: The first pairs to merge are {A, B} and {C, D}, at a distance of 1 The
updated distance matrix is:
C, D E
A, B 2.5 3.5
C, D 4
The next to merge are A, B and C, D, at a distance of 2.5, which yields the matrix:
E
A, B, C, D 3.75
1 X X 1 X X
= δ(x, y) + δ(x, y)
(ni + nj )nr (ni + nj )nr
x∈Ci y∈Cr x∈Cj y∈Cr
ni XX nj X X
= δ(x, y) + δ(x, y)
(ni + nj )ni nr (ni + nj )nj nr
x∈Ci y∈Cr x∈Cj y∈Cr
ni nj
= δ(Ci , Cr ) + δ(Cj , Cr )
ni + nj ni + nj
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14.4 Exercises 105
ni nj nr T nr
+2 µi µj − 2 ni µT T
r µi + nj µr µj (14.1)
nij nij r nij r
Q5. If we treat each point as a vertex, and add edges between two nodes with distance
less than some threshold value, then the single-link method corresponds to a well
known graph algorithm. Describe this graph-based algorithm to hierarchically cluster
the nodes via single-link measure, using successively higher distance thresholds.
Answer: Define the (complete) weighted graph over the points, where the weights
denote the distance. Then for each value of distance, if we restrict the graph to only
those edges with weight at most the chosen value of distance, then the clusters via
single-link are precisely the connected components of the distance restricted graph.
As an example, consider the dataset shown in Figure 14.1. For a distance
threshold of 1, there are only two connected components, namely {c, g, k} and
{d, e, i}. Next, when we raise the threshold to 2, we get two connected components,
namely {a, b, c, d, e, f, g, h, i, k} and {j }. Finally, when the distance is 3 there is only
one connected component. One can verify that there are precisely the clusters
obtained via single-link clustering.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 15 Density-based Clustering
15.5 EXERCISES
Q1. Consider Figure 15.1 and answer the following questions, assuming that we use the
Euclidean distance between points, and that ǫ = 2 and minpts = 3
(a) List all the core points.
Answer: The core points are a, b, c, d, e, f, g, h, i, j, k, n, o, p, q, r, s, t, v, w
(b) Is a directly density reachable from d?
Answer: Yes, since d is a core object and a belongs to N2 (d).
(c) Is o density reachable from i? Show the intermediate points on the chain or the
point where the chain breaks.
Answer: Yes, the intermediate points are i, e, b, c, f , g, j , n, o or i, e, f , j , n,
o.
(d) Is density reachable a symmetric relationship, that is, if x is density reachable
from y, does it imply that y is density reachable from x? Why or why not?
Answer: Density reachable is not a symmetric relationship, since a non-core
object may be reachable from a core object, but the reverse is not necessarily
true. For example u is density reachable from n but n is not density reachable
from u.
(e) Is l density connected to x? Show the intermediate points that make them density
connected or violate the property, respectively.
Answer: Yes, for example, via t, since l is density-reachable from t and x is
also density-reachable from t.
(f) Is density connected a symmetric relationship?
Answer: Yes, by definition. In other words for any two points, there exists a
core point that reaches both of them.
(g) Show the density-based clusters and the noise points.
106
15.5 Exercises 107
C1 :{a, d, h, k, p, q, r, s, t, l, v, w, x}
C2 :{b, c, e, f, g, i, j, n, m, o, u}
10
9
a b c
8
d e f g
7
h i j n
6
k m o
5
p q r s t l u
4
v w
3
x
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Q2. Consider the points in Figure 15.2. Define the following distance measures:
d
L∞ (x, y) = max |xi − yi |
i=1
X
d
1
2
L 1 (x, y) = |xi − yi | 2
2
i=1
d
Lmin (x, y) = min |xi − yi |
i=1
X
d 1/2
Lpow (x, y) = 2i−1 (xi − yi )2
i=1
(a) Using ǫ = 2, minpts = 5, and L∞ distance, find all core, border, and noise points.
Answer: The core points are c, f, g, k. The border points are b, e, h. The noise
points are a, d, i, j .
(b) Show the shape of the ball of radius ǫ = 4 using the L 1 distance. Using minpts = 3
2
show all the clusters found by DBSCAN.
Answer: The shape is
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
108 Density-based Clustering
Using ǫ = 4, we find that the core points are {b, c, d, e, f, g, h, i, k}, and the
border points are {a, j }. There are no noise points.
Q3. Consider the points shown in Figure 15.2. Define the following two kernels:
(
1 If L∞ (z, 0) ≤ 1
K1 (z) =
0 Otherwise
( P
1 If dj =1 |zj | ≤ 1
K2 (z) =
0 Otherwise
Using each of the two kernels K1 and K2 , answer the following questions assuming
that h = 2:
(a) What is the probability density at e?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
15.5 Exercises 109
9
a
8
b
7
6
c k
5
d e f g h
4
i
3
j
2
1
1 2 3 4 5 6 7 8 9
Figure 15.2. Dataset for Q2 and Q3.
1
fˆ (e) = K 1 (e − d)/2 + K1 (e − e)/2 + K1 (e − f )/2 + K1 (e − i)/2
11 · 22
1
= · 4 = 0.091
44
For K2 , the probability density is the same:
1
fˆ (e) = K 2 (e − d)/2 + K2 (e − e)/2 + K2 (e − f )/2 + K2 (e − i)/2
11 · 22
1
= · 4 = 0.091
44
Q4. The Hessian matrix is defined as the set of partial derivatives of the gradient vector
with respect to x. What is the Hessian matrix for the Gaussian kernel? Use the
gradient in Eq. (15.6).
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
110 Density-based Clustering
Q5. Let us compute the probability density at a point x using the k-nearest neighbor
approach, given as
k
fˆ (x) =
nVx
where k is the number of nearest neighbors, n is the total number of points, and Vx is
the volume of the region encompassing the k nearest neighbors of x. In other words,
we fix k and allow the volume to vary based on those k nearest neighbors of x. Given
the following points
2, 2.5, 3, 4, 4.5, 5, 6.1
Find the peak density in this dataset, assuming k = 4. Keep in mind that this may
happen at a point other than those given above. Also, a point is its own nearest
neighbor.
Answer: Since the data is one-dimensional, the volume Vx is simply the distance
from x to its fourth nearest neighbor (for k = 4). The computed density estimates at
the data points are given below.
x p̂(x)
2 4 / (7 × 2)
2.5 4 / (7 × 1.5)
3 4 / (7 × 1)
4 4 / (7 × 1)
4.5 4 / (7 × 1.5)
5 4 / (7 × 1.1)
6.1 4 / (7 × 2.1)
Therefore the peak density is 4/7 = 0.57.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 16 Spectral and Graph Clustering
16.5 EXERCISES
Pn
Q1. Show that if Qi denotes the ith column of the modularity matrix Q, then i=1 Qi = 0.
Q2. Prove that both the normalized symmetric and asymmetric Laplacian matrices Ls
[Eq. (16.6)] and La [Eq. (16.9)] are positive semidefinite. Also show that the smallest
eigenvalue is λn = 0 for both.
= zT Lz
n n
1 XX
= aij (zi − zj )2
2
i=1 j =1
111
112 Spectral and Graph Clustering
n n
!2
1 XX ci cj
= aij √ −p
2 di dj
i=1 j =1
Since aij ≥ 0 and (zi − zj )2 is always non-negative, we immediately have the result
that Ls is positive semidefinite. An immediate consequence is that Ls has real
eigenvalues, and each λi ≥ 0. Now, if Lsi denotes the ith column of Ls , then from
Eq. (16.7) we can see that
p p p p
d1 Ls1 + d2 Ls2 + d3 Ls3 + · · · + dn Lsn = 0
That is, Ls is not a full-rank matrix, and at least one eigenvalue has to be zero. Since
all eigenvalues are at least zero, we conclude that λn = 0.
The normalized asymmetric Laplacian matrix La is also positive-semidefinite in
the sense that it has the same set of non-negative eigenvalues λi ≥ 0 as the symmetric
Laplacian Ls , and for each eigenvector ui of Ls , we have vi = 1−1/2 ui as the
corresponding eigenvector of La .
Now, from Eq. (16.9) we can see that if Lai denotes the ith column of La , then
L1 + La2 + · · · + Lan = 0. Which implies that the smallest eigenvalue λ = 0, following
a
the same reasoning as for the normalized symmetric Laplacian case above.
Q3. Prove that the largest eigenvalue of the normalized adjacency matrix M [Eq. (16.2)]
is 1, and further that all eigenvalues satisfy the condition that |λi | ≤ 1.
Mk u = λk u
P
Let M′ = Mk . Since the latter is also a Markov matrix, we have m′ij ≥ 0 and i m′ij =
1, which implies that m′ij ≤ 1.
Now consider the i-th row of M′ and its dot product with u in the expression
′
M u = λk u; we have
X
m′ij uj = λk ui
j
P
However, note that the 1-norm of u, namely kuk1 = i |ui |, can achieve a maximum
√
value of n, since it is a fact that
√
kuk2 ≤ kuk1 ≤ n kuk2
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
16.5 Exercises 113
√
Since kuk2 = 1, we have that kuk1 ≤ n. Since each m′ij ≤ 1, we immediately have
P ′ P √
j mij uj ≤ j |uj | ≤ n
Now assume the |λ| > 1. In this case, |λ|k increases without bound as k → ∞,
P √
whereas j m′ij uj can never exceed n. We conclude that our assumption is false,
and therefore |λ| ≤ 1. Thus, we have |λ| ≤ 1 for all eigenvalues, and 1 is the largest
eigenvalue.
P P P
Q4. Show that vr ∈Ci cir dr cir = nr=1 ns=1 cir 1rs cis , where ci is the cluster indicator
vector for cluster Ci and 1 is the degree matrix for the graph.
Answer: Note that cir = 1 iff vr ∈ Ci . Also note that 1rs = 0 if r 6= s, and 1rs = dr
if r = s. r = s. Thus, we have
n X
X n n X
X n
X X
cir 1rs cis = cir 1rs cis = cir 1rr cir = cir dr cir
r=1 s=1 r=1 s=r r=1 vr ∈Ci
Q5. For the normalized symmetric Laplacian Ls , show that for the normalized cut
objective the real-valued cluster indicator vector corresponding to the smallest
eigenvalue λn = 0 is given as cn = √P1n 11/2 1
i=1 di
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
114 Spectral and Graph Clustering
1
= 0 · qP 11/2 1
n
i=1 di
2 4
Q6. Given the graph in Figure 16.1, answer the following questions:
(a) Cluster the graph into two clusters using ratio cut and normalized cut.
Answer: The adjacency matrix and the corresponding degree matrix for the
graph are as follows:
0 1 0 1 2 0 0 0
1 0 1 1 0 3 0 0
A=
0
D=
1 0 1 0 0 2 0
1 1 1 0 0 0 0 3
For the ratio cut, we have to find the two smallest eigenvalues and correspond-
ing eigenvectors of L, which are as follows:
λ4 = 0 λ3 = 2
√ √
T
u4 = (1/2, 1/2, 1/2, 1/2) u3 = (1/ 2, 0, −1/ 2, 0)T
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
16.5 Exercises 115
For the normalized cut, we have to find the two smallest eigenvalues and
corresponding eigenvectors of La , which are as follows:
λ4 = 0 λ3 = 1
√ √
u4 = (1/2, 1/2, 1/2, 1/2)T u3 = (1/ 2, 0, −1/ 2, 0)T
λ1 = 2.56 λ2 = 0
√ √
u1 = (0.44, 0.56, 0.44, 0.56)T u2 = (1/ 2, 0, −1/ 2, 0)T
λ1 = 1 λ2 = −0.67
u1 = (1/2, 1/2, 1/2, 1/2)T u2 = (−0.59, 0.39, −0.59, 0.39)T
However, since there is only one positive eigenvalue, we cannot cluster the data
into two groups based only on u1 , since all of its elements have the same value.
For kernel K-means we realize that we first have to modify the matrix K to have
a value of K(x, x) = 1, otherwise, each point will be more similar to other points
than to itself. So we first make sure that the diagonal in the kernel matrix K has
all ones, so that
1 1/2 0 1/2
1/3 1 1/3 1/3
K = M+I = 0
1/2 1 1/2
1/3 1/3 1/3 1
For Kernel K-means we start with a random split of the points. Let’s assume
that C1 = {1, 2} and C2 = {3, 4}. The average of all kernel values within each
P
cluster 1/n2i · x,y∈Ci K(x, y) is 0.7075 for both clusters. Next, the average kernel
P
value of each point x with other points in a given cluster, 1/n1 · y∈Ci K(x, y) is
as follows:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
116 Spectral and Graph Clustering
xi C1 C2
1 0.75 0.25
2 0.665 0.33
3 0.25 0.75
4 0.33 0.665
We assign each point to the closest cluster, breaking ties by choosing the
smaller index cluster, which gives the two new clusters C1 = {1, 2} and C2 =
{3, 4}. Since there is no change in the clusters, we stop.
Different answers are possible for different starting random clustering.
(c) Cluster the graph using the MCL algorithm with inflation parameters r = 2 and
r = 2.5.
Answer: For MCL, we first have to add self-loops to the adjacency matrix. The
normalized adjacency matrix is then given as:
1 1 0 1 1/3 1/3 0 1/3
1 1 1 1 1/4 1/4 1/4 1/4
A=
0
M=
1 1 1 0 1/3 1/3 1/3
1 1 1 1 1/4 1/4 1/4 1/4
Here we can see that vertices 2 and 4 are the attractors and both 1 and 3 are
attracted to it. Thus there is only one cluster comprising all the vertices in the
graph.
The result of using r = 2.5 is the same, since the converge is even more rapid
and we obtain the same final matrix as above, with the same clustering.
X1 X2 X3
x1 0.4 0.9 0.6
x2 0.5 0.1 0.6
x3 0.6 0.3 0.6
x4 0.4 0.8 0.5
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
16.5 Exercises 117
Q7. Consider Table 16.1. Assuming these are nodes in a graph, define the weighted
adjacency matrix A using the linear kernel
A(i, j ) = 1 + xT
i xj
Cluster the data into two groups using the modularity objective.
2.33 1.65 1.87 2.18
1.65 1.62 1.69 1.58
Answer: We have A = 1.87 1.69 1.81 1.78
The dominant eigenvector and the eigenvalue are as follows: λ1 = 0.01465 and
u1 = (0.55, −0.61, −0.38, 0.44)T .
Based on the values, we can see that the two clusters are C1 = {x1 , x4 } and C2 =
{x2 , x3 }.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 17 Clustering Validation
17.5 EXERCISES
Q1. Prove that the maximum value of the entropy measure in Eq. (17.2) is log k.
Answer: First note that H(T |Ci ) for a given cluster Ci has maximum entropy log k,
since in the worst case, the members of Ci are equally split among all k partitions,
n
i.e., niji· = 1/k for all j = 1, . . ., k. We therefore have
k
X 1
H(T |Ci ) = log k = logk
k
j =1
Q2. Show that if C and T are independent of each other then H(T |C ) = H(T ), and further
that H(C , T ) = H(C ) + H(T ).
= H(T )
118
17.5 Exercises 119
Answer: If H(T |C ) = 0, then assuming that ni 6= 0, it must be that H(T |Ci ) = 0 for all
i = 1, . . . , r. This implies that for cluster Ci , either nij = 0 or nij = ni for j = 1, . . . , k.
However, the nij = ni for only one value j and for the rest nij = 0. This means that
Ci is identical to Tj , and this implies perfect clustering.
For the reverse, it is obvious, since nij = ni = mj for the matching pair Ci and Tj .
X k
r X r X
X k
= pij log pij − pij log pCi · pTj
i=1 j =1 i=1 j =1
X k
r X r X
X k
= −H(C , T ) − pij log pCi − pij log pTj
i=1 j =1 i=1 j =1
r
X k
X
= −H(C , T ) − pCi log pCi − pTj log pTj
i=1 j =1
Q5. Show that the variation of information is 0 only when C and T are identical.
Q6. Prove that the maximum value of the normalized discretized Hubert statistic in
Eq. (17.21) is obtained when FN = FP = 0, and the minimum value is obtained when
TP = TN = 0.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
120 Clustering Validation
Q7. Show that the Fowlkes–Mallows measure can be considered as the correlation
between the pairwise indicator matrices for C and T , respectively. Define C(i, j ) = 1
if xi and xj (with i 6= j ) are in the same cluster, and 0 otherwise. Define T similarly
P
for the ground-truth partitions. Define hC, Ti = ni,j =1 Cij Tij . Show that FM =
√ hC,Ti
hT,TihC,Ci
Answer: By definition C(i, j ) = Cij = 1 iff xi and xj belong to the same cluster and
likewise T(i, j ) = Tij = 1 iff iff xi and xj belong to the same ground-truth partition.
Further, Cij Tij = 1 when both points belong to the same cluster and same partition.
Thus, hC, Ti = TP .
Now, hT, Ti is simply the number of pairs of points that are in the same partition.
Some of these are in the same cluster, which comprise the true positives (TP), and
some of these pairs are not in the same cluster, which comprise the false negatives
(FN). Thus, hT, Ti = TP + F N.
Similarly, hC, Ci is the number of point pairs that are in the same cluster. Some
of these are in the same partition, which comprise the true positives (TP), and some
of these pairs are not in the same partition, which comprise the false negatives (FP).
Thus, hC, Ci = TP + F P .
Thus, FM = √ hC,Ti TP
= √(TP +F N)(TP +F P )
hT,TihC,Ci
Q8. Show that the silhouette coefficient of a point lies in the interval [−1, +1].
µin
Answer: For a given point xi , if µmin
out > µin , then si = 1 − . The maximum value
µmin
out
µmin
for si in this case can be 1. On the other hand, if µmin
out < µin , then si = µin − 1. The
out
minimum value for si in this case can be −1. Thus si ∈ [−1, 1].
Q9. Show that the scatter matrix can be decomposed as S = SW + SB , where SW and SB
are the within-cluster and between-cluster scatter matrices.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
17.5 Exercises 121
j =1
k X
X
= xj xT T T
j − xj µ − µxj + µµ
T
i=1 xj ∈Ci
k X
X Xk
= xj xT
j + µµT
− ni µi µT + µµT
i
i=1 xj ∈Ci i=1
k X
X
xj xT T T T T T T
j − xi µi − µi xj + µi µi + µi µi − µi µ − µµi + µµ
T
i=1 xj ∈Ci
k X
X Xk Xk X
xj xT
j + µµT
− n i µi µ T
+ µµ T
i + 2µi µT T T
i − xj µi − µi xj
i=1 xj ∈Ci i=1 i=1 xj ∈Ci
k X
X Xk
= xj xT
j + µµT
− ni µi µT + µµT
i
i=1 xj ∈Ci i=1
Q10. Consider the dataset in Figure 17.1. Compute the silhouette coefficient for the point
labeled c.
Answer: To answer this question, we first need the clusters. Assume that we find
the following clusters C1 = {a, b, c, d, e}, C2 = {g, i}, C3 = {f, h, j } and C4 = {k}.
The mean distance from c to other points in its own cluster is: µin (c) = (2.92 +
1.5 + 2 + 1)/4 = 1.855.
Based on the distances c is closer to cluster C3 , and the mean of these distances
is µmin
out (c) = (5 + 4.24 + 4.92)/3 = 4.72. Just for comparison, the average distance of
C2 is (5 + 5.83)/2 = 5.4.
Thus, the silhouette coefficient for c is sc = (4.72 − 1.855)/4.72 = 5.375/7.5 = 0.61.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
122 Clustering Validation
g
9
i
8
7 a
d
6
5 k
c e
4
3 b
j
2
f h
1
1 2 3 4 5 6 7 8 9
Q11. Describe how one may apply the gap statistic methodology for determining the
parameters of density-based clustering algorithms, such as DBSCAN and DEN-
CLUE (see Chapter 15).
Answer: Let us consider the two parameters in DBSCAN, namely the radius ǫ and
the minimum number of points minpts. Let us fix ǫ for the moment, and try to
get a good value of minpts. Using D we can compute the number of core points
N(minpts) for different values of minpts. Likewise, given t random samples Ri ,
we can compute the average number of core points µN (minpts), and the standard
deviation σN (minpts). Finally, we can choose the value of minpts that maximizes
the gap N(minpts) − µN (minpts), since we want to see more core points in a
well-clustered dataset than we would expect under the null hypothesis that the data
is randomly generated from the input space.
For estimating ǫ we can compute the number of points within the ǫ-neighborhood
of each point and get the average of these, say, µD (ǫ) in the dataset D. Next, we can
compute the average number of points within the ǫ-neighborhood in each of the
t random datasets Ri . We can then compute the mean and standard deviation of
these averages, i.e., µR and σR . We can then look for those values of ǫ that show a
large gap.
A similar approach can be used to estimate the spread parameter h and the
density threshold ξ in DENCLUE.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T FOUR CLASSIFICATION
C H A P T E R 18 Probabilistic Classification
18.5 EXERCISES
Q1. Consider the dataset in Table 18.1. Classify the new point: (Age=23, Car=truck) via
the full and naive Bayes approach. You may assume that the domain of Car is given
as {sports, vintage, suv, truck}.
Answer: Naive Bayes: Let us consider the naive Bayes approach first, which
assumes that each attribute is independent of the other. That is
125
126 Probabilistic Classification
For Car, which is categorical, we immediately run into a problem, since the value
truck does not appear in the training set. We could assume that P (truck|H) and
P (truck|L) are both zero. However, we desire to have some small probability of
observing each values in the domain of the attribute. We fix the problem using the
pseudo-count approach of adding a count of one to the observed counts of each
value for each class, as shown in the table below.
H L
P (sports|H) = 1(+1)
4(+4) = 2/8
2(+1)
P (sports|L) = 2(+4) = 3/6
P (vintage|H) = 1(+1)
4(+4) = 2/8
0(+1)
P (vintage|L) = 2(+4) = 1/6
2(+1) 0(+1)
P (suv|H) = 4(+4) = 3/8 P (suv|L) = 2(+4) = 1/6
P (truck|H) = 0(+1)
4(+4) = 1/8
0(+1)
P (truck|L) = 2(+4) = 1/6
Using the above probabilities, we finally obtain
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
18.5 Exercises 127
xi a1 a2 a3 Class
x1 T T 5.0 Y
x2 T T 7.0 Y
x3 T F 8.0 N
x4 F F 3.0 Y
x5 F T 7.0 N
x6 F T 4.0 N
x7 F F 5.0 N
x8 T F 6.0 Y
x9 F T 1.0 N
Thus P (23, truck|H) = P (Age ≤ 25, truck|H) − P (Age ≤ 20, truck|H) = 1/5, and
likewise P (23, truck|L) = 1/5.
Therefore, we have
P ((23, truck)|H) × P (H)
P (H|(23, truck)) ∝ = 1/5 × 4/6 = 0.133
P (23, truck)
and
P ((23, truck)|L) × P (L)
P (L|(23, truck)) ∝ = 1/5 × 2/6 = 0.067
P (23, truck)
Thus we classify (23, truck) as high risk (H) using Full Bayes.
Q2. Given the dataset in Table 18.2, use the naive Bayes classifier to classify the new point
(T, F, 1.0).
Answer: We have P (a1 = T|Y) = 3/4 and P (a1 = T|N) = 1/5, and P (a2 = F |Y) =
2/4 and P (a2 = F |N) = 2/5. The mean and variance for a3 for Y are µY = 5.25 and
σY = 1.71, and those for N are µN = 5 and σN = 2.74. Using the normal density
function, we have P (1.0|Y) = 0.0106 and P (1.0|N) = 0.0502.
Now, P (T, F, 1.0|Y) = 0.75 · 0.5 · 0.0106 = 0.003975 and P (T, F, 1.0|N) = 0.2 · 0.4 ·
0.0502 = 0.004016 Next we have
Q3. Consider the class means and covariance matrices for classes c1 and c2 :
µ1 = (1, 3) µ2 = (5, 5)
5 3 2 0
61 = 62 =
3 2 0 1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
128 Probabilistic Classification
Classify the point (3, 4)T via the (full) Bayesian approach, assuming normally
distributed classes, and P (c2 ) = 0.5. Show all steps.
P (c1 ) = Recall
that the inverse
a b −1 1 d −b
of a 2 × 2 matrix A = is given as A = det(A) .
c d −c a
1
2 −3 0
Answer: First, compute = 6 −1
1
−1
and 6 2 = 2 .
−3 5 0 1
Now x − µ1 = (3, 4) − (1, 3) = (2, 1) and x − µ2 = (3, 4) − (5, 5) = (−2, −1).
Computing the mahalanobis distance we get for c1 ,
2 −3 2
(x − µi )T 6 −1
i (x − µ i ) = (2 1) =1
−3 5 1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
Table 19.1. Data for Q2: Age is numeric and Car is categorical. Risk gives the class
label for each point: high (H) or low (L)
19.4 EXERCISES
Q2. Given Table 19.1, construct a decision tree using a purity threshold of 100%. Use
information gain as the split point evaluation measure. Next, classify the point
(Age=27,Car=Vintage).
Answer: Let us consider how a complete decision tree is induced for the dataset in
Table 19.1. In the complete dataset we have P (H|D) = PH = 46 = 32 and P (L|D) =
129
130 Decision Tree Classifier
2
PL = 6 = 13 . Thus the entropy of D is
2 2 1 1
H(D) = − log2 + log2 = −(−0.390 − 0.528) = 0.918
3 3 3 3
At the root of the decision tree, we consider all possible splits on Age and Car.
For Age, the possible distinct splits to consider are Age ≤ 22.5 and Age ≤ 35, which
were chosen to be the mid-points between the distinct values, namely, 20, 25, and
45, that we observe for Age.
(a) For Age ≤ 22.5, DL includes only the points x2 and x5 , whereas DR comprises
the remaining points: x1 , x3 , x4 , and x6 . For DL , this yields PL = 0 and PH = 1,
whereas for DR we have PL = 24 and PH = 42 . The weighted entropy is then
2 4
H(DL , DR ) = H(DL ) + H(DR )
6 6
2 4 1 1 1 1
= − (0) − log2 + log2
6 6 2 2 2 2
2 1
= − log2 = 0.67
3 2
This yields an information gain of 0.918 − 0.67 = 0.248.
(b) In a similar manner we can compute the weighted entropy for Age ≤ 35.
For DR = {x4 } and DL has the remaining points, so that H(DL ) = 25 log2 25 +
3 3
5 log2 5 = 0.971 and H(DR ) = 0. The split entropy is then H(DL , DR ) =
5
6 (0.971) = 0.809, the information gain is: 0.915 − 0.809 = 0.106, which is not
as high as for Age ≤ 22.5.
Next, we evaluate all possible splits for Car. Note that categorical data, in
v
general, yields 2 2−1 possible splits, where v is the set of possible values for the
attribute. This can be reduced to O(v) by using a greedy split selection approach.
For Car the possible values are {Sports(S), Vintage(V), SUV(U)}, which yields the
following three distinct splits:
Car ∈ Car 6∈
{S} {V, U}
{V} {S, U}
{U} {S, V}
Note that the split Car ∈ {V, U} is essentially the same as the split Car ∈ {S}, the
only difference being that the decision has been “reversed”. It is therefore not
a distinct split, and we do not consider such splits. Next we evaluate the three
categorical splits as follows:
(a) For the split Car ∈ {S}, DL = {x1 , x3 , x5 }, and DR = {x2 , x4 , x6 }. For DL , this
yields PL = 32 and PH = 31 , and for DR , PL = 0 and PH = 1. The weighted entropy
of the split is then
3 3
H(DL , DR ) = H(DL ) + H(DR )
6 6
3 1 1 2 2 3
= − log2 + log2 − (0)
6 3 3 3 3 6
= 0.459
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
19.4 Exercises 131
Q3. What is the maximum and minimum value of the CART measure [Eq. (19.7)] and
under what conditions?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
132 Decision Tree Classifier
D
Car ∈ {S}
Yes No
DL DR
Age ≤ 22.5 H
Yes No
DLL DLR
H L
The minimum value is obtained when the class distribution on both sides of the split
P
is identical, in which case the summation ki=1 P (ci |DY ) − P (ci |DN ) goes to zero.
Therefore the minimum value is zero.
For the maximum value, first note that if α = nnY , then 1 − α = nnN , and the product
α(1 − α) = α − α 2 is maximized by taking the derivative with respect to α and setting
it to zero, i.e., when 1 − 2α = 0 which implies α = 1/2. In other words, the product
in the CART measure is maximized when the two partitions are balanced. Next,
the summation is maximized when each partition is pure, in which case each term
|P (ci |DY ) − P (ci |DN )| in the summation can achieve a maximum value of 1. Thus,
over k classes, the maximum value of the CART measure is
1 1 k
2· · ·k =
2 2 2
Q4. Given the dataset in Table 19.2. Answer the following questions:
(a) Show which decision will be chosen at the root of the decision tree using
information gain [Eq. (19.5)], Gini index [Eq. (19.6)], and CART [Eq. (19.7)]
measures. Show all split points for all attributes.
Answer: For the information gain we use log10 as opposed to log2 since it does
not qualitatively change the results.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
19.4 Exercises 133
Instance a1 a2 a3 Class
1 T T 5.0 Y
2 T T 7.0 Y
3 T F 8.0 N
4 F F 3.0 Y
5 F T 7.0 N
6 F T 4.0 N
7 F F 5.0 N
8 T F 6.0 Y
9 F T 1.0 N
Consider the split for attribute a1 , which has only one possible split, namely
a1 ∈ T. The split entropy is given as:
4 1 1 3 3 5 1 1 4 4
H(DL , DR ) = − log10 − log10 + − log10 − log10 = 0.2293
9 4 4 4 4 9 5 5 5 5
Thus the gain is 0.2983 − 0.2293 = 0.0690.
The gini for the split is
! !
4 12 32 5 12 42
G(DL , DR ) = 1− 2 − 2 + 1 − 2 − 2 = 0.3444
9 4 4 9 5 5
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
134 Decision Tree Classifier
For attribute a3 there are several numeric split points, namely a3 < 3.0, a3 < 4.0,
a3 < 5.0, a3 < 6.0, a3 < 7.0, and a3 < 8.0. The split entropy for each of these cases
is as follows:
(a) For a3 < 3.0 we have
8 1 1 1 1
H(DL , DR ) = 0 + − log10 − log10 = 0.2676
9 2 2 2 2
The gain is 0.2983 − 0.2676 = 0.0307.
The gini for the split is
!
8 12 12
G(DL , DR ) = 1− − = 0.4444
9 2 2
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
19.4 Exercises 135
So the best split for all three measures is a1 ∈ {T}. It has the highest gain (0.069),
the lowest Gini value (0.3444), and the highest CART measure (0.5432).
(b) What happens to the purity if we use Instance as another attribute? Do you think
this attribute should be used for a decision in the tree?
Answer: It is not advisable to use the instance id as a split value. The reason is
that in general the id is just some unique identifier for each instance and each
row will thus have a different value. Further, there is no reason to assume that
the value follows any numeric scale. So what might happen is that we might end
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
136 Decision Tree Classifier
up with relatively pure splits, at least on one side, if we only allow binary splits,
and a completely pure split if we allow multi-way splits. However such a split is
of no “predictive” value.
Q5. Consider Table 19.3. Let us make a nonlinear split instead of an axis parallel split,
given as follows: AB − B2 ≤ 0. Compute the information gain of this split based on
entropy (use log2 , i.e., log to the base 2).
A B Class
x1 3.5 4 H
x2 2 4 H
x3 9.1 4.5 L
x4 2 6 H
x5 1.5 7 H
x6 7 6.5 H
x7 2.1 2.5 L
x8 8 4 L
Answer: For the full data PL = 3/8 and PH = 5/8, therefore the entropy is:
As for the split, for those points with AB−B2 ≤ 0, we have DL = {x1 , x2 , x4 , x5 , x7 }
and DR = {x3 , x6 , x8 }. Therefore entropy values for each side of the split are:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 20 Linear Discriminant Analysis
20.4 EXERCISES
Q1. Consider the data shown in Table 20.1. Answer the following questions:
(a) Compute µ+1 and µ−1 , and B, the between-class scatter matrix.
Answer: We have µ+1 = (3.75, 3.45)T , and µ−1 = (2.25, 1.55)T . Next, their
difference is d = µ+1 − µ−1 = (1.5, 1.9)T , and finally
T 2.25 2.85
B = dd =
2.85 3.61
i xi yi
x1 (4,2.9) 1
x2 (3.5,4) 1
x3 (2.5,1) −1
x4 (2,2.1) −1
(b) Compute S+1 and S−1 , and S, the within-class scatter matrix.
Answer: For class +1 we have x1 − µ+1 = (0.25, −0.55)T and x2 − µ+1 =
(−0.25, 0.55). Therefore, the scatter matrix for +1 is given as
0.0625 −0.1375 0.125 −0.275
S1 = 2 · =
−0.1375 0.3025 −0.275 0.605
For class −1, we have x3 − µ−1 = (0.25, −0.55)T and x4 − µ−1 = (−0.25, 0.55)T ,
and thus we also have S−1 = S1 , therefore,
0.25 −0.55
S = S1 + S−1 =
−0.55 1.21
137
138 Linear Discriminant Analysis
That is, we increase the diagonal entries slightly to make the determinant
non-zero. We have det(S) = 0.0147, and the inverse of the scatter matrix is
−1 1 1.22 −0.55
S =
0.0147 −0.55 0.26
(d) Having found the direction w, find the point on w that best separates the two
classes.
Answer: One approach is to project all the four points onto w, and then find
the point of best separation. For instance, projecting x1 onto w, we obtain the
offset along w as xT T T
1 w = 4.845. Likewise, we have x2 w = 4.849, x3 w = 2.689, and
T
x4 w = 2.694. The point on w that best separates the two classes, can be taken
as the mid-point of the two closest points from opposite classes, i.e., half-way
between the projections of x2 and x3 , given as (4.849 + 2.689)/2 = 3.769.
Q2. Given the labeled points (from two classes) shown in Figure 20.1, and given that the
inverse of the within-class scatter matrix is
0.056 −0.029
−0.029 0.052
Answer: Since we are given the inverse of the scatter matrix, to obtain w, we only
need to compute the difference vector for the two means. We have
µ1 = ((2, 3)T + (3, 3)T + (3, 4)T + (5, 8)T + (7, 7)T )/5 = (4, 5)T
µ2 = ((5, 4)T + (6, 5)T + (7, 4)T + (7, 5)T + (8, 2)T + (9, 4)T )/6 = (7, 4)T
µ1 − µ2 = (−3, 1)T
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
20.4 Exercises 139
9
8 uT
7 uT
6
5 bC bC
uT bC bC bC
4
3 uT uT
bC
2
1
1 2 3 4 5 6 7 8 9
Figure 20.1. Dataset for Q2.
Q3. Maximize the objective in Eq. (20.7) by explicitly considering the constraint wT w = 1,
that is, by using a Lagrange multiplier for that constraint.
wT Bw
max J(w) = − α(wT w − 1)
w wT Sw
Taking the derivative of J(w) with respect to the vector w, and setting the result
to zero, gives us
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
140 Linear Discriminant Analysis
Q4. Prove the equality in Eq. (20.19). That is, show that
X
1
N1 = Ki KTi − n1 m 1 m T c1
1 = (K ) (In1 − 1n ×n ) (Kc1 )T
n1 1 1
xi ∈D1
Answer:
X
N1 = Ki Ki − n1 m1 mT
T
1
xi ∈D1
P P P
xi ∈D1 Ki1 Ki1 K K ··· K K
P Pxi ∈D1 i1 i2 Pxi ∈D1 i1 i n1
xi ∈D1 Ki2 Ki1 xi ∈D1 K i2 K i2 ··· xi ∈D1 Ki2 Ki n1
= .. .. ..
. . ··· .
P P P
xi ∈D1 K i n 1 K i1 xi ∈D1 K i n 1 K i2 ··· xi ∈D1 K i n 1 Ki n 1
P P P
xi ,xj ∈D1 Ki1 Kj 1 Ki1 Kj 2 ··· Ki1 Kj n1
P Pxi ,xj ∈D1 Pxi ,xj ∈D1
1 xi ,xj ∈D1 K i2 K j 1 xi ,xj ∈D1 Ki2 Kj 2 ··· xi ,xj ∈D1 Ki2 Kj n1
− . .. ..
n1 .. . ··· .
P P P
K
xi ,xj ∈D1 i n1 j 1 K xi ,xj ∈D1 Ki n1 Kj 2 ··· K
xi ,xj ∈D1 i n1 j n1 K
X 1 X X
= Kia Kib − Kia Kj b
n1
xi ∈D1 xi ∈D1 xj ∈D1
1≤a,b≤n
1
= (Kc1 ) (In1 − 1n1 ×n1 ) (Kc1 )T
n1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 21 Support Vector Machines
21.7 EXERCISES
Q1. Consider the dataset in Figure 21.1, which has points from two classes c1 (triangles)
and c2 (circles). Answer the questions below.
(a) Find the equations for the two hyperplanes h1 and h2 .
Answer: Note that for h1 we have the two points on the line, namely (6, 0) and
(1, 10), which gives the slope as m1 = 10/(1 − 6) = −2. Using (6, 0) as the point,
we get the equation of the line as xx21 −0
−6 = −2 =⇒ 2x1 + x2 − 12 = 0.
For h2 we have the two points on the line (2, 0) and (8, 10), which gives the
slope as m2 = 10/(8 − 2) = 5/3. Using (2, 0) as the point, we get the equation of
the line as xx21 −0 5
−2 = 3 =⇒ 5x1 − 3x2 − 10 = 0.
h 1(
0
x) =
9
x)
0
h2 (
8 uT
7 uT
6 uT bC
5
bC
4 uT uT bC
3 bC
uT bC
2 bC
1 uT
1 2 3 4 5 6 7 8 9
Figure 21.1. Dataset for Q1.
141
142 Support Vector Machines
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X1
The figure suggests the support vectors (6, 2) and (7, 6) for circles and (3, 4) for
triangles. The convex hull, and the lines passing through these support vectors
are shown in the figure above.
The optimal hyperplane is h : 4x1 − x2 − 15, which is exactly half-way between
the lines passing through the support vectors, namely 4x1 − x2 − 22 and
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
21.7 Exercises 143
i xi1 xi2 yi αi
x1 4 2.9 1 0.414
x2 4 4 1 0
x3 1 2.5 −1 0
x4 2.5 1 −1 0.018
x5 4.9 4.5 1 0
x6 1.9 1.9 −1 0
x7 3.5 4 1 0.018
x8 0.5 1.5 −1 0
x9 2 2.1 −1 0.414
x10 4.5 2.5 1 0
4x1 − x2 − 8. The margin is √14 = 3.395, which is more than the axis-parallel
17
hyperplane mid-way between (3, 4) and (6, 4), which has a margin of 3.
Q2. Given the 10 points in Table 21.1, along with their classes and their Lagranian
multipliers (αi ), answer the following questions:
(a) What is the equation of the SVM hyperplane h(x)?
Answer: We have
X
w= αi yi xi
i
= 0.414 · (4, 2.9)T − 0.414 · (2, 2.1)T + 0.018 · (3.5, 4)T − 0.018 · (2.5, 1)T
= 0.414 · (2, 0.8)T + 0.018 · (1, 3)T
= (0.818 + .018, .3312 + .054)T
= (.836, 0.385)T
(b) What is the distance of x6 from the hyperplane? Is it within the margin of the
classifier?
Answer: The distance of x6 from the hyperplane is given as
Since the signed distance is less than -1, the point is outside the margin.
(c) Classify the point z = (3, 3)T using h(x) from above.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
144 Support Vector Machines
Answer: We have
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 22 Classification Assessment
22.5 EXERCISES
Answer: False
(b) A classification model must have 100% coverage (overall) on the training
dataset.
Answer: True
Q2. Given the training database in Table 22.1a and the testing data in Table 22.1b, answer
the following questions:
(a) Build the complete decision tree using binary splits and Gini index as the
evaluation measure (see Chapter 19).
Answer: The Gini for whole DB is Gini(D) = 1 − ((4/8)2 + (4/8)2 ) = 1 − 0.5 =
0.5. We evaluate the different split points for X, Y, and Z as follows (CL 1
means the count on the left hand partition for class 1, Gini(L) is the gini
index value for left hand side, Gini(L, R) is the Gini of the split point, and
Gain = Gini(D) − Gini(L, R)).
Z ∈ {A} 4 0 0 4 0 0 0 0.5
145
146 Classification Assessment
It is clear that the best split point is Z ∈ {A}, since it produces a pure partition
on both sides, and there is no need to split further.
(b) Compute the accuracy of the classifier on the test data. Also show the per class
accuracy and coverage.
Answer: Our decision tree has only a root node, namely Z ∈ {A}, with class 1
if true, and class 2 if false. There are 4 cases misclassified in the testing data, so
the overall error rate is 4/5 = 0.8.
For the per class accuracy and coverage, we need to calculate the True Positives
and divide them by the True class and the Predicted class, respv.
For Class 1, instances 2 and 5 in Table 22.1b belong to class 1, but we predict
instances 1 and 3 to belong to class 1. There is no true positive, since there is no
correct prediction. The accuracy and coverage are both 0.
For class 2, instances 1,3,4 truly belong to class 2, but we predict 2,4,5 as
belonging to class 2. The true positive is instance 4, and since we predict three
cases as belonging to class 2 the accuracy is 1/3, and since the true class also has
three instances, the coverage is also 1/3.
X Y Z Class
15 1 A 1
X Y Z Class
20 3 B 2
10 2 A 2
25 2 A 1
20 1 B 1
30 4 A 1
30 3 A 2
35 2 B 2
40 2 B 2
25 4 A 1
15 1 B 1
15 2 B 2
(b) Testing
20 3 B 2
(a) Training
Q3. Show that for binary classification the majority voting for the combined classifier
decision in boosting can be expressed as
K
!
X
K
M (x) = sign αt Mt (x)
t=1
Answer: We are given that there are two classes, i.e., +1 and −1. Without loss of
generality assume that the weighted sum for the positive class is higher than that
for the negative class, i.e., v+1 (x) > v−1 (x). In this case, the class will be predicted
as +1, using arg max{vj (x)}, with v+1 (x) being the sum of the αt for classifiers that
predict class as +1.
Mt (x) = +1, then the product αt Mt (x) will be αt , otherwise, the weight will be
P
−αt . Since the weighted sum v+1 (x) for class +1 is higher, the sum t αt Mt (x) will
be positive, and the class will be predicted as +1.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
22.5 Exercises 147
h3 h2 h1
h4
9
8 uT
7 uT h5
6 uT bC h6
5
bC
4 uT uT bC
3 bC
uT bC
2 bC
1 uT
1 2 3 4 5 6 7 8 9
Figure 22.1. For Q4.
A similar argument will lead to the equivalence of both expressions, with the
weighted sum for the negative class is higher, in which case both will predict the
class as −1.
Q4. Consider the 2-dimensional dataset shown in Figure 22.1, with the labeled points
belonging to two classes: c1 (triangles) and c2 (circles). Assume that the six
hyperplanes were learned from different bootstrap samples. Find the error rate for
each of the six hyperplanes on the entire dataset. Then, compute the 95% confidence
interval for the expected error rate, using the t-distribution critical values for different
degrees of freedom (dof) given in Table 22.2.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
148 Classification Assessment
Q5. Consider the probabilities P (+1|xi ) for the positive class obtained for some classifier,
and given the true class labels yi
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
yi +1 −1 +1 +1 −1 +1 −1 +1 −1 −1
P (+1|xi ) 0.53 0.86 0.25 0.95 0.87 0.86 0.76 0.94 0.44 0.86
Answer: We first sort the points in decreasing order of the probabilities, to obtain
x4 x8 x5 x6 x2 x10 x7 x1 x9 x3
yi + + − + − − − + − +
P (+1|xi ) 0.95 0.94 0.87 0.86 0.86 0.86 0.76 0.53 0.44 0.25
The FPR is then simply FP/n2 and TPR is TP/n1 where n1 is the number of +
points, and n2 the number of − points.
The ROC curve is as follows
1
0.8
0.6
TPR
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
FPR
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.