0% found this document useful (0 votes)
241 views

Solution

This document outlines the contents of a book on data mining and analysis. It includes 4 parts that cover topics such as data analysis foundations, frequent pattern mining, clustering, and classification. Each part is divided into chapters that describe specific techniques. Each chapter also includes exercises related to the material covered.

Uploaded by

chisn235711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
241 views

Solution

This document outlines the contents of a book on data mining and analysis. It includes 4 parts that cover topics such as data analysis foundations, frequent pattern mining, clustering, and classification. Each part is divided into chapters that describe specific techniques. Each chapter also includes exercises related to the material covered.

Uploaded by

chisn235711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Contents

Contents 1

1 Data Mining and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.7 Exercises 3

PART I DATA ANALYSIS FOUNDATIONS 5

2 Numeric Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Exercises 7

3 Categorical Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Exercises 16

4 Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Exercises 20

5 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.6 Exercises 26

6 High-dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.9 Exercises 29

7 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6 Exercises 39

PART II FREQUENT PATTERN MINING 45

8 Itemset Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.5 Exercises 47

9 Summarizing Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.6 Exercises 56

1
2 Contents

10 Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.5 Exercises 63

11 Graph Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


11.5 Exercises 75

12 Pattern and Rule Assessment . . . . . . . . . . . . . . . . . . . . . . . . 84


12.4 Exercises 84

PART III CLUSTERING 89

13 Representative-based Clustering . . . . . . . . . . . . . . . . . . . . . . 91
13.5 Exercises 91

14 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
14.4 Exercises 99

15 Density-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 106


15.5 Exercises 106

16 Spectral and Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . 111


16.5 Exercises 111

17 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118


17.5 Exercises 118

PART IV CLASSIFICATION 123


18 Probabilistic Classification . . . . . . . . . . . . . . . . . . . . . . . . . 125
18.5 Exercises 125

19 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 129


19.4 Exercises 129

20 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 137


20.4 Exercises 137

21 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 141


21.7 Exercises 141

22 Classification Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 145


22.5 Exercises 145

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 1 Data Mining and Analysis

1.7 EXERCISES

Q1. Show that the mean of the centered data matrix Z in Eq. (1.5) is 0.

Answer: Each centered point is given as: zi = xi − µ. Their mean is therefore:


n n
1X 1X
zi = (xi − µ)
n n
i=0 i=0
n
1X 1
= xi − · n · µ
n n
i=0

= µ−µ= 0

Q2. Prove that for the Lp -distance in Eq. (1.2), we have

d 
δ∞ (x, y) = lim δp (x, y) = max |xi − yi |
p→∞ i=1

for x, y ∈ Rd .

Answer: We have to show that

d
! p1
X d 
lim |xi − yi |p = max |xi − yi |
p→∞ i=1
i=1

Assume that dimension a is the max, and let m = |xa − ya |. For simplicity, we
assume that |xi − yi | < m for all i 6= a.
If we divide and multiply the left hand side with mp we get:

! p1  1
d 
X  X  |xi − yi | p
p
|xi − yi | p
m p
= m 1 + 
m m
i=1 i6=a

3
4 Data Mining and Analysis

 p
As p → ∞, each term |xi m−yi |
→ 0, since m > |xi − yi | for all i 6= a. The finite
P  |xi −yi | p
summation i6=a m converges to 0 as p → ∞, as does 1/p.
Thus δ∞ (x, y) = m × 10 = m = |xa − ya | = maxdi=1 {|xi − yi |}
Note that the same result is obtained even if we assume that dimensions other
than a achieve the maximum value m. In the worst case, we have m = |xi − yi | for
all d dimensions. In this case, the expression on LHS becomes

d
!1/p
X
lim m p
1 = lim md 1/p = lim md 0 = m
p→∞ p→∞ p→∞
i=1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T ONE DATA ANALYSIS
FOUNDATIONS
CHAPTER 2 Numeric Attributes

2.7 EXERCISES

Q1. True or False:


(a) Mean is robust against outliers.
Answer: False
(b) Median is robust against outliers.
Answer: True
(c) Standard deviation is robust against outliers.
Answer: False

Q2. Let X and Y be two random variables, denoting age and weight, respectively.
Consider a random sample of size n = 20 from these two variables

X = (69, 74, 68, 70, 72, 67, 66, 70, 76, 68, 72, 79, 74, 67, 66, 71, 74, 75, 75, 76)
Y = (153, 175, 155, 135, 172, 150, 115, 137, 200, 130, 140, 265, 185, 112, 140,
150, 165, 185, 210, 220)

(a) Find the mean, median, and mode for X.


Answer: The mean, median, and mode are:
2
1 X
µ= 0xi = 1429/20 = 71.45
20
i=1

median = (71 + 72)/2 = 71.5


mode = 74

(b) What is the variance for Y?

7
8 Numeric Attributes

Answer: The mean of Y is µY = 3294/20 = 164.7. The variance is:


2
1 X
σY2 = 0yi − µY = 27384.2/20 = 1369.21
20
i=1

(c) Plot the normal distribution for X.


Answer: The mean for X is µX = 71.45, and the variance is σX2 = 13.8475, with
a standard deviation of σX = 3.72.
f (x)

0.10

0.05

0 x
60 65 70 75 80
(d) What is the probability of observing an age of 80 or higher?
Answer: If we leverage the empirical probability mass function, we get:

P (X ≥ 80) = 0/20 = 0

since we do not have anyone with age 80 or more in our sample.


We can use the normal distribution modeling, with parameters µX = 71.45 and
σX2 = 3.72 to get:

Z∞
P (X ≥ 80) = N(x|µX , σX ) = 0.010769
80

b for these two


(e) Find the 2-dimensional mean µ̂ and the covariance matrix 6
variables.
Answer: The mean and covariance matrices are:

µ = (µX , µY )T = (71.45, 164.7)T


 2   
σX σXY 13.8475 122.435
6= =
σXY σY2 122.435 1369.21

(f) What is the correlation between age and weight?

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 9

Answer:
122.435
ρXY = σXY /(σX σY ) = √ = 0.889
13.845 · 1369.21

(g) Draw a scatterplot to show the relationship between age and weight.
Answer: The scatterplot is shown in the figure below.
bC

250

230
bC

bC
210
Y: Weight

bC

190 bC bC

bC
bC
170 bC

bC bC
bC bC
150
bC bC
bCbC
bC
130
bC
bC
110
65 67 69 71 73 75 77 79
X: Age

Q3. Show that the identity in Eq. (2.15) holds, that is,

n
X n
X
(xi − µ)2 = n(µ̂ − µ)2 + (xi − µ̂)2
i=1 i=1

Answer: Consider the RHS


n
X n
X
n(µ̂ − µ)2 + (xi − µ̂)2 = n(µ̂2 − 2µ̂µ + µ2 ) + (xi2 − 2xi µ̂ + µ̂2 )
i=1 i=1
n
!
X
2 2
= nµ̂ − 2nµ̂µ + nµ + xi2 − 2nµ̂2 + nµ̂2
i=1
n
!
X
= xi2 − 2nµ̂µ + nµ2
i=1
n
!  Pn  n
X X
i=1 xi
= xi2 − 2n µ+ µ2
n
i=1 i=1
n
X
= (xi − µ)2
i=1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10 Numeric Attributes

Q4. Prove that if xi are independent random variables, then


n
! n
X X
var xi = var(xi )
i=1 i=1

This fact was used in Eq. (2.12).

Answer: We assume for simplicity that all the variables are discrete. A similar
approach can be used for continuous variables.
Consider the random variable x1 + x2 . Its mean is given as
X X
µx1 +x2 = (a + b)f (a, b)
x1 =a x2 =b

Since x1 and x2 are independent, their joint probability mass function is given as:

f (x1 , x2 ) = f (x1 ) · f (x2 )

Thus, the mean is given as


X X
µx1 +x2 = (a + b)f (a, b)
x1 =a x2 =b
X X
= (a + b)f (a)f (b)
x1 =a x2 =b
X X
= f (a) (a + b)f (b)
x1 =a x2 =b
 
X X X
= f (a)  af (b) + bf (b)
x1 =a x2 =b x2 =b
X 
= f (a) a + µx2
x1 =a

= µx1 + µx2

In general, we can show that the expected value of the sum of the variables xi is the
sum of their expected values, i.e.,
" n # n
X X
E xi = E[xi ]
i=1 i=1

Now, let us consider the variance of the sum of the random variables:
!  " n #!2 
Xn Xn X
var xi = E  xi − E xi 
i=1 i=1 i=1
 !2 
n
X n
X
= E xi − E[xi ] 
i=1 i=1
 !2 
n
X
= E (xi − E[xi ]) 
i=1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 11

 
Xn n X
X
= E  (xi − E[xi ])2 + 2 (xi − E[xi ])(xj − E[xj ])
i=1 i=1 j >i
n
X h i n X
X
= E (xi − E[xi ])2 + 2 cov(xi , xj )
i=1 i=1 j >i
n
X
= var(xi )
i=1

The last step follows from the fact that cov(xi , xj ) = 0 since they are independent.

Q5. Define a measure of deviation called mean absolute deviation for a random variable
X as follows:
n
1X
|xi − µ|
n
i=1
Is this measure robust? Why or why not?

Answer: No, it is not robust, since a single outlier can skew the mean absolute
deviation.

Q6. Prove that the expected value of a vector random variable X = (X1 , X2 )T is simply the
vector of the expected value of the individual random variables X1 and X2 as given in
Eq. (2.18).

Answer: This follows directly from the definition of expectation of a vector random
variable. When both X1 and X2 are discrete we have
X X X x1  
µX1

µ = E[X] = xf (x) = f (x1 , x2 ) =
x x x
x2 µX2
1 2

Likewise, when both X1 and X2 are continuous we have


Z Z Z Z    
x1 µX1
µ = E[X] = xf (x)dx = f (x1 , x2 )dx1 dx2 =
x2 µX2
x x1 x2

In more detail, assume that both X1 and X2 are discrete, we have


X 
  X   x1 f (x1 , x2 )
X1 x1 x1 ,x2 
µ=E = f (x1 , x2 ) = 

X 

X2 x1 ,x2
x2 x2 f (x ,
1 2 x )
x1 ,x2
X X  X 
x1 f (x1 , x2 ) x1 f (x1 )  
E[X1 ]  
 x1 x   x  µX1

=X X 2  = X 1  =   =
x2 f (x1 , x2 )  x2 f (x2 ) E[X ] µX2
2
x2 x1 x2

where f (x1 , x2 ) = p(X1 = x1 , X2 = x2 ) is the joint probability mass function of


P P
X1 and X2 , and f (x1 ) = x2 f (x1 , x2 ) and f (x2 ) = x1 f (x1 , x2 ) are the marginal
probability distributions of X1 and X2 , respectively. Note that X1 and X2 do not
have to be independent for the above to hold.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
12 Numeric Attributes

Q7. Show that the correlation [Eq. (2.23)] between any two random variables X1 and X2
lies in the range [−1, 1].

Answer: The Cauchy-Schwartz inequality states that for any two vectors x and y in
an inner product space, they satisfy:

|hx, yi|2 ≤ hx, xi · hy, yi

Define the inner product between two random variables X1 and X2 as follows:

hX1 , X2 i = E[X1 X2 ]

Expectation is a valid inner product since it satisfies the three conditions: i)


symmetric: E[X1 X2 ] = E[X2 X2 ], ii) positive-semidefinite: E[X1 X2 ] = E[X21 ] ≥ 0, and
iii) linear: E[(aX1 )X2 ] = aE[X1 X2 ] and E[(X1 + Z)X2 ] = E[X1 X2 ] + E[ZX2 ].
Then, we have
2
|σ12 | = cov(X1 , X2 )
2
= E[(X1 − µ1 )(X2 − µ2 )]
2
= h(X1 − µ1 )(X2 − µ2 )i

≤ hX1 − µ1 , X1 − µ1 i · hX2 − µ2 , X2 − µ2 i
= E[X1 − µ1 ] · E[X2 − µ2 ]
= σ1 · σ2

Since |σ12 | ≤ σ1 · σ2 , it follows that the correlation ρ12 = σ12 /σ1 σ2 lies in the range
[−1, 1].

Q8. Given the dataset in Table 2.1, compute the covariance matrix and the generalized
variance.

Table 2.1. Dataset for Q8

X1 X2 X3
x1 17 17 12
x2 11 9 13
x3 11 8 19

Answer: The covariance matrix is:


 
8.0 11.33 −5.33
6 =  11.33 16.22 −8.56
−5.33 −8.56 9.56

The generalized variance is:

det(6) = −1.38 × 10−13

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 13

Q9. Show that the outer-product in Eq. (2.31) for the sample covariance matrix is
equivalent to Eq. (2.29).

Answer: Let zi = xi − µ̂ denote a centered data point. The outer product form of
covariance matrix is given as:

X n
b= 1
6 zi zT
i
n
i=1

Let us consider the entry in cell (j, k); we have:

X n n
b k) = 1 1X
6(j, zij zik = (xij − µ̂j )(xik − µ̂k ) = σ̂j k
n n
i=1 i=1

which is exactly the covariance between the j -th and k-th attribute.

Q10. Assume that we are given two univariate normal distributions, NA and NB , and let
their mean and standard deviation be as follows: µA = 4, σA = 1 and µB = 8, σB = 2.
(a) For each of the following values xi ∈ {5, 6, 7} find out which is the more likely
normal distribution to have produced it.
Answer: If we plug-in xi in the equation for the normal distribution, we obtain
the following:

NA (5) = 0.242 NB (5) = 0.065


NA (6) = 0.054 NB (6) = 0.121
NA (7) = 0.004 NB (7) = 0.176

Based on these values, we can claim that NA is more likely to have produced 5,
but NB is more likely to have produced 6 and 7.
We can also solve this problem by finding the z-score for each value. We can
then assign a point to the distribution for which it has a lower z-score (in terms
of absolute value). For example, for 5, we have zA (5) = (5 − 4)/1 = 1, and
zB (5) = (5 − 8)/2 = −1.5. Since |zB | > |zA | we can claim that 5 comes from
NA .
For 6 and 7 we have:

zA (6) = (6 − 4)/1 = 2 zB (6) = (6 − 8)/2 = −1


zA (6) = (7 − 4)/1 = 3 zB (7) = (7 − 8)/2 = −0.5

Thus, these values are more likely to have been generated from NB .
(b) Derive an expression for the point for which the probability of having been
produced by both the normals is the same.
Answer: Plugging in the parameters of NA and NB into the equation for the
normal distribution, and after setting up the equality, we obtain:
1 (x−4)2 1 (x−8)2
√ e− 2 = √ e− 8
2π 2 2π

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14 Numeric Attributes

(x−8)2 (x−4)2
2e 8 =e 2

taking ln on both sides yields


(x − 8)2 (x − 4)2
ln(2) + =
8 2
8 ln(2) + x 2 + 64 − 16x x 2 + 16 − 8x
=
8 2
2 ln(2) + x 2 /4 − 4x = x 2 − 8x
3 2
x − 4x − 2 ln(2) = 0
4
0.75x 2 − 4x − 1.4 = 0

We √
can solve this equation using the general solution for a quadratic equation:
−b± b2 −4ac
2a . Plugging in the values from above we get x = 5.67.

Q11. Consider Table 2.2. Assume that both the attributes X and Y are numeric, and the
table represents the entire population. If we know that the correlation between X
and Y is zero, what can you infer about the values of Y?

Table 2.2. Dataset for Q11

X Y
1 a
0 b
1 c
0 a
0 c

Answer: Since the correlation is zero, we have cov(X, Y) = 0, which implies that
E[XY] = E[X]E[Y]. From the data we have

E[XY] = (a + c)/5 E[X] = 2/5 E[Y] = (2a + 2c + b)/5

Equating these we get

(a + c)/5 = 2(2a + 2c + b)/25


5a + 5c = 4a + 4c + 2b
a + c = 2b

Q12. Under what conditions will the covariance matrix 6 be identical to the correlation
matrix, whose (i, j ) entry gives the correlation between attributes Xi and Xj ? What
can you conclude about the two variables?

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
2.7 Exercises 15

Answer: If the covariance matrix equals the correlation matrix, this means that for
all i and j , we have

ρij = σij
σij
= σij
σi σj
σi σj = 1

Thus, for the covariance matrix to equal the correlation matrix, Xi and Xj must
be perfectly correlated; either σi = σj = 1 or σi = σj = −1.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 3 Categorical Attributes

3.7 EXERCISES

Q1. Show that for categorical points, the cosine similarity between any two vectors in lies
in the range cos θ ∈ [0, 1], and consequently θ ∈ [0◦ , 90◦ ].

Answer: From Section 3.4, we have:


s
cos θ =
d
where s is number of matching values between two categorical vectors xi and xj
and d is the number of attributes. Since s ∈ [0, d], it follows that cos θ ∈ [0, 1].

Q2. Prove that E[(X1 − µ1 )(X2 − µ2 )T ] = E[X1 XT T


2 ] − E[X1 ]E[X2 ] .

Answer:

E[(X1 − µ1 )(X2 − µ2 )T ] = E[(X1 − µ1 )(XT T


2 − µ2 )]

= E[X1 XT T T T
2 ] − E[X1 ]µ2 − µ1 E[X2 ] + µ1 µ2

= E[X1 XT T T
2 ] − E[X1 ]E[X2 ] − E[X1 ]E[X2 ] + E[X1 ]E[X2 ]
T

= E[X1 XT
2 ] − E[X1 ]E[X2 ]
T

Table 3.1. Contingency table for Q3

Z=f Z=g
Y=d Y=e Y=d Y=e
X=a 5 10 10 5
X=b 15 5 5 20
X=c 20 10 25 10

16
3.7 Exercises 17

Table 3.2. χ 2 Critical values for different p-values for different degrees of freedom (q): For example, for
q = 5 degrees of freedom, the critical value of χ 2 = 11.070 has p-value = 0.05.

q 0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
1 — — 0.001 0.004 0.016 2.706 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548

Q3. Consider the 3-way contingency table for attributes X, Y, Z shown in Table 3.1.
Compute the χ 2 metric for the correlation between Y and Z. Are they dependent
or independent at the 95% confidence level? See Table 3.2 for χ 2 values.

Answer: Summing out X, we have the new 2-way contingency table between Y and
Z, along with the row/col marginal frequencies:

Z=f Z=g
Y=d 40 40 80
Y=e 25 35 60
65 75 140

The expected counts in each cell are then given as follows:

Z=f Z=g
Y=d (80 · 65)/140 = 37.14 (80 · 75)/140 = 42.86
Y=e (60 · 65)/140 = 27.86 (60 · 75)/140 = 32.14

Subtracting the expected and observed values, and squaring them, we get:

Z=f Z=g
Y=d 2.862 = 8.16 −2.862 = 8.16
Y=e −2.862 = 8.16 2.862 = 8.16

Dividing by the expected counts, gives:

Z=f Z=g
Y=d 0.22 0.19
Y=e 0.29 0.25

Finally, summing all these values we obtain χ 2 = 0.22 + 0.19 + 0.29 + 0.25 = 0.95.
With one degree of freedom, we have p-value = 0.33, which is well to the left of
the 0.05 critical value. Thus we cannot reject the null hypothesis, and we conclude
that the two variables are independent.

Q4. Consider the “mixed” data given in Table 3.3. Here X1 is a numeric attribute and
X2 is a categorical one. Assume that the domain of X2 is given as dom(X2 ) = {a, b}.
Answer the following questions.
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
18 Categorical Attributes

(a) What is the mean vector for this dataset?


(b) What is the covariance matrix?

Answer: We can model X2 via the bivariate variable X2 , defined as follows:


(
(1, 0)T if X1 = a
X2 =
(0, 1)T if X1 = b

The mean for X1 is 0.0, whereas the mean vector for X2 is (0.6, 0.4)T . Therefore,
the mean vector for both attributes is µ = (0.0|0.6, 0.4)T .
The covariance matrix is given as
 
X1 X2 = a X2 = b
 
b=
6 
X1 0.92 −0.15 0.15 

X2 = a −0.15 0.24 −0.24 
X2 = a 0.15 −0.24 0.24

Q5. In Table 3.3, assuming that X1 is discretized into three bins, as follows:

c1 = (−2, −0.5]
c2 = (−0.5, 0.5]
c3 = (0.5, 2]

Answer the following questions:


(a) Construct the contingency table between the discretized X1 and X2 attributes.
Include the marginal counts.
(b) Compute the χ 2 statistic between them.
(c) Determine whether they are dependent or not at the 5% significance level. Use
the χ 2 critical values from Table 3.2.

Table 3.3. Dataset for Q4 and Q5

X1 X2
0.3 a
−0.3 b
0.44 a
−0.60 a
0.40 a
1.20 b
−0.12 a
−1.60 b
1.60 b
−1.32 a

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
3.7 Exercises 19

Answer: The contingency table is given as:

a b
c1 2 1 3
c2 4 1 5
c3 0 2 2
6 4 10

The expected counts are:

a b
c1 1.8 1.2
c2 3 2
c3 1.2 0.8

The degrees of freedom are 2, and the chi-square value us χ 2 = 3.89, with
pvalue = 0.143. At 5% significance level we cannot reject the null hypothesis that
they are independent.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 4 Graph Data

4.6 EXERCISES

Q1. Given the graph in Figure 4.1, find the fixed-point of the prestige vector.

a b

c
Figure 4.1. Graph for Q1

Answer: The adjacency matrix and its transpose for the graph is:
   
0 1 1 0 1 0
A = 1 0 0 AT = 1 0 1
0 1 0 1 0 0

Let p0 = (1, 1, 1)T . We can do successive multiplications on the left by AT , and


divide by the maximum value (after several iterations) to obtain the prestige vector
as follows:
               
1 1 2 2 3 4 5 7
1 → 2 → 2 → 3 → 4 → 5 → 7 → 9 →
1 1 1 2 2 3 4 5
               
9 12 16 21 28 0.76 37 0.76
12 → 16 → 21 → 28 → 37 =  1.0  → 49 =  1.0 
7 9 12 16 21 0.57 28 0.57

We can observe that the prestige vector converges to p = (0.76, 1, 0.57)T or after
normalization to p = (0.548, 0.726, 0.415)T .

Q2. Given the graph in Figure 4.2, find the fixed-point of the authority and hub vectors.

20
4.6 Exercises 21
a

b c

Figure 4.2. Graph for Q2.

Answer: The adjacency matrix and its transpose for the graph is:
   
0 1 0 0 1 1
A = 1 0 1 AT = 1 0 1
1 1 1 0 1 1

Let h = (1, 1, 1)T . We can do successive multiplications by alternating between AT


and A to obtain the authority and hub vectors as follows (via a = AT h and h = Aa):

h→a→
   
1 2
1 → 2 →
1 2
   
2 10
4 →  8  →
6 10
   
8 48
20 → 36 →
28 48
       
36 0.27 228 1.0
 96  = 0.73 → 168 = 0.74 →
132 1.0 228 1.0
       
168 0.27 1080 1.0
456 = 0.73 →  792  = 0.73
624 1.0 1080 1.0

We can observe that the authority and hub vectors converge to:

a = (1, 0.73, 1.0)T = (0.63, 0.46, 0.63)T


h = (0.27, 0.73, 1.0)T = (0.21, 0.58, 0.79)T

Q3. Consider the double star graph given in Figure 4.3 with n nodes, where only nodes
1 and 2 are connected to all other vertices, and there are no other links. Answer the
following questions (treating n as a variable).
(a) What is the degree distribution for this graph?

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
22 Graph Data

3 4 5 ··············· n

1 2

Figure 4.3. Graph for Q3.

Answer: The degree distribution is given as


k f (k)
2 n − 2/n
n−1 2/n

(b) What is the mean degree?


Answer: The average or mean degree is:

2(n − 2)/n + 2(n − 1)/n = (2n − 4 + 2n − 2)/n = (4n − 6)/n

(c) What is the clustering coefficient for vertex 1 and vertex 3?


Answer: The clustering coefficients for node 1 and node 3 are:
n−2
C(1) = n−1
= 2(n − 2)/((n − 1)(n − 2)) = 2/(n − 1)
2

C(3) = 1

(d) What is the clustering coefficient C(G) for the entire graph? What happens to
the clustering coefficient as n → ∞?
Answer: The average clustering coefficient for the graph is:
(n − 2) · 1 + 2 · 2/(n − 1)
C(G) = = (n2 − 3n + 6)/(n2 − n)
n
As n → ∞, C(G) → 1.
(e) What is the transitivity T(G) for the graph? What happens to T(G) and n → ∞?
Answer: The transitivity is given as:

3(n − 2)
T(G) = (n−1)
= 3/n
2 2 + (n − 2)

As n → ∞, T(G) → 0.
(f) What is the average path length for the graph?
Answer: The average path length can be computed as follows:

path from 1 to n − 1 other vertices len = 1


path from 2 to n − 2 other vertices len = 1
path from i to n − i other vertices len = 2

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
4.6 Exercises 23

The sum of all the path lengths over pairs of nodes is therefore:
n−3
X
(n−1)+(n−2)+2 i = 2n−3+(n−3·n−2) = n2 −5n+6+2n−3 = n2 −3n+3
i=1

n
But there are 2 = n(n − 1)/2 pairs. Thus, the average path length is 2(n2 − 3n +
3)/n2 − n. As n → ∞, average length tends to 2.
(g) What is the betweenness value for node 1?
Answer: There are n − 2 shortest paths from 2 to i >= 3, which do not include
1. Furthermore for the other n − 2 pairs of nodes, only half go through 1. But
there are 2 paths for each such pair. The betweenness is then given as:
n−2
2
= n−2
2 2 + (n − 1)
(n − 2)(n − 3)/2
=
(n − 2)(n − 3) + (n − 1)
n2 − 5n + 6
=
2(n2 − 4n + 5)
as n → ∞, betweenness goes to 1/2.
(h) What is the degree variance for the graph?
Answer: The variance for the degree can be computed as follows:

E[X2 ] = 4(n − 2)/n + 2(n − 1)2 /n = (2n2 + 2n − 6)/n

E[X]2 = ((4n − 6)/n) 2 = (16n2 − 24n + 36)/n2


var(X) = E[X2 ] − E[X]2 = 2n + 18 − 30/n + 36/n2

Q4. Consider the graph in Figure 4.4. Compute the hub and authority score vectors.
Which nodes are the hubs and which are the authorities?
1 3

2 4 5

Figure 4.4. Graph for Q4.

Answer: The adjacency matrix is given as:


 
0 0 1 0 0
1 0 1 0 0
 
 
A = 0 0 0 0 1
 
0 1 1 0 0
0 0 0 1 0
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
24 Graph Data

For the hubs we can directly compute the AAT matrix and compute its dominant
eigenvector
 
1 1 0 1 0
1 2 0 1 0
 
 
AAT = 0 0 1 0 0
 
1 1 0 2 0
0 0 0 0 1

Starting with the initial vector h0 = (1, 1, 1, 1, 1)T , successive iterations give:

h1 = (3, 4, 1, 4, 1)T
h2 = (11, 15, 1, 15, 1)T
h3 = (41, 56, 1, 56, 1)T
h4 = (153, 209, 1, 209, 1)T

After normalizing h4 the hub scores vector is h = (0.46, 0.63, 0.0, 0.63, 0.0)T . The
eigenvalue is obtained as the ratio of the maximum element from the last two
iterations, i.e., 209/56 = 3.73.
For the authorities we can directly compute the AT A matrix and compute its
dominant eigenvector
 
1 0 1 0 0
0 1 1 0 0
 
T  
A A = 1 1 3 0 0
 
0 0 0 1 0
0 0 0 0 1

Starting with the initial vector a0 = (1, 1, 1, 1, 1)T , successive iterations give:

a1 = (2, 2, 5, 1, 1)T
a2 = (7, 7, 19, 1, 1)T
a3 = (26, 26, 71, 1, 1)T
a4 = (97, 97, 265, 1, 1)T

After normalizing a4 the authority scores vector is a =


T
(0.325, 0.325, 0.888, 0.003, 0.003) . The eigenvalue is obtained as the ratio of
the maximum element from the last two iterations, i.e.,265/71 = 3.73.
Vertices 2 and 4 are the good hubs; both point to the high authority node 3.

Q5. Prove that in the BA model at time-step t + 1, the probability πt (k) that some node
with degree k in Gt is chosen for preferential attachment is given as

k · nt (k)
πt (k) = P
i i · nt (i)

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
4.6 Exercises 25

Answer: Based on (4.14) node vj is chosen with probability πt (vj ) proportional to


its degree in Gt
dj
πt (vj ) = P
vi ∈Gt di

Note that the denominator can also be written as


X X
di = i · nt (i)
vi ∈Gt i

Thus, node vj is chosen with probability

k
πt (vj ) P
i · nt (i)
i

Since there are nt (k) nodes with degree k, the probability that some node with
degree k is chosen is given as
k · nt (k)
πt (k) = P
i i · nt (i)

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 5 Kernel Methods

5.6 EXERCISES

Q1. Prove that the dimensionality of the feature space for the inhomogeneous polynomial
kernel of degree q is
 
d +q
m=
q
Answer: The dimensionality of the feature space for the inhomogeneous poly-
nomial kernel of degree q is equivalent to the number of possible non-negative
numbers n0 , n1 , ..., nd that sum up to q. This number is in turn equivalent to the
number of (d + 1)-partitions of q objects, where some partitions may be empty (a
k-partition is a partitioning of the objects into k disjoint parts).
Let us denote an object with the symbol ∗, and let us denote a partition boundary
with the symbol |. For instance, let q = 2, and let d = 2, and consider the following
partitioning: (∗ ∗ ||). This is mapped to the non-negative numbers n0 = 2, n1 = 0, n2 =
0, since there are two stars in the first partition, and none in the other two partitions.
Likewise, the partitioning (∗||∗) is mapped to the numbers n0 = 1, n1 = 0, n2 = 1, and
so on.
There is a one-to-one mapping between the (d + 1)-partitions and the set of
numbers n0 , n1 , ..., nd . Now the number of (d + 1) partitions can be obtained by
choosing d boundary symbols | out of a total of q + d objects comprising the set of

q distinct objects and d boundary symbols, i.e., d+q q .

Q2. Consider the data shown in Table 5.1. Assume the following kernel function:
K(xi , xj ) = kxi − xj k2 . Compute the kernel matrix K.

Answer: The kernel between x1 and x2 is given as:


K(x1 , x2 ) = kx1 − x2 k2 = 1.52 + 1.92 = 5.86
The complete kernel matrix is given as:
 
0 5.86 1.46 4.64
5.86 0 10 1.46
K= 1.46 10

0 5.86
4.64 1.46 5.86 0
26
5.6 Exercises 27

Table 5.1. Dataset for Q2

i xi
x1 (4, 2.9)
x2 (2.5, 1)
x3 (3.5, 4)
x4 (2, 2.1)

Q3. Show that eigenvectors of S and Sl are identical, and further that eigenvalues of Sl
are given as (λi )l (for all i = 1, . . . , n), where λi is an eigenvalue of S, and S is some
n × n symmetric similarity matrix.

Answer: Since S is symmetric it has real eigenvalues, and its eigen-decomposition


is given as:

S = U3UT

where each column ui of U is an eigenvector of S and 3 is the diagonal matrix of


eigenvalues λi arranged in decreasing order |λ1 | ≥ |λ2 | ≥ ... ≥ |λn |.
It is well known that the eigen-decomposition of Sl is given as

S = U3l UT

That is, Sand Sl have the same eigenvectors and for the eigenvalue λi for S, the
corresponding eigenvalue of Sl is λli .
Given that Sui = λi ui , we can derive the eigenvalues and eigenvectors of Sl
directly as follows:

Sl ui = Sl−1 (Sui ) = λi Sl−1 ui


= λi Sl−2 (Sui ) = λ2i Sl−2 ui
= ...
= λli ui

Thus, ui is an eigenvector and λli is the corresponding eigenvalue of Sl .

1
Q4. The von Neumann diffusion kernel is a valid positive semidefinite kernel if |β| < ρ(S) ,
where ρ(S) is the spectral radius of S. Can you derive better bounds for cases when
β > 0 and when β < 0?

Answer: For K to be a positive semi-definite kernel, we require that 1 − βλi ≥ 0 for


all i.
Let λP and λN denote the largest positive and negative eigenvalues of S. If β > 0,
the negative eigenvalues always satisfy 1 − βλi > 0, so the constraint only applies
the to positive eigenvalues. We have 1 − βλP ≥ 0, which implies that β ≤ 1/λP .

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
28 Kernel Methods

When β < 0, then positive eigenvalues always satisfy 1− βλi > 0, so the constraint
applies to all negative eigenvalues. For the largest negative eigenvalue we have 1 −
βλN > 0 which implies that 1 − |β| · |λN | > 0. This means that |β| < 1/|λN |.
1
Putting the two conditions together, we have |β| < max{|λ1P |,|λN |} = ρ(S) . The better
bounds are that when β > 0, β < 1/λP , regardless of how large |λN | is, and when
β < 0, |β| < 1/|λN |, regardless of how large |λP | is.

Q5. Given the three points x1 = (2.5, 1)T , x2 = (3.5, 4)T , and x3 = (2, 2.1)T .
(a) Compute the kernel matrix for the Gaussian kernel assuming that σ 2 = 5.
Answer: We have
kx1 − x2 k2 = 12 + 32 = 10
kx1 − x3 k2 = 0.52 + 1.12 = 1.46
kx2 − x3 k2 = 1.52 + 1.92 = 5.86

The Gaussian kernel matrix is then given as follows


   
1 e−10/10p e−1.46/10 1 0.37 p0.86
K =  e−10/10 1 e−5.86/10  = 0.37 1 0.56 
e−1.46/10 e−5.86/10 1 0.86 0.56 1

(b) Compute the distance of the point φ(x1 ) from the mean in feature space.
Answer: The squared distance in terms of kernel operations is given as
3
X XX
||φ(x1 ) − µφ ||2 = K(x1 , x1 ) − 2/3 K(x1 , xj ) + 1/9 K(xi , xj )
j =1 i j

= 1 − 2/3(2.23) + 0.73 = 0.24



The distance is therefore 0.24 = 0.49.
(c) Compute the dominant eigenvector and eigenvalue for the kernel matrix
from (a).
Answer: We use the power iteration method to find the dominant eigenvector,
as follows: Let x0 = (1, 1, 1)T , then we get Kx0 = (2.23, 1.92, 2.42)T , and after
scaling by the maximum value we have x1 = (0.92, 0.79, 1)T .
For next round we have Kx1 = (2.08, 1.69, 2.24)T , and after scaling we have
x2 = (0.93, 0.76, 1)T .
For next round we have Kx2 = (2.07, 1.65, 2.22)T , and after scaling we have
x3 = (0.93, 0.74, 1)T .
For next round we have Kx3 = (2.07, 1.64, 2.22)T , and after scaling we have
x4 = (0.93, 0.74, 1)T .
Thus, we conclude that λ1 = 2.22, and u1 = x4 /kx4 k = 1/1.55 · (0.93, 0.74, 1)T =
(0.6, 0.48, 0.64)T .

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 6 High-dimensional Data

6.9 EXERCISES

Q1. Given the gamma function in Eq. (6.6), show the following:
(a) Ŵ(1) = 1
Answer:
Z∞ Z∞
0 −x
Ŵ(1) = x e dx = e−x dx
0 0

= −1e−x = 0 + e0 = 1
0

  √
1
(b) Ŵ 2 = π
dy
Answer: Let y = x 1/2 , then dx = 21 x −1/2 , which implies that

dx = 2x 1/2 dy = 2ydy
R∞
Substituting y = x 1/2 in Ŵ(1/2) = 0 x −1/2 e−x dx, we get
Z∞ Z∞
−1 −y 2 2
Ŵ(1/2) = 2 y e ydy = 2 e−y dy
0 0

π ∞
= 2· · erf(z)
2 0

where erf(z) is the Gauss error function


Zz
√ 2
erf(z) = 2/ π e−y dy
0

It is well known that erf(0) = 0 and erf(∞) = 1, thus we have


√   √ √
Ŵ(1/2) = π erf(∞) − erf(0) = π(1 − 0) = π

29
30 High-dimensional Data

(c) Ŵ(α) = (α − 1)Ŵ(α − 1)


Answer: We use integration by parts. Let u = x α−1 so that du = (α − 1)x α−2 dx.
R R
Next, let dv = e−x dx, so that v = dv = e−x dx = −e−x . Now integration by
parts formula states that Z Z
udv = uv − vdu

Substituting from above, we have


Z∞ Z∞

x α−1 e−x dx = −x α−1 e−x + e−x (α − 1)x α−2
0
0 0

Ŵ(α) = 0 + (α − 1)Ŵ(α − 1)

Q2. Show that the asymptotic volume of the hypersphere Sd (r) for any value of radius r
eventually tends to zero as d increases.

Answer: The volume of Sd (r) is given as


!
(πr 2 )d/2
Ŵ(d/2 + 1)

For any given r, πr 2 is a constant C, so that the numerator is simply Cd/2 .


Consider the case when d is even, we have Ŵ(d/2 + 1) = (d/2)!, and thus

Cd/2
lim →0
d→∞ (d/2)!

this is because (d/2)! will eventually


√ exceed any constant. More precisely, using
Sterling’s approximation n! = 2πn(n/e)n , we have

(d/2)! ≃ πd(d/2e)d/2

so that
 d/2
Cd/2 1 2Ce
lim = lim √ →0
d→∞ (d/2)! d→∞ πd d
The last step follows since eventually d will exceed 2Ce.

π d!!
When d is odd, we have Ŵ(d/2 + 1) = 2(d+1)/2 . We first derive an approximation
for d!!. Since d is odd it can be written as 2n + 1 for some integer n. Now consider
the following:

(2n + 1)!! 2n n! = ((2n + 1)(2n − 1)...1)(2n · 2(n − 1) · 2(n − 2) · 2) = (2n + 1)!


d!
Since d = (2n + 1), we have d!! = 2(d−1)/2 ((d−1)/2)!
. We thus have

Cd/2 1 ((d − 1)/2)!


lim √ = lim √ (4C)d/2 ≃ lim (4C/d)d/2 → 0
d→∞ π Ŵ(d/2 + 1) d→∞ π d! d→∞

Again, d will eventually exceed 4C and thus the volume goes to 0.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 31
X2
1.0

0.5

0 X1
0 0.5 1.0
Figure 6.1. Shape of L0.5 ball of radius 0.5 in 2D

Q3. The ball with center c ∈ Rd and radius r is defined as



Bd (c, r) = x ∈ Rd | δ(x, c) ≤ r

where δ(x, c) is the distance between x and c, which can be specified using the
Lp -norm:
d
! p1
X
p
Lp (x, c) = |xi − ci |
i=1
where p 6= 0 is any real number. The distance can also be specified using the
L∞ -norm:

L∞ (x, c) = max |xi − ci |
i
Answer the following questions:
(a) For d = 2, sketch the shape of the hyperball inscribed inside the unit square, using
the Lp -distance with p = 0.5 and with center c = (0.5, 0.5)T .
Answer: Using radius r = 0.5, d = 2 and p = 0.5, we get Bd as the set of all
points that satisfy the equation
p p 2
|x1 − 0.5| + |x2 − 0.5| = 0.5
p p √
|x1 − 0.5| + |x2 − 0.5| − 0.5 = 0

The shape of the ball is plotted in Fig. 6.1.


(b) With d = 2 and c = (0.5, 0.5)T , using the L∞ -norm, sketch the shape of the ball of
radius r = 0.25 inside a unit square.
Answer: Using radius r = 0.25, d = 2 and p = ∞, we get Bd as the set of all
points that satisfy the equation

max {|x1 − 0.5|, |x2 − 0.5|} = 0.25

The shape of the ball is plotted in Fig. 6.2.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
32 High-dimensional Data
X2
1.0

0.5

0 X1
0 0.5 1.0
Figure 6.2. Shape of L∞ ball of radius 0.25 in 2D

ǫ
ǫ

Figure 6.3. For Q4.

(c) Compute the formula for the maximum distance between any two points in
the unit hypercube in d dimensions, when using the Lp -norm. What is the
maximum distance for p = 0.5 when d = 2? What is the maximum distance for the
L∞ -norm?
Answer: Let one corner be 0 = (0, 0, · · · , 0). The diagonally oppostie corner is
1 = (1, 1, · · · , 1). The maximum Lp distance between them is

d
! p1
X 1
|1 − 0|p =dp
i=1

Thus, when p = 0.5 and d = 2, we get the maximum distance 22 = 4.


The maximum distance for the L∞ norm is
d
max{|1 − 0|} = 1
i=1

Q4. Consider the corner hypercubes of length ǫ ≤ 1 inside a unit hypercube. The
2-dimensional case is shown in Figure 6.3. Answer the following questions:
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 33

(a) Let ǫ = 0.1. What is the fraction of the total volume occupied by the corner cubes
in two dimensions?
Answer: Each corner occupies ǫ 2 = 0.12 = 0.01 volume. Since there are four
corners, the combined corner volume is 0.04, which is 4% of the total volume
of the unit square.
(b) Derive an expression for the volume occupied by all of the corner hypercubes of
length ǫ < 1 as a function of the dimension d. What happens to the fraction of the
volume in the corners as d → ∞?
Answer: In d-dimensions, there are 2d corners, and the volume of each corner
is ǫ d . Thus, the fraction of the volume in the corners is (2ǫ)d .
As d → ∞, we can see that if ǫ < 0.5, then the volume goes to 0, otherwise, if
ǫ = 0.5, then the corner hypercubes span the entire space, and the volume is 1.
It is reasonable to assume that ǫ ≤ 0.5, since otherwise the corner hypercubes
will overlap, and the combined volume will increase without bound.
(c) What is the fraction of volume occupied by the thin hypercube shell of width ǫ < 1
as a fraction of the total volume of the outer (unit) hypercube, as d → ∞? For
example, in two dimensions the thin shell is the space between the outer square
(solid) and inner square (dashed).
Answer: The edge length for the inner hypercube is 1 − 2ǫ, and this its volume
is (1 − 2ǫ)d . The volume of the outer unit hypercube is 1d = 1. Thus, the volume
of the thin shell is given as:
1 − (1 − 2ǫ)d
Since the volume must be non-negative, it follows that ǫ ≤ 0.5. In this case the
volume of the shell approaches 1, i.e., it contains all of the volume of the outer
hypercube.

Q5. Prove Eq. (6.14), that is, limd→∞ P xT x ≤ −2 ln(α) → 0, for any α ∈ (0, 1) and x ∈ Rd .

Answer: Let Y = xT x. We know that Y follows a χ 2 distribution with d degrees of


freedom. It is known that if Y follows a χ 2 distribution, then Y can be approximated
by a normal distribution with mean µ = d and variance σ 2 = 2d. In terms of the
standard normal distribution we have

Z = (Y − µ)/σ = (Y − d)/ 2d
 √ 
We want to find P (Y < −2 ln(α)), or equivalently P Z < (−2 ln(α) − d)/ 2d .
Notice that
√ √ √
lim (−2 ln(α) − d)/ 2d = lim −d/ 2d = lim − d = −∞
d→∞ d→∞ d→∞

The probability of
 being −∞ standard√deviations
 away from the mean is essentially
0, i.e., limd→∞ P Z < (−2 ln(α) − d)/ 2d = 0

Q6. Consider the conceptual view of high-dimensional space shown in Figure 6.4. Derive
an expression for the radius of the inscribed circle, so that the area in the spokes
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
34 High-dimensional Data

accurately reflects the difference between the volume of the hypercube and the
inscribed hypersphere in d dimensions. For instance, if the length of a half-diagonal
is fixed at 1, then the radius of the inscribed circle is √1 in Figure 6.4a.
2

b α
2

h r
α1

Figure 6.4. Computing the radius for d = 3

Answer: The difference between the volume of the hypercube and hypersphere is
given as
δhs = (2r)d − Kd r d = (2d − Kd )r d = 2d − Kd
assuming r = 1, and where Kd is as given in (6.5).
In d dimensions, the inscribed circle is divided into 2d sectors, with the area of
2
each sector being π2rd .
Now consider with the triangle defined by the two sides of the sector of length r,
with the third sides length being b. The angle of the inner triangle at the center of
the circle is
2π π
α1 = d = d−1
2 2
Thus, the two other angles are both equal to
1 π  (2d−1 − 1)π
α2 = π − d−1 =
2 2 2d
By the law of sines, we have
b r
= , or
sin(α1 ) sin(α2 )
 
sin(α1 )
b= r = cd r
sin(α2 )
sin(α1 )
where cd = sin(α .
2)
The height of the inner triangle is given as
s q 
 2 4 − cd2
1
h = r2 − b = r
2 2
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 35

Thus the area of the inner triangle (for each sector) is


 q 
1 cd 4 − cd2
bh =   r2
2 4

The difference between the inscribed circle and inner polygon (set of triangles)
is given as
 q 
2
π cd 4 − cd
δcp = 2d  d − 
2 4

where we assume that r = 1.


Finally, consider the length of the spokes or triangles with the same base b. Let x
be the height of the triangle. Then we require that the area in the 2d triangles should
equal the difference in the volume of the hypercube and inscribed hypersphere
(δhs ), plus the area between the circle and inner polygon (δcp ), so that we have

1
2d · · bx = δhs + δcp
2
δhs + δcp
x = d−1
2 b
Thus the length of the half-diagonal is h + x, given as
q   q  
δ + π − 2d−2 c 4 − c 2
2
 4 − cd  hs d d 
R =h+x =   +



2 2d−1 cd

By fixing r = 1, we obtain the length of the half-diagonal as R, so by doing the


opposite, i.e., by fixing R = 1, we obtain r = R1 , as the radius of the inscribed circle.
Figure 6.4 illustrates the concept.

Q7. Consider the unit hypersphere (with radius r = 1). Inside the hypersphere inscribe
a hypercube (i.e., the largest hypercube you can fit inside the hypersphere). An
example in two dimensions is shown in Figure 6.5. Answer the following questions:

Figure 6.5. For Q7.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
36 High-dimensional Data

(a) Derive an expression for the volume of the inscribed hypercube for any given
dimensionality d. Derive the expression for one, two, and three dimensions, and
then generalize to higher dimensions.
Answer: In 1d, the inscribed hypercube is identical to the hypersphere. So if
radius of hypersphere is r, then the side length for the cube is l = 2r. Thus:

V(H1 ) = 2r

In 2d, the main diagonal of the square will have length 2r, since the diagonal
will do through the center of the circle, and the circle has radius r. Therefore
the side length
√ of the cube (square) is given as l 2 + l 2 = (2r)2 or 2l 2 = 4r 2 , which
gives l = 2r. The volume is then:

V(H2 ) = l 2 = 2r 2

In 3d, the main diagonal of the cube is still 2r, since the sphere has radius r, but
now there are 3 sides that need to be considered, so we have l 2 + l 2 + l 2 = (2r)2
or l 2 = √2 r. The volume is then:
3

8
V(H3 ) = l 3 = √ r 3
3 3
In general the trend is clear. We have to consider d sides to obtain the main
diagonal with length 2r, which gives
2
l= √ r
d
and
2d d
V(Hd ) = l d = r
d d/2

(b) What happens to the ratio of the volume of the inscribed hypercube to the
volume of the enclosing hypersphere as d → ∞? Again, give the ratio in one,
two and three dimensions, and then generalize.
Answer: Let’s look at the 1d, 2d and 3d cases: With r = 1, for 1d, we have
V(H1 ) 2r
= =1
V(S1 ) 2r
for 2d, we have:
V(H2 ) 2r 2 2
= 2 = = 0.64
V(S2 ) πr π
for 3d, we have: √
V(H3 ) (8/3 3)r 3 2
= =√ = 0.37
V(S3 ) (4/3)πr 3 3π
Now for the general case we have:
2d d
V(Hd ) d d/2
r 2d Ŵ(d/2 + 1)
= d/2
=
V(Sd ) π d (dπ)d/2
Ŵ(d/2+1) r

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
6.9 Exercises 37

Let’s assume that d is even, in which case Ŵ(d/2 + 1) = (d/2)! So we get:

V(Hd ) 2d (d/2)!
=
V(Sd ) (dπ)d/2
     
d d/2 d/2 − 1 d/2 − 2 d/2 − (d/2 − 1)
= 2 ···
dπ dπ dπ dπ
     
d d/2 d/2 1 d/2 2 d/2 d/2 − 1
= 2 − − ··· −
dπ dπ dπ dπ dπ dπ dπ
     
d/2 d/2 d/2 d/2
≤ 2d ···
dπ dπ dπ dπ
 
1
= 2d
(2π)d/2
 d/2
2
=
π

Since 2 < π, as d → ∞, we have that the ratio will go to 0.

Q8. Assume that a unit hypercube is given as [0, 1]d , that is, the range is [0, 1] in each
dimension. The main diagonal in the hypercube is defined as the vector from (0, 0) =
d−1 d−1
z }| { z }| {
(0, . . . , 0, 0) to (1, 1) = (1, . . . , 1, 1). For example, when d = 2, the main diagonal goes
from (0, 0) to (1, 1). On the other hand, the main anti-diagonal is defined as the
d−1 d−1
z }| { z }| {
vector from (1, 0) = (1, . . . , 1, 0) to (0, 1) = (0, . . . , 0, 1) For example, for d = 2, the
anti-diagonal is from (1, 0) to (0, 1).
(a) Sketch the diagonal and anti-diagonal in d = 3 dimensions, and compute the angle
between them.
Answer: The main diagonal is (1, 1, 1) and the anti-diagonal is (0, 0, 1) −
(1, 1, 0) = (−1, −1, 1). The angle is therefore: cos θ = √ 1 √ = 1/3, which
3× 3
implies θ = 70.53◦ .
(b) What happens to the angle between the main diagonal and anti-diagonal as d →
∞. First compute a general expression for the d dimensions, and then take the
limit as d → ∞.
d−1
z }| {
Answer: The main diagonal is (1, 1, 1) and the anti-diagonal is (−1, · · · , −1, 1).
The angle is therefore: cos θ = −(d − 2)/d.
As d → ∞, cos(θ ) → −1 + 2/d = −1, which implies θ = 180◦ or θ = 0◦ . In other
words the diagonal and anti-diagonal are parallel!

Q9. Draw a sketch of a hypersphere in four dimensions.

Answer: The 1D hypersphere is simply an interval along dimension X1 . The 2D


hypersphere is a collection of closely spaced 1D intervals of decreasing radius along
the new dimension X2 , yielding a circle. The 3D hypersphere is a collection of

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
38 High-dimensional Data

Figure 6.6. 4D Hypersphere

closely spaced 2D circles of decreasing radius along the new dimension X3 , to yield
a sphere.
In a similar manner, the 4D hypersphere will be a collection of closely spaced 3D
spheres along the new dimension X4 . Of course we cannot adequately draw a 4D
object, however Fig. 6.6 provides a conceptual sketch.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
CHAPTER 7 Dimensionality Reduction

7.6 EXERCISES

Q1. Consider the following data matrix D:

X1 X2
8 −20
0 −1
10 −19
10 −20
2 0

(a) Compute the mean µ and covariance matrix 6 for D.

Answer: The mean vector is (30/5, −60/5) T = (6, −12)T


Centering the data, we get:
2 -8
-6 11
4 -7
4 -8
-4 12
Using the inner-product form, the covariance matrix is then given as:
 
1 4 + 36 + 16 + 16 + 16 = 88 −16 − 66 − 28 − 32 − 48 = −190
6=
5 −16 − 66 − 28 − 32 − 48 = −190 64 + 121 + 49 + 64 + 144 = 442

That is:  
17.6 −38
6=
−38 88.4

(b) Compute the eigenvalues of 6.


Answer: The eigenvalues can be found by solving for the following determi-
nant equation

det (6 − λI) = 0

39
40 Dimensionality Reduction

Solving for the eigenvalues, we have:

(17.6 − λ)(88.4 − λ) − 382 = λ2 − 106λ + 1555.84 − 1444 = λ2 − 106λ + 111.84 = 0

Thus
p √
106 ± 1062 − 4 · 111.84 106 ± 10788.64 106 ± 103.87
λ= = =
2 2 2
Thus
106 + 103.87
λ1 = = 209.87/2 = 104.94
2
and
106 − 103.87
λ2 = = 2.13/2 = 1.07
2

(c) What is the “intrinsic” dimensionality of this dataset (discounting some small
amount of variance)?
104.94
Answer: Clearly 104.94+1.07 = 0.99 fraction of the variance is in the first
principal component. Thus, the intrinsic dimensionality of the data is only one.
(d) Compute the first principal component.
Answer: We can compute the first principal component from the equation
6ui = λi ui . Using λ1 = 104.94, we can solve for the eigenvector as follows:
    
17.6 − 104.94 −38 x 0
=
−38 88.4 − 104.94 y 0
    
−87.34 −38 x 0
=
−38 −16.54 y 0

We thus obtain the following equation: −87.34x − 38y = 0. That is x = −0.435y.


We can then choose y = 1 to get x = −0.435, which after normalization gives:
   
1 −0.435 −0.399
=
1.09 1 0.917

(e) If the µ and 6 from above characterize the normal distribution from which the
points were generated, sketch the orientation/extent of the 2-dimensional normal
density function.
Answer: Figure 7.1 plots the normal distribution shape, along with the first PC
and the mean.
 
5 4
Q2. Given the covariance matrix 6 = , answer the following questions:
4 5
(a) Compute the eigenvalues of 6 by solving the equation det(6 − λI) = 0.
Answer: We have to solve the following equation:
(5 − λ)2 − 1 = 0
λ2 − 10λ + 9 = 0

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
7.6 Exercises 41
X2

−10

−11

b
−12

−13

−14

−15 X1
4 5 6 7
Figure 7.1. Normal Distribution

(λ − 9)(λ − 1) = 0

which implies that λ1 = 9 and λ2 = 1.


(b) Find the corresponding eigenvectors by solving the equation 6ui = λi ui .
Answer: We can get the corresponding eigenvectors as follows. For λ1 = 9, we
have
    
5−9 4 x 0
=
4 5−9 y 0

which implies −4x + 4y = 0 or x = y. Thus, the eigenvector corresponding to


the dominant eigenvalue is
   
1 1 0.707
u1 = √ =
2 1 0.707

Since the second eigenvector must be orthogonal, we immediately have


   
1 −1 −0.707
u2 = √ =
2 1 0.707

Q3. Compute the singular values and the left and right singular vectors of the following
matrix:
 
1 1 0
A=
0 0 1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
42 Dimensionality Reduction

Answer: The right singular vectors are the eigenvectors of AT A, and the left
singular vectors are the eigenvectors of AAT . The eigenvalues of both matrices are
the same and are the squares of the singular values.

 
1 1 0  
2 0
AT A = 1 1 0 AAT =
0 1
0 0 1

It is clear that the T


√ eigenvalues of AA are λ1 = 2 and λ2 = 1, and thus the singular
values are δ1 = 2 and δ2 = 1.
For the left singular vectors, not that AAT is diagonal, which implies that the
standard basis vectors constitute the eigenvectors, i.e., l 1 = (1, 0)T , and l 2 = (0, 1)T .
We can also compute the dominant left singular vector via the power method.
Starting with x0 = (1, 1)T , we have
   
1 2
AAT =
1 1
   
2 4
AAT =
1 1
   
4 8
AAT =
1 1

After n iterations, the dominant eigenvector will be (2n , 1)T , or after normalization
(2n /2n , 1/2n )T = (1, 0)T . The other left singular vector should be orthogonal to the
first one, so we have
 
1 0
L=
0 1

For the right singular vector note that AT A is rank deficient, having a rank of
2 instead of 3, e.g., the first two columns (and rows) are the same. It is clear that
the r2 = (0, 0, 1)T is an eigenvector which corresponds to the eigenvalue 1. Now, we
can compute the dominant right singular vector via the power method as follows.
Starting with x0 = (1, 1, 1)T , we have
T T
AT A 1, 1, 1 = 2, 2, 1
T T
AT A 2, 2, 1 = 4, 4, 1
T T
AT A 4, 4, 1 = 8, 8, 1
√ √
The dominant right singular vector is r1 = √1 (1, 1, 0)T
= (1/ 2, 1/ 2, 0)T , and the
2 √ √
third vector is orthogonal to this one, and is given as r3 = (−1/ 2, 1/ 2, 0)T . Thus,
we have
 
0.707 0 −0.707
R = 0.707 0 0.707 
0 1 0

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
7.6 Exercises 43

Putting it all together we have

A = L1RT
 
  √  0.707 0.707 0
1 0 2 0 0 
= 0 0 1
0 1 0 1 0
−0.707 0.707 0

Q4. Consider the data in Table 7.1. Define the kernel function as follows: K(xi , xj ) =
kxi − xj k2 . Answer the following questions:
(a) Compute the kernel matrix K.
Answer: The kernel matrix is given as:
 
0 5.86 1.46 4.64
5.86 0 10 1.46
K=
1.46

10 0 5.86
4.64 1.46 5.86 0

(b) Find the first kernel principal component.


Answer: We use the power iteration method, starting with the vector
(1, 1, 1, 1)T ; we get:
         
1 11.96 0.69 10.52 0.7
1 17.32  1  15.05 1
K         
1 → 17.32 → K  1  → 15.05 →  1 
1 11.96 0.69 10.52 0.7

In the above steps, we scale the intermediate vectors by dividing by the largest
element, right after multiplying by K on the left hand side.
The first kernel PC is therefore
 
0.41
1 T 0.58
c1 = √ 0.7, 1, 1, 0.7 = 
0.58

2.98
0.41

Also, based on the ratio of the largest element of the vector to the previous
value, we conclude that the dominant eigenvalue is η1 = 15.1.

Table 7.1. Dataset for Q4

i xi
x1 (4, 2.9)
x4 (2.5, 1)
x7 (3.5, 4)
x9 (2, 2.1)

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
44 Dimensionality Reduction

Q5. Given the two points x1 = (1, 2)T , and x2 = (2, 1)T , use the kernel function
2
K(xi , xj ) = (xT
i xj )

to find the kernel principal component, by solving the equation Kc = η1 c.

Answer: The kernel matrix for the two points is given as


 
5 4
K=
4 5

This is the same as the matrix in Q2. Thus, the eigenvector corresponding to the
dominant eigenvalue is c1 = (0.707, 0.707)T

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T TWO FREQUENT PATTERN
MINING
Table 8.1. Transaction database for Q1

tid itemset
t1 ABCD
t2 ACDF
t3 ACDEG
t4 ABDF
t5 BCG
t6 DFG
t7 ABG
t8 CDFG

CHAPTER 8 Itemset Mining

8.5 EXERCISES

Q1. Given the database in Table 8.1.


(a) Using minsup = 3/8, show how the Apriori algorithm enumerates all frequent
patterns from this dataset.
Answer: The level one frequent items are:
A − 5, B − 4, C − 5, D − 6, F − 4, G − 5
Candidates for level two and their frequencies are:
AB − 3, AC − 3, AD − 4, AF − 2, AG − 2
BC − 2, BD − 2, BF − 0, BG − 2
CD − 4, CF − 3, CG − 3
DF − 4, DG − 3
F G − 3 Thus the frequent ones are: AB, AC, AD, CD, CF, CG, DF, DG, F G.

The candidates for level three and their support values are:
ABC − 1, ABD − 2, ACD − 3, CDF − 3, CDG − 2, CF G − 2, DF G − 3
Out of these the frequent 3-itemsets are ACD, CDF , and DF G.
No more frequent itemsets are possible.
(b) With minsup = 2/8, show how FPGrowth enumerates the frequent itemsets.

47
48 Itemset Mining

Answer: Count single items, remove infrequent and sort:


D(6), A(5), C(5), G(5), B(4), F (4)
We now construct the FP-tree for the whole DB in that order, as shown below:

D5
C1 A1

A4 C1 G1 G1 G1

C3 B1 G1 F1 B1 B1

B1 G1 F 1 F1 F1
In this tree the items D, A, C, G, B and F are frequent, so
we output them and recursively project on each in turn and
generate all frequent sets ending in those items as shown next.
D4 D2 D2 D2

Project: AF

A2 C1 G1 C1 A1

Project: GF Project: CF
C1 B1 G1

Project on F

We first project on F to get the left most tree.


In this tree first process B, but it is not frequent.
Next we process G(2), and output GF (2). Within GF ’s tree C is not frequent
but D(2) is, so we output GF D(2).
Next process C(2); output CF (2). Only D(2) is frequent; output DCF (2)
Next process A(2), output AF (2); since D(2) is freq, output DAF (2);
Finally D(4) is freq, so output DF (4).

D2 C1 A1 D2 C1 A1
D1 D2

G1 G1 Project AB
A2 A2
A1

C1 C1
Project CB

Project B Project GB

We next process B.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 49

Table 8.2. Dataset for Q2

A B C D E
1 2 1 1 2
3 3 2 6 3
5 4 3 4
6 5 5 5
6 6

GB(2) is freq, output it; no other item is frequent in that tree.


Next process C within FP Tree of B. Output CB(2). With Project CB, nothing
else is frequent.
Next do AB, output AB(3). D is freq, output DAB(2).
Next DB(2) is freq.

D4 D4
D3 C1 A1

Project A
A3
A1 C1
Project C

C1

Project G
We next process items G, C and A in turn as shown above.
Within G, output CG(3), DCG(2), AG(2), and DG(3).
Within C, output AC(3), DAC(3), DC(4).
Within A, output DA(4).

Q2. Consider the vertical database shown in Table 8.2. Assuming that minsup = 3,
enumerate all the frequent itemsets using the Eclat method.

Answer: The frequent itemsets based on eclat are shown below, along with their
tidsets:
A × 1356
B × 23456
C × 12356
E × 2345
AB × 356
AC × 1356
BC × 2356
BE × 2345
CE × 235
ABC × 356
BCE × 235

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
50 Itemset Mining

Q3. Given two k-itemsets Xa = {x1 , . . ., xk−1 , xa } and Xb = {x1 , . . ., xk−1 , xb } that share the
common (k − 1)-itemset X = {x1 , x2 , . . ., xk−1 } as a prefix, prove that

sup(Xab ) = sup(Xa ) − |d(Xab )|

where Xab = Xa ∪ Xb , and d(Xab ) is the diffset of Xab .

Answer: We know that d(Xab ) = t(Xa ) − t(Xab ). Note that d(Xab ) ∩ t(Xab ) = ∅,
since those transaction that contain xa and xb , and those that contain xa but not xb
are necessarily disjoint (in the context of t(X)), and therefore, we have |d(Xab )| =
|t(Xa )| − |t(Xab )|
We therefore have

sup(Xa ) − |d(Xab )| = |t(Xa )| − |d(Xab )|


= |t(Xa )| − |t(Xa )| + |t(Xab )|
= sup(Xab )

Q4. Given the database in Table 8.3. Show all rules that one can generate from the set
ABE.
Table 8.3. Dataset for Q4

tid itemset
t1 ACD
t2 BCE
t3 ABCE
t4 BDE
t5 ABCE
t6 ABCD

Answer: We first need to compute the support of ABE and all its subsets, which
comprises:
A − 4, B − 5, E − 4, AB − 3, AE − 2, BE − 4, ABE − 2

The set of rules one can generate from ABE are as follows; the support for all the
rule is 2, since sup(ABE) = 2
A −→ BE, confidence c = 2/4 = 0.5
B −→ AE, confidence c = 2/5 = 0.4
E −→ AB, confidence c = 2/4 = 0.5
AB −→ E, confidence c = 2/3 = 0.67
AE −→ B, confidence c = 2/2 = 1.0
BE −→ A, confidence c = 2/4 = 0.5

Q5. Consider the partition algorithm for itemset mining. It divides the database into k
partitions, not necessarily equal, such that D = ∪ki=1Di , where Di is partition i, and for
any i 6= j , we have Di ∩ Dj = ∅. Also let ni = |Di | denote the number of transactions in
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 51

partition Di . The algorithm first mines only locally frequent itemsets, that is, itemsets
whose relative support is above the minsup threshold specified as a fraction. In the
second step, it takes the union of all locally frequent itemsets, and computes their
support in the entire database D to determine which of them are globally frequent.
Prove that if a pattern is globally frequent in the database, then it must be locally
frequent in at least one partition.

Answer: Let X be globally frequent, so that rsup(X) = sup(X, D)/|D| ≥ minsup,


which implies sup(X, D) ≥ minsup × |D|.
Assume that X is locally infrequent in all partitions, i.e., sup(X, Di ) < minsup ×
|Di |. Summing up the support over all partitions, we get
k
X
sup(X, D) = sup(X, Di )
i=1
k
X
< minsup × ni
i=1
k
X
= minsup ni
i=1

= minsup × |D|

That is, sup(X, D) < minsup × |D|, which is a contradiction. Thus there must exist
at least one partition where X is locally frequent.

Q6. Consider Figure 8.1. It shows a simple taxonomy on some food items. Each leaf is
a simple item and an internal node represents a higher-level category or item. Each
item (single or high-level) has a unique integer label noted under it. Consider the
database composed of the simple items shown in Table 8.4 Answer the following
questions:

Table 8.4. Dataset for Q6

tid itemset
1 2367
2 1 3 4 8 11
3 3 9 11
4 1567
5 1 3 8 10 11
6 3 5 7 9 11
7 4 6 8 10 11
8 1 3 5 8 11

(a) What is the size of the itemset search space if one restricts oneself to only itemsets
composed of simple items?

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
52 Itemset Mining
b

vegetables grain fruit diary 15


1 14 6

bread 12 rice yogurt milk 13 cheese


5 7 11

wheat white rye whole 2% skim


2 3 4 8 9 10
Figure 8.1. Item taxonomy for Q6.

Answer: The search space size is 211 , since there are 11 simple items.
(b) Let X = {x1 , x2 , . . . , xk } be a frequent itemset. Let us replace some xi ∈ X with its
parent in the taxonomy (provided it exists) to obtain X′ , then the support of the
new itemset X′ is:
i. more than support of X
ii. less than support of X
iii. not equal to support of X
iv. more than or equal to support of X
v. less than or equal to support of X
Answer: The answer is (iv), more than or equal to support of X. The reason
is that it may be the case that none of the other siblings of xi may occur in the
database, in which case the support of X′ will be the same as that for X. The
support cannot be lower, but it can obviously be equal or higher.
(c) Use minsup = 7/8. Find all frequent itemsets composed only of high-level items
in the taxonomy. Keep in mind that if a simple item appears in a transaction, then
its high-level ancestors are all assumed to occur in the transaction as well.
Answer: If we replace each low-level item by all of the high level ancestors,
we obtain the following set of transactions:
1: 12, 14, 15
2: 12, 14, 13, 15
3: 12, 14, 13, 15
4: 14, 15
5: 12, 14, 13, 15
6: 12, 14, 13, 15
7: 12, 14, 13, 15
8: 12, 14, 13, 15

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 53

The frequent high level itemsets are as follows: 12 − 8, 14 − 7, 15 − 8


{12, 14} − 7, {12, 15} − 7, {14, 15} − 7
{12, 14, 15} − 7

Q7. Let D be a database with n transactions. Consider a sampling approach for mining
frequent itemsets, where we extract a random sample S ⊂ D, with say m transactions,
and we mine all the frequent itemsets in the sample, denoted as FS . Next, we make
one complete scan of D, and for each X ∈ FS , we find its actual support in the
whole database. Some of the itemsets in the sample may not be truly frequent in
the database; these are the false positives. Also, some of the true frequent itemsets
in the original database may never be present in the sample at all; these are the false
negatives.
Prove that if X is a false negative, then this case can be detected by counting
the support in D for every itemset belonging to the negative border of FS , denoted
Bd − (FS ), which is defined as the set of minimal infrequent itemsets in sample S.
Formally,

Bd − (FS ) = inf Y | sup(Y) < minsup and ∀Z ⊂ Y, sup(Z) ≥ minsup

where inf returns the minimal elements of the set.

Answer: Let X be a frequent pattern in D, i.e., sup(X, D) ≥ minsup. Assume that


X is a false negative, i.e., X 6∈ FS or sup(X, S) < minsup.
Since X is not frequent in S, then by definition, there must exist Y ⊆ X, such
that Y ∈ Bd − (FS ). Since X is a frequent pattern in D it follows that Y ⊆ X is also
frequent in D. Thus, all we have to do is check whether sup(Y, D) ≥ minsup, and if
so, we can conclude that Y and potentially all of its supersets can also be frequent
in D. The find which ones, we then have to recursively generate supersets of Y and
count them in D. All those found to be frequent in D are the ones that were the
false negative patterns (which we failed to enumerate from S).

Q8. Assume that we want to mine frequent patterns from relational tables. For example
consider Table 8.5, with three attributes A, B, and C, and six records. Each attribute
has a domain from which it draws its values, for example, the domain of A is dom(A) =
{a1 , a2 , a3 }. Note that no record can have more than one value of a given attribute.

Table 8.5. Data for Q8

tid A B C
1 a1 b1 c1
2 a2 b3 c2
3 a2 b3 c3
4 a2 b1 c1
5 a2 b3 c3
6 a3 b3 c3

We define a relational pattern P over some k attributes X1 , X2 , . . ., Xk to be a


subset of the Cartesian product of the domains of the attributes, i.e., P ⊆ dom(X1 ) ×
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
54 Itemset Mining

dom(X2 ) × · · · × dom(Xk ). That is, P = P1 × P2 × · · · × Pk , where each Pi ⊆ dom(Xi ).


For example, {a1 , a2 } × {c1 } is a possible pattern over attributes A and C, whereas
{a1 } × {b1 } × {c1 } is another pattern over attributes A, B and C.
The support of relational pattern P = P1 × P2 × · · · × Pk in dataset D is defined as
the number of records in the dataset that belong to it; it is given as

sup(P ) = {r = (r1 , r2 , . . ., rn ) ∈ D : ri ∈ Pi for all Pi in P }

For example, sup({a1 , a2 } × {c1 }) = 2, as both records 1 and 4 contribute to its support.
Note, however that the pattern {a1 } × {c1 } has a support of 1, since only record 1
belongs to it. Thus, relational patterns do not satisfy the Apriori property that we
used for frequent itemsets, that is, subsets of a frequent relational pattern can be
infrequent.
We call a relational pattern P = P1 × P2 × · · ·× Pk over attributes X1 , . . ., Xk as valid
iff for all u ∈ Pi and all v ∈ Pj , the pair of values (Xi = u, Xj = v) occurs together in
some record. For example, {a1 , a2 } × {c1 } is a valid pattern since both (A = a1 , C = c1 )
and (A = a2 , C = c1 ) occur in some records (namely, records 1 and 4, respectively),
whereas {a1 , a2 }×{c2 } is not a valid pattern, since there is no record that has the values
(A = a1 , C = c2 ). Thus, for a pattern to be valid every pair of values in P from distinct
attributes must belong to some record.
Given that minsup = 2, find all frequent, valid, relational patterns in the dataset in
Table 8.5.

Answer: The set of all frequent relational patterns are as follows:


{a1 , a2 }, {b1 } − 2
{a1 , a2 }, {c1 } − 2
{a1 , a2 }, {b1 }, {c1 } − 2
{a2 , a3 }, {b3 } − 4
{a2 , a3 }, {b3 }, {c3 } − 2
{a2 , a3 }, {c3 } − 3
{a2 }, {b1 , b3 } − 4
{a2 }, {b3 } − 3
{a2 }, {b3 }, {c3 } − 2
{a2 }, {b3 }, {c2 , c3 } − 3
{a2 }, {c1 , c2 } − 2
{a2 }, {c1 , c3 } − 3
{a2 }, {c1 , c2 , c3 } − 5
{a2 }, {c2 , c3 } − 3
{a2 }, {c3 } − 2
{b1 }, {c1 } − 2
{b3 }, {c3 } − 3
{b3 }, {c2 , c3 } − 4

Q9. Given the following multiset dataset:

tid multiset
1 ABCA
2 ABABA
3 CABBA
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
8.5 Exercises 55

Using minsup = 2, answer the following:


(a) Find all frequent multisets. Recall that a multiset is still a set (i.e., order is not
important), but it allows multiple occurrences of an item.
Answer: For multisets we have to account for the count of each item in the set.
We can mine all frequent multisets by using a level-wise approach as follows
(each multiset is shown along with its frequency):

Level 1: A − 3, B − 3, C − 2
Level 2: AA − 3, AB − 3, AC − 2, BB − 2, BC − 2
Level 3: AAB − 3, AAC − 2, ABB − 2, ABC − 2
Level 4: AABB − 2, AABC − 2

(b) Find all minimal infrequent multisets, that is, those multisets that have no
infrequent sub-multisets.
Answer: In the level-wise approach above we encounter the following minimal
infrequent multisets:

Level 2: CC − 0
Level 3: AAA − 1, BBB − 0, BBC − 1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
Table 9.1. Dataset for Q2

Tid Itemset
t1 ACD
t2 BCE
t3 ABCE
t4 BDE
t5 ABCE
t6 ABCD

CHAPTER 9 Summarizing Itemsets

9.6 EXERCISES

Q1. True or False:


(a) Maximal frequent itemsets are sufficient to determine all frequent itemsets with
their supports.
Answer: False
(b) An itemset and its closure share the same set of transactions.
Answer: True
(c) The set of all maximal frequent sets is a subset of the set of all closed frequent
itemsets.
Answer: False
(d) The set of all maximal frequent sets is the set of longest possible frequent
itemsets.
Answer: False

Q2. Given the database in Table 9.1


(a) Show the application of the closure operator on AE, that is, compute c(AE). Is
AE closed?
Answer: c(AE) = i(t(AE)) = i(3, 5) = ABCE. Thus AE is not closed.
(b) Find all frequent, closed, and maximal itemsets using minsup = 2/6.
56
9.6 Exercises 57

Table 9.2. Dataset for Q3

Tid Itemset
1 ACD
2 BCD
3 AC
4 ABD
5 ABCD
6 BCD

ABCD(3)

BC(5) ABD(6)

B(8)

Figure 9.1. Closed itemset lattice for Q4.

Answer: The set of all frequent itemsets F is given as:


A4 , B5 , C5 , D3 , E4
AB3 , AC4 , AD2 , AE2 , BC4 , BD2 , BE4 , CD2 , CE3
ABC3 , ABE2 , ACD2 , ACE2 , BCE3
ABCE2
The set of closed frequent itemsets C is:
B5 , C5 , D 3
AC4 , BC4 , BD2 , BE4
ABC3 , ACD2 , BCE3
ABCE2
The set of all maximal frequent itemsets M is:
BD2 , ACD2 , ABCE2

Q3. Given the database in Table 9.2, find all minimal generators using minsup = 1.

Answer: The set of all minimal generators are:


A − 4, B − 4, C − 5, D − 5
AB − 2, AC − 2, AD − 2, BC − 2, CD − 2
ABC − 1, ACD − 1

Q4. Consider the frequent closed itemset lattice shown in Figure 9.1. Assume that the
item space is I = {A, B, C, D, E}. Answer the following questions:
(a) What is the frequency of CD?
Answer: The frequency of CD is 3.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
58 Summarizing Itemsets

Table 9.3. Dataset for Q7

Tid Itemset
1 ACD
2 BCD
3 ACD
4 ABD
5 ABCD
6 BC

(b) Find all frequent itemsets and their frequency, for itemsets in the subset interval
[B, ABD].
Answer: The frequency of all itemsets in the interval is as follows:
B − 8, AB − 6, AD − 6, BD − 6, ABD − 6.
(c) Is ADE frequent? If yes, show its support. If not, why?
Answer: ADE is not frequent since it is not a subset of any closed itemset.

Q5. Let C be the set of all closed frequent itemsets and M the set of all maximal frequent
itemsets for some database. Prove that M ⊆ C .

Answer: Let X ∈ M. By definition of maximality, X is maximal implies it has no


frequent superset. It follows that it has no frequent superset of the same frequency.
Thus X must be closed, i.e., X ∈ C . This proves that M ⊆ C .

Q6. Prove that the closure operator c = i ◦ t satisfies the following properties (X and Y are
some itemsets):
(a) Extensive: X ⊆ c(X)
Answer: Let x ∈ X. Consider t(X) = {t|X ⊆ i(t)}. It follows that x ∈ i(t) for all
t ∈ t(X).
T
Now, c(X) = i(t(X)) = t∈t(X) i(t), which implies that x ∈ c(X). Thus X ⊆ c(X).

(b) Monotonic: If X ⊆ Y then c(X) ⊆ c(Y)


Answer: By definition, c(X) = i(t(X)) and C (Y) = i(t(Y)). Also, X ⊆ Y implies
that t(Y) ⊆ t(X), and further t(X) = t(Y) ∪ (t(X) − t(Y)).  
T T T
Now, c(X) = i(t(X)) = t∈t(X) i(t) = t∈t(Y) i(t) ∩ t∈t(X)−t(Y) i(t) ⊆
T
t∈t(Y) i(t) = i(t(Y)) = c(Y)

(c) Idempotent: c(X) = c(c(X))


Answer: Since X ⊆ c(X), it follows from part (a) that c(X) ⊆ c(c(X)).
So, we now have to show that c(c(X)) ⊆ c(X). Let x be an item, such that x ∈
c(c(C)). We have to show that x ∈ c(X).
Since x ∈ c(c(C)), this implies that x ∈ i(t) for all t such that c(X) ⊆ i(t). From
(a), we also know that X ⊆ c(X), and thus X ⊆ i(t) for all t above. In other
words, X ∪ {x} ⊆ i(t), which implies that x ∈ c(X).

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
9.6 Exercises 59

Q7. Let δ be an integer. An itemset X is called a δ-free itemset iff for all subsets Y ⊂ X, we
have sup(Y) − sup(X) > δ. For any itemset X, we define the δ-closure of X as follows:

δ-closure(X) = Y | X ⊂ Y, sup(X) − sup(Y) ≤ δ, and Y is maximal

Consider the database shown in Table 9.3. Answer the following questions:
(a) Given δ = 1, compute all the δ-free itemsets.
Answer: The δ-free sets and their closures are as follows:

1-free sets their closures


A(4) ACD(3)
B(4) BC(3), BD(3)
C(5) CD(4)
D(5) AD(4), CD(4)
AB(2) ABCD(1)
The figure below illustrates these sets:

Actually, the definition allows for the empty set to be counted as δ-free (though
it is not very interesting). So, if you do count ∅ as δ-free, then C and D will not
be δ-free, but ∅ will be. In that case your answer for the closure will also differ.
δ-closure(∅) = {C, D}. The final answer should be:

1-free sets their closures


∅(6) C(5), D(5)
A(4) ACD(3)
B(4) BC(3), BD(3)
AB(2) ABCD(1)

(b) For each of the δ-free itemsets, compute its δ-closure for δ = 1.
Answer: The closures are given in part (a) above.

Q8. Given the lattice of frequent itemsets (along with their supports) shown in Figure 9.2,
answer the following questions:
(a) List all the closed itemsets.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
60 Summarizing Itemsets

Answer: The closed itemsets are:


A − 6, AB − 5, AC − 4, AD − 3, ABC − 3, ABD − 1, ACD − 2, ABCD − 1
(b) Is BCD derivable? What about ABCD? What are the bounds on their supports.
Answer: Let us derive the support interval for BCD, we get:

sup(BCD) ≤ min{sup(BC), sup(BD), sup(CD)} = 2


≥ sup(BC) + sup(BD) − sup(B) = 3 + 2 − 5 = 0
≥ sup(BC) + sup(CD) − sup(C) = 3 + 2 − 4 = 1
≥ sup(BD) + sup(CD) − sup(D) = 2 + 2 − 4 = 1
≤ sup(BC) + sup(BD) + sup(CD) − sup(B) − sup(C) − sup(D) + sup(∅)
= 3+2+2−5−4−3+6 = 1

We conclude that sup(BCD) ∈ [1, 1], and thus BCD is derivable.


As for ABCD, we have:

sup(ABCD) ≤ sup(BCD) = 1
≥ sup(ABC) + sup(ACD) − sup(AC) = 3 + 2 − 4 = 1

Thus sup(ABCD) ∈ [1, 1] and it is also derivable.

∅(6)

A(6) B(5) C(4) D(3)

AB(5) AC(4) AD(3) BC(3) BD(2) CD(2)

ABC(3) ABD(2) ACD(2) BCD(1)

ABCD(1)

Figure 9.2. Frequent itemset lattice for Q8.

Q9. Prove that if an itemset X is derivable, then so is any superset Y ⊃ X. Using this
observation describe an algorithm to mine all nonderivable itemsets.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
9.6 Exercises 61

Answer: Let X = YZ, then from Eqs.(9.4) and (9.3) we conclude that

sup(YZ) = |sup(X) − IE(Y)|

where
X
IE(Y) = −1(|X\W|+1) · sup(W)
Y⊆W⊂X

Also, let the upper and lower bounds on the support of X be given as
n o
U(X) = min IE(Y) Y ⊆ X, |X \ Y| is odd
n o
L(X) = max IE(Y) Y ⊆ X, |X \ Y| is even

Now let Y′ ⊆ X be the subset that minimizes the bound |sup(X) − IE(Y)|, and let
Z′ = X − Y′ , then

Y′ = arg min {|sup(X) − IE(Y)|}


Y

= arg min{sup(YZ)}
Y

Now it is clear that for L(X) and U(X), we have

sup(X) − L(X) ≥ sup(Y′ Z′ )


U(X) − sup(X) ≥ sup(Y′ Z′ )

Adding the two bounds we have

U(X) − sup(X) + sup(X) − L(X) ≥ 2 · sup(Y′ Z′ )


≥ 2sup(Y′ Z′ )

Let Xa be an extension of X with a new item a 6∈ X. If |X \ Y′ | is odd, then |Xa \


Y′ a| is also odd, and |X \ Y′ a| is even. In this case, we have

sup(Xa) − L(Xa) ≤ sup(Y′ aZ′ )


U(Xa) − sup(Xa) ≤ sup(Y′ aZ′ )

The above two bounds imply that

U(Xa) − L(Xa) ≤ sup(Y′ aZ′ ) + sup(Y′ aZ′ )

On the other hand, if |X \ Y′ | is even, then |Xa \ Y′ a| is also even, but |X \ Y′ a| is


odd. However, in this case too we obtain the bound:

U(Xa) − L(Xa) ≤ sup(Y′ aZ′ ) + sup(Y′ aZ′ )

Combining all of the above, we have

U(Xa) − L(Xa) ≤ sup(Y′ aZ′ ) + sup(Y′ aZ′ )


≤ sup(Y′ Z′ )
1
≤ (U(X) − L(X))
2

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
62 Summarizing Itemsets

The second line follows from the fact that the transactions that contain Y′ but not Z′
can be broken into two disjoints subsets comprising those that contain a and those
that do not.
Now if an itemset X is derivable, then we know that UB(X) − LB(X) = 0, which
immediately implies that all of its supersets will also be derivable.
Finally, this result can be used in an efficient algorithm to mine all non-derivable
itemsets, since we can prune an itemset X and all of its supersets from the search
space the moment we find X is derivable. Any of the algorithms we have studied for
itemset mining can be used with this pruning strategy.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 10 Sequence Mining

10.5 EXERCISES

Q1. Consider the database shown in Table 10.1. Answer the following questions:
(a) Let minsup = 4. Find all frequent sequences.
Answer: The frequent sequences are:
A − 4, G − 4, T − 4
AA − 4, AG − 4, AT − 4, GA − 4, TA − 4, TG − 4
AAT − 4, AGA − 4, ATA − 4, ATG − 4, GAA − 4, TAA − 4
(b) Given that the alphabet is 6 = {A, C, G, T}. How many possible sequences of
length k can there be?
Answer: There can be 4k sequences of length k.

Table 10.1. Sequence database for Q1

Id Sequence
s1 AATACAAGAAC
s2 GTATGGTGAT
s3 AACATGGCCAA
s4 AAGCGTGGTCAA

Table 10.2. Sequence database for Q2

Id Sequence
s1 ACGTCACG
s2 TCGA
s3 GACTGCA
s4 CAGTC
s5 AGCT
s6 TGCAGCTC
s7 AGTCAG

63
64 Sequence Mining

Q2. Given the DNA sequence database in Table 10.2, answer the following questions
using minsup = 4
(a) Find the maximal frequent sequences.
Answer: The set of frequent sequences, along with their supports are as
follows:
A − 7, C − 7, G − 7, T − 7
AC − 6, AG − 6, AT − 6, GA − 5, GC − 6, GG − 4, GT − 6, CA − 6, CC − 4,
CG − 6, CT − 5, TA − 5, TC − 6, TG − 5
ACT − 4, AGC − 6, AGT − 5, ATC − 5, GAG − 4, GCA − 4, GCG − 4, GTC − 5,
CAG − 4, CGC − 4, CTC − 4, TCA − 4, TCG − 4
AGTC − 4

The maximal frequent sequences are:


AGTC−4, ACT−4, GAG−4, GCG−4, GCA−4, CAG−4, CGC−4, CTC−4,
TCA − 4, TCG − 4
(b) Find all the closed frequent sequences.
Answer: The closed frequent sequences are:
AGTC−4, ACT−4, AGC−6, AGT−5, GAG−4, GCG−4, GCA−4, GCT−5,
CAG−4, CGC−4, CTC−4, TCA−4, TCG−4, AT−6, GA−5, GT−6, CA−6,
CG − 6, TA − 5, TC − 6, TG − 5, A − 7, C − 7, G − 7, T − 7.
(c) Find the maximal frequent substrings.
Answer: The frequent substrings are as follows:
A − 7, C − 7, G − 7, T − 7
AG − 4, CA − 5, TC − 5

The maximal substrings are: AG − 5, CA − 5, TC − 5.


(d) Show how Spade would work on this dataset.
Answer: We create the vertical format for each item as follows:
t (A) = (1,1) (1,6) (2,4) (3,2) (3,7) (4,2) (5,1) (6,4) (7,1) (7,5)
t (C) = (1,2) (1,5) (1,7) (2,2) (3,3) (3,6) (4,1) (4,5) (5,3) (6,3) (6,6) (6,8) (7,4)
t (G) = (1,3) (1,8) (2,3) (3,1) (3,5) (4,3) (5,2) (6,2) (6,5) (7,2) (7,6)
t (T) = (1,4) (2,1) (3,4) (4,4) (5,4) (6,1) (6,7) (7,3)

We now intersect the vertical lists as follows (only frequent intersections for
prefix A are shown):
t (AC) = (1,2) (1,5) (1,7) (3,3) (3,6) (4,5) (5,3) (6,6) (6,8) (7,4)
t (AG) = (1,3) (1,8) (3,5) (4,3) (5,2) (6,5) (7,2) (7,6)
t (AT) = (1,4) (3,4) (4,4) (5,4) (6,7) (7,3)
t (ACT) = (1,4) (3,4) (6,7) (7,3)
t (AGC) = (1,5) (1,7) (3,6) (4,5) (5,3) (6,8) (7,4)
t (AGT) = (1,4) (4,4) (5,4) (6,7) (7,3)
t (ATC) = (1,5) (1,7) (3,6) (4,5) (6,8) (7,4)
t (AGTC) = (1,5) (1,7) (4,5) (6,8) (7,4)

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 65

The rest of the frequent sequences can be found in a similar manner.


(e) Show the steps of the PrefixSpan algorithm.
Answer: We show the projections only for sequences with prefix A. The
projected DB for A, DA is as follows. It consists of the remaining sequence,
called the suffix, after the first occurrence of an A in each input sequence:
s1 CGTCACG
s3 CTGCA
s4 GTC
s5 GCT
s6 GCTC
s7 GTCAG
In this DB we count the single items: A(3), C(6), G(6), T(6). So AA is not freq,
but AC, AG and AT all are.
We need to recursively project on each of these sequences.
Project database DA on AC to get DAC :
s1 GTCACG
s3 TGCA
s5 T
s6 TC
s7 AG
only T is frequent, to get ACT(4).

Project DA on AG to get DAG :


s1 TCACG
s3 CA
s4 TC
s5 CT
s6 CTC
s7 TCAG
Counts: A(3), C(6), G(2), T(5). Only C and T are frequent, so we project on
those.
Project DAG on AGC:
s1 ACG
s3 A
s5 T
s6 TC
s7 AG
We don’t find any other frequent item.
Project DAG on AGT:
s1 CACG
s4 C
s6 C
s7 CAG

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
66 Sequence Mining

Only C is frequent, to give AGTC(4).


Project DA on AT:
s1 CACG
s3 GCA
s4 C
s6 C
s7 CAG
only C is frequent, to give ATC(5).
Likewise we will project on other items and get the final answer.

Q3. Given s = AABBACBBAA, and 6 = {A, B, C}. Define support as the number
of occurrence of a subsequence in s. Using minsup = 2, answer the following
questions:
(a) Show how the vertical Spade method can be extended to mine all frequent
substrings (consecutive subsequences) in s.
Answer: The vertical poslists for all consecutive sequences are shown in the
figure below:

Each pair shows the start and stop positions for the pattern in the input
sequence. For example, the pair (8, 9) for BA denotes the fact that the
substring starts at position 8 and ends at position 9.
The frequent substrings are therefore: A, B, AA, BA, BB, BBA.
(b) Construct the suffix tree for s using Ukkonen’s method. Show all intermediate
steps, including all suffix links.
Answer: The different steps of the suffix tree construction method are shown
below.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 67

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
68 Sequence Mining

(c) Using the suffix tree from the previous step, find all the occurrences of the query
q = ABBA allowing for at most two mismatches.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 69

Answer: The occurrences with at most two mismatches are given as


matches position
AABB 1
ABBA 2
ACBB 5
BBAA 7
CBBA 6

(d) Show the suffix tree when we add another character A just before the $. That is,
you must undo the effect of adding the $, add the new symbol A, and then add $
back again.
Answer: The suffix tree after removing $ is shown in the figure below:

[6, e]
6
[1, 1] [3, 3]

[6, e]
[3, 5] [4, 5]
[3, e]

2 5
[2, e]

[6, e]
[10, e] [6, e] [10, e]

1 7 3
8 4

The one after adding a new A is shown below:

[6, e]
6
[1, 1] [3, 3]

[2, 2] [6, e]
[3, 5] [4, 5]
[3, e]

2 5

[3, e]
[11, e] [6, e]
[10, e] [6, e] [10, e]

1 9 7 3
8 4

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
70 Sequence Mining

Then adding $ back again, we get:

[6, e]
6
[1, 1] [3, 3]
[12, e]

[12, e] 12
[2, 2] [6, e]
11 [3, 5] [4, 5]
[3, e]

2 5

[3, e]
[11, e] [12, e] [6, e]
[10, e] [6, e] [10, e]

1 9 10 7 3
8 4

(e) Describe an algorithm to extract all the maximal frequent substrings from a suffix
tree. Show all maximal frequent substrings in s.
Answer: In a suffix tree, each node stands for a substring (the path from root
to that node). So we can extend each node in the suffix tree with a new field
“support”, which is the number of occurrences of the substring the node stands
for. In each internal node, the support is the number of leaf nodes of that
sub-tree. In each leaf node, the number is set to 1. Then we can traverse the
suffix tree. If we reach to an internal node whose support is at least minsup but
its children’s support are all less than minsup, then the path from root to this
internal node is a potential maximal frequent substring. The only other check
for maximality we have to do is that there is no character that can be added as
a prefix, which still results in a frequent substring.
In the above example, by using this algorithm we can find the maximal frequent
substrings: AA, BBA.

Q4. Consider a bitvector based approach for mining frequent subsequences. For instance,
in Table 10.1, for s1 , the symbol C occurs at positions 5 and 11. Thus, the bitvector for
C in s1 is given as 00001000001. Because C does not appear in s2 its bitvector can be
omitted for s2 . The complete set of bitvectors for symbol C is

(s1 , 00001000001)
(s3 , 00100001100)
(s4 , 000100000100)

Given the set of bitvectors for each symbol show how we can mine all frequent sub-
sequences by using bit operations on the bitvectors. Show the frequent subsequences
and their bitvectors using minsup = 4.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 71

Answer: The bitvectors for A are as follows:

(s1 , 11010110110)
(s2 , 0010000010)
(s3 , 11010000011)
(s4 , 110000000011)

The bitvectors for G are as follows:

(s1 , 00000001000)
(s2 , 1000110100)
(s3 , 00000110000)
(s4 , 001010110000)

To find support of AG, in each of the input sequences, we have to find first
occurrence of bit 1 in the bitvector for A. Suppose we find the first occurrence at
position i, then we make the bit bi = 0 and make all other bits 1’s after bi ; i.e.,
bi+1 ...bk are all 1’s. Now take AND with the bitvector for G for the same input
sequences. For example, consider s2 . The first occurrence of A is at position 3, so
we set that to 0, and make the remaining occurrences as 1 to obtain: 0001111111.
Now taking the AND with the bitvector for G for s2 we obtain 0000110100. This
means that in s2 , there are three occurrences of a G after an A, namely at positions
5, 6, 8. When we do the bitvector operations for AG, we obtain the following results:

(s1 , 00000001000)
(s2 , 0000110100)
(s3 , 00000110000)
(s4 , 001010110000)

Since there is at least a single 1 bit in each sequence, we have the support of AG as
4. In a similar manner we can obtain all of the remaining frequent sequences listed
in Q1(a).

Q5. Consider the database shown in Table 10.3. Each sequence comprises itemset events
that happen at the same time. For example, sequence s1 can be considered to be a
sequence of itemsets (AB)10 (B)20 (AB)30 (AC)40 , where symbols within brackets are
considered to co-occur at the same time, which is given in the subscripts. Describe
an algorithm that can mine all the frequent subsequences over itemset events. The
itemsets can be of any length as long as they are frequent. Find all frequent itemset
sequences with minsup = 3.

Answer: The sequence mining proceeds as before, where we look for occurrences
of a symbol after the given prefix. However, now we also have to consider the
possible itemset extensions at each time slot. The set of all frequent itemset
sequences is given as follows:

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
72 Sequence Mining

Table 10.3. Sequences for Q5

Id Time Items
10 A, B
20 B
s1
30 A, B
40 A, C
20 A, C
s2 30 A, B, C
50 B
10 A
30 B
s3 40 A
50 C
60 B
30 A, B
40 A
s4 50 B
60 C

A − 4, B − 4, C − 4
AA − 4, {A, B} − 3, AB − 4, AC − 4, BA − 3, BB − 4
AAB − 3, AAC − 3, ABB − 3, ABC − 3, {A, B}B − 3

Q6. The suffix tree shown in Figure 10.5 contains all suffixes for the three sequences
s1 , s2 , s3 in Table 10.1. Note that a pair (i, j ) in a leaf denotes the j th suffix of
sequence si .
(a) Add a new sequence s4 = GAAGCAGAA to the existing suffix tree, using the
Ukkonen algorithm. Show the last character position (e), along with the suffixes
(l) as they become explicit in the tree for s4 . Show the final suffix tree after all
suffixes of s4 have become explicit.
Answer: When adding s4 , we find that the following strings up to the current
last character will all be found in the tree: G, GA, GAA, and GAAG. When
looking at character 5, namely C, we find the first difference. At this point
e = 5, and suffixes l = 1, 2, 3, 4 will become explicit. Suffix 5 does not become
explicit, since C is already in the tree. In fact, all the remaining suffixes
CA, CAG, CAGA, CAGAA will be found in the tree and will remain implicit.
Finally, when we consider the terminal character $, all the suffixes will become
explicit, i.e., when e = 10, l = 5, 6, 7, 8, 9, 10. The final suffix tree after adding
sequence s4 is shown (without the last character $).

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
10.5 Exercises 73

CAG
A
T

G
4 3 4 3

CAG

GACA
CA

T$
AA

$
A

A
G

$
$
G$

AA$

G$
(1,6) (1,7)
3 (2,3) 4 (4,9) 2 (2,4) 4 (4,4) (2,6) (2,1)
(3,4) (3,5)
C A
A

CAG$
GT $
T$
AA
G

A
$

$
$
GA$

(1,5)
3 (4,8) 2 (4,3) (2,5) (1,1) (4,5) 3 (2,2)
(3,3)
CAGA

GT $
T$

G
$
$
A$

(1,4)
(4,2) (1,2) (4,6) 3 (4,7)
(3,2)
CAGA
T$
A$

(1,3)
(4,1)
(3,1)

(b) Find all closed frequent substrings with minsup = 2 using the final suffix
tree.
Answer: Now based on the tree above, the closed frequent substrings, with
minsup = 2 are:
T-3
AG - 4
GA - 4
CAG - 3
GAAG - 3
CAGAA - 2
GAAGT - 2

Q7. Given the following three sequences:

s1 : GAAGT
s2 : CAGAT
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
74 Sequence Mining

s3 : ACGT

Find all the frequent subsequences with minsup = 2, but allowing at most a gap of 1
position between successive sequence elements.

Answer: The frequent sequences with gap are as follows: A(3), C(2), G(3), T(3),
AA(2), AG(3), AT(2), CG(2), GA(2), GT(3), AAT(2), AGT(2), GAT(2).

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 11 Graph Pattern Mining

11.5 EXERCISES

Q1. Find the canonical DFS code for the graph in Figure 11.1. Try to eliminate some codes
without generating the complete search tree. For example, you can eliminate a code
if you can show that it will have a larger code than some other code.

a c

b a d a

b a
Figure 11.1. Graph for Q1.

Answer: First we are going to number


1 the vertices2as follows:
a c

3 4 5 6
b a d a

b a
7 8
75
76 Graph Pattern Mining

The canonical code is given in the figure below:

a(6)

a(8)

a(4)

a(1)

b(3) c(2)

b(7) d(5)

Q2. Given the graph in Figure 11.2. Mine all the frequent subgraphs with minsup = 1. For
each frequent subgraph, also show its canonical code.

a a

a
Figure 11.2. Graph for Q2.

Answer: We show the DFS code for each pattern.

subgraphs with 1 edge:


----------------------
1)

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 77

0 a
|
1 a

0 1 a a

subgraphs with 2 edges:


-----------------------
2) extending 1)
0 a
|
1 a
|
2 a

0 1 a a
1 2 a a

3) extending 1)
a
| \
a a

This subgraph is isomorphic to the previous one 2).

subgraphs with 3 edges:


-----------------------
4) extending 2)
0 a
| \
1 a |
| /
2 a

0 1 a a
1 2 a a
2 0 a a

5) extending 2)
0 a
|
1 a
|
2 a
|
3 a

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
78 Graph Pattern Mining

0 1 a a
1 2 a a
2 3 a a

6) extending 2)
0 a
|
1 a
|\
2 a a 3

0 1 a a
1 2 a a
1 3 a a

7) extending 2)
a
|\
a a
|
a
This subgraph is isomorphic to 5).

subgraphs with 4 edges:


----------------------
8) extending 4)
0 a
| \
1 a |
| /
2 a
|
3 a

0 1 a a
1 2 a a
2 0 a a
2 3 a a

9) extending 5)
0 a
| \
1 a |
| |
2 a |
| /
3 a/

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 79

0 1 a a
1 2 a a
2 3 a a
3 0 a a

10) extending 6)
a
| \
a |
| \|
a a

This subgraph is isomorphic with 8).

subgraphs with 5 edges:


----------------------

11) extending 8)
0 a
|\
1 a |
|\/
2 a/|
|/
3 a

0 1 a a
1 2 a a
2 0 a a
2 3 a a
3 1 a a

Q3. Consider the graph shown in Figure 11.3. Show all its isomorphic graphs and their
DFS codes, and find the canonical representative (you may omit isomorphic graphs
that can definitely not have canonical codes).

Answer: The figure below shows the potential isomorphic graphs that can have
minimal DFS codes.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
80 Graph Pattern Mining

A
a
b
A a A

a
b
B A
Figure 11.3. Graph for Q3.

G1 G2 G3
G4 G5 G6

A A A A A A
b
a a a b a a
b a b

A A A B A A A B A b
a b a a A a
a b a a a
b a
b a
b
A B A A A A A A
A
B
a b
a b

B A B A

The DFS codes for G4 and G6 cannot be minimal since they have as first edge
the tuple (0, 1, A, A, b), but all of the other graphs shown have the first edge
(0, 1, A, A, a). In the table below we show the DFS codes for the other four graphs
and indicate the minimal DFS code in bold. Note that the final comparison is
between G2 and G5 , but since (2, 3) < (1, 3), regardless of the labels, G5 wins out.

G1 G2 G3 G5
(0, 1, A, A, a) (0, 1, A, A, a) (0, 1, A, A, a) (0,1,A,A,a)
(1, 2, A, A, b) (1, 2, A, A, a) (1, 2, A, A, b) (1,2,A,A,a)
(2, 0, A, A, a) (2, 0, A, A, b) (2, 0, A, A, a) (2,0,A,A,b)
(1, 3, A, B, a) (1, 3, A, A, b) (2, 3, A, B, a) (2,3,A,B,a)
(0, 4, A, A, b) (0, 4, A, B, a) (0, 4, A, A, b) (1,4,A,A,b)

Q4. Given the graphs in Figure 11.4, separate them into isomorphic groups.

Answer: The groups of isomorphic graphs are as follows: {G1 }, {G2 , G5 }, {G3 },
{G4 , G6 }, {G7 },

Q5. Given the graph in Figure 11.5. Find the maximum DFS code for the graph, subject to
the constraint that all extensions (whether forward or backward) are done only from
the right most path.

Answer: The maximum dfs code is for the graph shown below.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 81

G1 G2 G3 G4
a a a b

a a a a

b b b b b b

a
G5 G6 G7
a a a

a a b b b a

b b b

Figure 11.4. Data for Q4.

b c c c a

Figure 11.5. Graph for Q5.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
82 Graph Pattern Mining

0,1,C,C
1,2,C,C
2,3,C,C
3,4,C,B
4,5,B,A
5,2,A,C
3,6,C,A
6,2,A,C
6,1,A,C
6,0,A,C
3,1,C,C

Note that in the minimal code, we consider the back edges from a node before
going depth-first or branching. In the maximum code, we still have to go depth-first,
but the back edges will all come in the end, further we work backwards from
higher numbered nodes to lower numbered ones. However, rightmost path must
be respected; therefore (5, 2, A, C) has to come before (3, 6, C, A).

Q6. For an edge labeled undirected graph G = (V, E), define its labeled adjacency matrix
A as follows:


 if i = j
L(vi )

A(i, j ) = L(vi , vj ) if (vi , vj ) ∈ E


0 Otherwise

where L(vi ) is the label for vertex vi and L(vi , vj ) is the label for edge (vi , vj ). In other
words, the labeled adjacency matrix has the node labels on the main diagonal, and it
has the label of the edge (vi , vj ) in cell A(i, j ). Finally, a 0 in cell A(i, j ) means that
there is no edge between vi and vj .
Given a particular permutation of the vertices, a matrix code for the graph is
obtained by concatenating the lower triangular submatrix of A row-by-row. For
example, one possible matrix corresponding to the default vertex permutation
v0 v1 v2 v3 v4 v5 for the graph in Figure 11.6 is given as

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
11.5 Exercises 83

v0 v1 v2
x y
a b b

y y
y

y z
b b a
v3 v4 v5
Figure 11.6. Graph for Q6.

a
x b
0 y b
0 y y b
0 0 y y b
0 0 0 0 z a

The code for the matrix above is axb0yb0yyb00yyb0000za. Given the total ordering
on the labels
0<a<b<x <y <z
find the maximum matrix code for the graph in Figure 11.6. That is, among all possible
vertex permutations and the corresponding matrix codes, you have to choose the
lexicographically largest code.

Answer: The maximum code is bybyybyy0b00z0a000x0a, which corresponds to the


matrix shown below, which is for the permutation v3 v2 v4 v1 v5 v0 :
b
y b
y y b
y y 0 b
0 0 z 0 a
0 0 0 x 0 a

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 12 Pattern and Rule Assessment

12.4 EXERCISES

Q1. Show that if X and Y are independent, then conv(X −→ Y) = 1.

1−rsup(Y)
Answer: We know that conv(X → Y) = 1−conf (X→Y) . Now if X and Y are
independent, then conf (X → Y) = P (Y|X) = P (Y) = rsup(Y). In this case, we
immediately have conv(X → Y) = 1.

Q2. Show that if X and Y are independent then oddsratio(X −→ Y) = 1.

Answer: If X and Y are independent, then P (XY) = P (X)P (Y). Further,


P (¬XY) = P (Y)−P (XY) = P (Y)(1−P (X)) = P (¬X)P (Y), and likewise P (X¬Y) =
P (X) − P (XY) = P (X)(1 − P (Y)) = P (X)P (¬Y) Now consider P (¬X¬Y), we have

P (¬X¬Y) = 1 − P (XY) − P (X¬Y) − P (¬XY)


 
= 1 − P (XY) − (P X) − P (XY) − P (Y) − P (XY)
= 1 − P (X) − P (Y) + P (XY)
= 1 − P (X) − P (Y) − P (X)P (Y)
= (1 − P (X)) · (1 − P (Y))
= P (¬X)P (¬Y)

Finally, we have
P (XY)P (¬X¬Y)
oddsratio(X → Y) =
P (X¬Y)P (¬XY)
P (X)P (Y)P (¬X)P (¬Y)
=
P (X)P (¬Y)P (¬X)P (Y)
=1

84
12.4 Exercises 85

Table 12.1. Data for Q5

Support No. of samples


10,000 5
15,000 20
20,000 40
25,000 50
30,000 20
35,000 50
40,000 5
45,000 10

Q3. Show that for a frequent itemset X, the value of the relative lift statistic defined in
Example 12.20 lies in the range

h i
1 − |D|/minsup, 1

Answer: We have rlift(X, D, Di ) = 1 − sup(X,D i)


sup(X,D) . It is clear that the maximum value
cannot exceed 1 since support cannot be negative. Now, the minimum support of X
in D can be minsup, whereas the maximum support of X in Di can be |D|, therefore,
|D|
the least value of rlift is 1 − minsup .

Q4. Prove that all subsets of a minimal generator must themselves be minimal generators.

Answer: Let X be a minimal generator, with the tidset t(X). Let Y ⊂ X. Since X is
minimal, we must have t(Y) ⊃ t(X). Also, note that t(X) = t(X \ Y ∪ Y) = t(X \ Y) ∩
t(Y).
Assume that Y is not a minimal generator, then there exists a minimal generator
Z ⊂ Y, such that t(Z) = t(Y). However, in this case, t((X \ Y) ∪ Z) = t(X \ Y) ∩
t(Z) = t(X \ Y) ∩ t(Y) = t(X), which contradicts the fact that X is minimal. Thus, we
conclude that Y must be a minimal generator.

Q5. Let D be a binary database spanning one trillion (109 ) transactions. Because it is
too time consuming to mine it directly, we use Monte Carlo sampling to find the
bounds on the frequency of a given itemset X. We run 200 sampling trials Di (i =
1 . . . 200), with each sample of size 100, 000, and we obtain the support values for X in
the various samples, as shown in Table 12.1. The table shows the number of samples
where the support of the itemset was a given value. For instance, in 5 samples its
support was 10,000. Answer the following questions:
(a) Draw a histogram for the table, and calculate the mean and variance of the
support across the different samples.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
86 Pattern and Rule Assessment

Answer: The histogram is plotted in the figure below:

Let ni be the number of samples where X has frequency fi . The mean and
variance for the itemset are given as:
P
ni fi 5400000
µ = Pi = = 27000
i n i 200
P
2 ni (fi − µ)2 1395 × 107
σ = i P = = 69750000
i ni 200

The standard deviation is therefore σ = 8351.647 Note that with respect to the
whole dataset, the mean relative support is: µ = 27000/109 = 2.75 × 10−5 , and
its variance is σ 2 = 6.975 × 10−11 , with standard deviation σ = 8.352 × 10−6 .
(b) Find the lower and upper bound on the support of X at the 95% confidence level.
The support values given should be for the entire database D.
Answer: We assume that support follows a normal distribution, for which the
critical z−value for the 95% confidence interval is 1.96. Thus, the support
interval is given as

(µ − 1.96 × σ, µ + 1.96 × σ ) = (2.7 × 10−5 − 1.637 × 10−5 , 2.7 × 10−5 + 1.637 × 10−5
= (1.0631, 4.3369) × 10−5

In terms of absolute support we have the bounds (10631, 43369).


(c) Assume that minsup = 0.25, and let the observed support of X in a sample be
sup(X) = 32500. Set up a hypothesis testing framework to check if the support of
X is significantly higher than the minsup value. What is the p-value?
Answer: Since the sample size is 100, 000, minsup = 0.25 corresponds to the
absolute minimum support value of 25000.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
12.4 Exercises 87

The observed support value is higher than the minsup by a count of 32500 −
25000 = 7500. We can use the empirical probability mass function from part
(a) to get the p-value of observing a difference of 7500 from 25000. We get
p-value(7500) = P (sup(X) − minsup ≥ 7500) = P (sup(X) ≥ 32500) = 65/200 =
0.325. Since the p-value is quite high, we conclude that the support is not
significantly higher than the minsup value.

Q6. Let A and B be two binary attributes. While mining association rules at 30%
minimum support and 60% minimum confidence, the following rule was mined:
A −→ B, with sup = 0.4, and conf = 0.66. Assume that there are a total of 10,000
customers, and that 4000 of them buy both A and B; 2000 buy A but not B, 3500 buy
B but not A, and 500 buy neither A nor B.
Compute the dependence between A and B via the χ 2 -statistic from the corre-
sponding contingency table. Do you think the discovered association is truly a strong
rule, that is, does A predict B strongly? Set up a hypothesis testing framework, writing
down the null and alternate hypotheses, to answer the above question, at the 95%
confidence level. Here are some values of chi-squared statistic for the 95% confidence
level for various degrees of freedom (df):
df χ2
1 3.84
2 5.99
3 7.82
4 9.49
5 11.07
6 12.59

Answer: Let our null hypothesis be: Ho : A and B are independent, and the
alternate hypothesis is Ha : A and B are dependent.
Let’s set up the contingency table:

B=1 B=0 marginal P rob


A=1 4000 2000 6000 0.60
A=0 3500 500 4000 0.40
marginal 7500 2500 10000
P rob 0.75 0.25

The estimated counts via the null hypothesis are:

B=1 B=0
A=1 4500 1500
A=0 3000 1000

(nij −eij )2
The values eij are given as:

B=1 B=0
A=1 55.56 166.67
A=0 83.33 250

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
88 Pattern and Rule Assessment

P P (n −e )2
We then get χ 2 = i j ij eij ij = 555.55. This value is way above the 3.84
threshold for 1 degree of freedom. Thus, we can safely reject the null hypothesis,
and we can claim A and B are highly dependent.
On the other hand this still doesn’t answer the question whether A → B is a
strong rule. For this we can compute the actual support of AB versus the null
hypothesis that they are independent, which is captured by the Lift measure. We
sup(AB) 0.40
calculate lif t (A → B) = sup(A)·sup(B) = 0.60·0.75 = 0.89. A value of 1 would imply
they are independent, but a value of < 1 implies a negative dependence. Thus, we
can conclude that instead of A being a good predictor of B, one is less likely to buy
B given A!

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T THREE CLUSTERING
C H A P T E R 13 Representative-based Clustering

13.5 EXERCISES

Q1. Given the following points: 2, 4, 10, 12, 3, 20, 30, 11, 25. Assume k = 3, and that we
randomly pick the initial means µ1 = 2, µ2 = 4 and µ3 = 6. Show the clusters obtained
using K-means algorithm after one iteration, and show the new means for the next
iteration.

Answer: Starting with the initial means µ1 = 2, µ2 = 4 and µ3 = 6, we assign each


point to the closest mean, which yields the following clusters:

C1 = {2, 3} C2 = {4} C3 = {10, 11, 12, 20, 25, 30}

where we assigned 3 to C1 instead of C2 . The new means are as follows:

µ1 = 1.5 µ2 = 4 µ3 = 18

For the second iteration, the assignment to the closest mean yields the following
clusters:

C1 = {2, 3} C2 = {4, 10, 11} C3 = {12, 20, 25, 30}

The new means are as follows:

µ1 = 1.5 µ2 = 8.33 µ3 = 21.75

For the third iteration, the assignment to the closest mean yields the following
clusters:

C1 = {2, 3, 4} C2 = {10, 11, 12} C3 = {20, 25, 30}

The new means are as follows:

µ1 = 3 µ2 = 11 µ3 = 25

Thereafter, the clusters do not change.

91
92 Representative-based Clustering

Table 13.1. Dataset for Q2

x P (C1 |x) P (C2 |x)


2 0.9 0.1
3 0.8 0.1
7 0.3 0.7
9 0.1 0.9
2 0.9 0.1
1 0.8 0.2

Q2. Given the data points in Table 13.1, and their probability of belonging to two clusters.
Assume that these points were produced by a mixture of two univariate normal
distributions. Answer the following questions:
(a) Find the maximum likelihood estimate of the means µ1 and µ2 .
Answer: Based on the maximum likelihood equations, we know that:
Pn
j =1 xj · P (Ci |xj )
µi = Pn
j =1 P (Ci |xj )

We get the following counts:

x P (C1 |x) P (C2 |x) x · P (C1 |x) x · P (C2 |x)


2 0.9 0.1 1.8 0.2
3 0.8 0.1 2.4 0.3
7 0.3 0.7 2.1 4.9
9 0.1 0.9 0.9 8.1
2 0.9 0.1 1.8 0.2
1 0.8 0.2 0.8 0.2
Sum 3.8 2.1 9.8 13.9

Therefore, µ1 = 9.8/3.8 = 2.58 and µ2 = 13.9/2.1 = 6.62.


(b) Assume that µ1 = 2, µ2 = 7, and σ1 = σ2 = 1. Find the probability that the point
x = 5 belongs to cluster C1 and to cluster C2 . You may assume that the prior
probability of each cluster is equal (i.e., P (C1 ) = P (C2 ) = 0.5), and the prior
probability P (x = 5) = 0.029.
Answer: We know that P (Ci |x) = P (x|Ci )P (Ci )/P (x), and P (x|Ci ) =
f (x|µi , σi ).
2 2
Now f (5|2, 1) = √1 e−(5−2) /2 = 0.0044, and f (5|7, 1) = √1 e−(5−7) /2 = 0.054.
2π 2π
Thus, P (C1 |x) = 0.0044 · 0.5/0.029 = 0.07, and P (C2 |x) = 0.054 · 0.5/0.029 =
0.93.

Q3. Given the two-dimensional points in Table 13.2, assume that k = 2, and that initially
the points are assigned to clusters as follows: C1 = {x1 , x2 , x4 } and C2 = {x3 , x5 }.
Answer the following questions:
(a) Apply the K-means algorithm until convergence, that is, the clusters do not
change, assuming (1) the usual Euclidean distance or the L2 -norm as the distance
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
13.5 Exercises 93

Table 13.2. Dataset for Q3

X1 X2
x1 0 2
x2 0 0
x3 1.5 0
x4 5 0
x5 5 2

P 1/2
d 2
between points, defined as xi − xj
2
= − a=1 (xia
xj a ) , and (2) the
Pd
Manhattan distance or the L1 -norm defined as xi − xj 1 = a=1 |xia − xj a |.
Answer: First, we consider the Euclidean distance. Initially, the two means are
given as µ1 = (5/3, 2/3)T = (1.67, 0.67)T and µ2 = (6.5/2, 2/2)T = (3.25, 1)T .
We compute the distance of each point to the cluster means, and assign it to
the nearest mean, as follows:
d(xi , µ1 ) d(xi , µ2 ) Cluster
x1 2.1 3.4 c1
x2 1.8 3.4 c1
x3 0.7 2.0 c1
x4 3.4 2.0 c2
x5 3.6 2.0 c2
For the next iteration, we recompute the means, as follows: µ1 = (1.5/3, 2/3)T =
(0.5, 0.67)T and µ2 = (10/2, 2/2)T = (5, 1)T . The new cluster assignments for the
points are follows:
d(xi , µ1 ) d(xi , µ2 ) Cluster
x1 1.42 5.1 c1
x2 0.83 5.1 c1
x3 1.2 3.6 c1
x4 4.5 1.0 c2
x5 4.7 1.0 c2
Since there is no change in cluster assignments, so we stop.
Now we consider the Manhattan distance. From the two means, µ1 =
(1.67, 0.67)T and µ2 = (3.25, 1)T , we compute the distance of each point to the
cluster means, and assign it to the nearest mean, as follows:
d(xi , µ1 ) d(xi , µ2 ) Cluster
x1 3 4.25 c1
x2 2.34 4.25 c1
x3 0.84 2.75 c1
x4 4 2.75 c2
x5 4.66 2.75 c2
The assignments are the same as for Euclidean distance. For the next iteration,
we recompute the means, as follows: µ1 = (1.5/3, 2/3)T = (0.5, 0.67)T and µ2 =
(10/2, 2/2)T = (5, 1)T . The new cluster assignments for the points are follows:

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
94 Representative-based Clustering

d(xi , µ1 ) d(xi , µ2 ) Cluster


x1 1.83 6 c1
x2 1.17 6 c1
x3 1.67 4.5 c1
x4 5.17 1.0 c2
x5 5.83 1.0 c2
Since there is no change in cluster assignments, so we stop.
(b) Apply the EM algorithm with k = 2 assuming that the dimensions are independent.
Show one complete execution of the expectation and the maximization steps.
Start with the assumption that P (Ci |xj a ) = 0.5 for a = 1, 2 and j = 1, . . . , 5.
Answer: Maximization Step: Since P (Ci |xj ) = 0.5 for both the clusters, the
mean for cluster C1 is given as
            
0 0 1.5 5 5 2.3
µ1 = 0.5 + + + + 0.5 · 5 =
2 0 0 0 2 0.8
In fact the mean for cluster C2 is also
 
2.3
µ2 =
0.8
To compute the covariance matrix for C1 , we first subtract the mean from each
point to obtain the centered data points:
 
−2.3 1.2
−2.3 −0.8
 
 
−0.8 −0.8
 
 2.7 −0.8
2.7 1.2
The covariance matrix for C1 can be computed using outer-product form, but
ignoring (or setting to zero) the off-diagonal elements, since we assume that
the dimensions are independent.
     
5.29 0 5.29 0 0.64 0
6 1 = 0.5 + +
0 1.44 0 0.64 0 0.64
    
7.29 0 7.29 0
+ + 0.5 · 5
0 0.64 0 1.44
   
25.8 0 5.16 0
= 5=
0 4.8 0 0.96
This is also the same for C2 , i.e., 6 2 = 6 1 .
Finally, the total weight in each cluster is also the same, i.e., P (C1 ) = 2.5
5 = 0.5,
and thus P (C2 ) = 0.5.
Expectation Step: Here we have to compute the posterior probability for each
cluster, given a point, using:

P (Cj |xi ) = f (xi |6 j , µj )P (Ci )/P (xi )

We have:

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
13.5 Exercises 95

xi f (xi |6 j , µj ) P (C1 |xi ) P (C2 |xi )


x1 0.0202 0.5 0.5
x2 0.0307 0.5 0.5
x3 0.4816 0.5 0.5
x4 0.0253 0.5 0.5
x5 0.0167 0.5 0.5
This example represents a degenerate case of the EM algorithm, due to the
assumption that P (Ci |xj ) = 0.5, which results in both clusters having the same
prior probability, mean and covariance matrix.

Q4. Given the categorical database in Table 13.3. Find k = 2 clusters in this data using
the EM method. Assume that each attribute is independent, and that the domain of
each attribute is {A, C, T}. Initially assume that the points are partitioned as follows:
C1 = {x1 , x4 }, and C2 = {x2 , x3 }. Assume that P (C1 ) = P (C2 ) = 0.5.

Table 13.3. Dataset for Q4

X1 X2
x1 A T
x2 A A
x3 C C
x4 A C

The probability of an attribute value given a cluster is given as


No. of times the symbol xj a occurs in cluster Ci
P (xj a |Ci ) =
No. of objects in cluster Ci
for a = 1, 2. The probability of a point given a cluster is then given as
2
Y
P (xj |Ci ) = P (xj a |Ci )
a=1

Instead of computing the mean for each cluster, generate a partition of the objects
by doing a hard assignment. That is, in the expectation step compute P (Ci |xj ), and
in the maximization step assign the point xj to the cluster with the largest P (Ci |xj )
value, which gives a new partitioning of the points. Show one full iteration of the EM
algorithm and show the resulting clusters.

Answer: Given the initial partition: C1 = {x1 , x4 }, and C2 = {x2 , x3 }, with P (C1 ) =
P (C2 ) = 0.5.
Expectation Step: First consider cluster C1 . For attribute X1 , we have:
P (X1 = A|C1 ) = 2/2 = 1, which implies P (X1 = C|C1 ) = P (X1 = T|C1 ) = 0.
For attribute X2 , we have: P (X2 = A|C1 ) = 0/2 = 0 and P (X2 = C|C1 ) = P (X2 =
T|C1 ) = 1/2.
Likewise for C2 , attributes X1 and X2 we have:
P (X1 = A|C2 ) = P (X1 = C|C2 ) = 1/2, and P (X1 = T|C2 ) = 0, and
P (X2 = A|C2 ) = P (X2 = C|C2 ) = 1/2, and P (X2 = T|C2 ) = 0.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
96 Representative-based Clustering

Assuming attribute independence, we can now compute the following values:


P (x1 |C1 ) = P (X1 = A|C1 ) · P (X2 = T|C1 ) = 1 · 1/2 = 1/2
P (x1 |C2 ) = P (X1 = A|C2 ) · P (X2 = T|C2 ) = 1/2 · 0 = 0

P (x2 |C1 ) = P (X1 = A|C1 ) · P (X2 = A|C1 ) = 1 · 0 = 0


P (x2 |C2 ) = P (X1 = A|C2 ) · P (X2 = A|C2 ) = 1/2 · 1/2 = 1/4

P (x3 |C1 ) = P (X1 = C|C1 ) · P (X2 = C|C1 ) = 0 · 1/2 = 0


P (x3 |C2 ) = P (X1 = C|C2 ) · P (X2 = C|C2 ) = 1/2 · 1/2 = 1/4

P (x4 |C1 ) = P (X1 = A|C1 ) · P (X2 = C|C1 ) = 1 · 1/2 = 1/2


P (x4 |C2 ) = P (X1 = A|C2 ) · P (X2 = C|C2 ) = 1/2 · 1/2 = 1/4

We can then compute:


P (x1 ) = P (x1 |C1 )P (C1 ) + P (x1 |C2 )P (C2 ) = 1/2 · 1/2 + 0 = 1/4
P (x2 ) = P (x2 |C1 )P (C1 ) + P (x2 |C2 )P (C2 ) = 0 + 1/4 · 1/2 = 1/8
P (x3 ) = P (x3 |C1 )P (C1 ) + P (x3 |C2 )P (C2 ) = 0 + 1/4 · 1/2 = 1/8
P (x4 ) = P (x4 |C1 )P (C1 ) + P (x4 |C2 )P (C2 ) = 1/2 · 1/2 + 1/4 · 1/2 = 1/4 + 1/8 = 3/8

Using Bayes theorem we get:


P (C1 |x1 ) = 1
P (C2 |x1 ) = 0

P (C1 |x2 ) = 0
P (C2 |x2 ) = 1

P (C1 |x3 ) = 0
P (C2 |x3 ) = 1

P (C1 |x4 ) = 2/3


P (C2 |x4 ) = 1/3

Maximization Step: Based on the probabilities above, we obtain the partitions


C1 = {x1 , x4 } and C2 = {x2 , x3 }.

Q5. Given the points in Table 13.4, assume that there are two clusters: C1 and C2 , with
µ1 = (0.5, 4.5, 2.5)T and µ2 = (2.5, 2, 1.5)T . Initially assign each point to the closest
mean, and compute the covariance matrices 6 i and the prior probabilities P (Ci ) for
i = 1, 2. Next, answer which cluster is more likely to have produced x8 ?

Answer: For x8 = (2.5, 3.5, 2.8)T , we first compute the distance from the two means
µ1 and µ2 as follows, which defaults to the Euclidean distance since the covariance
matrix is I. We have

d(x8 , µ1 )2 = kx8 − µ1 k2 = k(2, −1, 0.3)k2 = 5.09


d(x8 , µ2 )2 = kx8 − µ2 k2 = k(0, 1.5, 1.3)k2 = 3.94

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
13.5 Exercises 97

Table 13.4. Dataset for Q5

X1 X2 X3
x1 0.5 4.5 2.5
x2 2.2 1.5 0.1
x3 3.9 3.5 1.1
x4 2.1 1.9 4.9
x5 0.5 3.2 1.2
x6 0.8 4.3 2.6
x7 2.7 1.1 3.1
x8 2.5 3.5 2.8
x9 2.8 3.9 1.5
x10 0.1 4.1 2.9

Next, we compute P (x8 |Ci ), as follows

P (x8 |C1 ) = c · exp −5.09/2 = 0.0785c


P (x8 |C2 ) = c · exp −3.94/2 = 0.1395c

where c = 1/(2π)3/2 . The probability of point x8 is P (x8 ) = (0.0785+0.1395)·0.5·c =


0.218 · 0.5 · c.
Finally, we can compute the posterior probabilities as follows:

P (C1 |x8 ) = P (x8 |C1 )P (C1 )/P (x8 ) = (0.0785c · 0.5)/(0.218 · 0.5 · c) = 0.36
P (C2 |x8 ) = 1 − 0.36 = 0.64

Thus, the point is more likely to have been produced by C2 .

Q6. Consider the data in Table 13.5. Answer the following questions:
(a) Compute the kernel matrix K between the points assuming the following kernel:

K(xi , xj ) = 1 + xT
i xj

Answer: The kernel matrix is given as


 
2.33 1.65 1.87 2.18
1.65 1.62 1.69 1.58
K=
1.87

1.69 1.81 1.78
2.18 1.58 1.78 2.05

(b) Assume initial cluster assignments of C1 = {x1 , x2 } and C2 = {x3 , x4 }. Using kernel
K-means, which cluster should x1 belong to in the next step?
Answer: Using

φ 2 X 1 X X
kφ(xj ) − µi k2 = K(xj , xj ) − K(xa , xj ) + 2 K(xa , xb )
ni ni
xa ∈Ci xa ∈Ci xb ∈Ci

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
98 Representative-based Clustering

Table 13.5. Data for Q6

X1 X2 X3
x1 0.4 0.9 0.6
x2 0.5 0.1 0.6
x3 0.6 0.3 0.6
x4 0.4 0.8 0.5

The distance of x1 to C1 is given as:

2.33 − (2.33 + 1.65) + 1/4 · (2.33 + 1.65 + 1.65 + 1.64) = 0.165

and the distance to C2 as:

2.33 − (1.87 + 2.18) + 1/4 · (1.81 + 1.78 + 1.78 + 2.05) = 0.135

Thus, x1 should belong to C2 in the next step.

Q7. Prove the following equivalence for the multivariate normal density function:

f (xj |µi , 6 i ) = f (xj |µi , 6 i ) 6 −1
i (xj − µi )
∂µi

Answer: Let g(µi , 6 i ) = − 12 (xj − µi )T 6 −1


i (xj − µi ), so that f (xj |µi , 6 i ) =
− d2 −1 21

(2π) |6 i | exp g(µi , 6 i ) .
Since 6 i is a constant w.r.t. µi , we have
∂ d 1  ∂
f (xj |µi , 6 i ) = (2π)− 2 |6 −1
i | exp g(µi , 6 i )
2 g(µi , 6 i )
∂µi ∂µi

= f (xj |µi , 6 i ) g(µi , 6 i )
∂µi

The partial derivative of g is given as


∂ 1
g(µi , 6 i ) = − · 26 −1
i (xj − µi ) · −1
∂µi 2
= 6 −1
i (xj − µi )

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 14 Hierarchical Clustering

14.4 EXERCISES

Q1. Consider the 5-dimensional categorical data shown in Table 14.1.

Table 14.1. Data for Q1

Point X1 X2 X3 X4 X5
x1 1 0 1 1 0
x2 1 1 0 1 0
x3 0 0 1 1 0
x4 0 1 0 1 0
x5 1 0 1 0 1
x6 0 1 1 0 0

The similarity between categorical data points can be computed in terms of the
number of matches and mismatches for the different attributes. Let n11 be the number
of attributes on which two points xi and xj assume the value 1, and let n10 denote the
number of attributes where xi takes value 1, but xj takes on the value of 0. Define
n01 and n00 in a similar manner. The contingency table for measuring the similarity is
then given as

xj
1 0
xi 1 n11 n10
0 n01 n00

Define the following similarity measures:


+n00
• Simple matching coefficient: SMC(Xi , Xj ) = n11 +nn11
10 +n01 +n00
• Jaccard coefficient: JC(Xi , Xj ) = n11 +nn10
11
+n01
n11
• Rao’s coefficient: RC(Xi , Xj ) = n11 +n10+n01 +n00
Find the cluster dendrograms produced by the hierarchical clustering algorithm under
the following scenarios:
99
100 Hierarchical Clustering

(a) We use single link with RC.


Answer: First, we have to convert from similarities to distances. For RC, the
distance is simply 1 − RC(xi , xj ). The pair-wise RC distance matrix is given as:

x2 x3 x4 x5 x6
x1 3/5 3/5 4/5 3/5 4/5
x2 4/5 3/5 4/5 4/5
x3 4/5 4/5 4/5
x4 5/5 4/5
x5 4/5

We pick the least distance and break ties by choosing the cluster with the
smallest index. The first merge is therefore for x1 and x2 . We get the new matrix:

x3 x4 x5 x6
x1 , x2 3/5 3/5 3/5 4/5
x3 4/5 4/5 4/5
x4 5/5 4/5
x5 4/5

The next merge is then, x1 , x2 and x3 , and the new distance matrix is:

x4 x5 x6
x1 , x2 , x3 3/5 3/5 4/5
x4 5/5 4/5
x5 4/5

The next merge is between x1 , x2 , x3 , and x4 , and so on.


The dendrogram for the clustering steps is as follows:
1 2 3 4 5 6 Distance
| | | | | |
12 | | | | 3/5
| | | | |
123 | | | 3/5
| | | |
1234 | | 3/5
| | |
12345 | 3/5
| |
123456 4/5
(b) We use complete link with SMC.
Answer: The distances based on SMC are given as: 1 − SMC(xi , xj ). The
corresponding distance matrix is given as:
x2 x3 x4 x5 x6
x1 2/5 1/5 3/5 2/5 3/5
x2 3/5 1/5 4/5 3/5
x3 2/5 3/5 2/5
x4 5/5 2/5
x5 3/5
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14.4 Exercises 101

The smallest distances are between x1 and x3 , and between x2 and x4 . Since
these are disjoint, we can merge them in one step to obtain the two initial
clusters. The new distance matrix is given as:

x2 , x4 x5 x6
x1 , x3 3/5 3/5 3/5
x2 , x4 5/5 3/5
x5 3/5

Next to merge are x1 , x3 and x2 , x4 assuming that smaller indexes are merged
first in cases of tie-breaks. The new distance matrix is:
x5 x6
x1 , x2 , x3 , x4 5/5 3/5
x5 3/5

Next, we merge x1 , x2 , x3 , x4 , x6 , and do so on. A possible clustering is:


1 3 2 4 6 5 Distance
| | | | | |
13 24 | | 1/5
| | | |
1234 | | 3/5
| | |
12346 | 3/5
| |
123456 5/5
(c) We use group average with JC.
Answer: The JC distances are given as 1 − JC(xi , xj ) with the distance matrix
given as:
x2 x3 x4 x5 x6
x1 2/4 1/3 3/4 2/4 3/4
x2 3/4 1/3 4/5 3/4
x3 2/3 3/4 2/3
x4 5/5 2/3
x5 3/4
First to merge are x1 and x3 , and then x2 and x4 . The resulting distance matrix
is:
x2 , x4 x5 x6
x1 , x3 0.67 0.625 0.708
x2 , x4 0.9 0.708
x5 0.6
The next pair to merge is x5 and x6 . The new distance matrix is:

x2 , x4 x5 , x6
x1 , x3 0.625 0.666
x2 , x4 0.804

A possible clustering is:

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
102 Hierarchical Clustering

1 3 2 4 5 6 Distance
| | | | | |
13 24 | | 1/3
| | | |
| | 56 0.6
| | |
1234 | 0.625
| |
123456

Q2. Given the dataset in Figure 14.1, show the dendrogram resulting from the single-link
hierarchical agglomerative clustering approach using the L1 -norm as the distance
between points

2
X
δ(x, y) = |xia − yia |
a=1

Whenever there is a choice, merge the cluster that has the lexicographically smallest
labeled point. Show the cluster merge order in the tree, stopping when you have k = 4
clusters. Show the full distance matrix at each step.

Answer: We start with the distance matrix between the points.


 
b c d e f g h i j k
 a 2 4 7 6 4 6 8 7 9 5
 
 
b 2 7 6 4 4 6 7 7 3
 
c 5 4 2 2 4 5 5 1
 
d 1 3 5 7 2 8 6
 
e 2 4 6 1 7 5
 
f 2 4 3 5 3
 
g 2 5 3 1
 
 
h 7 3 3
 
i 6 6
j 4

The first pair of points to merge are {c, k}, {d, e}. However, note that the pairs {e, i}
and {g, k} are also at distance 1 from each other. In the single link clustering, these
will also merge in the next step. So we might as well merge these upfront to obtain
two initial clusters, namely {c, g, k} and {d, e, i}, with merge distance 1. The new
distance matrix is then given as

b c, g, k d, e, i f h j
a 2 4 6 4 8 9
b 2 6 4 6 7
c, g, k 2 4 2 2
d, e, i 2 6 6
f 4 5
h 3

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14.4 Exercises 103

9
a
8
b
7

6
c k
5
d e f g h
4
i
3
j
2

1 2 3 4 5 6 7 8 9
Figure 14.1. Dataset for Q2.

The next to merge will be a, b, then that cluster will merge with c, g, k to create
a, b, c, g, k. This cluster will merge with f to create the clusters a, b, c, f, g, k. At
this point there will be 4 clusters, namely, {a, b, c, f, g, k}, {d, e, i}, {h} and {j }. The
process stops as this point since we desire 4 clusters. The dendrogram is given as:

a b c g k f h j d e i
| | | | | | | | | | |
| | cgk | | | dei
ab | | | | |
| | | | | |
abcgk | | | |
| | | | |
abcfgk | | |
| | | |

Table 14.2. Dataset for Q3

A B C D E
A 0 1 3 2 4
B 0 3 2 3
C 0 1 3
D 0 5
E 0

Q3. Using the distance matrix from Table 14.2, use the average link method to generate
hierarchical clusters. Show the merging distance thresholds.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
104 Hierarchical Clustering

Answer: The first pairs to merge are {A, B} and {C, D}, at a distance of 1 The
updated distance matrix is:
C, D E
A, B 2.5 3.5
C, D 4
The next to merge are A, B and C, D, at a distance of 2.5, which yields the matrix:

E
A, B, C, D 3.75

Finally, we get a single cluster at distance 3.75.

Q4. Prove that in the Lance–Williams formula [Eq. (14.7)]


nj
(a) If αi = ni n+n
i
j
, αj = ni +n j
, β = 0 and γ = 0, then we obtain the group average
measure.
Answer: Let Cij and Cr be two clusters, with nij and nr points. The distance
between the two clusters is then given as
1 X X
δ(Cij , Cr ) = δ(x, y)
nij nr
x∈Cij y∈Cr

1 X X 1 X X
= δ(x, y) + δ(x, y)
(ni + nj )nr (ni + nj )nr
x∈Ci y∈Cr x∈Cj y∈Cr
ni XX nj X X
= δ(x, y) + δ(x, y)
(ni + nj )ni nr (ni + nj )nj nr
x∈Ci y∈Cr x∈Cj y∈Cr
ni nj
= δ(Ci , Cr ) + δ(Cj , Cr )
ni + nj ni + nj

We can see that this matches the Lance-Williams formula.


i +nr nj +nr −nr
(b) If αi = ni n+n j +nr
, αj = ni +nj +nr , β= ni +nj +nr and γ = 0, then we obtain Ward’s
measure.
Answer: Note that µij = (ni µi + nj µj )/(ni , nj ). Consider the Ward’s formula
for two clusters Cij and Cr , we have:
nij nr
δ(Cij , Cr ) = kµ − µr k2
nij + nr ij
nij nr  
= kµij k2 + kµr k2 − 2µTr µij
nij r
nij nr  
2 2 2 2 T
= n i µ i + n j µ j + 2n i n j µi µ j
nij r n2ij
nij nr
+ kµr k2
nij r
nij nr  
−2 ni µT T
r µi + nj µr µj
nij r nij
nr n2i 2 nr n2j 2 nij nr
= µi + µj + kµr k2
nij r nij nij r nij nij r

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
14.4 Exercises 105

ni nj nr T nr  
+2 µi µj − 2 ni µT T
r µi + nj µr µj (14.1)
nij nij r nij r

Now consider Lance-Williams formula, we have:


ni + nr nj + nr nr
δ(Cij , Cr ) = δ(Ci , Cr ) + δ(Cj , Cr ) − δ(Ci , Cj )
nij r nij r nij r
ni nr 2 nj nr 2 ni nj nr 2
= µi − µr + µj − µr − µi − µj
nij r nij r nij nij r
ni nr  2

= µi + kµr k2 − 2µT r µi
nij r
nj nr  2

+ µj + kµr k2 − 2µT r µj
nij r
ni nj nr  2 2

− µi + µj − 2µT i µj
nij nij r
1  2 1  2
= (ni + nj )ni nr − ni nj nr µi + (ni + nj )nj nr − ni nj nr µj
nij r nij r
(ni + nj )nr ni nj nr T ni nr T nj nr T
+ kµr k2 + 2 µ µ −2 µ µ −2 µ µ
nij r nij nij r i j nij r i r nij r j r
nr n2i 2 nr n2j 2 nij nr
= µi + µj + kµr k2
nij r nij nij r nij nij r
ni nj nr T nr  
+2 µi µj − 2 ni µT µ
r i + n j µ T
µ
r j (14.2)
nij nij r nij r

We see that Eqs.(14.1) and (14.2) match.

Q5. If we treat each point as a vertex, and add edges between two nodes with distance
less than some threshold value, then the single-link method corresponds to a well
known graph algorithm. Describe this graph-based algorithm to hierarchically cluster
the nodes via single-link measure, using successively higher distance thresholds.

Answer: Define the (complete) weighted graph over the points, where the weights
denote the distance. Then for each value of distance, if we restrict the graph to only
those edges with weight at most the chosen value of distance, then the clusters via
single-link are precisely the connected components of the distance restricted graph.
As an example, consider the dataset shown in Figure 14.1. For a distance
threshold of 1, there are only two connected components, namely {c, g, k} and
{d, e, i}. Next, when we raise the threshold to 2, we get two connected components,
namely {a, b, c, d, e, f, g, h, i, k} and {j }. Finally, when the distance is 3 there is only
one connected component. One can verify that there are precisely the clusters
obtained via single-link clustering.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 15 Density-based Clustering

15.5 EXERCISES

Q1. Consider Figure 15.1 and answer the following questions, assuming that we use the
Euclidean distance between points, and that ǫ = 2 and minpts = 3
(a) List all the core points.
Answer: The core points are a, b, c, d, e, f, g, h, i, j, k, n, o, p, q, r, s, t, v, w
(b) Is a directly density reachable from d?
Answer: Yes, since d is a core object and a belongs to N2 (d).
(c) Is o density reachable from i? Show the intermediate points on the chain or the
point where the chain breaks.
Answer: Yes, the intermediate points are i, e, b, c, f , g, j , n, o or i, e, f , j , n,
o.
(d) Is density reachable a symmetric relationship, that is, if x is density reachable
from y, does it imply that y is density reachable from x? Why or why not?
Answer: Density reachable is not a symmetric relationship, since a non-core
object may be reachable from a core object, but the reverse is not necessarily
true. For example u is density reachable from n but n is not density reachable
from u.
(e) Is l density connected to x? Show the intermediate points that make them density
connected or violate the property, respectively.
Answer: Yes, for example, via t, since l is density-reachable from t and x is
also density-reachable from t.
(f) Is density connected a symmetric relationship?
Answer: Yes, by definition. In other words for any two points, there exists a
core point that reaches both of them.
(g) Show the density-based clusters and the noise points.

106
15.5 Exercises 107

Answer: There are 2 density based clusters

C1 :{a, d, h, k, p, q, r, s, t, l, v, w, x}
C2 :{b, c, e, f, g, i, j, n, m, o, u}

There are no noise points.

10

9
a b c
8
d e f g
7
h i j n
6
k m o
5
p q r s t l u
4
v w
3
x
2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Figure 15.1. Dataset for Q1.

Q2. Consider the points in Figure 15.2. Define the following distance measures:

d 
L∞ (x, y) = max |xi − yi |
i=1

X
d
1
2
L 1 (x, y) = |xi − yi | 2
2
i=1
 d
Lmin (x, y) = min |xi − yi |
i=1

X
d 1/2
Lpow (x, y) = 2i−1 (xi − yi )2
i=1

(a) Using ǫ = 2, minpts = 5, and L∞ distance, find all core, border, and noise points.
Answer: The core points are c, f, g, k. The border points are b, e, h. The noise
points are a, d, i, j .
(b) Show the shape of the ball of radius ǫ = 4 using the L 1 distance. Using minpts = 3
2
show all the clusters found by DBSCAN.
Answer: The shape is

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
108 Density-based Clustering

The cluster is {a, b, c, d, e, f, g, h, i, k}. The outlier is j .


(c) Using ǫ = 1, minpts = 6, and Lmin , list all core, border, and noise points.
Answer: The core points are b, c, d, e, f, g, h, i, k. The border points are a, j .
There are no noise points.
(d) Using ǫ = 4, minpts = 3, and Lpow , show all clusters found by DBSCAN.
Answer: We first compute the distance matrix between the points, which is
given as:
 
b c d e f g h i j k
 a 1.73 4.36 6.4 6 5.66 6 6.93 7.35 9 4.69
 
 
b 2.83 5.83 5.20 4.36 4.36 5.2 6.4 7.35 3 
 
c 4.24 3.32 1.73 1.73 3.32 4.12 4.69 1 
 
d 1 3 5 7 1.73 6.63 5.20
 
e 2 4 6 1.41 5.74 4.24
 
f 2 4 2.45 4.12 2.45
 
g 2 4.24 3 1.41
 
 
h 6.16 3 2.45
 
i 5.20 4.9 
j 4.36

Using ǫ = 4, we find that the core points are {b, c, d, e, f, g, h, i, k}, and the
border points are {a, j }. There are no noise points.

Q3. Consider the points shown in Figure 15.2. Define the following two kernels:
(
1 If L∞ (z, 0) ≤ 1
K1 (z) =
0 Otherwise
( P
1 If dj =1 |zj | ≤ 1
K2 (z) =
0 Otherwise

Using each of the two kernels K1 and K2 , answer the following questions assuming
that h = 2:
(a) What is the probability density at e?

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
15.5 Exercises 109

9
a
8
b
7
6
c k
5
d e f g h
4
i
3
j
2
1

1 2 3 4 5 6 7 8 9
Figure 15.2. Dataset for Q2 and Q3.

Answer: For K1 , the probability density at e is given as:

1     
fˆ (e) = K 1 (e − d)/2 + K1 (e − e)/2 + K1 (e − f )/2 + K1 (e − i)/2
11 · 22
1
= · 4 = 0.091
44
For K2 , the probability density is the same:

1     
fˆ (e) = K 2 (e − d)/2 + K2 (e − e)/2 + K2 (e − f )/2 + K2 (e − i)/2
11 · 22
1
= · 4 = 0.091
44

(b) What is the gradient at e?


Answer: For K1 , the gradient at e is given as:
1
∇ fˆ (e) = ((d − e) + (f − e) + (i − e))
11 · 24
       
1 −1 2 0 1 1
= + + =
176 0 0 −1 176 −1

The gradient for K2 is the same.


(c) List all the density attractors for this dataset.
Answer: The density is the maximum at c and g. They both have density
5/44 = 0.114. They are the density attractors.

Q4. The Hessian matrix is defined as the set of partial derivatives of the gradient vector
with respect to x. What is the Hessian matrix for the Gaussian kernel? Use the
gradient in Eq. (15.6).

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
110 Density-based Clustering

Answer: For the Gaussian kernel, we know that


       
∂ x − xi x − xi xi − x 1
K =K · ·
∂x h h h h
  
1 x − xi xi − x
= 2K
h h h

Taking the derivative of the gradient with respect to x, we get


n  
∂ ˆ ∂ 1 X x − xi
∇ f (x) = K · (xi − x)
∂x ∂x nhd+2 h
i=1
n      
1 X x − xi 1 T x − xi
= d+2 K · 2 (x − xi )(x − xi ) + K I
nh h h h
i=1
n    h i
1 X x − xi
= d+4 K · (x − xi )(x − xi )T + h2 I
nh h
i=1

Q5. Let us compute the probability density at a point x using the k-nearest neighbor
approach, given as
k
fˆ (x) =
nVx
where k is the number of nearest neighbors, n is the total number of points, and Vx is
the volume of the region encompassing the k nearest neighbors of x. In other words,
we fix k and allow the volume to vary based on those k nearest neighbors of x. Given
the following points
2, 2.5, 3, 4, 4.5, 5, 6.1
Find the peak density in this dataset, assuming k = 4. Keep in mind that this may
happen at a point other than those given above. Also, a point is its own nearest
neighbor.

Answer: Since the data is one-dimensional, the volume Vx is simply the distance
from x to its fourth nearest neighbor (for k = 4). The computed density estimates at
the data points are given below.
x p̂(x)
2 4 / (7 × 2)
2.5 4 / (7 × 1.5)
3 4 / (7 × 1)
4 4 / (7 × 1)
4.5 4 / (7 × 1.5)
5 4 / (7 × 1.1)
6.1 4 / (7 × 2.1)
Therefore the peak density is 4/7 = 0.57.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 16 Spectral and Graph Clustering

16.5 EXERCISES

Pn
Q1. Show that if Qi denotes the ith column of the modularity matrix Q, then i=1 Qi = 0.

Answer: The modularity matrix is given as


 
1 1
Q= A− ddT
tr(1) tr(1)
Pn
where tr(1) = i=1 di . Let Ai denote the i-th column of A. Then the i-th column
of the modularity matrix Q is given as
 
1 1
Qi = Ai − di d
tr(1) tr(1)

Taking the sum over all the n columns we get


n n n
!
X 1 X 1 X 1
Qi = Ai − d di = (d − d) = 0
tr(1) tr(1) tr(1)
i=1 i=1 i=1

Q2. Prove that both the normalized symmetric and asymmetric Laplacian matrices Ls
[Eq. (16.6)] and La [Eq. (16.9)] are positive semidefinite. Also show that the smallest
eigenvalue is λn = 0 for both.

Answer: The normalized symmetric Laplacian matrix is given as Ls =


1−1/2 L1−1/2 . Let c ∈ Rn be any real vector. Define z = 1−1/2 c. From the derivation
in Eq. (16.5), we have
 T  
cT Ls c = 1−1/2 c L 1−1/2 c

= zT Lz
n n
1 XX
= aij (zi − zj )2
2
i=1 j =1

111
112 Spectral and Graph Clustering

n n
!2
1 XX ci cj
= aij √ −p
2 di dj
i=1 j =1

Since aij ≥ 0 and (zi − zj )2 is always non-negative, we immediately have the result
that Ls is positive semidefinite. An immediate consequence is that Ls has real
eigenvalues, and each λi ≥ 0. Now, if Lsi denotes the ith column of Ls , then from
Eq. (16.7) we can see that
p p p p
d1 Ls1 + d2 Ls2 + d3 Ls3 + · · · + dn Lsn = 0

That is, Ls is not a full-rank matrix, and at least one eigenvalue has to be zero. Since
all eigenvalues are at least zero, we conclude that λn = 0.
The normalized asymmetric Laplacian matrix La is also positive-semidefinite in
the sense that it has the same set of non-negative eigenvalues λi ≥ 0 as the symmetric
Laplacian Ls , and for each eigenvector ui of Ls , we have vi = 1−1/2 ui as the
corresponding eigenvector of La .
Now, from Eq. (16.9) we can see that if Lai denotes the ith column of La , then
L1 + La2 + · · · + Lan = 0. Which implies that the smallest eigenvalue λ = 0, following
a

the same reasoning as for the normalized symmetric Laplacian case above.

Q3. Prove that the largest eigenvalue of the normalized adjacency matrix M [Eq. (16.2)]
is 1, and further that all eigenvalues satisfy the condition that |λi | ≤ 1.

Answer: The main property of a M is that it is a Markov or stochastic matrix,


i.e., each row sums to 1 and each entry mij is non-negative and represents the
probability of moving or transitioning from vertex i to vertex j . It is clear that
M1 = 1, where 1 ∈ Rn is the n-dimensional vector of ones. Note also that Mk is
also a Markov matrix, and represents the k-step transition matrix (for a Markov
chain).
We conclude that u = √1n 1 is an eigenvector and λ = 1 is an eigenvalue of M.
Now, let λ be any eigenvalue and u the corresponding eigenvector, so that Mu =
λu. We assume that kuk2 = 1, i.e, u is a unit vector. Consider the k-th power of M,
we have

Mk u = λk u
P
Let M′ = Mk . Since the latter is also a Markov matrix, we have m′ij ≥ 0 and i m′ij =
1, which implies that m′ij ≤ 1.
Now consider the i-th row of M′ and its dot product with u in the expression

M u = λk u; we have
X
m′ij uj = λk ui
j
P
However, note that the 1-norm of u, namely kuk1 = i |ui |, can achieve a maximum

value of n, since it is a fact that

kuk2 ≤ kuk1 ≤ n kuk2

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
16.5 Exercises 113


Since kuk2 = 1, we have that kuk1 ≤ n. Since each m′ij ≤ 1, we immediately have
P ′ P √
j mij uj ≤ j |uj | ≤ n
Now assume the |λ| > 1. In this case, |λ|k increases without bound as k → ∞,
P √
whereas j m′ij uj can never exceed n. We conclude that our assumption is false,
and therefore |λ| ≤ 1. Thus, we have |λ| ≤ 1 for all eigenvalues, and 1 is the largest
eigenvalue.
P P P
Q4. Show that vr ∈Ci cir dr cir = nr=1 ns=1 cir 1rs cis , where ci is the cluster indicator
vector for cluster Ci and 1 is the degree matrix for the graph.

Answer: Note that cir = 1 iff vr ∈ Ci . Also note that 1rs = 0 if r 6= s, and 1rs = dr
if r = s. r = s. Thus, we have
n X
X n n X
X n
X X
cir 1rs cis = cir 1rs cis = cir 1rr cir = cir dr cir
r=1 s=1 r=1 s=r r=1 vr ∈Ci

Q5. For the normalized symmetric Laplacian Ls , show that for the normalized cut
objective the real-valued cluster indicator vector corresponding to the smallest
eigenvalue λn = 0 is given as cn = √P1n 11/2 1
i=1 di

Answer: Let us consider Ls · √P1n 11/2 1, we have


i=1 di
P 
a1j √ 
√j6=1 − √ad12d ···
d1 − √ad1nd
 P
d1 d1
1 2 1 n 
  √ 
 a j6=2 a2j a 
1 1 21
 − √d2 d1 √ · · · − √d d   d2 
2n
qP Ls 11/2 1 = qP  d2 d2 2 n · 
n n  . . . ..   .. 
i=1 di i=1 di 
.
. .
. . . P .   . 
  √
anj dn
− √adn1d − √adn2d ··· √j6=n
dn dn
n 1 n 2
 P a   −a12   −a1n 
j6=1 1j √ √
√ d d
 d1   Pj6=21a2j   −a 1

1  − √a21   √   √d2nn 
 d2   d2   
= qP  .. + ..  + ···+  .. 
n      
i=1 di  .   .  P . 
−a j6=2 a2j
− √an1dn
√ n2
dn

d2
 P a 
j6=1 1j −a12 −···−a1n

 P d1 
 j6=2 a2j −a21 −···−a2n 
1  √ 
= qP ·

d2
..


n  
i=1 di  P . 
j6=n anj −an1 −···−an,n−1

dn
√ 
d1
√ 
1  d2 
= 0 · qP  
n  .. 
i=1 di  . 

dn

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
114 Spectral and Graph Clustering

1
= 0 · qP 11/2 1
n
i=1 di

This shows that λn = 0 is given as cn = √P1n 11/2 1.


i=1 di

2 4

Figure 16.1. Graph for Q6.

Q6. Given the graph in Figure 16.1, answer the following questions:
(a) Cluster the graph into two clusters using ratio cut and normalized cut.
Answer: The adjacency matrix and the corresponding degree matrix for the
graph are as follows:
   
0 1 0 1 2 0 0 0
1 0 1 1 0 3 0 0
A=
0
 D= 
1 0 1 0 0 2 0
1 1 1 0 0 0 0 3

The Laplacian and normalized asymmetric Laplacian matrices are as follows:


   
2 −1 0 −1 1 −1/2 0 −1/2
−1 3 −1 −1 −1/3 1 −1/3 −1/3
L=
0
 L =D L=
a −1 
−1 2 −1  0 −1/2 1 −1/2
−1 −1 −1 3 −1/3 −1/3 −1/3 1

For the ratio cut, we have to find the two smallest eigenvalues and correspond-
ing eigenvectors of L, which are as follows:

λ4 = 0 λ3 = 2
√ √
T
u4 = (1/2, 1/2, 1/2, 1/2) u3 = (1/ 2, 0, −1/ 2, 0)T

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
16.5 Exercises 115

For the normalized cut, we have to find the two smallest eigenvalues and
corresponding eigenvectors of La , which are as follows:

λ4 = 0 λ3 = 1
√ √
u4 = (1/2, 1/2, 1/2, 1/2)T u3 = (1/ 2, 0, −1/ 2, 0)T

For a clustering into k = 2 partitions, we can either get C1 = {1, 2, 3} and C2 =


{4}, or we can get C1 = {1} and C2 = {2, 3, 4},
(b) Use the normalized adjacency matrix M for the graph and cluster it into two
clusters using average weight and kernel K-means, using K = M + I.
Answer: If we use the adjacency matrix, then the two largest eigenvalues and
eigenvectors are:

λ1 = 2.56 λ2 = 0
√ √
u1 = (0.44, 0.56, 0.44, 0.56)T u2 = (1/ 2, 0, −1/ 2, 0)T

For a clustering into k = 2 partitions, we can either get C1 = {1, 2, 3} and C2 =


{4}, or we can get C1 = {1} and C2 = {2, 3, 4}.
If we use the normalized adjacency matrix, we have:
 
0 1/2 0 1/2
1/3 0 1/3 1/3
M = D−1 A = 
 0

1/2 0 1/2
1/3 1/3 1/3 0

The two largest eigenvalues and eigenvectors are:

λ1 = 1 λ2 = −0.67
u1 = (1/2, 1/2, 1/2, 1/2)T u2 = (−0.59, 0.39, −0.59, 0.39)T

However, since there is only one positive eigenvalue, we cannot cluster the data
into two groups based only on u1 , since all of its elements have the same value.

For kernel K-means we realize that we first have to modify the matrix K to have
a value of K(x, x) = 1, otherwise, each point will be more similar to other points
than to itself. So we first make sure that the diagonal in the kernel matrix K has
all ones, so that
 
1 1/2 0 1/2
1/3 1 1/3 1/3
K = M+I =   0

1/2 1 1/2
1/3 1/3 1/3 1
For Kernel K-means we start with a random split of the points. Let’s assume
that C1 = {1, 2} and C2 = {3, 4}. The average of all kernel values within each
P
cluster 1/n2i · x,y∈Ci K(x, y) is 0.7075 for both clusters. Next, the average kernel
P
value of each point x with other points in a given cluster, 1/n1 · y∈Ci K(x, y) is
as follows:

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
116 Spectral and Graph Clustering

xi C1 C2
1 0.75 0.25
2 0.665 0.33
3 0.25 0.75
4 0.33 0.665
We assign each point to the closest cluster, breaking ties by choosing the
smaller index cluster, which gives the two new clusters C1 = {1, 2} and C2 =
{3, 4}. Since there is no change in the clusters, we stop.
Different answers are possible for different starting random clustering.
(c) Cluster the graph using the MCL algorithm with inflation parameters r = 2 and
r = 2.5.
Answer: For MCL, we first have to add self-loops to the adjacency matrix. The
normalized adjacency matrix is then given as:
   
1 1 0 1 1/3 1/3 0 1/3
1 1 1 1 1/4 1/4 1/4 1/4
A=
0
 M= 
1 1 1  0 1/3 1/3 1/3
1 1 1 1 1/4 1/4 1/4 1/4

Applying the expansion and inflation operations with r = 2, we have


   
0.28 0.28 0.17 0.28 0.30 0.30 0.11 0.30
0.21 0.29 0.21 0.29 0.17 0.33 0.17 0.33
M·M = 
0.17
 Ŵ(M, 2) =  
0.28 0.28 0.28 0.11 0.30 0.30 0.30
0.21 0.29 0.21 0.29 0.17 0.33 0.17 0.33

Continuing in this way the matrix converges to:


 
0 0.5 0 0.5
0 0.5 0 0.5
M=
0

0.5 0 0.5
0 0.5 0 0.5

Here we can see that vertices 2 and 4 are the attractors and both 1 and 3 are
attracted to it. Thus there is only one cluster comprising all the vertices in the
graph.
The result of using r = 2.5 is the same, since the converge is even more rapid
and we obtain the same final matrix as above, with the same clustering.

Table 16.1. Data for Q7

X1 X2 X3
x1 0.4 0.9 0.6
x2 0.5 0.1 0.6
x3 0.6 0.3 0.6
x4 0.4 0.8 0.5

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
16.5 Exercises 117

Q7. Consider Table 16.1. Assuming these are nodes in a graph, define the weighted
adjacency matrix A using the linear kernel

A(i, j ) = 1 + xT
i xj

Cluster the data into two groups using the modularity objective.

 
2.33 1.65 1.87 2.18
1.65 1.62 1.69 1.58
Answer: We have A =  1.87 1.69 1.81 1.78

2.18 1.58 1.78 2.05


The nodes degree vector is d = (8.09, 6.54, 7.15, 7.59)T , and thus tr(1) = 29.31,
 
2.20 1.79 1.96 2.08
1.79 1.46 1.60 1.69
1
and tr(1) ddT = 
1.96 1.60 1.74 1.85.

2.08 1.69 1.85 1.97


The modularity matrix is given as
 
0.00444 −0.00484 −0.00303 0.00343
−0.00484 0.00548 0.00323 −0.00387
Q=
−0.00303

0.00323 0.00224 −0.00244
0.00343 −0.00387 −0.00244 0.00288

The dominant eigenvector and the eigenvalue are as follows: λ1 = 0.01465 and
u1 = (0.55, −0.61, −0.38, 0.44)T .
Based on the values, we can see that the two clusters are C1 = {x1 , x4 } and C2 =
{x2 , x3 }.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 17 Clustering Validation

17.5 EXERCISES

Q1. Prove that the maximum value of the entropy measure in Eq. (17.2) is log k.

Answer: First note that H(T |Ci ) for a given cluster Ci has maximum entropy log k,
since in the worst case, the members of Ci are equally split among all k partitions,
n
i.e., niji· = 1/k for all j = 1, . . ., k. We therefore have

k
X 1
H(T |Ci ) = log k = logk
k
j =1

The conditional entropy is then given as


r
X ni
log k = logk
n
j =1

Q2. Show that if C and T are independent of each other then H(T |C ) = H(T ), and further
that H(C , T ) = H(C ) + H(T ).

Answer: If T and C are independent, then we have


r X
X k
H(T |C ) = − PCi PTj log PTj
i=1 j =1
 
r
X Xk
=− PCi  PTj log PTj 
i=1 j =1
r
X
= H(T ) PCi
i=1

= H(T )

118
17.5 Exercises 119

If C and T are independent, then H(T |C ) = H(T ). Also, by definition H(T |C ) =


H(C , T )− H(C ). Thus we have H(T ) = H(C , T )− H(C ), which implies that H(C , T ) =
H(C ) + H(T ).

Q3. Show that H(T |C ) = 0 if and only if T is completely determined by C .

Answer: If H(T |C ) = 0, then assuming that ni 6= 0, it must be that H(T |Ci ) = 0 for all
i = 1, . . . , r. This implies that for cluster Ci , either nij = 0 or nij = ni for j = 1, . . . , k.
However, the nij = ni for only one value j and for the rest nij = 0. This means that
Ci is identical to Tj , and this implies perfect clustering.
For the reverse, it is obvious, since nij = ni = mj for the matching pair Ci and Tj .

Q4. Show that I(C , T ) = H(C ) + H(T ) − H(T , C ).

Answer: The derivation is given as


r X
k
!
X pij
I(C , T ) = pij log
pCi · pTj
i=1 j =1

X k
r X r X
X k

= pij log pij − pij log pCi · pTj
i=1 j =1 i=1 j =1

X k
r X r X
X k
= −H(C , T ) − pij log pCi − pij log pTj
i=1 j =1 i=1 j =1

r
X k
X
= −H(C , T ) − pCi log pCi − pTj log pTj
i=1 j =1

= H(C ) + H(T ) − H(C , T )

Q5. Show that the variation of information is 0 only when C and T are identical.

Answer: When C and T are identical, we have

VI(C , T ) = H(T ) + H(C ) − 2I(C , T )


= H(T ) + H(T ) − 2H(T ) = 0

which follows from

I(C , T ) = I(T , T ) = H(T ) + H(T ) − H(T , T ) = H(T )

Q6. Prove that the maximum value of the normalized discretized Hubert statistic in
Eq. (17.21) is obtained when FN = FP = 0, and the minimum value is obtained when
TP = TN = 0.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
120 Clustering Validation

Answer: If FN = FP = 0, then µT = µC = TP/N. Then we can write Ŵn as follows


µT (1 − µT )
Ŵn = q =1
µ2T (1 − µT )2

If TP = TN = 0, then µT = FN/N and µC = FP/N, and µT + µC = FN+FP


N = N/N = 1.
Eq. (17.21) is then given as
−µT µC
Ŵn = q = −1
µ2T µ2C

which follows from (1 − µT )(1 − µC ) = 1 − (µT + µC ) + µT µC = µT µC .

Q7. Show that the Fowlkes–Mallows measure can be considered as the correlation
between the pairwise indicator matrices for C and T , respectively. Define C(i, j ) = 1
if xi and xj (with i 6= j ) are in the same cluster, and 0 otherwise. Define T similarly
P
for the ground-truth partitions. Define hC, Ti = ni,j =1 Cij Tij . Show that FM =
√ hC,Ti
hT,TihC,Ci

Answer: By definition C(i, j ) = Cij = 1 iff xi and xj belong to the same cluster and
likewise T(i, j ) = Tij = 1 iff iff xi and xj belong to the same ground-truth partition.
Further, Cij Tij = 1 when both points belong to the same cluster and same partition.
Thus, hC, Ti = TP .
Now, hT, Ti is simply the number of pairs of points that are in the same partition.
Some of these are in the same cluster, which comprise the true positives (TP), and
some of these pairs are not in the same cluster, which comprise the false negatives
(FN). Thus, hT, Ti = TP + F N.
Similarly, hC, Ci is the number of point pairs that are in the same cluster. Some
of these are in the same partition, which comprise the true positives (TP), and some
of these pairs are not in the same partition, which comprise the false negatives (FP).
Thus, hC, Ci = TP + F P .
Thus, FM = √ hC,Ti TP
= √(TP +F N)(TP +F P )
hT,TihC,Ci

Q8. Show that the silhouette coefficient of a point lies in the interval [−1, +1].

µin
Answer: For a given point xi , if µmin
out > µin , then si = 1 − . The maximum value
µmin
out
µmin
for si in this case can be 1. On the other hand, if µmin
out < µin , then si = µin − 1. The
out

minimum value for si in this case can be −1. Thus si ∈ [−1, 1].

Q9. Show that the scatter matrix can be decomposed as S = SW + SB , where SW and SB
are the within-cluster and between-cluster scatter matrices.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
17.5 Exercises 121

Answer: Let us consider the LHS, we have


n
X
S= (xj − µ)(xj − µ)T
j =1
n
X
= xj xT T T
j − xj µ − µxj + µµ
T

j =1
k X
X
= xj xT T T
j − xj µ − µxj + µµ
T

i=1 xj ∈Ci

k X 
X  Xk  
= xj xT
j + µµT
− ni µi µT + µµT
i
i=1 xj ∈Ci i=1

Now let us look at the RHS, we have


 
Xk X
SW + SB =  (xj − µi )(xj − µi )T + ni (µi − µ)(µi − µ)T 
i=1 xj ∈Ci

k X
X
xj xT T T T T T T
j − xi µi − µi xj + µi µi + µi µi − µi µ − µµi + µµ
T

i=1 xj ∈Ci

k X 
X  Xk   Xk X  
xj xT
j + µµT
− n i µi µ T
+ µµ T
i + 2µi µT T T
i − xj µi − µi xj
i=1 xj ∈Ci i=1 i=1 xj ∈Ci

k X 
X  Xk  
= xj xT
j + µµT
− ni µi µT + µµT
i
i=1 xj ∈Ci i=1

The last step follows from the fact that


k X 
X  X
k
2µi µT T T
i − xj µi − µi xj = 2ni µi µT T T
i − ni µi µi − ni µi µi = 0
i=1 xj ∈Ci i=1

Q10. Consider the dataset in Figure 17.1. Compute the silhouette coefficient for the point
labeled c.

Answer: To answer this question, we first need the clusters. Assume that we find
the following clusters C1 = {a, b, c, d, e}, C2 = {g, i}, C3 = {f, h, j } and C4 = {k}.
The mean distance from c to other points in its own cluster is: µin (c) = (2.92 +
1.5 + 2 + 1)/4 = 1.855.
Based on the distances c is closer to cluster C3 , and the mean of these distances
is µmin
out (c) = (5 + 4.24 + 4.92)/3 = 4.72. Just for comparison, the average distance of
C2 is (5 + 5.83)/2 = 5.4.
Thus, the silhouette coefficient for c is sc = (4.72 − 1.855)/4.72 = 5.375/7.5 = 0.61.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
122 Clustering Validation

g
9
i
8

7 a
d
6

5 k
c e
4

3 b
j
2
f h
1

1 2 3 4 5 6 7 8 9

Figure 17.1. Data for Q10 .

Q11. Describe how one may apply the gap statistic methodology for determining the
parameters of density-based clustering algorithms, such as DBSCAN and DEN-
CLUE (see Chapter 15).

Answer: Let us consider the two parameters in DBSCAN, namely the radius ǫ and
the minimum number of points minpts. Let us fix ǫ for the moment, and try to
get a good value of minpts. Using D we can compute the number of core points
N(minpts) for different values of minpts. Likewise, given t random samples Ri ,
we can compute the average number of core points µN (minpts), and the standard
deviation σN (minpts). Finally, we can choose the value of minpts that maximizes
the gap N(minpts) − µN (minpts), since we want to see more core points in a
well-clustered dataset than we would expect under the null hypothesis that the data
is randomly generated from the input space.
For estimating ǫ we can compute the number of points within the ǫ-neighborhood
of each point and get the average of these, say, µD (ǫ) in the dataset D. Next, we can
compute the average number of points within the ǫ-neighborhood in each of the
t random datasets Ri . We can then compute the mean and standard deviation of
these averages, i.e., µR and σR . We can then look for those values of ǫ that show a
large gap.
A similar approach can be used to estimate the spread parameter h and the
density threshold ξ in DENCLUE.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
P A R T FOUR CLASSIFICATION
C H A P T E R 18 Probabilistic Classification

18.5 EXERCISES

Q1. Consider the dataset in Table 18.1. Classify the new point: (Age=23, Car=truck) via
the full and naive Bayes approach. You may assume that the domain of Car is given
as {sports, vintage, suv, truck}.

Table 18.1. Data for Q1

xi Age Car Class


x1 25 sports L
x2 20 vintage H
x3 25 sports L
x4 45 suv H
x5 20 sports H
x6 25 suv H

Answer: Naive Bayes: Let us consider the naive Bayes approach first, which
assumes that each attribute is independent of the other. That is

P ((23, truck)|H) = P (23|H) × P (truck|H)

P ((23, truck)|L) = P (23|L) × P (truck|L)


For Age, we have DL = {x1 , x3 } and DH = {x2 , x4 , x5 , x6 }. We can estimate the
mean and variance from these labeled subsets, as shown in the table below
H L
20+45+20+25 110 25+25
µH = q 4 = 4 = 27.5 µL = q 2 = 25
425 0
σH = 4 = 10.31 σL = 2 =0

Using the univariate normal distribution, we obtain P (23|H) = N(23|µH =


27.5, σH = 10.31) = 0.035, and P (23|L) = N(23|µL = 25, σL = 0) = 0. Note that due
to limited data we obtain σL = 0, which leads to a zero likelihood for 23 to come
from class L.

125
126 Probabilistic Classification

For Car, which is categorical, we immediately run into a problem, since the value
truck does not appear in the training set. We could assume that P (truck|H) and
P (truck|L) are both zero. However, we desire to have some small probability of
observing each values in the domain of the attribute. We fix the problem using the
pseudo-count approach of adding a count of one to the observed counts of each
value for each class, as shown in the table below.
H L
P (sports|H) = 1(+1)
4(+4) = 2/8
2(+1)
P (sports|L) = 2(+4) = 3/6
P (vintage|H) = 1(+1)
4(+4) = 2/8
0(+1)
P (vintage|L) = 2(+4) = 1/6
2(+1) 0(+1)
P (suv|H) = 4(+4) = 3/8 P (suv|L) = 2(+4) = 1/6
P (truck|H) = 0(+1)
4(+4) = 1/8
0(+1)
P (truck|L) = 2(+4) = 1/6
Using the above probabilities, we finally obtain

P ((23, truck)|H) = P (23|H) × P (truck|H) = 0.035 × 1/8 = 0.0044

P ((23, truck)|L) = P (23|L) × P (truck|L) = 0 × 1/6 = 0


We next compute

P (23, truck) = P ((23, truck)|H) × P (H) + P ((23, truck)|L) × P (L)


4 2
= 0.0044 × + 0 × = 0.003
6 6
We then obtain the posterior probabilities as follows
4
P ((23, truck)|H) × P (H) 0.004 × 6
P (H|(23, truck)) = = =1
P (23, truck) 0.003

P ((23, truck)|L) × P (L) 0 × 26


P (L|(23, truck)) = = =0
P (23, truck) 0.003
Thus we classify (23, truck) as high risk (H) using Naive Bayes.
Full Bayes: We now turn out attention to the Full Bayes approach. Here we need
the estimates for the joint probability of Age and Car. Now, there is obviously not
enough data to estimate all the probabilities. Also, for Age, which is a numeric
attribute we cannot get the probability of a specific value. Instead, we will use the
cumulative distribution function for Age. To account for missing values for Car, we
can assume a base occurrence of 1 for each value, but spread out uniformly among
all the intervals for Age. So our estimates for the joint probabilities is based on the
following cumulative occurrence counts:
class Car Age < 20 Age ≤ 20 Age ≤ 25 Age ≤ 40 Age > 40
vintage 1+1/5 2+2/5 2+3/5 2+4/5 2+1
sports 1+1/5 2+2/5 2+3/5 2+4/5 2+1
H
suv 1+1/5 1+2/5 2+3/5 3+4/5 3+1
truck 1/5 2/5 3/5 4/5 1
vintage 1+1/5 1+2/5 1+3/5 1+4/5 1+1
sports 1+1/5 1+2/5 3+3/5 3+4/5 3+1
L
suv 1/5 2/5 3/5 4/5 5/5
truck 1/5 2/5 3/5 4/5 5/5

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
18.5 Exercises 127

Table 18.2. Data for Q2

xi a1 a2 a3 Class
x1 T T 5.0 Y
x2 T T 7.0 Y
x3 T F 8.0 N
x4 F F 3.0 Y
x5 F T 7.0 N
x6 F T 4.0 N
x7 F F 5.0 N
x8 T F 6.0 Y
x9 F T 1.0 N

Thus P (23, truck|H) = P (Age ≤ 25, truck|H) − P (Age ≤ 20, truck|H) = 1/5, and
likewise P (23, truck|L) = 1/5.
Therefore, we have
P ((23, truck)|H) × P (H)
P (H|(23, truck)) ∝ = 1/5 × 4/6 = 0.133
P (23, truck)
and
P ((23, truck)|L) × P (L)
P (L|(23, truck)) ∝ = 1/5 × 2/6 = 0.067
P (23, truck)
Thus we classify (23, truck) as high risk (H) using Full Bayes.

Q2. Given the dataset in Table 18.2, use the naive Bayes classifier to classify the new point
(T, F, 1.0).

Answer: We have P (a1 = T|Y) = 3/4 and P (a1 = T|N) = 1/5, and P (a2 = F |Y) =
2/4 and P (a2 = F |N) = 2/5. The mean and variance for a3 for Y are µY = 5.25 and
σY = 1.71, and those for N are µN = 5 and σN = 2.74. Using the normal density
function, we have P (1.0|Y) = 0.0106 and P (1.0|N) = 0.0502.
Now, P (T, F, 1.0|Y) = 0.75 · 0.5 · 0.0106 = 0.003975 and P (T, F, 1.0|N) = 0.2 · 0.4 ·
0.0502 = 0.004016 Next we have

P (Y|T, F, 1.0) ∝ 0.003975 · 4/9 = 0.00177

P (N|T, F, 1.0) ∝ 0.004016 · 5/9 = 0.00223


Thus, we predict the class as N.

Q3. Consider the class means and covariance matrices for classes c1 and c2 :

µ1 = (1, 3) µ2 = (5, 5)
   
5 3 2 0
61 = 62 =
3 2 0 1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
128 Probabilistic Classification

Classify the point (3, 4)T via the (full) Bayesian approach, assuming normally
distributed classes, and  P (c2 ) = 0.5. Show all steps.
 P (c1 ) =  Recall 
that the inverse
a b −1 1 d −b
of a 2 × 2 matrix A = is given as A = det(A) .
c d −c a
   1 
2 −3 0
Answer: First, compute = 6 −1
1
−1
and 6 2 = 2 .
−3 5 0 1
Now x − µ1 = (3, 4) − (1, 3) = (2, 1) and x − µ2 = (3, 4) − (5, 5) = (−2, −1).
Computing the mahalanobis distance we get for c1 ,
  
2 −3 2
(x − µi )T 6 −1
i (x − µ i ) = (2 1) =1
−3 5 1

and for c2 we get


  
1/2 0 −2
(−2 − 1) =3
0 1 −1
1 √ 1
Also, √
det (6 1 )
= 1, and det (6 2 )
= √1 = 0.71
2
e−1/2 0.607
Thus P (c1 |x) = e−1/2 +0.71e−3/2
= 0.607+0.158 = 0.79, and P (c2 |x) = 0.21. We thus
classify x as c1 .

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
Table 19.1. Data for Q2: Age is numeric and Car is categorical. Risk gives the class
label for each point: high (H) or low (L)

Point Age Car Risk


x1 25 Sports L
x2 20 Vintage H
x3 25 Sports L
x4 45 SUV H
x5 20 Sports H
x6 25 SUV H

C H A P T E R 19 Decision Tree Classifier

19.4 EXERCISES

Q1. True or False:


(a) High entropy means that the partitions in classification are “pure.”
Answer: False. High entropy means that the classes are very mixed, and not
pure.
(b) Multiway split of a categorical attribute generally results in more pure partitions
than a binary split.
Answer: True. In a binary split, there is more of a chance that the classes will
be mixed. In multiway partitioning the partitions are smaller and there is more
of a chance of being pure.

Q2. Given Table 19.1, construct a decision tree using a purity threshold of 100%. Use
information gain as the split point evaluation measure. Next, classify the point
(Age=27,Car=Vintage).

Answer: Let us consider how a complete decision tree is induced for the dataset in
Table 19.1. In the complete dataset we have P (H|D) = PH = 46 = 32 and P (L|D) =

129
130 Decision Tree Classifier

2
PL = 6 = 13 . Thus the entropy of D is
 
2 2 1 1
H(D) = − log2 + log2 = −(−0.390 − 0.528) = 0.918
3 3 3 3
At the root of the decision tree, we consider all possible splits on Age and Car.
For Age, the possible distinct splits to consider are Age ≤ 22.5 and Age ≤ 35, which
were chosen to be the mid-points between the distinct values, namely, 20, 25, and
45, that we observe for Age.
(a) For Age ≤ 22.5, DL includes only the points x2 and x5 , whereas DR comprises
the remaining points: x1 , x3 , x4 , and x6 . For DL , this yields PL = 0 and PH = 1,
whereas for DR we have PL = 24 and PH = 42 . The weighted entropy is then

2 4
H(DL , DR ) = H(DL ) + H(DR )
6 6
 
2 4 1 1 1 1
= − (0) − log2 + log2
6 6 2 2 2 2
2 1
= − log2 = 0.67
3 2
This yields an information gain of 0.918 − 0.67 = 0.248.
(b) In a similar manner we can compute the weighted entropy for Age ≤ 35.
For DR = {x4 } and DL has the remaining points, so that H(DL ) = 25 log2 25 +
3 3
5 log2 5 = 0.971 and H(DR ) = 0. The split entropy is then H(DL , DR ) =
5
6 (0.971) = 0.809, the information gain is: 0.915 − 0.809 = 0.106, which is not
as high as for Age ≤ 22.5.
Next, we evaluate all possible splits for Car. Note that categorical data, in
v
general, yields 2 2−1 possible splits, where v is the set of possible values for the
attribute. This can be reduced to O(v) by using a greedy split selection approach.
For Car the possible values are {Sports(S), Vintage(V), SUV(U)}, which yields the
following three distinct splits:
Car ∈ Car 6∈
{S} {V, U}
{V} {S, U}
{U} {S, V}
Note that the split Car ∈ {V, U} is essentially the same as the split Car ∈ {S}, the
only difference being that the decision has been “reversed”. It is therefore not
a distinct split, and we do not consider such splits. Next we evaluate the three
categorical splits as follows:
(a) For the split Car ∈ {S}, DL = {x1 , x3 , x5 }, and DR = {x2 , x4 , x6 }. For DL , this
yields PL = 32 and PH = 31 , and for DR , PL = 0 and PH = 1. The weighted entropy
of the split is then
3 3
H(DL , DR ) = H(DL ) + H(DR )
6 6
 
3 1 1 2 2 3
= − log2 + log2 − (0)
6 3 3 3 3 6
= 0.459

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
19.4 Exercises 131

This yields an information gain of 0.918 − 0.459 = 0.459.


(b) For Car ∈ {V}, we get the same information gain as for Age ≤ 35, i.e., 0.106
(c) For Car ∈ {U}, the gain is the same as for Age ≤ 22.5, i.e., 0.248.
Among all the possible split points for both Age and Car, the one with the highest
information gain is Car ∈ {S}, which is chosen as the best split decision at the root
of the decision tree, as shown in Figure 19.1. We therefore make this split and
recursively call the decision tree algorithm on each new subset DL = {x1 , x3 , x5 }
and DR = {x2 , x4 , x6 }.
Notice that for DR all points are already labeled as high risk (H). Since the
partition is already pure, we make it a leaf node, labeled as H. On the other hand,
DL is not completely pure, so we consider partitioning it further. Since all points in
DL have Car ∈ {S}, we cannot use Car to further distinguish the points. Further, for
Age, Age ≤ 22.5 is the only possible split to consider. Note that the entropy of DL is
given as  
2 2 1 1
H(DL ) = − log2 + log2 = 0.918
3 3 3 3
For Age ≤ 22.5, DLL = {x1 , x3 }, whereas DLR = {x5 }. For DLL , we get PL = 1 and
PH = 0, and for DLR , we get PL = 0 and PH = 1. The weighted entropy is then
2 1 2 1
H(DLL = {1, 3}, DLR = {5}) = H(DLL ) + H(DLR ) = − (0) − (0) = 0
3 3 3 3
Thus the information gain is 0.918 − 0 = 0.918. In this example, this is the only
possible split decision. After DL is split, we obtain the two new leaves DLL , which is
labeled as low-risk (L), and DLR , which is labeled as high-risk (H). The full decision
tree is shown in Figure 19.1.
One of the advantages of decision trees, is that each path from the root to a leaf
can be written as a rule. For our example tree above, we obtain the following three
rules
1) R1 : if Car ∈ {S} and Age ≤ 22.5, then Risk = H
2) R2 : if Car ∈ {S} and Age > 22.5, then Risk = L
3) R3 : if Car ∈/ {S}, then Risk = H
This is one of the strengths of decision trees, namely the ability to aid understanding
of the model via simple rules presented to the user.
Once a decision tree model has been built, it can be used to classify new points.
For example, for the test point Age = 27, and Car = Vintage, we can classify the
point by applying the set of decisions starting at the root. First we check whether
Car ∈ {S}. Since this test will be false, we go to the right branch, and since it is a leaf,
we predict the class to be H.

Q3. What is the maximum and minimum value of the CART measure [Eq. (19.7)] and
under what conditions?

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
132 Decision Tree Classifier
D
Car ∈ {S}

Yes No

DL DR
Age ≤ 22.5 H

Yes No

DLL DLR
H L

Figure 19.1. Induced Decision Tree

Answer: The CART measure is given as:


k
nY nN X
CART(DY , DN ) = 2 P (ci |DY ) − P (ci |DN )
n n
i=1

The minimum value is obtained when the class distribution on both sides of the split
P
is identical, in which case the summation ki=1 P (ci |DY ) − P (ci |DN ) goes to zero.
Therefore the minimum value is zero.
For the maximum value, first note that if α = nnY , then 1 − α = nnN , and the product
α(1 − α) = α − α 2 is maximized by taking the derivative with respect to α and setting
it to zero, i.e., when 1 − 2α = 0 which implies α = 1/2. In other words, the product
in the CART measure is maximized when the two partitions are balanced. Next,
the summation is maximized when each partition is pure, in which case each term
|P (ci |DY ) − P (ci |DN )| in the summation can achieve a maximum value of 1. Thus,
over k classes, the maximum value of the CART measure is
1 1 k
2· · ·k =
2 2 2

Q4. Given the dataset in Table 19.2. Answer the following questions:
(a) Show which decision will be chosen at the root of the decision tree using
information gain [Eq. (19.5)], Gini index [Eq. (19.6)], and CART [Eq. (19.7)]
measures. Show all split points for all attributes.
Answer: For the information gain we use log10 as opposed to log2 since it does
not qualitatively change the results.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
19.4 Exercises 133

Table 19.2. Data for Q4

Instance a1 a2 a3 Class
1 T T 5.0 Y
2 T T 7.0 Y
3 T F 8.0 N
4 F F 3.0 Y
5 F T 7.0 N
6 F T 4.0 N
7 F F 5.0 N
8 T F 6.0 Y
9 F T 1.0 N

The entropy for the whole dataset is given as


4 4 5 5
H(D) = − log10 − log10 = 0.2983
9 9 9 9
The Gini index for the whole dataset is:
!
42 52
G(D) = 1 − 2 + 2 = 0.4938
9 9

Consider the split for attribute a1 , which has only one possible split, namely
a1 ∈ T. The split entropy is given as:
   
4 1 1 3 3 5 1 1 4 4
H(DL , DR ) = − log10 − log10 + − log10 − log10 = 0.2293
9 4 4 4 4 9 5 5 5 5
Thus the gain is 0.2983 − 0.2293 = 0.0690.
The gini for the split is
! !
4 12 32 5 12 42
G(DL , DR ) = 1− 2 − 2 + 1 − 2 − 2 = 0.3444
9 4 4 9 5 5

The CART measure for the split is given as:


 
4 5 3 1 1 4
CART(DL , DR ) = 2 · · · − + − = 0.5432
9 9 4 5 4 5
Likewise for attribute a2 , we have only one split, and the split entropy is:
   
5 2 2 3 3 4 1 1 1 1
H(DL , DR ) = − log10 − log10 + − log10 − log10 = 0.2962
9 5 5 5 5 9 2 2 2 2
with gain 0.2983 − 0.2962 = 0.0021.
The gini for the split is
! !
4 12 12 5 22 32
G(DL , DR ) = 1− 2 − 2 + 1− 2 − 2 = 0.4889
9 2 2 9 5 5

The CART measure for the split is given as:


 
4 5 1 2 1 3
CART(DL , DR ) = 2 · · · − + − = 0.0988
9 9 2 5 2 5

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
134 Decision Tree Classifier

For attribute a3 there are several numeric split points, namely a3 < 3.0, a3 < 4.0,
a3 < 5.0, a3 < 6.0, a3 < 7.0, and a3 < 8.0. The split entropy for each of these cases
is as follows:
(a) For a3 < 3.0 we have
 
8 1 1 1 1
H(DL , DR ) = 0 + − log10 − log10 = 0.2676
9 2 2 2 2
The gain is 0.2983 − 0.2676 = 0.0307.
The gini for the split is
!
8 12 12
G(DL , DR ) = 1− − = 0.4444
9 2 2

The CART measure for the split is given as:


 
1 8 1 1
CART(DL , DR ) = 2 · · · 1 − + 0 − = 0.1975
9 9 2 2
(b) For a3 < 4.0, we have
   
2 1 1 7 3 3 4 4
H(DL , DR ) = −2 · log10 + − log10 − log10 = 0.2976
9 2 2 9 7 7 7 7
The gain is 0.2983 − 0.2976 = 0.0007.
The gini for the split is
! !
2 12 12 7 32 42
G(DL , DR ) = 1− 2 − 2 + 1− 2 − 2 = 0.4921
9 2 2 9 7 7

The CART measure for the split is given as:


 
2 7 1 3 1 4
CART(DL , DR ) = 2 · · · − + − = 0.0494
9 9 2 7 2 7
(c) For a3 < 5.0 we have
   
1 1 1 2 2 2 1 1
H(DL , DR ) = − log10 − log10 + −2 · log10 = 0.2928
3 3 3 3 3 3 2 2
The gain is 0.2983 − 0.2928 = 0.0055.
The gini for the split is
! !
1 12 22 2 12 12
G(DL , DR ) = 1− 2 − 2 + 1− 2 − 2 = 0.4815
3 3 3 3 2 2

The CART measure for the split is given as:


 
1 2 1 1 2 1
CART(DL , DR ) = 2 · · · − + − = 0.1481
3 3 3 2 3 2
(d) For a3 < 6.0 we have
   
5 2 2 3 3 4 1 1
H(DL , DR ) = − log10 − log10 + −2 · log10 = 0.2962
9 5 5 5 5 9 2 2

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
19.4 Exercises 135

The gain is 0.2983 − 0.2962 = 0.0021.


The gini for the split is
! !
5 22 32 4 12 12
G(DL , DR ) = 1− 2 − 2 + 1− 2 − 2 = 0.4889
9 5 5 9 2 2

The CART measure for the split is given as:


 
5 4 1 2 1 3
CART(DL , DR ) = 2 · · · − + − = 0.0988
9 9 2 5 2 5

(e) For a3 < 7.0 we have


   
2 1 1 1 1 1 2 2
H(DL , DR ) = −2 · log10 + − log10 − log10 = 0.2928
3 2 2 3 3 3 3 3

The gain is 0.2983 − 0.2928 = 0.0055.


The gini for the split is
! !
2 12 12 1 12 22
G(DL , DR ) = 1− 2 − 2 + 1− 2 − 2 = 0.4815
3 2 2 3 3 3

The CART measure for the split is given as:


 
2 1 1 1 2 1
CART(DL , DR ) = 2 · · · − + − = 0.1481
3 3 3 2 3 2

(f) For a3 < 8.0 we have


 
8 1 1 1 1
H(DL , DR ) = − log10 − log10 + 0 = 0.2676
9 2 2 2 2

The gain is 0.2983 − 0.2676 = 0.0307.


The gini for the split is
!
8 12 12
G(DL , DR ) = 1− 2 − 2 = 0.4444
9 2 2

The CART measure for the split is given as:


 
1 8 1 1
CART(DL , DR ) = 2 · · · 0 − + 1 − = 0.1975
9 9 2 2

So the best split for all three measures is a1 ∈ {T}. It has the highest gain (0.069),
the lowest Gini value (0.3444), and the highest CART measure (0.5432).
(b) What happens to the purity if we use Instance as another attribute? Do you think
this attribute should be used for a decision in the tree?
Answer: It is not advisable to use the instance id as a split value. The reason is
that in general the id is just some unique identifier for each instance and each
row will thus have a different value. Further, there is no reason to assume that
the value follows any numeric scale. So what might happen is that we might end

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
136 Decision Tree Classifier

up with relatively pure splits, at least on one side, if we only allow binary splits,
and a completely pure split if we allow multi-way splits. However such a split is
of no “predictive” value.

Q5. Consider Table 19.3. Let us make a nonlinear split instead of an axis parallel split,
given as follows: AB − B2 ≤ 0. Compute the information gain of this split based on
entropy (use log2 , i.e., log to the base 2).

Table 19.3. Data for Q5

A B Class
x1 3.5 4 H
x2 2 4 H
x3 9.1 4.5 L
x4 2 6 H
x5 1.5 7 H
x6 7 6.5 H
x7 2.1 2.5 L
x8 8 4 L

Answer: For the full data PL = 3/8 and PH = 5/8, therefore the entropy is:

H(D) = −(3/8 log2 3/8 + 5/8 log2 5/8) = 0.954

As for the split, for those points with AB−B2 ≤ 0, we have DL = {x1 , x2 , x4 , x5 , x7 }
and DR = {x3 , x6 , x8 }. Therefore entropy values for each side of the split are:

H(DL ) = −(4/5 log2 4/5 + 1/5 log2 1/5) = 0.722

H(DR ) = −(1/3 log2 1/3 + 2/3 log2 2/3) = 0.918


The split entropy is therefore:

H(DL , DR ) = 5/8 · 0.7222 + 3/8 · 0.918 = 0.796

The Information Gain is then: 0.954 − 0.796 = 0.158.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 20 Linear Discriminant Analysis

20.4 EXERCISES

Q1. Consider the data shown in Table 20.1. Answer the following questions:
(a) Compute µ+1 and µ−1 , and B, the between-class scatter matrix.
Answer: We have µ+1 = (3.75, 3.45)T , and µ−1 = (2.25, 1.55)T . Next, their
difference is d = µ+1 − µ−1 = (1.5, 1.9)T , and finally
 
T 2.25 2.85
B = dd =
2.85 3.61

Table 20.1. Dataset for Q1

i xi yi
x1 (4,2.9) 1
x2 (3.5,4) 1
x3 (2.5,1) −1
x4 (2,2.1) −1

(b) Compute S+1 and S−1 , and S, the within-class scatter matrix.
Answer: For class +1 we have x1 − µ+1 = (0.25, −0.55)T and x2 − µ+1 =
(−0.25, 0.55). Therefore, the scatter matrix for +1 is given as
   
0.0625 −0.1375 0.125 −0.275
S1 = 2 · =
−0.1375 0.3025 −0.275 0.605

For class −1, we have x3 − µ−1 = (0.25, −0.55)T and x4 − µ−1 = (−0.25, 0.55)T ,
and thus we also have S−1 = S1 , therefore,
 
0.25 −0.55
S = S1 + S−1 =
−0.55 1.21

137
138 Linear Discriminant Analysis

(c) Find the best direction w that discriminates


  between the classes. Use
 the fact that

a b −1 1 d −b
the inverse of the matrix A = is given as A = det(A) .
c d −c a
Answer: Since det(S) = 0, we unfortunately have a degenerate case to deal
with. To make the matrix non-singular, we add a diagonal term as follows:
 
0.26 −0.55
S = S + 0.01I =
−0.55 1.22

That is, we increase the diagonal entries slightly to make the determinant
non-zero. We have det(S) = 0.0147, and the inverse of the scatter matrix is
 
−1 1 1.22 −0.55
S =
0.0147 −0.55 0.26

The best LD direction is


    
−1 1.22 0.55 1.5 2.875
w=S (µ1 − µ−1 ) = =
0.55 0.26 1.9 1.319

Since we have to normalize, we drop the 1/ det(S) factor. After normalizing w,


   
1 2.875 0.909
we get w = 3.163 =
1.319 0.417

(d) Having found the direction w, find the point on w that best separates the two
classes.
Answer: One approach is to project all the four points onto w, and then find
the point of best separation. For instance, projecting x1 onto w, we obtain the
offset along w as xT T T
1 w = 4.845. Likewise, we have x2 w = 4.849, x3 w = 2.689, and
T
x4 w = 2.694. The point on w that best separates the two classes, can be taken
as the mid-point of the two closest points from opposite classes, i.e., half-way
between the projections of x2 and x3 , given as (4.849 + 2.689)/2 = 3.769.

Q2. Given the labeled points (from two classes) shown in Figure 20.1, and given that the
inverse of the within-class scatter matrix is

 
0.056 −0.029
−0.029 0.052

Find the best linear discriminant line w, and sketch it.

Answer: Since we are given the inverse of the scatter matrix, to obtain w, we only
need to compute the difference vector for the two means. We have

µ1 = ((2, 3)T + (3, 3)T + (3, 4)T + (5, 8)T + (7, 7)T )/5 = (4, 5)T
µ2 = ((5, 4)T + (6, 5)T + (7, 4)T + (7, 5)T + (8, 2)T + (9, 4)T )/6 = (7, 4)T
µ1 − µ2 = (−3, 1)T

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
20.4 Exercises 139

9
8 uT

7 uT

6
5 bC bC

uT bC bC bC
4
3 uT uT

bC
2
1

1 2 3 4 5 6 7 8 9
Figure 20.1. Dataset for Q2.

The best LD direction is therefore


     
0.056 −0.029 −3 −0.197
w= · =
−0.029 0.052 1 0.139

Q3. Maximize the objective in Eq. (20.7) by explicitly considering the constraint wT w = 1,
that is, by using a Lagrange multiplier for that constraint.

Answer: The maximization objective with the constraint is given as

wT Bw
max J(w) = − α(wT w − 1)
w wT Sw
Taking the derivative of J(w) with respect to the vector w, and setting the result
to zero, gives us

2Bw(wT Sw) − 2Sw(wT Bw)


− 2αw = 0
(wT Sw)2
Bw(wT Sw) − Sw(wT Bw)
=⇒ = αw
(wT Sw)2
Multiplying on both sides by wT , we get

wT Bw(wT Sw) − wT Sw(wT Bw)


= αwT w
(wT Sw)2
=⇒ α = 0

Plugging α = 0 in the derivative above, leads to the same formulation as shown in


Eq. (20.8).

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
140 Linear Discriminant Analysis

Q4. Prove the equality in Eq. (20.19). That is, show that
X 
1
N1 = Ki KTi − n1 m 1 m T c1
1 = (K ) (In1 − 1n ×n ) (Kc1 )T
n1 1 1
xi ∈D1

Answer:
X 
N1 = Ki Ki − n1 m1 mT
T
1
xi ∈D1
P P P 
xi ∈D1 Ki1 Ki1 K K ··· K K
 P Pxi ∈D1 i1 i2 Pxi ∈D1 i1 i n1 
 xi ∈D1 Ki2 Ki1 xi ∈D1 K i2 K i2 ··· xi ∈D1 Ki2 Ki n1 
=   .. .. .. 

 . . ··· . 
P P P
xi ∈D1 K i n 1 K i1 xi ∈D1 K i n 1 K i2 ··· xi ∈D1 K i n 1 Ki n 1
P P P 
xi ,xj ∈D1 Ki1 Kj 1 Ki1 Kj 2 ··· Ki1 Kj n1
P Pxi ,xj ∈D1 Pxi ,xj ∈D1 
1  xi ,xj ∈D1 K i2 K j 1 xi ,xj ∈D1 Ki2 Kj 2 ··· xi ,xj ∈D1 Ki2 Kj n1 
−   . .. .. 

n1  .. . ··· . 
P P P
K
xi ,xj ∈D1 i n1 j 1 K xi ,xj ∈D1 Ki n1 Kj 2 ··· K
xi ,xj ∈D1 i n1 j n1 K
 
X 1 X X 
= Kia Kib − Kia Kj b
 n1 
xi ∈D1 xi ∈D1 xj ∈D1
1≤a,b≤n
1
= (Kc1 ) (In1 − 1n1 ×n1 ) (Kc1 )T
n1

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 21 Support Vector Machines

21.7 EXERCISES

Q1. Consider the dataset in Figure 21.1, which has points from two classes c1 (triangles)
and c2 (circles). Answer the questions below.
(a) Find the equations for the two hyperplanes h1 and h2 .
Answer: Note that for h1 we have the two points on the line, namely (6, 0) and
(1, 10), which gives the slope as m1 = 10/(1 − 6) = −2. Using (6, 0) as the point,
we get the equation of the line as xx21 −0
−6 = −2 =⇒ 2x1 + x2 − 12 = 0.
For h2 we have the two points on the line (2, 0) and (8, 10), which gives the
slope as m2 = 10/(8 − 2) = 5/3. Using (2, 0) as the point, we get the equation of
the line as xx21 −0 5
−2 = 3 =⇒ 5x1 − 3x2 − 10 = 0.
h 1(

0
x) =

9
x)
0

h2 (

8 uT

7 uT

6 uT bC

5
bC
4 uT uT bC

3 bC
uT bC
2 bC

1 uT

1 2 3 4 5 6 7 8 9
Figure 21.1. Dataset for Q1.

141
142 Support Vector Machines

(b) Show all the support vectors for h1 and h2 .


Answer: The support vectors for h1 are (2, 6), (3, 4) and (6, 2).
The support vectors for h2 are (3, 4) and (7, 6).
(c) Which of the two hyperplanes is better at separating the two classes based on the
margin computation?
Answer: We compute the margins for the two classifiers by computing the
distance from the support vectors to the hyperplanes.
For h1 : 2x1 + x2 − 12, we have wT = (2, 1) and b = −12, so that H1 = wT x + b.
T x+b
Recall that distance of a point x = (x1 , x2 )T to the hyperplane is wkwk . Using
6+4−12 −2
support vector (3, 4), we get the distance as: d(3, 4) = √ = √ . Using
22 +12 5
support vector (6, 2), we get the distance as: d(6, 2) = 12+2−12
√ = √2 . The total
2 2 2 +1 5
margin is then 2( √2 ) = 1.79.
5
For h2 : 5x1 − 3x2 − 10, we have wT = (5, −3) and b = −10. Using support vector
(3, 4), we get the distance: d(3, 4) = 15−12−10

2 2
−7
= 5.83 . Using support vector (7, 6),
5 +3
35−18−10 7 14
we get the distance: d(7, 6) = 5.83 = 5.83 . The total margin is then 5.83 = 2.4.
Thus h2 is better than h1 .
(d) Find the equation of the best separating hyperplane for this dataset, and show
the corresponding support vectors. You can do this witout having to solve
the Lagrangian by considering the convex hull of each class and the possible
hyperplanes at the boundary of the two classes.
Answer: Look at the convex hull for the two classes shown below
10 h
9
8
7
6
X2

5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X1

The figure suggests the support vectors (6, 2) and (7, 6) for circles and (3, 4) for
triangles. The convex hull, and the lines passing through these support vectors
are shown in the figure above.
The optimal hyperplane is h : 4x1 − x2 − 15, which is exactly half-way between
the lines passing through the support vectors, namely 4x1 − x2 − 22 and

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
21.7 Exercises 143

Table 21.1. Dataset for Q2

i xi1 xi2 yi αi
x1 4 2.9 1 0.414
x2 4 4 1 0
x3 1 2.5 −1 0
x4 2.5 1 −1 0.018
x5 4.9 4.5 1 0
x6 1.9 1.9 −1 0
x7 3.5 4 1 0.018
x8 0.5 1.5 −1 0
x9 2 2.1 −1 0.414
x10 4.5 2.5 1 0

4x1 − x2 − 8. The margin is √14 = 3.395, which is more than the axis-parallel
17
hyperplane mid-way between (3, 4) and (6, 4), which has a margin of 3.

Q2. Given the 10 points in Table 21.1, along with their classes and their Lagranian
multipliers (αi ), answer the following questions:
(a) What is the equation of the SVM hyperplane h(x)?
Answer: We have
X
w= αi yi xi
i

= 0.414 · (4, 2.9)T − 0.414 · (2, 2.1)T + 0.018 · (3.5, 4)T − 0.018 · (2.5, 1)T
= 0.414 · (2, 0.8)T + 0.018 · (1, 3)T
= (0.818 + .018, .3312 + .054)T
= (.836, 0.385)T

Next, for the offset we can use x1 to get

b = 1 − (.836 · 4 + .385 · 2.9) = 1 − (3.344 + 1.1165) = 1 − 4.4605 = −3.46

Thus, the equation of the hyperplane is

h(x) = wT x + b = 0.836x1 + 0.385x2 − 3.46

(b) What is the distance of x6 from the hyperplane? Is it within the margin of the
classifier?
Answer: The distance of x6 from the hyperplane is given as

(1.9, 1.9) · (.836, .385)T − 3.46 = 1.5884 + 0.7315 − 3.46 = −1.14

Since the signed distance is less than -1, the point is outside the margin.
(c) Classify the point z = (3, 3)T using h(x) from above.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
144 Support Vector Machines

Answer: We have

h(z) = (3, 3) · (.836, .385)T − 3.46 = 2.508 + 1.155 − 3.46 = 0.203

The class is therefore +1.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
C H A P T E R 22 Classification Assessment

22.5 EXERCISES

Q1. True or False:


(a) A classification model must have 100% accuracy (overall) on the training dataset.

Answer: False
(b) A classification model must have 100% coverage (overall) on the training
dataset.
Answer: True

Q2. Given the training database in Table 22.1a and the testing data in Table 22.1b, answer
the following questions:
(a) Build the complete decision tree using binary splits and Gini index as the
evaluation measure (see Chapter 19).
Answer: The Gini for whole DB is Gini(D) = 1 − ((4/8)2 + (4/8)2 ) = 1 − 0.5 =
0.5. We evaluate the different split points for X, Y, and Z as follows (CL 1
means the count on the left hand partition for class 1, Gini(L) is the gini
index value for left hand side, Gini(L, R) is the Gini of the split point, and
Gain = Gini(D) − Gini(L, R)).

Split CL1 CL2 CR1 CR2 Gini(L) Gini(R) Gini(L, R) Gain


X ≤ 15 1 1 3 3 0.5 0.5 0.5 0
X ≤ 20 1 3 3 1 0.375 0.375 0.375 0.125
X ≤ 25 3 3 1 1 0.5 0.5 0.5 0
X ≤ 30 4 3 0 1 0.49 0 0.439 0.061

Y≤1 1 0 3 4 0 0.49 0.439 0.061


Y≤2 2 2 2 2 0.5 0.5 0.5 0
Y≤3 2 4 2 0 0.44 0 0.33 0.17

Z ∈ {A} 4 0 0 4 0 0 0 0.5

145
146 Classification Assessment

It is clear that the best split point is Z ∈ {A}, since it produces a pure partition
on both sides, and there is no need to split further.
(b) Compute the accuracy of the classifier on the test data. Also show the per class
accuracy and coverage.
Answer: Our decision tree has only a root node, namely Z ∈ {A}, with class 1
if true, and class 2 if false. There are 4 cases misclassified in the testing data, so
the overall error rate is 4/5 = 0.8.
For the per class accuracy and coverage, we need to calculate the True Positives
and divide them by the True class and the Predicted class, respv.
For Class 1, instances 2 and 5 in Table 22.1b belong to class 1, but we predict
instances 1 and 3 to belong to class 1. There is no true positive, since there is no
correct prediction. The accuracy and coverage are both 0.
For class 2, instances 1,3,4 truly belong to class 2, but we predict 2,4,5 as
belonging to class 2. The true positive is instance 4, and since we predict three
cases as belonging to class 2 the accuracy is 1/3, and since the true class also has
three instances, the coverage is also 1/3.

Table 22.1. Data for Q2

X Y Z Class
15 1 A 1
X Y Z Class
20 3 B 2
10 2 A 2
25 2 A 1
20 1 B 1
30 4 A 1
30 3 A 2
35 2 B 2
40 2 B 2
25 4 A 1
15 1 B 1
15 2 B 2
(b) Testing
20 3 B 2
(a) Training

Q3. Show that for binary classification the majority voting for the combined classifier
decision in boosting can be expressed as

K
!
X
K
M (x) = sign αt Mt (x)
t=1

Answer: We are given that there are two classes, i.e., +1 and −1. Without loss of
generality assume that the weighted sum for the positive class is higher than that
for the negative class, i.e., v+1 (x) > v−1 (x). In this case, the class will be predicted
as +1, using arg max{vj (x)}, with v+1 (x) being the sum of the αt for classifiers that
predict class as +1.
Mt (x) = +1, then the product αt Mt (x) will be αt , otherwise, the weight will be
P
−αt . Since the weighted sum v+1 (x) for class +1 is higher, the sum t αt Mt (x) will
be positive, and the class will be predicted as +1.

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
22.5 Exercises 147

h3 h2 h1
h4
9

8 uT

7 uT h5

6 uT bC h6

5
bC
4 uT uT bC

3 bC
uT bC
2 bC

1 uT

1 2 3 4 5 6 7 8 9
Figure 22.1. For Q4.

A similar argument will lead to the equivalence of both expressions, with the
weighted sum for the negative class is higher, in which case both will predict the
class as −1.

Table 22.2. Critical values for t-test


dof 1 2 3 4 5 6
tα/2 12.7065 4.3026 3.1824 2.7764 2.5706 2.4469

Q4. Consider the 2-dimensional dataset shown in Figure 22.1, with the labeled points
belonging to two classes: c1 (triangles) and c2 (circles). Assume that the six
hyperplanes were learned from different bootstrap samples. Find the error rate for
each of the six hyperplanes on the entire dataset. Then, compute the 95% confidence
interval for the expected error rate, using the t-distribution critical values for different
degrees of freedom (dof) given in Table 22.2.

Answer: The error rates of the hyperplanes are as follows: ǫ1 = 0, ǫ2 = 2/13, ǫ3 =


1/13, ǫ4 = 2/13, ǫ5 = 1/13 and ǫ6 = 3/13
The estimated mean error rate is therefore µ̂ = 9/(13 × 6) = 1.5/13 = 0.1154, and
the estimated variance is σ̂ 2 = 1/6 × (4 · (0.5/13)2 + 2 · (1.5/13)2 ) = 0.005424. We
have σ̂ = 0.074.
Using K − 1 = 5 degrees of freedom, we have tα/2 = √ 2.5706, and therefore
the confidence interval for the error rate is µ̂ ± tα/2 · σ̂ / K = 0.1154 ± (2.5706 ·
0.074/2.236) = 0.1154 ± 0.0847 = (0.0307, 0.2001)

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
148 Classification Assessment

Q5. Consider the probabilities P (+1|xi ) for the positive class obtained for some classifier,
and given the true class labels yi

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
yi +1 −1 +1 +1 −1 +1 −1 +1 −1 −1
P (+1|xi ) 0.53 0.86 0.25 0.95 0.87 0.86 0.76 0.94 0.44 0.86

Plot the ROC curve for this classifier.

Answer: We first sort the points in decreasing order of the probabilities, to obtain

x4 x8 x5 x6 x2 x10 x7 x1 x9 x3
yi + + − + − − − + − +
P (+1|xi ) 0.95 0.94 0.87 0.86 0.86 0.86 0.76 0.53 0.44 0.25

The FP and TP for different values of thresholds are as shown below.


threshold 1 0.95 0.94 0.87 0.86 0.76 0.53 0.44 0.25
FP 0 0 0 1 3 4 4 5 5
TP 0 1 2 2 3 3 4 4 5

The FPR is then simply FP/n2 and TPR is TP/n1 where n1 is the number of +
points, and n2 the number of − points.
The ROC curve is as follows
1

0.8

0.6
TPR

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
FPR

This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.

You might also like