08-Self-Taught Clustering
08-Self-Taught Clustering
tion p(X̃, Z̃) which is defined as where x ∈ x̃ and z ∈ z̃. Therefore, with regard to
XX Equations (1) and (4), p̃(X, Z) is given by
p(x̃, z̃) = p(x, z). (3)
x∈x̃ z∈z̃ 0.089 0.178 0.133
p̃(X, Z) = 0.044 0.089 0.067 . (7)
For example, for the joint probability p(X, Z) in Equa-
0.067 0.133 0.200
tion (1), suppose that the clustering on X is X̃ =
{x̃1 = {x1 , x2 }, x̃2 = {x3 }}, and the clustering on Z is Likewise, let q̃(Y, Z) denote the joint probability dis-
Z̃ = {z̃1 = {z1 , z2 }, z̃2 = {z3 }}. Then, tribution of Y and Z with respect to the co-clusters
0.4 0.2
(CY , CZ ). We have
p(X̃, Z̃) = . (4)
0.2 0.2
q(y) q(z)
q̃(y, z) = q(ỹ, z̃) , (8)
In this work, we model our self-taught clustering al- q(ỹ) q(z̃)
gorithm (STC) as performing co-clustering operations
on the target data X and auxiliary data Y , simultane- where y ∈ ỹ and z ∈ z̃.
ously, while the two co-clusters share the same features
clustering Z̃ on the feature set Z. Thus, the objective Using the probability distributions p̃(X, Z) and
function can be formulated as q̃(Y, Z) defined above, we can reformulate the objec-
h i tive function in Equation (5) into a form based on KL
J = I(X, Z) − I(X̃, Z̃) + λ I(Y, Z) − I(Ỹ , Z̃) . (5) divergence (Cover & Thomas, 1991).
In Equation (2), I(X, Z) − I(X̃, Z̃) is computed on Lemma 1 When the clusters CX , CY and CZ are
the co-clusters on the target data X, while I(Y, Z) − fixed, the objective function in Equation (5) can be re-
I(Ỹ , Z̃) on the auxiliary data Y . λ is a trade-off formulated as
parameter to balance the influence between the tar- h i
get data and the auxiliary data which we will test I(X; Z) − I(X̃; Z̃) + λ I(Y ; Z) − I(Ỹ ; Z̃) (9)
in our experiments. From Equation (5), we can see
= D(p(X, Z)||p̃(X, Z)) + λ D(q(Y, Z)||q̃(Y, Z)),
that, although the two co-clustering objective func-
tions I(X, Z) − I(X̃, Z̃) and I(Y, Z) − I(Ỹ , Z̃) are per- where D(·||·) denotes the KL divergence between two
formed separately, they share the same feature cluster- probability distributions (Cover & Thomas, 1991),
ing Z̃. This is the “bridge” to transfer the knowledge
where D(p||q) = x p(x) log p(x)
P
between the target and auxiliary data. q(x) .
Our remaining task is to minimize the value of the Proof Based on the Lemma 2.1 in (Dhillon et al.,
objective function in Equation (5)1 . However, min- 2003), I(X, Z)−I(X̃, Z̃) = D(p(X, Z)||p̃(X, Z)). Sim-
imizing Equation (5) is not an easy task, since it is ilarly, I(Y, Z) − I(Ỹ , Z̃) = D(q(Y, Z)||q̃(Y, Z)). There-
non-convex and there are no good solutions currently fore, Lemma 1 can be proved straightforwardly.
to directly optimize this objective function. In the fol-
Lemma 1 converts the loss in mutual information to
lowing, we will rewrite the objective function in Equa-
the KL divergence between the distributions p and p̃,
tion (5) into the form of Kullback-Leibler divergence
and between q and q̃, respectively. However, the prob-
(Cover & Thomas, 1991) (KL divergence), and mini-
ability distributions in Lemma 1 are joint distribu-
mize the reformulated objective function.
tions, and are therefore difficult to optimize. Hence, in
Lemma 2, we rewrite the objective function in Lemma
3.2. Optimization for Co-clustering
1 as a conditional probability form. We then show how
We first define two new probability distributions to optimize the objective function in the new form.
p̃(X, Z) and q̃(Y, Z) as follows.
Lemma 2 The KL divergence with respect to joint
Definition 1 Let p̃(X, Z) denote the joint probability probability distributions can be reformulated as
distribution of X and Z with respect to the co-clusters
(CX , CZ ); formally, D(p(X, Z)||p̃(X, Z))
XX
p(x) p(z) = p(x)D(p(Z|x)||p̃(Z|x̃)) (10)
p̃(x, z) = p(x̃, z̃) , (6) x̃∈X̃ x∈x̃
p(x̃) p(z̃) XX
1 = p(z)D(p(X|z)||p̃(X|z̃)). (11)
To be mentioned, in this paper, our minimization is
for a fixed numbers of clusters N , M and K. z̃∈Z̃ z∈z̃
Self-taught Clustering
Using a similar argument on Y and Z, we have Theorem 1 In Algorithm 1, let the value of objective
function J in the t-th iteration be
CY (y) = arg min D(q(Z|y)||q̃(Z|ỹ)), (15)
ỹ∈Ỹ (t) (t) (t)
J (CX , CY , CZ ) = (17)
and (t)
D(p(X, Z)||p̃ (X, Z)) + λ D(q(Y, Z)||q̃ (t)
(Y, Z)).
CZ (z) = arg min p(z)D(p(X|z)||p̃(X|z̃)) Then,
z̃∈Z̃
(t) (t) (t) (t+1) (t+1) (t+1)
+λ q(z)D(q(Y |z)||q̃(Y |z̃)). (16) J (CX , CY , CZ ) ≥ J (CX , CY , CZ ). (18)
Based on Equation (14), (15) and (16), an alternative
Proof (Sketch) Since in each iteration, the cluster-
way to minimize the objective function in Equation (9)
ing functions are updated based on Equations (14),
is derived, as shown in Algorithm 1.
(15) and (16), which locally minimize the values of
In Algorithm 1, in each iteration, our self-taught clus- D(p(X, Z)||p̃(X, Z)) and D(q(Y, Z)||q̃(Y, Z)), the ob-
tering algorithm (STC) minimizes the objective func- jective function is monotonically non-increasing as a
tion by choosing the best x̃, ỹ and z̃ for each x, y and result. Theorem 1 follows as a consequence.
Self-taught Clustering
Note that, although STC is able to minimize the ob- generated using these 20 categories, as shown in Table
jective function value in Equation (9), it is only able to 1. The first column in Table 1 presents the categories
find a locally optimal one. Finding the global optimal with respect to the target unlabeled data. For each
solution is NP-hard. The next corollary emphasizes clustering task, we used the data from the correspond-
the convergence property of our algorithm STC. ing categories as target unlabeled data, while the data
from the remaining categories as the auxiliary unla-
Corollary 1 Algorithm 1 converges in a finite number beled data.
of iterations.
For data preprocessing, we used the “bag-of-words”
Proof (Sketch) The convergence of our algorithm method (Li & Perona, 2005) to represent images in our
STC can be proved straightforwardly based on the experiments. Interesting points in images are found
monotonical decreasing property in Theorem 1, and and described by SIFT descriptor (Lowe, 2004). Then,
the finiteness of the solution space. we clustered all the interesting points to get the code-
book, and set the number of clusters to 800. Using this
3.3. Complexity Analysis codebook, each image can be represented as a vector
in the subsequent learning processes.
We now analyze the computational cost of our algo-
rithm STC. Suppose that the total number of (x, z) 4.2. Evaluation Criteria
co-occurrences in the target data set X is L1 , and
the total number of (y, z) co-occurrences in the aux- In these experiments, we used entropy to measure the
iliary data set Y is L2 . In each iteration, updating quality of clustering results, which reveals the purity
the target instance clustering CX takes O(N · L1 ). of clusters. Specifically,Pthe entropy for a cluster x̃
Updating the auxiliary instance clustering CY takes is defined as H(x̃) = − c∈C p(c|x̃) log2 p(c|x̃), where
O(M · L2 ). Moreover, updating the feature clustering c represents a category label in the evaluation cor-
CZ takes O(K · (L1 + L2 )). Since the number of it- pus, and p(c|x̃) is defined as p(c|x̃) = |{x|`(x)=c∧x∈x̃}|
|x̃| ,
erations is T , the time complexity of our algorithm is where `(x) denotes the true label of x in the evalu-
O(T · ((K + N ) · L1 + (K + M ) · L2 ))). In the follow- ation corpus. The total entropy for the whole clus-
ing experiments, it is shown that T = 10 is enough tering is defined as the weighted sum of the entropy
for convergence. Usually, the number of clusters N , with respect to all the clusters; formally, H(X̃) =
M and K can be considered as constants, so that the P |x̃|
x̃∈X̃ n H(x̃). The quality of clustering X̃ is eval-
time complexity of STC is O(L1 + L2 ).
uated using the entropy H(X̃).
Considering space complexity, our algorithm needs to
store all the (x, z) and (y, z) co-occurrences and their 4.3. Empirical Analysis
corresponding probabilities. Thus, the space complex-
ity is O(L1 + L2 ). This indicates that the time com- We compared our algorithm STC to several state-of-
plexity and the space complexity of our algorithm are the-art clustering methods as baseline methods. For
all linear on the input. We conclude that the algorithm each baseline method considered below, we have two
scales well. different options: one is to apply the baseline method
on the target data only, which we refer to as separate,
and the other is to apply on the combined data con-
4. Experiments sisting of target data and the auxiliary data, which we
In this section, we evaluate our self-taught cluster- refer as combined. The first baseline method is a tradi-
ing algorithm STC on the image clustering tasks, and tional 1D-clustering solution CLTUO (Zhao & Karypis,
show effectiveness of STC. 2002) using its default parameter. The second baseline
method is clustering on the target data under a new
feature representation that is first constructed through
4.1. Data Sets
feature clustering (on the target or the combined data
We conduct our experiments on eight clustering tasks set); this baseline is designed to evaluate the effec-
generated based on the Caltech-256 image corpus tiveness of co-clustering based method as opposed to
(Griffin et al., 2007). There are a total of 256 cate- naively constructing new data representation for clus-
gories in the Caltech-256 data set, where we randomly tering. We refer to this class of baseline methods as
chose 20 categories from this corpus. For each cat- Feature Clustering. The third baseline method is
egory, 70 images are randomly selected to form our an information theoretic co-clustering method applied
clustering tasks. Six binary clustering tasks, one 3- to the target (or the combined) data set (Dhillon et al.,
way clustering task, and one 5-way clustering task were 2003), which we refer to as Co-clustering. This base-
Self-taught Clustering
Table 1. Performance in terms of entropy for each data set and evaluation method.
CLUTO Feature Clustering Co-clustering
Data Set STC
separate combined separate combined separate combined
eyeglass vs sheet-music 0.527 0.966 0.669 0.669 0.630 0.986 0.187
airplane vs ostrich 0.352 0.696 0.512 0.479 0.426 0.753 0.252
fern vs starfish 0.865 0.988 0.588 0.953 0.741 0.968 0.575
guitar vs laptop 0.923 0.965 0.999 0.970 0.925 1.000 0.569
hibiscus vs ketch 0.371 0.446 0.659 0.649 0.399 0.793 0.252
cake vs harp 0.882 0.879 0.998 0.911 0.860 0.996 0.772
car-side, tire, frog 1.337 1.385 1.362 1.413 1.316 1.275 1.000
cd, comet, vcr, diamond-ring, skyscaper 1.663 1.827 1.755 1.751 1.715 1.772 1.274
Average 0.865 1.019 0.943 0.974 0.877 1.068 0.610
1.8 0.70 1
CLUTO (separate) fern vs starfish fern vs starfish
1.6 CLUTO (combined)
Co−clustering (separate) 0.9
0.65
1.4 Co−clustering (combined)
Feature Clustering (separate) 0.8
Entropy
Entropy
Entropy
0.50 0.5
0.4 0 5 10 15 20 25 30
2 4 8 16 32 64 128 256 0.0625 0.125 0.25 0.5 1 2 4
λ Number of Iterations
Number of Feature Clusters
Figure 2. The entropy curves as a Figure 3. The entropy curves as a Figure 4. The entropy curves as a
function of different number of feature function of different trade-off param- function of different number of iter-
clusters. eter λ. ations.
line is designed to test the effectiveness of our special the clustering performance of the target data is im-
co-clustering model for self-taught clustering. proved.
Table 1 presents the clustering performance in en- In our STC algorithm, it is assumed that we have al-
tropy according to each data set and each evaluation ready known the number of feature clusters K. How-
method. From this table, we can see that Feature ever, in reality, this number should be carefully tuned.
Clustering and Co-clustering perform somewhat In these experiments, we tuned this parameter em-
worse than CLUTO. This is a little different from the re- pirically. Figure 2 presents the entropy curves with
sults shown in the previous literatures such as (Dhillon respect to different number of feature clusters given
et al., 2003). In our opinion, it is because our self- by CLUTO, Feature Clustering, Co-clustering and
taught clustering problem focuses on a different situ- STC respectively. The entropy in Figure 2 is the aver-
ation from the previous ones; that is, the target data age over 6 binary image clustering tasks. Note that the
are insufficient for traditional clustering algorithms. curve given by CLUTO never changes, since CLUTO does
In our experiments, there are only 70 instances in not incorporate feature clustering. From this figure,
each category, which is too few to build a good fea- we can see Feature Clustering and Co-clustering
ture clustering partition. Therefore, the performance perform somewhat unstably as a function of the in-
of Feature Clustering and Co-clustering declines. creasing number of feature clustering. We believe the
Moreover, the performance with respect to combined reason is that there are only too few instances in each
is worse than that with respect to separate in gen- clustering task, which makes the traditional clustering
eral. We believe that it is because the target data results unreliable. Our algorithm STC incorporates a
and the auxiliary data are more or less independent of large amount of auxiliary unlabeled data, so that its
each other, and thus the topics in the combined data variance is much smaller than that of traditional clus-
set may be biased towards the auxiliary data and thus tering algorithms. STC performs increasingly better
harm the clustering performance on the target data. in general, along with the increasing number of fea-
In general, our algorithm STC greatly outperforms the ture clustering, until the number of feature clusters
three baseline methods. We observe that the reason reaches 32. When the number of feature clusters is
for the outstanding performance of STC is that the greater than 32, the performance of STC becomes in-
co-clustering part of STC makes feature clustering re- sensitive to the number of feature clusters. We believe
sult consistent with the clustering result on both the a number of feature clustering which is no less than 32
target data and the auxiliary data. Therefore, using will be sufficient to make STC perform well. In these
this feature clustering as the new data representation, experiments, we set the number of feature clustering
Self-taught Clustering
Basu, S., Banerjee, A., & Mooney, R. J. (2002). Nelson, B., & Cohen, I. (2007). Revisiting probabilis-
Semi-supervised clustering by seeding. Proceedings tic models for clustering with pair-wise constraints.
of the Nineteenth International Conference on Ma- Proceedings of the Twenty-fourth International Con-
chine Learning (pp. 27–34). ference on Machine Learning (pp. 673–680).
Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y.
Basu, S., Bilenko, M., & Mooney, R. J. (2004). A
(2007). Self-taught learning: transfer learning from
probabilistic framework for semi-supervised cluster-
unlabeled data. Proceedings of the Twenty-fourth
ing. Proceedings of the Tenth ACM SIGKDD In-
International Conference on Machine Learning (pp.
ternational Conference on Knowledge Discovery and
759–766).
Data Mining (pp. 59–68).
Raina, R., Ng, A. Y., & Koller, D. (2006). Construct-
Caruana, R. (1997). Multitask learning. Machine ing informative priors using transfer learning. Pro-
Learning, 28, 41–75. ceedings of the Twenty-third International Confer-
Cover, T. M., & Thomas, J. A. (1991). Elements of ence on Machine Learning (pp. 713–720).
information theory. Wiley-Interscience. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M.
Daumé III, H., & Marcu, D. (2005). A bayesian model (2006). Hierarchical Dirichlet processes. Journal
for supervised clustering with the dirichlet process of the American Statistical Association, 101, 1566–
prior. Journal of Machine Learning Research, 6, 1581.
1551–1577. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S.
(2001). Constrained k-means clustering with back-
Davidson, I., & Ravi, S. S. (2007). Intractability
ground knowledge. Proceedings of the Eighteenth
and clustering with constraints. Proceedings of
International Conference on Machine Learning (pp.
the Twenty-fourth International Conference on Ma-
577–584).
chine Learning (pp. 201–208).
Wu, P., & Dietterich, T. G. (2004). Improving svm
Dhillon, I. S., Mallela, S., & Modha, D. S. (2003).
accuracy by training on auxiliary data sources. Pro-
Information-theoretic co-clustering. Proceedings of
ceedings of the Twenty-first International Confer-
the Ninth ACM SIGKDD International Conference
ence on Machine Learning (pp. 110–117).
on Knowledge Discovery and Data Mining (pp. 89–
98). Zhao, Y., & Karypis, G. (2002). Evaluation of hierar-
chical clustering algorithms for document datasets.
Finley, T., & Joachims, T. (2005). Supervised clus- Proceedings of the Eleventh International Confer-
tering with support vector machines. Proceedings ence on Information and Knowledge Management
of the Twenty-second International Conference on (pp. 515–524).
Machine Learning (pp. 217–224).