0% found this document useful (0 votes)
7 views8 pages

08-Self-Taught Clustering

This paper introduces self-taught clustering, a novel clustering approach that utilizes a large set of auxiliary unlabeled data to enhance the clustering performance of a small collection of target unlabeled data. The proposed method employs co-clustering techniques to simultaneously cluster both target and auxiliary data, allowing for improved feature representation despite the lack of labeled data. Experimental results demonstrate that this algorithm significantly outperforms existing state-of-the-art clustering methods, even when the auxiliary data may be irrelevant to the target data.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

08-Self-Taught Clustering

This paper introduces self-taught clustering, a novel clustering approach that utilizes a large set of auxiliary unlabeled data to enhance the clustering performance of a small collection of target unlabeled data. The proposed method employs co-clustering techniques to simultaneously cluster both target and auxiliary data, allowing for improved feature representation despite the lack of labeled data. Experimental results demonstrate that this algorithm significantly outperforms existing state-of-the-art clustering methods, even when the auxiliary data may be irrelevant to the target data.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Self-taught Clustering

Wenyuan Dai† [email protected]


Qiang Yang‡ [email protected]
Gui-Rong Xue† [email protected]
Yong Yu† [email protected]
(†) Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
(‡) Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong

Abstract 1967), and recent works on clustering research have


focused on improving the clustering performance us-
This paper focuses on a new clustering task, ing the prior knowledge in semi-supervised clustering
called self-taught clustering. Self-taught clus- (Wagstaff et al., 2001) and supervised clustering (Fin-
tering is an instance of unsupervised transfer ley & Joachims, 2005).
learning, which aims at clustering a small col-
lection of target unlabeled data with the help In the past, semi-supervised clustering incorporates
of a large amount of auxiliary unlabeled data. pairwise supervision, such as must-link or cannot-link
The target and auxiliary data can be differ- constraints (Wagstaff et al., 2001), to bias clustering
ent in topic distribution. We show that even results. Supervised clustering methods learn distance
when the target data are not sufficient to al- functions from a small sample of auxiliary labeled data
low effective learning of a high quality feature (Finley & Joachims, 2005). Different from these clus-
representation, it is possible to learn the use- tering problems, in this paper, we address a new clus-
ful features with the help of the auxiliary data tering task where we use a large amount of auxiliary
on which the target data can be clustered ef- unlabeled data to enhance the clustering performance
fectively. We propose a co-clustering based of a small amount of target unlabeled data. In our
self-taught clustering algorithm to tackle this problem, we do not have any labeled data or pairwise
problem, by clustering the target and auxil- supervisory constraint knowledge. All we have are the
iary data simultaneously to allow the feature auxiliary data which are totally unlabeled and may be
representation from the auxiliary data to in- irrelevant to the target data. Our target data consist
fluence the target data through a common of a collection of unlabeled data from which it may
set of features. Under the new data represen- be insufficient to learn a good feature representation.
tation, clustering on the target data can be Thus, applying clustering directly on these target data
improved. Our experiments on image clus- may give very poor performance. However, with the
tering show that our algorithm can greatly help of auxiliary data, we are able to uncover a good
outperform several state-of-the-art clustering feature set to enable high quality clustering on the tar-
methods when utilizing irrelevant unlabeled get data.
auxiliary data. Our problem can be considered as an instance of trans-
fer learning, which makes use of knowledge gained from
1. Introduction one learning task to improve the performance of an-
other, even when these learning tasks or domains fol-
Clustering (Jain & Dubes, 1988) aims at partition- low different distributions (Caruana, 1997). However,
ing objects into groups, so that the objects in the since all the data are unlabeled, we can consider it
same groups are relatively similar, while the objects in as an instance of unsupervised transfer learning (Teh
different groups are relatively dissimilar. Clustering et al., 2006). This unsupervised transfer learning prob-
has a long history in machine learning (MacQueen, lem could also be viewed as a clustering version of the
self-taught learning (Raina et al., 2007), which uses
Appearing in Proceedings of the 25 th International Confer- irrelevant unlabeled data to help supervised learning.
ence on Machine Learning, Helsinki, Finland, 2008. Copy- Thus, we refer to our problem as self-taught clustering
right 2008 by the author(s)/owner(s).
Self-taught Clustering

Let p(X, Z) be the joint probability distribution with


respect to X and Z, and q(Y, Z) be the joint probabil-
(a) diamond (b) platinum ity distribution with respect to Y and Z. In general,
p(X, Z) and q(Y, Z) can be considered as two n×k and
m × k matrices respectively, which can be estimated
(c) ring (d) titanium
from data observations. For example, consider the case
that x1 = {z1 , z3 }, x2 = {z2 }, and x3 = {z2 , z3 }.
Figure 1. Example for common features among different Then, the joint probability distribution p(X, Z) can
types of objects, using images as the instance.
be estimated as
 
0.2 0.0 0.2
(or STC for abbreviation). p(X, Z) =  0.0 0.2 0.0  . (1)
To tackle the problem, we observe that the perfor- 0.0 0.2 0.2
mance of clustering highly relies on data represen-
tation when the objective function and the distance We wish to cluster X into N partitions X̃ =
measure are fixed. Therefore, to improve the clus- {x̃1 , . . . , x̃N } and Y into M clusters Ỹ = {ỹ1 , . . . , ỹM }.
tering performance, one alternative way is to seek a Furthermore, Z can be clustered into K feature clus-
better data representation. We observe that different ters Z̃ = {z̃1 , . . . , z̃K }. We use CX : X 7→ X̃,
objects may share some common or relevant features. CY : Y 7→ Ỹ and CZ : Z 7→ Z̃ to denote three cluster-
For example, in Figure 1, diamond and ring share ing functions, which map variables in the three value
quite a lot of features about “diamond”; ring and sets to their corresponding clusters. For brevity, in the
platinum share quite a lot of features about “plat- following, we will use X̃, Ỹ and Z̃ to denote CX (X),
inum”; moreover, platinum and titanium share quite CY (Y ) and CZ (Z), respectively.
a lot of features about “metal”. In this situation, Our objective is to find a good clustering function CX
the auxiliary data can be used to help uncover a bet- for the target data, with the help of the clusters CY
ter data representation to benefit the target data set. on the auxiliary data and CZ on the common feature
Our approach to tackling this problem is by using co- space.
clustering (Dhillon et al., 2003), so that the commonal-
ity can be found in the feature spaces that corresponds
to similar semantic meanings. 3. The Self-taught Clustering
Algorithm
In our solution to the self-taught clustering problem,
two clustering operations, on the target data and the In this section, we present our co-clustering based self-
auxiliary data are respectively performed together. taught clustering (STC) algorithm, and then discuss
This is done through co-clustering. We extend the its theoretical properties based on information theory.
information theoretic co-clustering algorithm (Dhillon
et al., 2003) which minimizes loss in mutual informa- 3.1. Objective Function for Self-taught
tion before and after co-clustering. An iterative al- Clustering
gorithm is proposed to monotonically reduce the ob-
jective function. The experimental results show that We extend the information theoretic co-clustering
our algorithm can greatly improve the clustering per- (Dhillon et al., 2003) to model our self-taught clus-
formance by effectively using auxiliary unlabeled data, tering algorithm. In the information theoretic co-
as compared to several other state-of-the-art clustering clustering, the objective function of co-clustering is
algorithms. defined as minimizing loss in mutual information be-
tween instances and features, before and after co-
clustering. Formally, using the target data X and their
2. Problem Formulation feature space Z for illustration, the objective function
For clarity, we first define the self-taught clustering can be expressed as
task. Let X and Y be two discrete random variables,
taking values from two value sets {x1 , . . . , xn } and I(X, Z) − I(X̃, Z̃), (2)
{y1 , . . . , ym }, respectively. X and Y correspond to the
where I(· ; ·) denotes the mutual information between
target and auxiliary data. Let Z be a discrete random
two random variables (Cover & Thomas, 1991) that
variable, taking values from the value set {z1 , . . . , zk }, P P p(x,z)
that corresponds to the common feature space of both I(X; Z) = x∈X z∈Z p(x, z) log p(x)p(z) . Moreover,
target and auxiliary data. I(X̃, Z̃) corresponds to the joint probability distribu-
Self-taught Clustering

tion p(X̃, Z̃) which is defined as where x ∈ x̃ and z ∈ z̃. Therefore, with regard to
XX Equations (1) and (4), p̃(X, Z) is given by
p(x̃, z̃) = p(x, z). (3)  
x∈x̃ z∈z̃ 0.089 0.178 0.133
p̃(X, Z) =  0.044 0.089 0.067  . (7)
For example, for the joint probability p(X, Z) in Equa-
0.067 0.133 0.200
tion (1), suppose that the clustering on X is X̃ =
{x̃1 = {x1 , x2 }, x̃2 = {x3 }}, and the clustering on Z is Likewise, let q̃(Y, Z) denote the joint probability dis-
Z̃ = {z̃1 = {z1 , z2 }, z̃2 = {z3 }}. Then, tribution of Y and Z with respect to the co-clusters

0.4 0.2
 (CY , CZ ). We have
p(X̃, Z̃) = . (4)
0.2 0.2
q(y) q(z)
q̃(y, z) = q(ỹ, z̃) , (8)
In this work, we model our self-taught clustering al- q(ỹ) q(z̃)
gorithm (STC) as performing co-clustering operations
on the target data X and auxiliary data Y , simultane- where y ∈ ỹ and z ∈ z̃.
ously, while the two co-clusters share the same features
clustering Z̃ on the feature set Z. Thus, the objective Using the probability distributions p̃(X, Z) and
function can be formulated as q̃(Y, Z) defined above, we can reformulate the objec-
h i tive function in Equation (5) into a form based on KL
J = I(X, Z) − I(X̃, Z̃) + λ I(Y, Z) − I(Ỹ , Z̃) . (5) divergence (Cover & Thomas, 1991).

In Equation (2), I(X, Z) − I(X̃, Z̃) is computed on Lemma 1 When the clusters CX , CY and CZ are
the co-clusters on the target data X, while I(Y, Z) − fixed, the objective function in Equation (5) can be re-
I(Ỹ , Z̃) on the auxiliary data Y . λ is a trade-off formulated as
parameter to balance the influence between the tar- h i
get data and the auxiliary data which we will test I(X; Z) − I(X̃; Z̃) + λ I(Y ; Z) − I(Ỹ ; Z̃) (9)
in our experiments. From Equation (5), we can see
= D(p(X, Z)||p̃(X, Z)) + λ D(q(Y, Z)||q̃(Y, Z)),
that, although the two co-clustering objective func-
tions I(X, Z) − I(X̃, Z̃) and I(Y, Z) − I(Ỹ , Z̃) are per- where D(·||·) denotes the KL divergence between two
formed separately, they share the same feature cluster- probability distributions (Cover & Thomas, 1991),
ing Z̃. This is the “bridge” to transfer the knowledge
where D(p||q) = x p(x) log p(x)
P
between the target and auxiliary data. q(x) .

Our remaining task is to minimize the value of the Proof Based on the Lemma 2.1 in (Dhillon et al.,
objective function in Equation (5)1 . However, min- 2003), I(X, Z)−I(X̃, Z̃) = D(p(X, Z)||p̃(X, Z)). Sim-
imizing Equation (5) is not an easy task, since it is ilarly, I(Y, Z) − I(Ỹ , Z̃) = D(q(Y, Z)||q̃(Y, Z)). There-
non-convex and there are no good solutions currently fore, Lemma 1 can be proved straightforwardly.
to directly optimize this objective function. In the fol-
Lemma 1 converts the loss in mutual information to
lowing, we will rewrite the objective function in Equa-
the KL divergence between the distributions p and p̃,
tion (5) into the form of Kullback-Leibler divergence
and between q and q̃, respectively. However, the prob-
(Cover & Thomas, 1991) (KL divergence), and mini-
ability distributions in Lemma 1 are joint distribu-
mize the reformulated objective function.
tions, and are therefore difficult to optimize. Hence, in
Lemma 2, we rewrite the objective function in Lemma
3.2. Optimization for Co-clustering
1 as a conditional probability form. We then show how
We first define two new probability distributions to optimize the objective function in the new form.
p̃(X, Z) and q̃(Y, Z) as follows.
Lemma 2 The KL divergence with respect to joint
Definition 1 Let p̃(X, Z) denote the joint probability probability distributions can be reformulated as
distribution of X and Z with respect to the co-clusters
(CX , CZ ); formally, D(p(X, Z)||p̃(X, Z))
XX
p(x) p(z) = p(x)D(p(Z|x)||p̃(Z|x̃)) (10)
p̃(x, z) = p(x̃, z̃) , (6) x̃∈X̃ x∈x̃
p(x̃) p(z̃) XX
1 = p(z)D(p(X|z)||p̃(X|z̃)). (11)
To be mentioned, in this paper, our minimization is
for a fixed numbers of clusters N , M and K. z̃∈Z̃ z∈z̃
Self-taught Clustering

Similarly, Algorithm 1 The Self-taught Clustering Algorithm:


STC
D(q(Y, Z)||q̃(Y, Z)) Input: A target unlabeled data set X; an auxiliary
unlabeled data set Y ; the feature space Z shared by
XX
= q(y)D(q(Z|y)||q̃(Z|ỹ)) (12) (0)
ỹ∈Ỹ y∈ỹ
both X and Y ; the initial clustering functions CX ,
(0) (0)
XX CY and CZ ; the number of iterations T .
= q(z)D(q(Y |z)||q̃(Y |z̃)). (13) (T )
Output: The final clustering function CX on the
z̃∈Z̃ z∈z̃
target data X.
Procedure STC
Proof We only give the proof to Equation (10). Using
an identical argument, Equations (11), (12) and (13) 1: Initialize p(X, Z) and q(Y, Z) based on the data
can be easily derived. observations on X, Y , and Z.
(0) (0)
2: Initialize p̃(0) (X, Z) based on p(X, Z), CX , CZ ,
X XXX p(x, z) and Equation (6).
D(p(X, Z)||p̃(X, Z)) = p(x, z) log .
p̃(x, z) 3: Initialize q̃ (0) (Y, Z) based on q(Y, Z), C (0) , C (0) ,
x̃∈X̃ z̃∈Z̃ x∈x̃ z∈z̃ Y Z
and Equation (8).
Since p̃(x, z) = p(x) p(x̃,z̃) p(z)
p(x̃) p(z̃) = p(x)p̃(z|x̃), we have
4: for t ← 1, . . . , T do
(t)
5: Update CX (X) based on p, p̃(t−1) , and Equa-
D(p(X, Z)||p̃(X, Z)) tion (14).
(t)
X XXX p(x)p(z|x) 6: Update CY (Y ) based on q, q̃ (t−1) , and Equa-
= p(x)p(z|x) log
p(x)p̃(z|x̃) tion (15).
x̃∈X̃ z̃∈Z̃ x∈x̃ z∈z̃ (t)
7: Update CZ (Z) based on p, q, p̃(t−1) , q̃ (t−1) , and
XX XX p(z|x)
= p(x) p(z|x) log Equation (16).
p̃(z|x̃) (t)
x̃∈X̃ x∈x̃ z̃∈Z̃ z∈z̃ 8: Update p̃(t) based on based on p(X, Z), CX ,
XX (t)
= p(x)D(p(Z|x)||p̃(Z|x̃)). CZ , and Equations (6).
(t)
x̃∈X̃ x∈x̃ 9: Update q̃ (t) based on based on q(Y, Z), CY ,
(t)
CZ , and Equations (8).
10: end for
From Lemma 2 and Equation (10), we can see that (T )
11: Return CX as the final clustering function on the
minimizing D(p(Z|x)||p̃(Z|x̃)) for a single x can reduce target data X.
the value of D(p(X, Z)||p̃(X, Z)) and thus can then
decrease global optimization function in Equation (9).
Therefore, if we iteratively choose the best cluster x̃ for z based on Equations (14), (15) and (16). As we dis-
each x to minimize D(p(Z|x)||p̃(Z|x̃)), the objective cussed above, this can reduce the value of the global
function will be minimized monotonically. Formally, objective function in Equation (9). In the following
theorem, we show the monotonically decreasing prop-
CX (x) = arg min D(p(Z|x)||p̃(Z|x̃)). (14) erty of the objective function of the STC algorithm.
x̃∈X̃

Using a similar argument on Y and Z, we have Theorem 1 In Algorithm 1, let the value of objective
function J in the t-th iteration be
CY (y) = arg min D(q(Z|y)||q̃(Z|ỹ)), (15)
ỹ∈Ỹ (t) (t) (t)
J (CX , CY , CZ ) = (17)
and (t)
D(p(X, Z)||p̃ (X, Z)) + λ D(q(Y, Z)||q̃ (t)
(Y, Z)).
CZ (z) = arg min p(z)D(p(X|z)||p̃(X|z̃)) Then,
z̃∈Z̃
(t) (t) (t) (t+1) (t+1) (t+1)
+λ q(z)D(q(Y |z)||q̃(Y |z̃)). (16) J (CX , CY , CZ ) ≥ J (CX , CY , CZ ). (18)
Based on Equation (14), (15) and (16), an alternative
Proof (Sketch) Since in each iteration, the cluster-
way to minimize the objective function in Equation (9)
ing functions are updated based on Equations (14),
is derived, as shown in Algorithm 1.
(15) and (16), which locally minimize the values of
In Algorithm 1, in each iteration, our self-taught clus- D(p(X, Z)||p̃(X, Z)) and D(q(Y, Z)||q̃(Y, Z)), the ob-
tering algorithm (STC) minimizes the objective func- jective function is monotonically non-increasing as a
tion by choosing the best x̃, ỹ and z̃ for each x, y and result. Theorem 1 follows as a consequence.
Self-taught Clustering

Note that, although STC is able to minimize the ob- generated using these 20 categories, as shown in Table
jective function value in Equation (9), it is only able to 1. The first column in Table 1 presents the categories
find a locally optimal one. Finding the global optimal with respect to the target unlabeled data. For each
solution is NP-hard. The next corollary emphasizes clustering task, we used the data from the correspond-
the convergence property of our algorithm STC. ing categories as target unlabeled data, while the data
from the remaining categories as the auxiliary unla-
Corollary 1 Algorithm 1 converges in a finite number beled data.
of iterations.
For data preprocessing, we used the “bag-of-words”
Proof (Sketch) The convergence of our algorithm method (Li & Perona, 2005) to represent images in our
STC can be proved straightforwardly based on the experiments. Interesting points in images are found
monotonical decreasing property in Theorem 1, and and described by SIFT descriptor (Lowe, 2004). Then,
the finiteness of the solution space. we clustered all the interesting points to get the code-
book, and set the number of clusters to 800. Using this
3.3. Complexity Analysis codebook, each image can be represented as a vector
in the subsequent learning processes.
We now analyze the computational cost of our algo-
rithm STC. Suppose that the total number of (x, z) 4.2. Evaluation Criteria
co-occurrences in the target data set X is L1 , and
the total number of (y, z) co-occurrences in the aux- In these experiments, we used entropy to measure the
iliary data set Y is L2 . In each iteration, updating quality of clustering results, which reveals the purity
the target instance clustering CX takes O(N · L1 ). of clusters. Specifically,Pthe entropy for a cluster x̃
Updating the auxiliary instance clustering CY takes is defined as H(x̃) = − c∈C p(c|x̃) log2 p(c|x̃), where
O(M · L2 ). Moreover, updating the feature clustering c represents a category label in the evaluation cor-
CZ takes O(K · (L1 + L2 )). Since the number of it- pus, and p(c|x̃) is defined as p(c|x̃) = |{x|`(x)=c∧x∈x̃}|
|x̃| ,
erations is T , the time complexity of our algorithm is where `(x) denotes the true label of x in the evalu-
O(T · ((K + N ) · L1 + (K + M ) · L2 ))). In the follow- ation corpus. The total entropy for the whole clus-
ing experiments, it is shown that T = 10 is enough tering is defined as the weighted sum of the entropy
for convergence. Usually, the number of clusters N , with respect to all the clusters; formally, H(X̃) =
M and K can be considered as constants, so that the P |x̃|
x̃∈X̃ n H(x̃). The quality of clustering X̃ is eval-
time complexity of STC is O(L1 + L2 ).
uated using the entropy H(X̃).
Considering space complexity, our algorithm needs to
store all the (x, z) and (y, z) co-occurrences and their 4.3. Empirical Analysis
corresponding probabilities. Thus, the space complex-
ity is O(L1 + L2 ). This indicates that the time com- We compared our algorithm STC to several state-of-
plexity and the space complexity of our algorithm are the-art clustering methods as baseline methods. For
all linear on the input. We conclude that the algorithm each baseline method considered below, we have two
scales well. different options: one is to apply the baseline method
on the target data only, which we refer to as separate,
and the other is to apply on the combined data con-
4. Experiments sisting of target data and the auxiliary data, which we
In this section, we evaluate our self-taught cluster- refer as combined. The first baseline method is a tradi-
ing algorithm STC on the image clustering tasks, and tional 1D-clustering solution CLTUO (Zhao & Karypis,
show effectiveness of STC. 2002) using its default parameter. The second baseline
method is clustering on the target data under a new
feature representation that is first constructed through
4.1. Data Sets
feature clustering (on the target or the combined data
We conduct our experiments on eight clustering tasks set); this baseline is designed to evaluate the effec-
generated based on the Caltech-256 image corpus tiveness of co-clustering based method as opposed to
(Griffin et al., 2007). There are a total of 256 cate- naively constructing new data representation for clus-
gories in the Caltech-256 data set, where we randomly tering. We refer to this class of baseline methods as
chose 20 categories from this corpus. For each cat- Feature Clustering. The third baseline method is
egory, 70 images are randomly selected to form our an information theoretic co-clustering method applied
clustering tasks. Six binary clustering tasks, one 3- to the target (or the combined) data set (Dhillon et al.,
way clustering task, and one 5-way clustering task were 2003), which we refer to as Co-clustering. This base-
Self-taught Clustering

Table 1. Performance in terms of entropy for each data set and evaluation method.
CLUTO Feature Clustering Co-clustering
Data Set STC
separate combined separate combined separate combined
eyeglass vs sheet-music 0.527 0.966 0.669 0.669 0.630 0.986 0.187
airplane vs ostrich 0.352 0.696 0.512 0.479 0.426 0.753 0.252
fern vs starfish 0.865 0.988 0.588 0.953 0.741 0.968 0.575
guitar vs laptop 0.923 0.965 0.999 0.970 0.925 1.000 0.569
hibiscus vs ketch 0.371 0.446 0.659 0.649 0.399 0.793 0.252
cake vs harp 0.882 0.879 0.998 0.911 0.860 0.996 0.772
car-side, tire, frog 1.337 1.385 1.362 1.413 1.316 1.275 1.000
cd, comet, vcr, diamond-ring, skyscaper 1.663 1.827 1.755 1.751 1.715 1.772 1.274
Average 0.865 1.019 0.943 0.974 0.877 1.068 0.610

1.8 0.70 1
CLUTO (separate) fern vs starfish fern vs starfish
1.6 CLUTO (combined)
Co−clustering (separate) 0.9
0.65
1.4 Co−clustering (combined)
Feature Clustering (separate) 0.8

Entropy
Entropy
Entropy

1.2 Feature Clustering (combined)


Self−taught Clustering 0.60
1.0 0.7

0.8 0.55 0.6


0.6

0.50 0.5
0.4 0 5 10 15 20 25 30
2 4 8 16 32 64 128 256 0.0625 0.125 0.25 0.5 1 2 4
λ Number of Iterations
Number of Feature Clusters
Figure 2. The entropy curves as a Figure 3. The entropy curves as a Figure 4. The entropy curves as a
function of different number of feature function of different trade-off param- function of different number of iter-
clusters. eter λ. ations.

line is designed to test the effectiveness of our special the clustering performance of the target data is im-
co-clustering model for self-taught clustering. proved.
Table 1 presents the clustering performance in en- In our STC algorithm, it is assumed that we have al-
tropy according to each data set and each evaluation ready known the number of feature clusters K. How-
method. From this table, we can see that Feature ever, in reality, this number should be carefully tuned.
Clustering and Co-clustering perform somewhat In these experiments, we tuned this parameter em-
worse than CLUTO. This is a little different from the re- pirically. Figure 2 presents the entropy curves with
sults shown in the previous literatures such as (Dhillon respect to different number of feature clusters given
et al., 2003). In our opinion, it is because our self- by CLUTO, Feature Clustering, Co-clustering and
taught clustering problem focuses on a different situ- STC respectively. The entropy in Figure 2 is the aver-
ation from the previous ones; that is, the target data age over 6 binary image clustering tasks. Note that the
are insufficient for traditional clustering algorithms. curve given by CLUTO never changes, since CLUTO does
In our experiments, there are only 70 instances in not incorporate feature clustering. From this figure,
each category, which is too few to build a good fea- we can see Feature Clustering and Co-clustering
ture clustering partition. Therefore, the performance perform somewhat unstably as a function of the in-
of Feature Clustering and Co-clustering declines. creasing number of feature clustering. We believe the
Moreover, the performance with respect to combined reason is that there are only too few instances in each
is worse than that with respect to separate in gen- clustering task, which makes the traditional clustering
eral. We believe that it is because the target data results unreliable. Our algorithm STC incorporates a
and the auxiliary data are more or less independent of large amount of auxiliary unlabeled data, so that its
each other, and thus the topics in the combined data variance is much smaller than that of traditional clus-
set may be biased towards the auxiliary data and thus tering algorithms. STC performs increasingly better
harm the clustering performance on the target data. in general, along with the increasing number of fea-
In general, our algorithm STC greatly outperforms the ture clustering, until the number of feature clusters
three baseline methods. We observe that the reason reaches 32. When the number of feature clusters is
for the outstanding performance of STC is that the greater than 32, the performance of STC becomes in-
co-clustering part of STC makes feature clustering re- sensitive to the number of feature clusters. We believe
sult consistent with the clustering result on both the a number of feature clustering which is no less than 32
target data and the auxiliary data. Therefore, using will be sufficient to make STC perform well. In these
this feature clustering as the new data representation, experiments, we set the number of feature clustering
Self-taught Clustering

to 32. Markov random fields that combines the constraints


and clustering distortion measures in a general frame-
We next tested the choice for the trade-off parame-
work. Recent semi-supervised clustering works include
ter λ in our algorithm STC (refer to Equation (5)).
(Nelson & Cohen, 2007; Davidson & Ravi, 2007).
Generally, it is difficult to theoretically determine the
value of the trade-off parameter λ. Instead, in this Supervised clustering is another branch of work de-
work, we tuned this parameter empirically on the data signed to improve clustering performance with the help
set fern vs starfish. Figure 3 presents the entropy of a collection of auxiliary labeled data. To address the
curve given by STC along with changing trade-off pa- supervised clustering problem, Finley and Joachims
rameter λ. From this figure, it can be seen that, when (2005) proposed an SVM-based supervised clustering
λ decreases, which implies that the weights of the aux- algorithm by optimizing a variety of different cluster-
iliary unlabeled data lower, the performance of STC ing functions. Daumé III and Marcu (2005) developed
declines rapidly. On the other hand, when λ is suffi- a Bayesian framework for supervised clustering based
ciently large, i.e. λ > 1, the performance of STC is on Dirichlet process prior.
relatively insensitive to the parameter λ. This indi-
Transfer learning emphasizes the transferring of knowl-
cates the auxiliary data can help the clustering on the
edge across different domains or tasks. For example,
target data in our clustering tasks. In these experi-
multi-task learning (Caruana, 1997) or clustering (Teh
ments, we set the trade-off parameter λ to one, which
et al., 2006) learns the common knowledge among dif-
is the best point in Figure 3.
ferent related tasks. Wu and Dietterich (2004) investi-
Since our algorithm STC is iterative, the convergence gated methods for improving SVM classifiers with aux-
property is also important to evaluate. Theorem 1 iliary training data sources. Raina et al. (2006) pro-
and Corollary 1 have already proven the convergence posed to learn logistic regression classifiers by incorpo-
of STC theoretically. Here, we analyze the conver- rating labeled data from irrelevant categories through
gence of STC empirically. Figure 4 shows the entropy constructing informative prior from the irrelevant la-
curve given by STC corresponding to different num- beled data. Raina et al. (2007) proposed a new learn-
ber of iterations on the data set fern vs starfish. ing strategy known as self-taught learning, which uti-
From this figure, we can see that STC converges very lizes irrelevant unlabeled data to enhance the classifi-
well after 7 iterations, while the performance of STC cation performance.
reaches the lowest point when STC converges. This
In this paper, we propose a new clustering framework
indicates that our algorithm STC converges very fast
called self-taught clustering which is an instance of un-
and very well. In these experiments, we set the num-
supervised transfer learning. The basic idea is to use
ber of iterations T to 10. We believe 10 iterations are
irrelevant unlabeled data to help the clustering of a
enough for STC to converge.
small amount of target data. To our best knowledge,
our self-taught clustering problem is novel in capturing
5. Related Work a large class of machine learning problems.
In this section, we review several past research works
that are related to our work, including semi-supervised 6. Conclusions and Future Work
clustering, supervised clustering and transfer learning.
In this paper, we investigated an unsupervised trans-
Semi-supervised clustering improves clustering perfor- fer learning problem called self-taught clustering, and
mance by incorporating additional constraints pro- developed a solution by using an unlabeled auxiliary
vided by a few labeled data, in the form of must- data to help improve the target clustering results. We
links (two examples must in the same cluster) and proposed a co-clustering based self-taught clustering
cannot-links (two examples cannot in the same clus- algorithm (STC) to solve this problem. In our al-
ter) (Wagstaff et al., 2001). It finds a balance be- gorithm, two co-clusterings are performed simultane-
tween satisfying the pairwise constraints and optimiz- ously on the target data and the auxiliary data to un-
ing the original clustering criteria function. In addition cover the shared feature clusters. Our empirical results
to (Wagstaff et al., 2001), Basu et al. (2002) used a show that the auxiliary data can help the target data
small amount of labeled data to generate initial seed to construct a better feature clustering as data rep-
clusters in K-means and constrained K-means algo- resentation. Under the new data representation, the
rithm by labeled data. Basu et al. (2004) generalized clustering performance on the target data is indeed
the previous semi-supervised clustering algorithms and enhanced, and our algorithm can greatly outperform
proposed a probabilistic framework based on hidden several state-of-the-art clustering methods in the ex-
Self-taught Clustering

periments. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-


256 object category dataset (Technical Report 7694).
In this work, we tackled the self-taught clustering
California Institute of Technology.
by finding a better feature representation using co-
clustering. In the future, we will explore several other Jain, A. J., & Dubes, R. C. (1988). Algorithms for
ways in finding common feature representations. clustering data. Englewood, NJ: Prentice-Hall.
Li, F.-F., & Perona, P. (2005). A bayesian hierarchical
Acknowledgements model for learning natural scene categories. Proceed-
ings of the 2005 IEEE Computer Society Conference
Qiang Yang thanks Hong Kong CERG grants 621307
on Computer Vision and Pattern Recognition - Vol-
and CAG grant HKBU1/05C. We thank the anony-
ume 2 (pp. 524–531).
mous reviewers for their greatly helpful comments.
Lowe, D. G. (2004). Distinctive image features from
References scale-invariant keypoints. International Journal of
Computer Vision, 60, 91–110.
Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I.
(2004). Multiple kernel learning, conic duality, and MacQueen, J. B. (1967). Some methods for classi-
the smo algorithm. Proceedings of the Twenty-first fication and analysis of multivariate observations.
International Conference on Machine Learning (pp. Proceedings of Fifth Berkeley Symposium on Math-
6–13). ematical Statistics and Probability (pp. 1:281–297).

Basu, S., Banerjee, A., & Mooney, R. J. (2002). Nelson, B., & Cohen, I. (2007). Revisiting probabilis-
Semi-supervised clustering by seeding. Proceedings tic models for clustering with pair-wise constraints.
of the Nineteenth International Conference on Ma- Proceedings of the Twenty-fourth International Con-
chine Learning (pp. 27–34). ference on Machine Learning (pp. 673–680).
Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y.
Basu, S., Bilenko, M., & Mooney, R. J. (2004). A
(2007). Self-taught learning: transfer learning from
probabilistic framework for semi-supervised cluster-
unlabeled data. Proceedings of the Twenty-fourth
ing. Proceedings of the Tenth ACM SIGKDD In-
International Conference on Machine Learning (pp.
ternational Conference on Knowledge Discovery and
759–766).
Data Mining (pp. 59–68).
Raina, R., Ng, A. Y., & Koller, D. (2006). Construct-
Caruana, R. (1997). Multitask learning. Machine ing informative priors using transfer learning. Pro-
Learning, 28, 41–75. ceedings of the Twenty-third International Confer-
Cover, T. M., & Thomas, J. A. (1991). Elements of ence on Machine Learning (pp. 713–720).
information theory. Wiley-Interscience. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M.
Daumé III, H., & Marcu, D. (2005). A bayesian model (2006). Hierarchical Dirichlet processes. Journal
for supervised clustering with the dirichlet process of the American Statistical Association, 101, 1566–
prior. Journal of Machine Learning Research, 6, 1581.
1551–1577. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S.
(2001). Constrained k-means clustering with back-
Davidson, I., & Ravi, S. S. (2007). Intractability
ground knowledge. Proceedings of the Eighteenth
and clustering with constraints. Proceedings of
International Conference on Machine Learning (pp.
the Twenty-fourth International Conference on Ma-
577–584).
chine Learning (pp. 201–208).
Wu, P., & Dietterich, T. G. (2004). Improving svm
Dhillon, I. S., Mallela, S., & Modha, D. S. (2003).
accuracy by training on auxiliary data sources. Pro-
Information-theoretic co-clustering. Proceedings of
ceedings of the Twenty-first International Confer-
the Ninth ACM SIGKDD International Conference
ence on Machine Learning (pp. 110–117).
on Knowledge Discovery and Data Mining (pp. 89–
98). Zhao, Y., & Karypis, G. (2002). Evaluation of hierar-
chical clustering algorithms for document datasets.
Finley, T., & Joachims, T. (2005). Supervised clus- Proceedings of the Eleventh International Confer-
tering with support vector machines. Proceedings ence on Information and Knowledge Management
of the Twenty-second International Conference on (pp. 515–524).
Machine Learning (pp. 217–224).

You might also like