DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
1 Introduction
Clustering is the task of grouping samples such that the ones in the same group are more
similar to each other than to the ones in other groups. Nowadays, clustering performs
as a basic and essential pre-processing step of many real world applications. For exam-
ple, it could be used to help with fake news identification [6], document analysis [16],
marketing and sales, etc. Specifically, clustering algorithms can figure out useful in-
formation for the applications via grouping according to a variety of data similarity
metrics and data grouping schemes. For example, similar patches could be used for
image denoising [1–3] or depth enhancement [9], and clustering could be used to find
good similar patches [8].
To let the samples be properly assigned to different groups(called clusters), mean-
ingful feature values of the samples need to be obtained first. However, in real world
applications, the data we get is often of high dimensions [5] and usually contains noise,
making the clustering difficult to succeed. For example, in the MNIST dataset [7], each
input hand-written digit image has 784 pixels. While we know some pixels (e.g. the
ones at image corners) might not be as useful as others(e.g. the ones around image
centers), it is difficult to manually distinguish them in clustering.
Traditional dimensionality reduction algorithms, namely Principle Component Anal-
ysis(PCA) [10], Linear Discriminant Analysis(LDA) [4], and Canonical Correlation
2 Si Lu et al.
Analysis(CCA) [13], could be used to reduce the number of features. In addition, fea-
ture selection algorithms can be used to select from the original feature values a set of
useful and noiseless ones. These algorithms aim to extract the core information given
the redundant and correlated input high-dimension data features. However, these al-
gorithms often fail mainly due to two reasons. Firstly, most of them require complex
mathematical analysis, which is difficult and time consuming as well. Secondly, their
is no single approach that could work for all types of datasets. Different datasets could
have different dimensions, data sizes and even might be used in totally different ap-
plications. Some datasets are linear and some of them are non-linear. As a result, it is
difficult to find a way to generally work on all types of datasets.
Recently, due to the emerging of the powerful deep neuron networks, deep learning-
based approaches have been introduced to learn better data representations and achieve
appealing performance improvements for clustering algorithms. One simple approach is
to learn representations using deep auto-encoders. Specifically, the original input high
dimensional features are fed into a encoder that generates a low dimensional output.
This output is further fed to a decoder that tries to recover the raw input data as much as
possible. However, most of the existing approaches [11, 15] are using images as input
and thus using convolutional neuron networks in their work.
In this paper, we propose Deep Clustering Autoencoder, a simple but more general
framework for representation learning that takes feature vectors as input. Thus, our
approach could be applied to more generalized datasets. In addition, according to the
group labels, we propose a scheme to adaptively weight all input features. We combine
this estimated weight with the loss function computation during training. Experiment
results show that our approach could effectively improve the performance of K-Means
clustering algorithm on different types of datasets, namely MNIST, Fashion-MNIST
[14], as well as Human Activities and Postural Transitions Data Set (HAPT) [12].
The rest of the paper is organized as follows: in section 2 we describe the overview
of our deep autoencoder-based clustering. We then describe the deep autoencoder for
representation learning in more details in section 3. We finally show experimental re-
sults in section 4 and conclude in section 5.
Fig. 1. Overview of our Deep Autoencoder-based Clustering on MNSIT dataset. The autoencoder
(consists of an encoder and a decoder) tries to encode and decode the input features such that the
decoded output is as close to the input as possible. The input size is 28X28 = 784, the size of
the learned low-dimension representation is 10. In the testing stage, the learned encoder output is
then fed into the classic K-Means algorithm to do clustering.
3.1 Encoder
The encoder aims to encode or compress the input data into a smaller size represen-
tation, and at the same time preserve as much key information as possible. As shown
in Figure 1, the encoder consists of 8 layers, include the input layer and the learned
representation output layer. Here the input layer is being normalized such that all its
values is in the range of (0, 1). Specifically, from the beginning, each larger layer is
fully connected to the next smaller layer followed by a couple of activation layers.
4 Si Lu et al.
There are mainly two types of activation layers, Relu and Tanh, as shown in Equation
1 and 2. Adding the Relu layers could introduce non-linearity to our model, making
it more robust against non-linear input data. The Tanh layer, on the other hand, could
transform the data into a normalized range of (−1, 1), to alleviate the gradient vanish-
ing/exploding problem.
ex + e−x
cosh(x) =
2
ex − e−x
sinh(x) = (2)
2
sinh(x) ex − e−x
tanh(x) = = x
cosh(x) e + e−x
3.2 Decoder
The decoder aims to decode or decompress the encoded output to reconstruct the origi-
nal input data as much as possible. It contains nine layers, include the input layer, which
is the output of the encoder, and the final output layer. Specifically, each smaller layer is
fully connected to the next larger layer followed by a Tanh activation layer. In addition,
the decoder has a Sigmoid activation layer (shown in Equation 3) at the final stage to
enforce the output values lie into the range of (0, 1).
1
Sigmoid(x) = (3)
1 + e−x
large if both of the two following conditions are met. First, all sampled values in the
same groups/clusters have small differences. Second, all sampled values in different
groups/clusters have large differences. Thus, the weight is computed as:
2 2
e−(xip −xiq ) (1 − e−(xip −xiq ) )
P P
lp =lq lp 6=lq
wi = P • P (5)
1 1
lp =lq lp 6=lq
Fig. 2. A map of the clustering weight computed for MNIST dataset using 1000 samples from
the training set. It could be seen that pixels at boundaries and corners are less important than the
ones around image centers.
Figure 2 shows a map of the clustering weight computed for MNIST dataset using
1000 samples from the training set. Pixels at boundaries and corners are less impor-
tant than the ones around image centers, thus have smaller weights(white means larger
weights).
Final Objective Function The final objective function then combines the Clustering-
weighted MSE Loss and a standard L2 norm regularization, as shown in Equation 6.
Here the L2 norm regularization Lr is computed using all parameters from the autoen-
coder. β is a balancing factor with a default value of 0.00001.
4 Experimental Results
4.1 Dataset
We evaluate our approach on the classic MNIST hand-written digits dataset. This dataset
has 50, 000 images as the training set and 10, 000 images as the testing set. There are
10 groups in total. We show some samples of MNIST dataset in Figure 3.
To evaluate our framework, we apply our trained encoder to the testing dataset. We
then compare the generated representations from our trained encoder to the raw input
features by applying them to the K-Means algorithm. To measure the performance of
clustering algorithms, we use the Adjusted Rand Index (ARI). Specifically, this metrics
computes a similarity between two clustering results by considering all pairs of samples
and counting pairs that are assigned in the same or different clusters in the predicted and
ground truth clustering results. The proposed approach is denoted as DAC.
We implement our framework in Python and PyTorch and test it on a desktop with RTX
2080-Ti. We train the autoencoder for 200 epochs using Adam Optimization Algorithm.
The initial learning rate is set to 0.003 and will decrease with the number of epochs
during training.
Deep Autoencoder-based Clustering 7
Table 1 shows the quantitative performance of the proposed approach in terms of ARI.
Comparing to the raw K-Means algorithm, our approach (DAC) boosts the K-Means
algorithm’s performace from 0.3477 to 0.6624, which is a 90.50% boost. We also show
some of the reconstructed results by our trained autoencoder in Figure 4. It shows that
our trained autoencoder can properly reconstruct the raw input hand-written digits.
K-Mesn DAC
ARI 0.3477 0.6624
Fig. 4. Sample results of our trained autoencoder on MNIST dataset. Top: raw input images.
Bottom: reconstructed images
To test the robustness of our approach against different data types, we apply our method
to two other datasets: Fashion-MNIST [14], and Human Activities and Postural Transi-
tions Data Set (HAPT) [12].
Fashion-MNIST is a similar dataset to MNIST, with the same image format and
image size. It has 60, 000 images as training set and 10, 000 images as testing set.
The only difference is the content: it contains images of 10 types of clothes. The ten
categories are shown in Table 2. We show some samples of this dataset in Figure 5.
Human Activities and Postural Transitions Data Set is a dataset that has been cap-
tured by smart phone’s sensors [12]. The authors captured 3-axial linear acceleration
8 Si Lu et al.
and 3-axial angular velocity at a constant rate of 50Hz using the embedded accelerom-
eter and gyroscope of the device, which is a smartphone (Samsung Galaxy S II). There
are 30 volunteers whose ages are in the range of 19-48 years old. In their data capturing
experiment, the volunteers was doing one of twelve activities. There are six basic activi-
ties: three static postures (standing, sitting, lying) and three dynamic activities (walking,
walking downstairs and walking upstairs). Another six postural transitions that occurred
between the static postures have also been added to the dataset. These are: stand-to-sit,
sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All twelve types of activ-
ities are shown in Table 3.
The sensor signals (accelerometer and gyroscope) were then denoised by some
noise filters. The authors then sampled in fixed-width sliding windows of 2.56 sec and
50% overlap (128 readings/window), leading to a sample size of 561 features. Each
sample is captured when the volunteer is doing one type of activities. During the cap-
ture process, 70% of the volunteers were randomly selected to generate the training set
Deep Autoencoder-based Clustering 9
Fig. 6. Sample results of our trained on Fashion-MNIST dataset. Top: raw input images. Bottom:
reconstructed images
and 30% were selected to generate the testing set. In total, this dataset has 7767 samples
for training and 3162 samples for testing.
We apply our method to Fashion-MNIST dataset and report the results in Table 4.
Here as the Fashion-MNIST is a more complex dataset, we modified the autoencoder
and show the modified autoencoder architecture in Figure 7. It can be seen that com-
paring to using raw input features in K-Means clustering, our method boosts ARI from
0.3039 to 0.4702, yields to a improvement of 54.7%.
We then apply our method to the HAPT dataset and report the results in Table 3.
Here as this dataset’s inputs are of lower dimension than MNIST, we modified the
autoencoder accordingly and show the modified autoencoder architecture in Figure 8. It
can be seen that even with this temporal sequence dataset, our method could effectively
improve the K-Means algorithm’s performance by 30%. These results also show that
our method could be generally applied to other data types. We also show some of the
reconstructed results by our trained autoencoder in Figure 6. It shows that our trained
autoencoder can properly reconstruct the raw input fashion images.
K-Mesn DAC
ARI 0.3039 0.4702
K-Mesn DAC
ARI 0.4290 0.5594
10 Si Lu et al.
5 Conclusion
References
1. Buades, A., Coll, B., Morel, J.: A non-local algorithm for image denoising. In: IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 60–65 (2005)
2. Chen, F., Zhang, L., Yu, H.: External patch prior guided internal clustering for image denois-
ing. In: IEEE International Conference on Computer Vision (ICCV), pp. 603–611 (2015)
3. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-d transform-
domain collaborative filtering. IEEE Transactions on Image Processing 16(8), 2080–2095
(2007)
4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. john wiley & sons. Inc., New York
2 (2001)
5. Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier (2011)
6. Hosseinimotlagh, S., Papalexakis, E.E.: Unsupervised content-based identification of fake
news articles with tensor decomposition ensembles. In: Proceedings of the Workshop on
Misinformation and Misbehavior Mining on the Web (MIS2) (2018)
7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
8. Lu, S.: Good similar patches for image denoising. In: 2019 IEEE Winter Conference on
Applications of Computer Vision (WACV), pp. 1886–1895. IEEE (2019)
9. Lu, S., Ren, X., Liu, F.: Depth enhancement via low-rank matrix completion. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 3390–3397 (2014)
10. Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The Lon-
don, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572
(1901)
11. Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., Carin, L.: Variational autoencoder
for deep learning of images, labels and captions. Advances in neural information processing
systems 29, 2352–2360 (2016)
12. Reyes-Ortiz, J.L., Oneto, L., Samà, A., Parra, X., Anguita, D.: Transition-aware human ac-
tivity recognition using smartphones. Neurocomputing 171, 754–767 (2016)
13. Sun, Q.S., Zeng, S.G., Liu, Y., Heng, P.A., Xia, D.S.: A new method of feature fusion and its
application in image recognition. Pattern Recognition 38(12), 2437–2448 (2005)
14. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
15. Yang, X., Deng, C., Zheng, F., Yan, J., Liu, W.: Deep spectral clustering using dual au-
toencoder network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) (2019)
16. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets.
In: Proceedings of the eleventh international conference on Information and knowledge man-
agement, pp. 515–524 (2002)