Ijdkp 030207
Ijdkp 030207
2, March 2013
ABSTRACT
An intrinsic problem of classifiers based on machine learning (ML) methods is that their learning time grows as the size and complexity of the training dataset increases. For this reason, it is important to have efficient computational methods and algorithms that can be applied on large datasets, such that it is still possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model (HMM) training. The idea of the proposed process consists of two steps. In the first step, training instances with similar inputs are clustered and a weight factor which represents the frequency of these instances is assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity function to cluster similar examples. In the second step, all formulas in the classical HMM training algorithm (EM) associated with the number of training instances are modified to include the weight factor in appropriate terms. This process significantly accelerates HMM training while maintaining the same initial, transition and emission probabilities matrixes as those obtained with the classical HMM training algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set, speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach is not limited to training HMMs, but it can be employed for a large variety of MLs methods.
KEYWORDS
Dynamic Time Warping, clustering, Hidden Markov Models.
1. INTRODUCTION
The main problem of classifiers based on machine learning (ML) methods is that their learning time grows as the size and complexity of the training dataset increases because training with only a few data will lead to an unreliable performance. Indeed, due to advances in technology, the size and dimensionality of data sets used in machine learning tasks have grown very large and continue to grow by the day. For this reason, it is important to have efficient computational methods and algorithms that can be applied on very large data sets, such that it is still possible to complete the machine learning tasks in reasonable time. The speeding up of the estimation process of the built model parameters is a common problem of ML methods, although it raises several challenges. Many methods have been employed to overcome this problem. These methods can be classified into two categories: the first category consists of methods which compress data, either the number of instances [2, 3, 4, 5], or the set of features (attributes) which characterize instances [16,20,21]. The second category consists of methods which reduce running time in the execution and compilation level [16, 22, 23].
DOI : 10.5121/ijdkp.2013.3207 107
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
This paper is an extended version of our previous work described in [1]. In that paper a simple process to speed up ML methods is presented. An unsupervised clustering algorithm is combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model (HMM) training. The idea of the proposed process consists of two steps. The first one involves a preprocessing step to reduce the number of training instances with any information loss. In this step, training instances with similar inputs are clustered and a weight factor which represents the frequency of these instances is assigned to each representative cluster (index). The Euclidean distance is used as a metric to compare two observation sequences. Increasing evidence with the use of this metric is its poor accuracy for classification and clustering of time dependent sequences. The Euclidean distance metric is widely known to be very sensitive to distortion in time axis [6][7]. In spite of that it is used in many fields because of its ease of implementation and its time and space efficiency. In this paper, we introduce Dynamic Time Warping (DTW) as a dissimilarity function. The problem of distortion in the time axis can be addressed by DTW. This method allows non-linear alignments between two sequences with different length to accommodate those that are similar, but locally out of phase, as shown in Figure 1.
Figure 1, a warping between two temporal signals (Matching by stretching & scaling along time axis). . The Euclidean distance between two sequences can be seen as a special case of DTW, it is only defined in the special case where the two sequences have the same length. It is a way to solve a vast range of time dependant sequence problems, and it is widely used in various disciplines:- In bioinformatics, Aach and Church successfully applied DTW to cluster RNA expression data [8]. In chemical engineering, it has been used for the synchronization and monitoring of batch processes [9]. DTW has been effectively used to align biometric data, such as gait [10], signatures [11], fingerprints [12], and ECGs [13]. Rath and Manmatha have successfully applied DTW to the problem of indexing repositories of handwritten historical documents [14] and is also used in indexing video motion streams [15]. In robotics, Schmill et al. demonstrate a technique that utilizes DTW to cluster robots sensory outputs [17]. And in music, Zhu and Shasha have exploited DTW to query music databases with snippets of hummed phrases [18,19]. The superiority of DTW over Euclidean distance for these tasks has been demonstrated by many authors [8, 6, 13] However, the greater accuracy of DTW comes at a cost. Depending on the length of the sequences, DTW is typically hundreds or thousands of times slower than Euclidean distance. But,
108
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
extensive research has been performed on how to accelerate DTW computations, in particular for one-dimensional (or low-dimensional) real-valued sequences, often referred to as time series. The main contributions of this paper are summarized as follows: We investigate the DTW technique in clustering time dependant sequences. Compared to the previous work [1], DTW technique provide better accuracy in formed clusters than Euclidian metric; We evaluate DTW performance in term of speed. Compared to Euclidian metric, DTW is many of times slower than Euclidean distance. In spite of that the global time taken by the clustering process with DTW technique is less many of times than the time taken without clustering; We also evaluate the generalization ability of DTW in clustering cross other databases.
The remainder of this paper is structured as follows. The two next sections briefly review the HMM on which we apply our proposed technique and the Dynamic time warping technique. Section 4 describes the proposed technique for reduced time learning algorithm. Obtained results are discussed in section 4. Section 5 discusses the possibility of an extended usability of the proposed technique on other database. Finally, section 6 concludes the paper.
1<=i, j<=N.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
110
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Furthermore, an optimal warping path between X and Y is a warping path p* having minimal total cost among all possible warping paths. The DTW distance DTW(X, Y ) between X and Y is then defined as the total cost of p*:
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Example: Suppose we have two HMM with Two states. The possible observations which can be emitted by each state are: O={1,2,3,4,5,6,7,8}. Some possible produced sequences from the two HMMs can be: {1234567, 1222234, 1123334, 1222344...} as we can see, the first sequence represents one occurrence of each observation; the next ones represent multiple occurrences of different observations. When applying Euclidean metric to cluster these sequences, we get four clusters, but when using DTW technique we get only two clusters, the distance between the three last sequences is equal to 0. Consequently, the first sequence can be assigned to the first HMM and the three other sequences can be assigned to the second HMM. The maximum likelihood confirms the decision. Thus, DTW technique is most accurate than the Euclidean distance since it produces meaningful clusters. Once clustering is completed, we use each instance in the cluster table as a new training instance. Its weight factor can be found in the associated frequency field. In many studies, clusters with small weight are considered as noisy instances so they are removed. It is not the case in all studies, in our experiments, each cluster describe a specific behavior associated to the considered class (e.g Anger or disgust are two universal facial expressions which can be displayed in different ways, each cluster of each class (facial expression) can be a specific description of each facial expression so it is not a noisy instance), in this case only a human expert can remove these clusters to improve classification accuracy.
113
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
These computations are performed for all training instances, in our case, these computations are performed for the new set of training instances, it consists on the clustered training set instances so they are iterated for m=1..ki (ki is the number of clusters of class i), multiplying each formula by Ck(i,j+1) (number of similar instances); As a result the transition probability from state Si to state Sj and the emission probability of the observation Vk from state Sj increase with the number of instances (redundant and non redundant). More than that, time running in the computation of all these expectations and for all iterations is significantly reduced.
5. EXPERIMENTAL RESULTS
The experiments were conducted on both (i) extracted information from (MMI database[27]) videos and (ii) synthetically generated instances. Our aim is to analyze the capability of the proposed method in reducing training set instances without information loss and reducing training time. Our code source is written in Matlab, and, our experiments are done on: Intel Dual-Core 1.47GHz machine with 2G memory.
114
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Time of EM training/s
10 19 24 25
Results on this table are computed as the mean of ten (10) executions of the two algorithms. In term of processing speed, the EM training takes 0.5304s to learn 100 training sequences. With the EM based cluster training, it takes 0.0936s, it means that it is 5.6 times faster than EM training. When the number of training sequences is equal to 1.000, the EM based cluster training is 32.3 times faster than the EM training. When the number of training sequences is equal to 10.000, the EM based cluster training is 228.26 times faster than the EM training. Finally, when the number of training sequences is equal to 100.000, the EM based cluster training is 2361.87 times faster than the EM training. However, we can observe that time clustering with Euclidean distance is less than time clustering with DTW technique in all cases. Anyway, we can explore one of the proposed methods presented in the literature to speed up DTW computation. The complexity of this technique is O(N.M), it can be reduced to O(N+M) like in [30]. With conducted experiments on these databases, we can observe that the number of constructed clusters when using Euclidean metric is the same when using DTW technique this can be explained by the fact that the new training set sequences (representative/index clusters) are really different (no multiple occurrences of some observations in the same sequence). To evaluate the ability of DTW in the construction of more accurate fewer clusters, we use another database (see section 5). Although clustering based DTW technique consumes more time than clustering based Euclidean metric, the overall time learning (clustering + training) based clustering with Euclidean/DTW is less than time learning without clustering. The experiments show that the difference in time training between the two algorithms increases dramatically with the number of training sequences in favor of the proposed modified EM training algorithm (it reduces). These results suggest that the proposed method can be used especially with very large datasets. Obtained results show that the proposed process reduces not only time computation requirements, but, it also reduces significantly storage requirements. Because HMM profiles are usually trained with a set of sequences that are known to belong to a single category, the clustering process produce a small number of clusters. These clusters form indexes of constructed clusters and will represent the new set of training instances.
115
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Experimental results show that both EM and EM based cluster training algorithms achieve similar: Initial (), Transition (A) and Emission (B) probabilities matrixes. Consequently the classification accuracy is maintained.
As we have explained before, we can note that the number of constructed clusters when using DTW is less than the one obtained when using Euclidean metric, it match very well the real number of clusters. This can be explained by the fact that many sequences with Euclidean metric are considered as different in spite that they are similar but not synchronized. This means that DTW are more accurate in the field of clustering than Euclidean metric. Also, we can observe that time training based clustering with DTW is less than the one with Euclidean metric, this can be explained by the number of constructed clusters which is greater when using Euclidean metric. Finally we can say that in general, the proposed idea can be adapted to all ML methods which involve a number of iterations equal to the number of instances in their processing. The proposed process reduces the number of iterations and consequently reduces the time training of these ML methods.
7. CONCLUSION
In this paper, a more accurate simple process to speed up ML methods is studied. An unsupervised clustering algorithm is combined with EM algorithm to develop an efficient HMM training to tackle the problem of HMM time training of large datasets. The idea of the proposed process consists of two steps. In the first step, training instances with similar inputs are clustered in one cluster and each cluster is weighted with the number of similar instances. DTW technique is used to accurately cluster training time dependant sequences. In the second step, all formulas associated with the number of training instances are modified to include the weight factor in
116
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
appropriate terms. Through empirical experiments, we demonstrated that the EM based cluster training algorithm reduces significantly the HMM time training and storage requirements. Furthermore, the proposed approach is not limited to HMMs it can be employed for a large variety of ML methods. In a future work, we plan to explore new techniques to cluster more accurately time dependant sequences, to reduce more and more time training and improve more and more classification accuracy and test these methods on different ML methods.
REFERENCES
[1] Ghanem K., (2012) A simple method to speed up Machine learning methods: application to Hidden Markov Models. CDKP, International conference on Data mining and Knowledge Processing, Dubai. Wilson D.R. and Martinez T.R., (2000) Reduction Techniques for Instance-Based Learning Algorithms Machine Learning. 38, 257-286. Linde, Y., Buzo A., and Gray R.M., (1980) An Algorithm for Vector Quantizer Design IEEE Transactions on Communications. 702-710. Gersho A. and Gray R.M., (1992) Vector Quantization And Signal Compression Kluwer Academic Publishers. Antonio M.P., Jos C.segura, Antonio J.Rubio , Pedro Garcia and Jos L.Prez, (1996) Discriminative codebook design using multiple VQ in HMM-Based speech recognizers IEEE trans on Speech and Audio processing. 4 (2), 89-95. Bar-Joseph. Z., Gerber, G., Gifford, D., Jaakkola T & Simon. I. (2002). A new approach to analyzing gene expression time series data. In Proceedings of the 6th Annual International Conference on Research in Computational Molecular Biology. pp. 39-48 Chu, S., Keogh, E., Hart, D., Pazzani, M (2002). Iterative deepening dynamic time warping for time series. In Proc 2nd SIAM International Conference on Data Mining. Aach, J. and Church, G. (2001). Aligning gene expression time series with time warping algorithms, Bioinformatics. Volume 17, pp. 495-508. Gollmer, K., & Posten, C. (1995) Detection of distorted pattern using dynamic time warping algorithm and application for supervision of bioprocesses. On-Line Fault Detection and Supervision in Chemical Process Industries. Gavrila, D. M. & Davis, L. S. (1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International Workshop on Automatic Face- and GestureRecognition. pp. 272-277. Munich,. M & Perona, P. (1999). Continuous dynamic time warping for translation-invariant curve alignment with applications to signature verification. In Proceedings of 7th International Conference on Computer Vision, Korfu, Greece. pp. 108-115. Kovacs-Vajna,. Z. M. (2000). A fingerprint verification system based on triangular matching and dynamic time warping. IEEE transaction on pattern analysis and machine intelligence, Vol.22, No.11, November, pp. 1266-1276. Caiani, E.G., Porta, A., Baselli, G., Turiel, M., Muzzupappa, S., Pieruzzi, F., Crema, C., Malliani, A. & Cerutti, S. (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEE Computers in Cardiology. pp. 73-76. Rath, T. & Manmatha, R. (2002). Word image matching using dynamic time warping. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR03), Vol. II, pp. 521-527. Ng, J. & Gong, S. (2002). Learning Intrinsic Video Content using Levenshtein Distance in Graph Partitioning. In Proc. European Conference on Computer Vision, Volume 4, pp. 670-684, Copenhagen, Denmark, May 2002. Yu L. and Liu H., (2004) Efficient Feature Selection via Analysis of Relevance and Redundancy Journal of Machine learning research (JMLR). 5, 1205-1224. Schmill, M., Oates, T. & Cohen, P. (1999). Learned models for continuous planning. In 7th International Workshop on Artificial Intelligence and Statistics. 117
[6]
[10]
[11]
[12]
[13]
[14] [15]
[16] [17]
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
[18] Hu, N., Dannenberg, R.B., & Tzanetakis, G. (2003). Polyphonic Audio Matching and Alignment for Music Retrieval. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 03). [19]Zhu, Y. & Shasha, D. (2003). Warping Indexes with Envelope Transforms for Query by Humming. SIGMOD 2003. pp. 181-192. [20] Ozcift, A. and Gulten A., (2012) A robust multi-class feature selection strategy based on rotation forest ensemble algorithm for diagnosis of Erythemato-Squamous diseases Journal of medical systems. 36 (2), 941-951. [21] Mark A.Hall and Lloyd A.Smith, (1999) Feature Selection for Machine Learning: Comparing a Correlation_based Filter Approach to the Wrapper In proceedings of 12 international Florida AI Research. AAAI Press, pp:235-239. USA. [22] Thomson J., OBoyle M., Fursin G. and Bjorn F., .Reducing Training Time in a One-shot Machine Learning-based Compiler 22th international workshop. Proc. Languages and compilers for parallel computing. pp. 399-407. USA. [23] Rajat R. , Anand M. , Andrew Y.Ng, (2009) Large-scale Deep Unsupervised Learning using Graphics Processors In ACM Proceedings of the 26th International Conference on Machine Learning. 382, 110. Canada. [24] Rabiner L.R., (1989) A tutorial on hidden Markov models and selected applications in speech recognition Proceedings of the IEEE. 77, 257-286. [25] Velichko VM, Zagoruyko NG (1970). \Automatic Recognition of 200 Words." International Journal of Man-Machine Studies, 2, 223-234. [26] Carroll J.M, Russell J., (1997) Facial Expression in Hollywoods Portrayal of Emotion J. Personality and social psychology. 72, 164-176. [27] MMI Database : http://www.mmifacedb.com/ [28] Ghanem K. and Caplier A. (2011) Occurrence order detection of face features deformations in posed expressions ICIEIS, Part II, CCIS 252, pp.56-66, Springer-Verlag. [29] Sakoe H, Chiba S (1971). A Dynamic Programming Approach to Continuous Speech Recognition. In Proceedings of the Seventh International Congress on Acoustics, volume 3, pp.65-69. Akadmia Kiado, Budapest. [30] K. Selcuk Candan, R. Rossini, M.l Sapino, 2012 sDTW: Computing DTW Distances using Locally Relevant Constraints based on Salient Feature Alignments, The 38th International Conference on Very Large Data Bases, August 27th 31st, Istanbul, Turkey. Proceedings of the VLDB Endowment, Vol. 5, No. 11.
118