International Journal of Modern Engineering Research (IJMER) Vol.2, Issue.4, July-Aug 2012 pp-1955-1957 ISSN: 2249-6645
partitions of data, to obtain a representative subset of each local dataset. In the sequence these representative subsets are sent to a central site, which performs a fusion of the results and applies SOM and K-means algorithms to obtain the final result. The remainder of the article is organized as follows: section 2 presents a brief review about distributed data clustering algorithms and section 3 describes the main aspects of the SOM. The proposed algorithm is presented in section 4 describes the methodology. Finally, section 5 presents conclusions and future research directions.
In the recent years, there has been an increasing of data volume in organizations, due to many factors such as the automation of the data acquisition and reduced storage costs. For that reason, there has been also a growing interest in computational algorithms that can be used to extracting relevant information from recorded data. Data mining is the process of applying various methods and techniques to databases, with the objective of extract information hidden in large amounts of data. A frequently used method is cluster analysis, which can be defined as the process of partition data into a certain number of clusters (or groups) of similar objects, where each group consists of similar objects amongst themselves (internal homogeneity) and different from the objects of the other groups (external heterogeneity), i.e., patterns in the same cluster should be similar to each other, while patterns in different clusters should not [1]. More formally, given a set of N input patterns: X = {x1, , xN}, where each xj = (xj1, , xjp) represents a p-dimensional vector and each measure xji represents a attribute (or variable) from dataset, a clustering process attempts to seek a K partition of X, denoted by C = {C1, , CK}, (K <= N). Artificial neural networks are an important computational tool with strong inspiration neurobiological and widely used in the solution of complex problems, which cannot be handled with traditional algorithmic solutions [13]. Applications for RNA include pattern recognition, signal analysis and processing, analysis tasks, diagnosis and prognostic, data classification and clustering. In some works, they presented a simple and efficient algorithm to cluster distributed datasets, based on multiples parallel SOM, denominated partSOM [5]. The algorithm is particularly interesting in situations where the data volume is very large or when data privacy and security policies forbid data consolidation into a single location. This work extends this approach presenting a strategy for efficient cluster analysis in distributed databases using SOM and K-means. The strategy is to apply SOM algorithm separately in each distributed dataset, horizontal
Cluster analysis algorithms groups data based on the similarities between patterns. The complexity of cluster analysis process increases with data cardinality and dimensionality. Cardinality :-(N, the number of objects in a database) and dimensionality:- (p, the number of attributes).Clustering methods range from those that are Largely heuristic method to statistic method. Several algorithms have been developed based on different strategies, including hierarchical clustering, vector quantization, graph theory, fuzzy logic, neural networks and others. A recent survey of cluster analysis algorithms is presented in Xu and Wunsch [1]. Searching clusters in high-dimensional databases is a non trivial task. Some common algorithms, such as traditional agglomerative hierarchical methods, are improper to large datasets. The increase in the number of attributes of each entrance does not just influence negatively in the time of processing of the algorithm, as well as it hinders the process of identification of the clusters. An alternative approach is divide database into partitions and to perform data clustering each one separately. Some current applications have so large databases that are not possible to maintain them integrally in the main memory, even using robust machines. Kantardzic[2] points three approaches to solve that problem: a) The data are stored in secondary memory and data subsets are clustered separately. A subsequent stage is needed to merge results; b) Usage of an incremental grouping algorithm. Each element is individually stored in the main memory and associated to one of the existent groups or allocated in a new group; c) Usage of a parallel implementation. Several algorithms work simultaneously on the stored data. Two approaches are usually used to partition dataset: the first, and more usual, is to divide horizontally the database, creating homogeneous subsets of the data. Each algorithm operates on the same attributes. Another approach is to divide horizontally the database, creating heterogeneous 1955 | P a g e
International Journal of Modern Engineering Research (IJMER) Vol.2, Issue.4, July-Aug 2012 pp-1955-1957 ISSN: 2249-6645 data subsets. In this case, each algorithm operates on the projections and 2D and 3D surface plots of distance same registrations, but handle on different sets of attributes. matrices. The U-matrix method [17] enables visualization of Some recent works about distributed data the topological relations of the neurons in an organized clustering include Forman and Zhang [3] that describes a SOM. A gradient image (2D) or a surface plot is generated technique to parallels several algorithms in order to obtain by computing distances between adjacent neurons. High larger efficiency in data mining process of multiple values in the U-matrix encode dissimilarities between distributed databases. Authors reinforce the concern need in neurons and correspond to cluster borders. Strategies for relation to reducing communication cluster detection using U-matrix were proposed by Costa Several organizations maintain geographically and Netto [16]. The algorithms were developed for distributed databases as a form of increasing the safety of automatic partitioning and labeling of a trained SOM their information. In that way, if safety policies fail, the network. The result is a segmented SOM output with invader has just access to a part of the existent information. regions of neurons describing the data clusters. Vaidya and Clifton [18] approaches vertically partitioned databases using a distributed K-means algorithm. IV. Proposed Methodology Jagannathan et al. [11] present a variant of K-means Distributed clustering algorithms usually work in algorithm to clustering horizontally partitioned databases. two stages. Initially, the data are analyzed locally, in each Oliveira and Zaane [9] proposed a spatial data unit that is part of the distributed database. In a second transformation method to protecting attributes values when stage, a central instance gathers partial results and combines sharing data for clustering, called RBT, that and is them into an overall result. independent of any clustering algorithm. This section presents a strategy for clustering In databases with a large number of attributes, similar objects located in distributed databases, using another approach sometimes used is to accomplish the parallel self-organizing maps and K-means algorithm. The analysis considering only a subset of the attributes, instead process is divided in three stages. of considering all of them. An obvious difficulty of this a) Traditional SOM algorithm is applied locally in each approach is to identify which attributes are more important one of the distributed bases, in order to elect a in the process of clusters identification. Some papers related representative subset from input data; with this approach have frequently used statistical methods b) Traditional SOM is applied again, this time to the as Principal Components Analysis (PCA) and Factor representatives of each one of the distributed bases that Analysis to treat this problem. Kargupta et al. [10] presented are unified in a central unit; a PCA-based technique denominated Collective Principal c) K-means algorithm is applied over trained selfComponent Analysis (CPCA) for cluster analysis of highorganizing map, to create a definitive result. dimensional heterogeneous distributed databases. The authors demonstrated concern in reducing data transfers The proposed algorithm, consisting of six steps: step 1 taxes among distributed sites. applies local clustering in each local dataset (horizontal Other works consider the possibility to partition parties from the database) using traditional SOM. Thus, the attributes in subsets, but considering each one of them in algorithm is applied to an attribute subset in each of the data mining process. This is of particular interest for the remote units, obtaining a reference vector from each data maintenance of whole characteristics present in initial subset. This reference vector, known as the codebook, is the dataset. He et al. [12] analyzed the influence of data types in self-organizing map trained. clustering process and presented a strategy that divided the In step 2, a projection is made of the input data on attributes in two subsets, one with the numerical attributes the map in the previous stage, in each local unit. Each input and other with the categorical ones. Subsequently, they is presented to the trained map and the index corresponding propose to cluster separately of each one of the subsets, to the closest vector (BMU) present in the codebook is using appropriate algorithms for each one of the types. The stored in an index vector. So, a data index is created based cluster results were combined in a new database, which was on representative objects instead of original objects. Despite again submitted to a clustering algorithm for categorical the difference from the original dataset, representative data. objects in the index vector are very similar to the original data, since maintenance of data topology is an important III. Self-Organizing Map characteristic of the SOM. The self-organizing feature map (SOM) has been In step 3, each remote unit sends its index and reference widely used as a tool for visualization of high-dimensional vector to the central unit, which is responsible for unifying data. Important features include information compression all partial results. An additional advantage of the proposed while preserving topological and metric relationship of the algorithm is that the amount of transferred data is primary data items [14]. SOM is composed of two layers of considerately reduced, since index vectors have only one neurons, input and output layers. A neighbouring relation column (containing an integer value) and the codebook is with neurons defines the topology of the map. Training is usually mush less than the original data. So, reducing data similar to neural competitive learning, but the best match transfer and communication overload are considered by the unit (c or BMU) is updated as well as they neighbors. Each proposed algorithm. input is mapped to a BMU, which has weight vectors Step 4 is responsible for receiving the index most similar to the presented data. vector and the codebook from each local unit and A number of methods for visualizing data relations combining partial results to remount a database based on in a trained SOM have been proposed [17], such as multiple received data. To remount each dataset, index vector views of component planes, mesh visualization using indexes are substituted by the equivalent value in the
codebook. Datasets are combined juxtaposing partial datasets; however, it is important to ensure that objects are in the same order as that of the original datasets. Note that the new database is slightly different from the original data, but data topology is maintained. In step 5, the SOM algorithm is again applied over, the complete database obtained in step 4. The expectation is that the results obtained in that stage can be generalized as being equivalent to the clustering process of the entire database. The data obtained after the step 4 and that will serve as input in stage 5 correspond to values close to the original, because vectors correspondents in codebook are representatives of input dataset. In step 6, K-means algorithm is applied over the final trained map, in order to improve the quality of the results.
Self-organizing map is neural network concept, unsupervised learning strategy, has been widely used in clustering applications. However, SOM approach is normally applied to single and local datasets. In one of the research work, they introduced partSOM, an efficient strategy SOM-based to perform distributed data clustering on geographically distributed databases. However, SOM and partSOM approaches have some limitations for presenting results. In this work we join partSOM strategy with an alternative approach for cluster detection using K-means algorithm.
Self-organizing map is neural network concept, unsupervised learning strategy, has been widely used in clustering applications. However, SOM approach is normally applied to single and local datasets. In one of the research work, they introduced partSOM, an efficient strategy SOM-based to perform distributed data clustering on geographically distributed databases. However, SOM and partSOM approaches have some limitations for presenting results. In this work we join partSOM strategy with an alternative approach for cluster detection using K-means algorithm.
