Object-Oriented Software Architecture Recovery Using A New Hybrid Clustering Algorithm
Object-Oriented Software Architecture Recovery Using A New Hybrid Clustering Algorithm
Abstract—In order to recover high-level software architecture architecture recovery have two shortcomings. First, most of
from existing systems, we define Weighted Directed Class studies used static dependency graph [4, 5, 6, 8] as the
Graph(WDCG) to represent object-oriented software in this clustering data set, which did not represent the dynamic
paper, which not only reflects static information of lowest level information of software running. Second, the clustering
composition of software but also reflects dynamic information of algorithms for recovering software architecture [8] were not
software running. A new hybrid clustering algorithm based on tailored enough to the nature of data set and the feature of
hierarchical clustering and partition clustering is proposed for software.
recovering high-level software architecture from WDCG. Four
metrics are introduced to measure the effect of the new In this paper, we extracted Weighted Directed Class Graph
clustering algorithm for software architecture recovery. (WDCG) from the existing systems and used the coupling
Experimental results show that our algorithm performs best in between classes as the weights of edges. WDCG not only
terms of software clustering quality, authoritativeness and reflects static information of software but also reflects
extremity of cluster distribution. dynamic information of software running. According to the
nature of WDCG and the feature of object-oriented software,
Keywords-software architecture; clustering; WDCG; we proposed a new hybrid algorithm based on hierarchical
clustering and partition clustering for software architecture
I. INTRODUCTION recovery. Experimental results showed that our approach was
Software architecture acts as a shared mental model of a effective.
system expressed at a high-level of abstraction [1], which The organization of the paper is as follows. In section II
plays an important role in at least six aspects of software we present related research and problem description. Section
development: understanding, reuse, construction, evolution, II also presents the definition of Weighted Directed Class
analysis and management [2]. But the original software Graph. Section III gives the description of our hybrid
architecture would deviate from actual systems because of clustering algorithm. Section IV presents the experiment result
software maintenance and software evolution [13]. Besides, and analysis. Finally we give the conclusions.
many open source software and some other software lack the
original documentations. Without high-level software
II. RELATED RESEARCH AND PROBLEM DESCRIPTION
abstraction, software engineers would spend much time in
program understanding, because it is confusing towards Mancoridis et al. [4] extracted the file dependency graph
thousands of lines source code. Moreover, without software from the source code and used clustering algorithm based on
architecture, software maintenance engineer is often forced to genetic algorithm to partition the graph in a way that derived
make modifications to the source code without a thorough the high-level subsystem structure from the component-level
understanding of its organization [4]. So it is important to relationships. Mahdavi et al. [5] extracted the weighted and
recover software architecture from existing systems. non-weighted file dependency graph from the source code and
used the multiple hills climbing approach to implement
Many approaches and techniques were proposed in the software architecture recovery. Saeed et al. [6] extracted the
literature to support software architecture recovery [3]. In the function dependency graph by Rigi tool and presented a new
fields of semiautomatic and automatic software architecture clustering algorithm called the ‘combined’ algorithm to
recovery, clustering was commonly used [4~9]. Clustering implement software architecture recovery. Chiricota et al. [7]
analysis is not a new field, and it has been applied in many proposed a clustering algorithm based on graph theory to
disciplines to discover similarities between artifacts. Recently, support component identification. Dietrich et al. [8] used the
clustering analysis has also been applied in software class dependency graph to represent programs and proposed to
engineering to discover patterns within data. The software use the Girvan-Newman clustering algorithm to compute the
clustering problem consists of finding a good quality modular structure of programs. Pourhaji Kazem et al. [9]
clustering of software modules based on the relationships presented a new genetic algorithm for clustering Weighted
among the modules. These relationships typically take the Module Dependency Graph. Bittencourt et al. [10] suggested
form of dependencies between modules. Unfortunately, k-means clustering algorithm performed best in terms of
previous studies on clustering algorithm for software
2547
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
In Weighted Directed Class Graph (WDCG), the lager After forming the kernels of clusters, other vertices can be
edges indicate that the coupling between two vertices is partitioned into kernels according to the coupling between
higher. The two vertices should be partitioned into a same vertices and kernels. Fig. 5 shows a partition of software S.
module usually. According to this principle we proposed a
new hybrid clustering algorithm based on hierarchical
clustering and partition clustering for software architecture
recovery. The first step of our hybrid algorithm is to use
hierarchical clustering to find out the kernels of clusters, and
then partition other vertices into the kernels. Fig. 3 shows the
description of our hybrid clustering algorithm in detail.
Hybrid Clustering Algorithm Figure 5. A partition of software S
2548
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
(MST) and is motivated by the way human perception MoJo( A, B ) = min(mno( A, B), mno( B, A)) . (5)
works. The idea of the algorithm is the following:
determine the minimum spanning tree of graph G and Definition10. Given partition B is authoritative partition,
then remove the edges that are “unusually” large the similarity quality between partition A and partition B is
compared with their neighboring edges. These edges defined as:
are called inconsistent, and it is expected that they
MoJo( A, B )
connect vertices from different clusters. SimQua( A, B) = (1 − ) × 100% , (6)
n
3) Comparison: We introduced four metrics: software
clustering quality, authoritativeness, extremity of cluster where n is the number of entities to be clustered.
distribution and stability [20] and compared the effect of In our experiment, we gave authoritative decomposition
three algorithms for software architecture recovery on four partitioned by software engineers who are familiar with three
metrics. software in Table I. Fig. 7 shows the similarity quality
between authoritative decomposition and clustering results for
B. Experimental results and analysis six versions.
1) Software clustering quality: We defined Software
Clustering Quality (SCQ) referring to Modularization Quality
(MQ) [4] to measure the cohesion and coupling of modules.
100.00%
Definition8. Software S consists of K modules, 80.00%
SimQua
i i j
SCQ = (4) HCA
K 1 40.00% MSTCA
K ( K − 1)
2 20.00%
where K > 1 . 0.00%
0 1 2 3 4 5 6 7
We calculated SCQ value for three algorithms for
ID
log4j1.1.3, Fig. 6 shows the relation between the number of
clusters and SCQ value. Figure 7. SimQua for each algorithm
2 HPCA
1.6 HCA is best authority of three algorithms.
1.2 MSTCA
0.8 3) Extremity of cluster distribution: Neither huge clusters
0.4 nor singletons are usual in architectural components. Huge
0 clusters would reduce cohesion of software, and singletons
0 2 4 6 8 10 12
would increase the coupling of software. Wu et al. [20]
Number of clusters
proposed a measure called non-extreme distribution (NED),
Figure 6. Relation between the number of clusters and SCQ which is defined as:
For one algorithm, we can determine the number of K
2549
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
forms singletons and huge clusters, so the NED value of MST Foundation of Huazhong University of Science and
is very smallest. Technology (0125921001).
1 REFERENCES
0.9
[1] R. Holt, “Software architecture as a shared mental model,” Proc.
0.8
Workshop on Software Architecture(ASERC 01), 2001.
0.7
[2] D. Garlan, “Software architecture: a roadmap,” Proc. The Future of
NED value
0.6 HPCA
Software Engineering(ICSE 00), ACM Press, 2000, pp. 91–101.
0.5 HCA
[3] S. Ducasse and D. Pollet, “Software Architecture Reconstruction: A
0.4 MSTCA
Process - Oriented Taxonomy,” IEEE Transactions on Software
0.3 Engineering. vol. 35, no. 4, pp. 573-591, 2009.
0.2 [4] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen, and E. Gansner,
0.1 “Using automatic clustering to produce high-level system
0 organizations of source code,” Proc. Workshop on Program
0 1 2 3 4 5 6 7 Comprehension, IEEE Computer Society Press, 1998, pp. 45-52.
ID [5] K. Mahdavi, M. Harman, and R. M. Hierons, “A multiple hill
climbing approach to software module clustering,” Proc. Software
Figure 8. NED value for each algorithm Maintenance(ICSM 03), IEEE Computer Society Press, 2003, pp.
315-324.
4) Stability : Good algorithms should be stable enough to [6] M. Saeed, O. Maqbool, H.A. Babri, S.Z. Hassan and S.M. Sarwar,
produce similar clusters when small changes happen, but still “Software Clustering Techniques and the Use of Combined
Algorithm,” Proc. Software Maintenance and Reengineering(CSMR
produce different clusters when architectural changes happen. 03), IEEE Computer Society Press, 2003, pp. 301-306.
Considering the similarity of two consecutive versions of a [7] Y. Chiricota, F. Jourdan and G. Melancon, “Software components
same software, we got the partition Pi and Pi+1 of two capture using graph clustering,” Proc. Program Comprehension(IWPC
03), IEEE Computer Society Press, 2003, pp.217-226.
consecutive versions of a same software firstly, then obtained [8] J. Dietrich, V. Yakovlev, C. McCartin, G. Jenson and M. Duchrow,
partition Pi’ and Pi+1’ by deleting the different classes of “Cluster Analysis of Java Dependency Graphs,” Proc. Software
partition Pi and Pi+1. We measured algorithm stability visualization(SOFTVIS 08), ACM Press, 2008, pp. 91-94.
[9] A.A. Pourhaji Kazem and S. Lotfi, “An Evolutionary Approach for
through comparing the similarity quality between Pi’ and Partitioning Weighted Module Dependency Graphs,” Proc.
Pi+1’. Table II shows the similarity quality for two Innovations in Information Technology, IEEE Computer Society
consecutive versions of the same software for three Press, 2007, pp. 252-256.
[10] R.A. Bittencourt and D.D. Serey Guerrero, “Comparison of Graph
algorithms. Clustering Algorithms for Recovering Software Architecture Module
Views,” Proc. Software Maintenance and Reengineering(CSMR 09),
TABLE II. RELATIVE STABILITY FOR EACH ALGORITHM IEEE Computer Society Press, 2009, pp. 251-254.
[11] J. Eder, G. Kappel, and M. Schrefl. “Coupling and Cohesion in
Similarity quality Object-Oriented Systems,” Technical Report, Univ. of Klagenfurt,
Software Version
HPCA HCA MSTCA 1994.
Log4j 1.04 1.1.3 90.67% 92.50% 100% [12] http://www.headwaysoftware.com/products/structure101/index.php.
Jedit 2.3 2.4 85.12% 99.14% 100% [13] C. Riva, “View-Based Software Architecture Reconstruction,” PhD
Junit 4.1 4.3.1 98.46% 100% 100% thesis, Technical Univ. of Vienna, 2004.
[14] J.K. Lee, S.J. Jung, S.D. Kim, W.H Jang and D.H. Ham. “Component
V. CONCLUSION identification method with coupling and cohesion,” Proc. Software
Engineering Conference(APSEC 01), IEEE Computer Society Press,
In this paper, we defined Weighted Directed Class Graph 2001, pp. 79-86.
(WDCG) to represent existing systems, which not only [15] http://logging.apache.org/log4j/1.2/index.html.
reflects static information of software but also reflects [16] http://www.jedit.org/.
[17] http://www.junit.org/.
dynamic information of software running. According to the [18] O. Maqbool and H.A. Babri, “Hierarchical Clustering for Software
nature of WDCG and the feature of object-oriented software, Architecture Recovery,” IEEE Transactions on Software Engineering.
we proposed a new hybrid algorithm based on hierarchical vol. 33, no. 11, pp. 759-780, 2007.
clustering and partition clustering. We compared the effect of [19] O. Grygorash, Y. Zhou and Z. Jorgensen, “Minimum Spanning Tree
three algorithms for software architecture recovery on four Based Clustering Algorithms,’ Proc. Tools with Artificial
Intelligence(ICTAI 06), APSEC 01), IEEE Computer Society Press,
metrics. Experimental results show that our algorithm 2006, pp. 73-81.
performed best in terms of software clustering quality, [20] J. Wu, A.E. Hassan, and R.C. Holt, “Comparison of Clustering
authoritativeness and extremity of cluster distribution. It was Algorithms in the Context of Software Evolution,” Proc. Software
effective for software architecture recovery. Maintenance(ICSM 05), IEEE Computer Society Press, 2005, pp.
525-535.
[21] V. Tzerpos and R.C. Holt. “MoJo: A distance metric for software
ACKNOWLEDGMENT clusterings,” Proc. Reverse Engineering(WCRE 99), IEEE Computer
Society Press, 1999, pp. 187–193.
This work was supported in part by the National Natural
Science Foundation of China Grant 60873031, and Research
2550
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.