We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3
Working Session:
Information Retrieval Based Approaches in Software Evolution
Andrian Marcus1, Andrea De Lucia2, Jane Huffman Hayes3, Denys Poshyvanyk1
1 2 3 Department of Computer Science Dipart. di Matem. e Informatica Department of Computer Science Wayne State University Università di Salerno University of Kentucky Detroit, MI 48202 Via ponte don Melillo, 301 Rose Street 313 577 5408 84084, Fisciano (SA), Italy Lexington, KY 40506 [email protected], +39 089 963376 859 257 3171 [email protected][email protected][email protected]
domain of the software and capture design decisions,
Abstract change requests, developer information, etc. This During software evolution a collection of related unstructured information is referred to as semantic, as artifacts with different representations are created. opposed to structural, which is expressed mainly by the Some of these are composed of structured data (e.g., source code and other data intensive artifacts, such as analysis data), some contain semi-structured information analysis information. (e.g., source code), and many include unstructured The single developer/maintainer development model information (e.g., text). Research efforts exist that are did not need capturing much of this information, as the trying to extract, represent, and analyze the unstructured working and long term memory of the developer often information in software. Information retrieval (IR) sufficed to store such information. Today, the increasing techniques are used quite successfully in the past years to size and complexity of software needs large development represent and extract textual information from software groups, often distributed geographically. Storing and artifacts, with application to many maintenance tasks. sharing the semantic information is much needed today. This working session will focus on the state on the art More than that, given the large amount of it, tools are in the application of IR-based techniques to support necessary for its storage, retrieval, and analysis, before it software maintenance activities. The session aims to is delivered to the users. identify the main research and practical issues in the field, to determine future work directions, and to foster 2. State of the Art collaborations among the participants. In the past decade, researchers proposed information retrieval (IR) models to address these problems related to 1. Introduction and Rationale the semantic information in existing software. Early models were used to construct software libraries [13] and Software is comprised of a multitude of artifacts; more recent work focused on specific software some of them are intended to be read by the compiler, maintenance or development tasks such as: while many others are intended to be read by developers. • Traceability link recovery [1, 5, 8, 12, 15] This is especially true during software evolution, when • Concept location [17, 19, 24] developers have to deal with large software, often written • Software and web site modularization and reverse by others. engineering [9, 10, 14, 21] The user centric information is often expressed in • Requirements engineering [3, 18] natural language and it is embedded in documentation • Software reuse [7, 13, 23] and source code. This information is very important for • Impact analysis [2] the developers to understand a great deal of the why and • Quality assessment and software measurement [11, what of the software system, as much as the source code 16, 20], etc. is useful to understand the how of the software. Natural These IR based approaches to software engineering language external documentation (e.g., requirements, problems differ not only in their scope, but also in their design documents, user manual, etc.), comments, and underlying indexing mechanism, corpus construction, or identifiers in the source code encode to a large degree the data analysis method. A general model can be described and precision, etc.? Are there specific problems with the following steps: associated with different IR methods? 1. A corpus is created using the source code and other • Who among the current researchers can collaborate linguistic software artifacts, such as the external on future projects? documentation. Various processing methods are • Is there available software produced by any research employed in the corpus construction, some based on group? Can we initiate and maintain an open source natural processing techniques, such as word effort in the area? stemming. Each document in the corpus • How can we best integrate IR methods with other corresponds to a specific software element, such as a techniques for the analysis of unstructured file, a class, or a method. information (e.g., natural language processing)? 2. An IR method is used to index the corpus, such as What is the trade-off? vector space models [22], Latent Semantic Indexing • How can we bridge the work of the software [6], Bayes classifiers, or other probabilistic models maintenance community and other groups from areas [4], etc. A semantic space of the software system is like requirements engineering, programming created. languages, etc? 3. A similarity measure between the documents in the • Is there a need for future, organized meetings like corpus is defined and similarities are computed this working session? among the corresponding software elements. These measures are commonly referred to as semantic 4. Session Format similarities. 4. The semantic similarities are used to solve the The working session will have 90 minutes and will maintenance or development task at hand. Some consist of three parts. approaches combine these measures with additional It will start with short interactive presentations given data extracted with structural software analysis tools, by some of the participants, which will be solicited in such as: dependencies, software change data, advance and selected by the organizers. These execution traces, test cases, etc. presentations will focus on existing approaches and techniques. 3. Open Issues and Problems Following these presentations, all the participants will participate in an open brainstorming session, which will The working session has several complementing focus on identifying open issues in the field, new goals. First, it aims at clearly defining the state of the art challenges, etc. Questions will be asked and answers in the filed, briefly described above. As the field grows, provided by the participants. researchers and practitioners need to agree on a common The final part will be devoted to recapitulate and terminology, as the current work by different groups is reiterate the unanswered items from the previous two somewhat incoherent. We need to assess how far this parts and to build a roadmap for future events, research, field came to date and how far it can go in the future. and collaborations among the participants. In addition, we want to identify which issues are already answered by research and ready for practical 5. Expected Outcome of the Session applications and which are still open or unaddressed. Several questions will be directly addressed during the A website for the working session will be developed working session and many more will be raised on the and maintained by the organizers. The discussions and spot: presentations from the session will be summarized and • How can we refine and improve the general model, publicized on the website and other appropriate venues. presented above? Does the model suit all current We expect that this session will be the first in a and future applications? succession of future events that will focus on this • Do certain IR methods suit specific software research area and will also include related fields. maintenance problems, or we can use any of them for any task? 6. References • Is the field mature enough to talk about benchmarking? [1] Antoniol, G., Canfora, G., Casazza, G., De Lucia, A., and • What new applications in software evolution exist Merlo, E., "Recovering Traceability Links between Code and for the IR-based approaches? Documentation", IEEE Transactions on Software Engineering, 28, 10, October 2002, pp. 970 - 983. • What are the major practical problems with the current state of the art: efficiency, scalability, recall [2] Antoniol, G., Canfora, G., Casazza, G., and Lucia, A., "Identifying the Starting Impact Set of a Maintenance Request: A Case Study", in Proceedings 4th European Conference on Proceedings 23rd International Conference on Software Software Maintenance and Reengineering (CSMR'00), Zurich, Engineering (ICSE'01), Toronto, Ontario, Canada, May 12-19 Switzerland, February 29 - March 03 2000, pp. 227-230. 2001, pp. 103-112. [3] Clelang-Huang, J., Settimi, R., Duan, C., and Zou, X., [15] Marcus, A., Maletic, J. I., and Sergeyev, A., "Recovery of "Utilizing Supporting Evidence to Improve Dynamic Traceability Links Between Software Documentation and Requirements Traceability", in Proceedings International Source Code", International Journal of Software Engineering Requirements Engineering Conference (RE'05), Paris, France, and Knowledge Engineering, 15, 5, October 2005, pp. 811-836. 2005, pp. 135-144. [16] Marcus, A. and Poshyvanyk, D., "The Conceptual [4] Crestani, F., Lalmas, M., Van Rijsbergen, C. J., and Cohesion of Classes", in Proceedings IEEE International Campbell, I., "Is this document relevant?…probably: a survey Conference on Software Maintenance (ICSM'05), Budapest, of probabilistic models in information retrieval", ACM Hungary, September 25-30 2005, pp. 133-142. Computing Surveys, 30, 4, 1998, pp. 528-552. [17] Marcus, A., Sergeyev, A., Rajlich, V., and Maletic, J., "An [5] De Lucia, A., Fasano, F., Oliveto, R., and Tortora, G., Information Retrieval Approach to Concept Location in Source "Enhancing an Artefact Management System with Traceability Code", in Proceedings 11th IEEE Working Conference on Recovery Features", in Proceedings IEEE International Reverse Engineering (WCRE'04), Delft, The Netherlands, Conference on Software Maintenance (ICSM'04), Chicago, IL, November 9-12 2004, pp. 214-223. September 11-17 2004, pp. 306-315. [18] och Dag, J. N., Gervasi, V., Brinkkemper, S., and Regnell, [6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. B., "A Linguistic-Engineering Approach to Large-Scale K., and Harshman, R., "Indexing by Latent Semantic Analysis", Requirements Management", IEEE Software, 22, 1, 2005, pp. Journal of the American Society for Information Science, 41, 32-39. 1990, pp. 391-407. [19] Poshyvanyk, D., Gael-Gueheneuc, Y., Marcus, A., [7] Frakes, W., "Software Reuse Through Information Antoniol, G., and Rajlich, V., "Combining Probabilistic Retrieval", in Proceedings 20th Hawaii International Ranking and Latent Semantic Indexing for Feature Conference On System Sciences (HICSS'87), Kona, HI, January Identification", in Proceedings 14th IEEE International 1987, pp. 530-535. Conference on Program Comprehension (ICPC'06), Athens, Greece, June 14-16 2006, pp. 137-148. [8] Hayes, J. H., Dekhtyar, A., and Sundaram, S. K., "Advancing Candidate Link Generation for Requirements [20] Poshyvanyk, D. and Marcus, A., "The Conceptual Tracing: The Study of Methods", IEEE Transactions on Coupling Metrics for Object-Oriented Systems", in Proceedings Software Engineering, 32, 1, January 2006, pp. 4-19. 22nd IEEE International Conference on Software Maintenance (ICSM'06), Philadelphia, PA, September 25-27 2006, pp. to [9] Kawaguchi, S., Garg, P. K., Matsushita, M., and Inoue, K., appear. "Mudablue: An automatic categorization system for open source repositories", in Proceedings the 11th Asia-Pacific [21] Ricca, F., Tonella, P., Girardi, C., and Pianta, E., "An Software Engineering Conference (APSEC'04), 2004, pp. 184- Empirical Study on Keyword-based Web Site Clustering", in 193. Proceedings 12th IEEE International Workshop on Program Comprehension (IWPC'04), Bari, Italy, 2004, pp. 204-213. [10] Kuhn, A., Ducasse, S., and Girba, T., "Enriching Reverse Engineering with Semantic Clustering", in Proceedings IEEE [22] Salton, G. and McGill, M., Introduction to Modern Working Conference On Reverse Engineering (WCRE'05), Information Retrival, McGraw-Hill, 1983. Pittsburgh, PA, November 8-11 2005, pp. 113—122. [23] Ye, Y. and Fischer, G., "Supporting Reuse by Delivering [11] Lawrie, D., Feild, H., and Binkley, D., "Leveraged Quality Task-Relevant and Personalized Information", in Proceedings Assessment Using Information Retrieval Techniques", in IEEE/ACM International Conference on Software Engineering Proceedings 14th IEEE International Conference on Program (ICSE'02), Orlando, FL, May 19-25 2002, pp. 513-523. Comprehension (ICPC'06), Athens, Greece, June 14-16 2006, pp. 149-158. [24] Zhao, W., Zhang, L., Liu, Y., Sun, J., and Yang, F., "SNIAFL: Towards a Static Non-Interactive Approach to [12] Lormans, M. and Van Deursen, A., "Can LSI help Feature Location", ACM Transactions on Software Engineering Reconstructing Requirements Traceability in Design and Test?" and Methodologies, 2006, pp. to appear. in Proceedings 10th European Conference on Software Maintenance and Reengineering (CSMR'06), Bari, Italy, March 12 2006, pp. 47-56. [13] Maarek, Y. S., Berry, D. M., and Kaiser, G. E., "An Information Retrieval Approach for Automatically Constructing Software Libraries", IEEE Transactions on Software Engineering, 17, 8, 1991, pp. 800-813. [14] Maletic, J. I. and Marcus, A., "Supporting Program Comprehension Using Semantic and Structural Information", in