0% found this document useful (0 votes)
52 views

Mining Textual Data For Software Enginee

This document describes a tutorial on mining textual data for software engineering tasks. The tutorial aims to discuss using textual information from sources like source code, documentation, bug reports, and StackOverflow. It will present techniques for generating and mining unstructured text data, and how that data can help with tasks like traceability, concept location, vocabulary normalization, and summarization. The goals are to address challenges, present state-of-the-art tools and methods, and discuss integrating these approaches into the software development process.

Uploaded by

Musab Alriani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Mining Textual Data For Software Enginee

This document describes a tutorial on mining textual data for software engineering tasks. The tutorial aims to discuss using textual information from sources like source code, documentation, bug reports, and StackOverflow. It will present techniques for generating and mining unstructured text data, and how that data can help with tasks like traceability, concept location, vocabulary normalization, and summarization. The goals are to address challenges, present state-of-the-art tools and methods, and discuss integrating these approaches into the software development process.

Uploaded by

Musab Alriani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Mining Textual Data for Software Engineering

Tasks
Latifa Guerrouj Benjamin C. M. Fung David Lo
McGill University McGill University Singapore Management University
3661 Peel St., Canada H3A 1X1 3661 Peel St., Canada H3A 1X1 80 Stamford Road
Mobile: (+1) 514-791-0085 Phone: (+1) 514-398-3360 Singapore 178902
Email: [email protected] Fax: (+1) 514-398-7193 Email: [email protected]
Web: http://latifaguerrouj.ca/ Email: [email protected] Web: http://www.mysmu.edu/faculty/davidlo/
Web: http://dmas.lab.mcgill.ca/fung/index.htm

Foutse Khomh Abdelwahab Hamou-Lhadj


École Polytechnique de Montréal Concordia University
2500, chemin de la Polytechnique, Montral (Qubec) H3T 1J4 515 St. Catherine, West
Phone: (+1) 514-340-4711 Montréal, H3G 2W1 Canada
Fax: (+1) 514-340-5139 Phone: (+1) 514-848-2424 ext 7949
Email: [email protected] Email: [email protected]
Web: http://khomh.net/ Web: http://users.encs.concordia.ca/ abdelw/index.html

Abstract—Software development artifacts produced during the I. M OTIVATION


development process are of different types. Some are structured
such as the source code and execution traces while others Software development projects knowledge is grounded in
are unstructured like source code comments, identifiers, bug rich data. For example, source code, check-ins, bug reports,
reports, usage logs, etc. Such data embeds a significant knowledge
about software projects that can help software developers make work items and test executions are recorded in software
technical and business decisions. repositories such as version control systems (Git, Subversion,
While the focus has been extensively on source code in the past, Mercurial, CVS) and issue-tracking systems (Bugzilla, JIRA,
researchers have recently investigated the textual information Trac), and the information about user experiences of interact-
(e.g., identifiers and comments) contained in software artifacts ing with software is typically stored in log files or informal
or informal documentation (e.g., StackOverflow, emails threads,
change logs, bug reports, etc.) about the software systems. documentation such as StackOverflow.
Automatic techniques and tools have been developed to generate While there has been extensive research on static analysis
and–or mine unstructured data to gain insight about the soft- of source code, recent studies have exploited textual informa-
ware development process or assist development teams in tasks tion used in source code of software systems or trapped in
like software traceability, feature/concept location, source code
informal documentation (e.g., emails threads, StackOverflow
vocabulary normalization, bug localization, and summarization.
The tutorial will start with an introduction of textual in- posts, etc.). The purpose is to develop automatic software
formation in source code and–or documentation. Next, we will engineering techniques, gain insights and understand software
present automatic techniques and tools to generate and mine projects, and support the decision-making process.
unstructured data and discuss related challenges. We will also Major software engineering tasks have leveraged textual in-
present examples of major software engineering tasks making
use of unstructured data mining along with scenarios of their formation. For example in the context of software traceability,
application and the most recent contributions relevant to each researchers have made use of textual information to trace code
task. Specifically, we will focus on automatic source code vocab- to documents (e.g., requirements) [1], [2], they also suggested
ulary normalization, summarization, crash reports analysis for lightweight techniques of linking code to documentation such
fault localisation. Finally, we will discuss with the audience the as email threads [3] and StackOverflow [4], as well as tracing
success and failures in achieving the full potential of such tasks in
a software development context as well as possible improvements code examples to documentation [5]. Textual information have
and research directions. The tutorial will provide novice with been also exploited in feature/concept location [6], [7], [8],
a common framework about major software engineering tasks source code vocabulary normalization [9], [10] and summa-
leveraging textual information while for experts, the tutorial can rization of complex artifacts involving release notes [11],
be an interesting opportunity to discuss challenges, document the StackOverflow [12], and bug reports [13]. Such approaches
state of the art and practice, encourage cross-fertilization across
various research areas ranging from mining software repositories have been developed with the aim of guiding developers and
to natural language processing and text retrieval, and to establish practitioners towards a better understanding of their software
foreseeable collaborations between researchers. projects and the way they evolve.
While solutions provided for these engineering tasks with that found in other software artifacts (e.g., test cases,
demonstrated promising results, there are many challenges left requirements, specifications, design, etc). Latifa developed
concerned with mining textual information, using it in the automatic context-aware source code vocabulary approaches
development of the above-mentioned tasks, as well as inte- by mining textual information in source code [14], [15], [16],
grating and adopting such solutions into software development [17], [18]. She also investigated the use of normalization
processes. in the context of feature location using textual information
The goals of this tutorial are to discuss the use of textual and dynamic analysis [19]. Recently, she suggested a new
information, its related challenges and open-question, tools approach summarizing Android API classes and methods dis-
and techniques of mining such data as well as ways of cussed in StackOverflow using n-grams language models and
integrating and exploiting them by major software engineering applying machine learning techniques [12]. Latifa is the co-
tasks to fully reap their benefits. organizer of the International Workshop on Software Analytics
We invite both novice and experts to this tutorial that will be (SWAN’15). In this tutorial, she will make the focus on how
an opportunity to share tools, techniques, and experiences in text found in source code or information documentation can
the field. We also plan, after the presentation of the tutorial, to be mined and exploited in the context of engineering tasks
have a discussion and dissemination of the presented research namely source code vocabulary and summarization of software
by opening up a discussion and involving participants in artefacts.
sharing their opinions. We invite researchers and practitioners David Lo research work focuses on software engineering
interested in improving, integrating, and adopting the use and and data mining. He investigates how techniques from these
mining of textual information in their software engineering two research areas could benefit and complement each other.
tools and thus software development and maintenance ac- In the software engineering area, his research includes soft-
tivities. The tutorial encourages both academic researchers ware specification mining/protocol inference, mining software
and industrial practitioners for an exchange of ideas and repositories, program analysis, software testing and automated
collaboration. debugging. Technique-wise, he investigates a composition of
techniques including static analysis, dynamic analysis, data
II. T OPICS
mining, information retrieval, and natural language processing.
The tutorial will focus on the presentation of recent tech- In the data mining area, his works on frequent pattern mining,
niques and tools used to generate and mine textual information discriminative pattern mining, and social network mining.
as well as software engineering tasks making use of such rich David contributed to the analysis of software text with the
data. aim of aiding software developers in performing their various
The tutorial will explain, present, and discuss the following: tasks. Examples of his works relevant to this tutorial involve
1) Textual information in source code and informal docu- enhanced techniques making use of text version for bug local-
mentation; ization [20], a large scale investigation of issue trackers from
2) Benefits of using textual information in software engi- GitHub [21], accurate information retrieval-based bug local-
neering tasks; ization based on bug reports [22], interactive fault localization
3) Recent tools and techniques used to generate and mine leveraging simple user feedback [23], automatic duplicate bug
textual information; report detection with a combination of information retrieval
4) Challenges related to mining textual information; and topic modeling [24]. David is the co-organizer of the first
International Workshop on Machine Learning and Information
5) Major software engineering tasks using textual informa-
Retrieval for Software Evolution (MALIR-SE) collocated with
tion;
ASE 2013. In this tutorial, David will make the focus on
6) Explain source code vocabulary normalization and how techniques of mining text and its use for bug localization.
it makes use of textual information along with recent Foutse Khomh leads the SoftWare Analytics and Technolo-
automatic approaches; gies (SWAT) Lab that applies analytic techniques to empower
7) Present summarization software artifacts with recent au- development teams with insightful and actionable information
tomatic approaches in this area; about their activities. SWAT team also build tools to assess
8) Explore bug localization, how it makes use of textual and improve the quality of software systems. Early models
information, and how the instructors could improve it by and tools proposed by SWAT members are already being used
leveraging text in crash reports; in the industry. Among Foutse’s research works related to
9) Identification of open research challenges and possible this workshop, we state the ones on challenges and issues of
solutions. mining crash reports [25], tracking back the history of commits
in low-tech reviewing environments [26], supplementary bug
III. P RESENTERS ’ E XPERIENCE IN THE A REA AND T OPICS fixes vs. re-opened bugs [27], improving bug localization
OF T HEIR P RESENTATIONS using correlations in crash reports [28], classifying field crash
Latifa Guerrouj preformed her past studies on context- reports for fixing bugs: A case study of Mozilla Firefox [29],
aware source code vocabulary normalization. Vocabulary nor- and a text-based approach to classify change requests [30].
malization aligns the vocabulary found in the source code Foutse co-founded the International Workshop on Release
Engineering (RELENG) in 2013 and has been co-organizing VI. TARGET AUDIENCE
it since then. In this tutorial, Foutse will show his recent work This tutorial is intended for both novice and experts, aca-
on using crash reports for the improvement of bug localization demics and industrial practitioners. It will provide participants
and identifying highly impactful bugs. with an understanding of software text, techniques to mine it
IV. G OALS AND E XPECTED R ESULTS from source code or documentation, and ways of adopting and
This tutorial targets both novice and experts working in integrating it in major engineering tasks. Additionally, novice
the field of software maintenance and evolution, interested in will be able to understand engineering tasks such as vocabulary
the analysis of software text, its mining, and its practical use normalization, bug localization, and summarization and how
in the context of software engineering tasks. For experts, it they exploit textual data to fully reap their benefits. The tutorial
will provide an informal interactive forum to exchange ideas will show scenarios of the presented approaches and how they
and experiences, streamline research making use of textual can help to guide developers during their tasks as well as to
information, identify some common ground of their work, and improve software maintenance and evolution.
share lessons and challenges, thereby articulating a vision for We will also discuss the limitations and challenges of the
the future of software engineering. most recent related techniques and how these issues can be
The intended outcomes of this tutorial are: addressed and mitigated.
1) Make clear (for novice) what is textual information and Participants are encouraged to talk about their recent works
techniques of its mining; related to the tutorial (if any) and share their experiences and
major faced challenges. Experts will be there to guide and
2) Explore the different contemporary software engineering provide them with feedback.
techniques making use of textual data;
VII. F ORMAT
3) Stimulate discussions, interest, and understanding in in-
tegrating textual info in software engineering tasks and We propose to have 2-hours tutorial consisting of a 1 hour
software development process; dedicated to an 1) introduction of textual data by the pre-
senters, 2) major software engineering tasks leveraging such
4) Bridging the gap between the theory and practice by data, 3) concrete examples of recent automatic approaches on
bringing together researchers and practitioners interested source code vocabulary normalization and summarization, and
in analysing software text for software engineering tasks; 4) related discussions by participants. The other 1 hour will be
5) Discuss challenges, experiences, lessons, and explore the devoted to the 5) bug localization, 6) its enhancement using
different possible strategies to overcome the challenges crash reports as well as ways of identifying impactful bugs,
faced and towards promising solutions to essential prob- 7) discussion by participants, and 8) summary and recap.
lems; We encourage discussions so as to develop an in-depth
understanding of the presented topics for novice. Experts
6) Build a common framework of major automatic ap- are invited to enrich the discussions by providing opinions
proaches making use of textual information; and moderating a discussion on the state-of-the-art and state-
7) Advance the state of the art and practice in software of-the-practice of software engineering tasks making use of
engineering; textual data.
V. O UTLINE VIII. ACKNOWLEDGEMENT
1) Introduction about software text and tools to generate and Special thanks to Giuliano Antoniol and Massimiliano Di
mine such data by David Lo. Penta for all their valuable feedback on this tutorial.
2) Exploration of major software engineering tasks making
use of textual data by Foutse Khomh.
3) Presentation of source code vocabulary normalization
along with examples of recent published automatic source
code vocabulary normalization approaches by Latifa
Guerrouj.
4) Presentation of summarization of software artifacts along
with examples of recent published automatic summariza-
tion approach by Latifa Guerrouj.
5) Presentation of bug localization with examples of most
recent automatic approaches in this area by David Lo.
6) Exploration of recent ways to improve bug localization
using crash reports and to identify impactful bugs by
Foutse Khomh.
7) Summary and recap of the tutorial by David Lo, Latifa
Guerrouj, and Foutse Khomh.
IX. C ONTRIBUTORS ’ BIOGRAPHY David Lo is an Assistant Profes-
sor in the School of Information
Systems at Singapore Manage-
ment University. He received his
PhD from School of Computing,
National University of Singapore
in 2008. Before that, he was
studying at School of Computer
Engineering, Nanyang Techno-
Latifa Guerrouj is a logical University and graduated
Postdoctoral Research Fellow with a B.Eng (Hons I) in 2004.
at McGill University, Canada. David works in the intersection of software engineering and
She received her Ph.D. from the data mining. His research interests include dynamic program
Department of Computing and analysis, specification mining, and pattern mining. Lo received
Software Engineering (DGIGL) a PhD in computer science from the National University of
of École Polytechnique de Singapore. He is a member of the IEEE and the ACM.
Montréal, Canada. Her research
work/interests involves empirical
software engineering, software
analytics, data mining, and big data software engineering. Foutse khomh is an Assistant
Latifa is serving as an organizing and program committee Professor at the École
member for several international conferences and workshops Polytechnique de Montréal,
including ICSME’16, ICSME’15, SANER’15, SWAN’15, where he heads the SWAT
ICSM’14, SCAM’14, MSR’14/13, WCRE’13/12, ICST’12, Lab on software analytics and
and MUD’12/13. She is a member of ACM and IEEE. cloud engineering research
(http://swat.polymtl.ca/). Prior
to this position he was a
Research Fellow at Queen’s
University (Canada), working
with the Software Reengineering
Research Group and the NSERC/RIM Industrial Research
Benjamin C. M. Fung is an Chair in Software Engineering of Ultra Large Scale Systems.
Associate Professor of Informa- He received his Ph.D in Software Engineering from the
tion Studies (SIS) at McGill Uni- University of Montreal in 2010, under the supervision of
versity and a Research Scientist Yann-Gaël Guéhéneuc. His main research interest is in the
in the National Cyber-Forensics field of empirical software engineering, with an emphasis
and Training Alliance Canada on developing techniques and tools to improve software
(NCFTA Canada). He received a quality. Over the years, he has applied many text mining
Ph.D. degree in computing sci- techniques to solve multiple software engineering problems.
ence from Simon Fraser Univer- He co-founded the International Workshop on Release
sity in 2007. Dr. Fung has over Engineering (http://releng.polymtl.ca) and was one of the
80 refereed publications that span the prestigious research editors of the first special issue on Release Engineering in
forums of data mining, privacy protection, cyber forensics, the IEEE Software magazine.
services computing, and building engineering. His data mining
works in crime investigation and authorship analysis have
been reported by media worldwide. His research has been Abdelwahab Hamou-Lhadj is
supported in part by the Discovery Grants and Strategic Project a tenured Associate Professor in
Grants from the Natural Sciences and Engineering Research ECE, Concordia University. His
Council of Canada (NSERC), Insight Development Grants research interests include soft-
from the Social Sciences and Humanities Research Coun- ware modeling, software behav-
cil (SSHRC), Defence Research and Development Canada ior analysis, software mainte-
(DRDC), and Fonds de recherche du Qubec - Nature et nance and evolution, anomaly
technologies (FRQNT), and NCFTA Canada. Dr. Fung is a detection systems. He holds a
licensed professional engineer in software engineering, and is Ph.D. degree in Computer Science from the University of
currently affiliated with the Data Mining and Security Lab at Ottawa (2005). He is a Licensed Professional Engineer in
SIS. Quebec, and a long- lasting member of IEEE and ACM.
R EFERENCES [21] T. F. Bissyand, D. Lo, L. Jiang, L. Rveillre, J. Klein, and Y. L. Traon,
“Got issues? who cares about it? a large scale investigation of issue
trackers from github.” IEEE, 2013, pp. 188–197.
[1] N. Ali, Y.-G. Guéhéneuc, and G. Antoniol, “Trustrace: Mining software [22] J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed?
repositories to improve the accuracy of requirement traceability links,” - more accurate information retrieval-based bug localization based on
IEEE Transactions on Software Engineering, vol. 39, no. 5, pp. 725–741, bug reports,” in Proceedings of the 34th International Conference on
2013. Software Engineering, 2012, pp. 14–24.
[2] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo, [23] L. Gong, D. Lo, L. Jiang, and H. Zhang, “Interactive fault localization
“Recovering traceability links between code and documentation,” IEEE leveraging simple user feedback.” IEEE Computer Society, 2012, pp.
Transactions on Software Engineering, vol. 28, no. 10, pp. 970–983, 67–76.
2002. [24] A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, D. Lo, and C. Sun,
[3] A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails and source “Duplicate bug report detection with a combination of information
code artifacts,” in Proceedings of the 32nd ACM/IEEE International retrieval and topic modeling,” in Proceedings of the 27th IEEE/ACM
Conference on Software Engineering, 2010, pp. 375–384. International Conference on Automated Software Engineering, 2012, pp.
[4] P. C. Rigby and M. P. Robillard, “Discovering essential code elements 70–79.
in informal documentation,” in Proceedings of the 2013 International [25] L. An and F. Khomh, “Challenges and issues of mining crash reports,”
Conference on Software Engineering, ser. ICSE ’13, 2013, pp. 832–841. in 1st IEEE International Workshop on Software Analytics, SWAN 2015,
[5] S. Subramanian, L. Inozemtseva, and R. Holmes, “Live api documenta- Montreal, QC, Canada, March 2, 2015, 2015, pp. 5–8.
tion,” in Proceedings of the 36th International Conference on Software [26] Y. Jiang, B. Adams, F. Khomh, and D. M. German, “Tracing back the
Engineering, ser. ICSE 2014, 2014, pp. 643–652. history of commits in low-tech reviewing environments,” in Proceedings
[6] D. Liu, A. Marcus, D. Poshyvanyk, and V. Rajlich, “Feature location of the 8th International Symposium on Empirical Software Engineering
via information retrieval based filtering of a single scenario execution and Measurement (ESEM), Torino, Italy, September 2014.
trace.” in ASE’07, 2007, pp. 234–243. [27] L. An, F. Khomh, and B. Adams, “Supplementary Bug Fixes vs. Re-
[7] D. Poshyvanyk, Y.-G. Guéhéneuc, A. Marcus, G. Antoniol, and V. Ra- opened Bugs.” IEEE Computer Society, 2014, pp. 205–214.
jlich, “Feature location using probabilistic ranking of methods based on [28] S. Wang, F. Khomh, and Y. Zou, in MSR, pp. 247–256.
execution scenarios and information retrieval,” IEEE Transactions on [29] T. Dhaliwal, F. Khomh, and Y. Zou, “Classifying field crash reports for
Software Engineering, vol. 33, no. 6, pp. 420–432, 2007. fixing bugs: A case study of mozilla firefox.” in ICSM. IEEE, 2011,
[8] T. Eisenbarth, R. Koschke, and D. Simon, “Locating features in source pp. 333–342.
code,” IEEE Transactions on Software Engieering, pp. 210–224, March [30] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-G. Guéhéneuc,
2003. “Is it a bug or an enhancement?: A text-based approach to classify
[9] L. Guerrouj, D. P. Massimiliano, G. Yann-Gaël, and G. Antoniol, change requests,” in Proceedings of the 2008 Conference of the Center
“Tidier: an identifier splitting approach using speech recognition tech- for Advanced Studies on Collaborative Research: Meeting of Minds,
niques,” Journal of Software: Evolution and Process, pp. 575–599, 2013. 2008, pp. 23:304–23:318.
[10] E. Enslen, E. Hill, L. L. Pollock, and K. Vijay-Shanker, “Mining
source code to automatically split identifiers for software analysis,” in
Proceedings of of the 6th International Working Conference on Mining
Software Repositories, 2009, pp. 71–80.
[11] L. Moreno, G. Bavota, M. D. Penta, R. Oliveto, and A. Marcus,
“How can i use this method,” in Proceedings of the 37th International
Conference on Software Engineering, ser. ICSE 2015, 2015.
[12] L. Guerrouj, D. Bourque, and P. Rigby, “Leveraging informal documen-
tation to summarize classes and methods in context,” in Proceedings of
the 37th International Conference on Software Engineering, ser. ICSE
2015, 2015.
[13] S. Rastkar, G. C. Murphy, and G. Murray, “Summarizing software
artifacts: a case study of bug reports.” ACM, 2010, pp. 505–514.
[14] L. Guerrouj, M. D. Penta, Y. Guéhéneuc, and G. Antoniol, “An experi-
mental investigation on the effects of context on source code identifiers
splitting and expansion,” Empirical Software Engineering, vol. 19, no. 6,
pp. 1706–1753, 2014.
[15] L. Guerrouj, M. D. Penta, G. Antoniol, and Y. G. Guéhéneuc, “Tidier:
An identifier splitting approach using speech recognition techniques,”
Journal of Software Maintenance - Research and Practice, p. 31, 2011.
[16] L. Guerrouj, “Normalizing source code vocabulary to support program
comprehension and software quality,” in Proceedings of the 2013 Inter-
national Conference on Software Engineering, 2013, pp. 1385–1388.
[17] L. Guerrouj, P. Galinier, Y.-G. Guéhéneuc, G. Antoniol, and M. D.
Penta, “Tris: a fast and accurate identifiers splitting and expansion
algorithm,” in Proc. of the International Working Conference on Reverse
Engineering (WCRE’12), 2012, pp. 103–112.
[18] N. Madani, L. Guerrouj, M. Di Penta, Y.-G. Guéhéneuc, and G. An-
toniol, “Recognizing words from source code identifiers using speech
recognition techniques,” in Proceedings of the 14th European Confer-
ence on Software Maintenance and Reengineering (CSMR 2010), March
15-18 2010, Madrid, Spain. IEEE CS Press, 2010.
[19] B. Dit, L. Guerrouj, D. Poshyvanyk, and G. Antoniol, “Can better
identifier splitting techniques help feature location?” in Proc. of the
International Conference on Program Comprehension (ICPC), Kingston,
2011, pp. 11–20.
[20] S. Wang and D. Lo, “Version history, similar report, and structure:
Putting them together for improved bug localization,” in Proceedings of
the 22Nd International Conference on Program Comprehension. ACM,
2014, pp. 53–63.

You might also like