Application of DM in Se
Application of DM in Se
3, 2010 243
Quinn Taylor*
Department of Computer Science,
Brigham Young University,
Provo, UT 84602, USA
E-mail: [email protected]
*Corresponding author
Christophe Giraud-Carrier
Department of Computer Science,
Brigham Young University,
Provo, UT 84602, USA
E-mail: [email protected]
Copyright ⃝
c 2010 Inderscience Enterprises Ltd.
244 Q. Taylor and C. Giraud-Carrier
1 Introduction
Software systems are inherently complex and difficult to conceptualise. This complexity,
compounded by intricate dependencies and disparate programming paradigms, slows
development and maintenance activities, leads to faults and defects and ultimately
increases the cost of software. Most software development organisations develop some
sort of processes to manage software development activities. However, as in most other
areas of business, software processes are often based only on hunches or anecdotal
experience, rather than on empirical data.
Consequently, many organisations are ‘flying blind’ without fully understanding the
impact of their process on the quality of the software that they produce. This is generally
not due to apathy about quality, but rather to the difficulty inherent in discovery and
measurement. Software quality is not simply a function of lines of code, bug count,
number of developers, man-hours, money or previous experience – although it involves
all those things – and it is never the same for any two organisations.
Software metrics have long been a standard tool for assessing quality of software
systems and the processes that produce them. However, there are pitfalls associated
with the use of metrics. Managers often rely on metrics that they can easily obtain
and understand which may be worse than using no metrics at all. Metrics can
seem interesting, yet be uninformative, irrelevant, invalid or not actionable. Truly
valuable metrics may be unavailable or difficult to obtain. Metrics can be difficult to
conceptualise and changes in metrics can appear unrelated to changes in process.
Alternatively, software engineering activities generate a vast amount of data that, if
harnessed properly through data mining techniques, can help provide insight into many
parts of software development processes. Although many processes are domain – and
organisation – specific, there are many common tasks which can benefit from such
insight, and many common types of data which can be mined. Our purpose here is to
bring software engineering to the attention of our community as an attractive testbed
for data mining applications and to show how data mining can significantly contribute
to software engineering research.
The paper is organised as follows. In Section 2, we briefly discuss related work,
pointing to surveys and venues dedicated to recent applications of data mining to
software engineering. Section 3 describes the sources of software data available for
mining and Section 4 provides a brief, but broad, survey of current practices in this
domain. Section 5 discusses issues specific to mining software engineering data and
prerequisites for success. Finally, Section 6 concludes the paper.
2 Related work
Since then, and over the years, Xie (2010) has been compiling and maintaining
an (almost exhaustive) online bibliography on mining software engineering data. He
also presented tutorials on that subject at the International Conference on Knowledge
Discovery in Databases in 2006 and at the International Conference on Software
Engineering in 2007, 2008 and 2009 (e.g., see Xie et al., 2007). Many of the
publications we cite here are also included in Xie’s bibliography and tutorials.
The Mining Software Repositories (MSR) Workshop, co-located with the
International Conference on Software Engineering, was originally established in 2004.
Papers published in MSR focus on many of the same issues we have discussed in
this survey and the goal of the workshops is to increase understanding of software
development practices through data mining. Beyond tools and applications, topics
include assessment of mining quality, models and meta-models, exchange formats,
replicability and reusability, data integration and visualisation techniques.
Finally, Kagdi et al. (2007) have recently published a comprehensive survey of
approaches for MSR in the context of software evolution. Although their survey is
narrower in scope than the overview given here, it has greater depth of analysis, presents
a detailed taxonomy of software evolution data mining methodologies and identifies a
number of related research issues that require further investigation.
The first step in the knowledge discovery process is to gain understanding about the
data that is available and the business goals that drive the process. This is essential for
software engineering data mining endeavours, because unavailability of data for mining
is a factor that limits the questions which can be effectively answered.
In this section, we describe software engineering data that are available for data
mining and analysis. Current software development processes involve several types of
resources from which software-related artefacts can be obtained. Software ‘artefacts’
are a product of software development processes. Artefacts are generally lossy and thus
cannot provide a full history or context, but they can help piece together understanding
and provide further insight. There are many data sources in software engineering. In
this paper, we focus only on four major groups and describe how they may be used for
mining software engineering data.
First, the vast majority of collaborative software development organisations utilise
revision control software1 (e.g., CVS, Subversion, Git, etc.) to manage the ongoing
development of digital assets that may be worked on by a team of people. Such systems
maintain a historical record of each revision and allow users to access and revert to
previous versions. By extension, this provides a way to analyse historical artefacts
produced during software development, such as number of lines written, authors which
wrote particular lines or any number of common software metrics.
Second, most large organisations (and many smaller ones) also use a system for
tracking software defects. Bug tracking software (such as Bugzilla, JIRA, FogBugz,
etc.) associates bugs with meta-information (status, assignee, comments, dates and
milestones, etc.) that can be mined to discover patterns in software development
processes, including the time-to-fix, defect-prone components, problematic authors, etc.
Some bug trackers are able to correlate defects with source code in a revision system.
246 Q. Taylor and C. Giraud-Carrier
Third, virtually all software development teams use some form of electronic
communication (e-mail, instant messaging, etc.) as part of collaborative development
(communication in small teams may be primarily or exclusively verbal, but such cases
are inconsequential from a data mining perspective). Text mining techniques can be
applied to archives of such communication to gain insight into development processes,
bugs and design decisions.
Fourth, software documentation and knowledge bases can be mined to provide
further insight into software development processes. This approach is useful to
organisations that use the same processes across multiple projects and want to examine
a process in terms of overall effectiveness or fitness for a given project. Although
knowledge bases may contain source code, this approach focuses primarily on retrieval
of information from natural languages.
In this section, we discuss several data mining techniques and provide examples of ways
they have been applied to software engineering data. Many of these techniques may
be applied to software process improvement. We attempt to emphasise innovative and
promising approaches and how they can benefit software organisations.
Śliwerski et al. (2005) have used association rules to study the link between changes
and fixes in CVS and Bugzilla data for Eclipse and Mozilla. Their approach is to
identify fix-inducing changes, or those changes which cause a problem that must later
be fixed (closely related are fix-inducing fixes, or bug ‘fixes’ which require a subsequent
fix-on-fix). They identify several applications, including: characterisation and filtering
of problematic change properties, analysis of error-proneness and prevention of
fix-inducing changes by guiding programmers. Interestingly, they also find that the
likelihood of a change being fix-inducing (problematic) is greatest on Fridays.
Wasylkowski et al. (2007) have done work in automated detection of anomalies in
object usage models, which are collections of typical or ‘correct’ usage composed of
sequences of method calls, such as calling hasNext() before next() on an Iterator
object. Their Jadet tool learns and checks method call sequences from Java code patterns
to deduce correct usage and identify anomalies. They test their approach on five large
open-source programs and successfully identify previously unknown defects, as well as
‘code smells’ that are subject to further scrutiny.
Weimer and Necula (2005) focus on improving the effectiveness of detecting
software errors. They note that most verification tools require software specifications,
the creation of which is difficult, time-consuming and error-prone. Their algorithm
learns specifications from observations of error handling, based on the premise that
programs often make mistakes along exceptional control-flow paths even when they
normally behave correctly. Tests which force a program into error control flows
have proven effective. The focus is on learning rules of temporal safety [similar to
Wasylkowski et al. (2007)] and infer correct API usage. They test several existing Java
programs and demonstrate improvements in discovery of specifications versus existing
data mining techniques.
Christodorescu et al. (2007) explore a related technique: automatic construction of
specifications consistent with malware by mining of execution patterns which are present
in known malware and absent in benign programs. They seek to improve the current
process of manually creating specifications that identify malevolent behaviour from
observations of known malware. Not only is the output of this technique usable by
malware detection software, but also by security analysts seeking to understand malware.
4.1.2 Classification
Large software organisations frequently use bug tracking software to manage defects
and correlate them with fixes. Bugs are assigned a severity and assigned to someone
within the organisation. Classification and assignment can sometimes be automated, but
are often done by humans, especially when a bug is incorrectly filed by the reporter or
the bug database. Anvik et al. (2006, 2005) and Anvik (2006) have researched automatic
classification of defects by severity (‘triage’), and Čubranić and Murphy (2004) have
studied methods for determining who should fix a bug. Both approaches use data mining
and learning algorithms to determine which bugs are similar and how a specific bug
should be classified.
Work by Kim and Ernst (2007) focused on classification of warnings and errors
and, specifically, the ability to suggest to programmers which should be fixed first.
Their motivations include the high false-positive rates and spurious warnings typical
of automatic bug-finding tools. They present a history-based prioritisation scheme that
mines software change history data that tells if and when certain types of errors were
248 Q. Taylor and C. Giraud-Carrier
fixed. The intuition is that categories of warnings that were fixed in previous software
changes are likely to be important. They report significant improvements in prioritisation
accuracy over three existing tools.
Nainar et al. (2007) use statistical debugging methods together with dynamic code
instrumentation and examination of the execution state of software. They expand on
the use of simple predicates (such as branch choices and function return values) by
adding compound Boolean predicates. They describe such predicates, how they may
be measured, evaluation of predicate ‘interestingness’ and pruning of uninteresting
predicates. They show how their approach is robust to sparse random sampling typical
of post-deployment statistical debugging and provide empirical results to substantiate
their research.
4.1.3 Clustering
Text mining is an area of data mining with extremely broad applicability. Rather than
requiring data in a very specific format (e.g., numerical data, database entries, etc.), text
mining seeks to discover previously unknown information from textual data. Because
many artefacts in software engineering are text-based, there are many rich sources of
data from which information may be extracted. We examine several current applications
of text mining and their implications for software development processes.
Code duplication is a chronic problem which complicates maintenance and evolution
of software systems. Ducasse et al. (1999) propose a visual approach which is
language-independent, overcoming a major stumbling block of virtually all existing
code duplication techniques. Although their approach requires no language-specific
parsing, it is able to detect significant amounts of code duplication. This and other
similar approaches help alleviate the established problems of code duplication – such
Applications of data mining in software engineering 249
as unsynchronised fixes, code bloat, architectural decay and flawed inheritance and
abstraction – which frequently contribute to diminished functionality or performance.
Duplication of bug reports is also common, especially in organisations with
widespread or public-facing test and development activities. Runeson et al. (2007)
have applied natural language processing and text mining to bug databases to detect
duplicates. They use standard methods such as tokenisation, stemming, removal of
stop words and measures of set similarity to evaluate whether bug reports are in fact
duplicates. Because text mining is computationally expensive, they also use temporal
windowing to detect duplicates only within a certain period of time of the ‘master’
record. A case study of Sony Ericsson bug data has yielded success rates between 40%
and 66%.
Tan et al. (2007) have presented preliminary work that addresses an extremely
common occurrence: inconsistencies between source code and inline comments. The
authors observe that out-of-sync comments and code point to one of two problems:
1 bad code inconsistent with correct comments
2 bad comments inconsistent with correct code.
The former indicates existing bugs; the latter can ‘mislead programmers to introduce
bugs in subsequent versions’. However, differences between intent and implementation
are difficult to detect automatically. The authors have created a tool (iComment) which
combines natural language processing, machine learning, statistics and program analysis
to automatically analyse comments and detect inconsistencies. Their tests on four large
code bases achieved accuracy of 90.8–100% and successfully detected a variety of such
inconsistencies, due to both bad code and bad comments.
Locating code which implements specific functionality is important in software
maintenance, but can be difficult, especially if the comments do not contain words
relevant to the functionality. Chen et al. (2001) propose a novel approach for locating
code segments by examining CVS comments, which they claim often describe the
changed lines and functionality, and generally apply for many future versions. The
comments can then be associated with the lines known to have been changed,
enabling users to search for specific functionality based on occurrences of search terms.
Obviously, the outcome depends on CVS comment quality.
Large software projects require a high degree of communication through both direct
and indirect mediums. Bird et al. (2006) mine the text of e-mail communications
between contributors to open-source software (OSS). This approach allows them
to detect and represent social networks that exist in the open-source community,
characterise interactions between contributors and identify roles such as ‘chatterers’
and ‘changers’. The in-degree and out-degree of e-mail responses are analysed and
communication is correlated with repository commit activity. These techniques were
applied to the Apache mailing lists and were able to successfully construct networks of
major contributors.
A very recent application of text mining is analysis of the lexicon (vocabulary)
which programmers use in source code. While identifier names are meaningless to a
compiler, they can be an important source of information for humans. Effective and
accurate identifiers can reduce the time and effort required to understand and maintain
code.
250 Q. Taylor and C. Giraud-Carrier
Antoniol et al. (2007) have examined the lexicon used during software evolution.
Their research studies not only the objective quality of identifier choices, but also how
the lexicon evolves over time. Evidence has been found to indicate that evolution of
the lexicon is more constrained than overall program evolution, which they attribute to
factors such as lack of advanced tool support for lexicon-related tasks.
Atkins et al. (1999) attempt to quantify the effects of a software tool on developer
effort. Software tools can improve software quality, but are expensive to acquire,
deploy and maintain, especially in large organisations. They present a method for tool
evaluation that correlates tool usage statistics with estimates of developer effort. Their
approach is inexpensive, observational, non-intrusive in nature and includes controls for
confounding variables; the subsequent analysis allows managers to accurately quantify
the impact of a tool on developer effort. Cost-benefit analyses provide empirical data
(although possibly from dissimilar domains) that can influence decisions about investing
in specific tools.
Data mining is only as good as the results it produces. Its effectiveness may be
constrained by the quantity or quality of available data, computational cost, stakeholder
buy-in or return on investment. Some data or tasks are difficult to mine and ‘mining
common sense’ is a waste of effort, so choosing battles wisely is critical to the success
of any data mining endeavour.
Automatable tasks are potentially valuable targets for data mining. Because software
development is so human-oriented, people are generally the most valuable resources
in a software organisation. Anything that reduces menial workload requiring human
interaction can free up those resources to perform other tasks which only humans can
do.
For example, large organisations may benefit substantially from automation of bug
report triage and assignment. Automatic analysis and reporting of defect detection, error
patterns and exception testing can be highly beneficial and the costs of computing
resources to accomplish these tasks are very reasonable. Text analysis of source code for
duplication, out-of-sync comments and code, and localisation of specific functionality
could also be extremely valuable to maintenance engineers.
Data mining is most effective at finding new information in large amounts of data.
Complex software processes will generally benefit more from data mining techniques
than simpler, more lightweight processes that are already well-understood. However,
information gained from large processes will also have more confounding factors and
be more difficult to interpret and put into action. Changes to software process are not
trivial and the effects that result from changes are not always what one might expect.
Equally important to remember is the fact that data mining is not a panacea or ‘silver
bullet’ that improves software all by itself. Information gleaned from mining activities
must be correctly analysed and properly implemented if it is to change anything. Data
mining can only answer questions that are effectively articulated and implemented and
good intentions cannot rescue bad data (or no data).
Data miners and software development organisations wishing to employ data mining
techniques should carefully consider the costs and benefits of mining their data. The cost
to an organisation – whether in man-hours, computing resources or data preparation –
must be low enough to be effective for a given application.
In order to make a difference in more areas of software engineering, data mining needs
to be more accessible and easier to adapt to tasks of interest. There is a great need for
254 Q. Taylor and C. Giraud-Carrier
tools which can automatically clean or filter data, a problem which is intractable in the
general case but possible for specific domains where data is in a known format.
In addition to automated ‘software-aware’ data mining tools, we see a need for
research and tools aimed at simplifying the process of connecting data mining tools to
common sources of software data, as discussed in Section 3. Currently, it is common
for each new tool to re-implement problems which have already been solved by another
tool, perhaps only because the solutions have not been published or generalised.
Because many data mining tasks (e.g., text mining) are extremely computationally
expensive, replication of effort is a major concern. Tools that help simplify centralised
extraction and caching of results will make widespread data mining more appealing to
large software organisations; the same tools can make collaborative data mining research
more effective. The ability to share data among colleagues or collaborators without
replication amortises the cost of even the most time-intensive operations. Removing the
‘do-it-all-yourself’ requirement will open many possibilities.
Intuitive client-side analysis and visualisation tools can help spur adoption among
those responsible for applying newly-discovered information. Most current tools,
although extremely powerful, are targeted at individuals with strong fundamental
understanding of machine learning, statistics, databases, etc. A greater emphasis on
creating approachable tools for the layperson with interest in using mined data will
increase the value (or at least an organisation’s perception of value) of the data itself.
Just as with any tool, data mining techniques can be used either well or poorly. As data
mining techniques become more popular and widespread, there is a tendency to treat
data mining as a hammer and any available data as a nail. If unchecked, this can be a
significant drain on resources.
Software practitioners must carefully consider which, if any, data mining technique
is appropriate for a given task. Despite the many commonalities in software development
artefacts and data, no two organisations or software systems are identical. Because
improvement depends on intelligent interpretation of information, and the information
that can be obtained depends on the available data, knowledge of one’s data is just as
crucial in software development as it is in other domains. Thus, we reiterate that the
first step is to understand what data is available, then decide whether that can provide
useful insights, and if so, how to analyse it.
6 Summary
We have identified reasons why software engineering is a good fit for data mining,
including the inherent complexity of development, pitfalls of raw metrics and the
difficulties of understanding software processes.
We discussed four main sources of software ‘artefact’ data:
1 version control systems
2 bug trackers
3 electronic developer communication
Applications of data mining in software engineering 255
References
Alonso, O., Devanbu, P.T. and Gertz, M. (2006) ‘Extraction of contributor information from
software repositories’, available at
http://wwwcsif.cs.ucdavis.edu/ alonsoom/contributor information adg.pdf.
Antoniol, G., Guéhéneuc, Y.G., Merlo, E. and Tonella, P. (2007) ‘Mining the lexicon used by
programmers during software evolution’, in Proceedings of the IEEE International Conference
on Software Maintenance, pp.14–23.
Anvik, J. (2006) ‘Automating bug report assignment’, in Proceedings of the 28th International
Conference on Software Engineering, pp.937–940.
Anvik, J., Hiew, L. and Murphy, G.C. (2005) ‘Coping with an open bug repository’, in Proceedings
of the OOPSLA Workshop on Eclipse Technology eXchange, pp.35–39.
Anvik, J., Hiew, L. and Murphy, G.C. (2006) ‘Who should fix this bug?’, in Proceedings of the 28th
International Conference on Software Engineering, pp.361–370.
Atkins, D., Ball, T., Graves, T. and Mockus, A. (1999) ‘Using version control data to evaluate
the impact of software tools’, in Proceedings of the 21st International Conference on Software
Engineering, pp.324–333.
Ball, T., Kim, J.M., Porter, A.A. and Siy, H.P. (1997) ‘If your version control system could talk. . . ’,
in Proceedings of the Workshop on Process Modelling and Empirical Studies of Software
Engineering.
Bird, C., Gourley, A., Devanbu, P., Gertz, M. and Swaminathan, A. (2006) ‘Mining email social
networks’, in Proceedings of the International Workshop on Mining Software Repositories,
pp.137–143.
Canfora, G. and Cerulo, L. (2005) ‘Impact analysis by mining software and change request
repositories’, in Proceedings of the 11th IEEE International Software Metrics Symposium, p.29.
Chen, A., Chou, E., Wong, J., Yao, A.Y., Zhang, Q., Zhang, S. and Michail, A. (2001) ‘Cvssearch:
searching through source code using CVS comments’, in Proceedings of the IEEE International
Conference on Software Maintenance, pp.364–373.
Christodorescu, M., Jha, S. and Kruegel, C. (2007) ‘Mining specifications of malicious behavior’, in
Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the
ACM SIGSOFT Symposium on The Foundations of Software Engineering, pp.5–14.
Čubranić, D. and Murphy, G.C. (2004) ‘Automatic bug triage using text classification’, in Proceedings
of the 16th International Conference on Software Engineering & Knowledge Engineering,
pp.92–97.
Dickinson, W., Leon, D. and Podgurski, A. (2001) ‘Finding failures by cluster analysis of execution
profiles’, in Proceedings of the 23rd International Conference on Software Engineering,
pp.339–348.
256 Q. Taylor and C. Giraud-Carrier
Ducasse, S., Rieger, M. and Demeyer, S. (1999) ‘A language independent approach for detecting
duplicated code’, in Proceedings of the IEEE International Conference on Software Maintenance,
pp.109–118.
Gall, H.C. and Lanza, M. (2006) ‘Software evolution: analysis and visualization’, in Proceedings of
the 28th International Conference on Software Engineering, pp.1055–1056.
Hassan, A.E. (2006) ‘Mining software repositories to assist developers and support managers’, in
Proceedings of the 22nd IEEE International Conference on Software Maintenance, pp.339–342.
Howison, J. and Crowston, K. (2004) ‘The perils and pitfalls of mining sourceforge’, in Proceedings
of the International Workshop on Mining Software Repositories.
Kagdi, H., Collard, M.L. and Maletic, J.I. (2007) ‘A survey and taxonomy of approaches for mining
software repositories in the context of software evolution’, Journal of Software Maintenance and
Evolution: Research and Practice, Vol. 19, No. 2, pp.77–131.
Kagdi, H., Yusuf, S. and Maletic, J.I. (2006) ‘Mining sequences of changed-files from version
histories’, in Proceedings of the International Workshop on Mining Software Repositories,
pp.47–53.
Kim, S. and Ernst, M.D. (2007) ‘Which warnings should I fix first?’, in Proceedings of the 6th Joint
Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium
on the Foundations of Software Engineering, pp.45–54.
Liblit, B., Naik, M., Zheng, A.X., Aiken, A. and Jordan, M.I. (2005) ‘Scalable statistical bug
isolation’, in Proceedings of the ACM SIGPLAN Conference on Programming Language Design
and Implementation, pp.15–26.
Liu, C. and Han, J. (2006) ‘Failure proximity: a fault localization-based approach’, in Proceedings
of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering,
pp.46–56.
Livshits, B. and Zimmermann, T. (2005) ‘Dynamine: finding common error patterns by mining
software revision histories’, ACM SIGSOFT Software Engineering Notes, Vol. 30, No. 5,
pp.296–305.
Lotka, A.J. (1926) ‘The frequency distribution of scientific productivity’, Journal of the Washington
Academy of Sciences, Vol. 16, No. 12, pp.317–324.
Mendonca, M. and Sunderhaft, N. (1999) ‘Mining software engineering data: a survey’, Data &
Analysis Center for Software (DACS) State-of-the-Art Report, No. DACS-SOAR-99-3.
Mens, T. and Demeyer, S. (2001) ‘Future trends in software evolution metrics’, in Proceedings of the
4th International Workshop on Principles of Software Evolution, pp.83–86.
Mockus, A., Eick, S.G., Graves, T.L. and Karr, A.F. (1999) ‘On measurement and analysis of software
changes’, Technical report, National Institute of Statistical Sciences.
Mockus, A., Weiss, D.M. and Zhang, P. (2003) ‘Understanding and predicting effort in software
projects’, in Proceedings of the 25th International Conference on Software Engineering,
pp.274–284.
Nainar, P.A., Chen, T., Rosin, J. and Liblit, B. (2007) ‘Statistical debugging using compound Boolean
predicates’, in Proceedings of the International Symposium on Software Testing and Analysis,
pp.5–15.
Newby, G.B., Greenberg, J. and Jones, P. (2003) ‘Open source software development and Lotka’s
Law: bibliometric patterns in programming’, Journal of the American Society for Information
Science and Technology, Vol. 54, No. 2, pp.169–178.
Robles, G., González-Barahona, J.M. and Ghosh, R.A. (2004) ‘Gluetheos: automating the retrieval and
analysis of data from publicly available software repositories’, in Proceedings of the International
Workshop on Mining Software Repositories, pp.28–31.
Runeson, P., Alexandersson, M. and Nyholm, O. (2007) ‘Detection of duplicate defect reports using
natural language processing’, in Proceedings of the 29th International Conference on Software
Engineering, pp.499–510.
Applications of data mining in software engineering 257
Scotto, M., Sillitti, A., Succi, G. and Vernazza, T. (2006) ‘A non-invasive approach to product metrics
collection’, Journal of Systems Architecture, Vol. 52, No. 11, pp.668–675.
Shirabad, J.S., Lethbridge, T.C. and Matwin, S. (2001) ‘Supporting software maintenance by mining
software update records’, in Proceedings of the IEEE International Conference on Software
Maintenance, pp.22–31.
Śliwerski, J., Zimmermann, T. and Zeller, A. (2005) ‘When do changes induce fixes?’, ACM SIGSOFT
Software Engineering Notes, Vol. 30, No. 4, pp.1–5.
Tan, L., Yuan, D., Krishna, G. and Zhou, Y. (2007) ‘/*icomment: bugs or bad comments?*/’, in
Proceedings of the 21st ACM Symposium on Operating Systems Principles, pp.145–158.
Wasylkowski, A., Zeller, A. and Lindig, C. (2007) ‘Detecting object usage anomalies’, in Proceedings
of the 6th Joint Meeting of the European Software Engineering Conference and the ACM
SIGSOFT Symposium on The Foundations of Software Engineering, pp.35–44.
Weimer, W. and Necula, G.C. (2005) ‘Mining temporal specifications for error detection’, in
Proceedings of the 11th International Conference on Tools and Algorithms for the Construction
and Analysis of Systems, pp.461–476.
Xie, T. (2010) ‘Bibliography on mining software engineering data’, available at
http://ase.csc.ncsu.edu/dmse.
Xie, T., Pei, J. and Hassan, A.E. (2007) ‘Mining software engineering data’, in Proceedings of the
29th International Conference on Software Engineering, pp.172–173.
Zhang, S., Wang, Y., Yuan, F. and Ruan, L. (2007) ‘Mining software repositories to understand the
performance of individual developers’, in Proceedings of the 31st Annual International Computer
Software and Applications Conference, pp.625–626.
Zimmermann, T., Weißgerber, P., Diehl, S. and Zeller, A. (2005) ‘Mining version histories to guide
software changes’, IEEE Transactions on Software Engineering, Vol. 31, No. 6, pp.429–445.
Notes
1 Revision control is sometimes also identified by the acronyms VCS for version control
system and SCM for source control management.