A Metric For Software Readability: Raymond P.L. Buse and Westley R. Weimer
A Metric For Software Readability: Raymond P.L. Buse and Westley R. Weimer
ABSTRACT a project [1]. Other researchers have noted that the act
In this paper, we explore the concept of code readability of reading code is the most time-consuming component of
and investigate its relation to software quality. With data all maintenance activities [29, 36, 38]. Furthermore, main-
collected from human annotators, we derive associations be- taining software often means evolving software, and modi-
tween a simple set of local code features and human notions fying existing code is a large part of modern software en-
of readability. Using those features, we construct an au- gineering [35]. Readability is so significant, in fact, that
tomated readability measure and show that it can be 80% Elshoff and Marcotty proposed adding a development phase
effective, and better than a human on average, at predict- in which the program is made more readable [11]. Knight
ing readability judgments. Furthermore, we show that this and Myers suggested that one phase of software inspection
metric correlates strongly with two traditional measures of should be a check of the source code for readability [26].
software quality, code changes and defect reports. Finally, Haneef proposed the addition of a dedicated readability and
we discuss the implications of this study on programming documentation group to the development team [19].
language design and engineering practice. For example, our We hypothesize that everyone who has written code has
data suggests that comments, in of themselves, are less im- some intuitive notion of this concept, and that program fea-
portant than simple blank lines to local judgments of read- tures such indentation (e.g., as in Python [43]), choice of
ability. identifier names [37], and comments are likely to play a part.
Dijkstra, for example, claimed that the readability of a pro-
gram depends largely upon the simplicity of its sequencing
Categories and Subject Descriptors control, and employed that notion to help motivate his top-
D.2.9 [Management]: Software quality assurance (SQA); down approach to system design [10].
D.2.8 [Software Engineering]: Metrics We present a descriptive model of software readability
based on simple features that can be extracted automat-
ically from programs. This model of software readability
General Terms correlates strongly both with human annotators and also
Measurement, Human Factors with external notions of software quality, such as defect de-
tectors and software changes.
Keywords To understand why an empirical and objective model of
software readability is useful, consider the use of readabil-
software readability, program understanding, machine learn- ity metrics in natural languages. The Flesch-Kincaid Grade
ing, software maintenance, code metrics, FindBugs Level [12], the Gunning-Fog Index [18], the SMOG Index [31],
and the Automated Readability Index [24] are just a few ex-
1. INTRODUCTION amples of readability metrics that were developed for ordi-
We define “readability” as a human judgment of how easy a nary text. These metrics are all based on simple factors such
text is to understand. The readability of a program is related as average syllables per word and average sentence length.
to its maintainability, and is thus a critical factor in over- Despite their relative simplicity, they have each been shown
all software quality. Typically, maintenance will consume to be quite useful in practice. Flesch-Kincaid, which has
over 70% of the total lifecycle cost of a software product [5]. been in use for over 50 years, has not only been integrated
Aggarwal claims that source code readability and documen- into popular text editors including Microsoft Word, but has
tation readability are both critical to the maintainability of also become a United States governmental standard. Agen-
cies, including the Department of Defense, require many
documents and forms, internal and external, to meet have
a Flesch readability grade of 10 or below (DOD MIL-M-
Permission to make digital or hard copies of all or part of this work for 38784B). Defense contractors also are often required to use
personal or classroom use is granted without fee provided that copies are it when they write technical manuals.
not made or distributed for profit or commercial advantage and that copies These metrics, while far from perfect, can help organiza-
bear this notice and the full citation on the first page. To copy otherwise, to tions gain some confidence that their documents meet goals
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
for readability very cheaply, and have become ubiquitous
ISSTA’08, July 20–24, 2008, Seattle, Washington, USA. for that reason. We believe that similar metrics, targeted
Copyright 2008 ACM 978-1-60558-050-0/08/07 ...$5.00.
121
specifically at source code and backed with empirical evi- 2. HUMAN READABILITY ANNOTATION
dence for effectiveness, can serve an analogous purpose in A consensus exists that readability is an essential deter-
the software domain. mining characteristic of code quality [1, 5, 10, 11, 19, 29, 34,
It is important to note that readability is not the same as 35, 36, 37, 38, 44], but not about which factors most con-
complexity, for which some existing metrics have been em- tribute to human notions of software readability. A previous
pirically shown useful [44]. Brooks claims that complexity is study by Tenny looked at readability by testing comprehen-
an “essential” property of software; it arises from system re- sion of several versions of a program [42]. However, such an
quirements, and cannot be abstracted away [13]. Readabil- experiment is not sufficiently fine-grained to extract precise
ity, on the other hand, is purely “accidental.” In the Brooks features. In that study, the code samples were large, and
model, software engineers can only hope to control acciden- thus the perceived readability arose from a complex inter-
tal difficulties: coincidental readability can be addressed far action of many features, potentially including the purpose
more easily than intrinsic complexity. of the code. In contrast, we choose to measure the software
While software complexity metrics typically take into ac- readability of small (7.7 lines on average) selections of code.
count the size of classes and methods, and the extent of Using many short code selections increases our ability to
their interactions, the readability of code is based primarily tease apart which features are most predictive of readabil-
on local, line-by-line factors. It is not related, for example, ity. We now describe an experiment designed to extract a
to the size of a piece of code. Furthermore, our notion of large number of readability judgments over short code sam-
readability arises directly from the judgments of actual hu- ples from a group of human annotators.
man annotators who need not be familiar with the purpose Formally, we can characterize software readability as a
of the system. Complexity factors, on the other hand, may mapping from a code sample to a finite score domain. In this
have little relation to what makes code understandable to experiment, we present a sequence of short code selections,
humans. Previous work [34] has shown that attempting to called snippets, through a web interface. Each annotator is
correlate artificial code complexity metrics directly to de- asked to individually score each snippet based on a personal
fects is difficult, but not impossible. In this study, we have estimation of how readable it is. There are two important
chosen to target readability directly both because it is a con- parameters to consider: snippet selection policy and score
cept that is independently valuable, and also because devel- range.
opers have great control over it. We show in Section 4 that
there is indeed a significant correlation between readability 2.1 Snippet Selection Policy
and quality.
We claim that the readability of code is very different from
The main contributions of this paper are:
that of natural languages. Code is highly structured and
• An automatic software readability metric based on lo- consists of elements serving different purposes, including de-
cal features. Our metric correlates strongly with both sign, documentation, and logic. These issues make the task
human annotators and also external notions of soft- of snippet selection an important concern. We have designed
ware quality. an automated policy-based tool that extracts snippets from
Java programs.
• A survey of 120 human annotators on 100 code snip- First, snippets should be relatively short to aid feature
pets that forms the basis for our metric. We are un- discrimination. However, if snippets are too short, then they
aware of any published software readability study of may obscure important readability considerations. Second,
comparable size (12,000 human judgments). snippets should be logically coherent to give the annotators
• A discussion of the features involved in that metric and the best chance at appreciating their readability. We claim
their relation to software engineering and program- that they should not span multiple method bodies and that
ming language design. they should include adjacent comments that document the
code in the snippet. Finally, we want to avoid generating
There are a number of possible uses for an automated snippets that are “trivial.” For example, the readability of a
readability metric. It may help developers to write more snippet consisting entirely of boilerplate import statements
readable software by quickly identifying code that scores or entirely of comments is not particularly meaningful.
poorly. It can assist project managers in monitoring and Consequently, an important tradeoff exists such that snip-
maintaining readability. It can serve as a requirement for pets must be as short as possible to adequately support anal-
acceptance. It can even assist inspections by helping to tar- ysis, yet must be long enough to allow humans to make sig-
get effort at parts of a program that may need improvement. nificant judgments on them. Note that it is not our intention
The structure of this paper is as follows. In Section 2 we to “simulate” the reading process, where context may be im-
investigate the extent to which our study group agrees on portant to understanding. Quite the contrary: we intend
what readable code looks like, and in Section 3 we determine to eliminate context and complexity to a large extent and
a small set of features that is sufficient to capture the notion instead focus on the “low-level” details of readability. We
of readability for a majority of annotators. In Section 4 we do not imply that context is unimportant; we mean only
discuss the correlation between our readability metric and to show that there exists a set of local features that, by
external notions of software quality. We discuss some of the themselves, have a strong impact on readability and, by ex-
implications of this work on programming language design tension, software quality.
in Section 5, place our work in context in Section 6, discuss With these considerations in mind, we restrict snippets
possibilities for extension in Section 7, and conclude in Sec- for Java programs as follows. A snippet consists of precisely
tion 8. three consecutive simple statements [16], the most basic unit
of a Java program. Simple statements include field declara-
tions, assignments, function calls, breaks, continues, throws
122
and returns. We find by experience that snippets with fewer
such instructions are sometimes too short for a meaningful
evaluation of readability, but that three statements are gen-
erally both adequate to cover a large set of local features
and sufficient for a fine-grained feature-based analysis.
A snippet does include preceding or in-between lines that
are not simple statements, such as comments, function head-
ers, blank lines, or headers of compound statements like if-
else, try-catch, while, switch, and for. Furthermore, we
do not allow snippets to cross scope boundaries. That is, a
snippet neither spans multiple methods nor starts inside a
compound statement and then extends outside it. We find
that with this set of policies, over 90% of statements in all of
the programs we considered (see Figure 8) are candidates for
incorporation in some snippet. The few non-candidate lines
are usually found in functions that have fewer than three
statements. This snippet definition is specific to Java but
extends directly to similar languages like C and C++.
123
Figure 2: Distribution of readability scores made by
120 human annotators on code snippets taken from
several open source projects (see Figure 8).
124
Average Maximum Feature Name
X X line length (characters)
X X identifiers
X X identifier length
X X indentation (preceding whitespace)
X X keywords
X X numbers
X comments
X periods
X commas
X spaces
X parenthesis
X arithmetic operators
X comparison operators
X assignments (=)
X branches (if)
Figure 5: Annotator agreement with a model ob-
X loops (for, while)
tained by averaging the scores of 100 annotators
X blank lines with the addition of our metric.
X occurrences of any single character
X occurrences of any single identifier
125
Project Name KLOC Maturity Description
JasperReports 2.04 269 6 Dynamic content
Hibernate* 2.1.8 189 6 Database
jFreeChart* 1.0.9 181 5 Data rep.
FreeCol* 0.7.3 167 3 Game
jEdit* 4.2 140 5 Text editor
Gantt Project 3.0 130 5 Scheduling
soapUI 2.0.1 98 6 Web services
Xholon 0.7 61 4 Simulation
Risk 1.0.9.2 34 4 Game
JSch 0.1.37 18 3 Security
jUnit* 4.4 7 5 Software dev.
jMencode 0.64 7 3 Video encoding
4. CORRELATING READABILITY
Figure 7: Relative power of features as determined WITH SOFTWARE QUALITY
by a singleton (one-feature-at-a-time) analysis. The In the previous section we constructed an automated model
direction of correlation for each is also indicated. of readability that mimics human judgments. We imple-
mented our model in a tool that assesses the readability
of programs using a fixed classifier. In this section we use
in the group. Two interesting observations arise. First, for that tool to investigate whether our model of readability
all groups except graduate students, our automatic metric compares favorably with external conventional metrics of
agrees with the human average more closely than the hu- software quality. Specifically, we first look for a correlation
mans agree. We suspect that the difference with respect to between readability and FindBugs, a popular static bug-
graduates may a reflection of the more diverse background finding tool [22]. Second, we look for a similar correlation
of the graduate student population, their more sophisticated with changes to code between versions of several large open
opinions, or some other external factor. Second, we see a source projects. We chose FindBugs defects and version
gradual trend toward increased agreement with experience. changes related to code churn in part because they can be
We investigated which features have the most predictive measured objectively. Finally, we look for trends in code
power by re-running our all-annotators analysis using only readability across those projects.
one feature at a time. The relative magnitude of the per- The set of open source Java programs we have employed
formance of the classifier is indicative of the comparative as benchmarks can be found in Figure 8. They were selected
importance of each feature. Figure 7 shows the result of because of their relative popularity, diversity in terms of de-
that analysis with the magnitudes normalized between zero velopment maturity and application domain, and availabil-
and one. ity in multiple versions from SourceForge, an open source
We find, for example, that factors like ‘average line length’ software repository. Maturity is self reported, and catego-
and ‘average number of identifiers per line’ are very impor- rized by SourceForge into 1-planning, 2-pre-alpha, 3-alpha,
tant to readability. Conversely, ‘average identifier length’ is 4-beta, 5-production/stable, 6-mature, 7-inactive. Note that
not, in of itself, a very predictive factor; neither are if con- some projects present multiple releases at different maturity
structs, loops, or comparison operators. Section 5 includes a levels; in such cases we selected the release for the maturity
discussion of some of the possible implications of this result. level indicated.
We prefer this singleton feature analysis to a leave-one- Running our readability tool (including feature detection
out analysis (which judges feature power based on decreases and the readability judgment) was quite rapid. For example,
in classifier performance) that may be misleading due to the 98K lines of code in soapUI took less than 16 seconds to
significant feature overlap. This occurs when two or more process on a machine with a 2GHz processor and disk with
features, though different, capture the same underlying phe- a maximum 150 MBytes/sec transfer rate.
126
Figure 9: f-measure for using readability to predict
functions with FindBugs defect reports and func-
tions which change between releases.
Figure 10: Mean ratio of the classifier probabilities
(predicting ‘less readable) assigned to functions that
4.1 Readability Correlations contained a FindBugs defect or that will change in
Our first experiment attempts to correlate defects de- the next version to those that were not. For ex-
tected by FindBugs with our readability metric at the func- ample, Risk functions with FindBug errors were as-
tion level. We first ran FindBugs on the benchmark, not- signed a probability of ‘less readable’ that was nearly
ing defect reports. Second, we extracted all of the functions 150% greater on average than the probabilities as-
and partitioned them into two sets: those containing at least signed to functions without such defects.
one reported defect, and those containing none. We normal-
ized function set sizes to avoid bias between programs for
which more or fewer defects were reported. We then ran 100%) for many of the projects indicates that the functions
the already-trained classifier on the set of functions, record- with these features tend to have lower readability scores than
ing an f-measure for “contains a bug” with respect to the functions without them. For example, in the jMencode and
classifier judgment of “less readable.” soapUI projects, functions judged less readable by our met-
Our second experiment relates future code churn to read- ric were dramatically more likely to contain FindBugs de-
ability. Version-to-version changes capture another impor- fect reports, and in the JasperReports project less-readable
tant aspect of code quality. This experiment used the same methods were very likely to change in the next version.
setup as the first, but used readability to predict which func- For both of these external quality indicators we found that
tions will be modified between two successive releases of a our tool exhibits a substantial degree of correlation. Predict-
program. For this experiment, “successive release” means ing based on our readability metric yields an f-measure over
the two most recent stable versions. In other words, instead 0.7 in some cases. Again, our goal is not a perfect correla-
of “contains a bug” we attempt to predict “is going to change tion with version changes and code churn. These moderate
soon.” We consider a function to have changed in any case correlations do, however, imply a connection between code
where the text is not exactly the same, including changes readability, as described by our model, and defects and up-
to whitespace. Whitespace is normally ignored in program coming code changes.
studies, but since we are specifically focusing on readability
we deem it relevant. 4.2 Software Lifecycle
Figure 9 summarizes the results of these two experiments. To further investigate the relation of our readability met-
Guessing randomly yields an f-measure of 0.5 and serves as ric to external factors, we investigate changes over long pe-
a baseline, while 1.0 represents a perfect upper bound. The riods of time. Figure 11 shows how readability tends to
average f-measure over 11 of our benchmarks for the Find- change over the lifetime of a project. To construct this fig-
Bugs correlation is 0.61. The average f-measure for version ure we selected several projects with rich version histories
changes over all 12 of our benchmarks is 0.63. It is important and calculated the average readability level over all of the
to note that our goal is not perfect correlation with Find- functions in each.
Bugs or any other source of defect reports: projects can run Note that newly-released versions for open source projects
FindBugs directly rather than using our metric to predict are not always more stable than their predecessors. Projects
its output. Instead, we argue that our readability metric often undergo major overhauls or add additional crosscut-
has general utility and is correlated with multiple notions of ting features. Consider jUnit, which has recently adopted a
software quality. “completely different API . . . [that] depends on new features
A second important consideration is the magnitude of the of Java 5.0 (annotations, static import. . . )” [15]. We thus
difference. We claim that classifier probabilities (i.e. con- conducted an additional experiment to measure readability
tinuous output v.s. discrete classifications) is useful in eval- against maturity and stability.
uating readability. Figure 10 presents this data in the form Figure 12 plots project readability against project matu-
of a ratio, the mean probability assigned by the classifier to rity, as self-reported by developers. The data shows a noisy
functions positive for FindBugs defects or version changes upward trend implying that projects that reach maturity
to functions without these features. A ratio over 1 (i.e., > tend to be more readable. The results of these two experi-
127
Figure 12: Average readability metric of all func-
tions in a project as a function of self-reported
Figure 11: Average readability metric of all func- project maturity with best fit linear trend line. Note
tions in a project as a function of project lifetime. that projects of greater maturity tend to exhibit
Note that over time, the readability of some projects greater readability.
tends to decrease, while it gradually increases in oth-
ers.
cated integrated development environments (IDEs) and spe-
ments would seem not to support the Fred Brooks argument cialized static analysis tools designed to aid in software in-
that, “all repairs tend to destroy the structure, to increase spections (e.g., [3]), may constitute a better approach to
the entropy and disorder of the system . . . as time passes, the goal of enhancing program understanding.
the system becomes less and less well ordered” [14] for the Unlike identifiers, comments are a very direct way of com-
readability component of “order”. While Brooks was not municating intent. One might expect their presence to in-
speaking specifically of readability, a lack readability can be crease readability dramatically. However, we found that
a strong manifestation of disorder. comments were are only moderately well-correlated with
readability (33% relative power). One conclusion may be
that while comments can enhance readability, they are typ-
5. DISCUSSION ically used in code segments that started out less readable:
This study includes a significant amount of empirical data the comment and the unreadable code effectively balance
about the relation between local code features and readabil- out. The net effect would appear to be that comments are
ity. We believe that this information may have implications not always, in and of themselves, indicative of high or low
for the way code should be written and evaluated, and for readability.
the design of programming languages. However, we caution The number of identifiers and characters per line has a
that this data may only be truly relevant to our annotators; strong influence on our readability metric (100% and 96%
it should not be interpreted to represent a comprehensive or relative power respectively). It would appear that just as
universal model for readability. long sentences are more difficult to understand, so are long
To start, we found that the length of identifier names con- lines of code. Our findings support the conventional wisdom
stitutes almost no influence on readability (0% relative pre- that programmers should keep their lines short, even if it
dictive power). Recently there has been a significant move- means breaking up a statement across multiple lines.
ment toward “self documenting code” which is often char- When designing programming languages, readability is an
acterized by long and descriptive identifier names and few important concern. Languages might be designed to force
abbreviations. The movement has had particular influence or encourage improved readability by considering the im-
on the Java community. Furthermore, naming conventions, plications of various design and language features on this
like the “Hungarian” notation which seeks to encode typing metric. For example, Python enforces a specific indentation
information into identifier names, may not be as advisable scheme in order to aid comprehension [43, 32]. In our exper-
as previously thought [39]. While descriptive identifiers cer- iments, the importance of character count per line suggests
tainly can improve readability, perhaps some additional at- that languages should favor the use of constructs, such as
tention should be paid to the fact that they may also reduce switch statements and pre- and post-increment, that encour-
it; our study indicates the net gain may be near zero. age short lines. Our conclusion, that readability does not
For example, forcing readers to process long names, where appear to be negatively impacted by repeated characters or
shorter ones would suffice, may negatively impact readabil- words, runs counter to the common perception that oper-
ity. Furthermore, identifier names are not always an ap- ator overloading is necessarily confusing. Finally, our data
propriate place to encode documentation. There are many suggests that languages should add additional keywords if it
cases where it would be more appropriate to use comments, means that programs can be written with fewer new identi-
possibly associated with variable or field declarations, to ex- fiers.
plain program behavior. Long identifiers may be confusing, As we consider new language features, it might be useful to
or even misleading. We believe that in many cases sophisti- conduct studies of the impact of such features on readability.
128
The techniques presented in this paper offer a means for Finally, in line with conventional readability metrics, it
conducting such experiments. would be worthwhile to express our metric using a simple
formula over a small number of features (the PCA from Sec-
tion 3.2 suggests this may be possible). Using only the truly
6. RELATED WORK essential and predictive features would allow the metric to be
Previously, we identified several of the many automated adapted easily into many development processes. Further-
readability metrics that are in use today for natural lan- more, with a smaller number of coefficients the readability
guage [12, 18, 24, 31]. While we have not found analogous metric could be parameterized or modified in order to bet-
metrics targeted at source code (as presented in this paper), ter describe readability in certain environments, or to meet
some metrics do exist outside the realm of traditional lan- more specific concerns.
guage. For example, utility has been claimed for a readabil-
ity metric for computer generated math [30], for the layout
of treemaps [4], and for hypertext [20].
8. CONCLUSION
Perhaps the bulk of work in the area of source code read- It is important to note that the metric described in this
ability today is based on coding standards (e.g., [2, 6, 41]). paper is not intended as the final or universal model of read-
These conventions are primarily intended to facilitate col- ability. Rather, we have shown how to produce a metric
laboration by maintaining uniformity between code written for software readability from the judgments of human an-
by different developers. Style checkers such as PMD [9] and notators, relevant specifically to those annotators. In fact,
The Java Coding Standard Checker are employed as a means we have shown that it is possible to create a metric that
to automatically enforce these standards. agrees with these annotators as much as they agree with
We also note that machine learning has, in the past, been each other by only considering a relatively simple set of low-
used for defect prediction, typically by training on data from level code features. In addition, we have seen that readabil-
source code repositories (e.g., [7, 17, 23, 25]). We believe ity, as described by this metric, exhibits a significant level
that machine learning has substantial advantages over tra- of correlation with more conventional metrics of software
ditional statistics and that much room yet exists for the quality, such as defects, code churn, and self-reported sta-
exploitation of such techniques in the domains of Software bility. Furthermore, we have discussed how considering the
Engineering and Programming Languages. factors that influence readability has potential for improving
the programming language design and engineering practice
with respect to this important dimension of software quality.
7. FUTURE WORK
The techniques presented in this paper should provide an 9. REFERENCES
excellent platform for conducting future readability experi- [1] K. Aggarwal, Y. Singh, and J. K. Chhabra. An
ments, especially with respect to unifying even a very large integrated measure of software maintainability.
number of judgments into an accurate model of readability. Reliability and Maintainability Symposium, 2002.
While we have shown that there is significant agreement Proceedings. Annual, pages 235–241, September 2002.
between our annotators on the factors that contribute to [2] S. Ambler. Java coding standards. Softw. Dev.,
code readability, we would expect each annotator to have 5(8):67–71, 1997.
personal preferences that lead to a somewhat different weight-
[3] P. Anderson and T. Teitelbaum. Software inspection
ing of the relevant factors. It would be interesting to inves-
using codesurfer. WISE ’01: Proceeding of the first
tigate whether a personalized or organization-level model,
workshop on inspection in software engineering, July
adapted over time, would be effective in characterizing code 2001.
readability. Furthermore, readability factors may also vary
[4] B. B. Bederson, B. Shneiderman, and M. Wattenberg.
significantly based on application domain. Additional re-
Ordered and quantum treemaps: Making effective use
search is needed to determine the extent of this variability,
of 2d space to display hierarchies. ACM Trans.
and whether specialized models would be useful.
Graph., 21(4):833–854, 2002.
Another possibility for improvement would be an exten-
sion of our notion of local code readability to include broader [5] B. Boehm and V. R. Basili. Software defect reduction
features. While most of our features are calculated as aver- top 10 list. Computer, 34(1):135–137, 2001.
age or maximum value per line, it may be useful to consider [6] L. W. Cannon, R. A. Elliott, L. W. Kirchhoff, J. H.
the size of compound statements, such as the number of Miller, J. M. Milner, R. W. Mitze, E. P. Schan, N. O.
simple statements within an if block. For this study, we in- Whittington, H. Spencer, D. Keppel, , and M. Brader.
tentionally avoided such features to help ensure that we were Recommended C Style and Coding Standards: Revision
capturing readability rather than complexity. However, in 6.0. Specialized Systems Consultants, Inc., Seattle,
practice, achieving this separation of concerns is likely to be Washington, June 1990.
less compelling. [7] T. J. Cheatham, J. P. Yoo, and N. J. Wahl. Software
Readability measurement tools present their own chal- testing: a machine learning experiment. In CSC ’95:
lenges in terms of programmer access. We suggest that Proceedings of the 1995 ACM 23rd annual conference
such tools could be integrated into an IDE, such as Eclipse, on Computer science, pages 135–141, 1995.
in the same way that natural language readability metrics [8] T. Y. Chen, F.-C. Kuo, and R. Merkel. On the
are incorporated into word processors. Software that seems statistical properties of the f-measure. In QSIC’04:
readable to the author may be quite difficult for others to Fourth International Conference on Quality Software,
understand [19]. Such a system could alert programmers as pages 146–153, 2004.
such instances arise, in a way similar to the identification of [9] T. Copeland. PMD Applied. Centennial Books,
syntax errors. Alexandria, VA, USA, 2005.
129
[10] E. W. Dijkstra. A Discipline of Programming. [27] R. Kohavi. A study of cross-validation and bootstrap
Prentice Hall PTR, 1976. for accuracy estimation and model selection.
[11] J. L. Elshoff and M. Marcotty. Improving computer International Joint Conference on Artificial
program readability to aid modification. Commun. Intelligence, 14(2):1137–1145, 1995.
ACM, 25(8):512–521, 1982. [28] R. Likert. A technique for the measurement of
[12] R. F. Flesch. A new readability yardstick. Journal of attitudes. Archives of Psychology, 140:44–53, 1932.
Applied Psychology, 32:221–233, 1948. [29] J. Lionel E. Deimel. The uses of program reading.
[13] J. Frederick P. Brooks. No silver bullet: essence and SIGCSE Bull., 17(2):5–14, 1985.
accidents of software engineering. Computer, [30] S. MacHaffie, R. McLeod, B. Roberts, P. Todd, and
20(4):10–19, 1987. L. Anderson. A readability metric for
[14] J. Frederick P. Brooks. The Mythical Man-Month: computer-generated mathematics. Technical report,
Essays on Software Engineering, 20th Anniversary Saltire Software,
Edition. Addison-Wesley Professional, August 1995. http://www.saltire.com/equation.html, retrieved 2007.
[15] A. Goncalves. Get acquainted with the new advanced [31] G. H. McLaughlin. Smog grading – a new readability.
features of junit 4. DevX, Journal of Reading, May 1969.
http://www.devx.com/Java/Article/31983, 2006. [32] R. J. Miara, J. A. Musselman, J. A. Navarro, and
[16] J. Gosling, B. Joy, and G. L. Steele. The Java B. Shneiderman. Program indentation and
Language Specification. The Java Series. comprehensibility. Commun. ACM, 26(11):861–867,
Addison-Wesley, Reading, MA, USA, 1996. 1983.
[17] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. [33] T. Mitchell. Machine Learning. McGraw Hill, 1997.
Predicting fault incidence using software change [34] N. Nagappan and T. Ball. Use of relative code churn
history. IEEE Trans. Softw. Eng., 26(7):653–661, 2000. measures to predict system defect density. In ICSE
[18] R. Gunning. The Technique of Clear Writing. ’05: Proceedings of the 27th international conference
McGraw-Hill International Book Co, New York, 1952. on Software engineering, pages 284–292, 2005.
[19] N. J. Haneef. Software documentation and readability: [35] C. V. Ramamoorthy and W.-T. Tsai. Advances in
a proposed process improvement. SIGSOFT Softw. software engineering. Computer, 29(10):47–58, 1996.
Eng. Notes, 23(3):75–77, 1998. [36] D. R. Raymond. Reading source code. In CASCON
[20] A. E. Hatzimanikatis, C. T. Tsalidis, and ’91: Proceedings of the 1991 conference of the Centre
D. Christodoulakis. Measuring the readability and for Advanced Studies on Collaborative research, pages
maintainability of hyperdocuments. Journal of 3–16. IBM Press, 1991.
Software Maintenance, 7(2):77–90, 1995. [37] P. A. Relf. Tool assisted identifier naming for
[21] G. Holmes, A. Donkin, and I. Witten. Weka: A improved software readability: an empirical study.
machine learning workbench. Proceedings of the Empirical Software Engineering, 2005. 2005
Second Australia and New Zealand Conference on International Symposium on, November 2005.
Intelligent Information Systems, 1994. [38] S. Rugaber. The use of domain knowledge in program
[22] D. Hovemeyer and W. Pugh. Finding bugs is easy. understanding. Ann. Softw. Eng., 9(1-4):143–192,
SIGPLAN Not., 39(12):92–106, 2004. 2000.
[23] T. M. Khoshgoftaar, E. B. Allen, N. Goel, A. Nandi, [39] C. Simonyi. Hungarian notation. MSDN Library,
and J. McMullan. Detection of software modules with November 1999.
high debug code churn in a very large legacy system. [40] S. E. Stemler. A comparison of consensus, consistency,
In ISSRE ’96: Proceedings of the The Seventh and measurement approaches to estimating interrater
International Symposium on Software Reliability reliability. Practical Assessment, Research and
Engineering (ISSRE ’96), page 364, Washington, DC, Evaluation, 9(4), 2004.
USA, 1996. IEEE Computer Society. [41] H. Sutter and A. Alexandrescu. C++ Coding
[24] J. P. Kinciad and E. A. Smith. Derivation and Standards: 101 Rules, Guidelines, and Best Practices.
validation of the automated readability index for use Addison-Wesley Professional, 2004.
with technical materials. Human Factors, 12:457–464, [42] T. Tenny. Program readability: Procedures versus
1970. comments. IEEE Trans. Softw. Eng., 14(9):1271–1279,
[25] P. Knab, M. Pinzger, and A. Bernstein. Predicting 1988.
defect densities in source code files with decision tree [43] A. Watters, G. van Rossum, and J. C. Ahlstrom.
learners. In MSR ’06: Proceedings of the 2006 Internet Programming with Python. MIS Press/Henry
international workshop on Mining software Holt publishers, New York, 1996.
repositories, pages 119–125, 2006. [44] E. J. Weyuker. Evaluating software complexity
[26] J. C. Knight and E. A. Myers. Phased inspections and measures. IEEE Trans. Softw. Eng., 14(9):1357–1365,
their implementation. SIGSOFT Softw. Eng. Notes, 1988.
16(3):29–35, 1991.
130