Feature Engineering For Machine Learning and Data Analytics
Feature Engineering For Machine Learning and Data Analytics
4-2018
David LO
Singapore Management University, [email protected]
Part of the Numerical Analysis and Scientific Computing Commons, and the Software Engineering
Commons
Citation
XIA, Xin and LO, David. Feature Engineering for Machine Learning and Data Analytics. (2018). Feature
engineering for machine learning and data analytics. 335-358. Research Collection School Of Information
Systems.
Available at: https://ink.library.smu.edu.sg/sis_research/4362
This Book Chapter is brought to you for free and open access by the School of Information Systems at Institutional
Knowledge at Singapore Management University. It has been accepted for inclusion in Research Collection School
Of Information Systems by an authorized administrator of Institutional Knowledge at Singapore Management
University. For more information, please email [email protected].
Chapter 1
Feature Generation and Engineering
for Software Analytics
Xin Xia
Faculty of Information Technology, Monash University, Australia
David Lo
School of Information Systems, Singapore Management University, Singapore
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Features for Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 File-level Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1.1 Code Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1.2 Process Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Just-in-time Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Prediction Models and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Features for Crash Release Prediction for Apps . . . . . . . . . . . . . . . . . 11
1.3.1 Complexity Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Time Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Code Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Diffusion Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Commit Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.6 Text Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.7 Prediction Models and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Features from Mining Monthly Reports to Predict Developer
Turnover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Working Hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Task Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.3 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.4 Prediction Models and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3
4 FE
Abstract
This chapter provides an introduction on feature generation and engineer-
ing for software analytics. Specifically, we show how domain-specific features
can be designed and used to automate three software engineering tasks: (1)
detecting defective software modules (defect prediction), (2) identifying crash-
ing mobile app release (crash release prediction), and (3) predicting who will
leave a software team (developer turnover prediction). For each of the three
tasks, different sets of features are extracted from a diverse set of software
artifacts, and used to build predictive models.
1.1 Introduction
As developers work on a project, they leave behind many digital artifacts.
These digital trails can provide insights into how software is developed and
provide a rich source of information to help improve development practices.
For instance, GitHub hosts more than 57M repositories, and is currently used
by more than 20M developers [1]. As another example, Stack Overflow has
more than 3.9M registered users, 8.8M questions, and 41M comments [58].
The productivity of software developers and testers can be improved using
information from these repositories.
There have been a number of studies in software engineering which focus on
building predictive models by mining a wide variety of software data collected
from systems, their repositories and relevant online resources [3, 6, 35, 59, 60].
For example, in defect prediction [35,60], developers aim to predict whether a
module/class/method/change contains bugs, and they build a predictive mod-
el by extracting features from historical modules/classes/methods/changes
with known labels (i.e., buggy or clean). In bug priority prediction [53], devel-
opers aim to predict the priority level of a bug when it is submitted, and they
build a predictive model by leveraging features from historical bug reports
with known priority levels. In practice, the performance of a predictive model
will be largely affected by the features used to build the model. For example,
Rahman and Devanbu investigated different types of features on the perfor-
mance of defect prediction, and they found that process features performed
better than the code features in defect prediction [45]. However, feature identi-
fication and generation from software artifacts and repositories is challenging
since (1) software engineering data are complex, and (2) it requires domain
knowledge to identify effective features.
Features can be extracted from various types of software artifacts, e.g.,
source code, bug reports, code reviews, commit logs, and email lists. Even in
Feature Generation and Engineering for Software Analytics 5
the same software artifacts, there are various ways to extract features. For
example, to extract features from source code, trace features (e.g., statement
coverage) can be extracted by running the source code and analyzing its exe-
cution trace, code features (e.g., code complexity) by leveraging static analysis
tools (e.g., SciTool1 ), textual features (e.g., readibility and term frequency) by
using text mining techniques, and process features (e.g., number of developers
who changed the code) by mining the change history of the code.
In this chapter, we aim to provide an introduction on feature generation
and engineering for software analytics, and show how domain-specific features
are extracted and used for three software engineering use cases, i.e., defect
prediction, crash release prediction, and developer turnover prediction. These
three case studies extract different kinds of features from software artifacts,
and build predictive models based on these features. Some features used in
these three case studies are related, while others are problem-specific.
The remainder of the chapter is structured as follows. Section 1.2 describes
features used in defect prediction. Section 1.3 presents features used in crash
release prediction for apps. Section 1.4 elaborates features generated from
monthly report for developer turnover prediction. Section 1.5 concludes the
chapter and discusses future directions.
1
Pa
a j=1 m(Aj ) − m
LOCM 3 = (1.1)
1−m
In the above equation, m is the number of methods in a class, a is the
number of attributes in a class, and m(A) is the number of methods
that access the attribute A.
7. Others:
est (RF). In the last family, four ensemble learning methods are investigated:
Bagging, Adaboost, Rotation Forest and Rotation Subspace. Different from
other models, ensemble learning models are built with multiple base classifiers.
Yang et al. compared the performance of different prediction models on
just-in-time defect prediction, and they found that EALR showed the best
performance than the other prediction models – it can detect 33% defective
changes when inpsecting 20% LOC [66]. A similar results were found by Yan
et al.’s study [63], and they found EALR achieved the best performance in
file-level defect prediction – it can detect 34% defective files when inspecting
20% LOC.
12 FE
second training subset. Besides, a second classifier is trained with the second
training subset, and it is used to obtain the textual scores on the first training
subset. In the prediction phase, for a new release, text mining classifiers that
are built on all of the training releases to compute the values of the textual
features. Different text mining classifiers can be used to build the textual fea-
tures, our prior study use 5 types of textual classifiers to calculate the scores
of the textual features, including fuzzy set classifier [67], naive Bayes classi-
fier [33], naive Bayes multinomial classifier [33], discriminative naive Bayes
multinomial classifier [50], and complement naive Bayes classifier [47].
Readability features, which refers to the ease with which a reader can un-
derstand the task report, are collected. The readability of a text is measured
by the number of syllables per word and the length of sentences. Readability
measures can be used to tell how many years of education a reader should have
before reading the text without difficulties [11,12]. Amazon.com uses readabil-
ity measures to inform customers about the difficulty of books. Readability
features of task report are used as a complementary of statistics features of
task report since readability could also be an indicator of a developer’s work-
ing attitude. Following the prior study on the state-of-the-art on readabilty,
nine readability features are used, i.e., Flesh [12], SMOG (simple measure of
gobbledygook) [31], Kincaid [26], Coleman-Liau [10], Automated Readability
Index [48], Dale-Chall [11], difficult words [11], Linsear Write [2], Fog [15].
These readability features can be extracted by using a python package named
textstat 3 .
1.4.3 Project
In this category, these features represent the information of a project which
a developer is working on for each month. The working environment and other
members in the project might have very important effect on a developer’s
working experience. For example, the good collaboration with other members
in the project can improve a developer’s work efficiency and experience. For
each month, the following measures of the project which the developer is
working for are calculated: the number of project members, the sum, mean
and standard deviation of working hours of project members, and the number
3 https://pypi.python.org/pypi/textstat
Feature Generation and Engineering for Software Analytics 19
4. p{N} hour std : the standard deviation of working hours of project mem-
bers in N th month.
5. p{N} person change: the number of changed person compared with the
previous month in N th month.
1.5 Summary
In this chapter, we present three case studies, to demonstrate how features
can be generated from different software artifacts for different software engi-
neering problems. The generated features can be used as input to a machine
learning engine (e.g., a classification algorithm) to automate some software
tasks or better manage projects. We hope our chapter can inspire more re-
searchers and developers to dig into software artifacts to generate more power-
ful features, to further improve the performance of existing software analytics
solutions or build new automated solutions that address pain points of soft-
ware developers.
Nowadays, the performance of many predictive models developed to im-
prove software engineering tasks is highly dependent manually on the con-
structed features. However, significant expert knowledge is required to identi-
fy domain-specific features. It would be interesting to investigate methods to
automatically generate features from raw data. Deep learning is a promising
direction that can be used to automatically learn advanced features from the
multitude of raw data available in software repositories, APIs, blog posts, etc.
Some of recent studies have showed the potential of deep learning to solve
many software analystic problems (e.g., defect prediction [56, 65], similar bug
detection [64], and linkable knowledge detection [62]) with promising result-
s. Thus, it would be interesting to use deep learning techniques to relieve
the heavy workload involved in manually crafting domain-specific features for
various software engineering tasks and applications.
Bibliography
21
22 Bibliography
[11] Edgar Dale and Jeanne S Chall. A formula for predicting readability:
Instructions. Educational research bulletin, pages 37–54, 1948.
[12] Rudolf Franz Flesch. How to write plain English: A book for lawyers and
consumers. Harpercollins, 1979.
[13] Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. Revisiting the
impact of classification techniques on the performance of defect prediction
models. In ICSE, pages 789–800. IEEE Press, 2015.
[14] Todd L Graves, Alan F Karr, James S Marron, and Harvey Siy. Predicting
fault incidence using software change history. IEEE Transactions on
software engineering, 26(7):653–661, 2000.
[15] Robert Gunning. {The Technique of Clear Writing}. 1952.
[16] Philip J Guo, Thomas Zimmermann, Nachiappan Nagappan, and Bren-
dan Murphy. Characterizing and predicting which bugs get fixed: an
empirical study of microsoft windows. In Software Engineering, 2010
ACM/IEEE 32nd International Conference on, volume 1, pages 495–504.
IEEE, 2010.
[17] Tibor Gyimothy, Rudolf Ferenc, and Istvan Siket. Empirical validation
of object-oriented metrics on open source software for fault prediction.
IEEE Transactions on Software engineering, 31(10):897–910, 2005.
[18] Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Coun-
sell. A systematic literature review on fault prediction performance in
software engineering. TSE, 38(6):1276–1304, 2012.
[19] Ahmed E Hassan. Predicting faults using the complexity of code changes.
In Proceedings of the 31st International Conference on Software Engi-
neering, pages 78–88. IEEE Computer Society, 2009.
[20] B. Henderson-Sellers. Object-Oriented Metrics, Measures of Complexity.
Prentice Hall, 1996.
[21] Qiao Huang, Xin Xia, and David Lo. Supervised vs unsupervised models:
A holistic look at effort-aware just-in-time defect prediction. In Proceed-
ings of the 33nd International Conference on Software Maintenance and
Evolution. IEEE, 2017, to appear.
[22] James J Jiang and Gary Klein. Supervisor support and career anchor
impact on the career satisfaction of the entry-level information systems
professional. Journal of management information systems, 16(3):219–
240, 1999.
[23] Marian Jureczko and Lech Madeyski. Towards identifying software
project clusters with regard to defect prediction. In Proceedings of the 6th
International Conference on Predictive Models in Software Engineering,
page 9. ACM, 2010.
Bibliography 23
[24] Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris
Mockus, Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical
study of just-in-time quality assurance. IEEE Transactions on Software
Engineering, 39(6):757–773, 2013.
[25] Sunghun Kim, Thomas Zimmermann, E James Whitehead Jr, and An-
dreas Zeller. Predicting faults from cached history. In Proceedings of the
29th international conference on Software Engineering, pages 489–498.
IEEE Computer Society, 2007.
[26] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S
Chissom. Derivation of new readability formulas (automated readabil-
ity index, fog count and flesch reading ease formula) for navy enlisted
personnel. Technical report, DTIC Document, 1975.
[27] Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch.
Benchmarking classification models for software defect prediction: A pro-
posed framework and novel findings. IEEE Transactions on Software
Engineering, 34(4):485–496, 2008.
[28] Paul Luo Li, James Herbsleb, Mary Shaw, and Brian Robinson. Experi-
ences and results from initiating field defect prediction and product test
prioritization efforts at abb inc. In Proceedings of the 28th international
conference on Software engineering, pages 413–422. ACM, 2006.
[29] R. Martin. Oo design quality metrics - an analysis of dependencies. IEEE
Trans. Software Eng., 20(6):476–493, 1994.
[30] Shinsuke Matsumoto, Yasutaka Kamei, Akito Monden, Ken-ichi Mat-
sumoto, and Masahide Nakamura. An analysis of developer metrics for
fault prediction. In Proceedings of the 6th International Conference on
Predictive Models in Software Engineering, page 18. ACM, 2010.
[31] G Harry Mc Laughlin. Smog grading-a new readability formula. Journal
of reading, 12(8):639–646, 1969.
[32] T.J. McCabe. A complexity measure. IEEE Trans. Software Eng.,
2(4):308–320, 1976.
[33] Andrew McCallum, Kamal Nigam, et al. A comparison of event models
for naive bayes text classification. In AAAI-98 workshop.
[34] Thilo Mende and Rainer Koschke. Revisiting the evaluation of defect
prediction models. In Proceedings of the 5th International Conference on
Predictor Models in Software Engineering, page 7. ACM, 2009.
[35] Tim Menzies, Andrew Butcher, David Cok, Andrian Marcus, Lucas Lay-
man, Forrest Shull, Burak Turhan, and Thomas Zimmermann. Local
versus global lessons for defect prediction and effort estimation. IEEE
Transactions on software engineering, 39(6):822–834, 2013.
24 Bibliography
[36] Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static
code attributes to learn defect predictors. IEEE transactions on software
engineering, 33(1):2–13, 2007.
[37] Audris Mockus and David M Weiss. Predicting risk of software changes.
Bell Labs Technical Journal, 5(2):169–180, 2000.
[38] Raimund Moser, Witold Pedrycz, and Giancarlo Succi. A comparative
analysis of the efficiency of change metrics and static code attributes for
defect prediction. In Proceedings of the 30th international conference on
Software engineering, pages 181–190. ACM, 2008.
[39] John C. Munson and Taghi M. Khoshgoftaar. The detection of fault-
prone programs. IEEE Transactions on Software Engineering, 18(5):423–
433, 1992.
[40] Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. Mining metrics
to predict component failures. In Proceedings of the 28th international
conference on Software engineering, pages 452–461. ACM, 2006.
[41] Hector M Olague, Letha H Etzkorn, Sampson Gholston, and Stephen
Quattlebaum. Empirical validation of three software metrics suites to
predict fault-proneness of object-oriented classes developed using highly
iterative or agile software development processes. IEEE Transactions on
software Engineering, 33(6), 2007.
[42] Nancy Pekala. Holding on to top talent. Journal of Property management,
66(5):22–22, 2001.
[43] Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. Ecological
inference in empirical software engineering. In Proceedings of the 2011
26th IEEE/ACM International Conference on Automated Software En-
gineering, pages 362–371. IEEE Computer Society, 2011.
[44] Ranjith Purushothaman and Dewayne E Perry. Toward understanding
the rhetoric of small source code changes. IEEE Transactions on Software
Engineering, 31(6):511–526, 2005.
[45] Foyzur Rahman and Premkumar Devanbu. How, and why, process met-
rics are better. In Proceedings of the 2013 International Conference on
Software Engineering, pages 432–441. IEEE Press, 2013.
[46] Foyzur Rahman, Daryl Posnett, Abram Hindle, Earl Barr, and Premku-
mar Devanbu. Bugcache for inspections: hit or miss? In Proceedings of
the 19th ACM SIGSOFT symposium and the 13th European conference
on Foundations of software engineering, pages 322–331. ACM, 2011.
[47] Jason D Rennie, Lawrence Shih, Jaime Teevan, David R Karger, et al.
Tackling the poor assumptions of naive bayes text classifiers. In ICML,
2003.
Bibliography 25
[61] Xin Xia, Emad Shihab, Yasutaka Kamei, David Lo, and Xinyu Wang.
Predicting crashing releases of mobile applications. In Proceedings of
the 10th ACM/IEEE International Symposium on Empirical Software
Engineering and Measurement, page 29. ACM, 2016.
[62] Bowen Xu, Deheng Ye, Zhenchang Xing, Xin Xia, Guibin Chen, and
Shanping Li. Predicting semantically linkable knowledge in developer
online forums via convolutional neural network. In Proceedings of the
31st IEEE/ACM International Conference on Automated Software Engi-
neering, pages 51–62. ACM, 2016.
[63] Meng Yan, Yicheng Fang, David Lo, Xin Xia, and Xiaohong Zhang. File-
level defect prediction: Unsupervised vs. supervised models. In Proceed-
ings of the 11th ACM/IEEE International Symposium on Empirical Soft-
ware Engineering and Measurement. IEEE, 2017, to appear.
[64] Xinli Yang, David Lo, Xin Xia, Lingfeng Bao, and Jianling Sun. Com-
bining word embedding with information retrieval to recommend similar
bug reports. In Software Reliability Engineering (ISSRE), 2016 IEEE
27th International Symposium on, pages 127–137. IEEE, 2016.
[65] Xinli Yang, David Lo, Xin Xia, Yun Zhang, and Jianling Sun. Deep
learning for just-in-time defect prediction. In Software Quality, Reliability
and Security (QRS), 2015 IEEE International Conference on, pages 17–
26. IEEE, 2015.
[66] Yibiao Yang, Yuming Zhou, Jinping Liu, Yangyang Zhao, Hongmin Lu,
Lei Xu, Baowen Xu, and Hareton Leung. Effort-aware just-in-time defect
prediction: simple unsupervised models could be better than supervised
models. In Proceedings of the 2016 24th ACM SIGSOFT Internation-
al Symposium on Foundations of Software Engineering, pages 157–168.
ACM, 2016.
[67] HJ Zimmermann. Fuzzy Set Theory and Its Applications Second, Revised
Edition. 1992.
Index
defect prediction, 5
software analytics, 4
software artifacts, 4
27