Information 14 00053
Information 14 00053
Information 14 00053
Article
Tool Support for Improving Software Quality in Machine
Learning Programs
Kwok Sun Cheng 1 , Pei-Chi Huang 1 , Tae-Hyuk Ahn 2 and Myoungkyu Song 1, *
Abstract: Machine learning (ML) techniques discover knowledge from large amounts of data. Model-
ing in ML is becoming essential to software systems in practice. The accuracy and efficiency of ML
models have been focused on ML research communities, while there is less attention on validating
the qualities of ML models. Validating ML applications is a challenging and time-consuming process
for developers since prediction accuracy heavily relies on generated models. ML applications are
written by relatively more data-driven programming based on the black box of ML frameworks. All
of the datasets and the ML application need to be individually investigated. Thus, the ML validation
tasks take a lot of time and effort. To address this limitation, we present a novel quality validation
technique that increases the reliability for ML models and applications, called MLVAL. Our approach
helps developers inspect the training data and the generated features for the ML model. A data
validation technique is important and beneficial to software quality since the quality of the input data
affects speed and accuracy for training and inference. Inspired by software debugging/validation for
reproducing the potential reported bugs, MLVAL takes as input an ML application and its training
datasets to build the ML models, helping ML application developers easily reproduce and understand
anomalies in the ML application. We have implemented an Eclipse plugin for MLVAL that allows
developers to validate the prediction behavior of their ML applications, the ML model, and the
training data on the Eclipse IDE. In our evaluation, we used 23,500 documents in the bioengineering
research domain. We assessed the ability of the MLVAL validation technique to effectively help ML
application developers: (1) investigate the connection between the produced features and the labels
in the training model, and (2) detect errors early to secure the quality of models from better data.
Citation: Cheng, K.S.; Huang, P.-C.;
Our approach reduces the cost of engineering efforts to validate problems, improving data-centric
Ahn, T.-H.; Song, M. Tool Support for
Improving Software Quality in
workflows of the ML application development.
Machine Learning Programs.
Information 2023, 14, 53. https:// Keywords: software quality; anomaly detection; quality validation; machine learning applications
doi.org/10.3390/info14010053
experienced in prior application domains, to increase task efficiency and correctness, rather
than investigating individually all of the datasets with the ML application.
The amount of effort to maintain and evolve the ML data is inherently more complex
and challenging compared to software code [6]. Recently, interactive visualization princi-
ples and approaches in the human-computer interaction community have been discussed
to validate ML models and applications [7]. Collaboration of ML applications and users is
needed in light of advances in ML technologies to fulfill probabilistic tasks with improved
understanding and performance. For example, machine intelligence tasks integrate with
ML applications for user trust, and reasons about the uncertainty of model outputs are
based on expert knowledge and model steering.
Interactive ML techniques are employed to enhance ML outputs with more inter-
pretable models escalating user trust, reliably achieving the principle of defect prevention.
However, ML models based on a large corpus of training datasets still suffer from inconsis-
tent and unpredictable behaviors. Many ML applications behave inconsistently and react
differently from one user to the next, when using a model based on nuances of tasks and
customizable configurations change via learning over time. For example, a recommenda-
tion application for autocompletion predicts different results after a model updates. This
inconsistent and unpredictable behavior can confuse users, erode their confidence, and
raise uncertainty due to a lack of data availability, quality, and management.
Modeling ML data often brings forth a continuous, labor-intensive effort in data
wrangling and exploratory analysis. Developers typically struggle to identify underlying
problems in data quality when building their own models. Such validation tasks over
time make them feel overwhelmed and are difficult because of the enormous amount of
learning datasets. Traditional software engineering practice provides concise and expres-
sive abstraction with encapsulation and modular design that can support maintainable
code (technical debt) [8–10]. Such abstraction techniques and maintenance activities can
effectively express desired behavior in software logic. The ML frameworks offer the power-
ful ability to create useful applications with prediction models in a cost-efficient manner;
however, ML applications often result in ML-specific issues, compounding costs for several
reasons. First, ML models may imperceptibly degrade abstraction or modularity. Second,
ML frameworks are considered to be black boxes, leading to large masses of glue code.
Lastly, ML applications cannot implement desired behavior effectively without dependency
on external data. Changing external data may affect application behaviors unexpectedly.
Regarding maintainable code, static analysis is traditionally used to detect data dependen-
cies. Regarding ML models, it is challenging to identify arbitrary detrimental effects in the
ML application that uses the model.
To address this problem, we present a novel validation approach that increases the
reliability of ML models and applications, called MLVAL, that supports an interactive
visualization interface. Our approach helps developers explore and inspect the training
data and validate and compare the generated features across ML model versions. Such
novel interactive features mitigate the challenge of building ML software and collaboration
among developers and researchers, enhancing transparency and reasoning the behaviors
of ML models. MLVAL takes as input an ML application and its training datasets to build
and validate the ML models. Then, an interactive environment provides developers with
the guidance to easily reproduce and understand anomalies in the ML application. MLVAL
enables the developers to access the reason why ML applications behave unexpectedly,
which is inspired by software debugging for reproducing the potential reported bugs. We
have implemented an Eclipse plugin for MLVAL, which allows developers to validate the
prediction behavior of their ML applications, the ML model, and the training data within
an interactive environment, Eclipse IDE. The developers interact with MLVAL to update or
refine the original model behavior that may evolve over time as their understanding of the
data model improves, while building an ML model to accurately capture relationships in
the data. As repetitive data comparison tasks are error-prone and time-consuming, MLVAL
aids developers in maintaining and evolving large complex ML applications by comparing
Information 2023, 14, 53 3 of 20
and highlighting differences between old and new data versions that are stored, tracked,
and versioned.
We evaluated the optimization and detection quality of MLVAL. We used 23,500 doc-
uments in the bioengineering research domain, where we also added 458,085 related
documents in the dataset, including 387,673 reference papers and 140,765 cited papers. We
collected 458,085 by removing duplicated papers from 387,673 reference papers and 140,765
cited papers. We assessed the MLVAL’s ability regarding how effectively our validation
technique can help developers investigate the connection between the produced features
and the labels in the training model and the relationship between the training instances
and the instances the model predicts. This paper discusses the design, implementation,
and evaluation of MLVAL, making major contributions as follows:
• A novel maintenance technique that helps ML application developers or end users
detect and correct anomalies in the application’s reasoning that aims for predictions
that failed to achieve the functional requirements.
• A prototype, open-source, plug-in implementation in the Eclipse IDE that blends
data maintenance features (i.e., model personalization, data version diff, and data
visualization) with the IDE platform to hinder the separation of code, data, and model
maintenance activities.
• A thorough case study that validates our approach by applying a text corpus in the
bioengineering domain, demonstrating MLVAL’s effectiveness in the model training
and tuning processes.
We expand on prior work [11] to add more details about the design and implementa-
tion for MLVAL’s validation ability for ML applications. The rest of this article is structured
as follows: Section 2 compares MLVAL with the related state of the art. Section 3 de-
scribes several design principles for our approach. Section 4 discusses the design and
implementation of MLVAL. Section 5 highlights the workflow of MLVAL, supporting a
human-in-the-loop approach. Section 6 presents the experimental results that we have con-
ducted to evaluate MLVAL. Section 7 discusses the limitations of our approach. Section 8
outlines conclusions and future work directions.
2. Related Work
Amershi et al. [12] conduct a case study at Microsoft and find that the software
development based on ML is completely different from traditional software development.
For example, managing data in ML is inherently more complex than doing the same with
software code; building, customizing, and extending ML models requires appropriate,
sufficient knowledge of ML, and separating tangled ML components can affect others
during model training and optimization. Yang et al. [13] report their surveys where
most non-experts who are not formally educated in ML (e.g., bioengineering researchers)
simply exploit ML models although they are unable to interpret the models. The internal
mechanism of learning algorithms remains unknown to them, leading to unexpected
prediction results. Cai et al. [14] survey learning hurdles for ML and reveal that application
developers using ML frameworks encounter challenges of understanding mathematical and
theoretical concepts. They find specific desires that ML frameworks should better support
self-directed learning to understand the conceptual underpinnings of ML algorithms.
Cai et al. [15] interview medical researchers interacting with diagnostic ML tools
and find that the tool needs to provide both more information on summary statistics and
tutorials for the user interface to understand how they can most effectively interact with the
tool in practice. Amershi et al. [16] propose several design guidelines that provide a clear
recommendation for an effective user interaction with the ML-based system. For example,
evolving a learning model may produce the different outputs to the identical inputs over
time. Users may be confused by such inconsistent and uncertain behaviors of ML appli-
cations, which can reduce users’ confidence, leading to disuse of ML frameworks. Their
design guidelines suggest that ML applications store previous interactions, maintaining
recent histories and allowing the user to use those association histories efficiently.
Information 2023, 14, 53 4 of 20
Table 1. Visual analysis feature comparison with an existing approach for data exploration and
evolution.
Allow users to observe how the input data and hyperparameters affect the predic-
tion results.
Interactive
Support Eclipse IDE plug-in application incorpo-
Visual analysis tool support focusing on data
rating code editing, program running,
iteration with hyperparameter updates.
and model behavior validation.
Visualizing learned features to understand, explore, and validate models for the
Model/Feature best performance model.
View Tabular style in a tab widget. Histogram style in multiple boxes.
Although the above approaches are similar to our approach that allows developers to
(1) explore data changes and to (2) optimize model performance, we focus on helping both
developers and end users (who are often domain experts or practitioners) of the application
Information 2023, 14, 53 5 of 20
3. Design Principles
For ML developers, it is challenging to reason about why a model fails to classify
test data instances or behaves inaccurately. End users without substantial computational
expertise may similarly question the reliability when ML applications provide no expla-
nation or are too complex to understand models. Therefore, we summarize below our
design principles for MLVAL that support designing and maintaining ML applications and
data sets.
P1. Help users secure knowledge and discover insights. Our data visualization
approach should help developers understand and discover insight into what types of
abstract data an ML application transforms into logical and meaningful representations.
For example, our approach enables a user to determine what types of features an ML
application (e.g., deep neural networks) encodes at particular layers. It allows users to
inspect how features are evolved during model training and to find potential anomalies
with the model.
P2. Help users adapt their experiences by learning the differences between their
current and previous context over time. Our validation approach should help developers
or end users systematically personalize or adapt an ML application that they interact with.
Validating an ML application is closely associated with personalization to detect and fix
anomalies (mistakes) that affect failures of predictions against classification requirements.
For example, our approach shows how an iterative train–feedback–correct cycle can allow
users to fix their incorrect actions made by a trained classifier, enabling better model
selection by personalization of an ML application with different inputs.
P3. Help users convert raw data into input/output formats to feed to an ML applica-
tion. Our data diff approach should help developers transform real-world raw data on a
problem statement into a clean sample for a concrete ML formulation. For example, our
approach allows a user to inspect changes to what columns are available or how data is
coded when a user converts data into tables, standardizing formats and resolving data
quality tasks such as preprocessing analysis of incorrect or missing values. We aim to re-
duce tedious and error-prone efforts of data wrangling tasks when each difference between
the data samples is redundant but not exactly identical.
Figure 1. The Eclipse plug-in implementation of MLVAL for validation support of ML applications
and models.
Figure 2 shows Data Navigation View , which contains the training data, the testing
data, and the raw data. Data Navigation View allows a user to select and inspect individual
datasets. Then, it interacts with a user who wants to expand or group a list of data sets
of interesting models. Data Navigation View responds to a user command action that
recognizes a double-clicking event on each dataset and then automatically enables the
corresponding plug-in view, such as Model Feature View or PDF View , to display the selected
instance, including the training, testing, or raw dataset.
Model Feature View at ® shows feature vectors for each preprocessed data encoded by
a numerical representation of objects. In the right of Figure 2, Model Feature View displays
the features and the label consisting of the model when a user wants to explore individual
datasets by executing a double-click command action on the training or testing datasets
from Data Navigation View . Model Feature View subsequently displays the user-selected
feature on the multiple tabs, which allows a user to inspect different data instances by
using a tabbed widget interface.
To experiment with attribute combinations to gain insights, John wants to clean up
by comparing the dataset with the previous dataset version before feeding the dataset to
an ML algorithm. He finds an interesting correlation between attributes by using the Data
Diff View in our tool. For example, Data Diff View allows a user to find some attributes
that have a tail-heavy distribution by highlighting data differences. The input/output of
the Model Creation View allows a user to examine various attribute combinations, which
helps a user determine if some attributes are not very useful or to create new interesting
attributes. For example, the attribute Ao is much more correlated with the attribute A p
than with the attribute Aq or Ar . Our tool assists in performing this iterative process in
which John analyzes his output to gain more insight into the exploration process.
Information 2023, 14, 53 7 of 20
Figure 2. The tool screenshot of Data Navigation View and Model Feature View.
At the left of Figure 3, Data Side-by-side View compares the tabular representations
based on a pair of the old and new versions of the ML model. For example, a user selects the
old version “Model Version (12-21-2021 11:01 AM)” and the new version “Model Version
(12-28-2021 3:57 PM)” and then executes a command action “Compare” on the pop-up
menu, which automatically toggles Data Diff View .
Figure 3. The tool screenshot of Data Side-by-side View and Data Diff View.
Data Diff View at the right of Figure 3 shows two table views to allow a user to
investigate the differences between a pair of the old and new versions of the ML model. For
example, the first table view shows the old version “Model Version (12-21-2021 11:01 AM)”,
and the second table view does the new version “Model Version (12-28-2021 3:57 PM)”. In
the table, the rows represents a document instance of the datasets, and the columns consist
of the instance ID, the numerical feature vectors, and the label. We highlight the updated
part(s) evolving from the old version to the new version if the different value is greater than
the threshold that a user configures. For example, Figure 3 shows that the third feature of
instance 2986147 has evolved from 0.077 to 0.377 when a user computes the differences
between the old and new versions of the model by using the diff threshold 0.2.
The Model History View at ¯ in Figure 1 allows John to reproduce a learning model
easily on any dataset whenever he secures a new (or old) dataset in the future. It helps
him to experiment with various transformations (e.g., feature scaling), while inspecting
and finding the best combination of hyperparameters. For example, Model History View
stores both inputs and outputs, such as loss function, optimizer, iteration, input unit/layer,
hidden unit/layer, output unit/layer, loss, accuracy, precision, recall, and F1 score. Existing
models, predictors, hyperparameters, and other matrices are reused as much as possible,
making it easy to implement a baseline ML application quickly.
Figure 4 shows Model History View , which contains a list of archived ML models that a
user has built previously for exploratory analyses during the experimental investigation. A
user enables Model History View by executing the command action “Open Model History
Information 2023, 14, 53 8 of 20
View” on the pop-up menu on the selected training datasets from Data Navigation View . In
Model History View , a user can investigate detailed information about individual models
such as creation dates, model parameters, produced accuracy, and so on.
Using the Model History View and the Model Creation View aids John in discovering
a great combination of hyperparameters by generating all possible combinations of hy-
perparameters. For example, given the input of two hyperparameters p1 and p2 along
with three and four values [p1 :{10, 20, 30}, p2 :{100, 200, 300, 400}], the Model Creation View
evaluates all 3 × 4 = 12 combinations of the specified p1 and p2 hyperparameter values.
The Model History View searches for the best score that may continually evolve and improve
the performance.
Figure 5 shows Model Creation View , which allows a user to build an ML model incre-
mentally within an interactive environment. To create and optimize a model, a user selects
the features while entering a combination of 15 parameters such as the number of layers,
iterations, loss function, etc. For example, a user chooses 1 input layer, 10 input units, relu
activation, 10 input shapes, 2 hidden layers, 5 hidden units, 1 output layer, 10 output units,
categorical_crossentropy loss function, adam optimizer, accuracy as metrics, and 5 iterations.
After configuring the parameters, a user executes the model creation by clicking the “Build
Model” button. To build a model based on an ML framework (e.g., Keras), MLVAL pa-
rameterizes a separate script program, which is a template implemented in Python. Then,
the synthesized script program takes as input the user entered parameters and imports
the Sequential class to group a linear stack of the layers into a tf . keras.Model. The model,
then, is passed to KerasClassifier, which is an implementation of the scikit −learn classifier
API in Keras. Lastly, the script program returns the result to MLVAL, which reports the
result to a user in Training Result View .
Model Creation View allows a user to build a model as well as investigate the corre-
sponding result in Training Result View , as shown in Figure 6. Training Result View reports
five evaluation matrices about the resulting model such as accuracy, loss, precision, recall,
and F1 score. It also informs a user about the used parameters such as loss function, opti-
mizer function, metrics, and the number of iterations. For example, Figure 6 shows that the
model is generated with 82.1% accuracy, 86.8% precision, 87.3% recall, and 87.0% F1 score.
Information 2023, 14, 53 9 of 20
John experiments and validates a model. For example, to avoid overfitting, John
chooses the best value of the hyperparameter with the help of MLVAL to produce a model
with the lowest generalization error. Given this model, John investigates an ML application
by running the produced model on the test datasets. From the Data Navigation View , he
selects one of the test datasets and opens the Model Test View , which displays the model
import option to allow him to evaluate the final model on the test dataset and determine if
it is better than the model currently in use. For example, from the output of the Model Test
View , he computes a confidence interval for the generalization error.
Figure 7 shows Model Test View , which allows a user to select one of the optimized
models in the drop-down list and displays the model training result (accuracy, loss, pre-
cision, recall, and F1 score) and the model parameter information (a loss function, an
optimizer, metrics, iterations, the number of input layers, etc.). To enable Model Test View , a
user selects one of test datasets from Data Navigation View and executes the command “Run
Testing Dataset” on the pop-up menu. Given the model, a user computes the classification
prediction with the selected test dataset by running the “Predict” button, and MLVAL runs
a separate script program implemented in Python. Then, the script program injects the test
Information 2023, 14, 53 10 of 20
dataset into the selected model that a user has trained and uses it to make predictions on
the test dataset.
Figure 8 shows Test Result View , which implements two tab views: (1) Results and
(2) Prediction. The Results tab view reports to the user the evaluation results (accuracy,
loss, precision, recall, and F1 score) when a user makes predictions on the new test dataset.
For example, Figure 8 shows 83.8% accuracy, 86.6% precision, 86.9% recall, and 85.8% F1
score when a user applies the model to the test dataset. The Prediction tab view shows the
features, the corresponding ground truth, and the produced labels for each instance of the
test dataset, where we highlight the differences between the ground truth and the labels.
For example, Figure 8 indicates that the model classifies the first three instances with the
labels 9, 5, and 7 rather than the label 1.
Data Visualization View at ° shows the prediction result through data visualization for
relevant topic clusters regarding text categorization. In Figure 9, Data Visualization View
illustrates the data set with a graphical representation by using the network visualization.
The node indicates each document instance, and the edge denotes the similar relationship
between a pair of two document instances. The subgroup (cluster) connecting similar
instances is highlighted with the same color, while data instances in different clusters have
no relationship with each other. For example, in Figure 9, we visualize 23,500 dataset based
on the K-means clustering algorithm [45] with 10 clusters using 10 colors for nodes and
edges. A user interacts with Data Visualization View to configure the number of clusters and
the focusing clusters for easily understanding the structure of the dataset. The prototype
tool supports K-means and hierarchical clustering algorithms [45]. An unsupervised
ML algorithm, K-means takes as input the dataset vectors, finds the patterns between
data instances, and groups them together based on their similarity. Similar to K-means,
hierarchical clustering is also an unsupervised ML algorithm; however, the cluster analysis
builds a hierarchy of clusters. The implementation uses the JavaScript library vis . js
(https://github.com/visjs, accessed on 12 January 2023) that dynamically handles datasets
and manipulates the data interaction.
Information 2023, 14, 53 11 of 20
Our tool brings John more awareness of various design choices and personalizes his
learning model better and faster than traditional black box ML applications. As mentioned
in the design principle P1 in Section 4, the aforementioned plug-in views let users compare
data features and model performance across versions in an interactive visualization envi-
ronment. Our interactive approach for discovering, managing, and versioning the data
needed for ML applications can help obtain insight into what types of abstract data an ML
application can transform into logical representations and into understanding relationships
between data, features, and algorithms.
As mentioned in design principle P2, our plug-in applications assist users exploring
ML pipelines in understanding, validating, and isolating anomalies during model changes.
Such evolving models can be monitored in our IDE plug-in views to determine whether
to cause less accurate predictions over time as features or labels are altered in unexpected
ways. As mentioned in design principle P3, our visualization views incorporated with a
development environment can ease the transition from model debugging to error analysis
and mitigate a burden of the workflow switch during the model inspection and compar-
ison processes. Our visual summary and graphical representation views can help users
focus on improving feature selection and capture better data quality by examining feature
distribution and model performance.
5. Approach
Figure 10 illustrates the overview of MLVAL’s workflow that supports a human-in-
the-loop approach. Our approach MLVAL helps developers maintain ML applications by
exploring neural network design decisions, inspecting the training data sets, and validating
the generated models.
Figure 10. The overview of our human-in-the-loop workflow for reaching target accuracy for a
learning model faster, for maximizing accuracy by combining human and machine intelligence, and
for increasing efficiency in assisting end user tasks with machine learning.
5.1. Preprocessing
One of the important steps in text mining, ML, or natural language processing tech-
niques is text preprocessing [46,47]. We conduct tokenization, stop-word removal, filtering,
lemmatization, and stemming for the preprocessing step with the document datasets. We
perform text normalization for classification tasks on the datasets by using natural language
processing (NLP) algorithms such as stemming and lemmatization. While mitigating the
burden of the context of the word occurrence, we reduce different grammatical words in a
document with an identical stem to a common form. While further understanding the part
of speech (POS) and the context of the word, we remove inflectional endings and use the
base form of a word.
To generalize each word (term) in terms of its derivational and inflectional suffixes,
we leverage a stemming method to process the datasets. We discover each word form with
its base form when the concept is similar but the word form is divergent. For example,
text segmentation, truncation, and suffix removal are applied for clustering, categorization,
and summarization for the document datasets. For example, {“expects”, “expected”,
“expecting”} can be normalized to {“expect”}. We exploit an NLP stemmer that truncates a
word at the nth symbol, preserves n tokens, and combines singular and plural forms for
matching the root or stem (morphological constituents) of derived words [48].
Information 2023, 14, 53 13 of 20
DiT · TC
j
Relevance( Di , Cj ) = cos( Di , TC ) =
j k Di k · k TCj k
We take as input the document attributes Di and the cluster Cj , where Di = [wt,D |t ∈
i
V ] and TC = [wt,TC |t ∈ V ]. V means the vocabulary set occurring in text tokens of the
j j
topic words and the document attributes, and wt,D and wt,TC indicate the term weights.
i j
k Di k and k TCj k are the size of vector Di and TCj , and · is the dot operation between vectors.
We complement cosine similarity by applying the following BM25 [52] computation
method:
Relevance( Di , Cj ) = BM25( Di , TC ) =
j
n f ( Di , TC ) · (k1 + 1)
j
∑ Id f Qi ·
|s|
i =1 f ( Di , TC ) + K1 · (1 − b + b · avgdl )
j
We analyzed Document Object Models (DOM) and parsed the DOM trees to export
diverse metadata for the documents, including the digital object identifier (DOI), abstract,
the author keywords, the article title, and other publication information. For the PDF format
file, we converted it to text format by using a PDF-to-Text converter [55]. We combined
keywords with boolean search operators (AND, OR, AND NOT) in the search execution
syntax to produce relevant query results.
We wrote a batch script program that reduced the labor-intensive investigation, which
was typically performed manually. This program conducted web crawling to collect the
hypertext structure of the web in academic literature, finding the research papers of interest.
Given a collection of query keywords, the crawler program takes as input parameters
for application programming interfaces (APIs) such as Elsevier APIs and NCBI APIs.
Regarding the NCBI search, the program uses https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
(accessed on 12 January 2023) to build an HTTP request. Regarding the Elsevier APIs,
the program parameterizes https://api.elsevier.com/content/article/ (accessed on 12 January
2023) to create an HTTP request. The API key provides the permission to execute the
number of requests for data from the APIs as quota limits. The web crawler iteratively calls
the search component to collect data while controlling the throttling rate. We conducted
the experiment with an Intel Xeon CPU 2.40 GHz. The next section details our study and
result.
1.00 1.00
Accuracy
0.75
Accuracy
0.75
0.50 0.50
0.25 0.25
0.00 0.00
0 25 50 75 100 2 4 6 8 10
7. Threats to Validity
Regarding our evaluation with a large document corpus, in terms of external validity,
we may not generalize our results beyond the bioengineering domain that we explored
for our evaluation. Our scope is limited to datasets found in bibliographic document
repositories such as NCBI and Elsevier. We acknowledge the likelihood that we might
have failed to notice the inclusion of other relevant datasets. Other online repositories (e.g.,
CiteSeerX) and online forums (e.g., Stack Overflow) may provide more related documents
or articles. Our future work includes diverse datasets in different categories to improve the
external validity of our outcomes.
In terms of internal validity, the participants in our initial data collection phase read
the text descriptions of the title, the abstract summary, and the introduction section of each
document then checked whether the contexts and discussions were related to the topics of
Information 2023, 14, 53 18 of 20
interest. The inherent ambiguity in such settings may introduce a threat to investigator bias.
In other words, the participants may have an impact on how they determined the relevant
research papers via domain-specific analysis. To attempt to guard against this threat, we
leveraged Fleiss’ Kappa [54] to reconcile a conflict when we estimated the agreement level
among data investigators.
In terms of construct validity, the accuracy of controlling a threshold used in the bug
detection in ML models directly affects our tool’s ability and performance in capturing
anomalies by highlighting differences between ML model versions. In our future work, we
will design an enhanced algorithm to automatically adjust such a contrasting threshold.
Our interactive environment mechanism makes use of an ML framework (e.g., Keras),
which is able to build a model with a combination of the hyperparameters. Thus, the
soundness of our approach is dependent on the effectiveness of the ML framework we
have exploited. This limitation can be overcome by plugging in ML frameworks that are
more resilient to classification for text corpora.
Author Contributions: Conceptualization, M.S.; Methodology, M.S.; Software, K.S.C. and M.S.;
Validation, K.S.C. and M.S.; Formal analysis, M.S.; Investigation, M.S.; Resources, K.S.C.; Data
curation, K.S.C.; Writing-original draft, K.S.C., P.-C.H., T.-H.A. and M.S.; Writing-review & editing,
K.S.C., P.-C.H., T.-H.A. and M.S.; Visualization, K.S.C.; Supervision, M.S.; Project administration,
M.S.; Funding acquisition, M.S. All authors have read and agreed to the published version of the
manuscript.
Funding: This work was partially supported by NSF Grant No. OIA-1920954.
Data Availability Statement: We made the source code be publicly available at the link: https:
//figshare.com/articles/software/Software_Download/21785153 (accessed on 12 January 2023).
Information 2023, 14, 53 19 of 20
References
1. Cai, C.J.; Reif, E.; Hegde, N.; Hipp, J.; Kim, B.; Smilkov, D.; Wattenberg, M.; Viegas, F.; Corrado, G.S.; Stumpe, M.C.; et al.
Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–14.
2. Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer
with deep neural networks. Nature 2017, 542, 115–118. [CrossRef] [PubMed]
3. Miner, A.S.; Milstein, A.; Schueller, S.; Hegde, R.; Mangurian, C.; Linos, E. Smartphone-based conversational agents and responses
to questions about mental health, interpersonal violence, and physical health. JAMA Intern. Med. 2016, 176, 619–625. [CrossRef]
[PubMed]
4. Urmson, C.; Whittaker, W.R. Self-driving cars and the urban challenge. IEEE Intell. Syst. 2008, 23, 66–68. [CrossRef]
5. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
6. Muller, M.; Lange, I.; Wang, D.; Piorkowski, D.; Tsay, J.; Liao, Q.V.; Dugan, C.; Erickson, T. How data science workers work with
data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing
Systems, Glasgow, Scotland, UK, 4–9 May 2019; pp. 1–15.
7. Andrienko, N.; Andrienko, G.; Fuchs, G.; Slingsby, A.; Turkay, C.; Wrobel, S. Visual Analytics for Data Scientists; Springer
International Publishing: Berlin/Heidelberg, Germany, 2020.
8. Fowler, M. Refactoring: Improving the Design of Existing Code; Addison-Wesley Professional: Boston, MA, USA, 2000.
9. Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden
technical debt in machine learning systems. Adv. Neural Inf. Process. Syst. 2015, 28, 2503–2511.
10. Ousterhout, J.K. A Philosophy of Software Design; Yaknyam Press: Palo Alto, CA, USA, 2018; Volume 98.
11. Cheng, K.S.; Ahn, T.H.; Song, M. Debugging Support for Machine Learning Applications in Bioengineering Text Corpora. In
Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA,
USA, 27 June–1 July 2022; pp. 1062–1069.
12. Amershi, S.; Begel, A.; Bird, C.; DeLine, R.; Gall, H.; Kamar, E.; Nagappan, N.; Nushi, B.; Zimmermann, T. Software engineering
for machine learning: A case study. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering:
Software Engineering in Practice (ICSE-SEIP), Montreal, QC, Canada, 25–31 May 2019; pp. 291–300.
13. Yang, Q.; Suh, J.; Chen, N.C.; Ramos, G. Grounding interactive machine learning tool design in how non-experts actually build
models. In Proceedings of the 2018 Designing Interactive Systems Conference, Hong Kong, China, 9–13 June 2018; pp. 573–584.
14. Cai, C.J.; Guo, P.J. Software developers learning machine learning: Motivations, hurdles, and desires. In Proceedings of the 2019
IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Memphis, TN, USA, 14–18 October 2019;
pp. 25–34.
15. Cai, C.J.; Winter, S.; Steiner, D.; Wilcox, L.; Terry, M. “Hello AI”: Uncovering the onboarding needs of medical practitioners for
human-AI collaborative decision-making. Proc. ACM Hum.-Comput. Interact. 2019, 3, 1–24. [CrossRef]
16. Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al.
Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems,
Glasgow, Scotland, UK, 4–9 May 2019; pp. 1–13.
17. Amberkar, A.; Awasarmol, P.; Deshmukh, G.; Dave, P. Speech recognition using recurrent neural networks. In Proceedings of the
2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, India, 1–3 March
2018; pp. 1–4.
18. Dlamini, Z.; Francies, F.Z.; Hull, R.; Marima, R. Artificial intelligence (AI) and big data in cancer and precision oncology. Comput.
Struct. Biotechnol. J. 2020, 18, 2300–2311. [CrossRef]
19. Vaishya, R.; Javaid, M.; Khan, I.H.; Haleem, A. Artificial Intelligence (AI) applications for COVID-19 pandemic. Diabetes Metab.
Syndr. Clin. Res. Rev. 2020, 14, 337–339. [CrossRef]
20. Ye, D.; Chen, G.; Zhang, W.; Chen, S.; Yuan, B.; Liu, B.; Chen, J.; Liu, Z.; Qiu, F.; Yu, H.; et al. Towards playing full moba games
with deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 621–632.
21. Nikzad-Khasmakhi, N.; Balafar, M.; Feizi-Derakhshi, M.R. The state-of-the-art in expert recommendation systems. Eng. Appl.
Artif. Intell. 2019, 82, 126–147. [CrossRef]
22. Xu, W. Toward human-centered AI: A perspective from human-computer interaction. Interactions 2019, 26, 42–46. [CrossRef]
23. Zanzotto, F.M. Human-in-the-loop artificial intelligence. J. Artif. Intell. Res. 2019, 64, 243–252. [CrossRef]
24. Shneiderman, B. Human-centered artificial intelligence: Reliable, safe & trustworthy. Int. J. Hum.-Comput. Interact. 2020,
36, 495–504.
25. Shneiderman, B. Human-centered artificial intelligence: Three fresh ideas. AIS Trans. Hum.-Comput. Interact. 2020, 12, 109–124.
[CrossRef]
26. Müller-Schloer, C.; Tomforde, S. Organic Computing-Technical Systems for Survival in the Real World; Springer: Berlin/Heidelberg,
Germany, 2017.
Information 2023, 14, 53 20 of 20
27. Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security;
Chapman and Hall/CRC: London, UK, 2018; pp. 99–112.
28. Lee, C.J.; Teevan, J.; de la Chica, S. Characterizing multi-click search behavior and the risks and opportunities of changing
results during use. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information
Retrieval, Gold Coast, QLD, Australia, 1–6 July 2014; pp. 515–524.
29. De Graaf, M.; Allouch, S.B.; Van Diik, J. Why do they refuse to use my robot?: Reasons for non-use derived from a long-term
home study. In Proceedings of the 2017 12th ACM/IEEE International Conference on Human-Robot Interaction, Vienna, Austria,
6–9 March 2017; pp. 224–233.
30. Jaech, A.; Ostendorf, M. Personalized language model for query auto-completion. arXiv 2018, arXiv:1804.09661.
31. Norman, D.A. How might people interact with agents. Commun. ACM 1994, 37, 68–71. [CrossRef]
32. Horvitz, E. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems, Pittsburgh, PA, USA, 15–20 May 1999; pp. 159–166.
33. Höök, K. Steps to take before intelligent user interfaces become real. Interact. Comput. 2000, 12, 409–426. [CrossRef]
34. Choudhury, M.D.; Lee, M.K.; Zhu, H.; Shamma, D.A. Introduction to this special issue on unifying human computer interaction
and artificial intelligence. Hum.-Comput. Interact. 2020, 35, 355–361. [CrossRef]
35. Luger, E.; Sellen, A. “Like Having a Really Bad PA” The Gulf between User Expectation and Experience of Conversational
Agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May
2016; pp. 5286–5297.
36. Purington, A.; Taft, J.G.; Sannon, S.; Bazarova, N.N.; Taylor, S.H. “Alexa is my new BFF” Social Roles, User Satisfaction, and
Personification of the Amazon Echo. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in
Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 2853–2859.
37. Defy Medi Damn You Auto Correct! Available online: http://www.damnyouautocorrect.com/ (accessed on 12 January 2023).
38. Clark, L.; Pantidi, N.; Cooney, O.; Doyle, P.; Garaialde, D.; Edwards, J.; Spillane, B.; Gilmartin, E.; Murad, C.; Munteanu, C.;
et al. What makes a good conversation? Challenges in designing truly conversational agents. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems, Glasgow, Scotland, UK, 4–9 May 2019; pp. 1–12.
39. Yang, Q.; Steinfeld, A.; Rosé, C.; Zimmerman, J. Re-examining whether, why, and how human-AI interaction is uniquely difficult
to design. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April
2020; pp. 1–13.
40. Nielsen, J. Ten Usability Heuristics; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2005.
41. Kumar, B.A.; Goundar, M.S. Usability heuristics for mobile learning applications. Educ. Inf. Technol. 2019, 24, 1819–1833.
[CrossRef]
42. Boukhelifa, N.; Bezerianos, A.; Chang, R.; Collins, C.; Drucker, S.; Endert, A.; Hullman, J.; North, C.; Sedlmair, M. Challenges in
Evaluating Interactive Visual Machine Learning Systems. IEEE Comput. Graph. Appl. 2020, 40, 88–96. [CrossRef]
43. Hohman, F.; Wongsuphasawat, K.; Kery, M.B.; Patel, K. Understanding and visualizing data iteration in machine learning.
In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020;
pp. 1–13.
44. Hohman, F.; Kahng, M.; Pienta, R.; Chau, D.H. Visual analytics in deep learning: An interrogative survey for the next frontiers.
IEEE Trans. Vis. Comput. Graph. 2018, 25, 2674–2693. [CrossRef] [PubMed]
45. Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [CrossRef] [PubMed]
46. Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Process. Manag. 2014, 50, 104–112. [CrossRef]
47. Ayedh, A.; Tan, G.; Alwesabi, K.; Rajeh, H. The effect of preprocessing on arabic document categorization. Algorithms 2016, 9, 27.
[CrossRef]
48. Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language
Processing Toolkit. In Proceedings of the ACL (System Demonstrations), Baltimore, MD, USA, 22–27 June 2014.
49. Rehurek, R.; Sojka, P. Software framework for topic modelling with large corpora. In Proceedings of the LREC Citeseer, Valletta,
Malta, 17–23 May 2010.
50. Hansen, P.; Jaumard, B. Cluster analysis and mathematical programming. Math. Program. 1997, 79, 191–215. [CrossRef]
51. Singhal, A.; Google, Inc. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 2001, 24, 35–43 .
52. Robertson, S.; Zaragoza, H.; Taylor, M. Simple BM25 extension to multiple weighted fields. In Proceedings of the IIKM,
Washington, DC, USA, 8–13 November 2004.
53. Manning, C.D.; Raghavan, P.; Schü tze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008.
54. Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [CrossRef]
55. Apache PDFBox—A Java PDF Library. Available online: https://pdfbox.apache.org/ (accessed on 12 January 2023).
56. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications.
Neurocomputing 2017, 234, 11–26. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.