0% found this document useful (0 votes)
75 views107 pages

Data Quality Model

This thesis examines data quality attributes and proposes a data quality model for machine learning. The authors conducted a literature review and interview study with 15 data scientists to identify important data quality attributes. Based on these, they proposed a data quality model for monitoring and improving data quality for machine learning. The model consists of 16 key data quality attributes identified from the perspectives of experienced data scientists. The study highlights the importance of data quality for machine learning and provides a framework to help practitioners evaluate and enhance their data.

Uploaded by

Revathi Birapaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views107 pages

Data Quality Model

This thesis examines data quality attributes and proposes a data quality model for machine learning. The authors conducted a literature review and interview study with 15 data scientists to identify important data quality attributes. Based on these, they proposed a data quality model for monitoring and improving data quality for machine learning. The model consists of 16 key data quality attributes identified from the perspectives of experienced data scientists. The study highlights the importance of data quality for machine learning and provides a framework to help practitioners evaluate and enhance their data.

Uploaded by

Revathi Birapaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Master of Science in Software Engineering

May 2019

Data Quality Model for Machine learning

Nitesh Varma Rudraraju


Varun Boyanapally

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden


This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in
partial fulfilment of the requirements for the degree of Master of Science in Software
Engineering. The thesis is equivalent to 20 weeks of full-time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any
sources other than those listed in the bibliography and identified as references. They further
declare that they have not submitted this thesis at any other institution to obtain a degree.
This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in
partial fulfilment of the requirements for the degree of Master of Science in Software
Engineering. The thesis is equivalent to 20 weeks of full-time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any
sources other than those listed in the bibliography and identified as references. They further
declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:
Author(s):
Nitesh Varma Rudraraju
E-mail: [email protected]

Varun Boyanapally
E-mail: [email protected]

University advisor:
Dr. Samuel A. Fricker
Department of Software Engineering

Faculty of Computing Internet : www.bth.se


Blekinge Institute of Technology Phone : +46 455 38 50 00
SE-371 79 Karlskrona, Sweden Fax : +46 455 38 50 57
ii
ABSTRACT

Context: - Machine learning is a part of artificial intelligence; this area is now continuously growing
day by day. Most internet related services such as Social media service, Email Spam, E-commerce sites,
Search engines are now using machine learning. The Quality of machine learning output relies on the
input data, so the input data is crucial for machine learning and good quality of input data can give a
better outcome to the machine learning system. In order to achieve quality data, a data scientist can use
a data quality model on data of machine learning. Data quality model can help data scientists to monitor
and control the input data of machine learning. But there is no considerable amount of research done on
data quality attributes and data quality model for machine learning.

Objectives: - The primary objectives of this paper are to find and understand the state-of-art and state-
of-practice on data quality attributes for machine learning and to develop a data quality model for
machine learning in collaboration with data scientists.

Methods: - This paper mainly consists of two studies: - 1) Conducted a literature review in the different
database in order to identify literature on data quality attributes and data quality model for machine
learning. 2) An in-depth interview study was conducted to allow a better understanding and verifying of
data quality attributes that we identified from our literature review study, this process is carried out with
the collaboration of data scientists from multiple locations. Totally of 15 interviews were performed and
based on the results we proposed a data quality model based on these interviewees perspective.

Result: - We identified 16 data quality attributes as important from our study which is based on the
perspective of experienced data scientists who were interviewed in this study. With these selected data
quality attributes, we proposed a data quality model with which quality of data for machine learning can
be monitored and improved by data scientists, and effects of these data quality attributes on machine
learning have also been stated.

Conclusion: - This study signifies the importance of quality of data, for which we proposed a data
quality model for machine learning based on the industrial experiences of a data scientist. This research
gap is a benefit to all machine learning practitioners and data scientists who intended to identify quality
data for machine learning. In order to prove that data quality attributes in the data quality model are
important, a further experiment can be conducted, which is proposed in future work.

Keywords: - Machine learning, Data Quality Attributes, Data Quality Model.

iii
ACKNOWLEDGEMENTS

The journey of our Master thesis has been a wonderful journey, we were able to enhance both technical
and personal knowledge during this journey. This Thesis wouldn’t be successful without the
contribution of all those people who have been our inspiration, support, encouragement throughout the
thesis.

We express our deepest sense of gratitude to our remarkable supervisor Dr. Samuel A Fricker for his
unconditional support, motivation, and expertise of outstanding guidance throughout the journey of our
Master thesis. Without his endless patience, experience, and discussion our thesis would not have been
completed. He plays a prominent role in our master thesis and we are honoured to have such an idol
supervisor.

We would like to extend our heartful thanks to Marcel Würsch, Daniel Ostler, Ina Schiering, Anna
Samuelsson, Frank höppner, Oskar Eriksson, Sabri Karagönen, Misbah Uddin, Matteo Landrò, Federica
Citterio, Henrik Sjökvist, Simon Strandberg, David Rydéna, Viktor Hallman, Mikael Huss, Jon
Warghed and Josefin Rosén for allotting their valuable time to share their exceptional knowledge with
us.

All the interviewees played a major role by providing answers based on their immense knowledge and
experience which helped us in improving our thesis.

Finally, we would like to extend our heart full love and gratitude to our parents, family, and friends for
showing their moral support and backing us in every situation of our lives. We are forever grateful to
our parents for providing us with these beautiful opportunities and believing us. All our success and
achievements we dedicate to our parents and brothers.

iv
CONTENTS
ABSTRACT ......................................................................................................................................................... III
ACKNOWLEDGEMENTS ................................................................................................................................. IV
CONTENTS ........................................................................................................................................................... V
LIST OF TABLES .............................................................................................................................................. VII
LIST OF FIGURES ..........................................................................................................................................VIII
1 INTRODUCTION ......................................................................................................................................... 1
1.1 CONTEXT: ................................................................................................................................................ 1
1.2 RESEARCH PROBLEM: .............................................................................................................................. 1
1.3 AIM AND OBJECTIVES .............................................................................................................................. 2
1.4 STRUCTURE OF THIS THESIS DOCUMENT: ................................................................................................ 2
2 BACKGROUND & RELATED WORK ..................................................................................................... 4
2.1 BACKGROUND: ........................................................................................................................................ 4
2.1.1 Artificial Intelligence: ........................................................................................................................ 4
2.1.2 Machine learning: .............................................................................................................................. 4
2.1.3 Data quality dimensions & Data quality attributes: .......................................................................... 6
2.1.4 Data Quality Model:........................................................................................................................... 6
2.1.5 Data quality: ....................................................................................................................................... 7
2.2 RELATED WORK: ..................................................................................................................................... 7
3 RESEARCH METHOD .............................................................................................................................. 10
3.1 RESEARCH QUESTIONS ........................................................................................................................... 10
3.2 LITERATURE REVIEW............................................................................................................................. 10
3.2.1 Search Method: ................................................................................................................................ 11
3.2.2 Search Process: -.............................................................................................................................. 11
3.2.3 Selection of Database:...................................................................................................................... 12
3.2.4 Database Search: - ........................................................................................................................... 12
3.2.5 Title Screening: ................................................................................................................................ 13
3.2.6 Abstract Screening: .......................................................................................................................... 13
3.2.7 Inclusion/exclusion criteria: ............................................................................................................. 14
3.2.8 Quality Assessment:.......................................................................................................................... 14
3.3 IN-DEPTH INTERVIEW METHOD: ............................................................................................................ 16
3.3.1 Interview Method: ............................................................................................................................ 16
3.3.2 Interview Questionnaire and Trello board: ..................................................................................... 17
3.3.3 Information of interviewees: - .......................................................................................................... 19
3.3.4 Analysis Protocol for In-depth Interview Method: ........................................................................... 20
4 RESULTS AND ANALYSIS ...................................................................................................................... 24
4.1 SEARCH RESULTS FOR THE LITERATURE REVIEW: .................................................................................. 24
4.2 ANALYSIS OF LITERATURE REVIEW ....................................................................................................... 25
4.3 RESULTS OF IN-DEPTH INTERVIEW: ....................................................................................................... 30
4.3.1 Summarizing Interviews: .................................................................................................................. 30
4.3.2 Labelled code: .................................................................................................................................. 34
4.4 ANALYSIS OF IN-DEPTH INTERVIEW STUDY: ......................................................................................... 35
4.4.1 Research Implications: ..................................................................................................................... 38
5 DATA QUALITY MODEL: ....................................................................................................................... 40
5.1.1 Comparison with ISO/IEC 25012: ................................................................................................... 41
6 DISCUSSION ............................................................................................................................................... 44
6.1 LIMITATIONS ......................................................................................................................................... 47
6.2 THREATS TO VALIDITY........................................................................................................................... 48
6.2.1 Internal validity ................................................................................................................................ 48
6.2.2 External validity ............................................................................................................................... 48

v
6.2.3 Construct validity ............................................................................................................................. 48
6.2.4 Conclusion validity ........................................................................................................................... 49
7 CONCLUSION & FUTURE WORK ........................................................................................................ 50
7.1 CONCLUSION ......................................................................................................................................... 50
7.2 FUTURE WORK ....................................................................................................................................... 50
REFERENCES ..................................................................................................................................................... 52
APPENDIX: .......................................................................................................................................................... 57

vi
LIST OF TABLES

Table 3-1 Data Base Search String .......................................................................................... 12


Table 3-2: Results of Search String.......................................................................................... 13
Table 3-3 Criteria for Quality assessment ................................................................................ 15
Table 3-4: Information of Interviewees ................................................................................... 20
Table 3-5: Categorizing Data Quality Attributes ..................................................................... 22
Table 3-6: Applying codes o themes ........................................................................................ 22
Table 3-7: Implementing the coding process ........................................................................... 23
Table 4-1: Selected Articles ..................................................................................................... 24
Table 4-2: Selected Publications ............................................................................................. 25
Table 4-3: Selected Data Quality Attributes ............................................................................ 27
Table 4-4: Attributes Identified from Published reports .......................................................... 28
Table 4-5: Labelled code: ......................................................................................................... 35
Table 4-6: Occurrence of Data Quality Attributes in Interviews and selected articles ............ 36
Table 4-7: Application of Data Quality attributes .................................................................... 38
Table 5-1: Attributes and Sub-Attributes ................................................................................. 41
Table 6-1: Limitations .............................................................................................................. 48
Table 0-1: Interview 1: Important Attributes ........................................................................... 58
Table 0-2: Interview 2: Important attributes ............................................................................ 61
Table 0-3: Interview 2: Not Important attributes ..................................................................... 61
Table 0-4: Interview 2: Attributes depending on application .................................................. 62
Table 0-5: Interview 3: Important Attributes ........................................................................... 64
Table 0-6: Interview 3: Not Important Attributes .................................................................... 65
Table 0-7: Interview 4: Important attributes ............................................................................ 68
Table 0-8: Interview 5: Important attributes ............................................................................ 71
Table 0-9: Interview 5: Not Important attributes ..................................................................... 71
Table 0-10: Interview 6: Important attributes .......................................................................... 74
Table 0-11: Interview 6: Not Important attributes ................................................................... 74
Table 0-12: Interview 7: Important attributes .......................................................................... 76
Table 0-13: Interview 7: Not Important attributes ................................................................... 77
Table 0-14: Interview 8: Important attributes .......................................................................... 79
Table 0-15: Interview 9: Important attributes .......................................................................... 82
Table 0-16: Interview 10: Important attributes ........................................................................ 84
Table 0-17: Interview 11: Important attributes ........................................................................ 88
Table 0-18: Interview 11: Not Important attributes ................................................................. 88
Table 0-19: Interview 12: Important attributes ........................................................................ 90
Table 0-20: Interview 12: Not Important attributes ................................................................. 91
Table 0-21: Suggested data Quality Attributes by the interview ............................................. 91
Table 0-22: Interview 13: Important attributes ........................................................................ 94
Table 0-23: Interview 14: Important attributes ........................................................................ 96
Table 0-24: Interview 15: Important attributes ........................................................................ 98
Table 0-25: Interview 15: Not Important attributes ................................................................. 99

vii
LIST OF FIGURES

Figure 1-1 Structure of the Thesis .............................................................................................. 3


Figure 2-1 Types of Machine Learning Algorithms .................................................................. 6
Figure 3-1 Data Base Search .................................................................................................... 12
Figure 3-2 Interview Questionnaire - I ..................................................................................... 17
Figure 3-3 Interview Questionnaire - II ................................................................................... 18
Figure 3-4 Trello Board Questionnaire .................................................................................... 19
Figure 3-6: Conventional Content Analysis ............................................................................. 21
Figure 5-1 Data Quality Model for Machine Learning[95] ..................................................... 41
Figure 5-2: Selection of data quality attributes ........................................................................ 43

viii
1 INTRODUCTION

1.1 Context:
Artificial intelligence (AI) is where we study the problems in the intricate data processing that often
have their roots in some aspects of biological information processing [1]. Artificial intelligence, in
theory, can be described as development in the computer's ability to perform tasks while making use of
visual perception, speech recognition, decision making [1]. Development of such AI systems is
receiving consideration increasingly, for example, Bonseyes which offered us an environment to execute
our research is an AI platform, that is both open and increasing. “Bonseyes will transform AI
development from a cloud-centric model, dominated by large internet companies, to an edge device-
centric model through a marketplace and an open AI platform” [2].
Machine learning can be stated as an application of artificial intelligence where the users have the luxury
of providing data to the machine and let the machine do the learning part by itself [3]. AI can generally
be applied refers doing automation, for example, intelligent trades generalised AI refers to the handling
of any day to day activities, this particular area has provided motivation to develop machine learning
often referred as a subset of AI [3]. A current example in the state of the art of machine learning and AI
could be the self-driving car, which uses both the technologies to the core [3]. Machine learning systems
identify the objects in images, transfers speech into text, equivalents news items, posts or products based
on user’s interests, and choose similar results of the search. The applications in which a class of
techniques are applied defines Deep Learning, it can be detailed as an era in Machine learning research
with the strong aim of shifting machine learning closer to artificial intelligence[4]. Deep learning falls
in the category of machine learning but is based on data representations, task-specific algorithms. Deep
learning can be either semi-supervised (like classification) or unsupervised (like pattern analysis). It
uses several layers of non-linear processing for transformations and each layer will use the result of the
previous and will again use it back as the input. Mostly, “modern deep learning models are highly
dependent on ANN (artificial neural networks) use latent variables organised in terms of layers”[4].
Data is at most important in machine learning as it mainly works on data [5]. So, the quality of data in
machine learning also plays a major role. In order to maintain good data quality, a data scientist can
implement data quality model for machine learning. Data quality model contains data quality attributes
using which data scientists (people who have experience in training of machine learning systems) can
improve the quality of data. There are articles related to data quality models, but we couldn’t find any
relevant literature on data quality model for machine learning, and this would be the main reason to
contribute a data quality model for machine learning from our study. Different machine learning data
quality attributes can be identified in multiple articles, with the help of these data quality attributes for
machine learning, we would like to conduct multiple interviews to identify which of these data quality
attributes are important for machine learning. With these results, we would develop a data quality model
which can be applied to data of machine learning.

1.2 Research Problem:


The problem with the data in machine learning is that a data scientist needs to give appropriate data to
the machine learning system that a data scientist is planning to train. If we consider training a machine
model with image data, a data scientist needs to get the right image data to train the model. Here the
selection of data needs to focus mainly on the objects in the image, a data scientist needs to give that
data that has proper context, and this sometimes is problematic.

1
Example: Data scientist builds a classifier that can differentiate between human and dogs, the data
scientist needs data (Image data) of both dogs and humans, and this model will not work if the data is
only images of dogs or only images of human as the machine reads any image that is given as a data.
Another problem with the given data is that, even if we give the data (image data) of both dogs and
human equally, the given data might not be proper.

Let's say we trained the model with all the images having the same background or similar background,
the machine reads them as the same image, so if we give the data of humans with the same background
and dogs with another background the machine even reads the background and when an image of a dog
is not clear and if the background matches with that of humans background, then the machine will read
dog as human. So, we must give the image data in such a way that the object of concentration (humans,
dogs as per above example) must have both similar backgrounds and different background in different
locations so that the object of concentration can be recognized easily [6].

Our study mainly focusses on quality of data, by developing a data quality model we will share data
quality attributes using which data scientists can check if the selected data is of good quality or not. This
data quality model helps in characterizing the quality of data sets for machine learning algorithms [7].

1.3 Aim and objectives


The main aim of this study is to consolidate research on data quality for Machine Learning and find
relevant data which is needed to proceed with the interview process. The key idea is to develop data
quality model for machine learning in collaboration with data scientists who have experience in Machine
learning.
The above aim is achieved by following the given objectives below:
1. Firstly, to understand the current state-of-the-art and the state-of-the-practice on the data quality
models available for data on Machine Learning.
2. To conduct a Literature review and extract data quality attributes for machine learning from the
available resources.
3. To perform an in-depth interview study to assess the selected data quality attributes to develop our
data quality model.

1.4 Structure of this Thesis Document:


Section 1: This includes the introduction of our study, Context, research problem, the structure of the
thesis, and aim and objectives for this study
Section 2: This includes Background, related work, knowledge gap and success criteria for our research.
Section 3: We include details of the research method, research questions and the details of the selected
research methods i.e. literature review and in-depth interview study for this study.
Section 4: This section consists of the detail explanation of results and analysis of the literature review
and in-depth interview study. This section also includes the answers to RQ1, RQ1.1, and RQ2, RQ2.1
are stated in this section.
Section 5: In this section, we discuss totally on the data quality model that we developed and comparison
of ISO 25012 model with our study.
Section 6: Discussion of the entire document, threats to validity, and limitations of our research are
described in this section.
Section 7: This section includes our final conclusion and future work of our research.

2
Figure 1-1 Structure of the Thesis
3
2 BACKGROUND & RELATED WORK
This section includes all the background details of all the key factors that we use in this entire study and
any related work that has been done previously which is similar to our study.

2.1 Background:
This study is mainly related to Machine learning and the data quality attributes that are used to define
the quality of data with which a data scientist can monitor and control the data for a machine-learning
algorithm. This study involves machine learning algorithms and attributes that can help in improving
the quality of data, they can use these data quality attributes to monitor and improve the quality of data
that is used to train machine learning systems.

2.1.1 Artificial Intelligence:


Artificial intelligence is being used in various areas and getting more popular nowadays[8]. Artificial
intelligence is similar to human intelligence which is processed by machines mainly computer systems
[8].

2.1.2 Machine learning:


The key component for all the machine learning systems are data [5]. Machine learning (ML) is a subset
of artificial intelligence (AI) right now these two are trending topics in the computer field. Artificial
intelligence and machine learning are tremendously changing from the past few years and their research
is widely used in government and defence industries [9]. Machine learning is defined as the computer
systems used to perform the task without given a specific instruction but based on example data and past
experience [10]. According to Arthur Samuel in 1959 “computers have the ability to learn without being
explicitly programmed”[11]. A data scientist trains these machine learning algorithms with data sets by
collecting the data for the problems, using this data a data scientist can train the machine learning
algorithms so that the system can solve complex tasks[12]. By training with more good quality data,
these systems can solve tasks based on the trained data instead of pure intuition and these systems can
also adopt new situations based on the new trained data [12]. This is where our study comes into the
light where data scientist needs to train the machine learning algorithms with quality data. Using this
study, machine learning practitioners can have a good idea from the start of the type of data that they
need to train the machine learning systems so that it can give the best quality outcome.

Machine learning Algorithms:

Machine learning algorithms are programs that are trained and analysed with input data to predict an
output within an adequate range. As data scientist trains these machine learning algorithms with new
data, it learns and improves their operations to enhance the performance and develop intelligence over
time[13].

Types of machine learning algorithms that are commonly used:


1. Supervised learning: In this type of learning the algorithm creates a function that can connect the
inputs to their desired outputs.
2. Unsupervised learning: In this type, the main goal is for the system to learn how to do something
new without training the system with any new data.
3. Semi-supervised learning: This includes both labelled and unlabelled data to generate a suitable
function or classifier.
4. Reinforcement learning: This is a type of learning where the systems learn and adapt to a given
situation by observing a real-time world.
4
5. Transduction: This is more like supervised learning, but the system here tries to predict a different
output based on the input, training outputs and new inputs.
6. Learning to learn: In this learning, the algorithm learns inductive bias from its previous
experience[14].

Few Machine Learning Algorithms mentioned in our study:

Neural Networks are modelled to recreate the human brain to process information resulting in accurate
results than other regression models. This model is used in recognizing patterns. The patterns which
they recognize are in vectors into which all real-world data is translated[15].

Recurrent Neural Network (RNN) is a type of neural network that uses the output of the previous step
as input to the current step[16].

Long Short-Term Memory networks (“LSTM”) is a kind of recurrent neural networks which is used in
processing large sequence of data. This model is widely used in Handwritten recognition, speech
recognition[16].

Convolutional Neural Networks (“CNN”) is a specialized neural network mainly used to do


image recognition and classification. In this model, it automatically detects the important
features without any supervision. For example, given many pictures of cars and bikes, it learns
distinctive features for each class by itself[17].

Regression is a form of predictive modelling where there is a relation between the dependent and
independent variables. This model is generally used for forecasting, time series[18].

Logistic Regression is a classification algorithm which is used to predict the probability of occurrence
of an event by fitting data to a function. This model is used when the dependent variable is categorical.
And is widely used in medical fields and social sciences[19].

Support Vector Machines (SVM) are supervised learning model which are used in classification and
regression problems. They solve linear and nonlinear problems which work well with many practical
problems. SVM works better in classifying small datasets[20].

Random Forest is the simplest and widely used algorithm used for classification and regression. Random
forest, it is a combination of many random tree predictors and merges them together to obtain a more
stable and accurate prediction. The more trees in the forest, more robust would be the prediction with
high accuracy [21].

Below are other major types of machine learning algorithms,


• Concept learning (CL)
• Decision trees (DT)
• Bayesian belief networks (BBN)
• Genetic algorithms (GA) and genetic programming (GP)
• Instance-based learning (IBL)
• Inductive logic programming (ILP)
• Analytical learning (AL)[22].

5
Figure 2-1 Types of Machine Learning Algorithms

2.1.3 Data quality dimensions & Data quality attributes:


Data quality dimensions are the concepts that define the quality of data. This quality dimension has the
ability to describe various dimensions of data quality such as relevancy, accessibility, understandability,
etc [23]. As per Wang and Strong, data quality dimension is “a set of data quality attributes that represent
a single aspect or construct of data quality” [23]. This study mainly focuses on the data quality attributes
that are useful in selecting the appropriate data to train machine learning systems. “Data quality
attribute” can be used to measure the performance of the data in its defined way [24].
2.1.4 Data Quality Model:
Data quality models enhance traditional models and the reason for this is to make the traditional model
as a base to represent a data quality dimension and any data dimension which is related to it. A data
quality model allows us to analyse a set of requirements for data quality and helps to represent it in terms
of conceptual schema, this model also accesses and inquire all data quality dimensions by a logical
schema. In these models, we usually get process models that adapt design and analysis of quality
improvement actions. Using these data quality model, we can trace down the selected data from the
source and we can see all the changes that data has gone through until it reached its final and present
stage of usage. This can help in detecting the root cause for the poor data quality and the design that is
implemented to improve that action[25].

This model provides us with a method that describes a specific data element, it clarifies a set of
information and the state in which the expected data to existing, it also gives us all additional attributes
that are precisely needed to identify this model[26].
6
2.1.5 Data quality:
Quality is defined as “fitness for use” said by Juran [27]. We can define “data quality” as the data that
is best suitable for the data consumers to use[28]. The need for a high degree quality data has been stated
by both researchers and machine learning practitioners for quite some time [28]. Nowadays to solve a
specific problem, data quality measures are developed on the bases of ad hoc and the evaluation of data
quality mainly depends on the needs and expectations of stakeholders [29]. Poor quality of data
negatively impacts on social and economic due to which it cost billions of dollars and the quality
literature states that quality assessment is also dependent on consumers who decided to use the products,
similarly data quality assessment also depends upon data consumers who use that data, that’s important
because consumers nowadays have lots of choices, and authority over their computer environment and
the data they wanted to use in it[30]. Poor data quality leads to hurting employees Morales, trust issues
in the organization starts to multiply, more difficult to put the project in order [31]. Poor data quality
effects in the operational level are decreased in customer satisfaction, increase in cost, decreases the
employee's job satisfaction and in tactical level, Reengineering becomes difficult [31]. Beginning causes
of the products and poor data are the main donation to “information ecology” which is unsuitable for
information age [32]. In recent years references have been increased regarding the poor data quality and
the impact of poor data quality have been appearing in many kinds of literature, social media and in
many well know publications [24], [33].

2.2 Related Work:


While we are going through our research about this topic, we have noticed that there is very limited
research on this topic of data quality for machine learning. We could not find a relevant study that mainly
focuses on the data quality model for machine learning as pre our research, this makes our study more
needed in the industry at the moment. To start our research we found very limited research articles about
the data quality attributes for machine learning algorithms, making it hard for us to find more data quality
attributes, and this is one of the reasons why we would like to add already existing data quality model
ISO 25012 in our study[34], we also plan to extract attributes from the published report which were
closely related to our study. We have few existing data quality models such as [35]–[43] and from these,
we selected ISO 25012 data quality model for our study, and the reason for selecting ISO is that it is
International Organization for Standardization and they develop and publish International Standards.
People who develop stands for ISO are experts in their sector and these experts are from all over the
world[44]. From our research, we found no strong literature evidential proof from which we can say
that existing data quality models can be suitable for machine learning systems, from this study we would
like to check if the existing data quality models and its attributes are suitable for machine learning
systems.

Data quality: We found a few related articles on data quality their importance which are the key aspects
in our study.

Article 1: In this article[45], the author stated that data quality is defined and viewed in many ways, the
definition for data quality which is accepted widely stated in terms of “fitness for purpose”. The author
says that data quality is viewed in terms or either accuracy or absence of noise, but there are more
perspectives such as completeness and timeliness. This article mainly focuses on the view of accuracy
I.e. noise, and noise can also be defined as incorrect data which occur due to intentional or unintentional
errors in the capture. The author in this article stated that in that study, more than 73% claimed that data
quality is important and to improve data quality collecting of data manually should be avoided and
“external measures” can be implemented for triangulation purpose.
7
Article 2: In this article, author-defined data quality as “fitness for use”, which can be considered as
data with good quality used in one case may not be a good quality data in another case. The author stated
that ensuring data quality is recognized widely as an important activity, and in practice, few people
consider data quality as a top priority. As per author data is best described via multiple attributes or
dimensions such as accuracy, completeness, consistency, and timeliness. This article also gives details
on the importance of quality of data and discuss various articles and techniques that can improve the
data[46].

Pre-Processing: Data pre-processing is an important process where the data scientists select relevant
data and resolve several problems such as noisy data, redundancy data[47]. This is where the data quality
model comes to a place where the data scientists get to monitor and choose the right data for their system,
data quality model consists of data quality characters which are useful for these data scientists to check
with the data and remove the irrelevant ones which are not needed. These are a few of the articles that
are related to our study and helps in understanding what pre-processing is and how data quality attributes
are useful in machine learning.

Article 1: In this article, the reasons why data pre-processing is done is mentioned by the author. The
author says that the data is not clean in the real world due to incomplete data such as data that lacks
attribute values lack attributes of interest or data that contains only aggregate data. The author also
mentioned that data is not clean due to noisy data which contains errors and inconsistent data which
discrepancies in codes or names. It is said that if the quality of data decreases then the quality of end
results also decreases, duplicate or missing data can lead to incorrect analysis. These are the reasons for
conducting data pre-processing, as per author these are few examples of well-accepted multi-
dimensional views to measure data quality: Accuracy, Completeness, Consistency, Timeliness, etc[48].

Article 2: In this article, the author has stated that the data selected from the real world is not neat with
all the incomplete, noisy data which gives wrong and duplicate values. This kind of poor data quality
results in poor quality mining results. In this article author even stated as following “In order to get
quality data, the data in the database need to be checked for accuracy, completeness, consistency,
timeliness, believability, interpretability and accessibility Incomplete data is an unavoidable problem in
dealing with most of the real-world data sources”[49].

Examples of data quality attributes and how the attributes are used in machine learning according
to the above articles:

Example 1: Timeliness is one of the data quality attributes that effects data quality, consider a situation
where we manage a distribution sale bonus every month to all the primary sales representatives in a
electronics company. Many sales representatives fail to offer their records of sales on time by the end
of the month, and there are many corrections and adjustments in the records after the month-end. For
that month, data stored in the database is incomplete. This was the month-end updating of data in a
timely way is incomplete and it can effect the data quality in a negative way. In this case, if we receive
the data on time, then the data is correct and the quality of data is considered good[48].

Example 2: Interpretability is one of the data quality attributes indicates the ease to understand the
data. If we assume a data in the database, and this data contains accounting codes, and the sales
department has no understanding of how to interpret the accounting codes [48].

Example 3: Believability indicated the trust of the users towards the data. If we consider a database
with several errors at a point of time, and the errors have been corrected. In past, these errors caused
several problems, and the users no longer trust the data [48].

8
Other Examples: Accuracy is a data quality attribute which can be used to record the facts correctly
about a disposition of crime case. Completeness can be used to have all the relevant information
recorded, consistency can be helpful in unifying the format for recording the related information.
Timeliness can be used in recording information after the disposition [46].

Knowledge gap:

In the search for articles in various databases on data quality model for machine learning, we found that
there is a lack of research on this topic. The research about the data quality model for machine learning
has not been consolidated. We have different methods like instance selection, discretization, feature to
improve the quality of data, but these methods are more expensive for complex data that involves
machine learning [6]. For the data used to train machine learning systems, there is a lack of strong
literature which can be used to improve the quality of data. We do have existing data quality models,
but there is a lack of strong literature evidential proof which can state that any data quality model is
suitable for machine learning. So, from this research, we are planning on developing a data quality model
for machine learning that can be used to improve the data quality for machine learning systems. Also
using the data quality model, we can get more computational cost benefits compared to other models
and helps in setting the right pricing for the data which is attracting the customer more. As we found
very less research on data quality model for machine learning, we are working on this research to find
relevant data quality characteristics that are suitable for machine learning and based on which we would
develop a data quality model for machine learning based on interviewees perspective.

Success criteria:

The Data Quality model that we are about to develop is used to improve the quality of data, this model
can be used to filter the right data that a data scientist needs for machine learning systems which they
are about to train, it helps the data scientists to predict the quality of data using data quality attributes
and evaluate based on prediction to improve the quality of data which is used to train machine learning
systems. This data quality model that we develop will assist data scientists in evaluating the quality of
data for machine learning and helps them in reasoning about the existing datasets.

9
3 RESEARCH METHOD
The below section describes the details of the research questions related to the research and the
appropriate methods for addressing the research questions like Literature Review and In-depth Interview
study are discussed. Research Questions are defined in section 3.1, Literature Review method to address
Research Question 1 is described in Section 3.2, In-depth Interview study method to address Research
Question 2 is described in Section 3.2 below.

3.1 Research questions


Considering the aim and objectives in mind we had formulated the following research questions:

Research question Motivation

RQ.1 What is the state of art The main reason for formulating this research question is to gather data
of data quality model for on the data quality models. What all data quality models have been
Machine Learning? identified till now and to identify different data quality attributes and
their effects on machine learning which might be useful for the
RQ.1.1 What are the data development of data quality model for machine learning.
quality attributes that are For conducting an interview in the industries, we need to gather some
used to characterize the literature data on data quality attributes which effects machine learning
quality of data used in and data quality model, based on which we would formulate
machine learning systems? questionnaire for interview. In order to conduct an interview and to
develop data quality model, we need to have good knowledge about
different data quality attributes and their data quality model that are
available till now.
RQ.2 How do data These questions were formulated based on the reason to see on what
scientists characterize data basis do the data scientist identify the selected data quality attributes
quality attributes for as important or not important for machine learning, on what specific
machine learning? reason they identify those attributed to that selected list, and to know
their opinion on how the selected data quality attributes effect or
RQ.2.1) how are the quality applicable on the data for machine learning based on their industrial
attributes of data quality experience. To find out any additional data quality attributes which are
model applicable to data of missing in our research and based on the results of these questions we
machine learning? could develop our data quality model.
Table 3-1: Research Questions

RQ1, RQ1.1 will be answered using the Literature review method, based on which we will formulate
our interview questionnaire and share it with the interviewees. RQ2, RQ2.1 will be answered using an
in-depth interview study after we extract the data from the interviews.

3.2 Literature Review


This section describes the procedures followed in implementing Literature Review. In this search
process, we are following a reference article [50] that we felt is closely related to our study and following
the process of our literature study accordingly.

Motivation:
For the selected topic in the research problem, literature review presents general and specialized
literature relevant to the problem and critically discusses it. "A literature review is an objective,
thorough summary and critical analysis of the relevant available research and non-research literature
10
on the topic being studied[51]. Using literature review new ideas are generated and duplication of
results is avoided[52]. The literature is conducted by following the guidelines provided by J Rowling
and F Slack. To obtain the base knowledge of the research domain, this method helps in providing
analysis, elucidation, and validation of results. The results from this method help in providing an
input to the In-depth Interview study. The literature review aims to collect and assess the base of
current literature, and this would help in enhancing the trust of the readers[53].

The reasons for not choosing a Systematic Literature Review is that it is conducted in a systematic and
standard way, it is a time-consuming process and supports evidential based practice. SLR summarises
studies based on high quality evidence, which lacks in collecting the primary data for our research, as
in this study we also focus on reviewer’s theories in the articles, and also due to confined availability
of literature and to stay within the time schedule of our research plan we selected literature review
instead[54] [55].

Systematic mapping can be conducted to get an overview of the research topic, but to investigate the
evidence related to the literature is done using systematic reviews, where these reviews do not include
primary data. So, this method is avoided[56].

3.2.1 Search Method:


Keyword-based search is chosen as a search method for our research over snowballing as we couldn't
identify any fundamental papers that would have to spawn the research and there is very limited research
on this topic and couldn’t find any related articles on data quality models for machine learning. There
might be questions of selecting backwards snowballing as there are a limited amount of papers for our
research, but backward snowballing is also not suitable as we couldn't find any articles that are suited
for our study and are mostly rejected in exclusion criteria[57]. So, we chose keyword-based search so
that we can find the relevant data on data quality attributes, data quality models and attributes of data
quality models in different fields. Keyword-based search looks for matching documents that contain one
or more words specified by the user [58].
This keyword-based search while using a set of keywords focuses on finding structural information
among objects in a database [58]. By adding “AND, OR” to the search strings can help us in refining
the search, in our case by giving keywords like “Data quality model” AND “Attributes” gave us some
better results than a regular search [59]. While using keyword-based searches, we have a hard time
distinguishing between words that are spelt the same way but mean something different. This often
results in data that are completely irrelevant to our search[59].

3.2.2 Search Process: -


Initially, we have searched based on keywords and have gathered 342 articles from different databases.
After title and abstract screening, we have selected 27 articles as relevant articles and 315 papers as
irrelevant papers. These 27 articles are mainly taken from four databases. We have performed in-depth
screening on the above mentioned 27 articles and have acquired 10 relevant articles and 17 articles were
removed based on implicit and explicit criteria.

11
Figure 3-1 Data Base Search

In this ongoing keyword-based search process we have used below mentioned keywords to obtain results
for data quality model, data quality attributes on machine learning.

S.NO Substring Keywords Database

1 Data Quality model, Data 1)data AND (“quality model" OR "quality Arxiv
Quality attributes, dimensions”) AND "machine learning"
machine learning ACM
Digital
Library

IEEE Xplore

2 Data Quality model, Data 2)(TITLE-ABS-KEY ("quality model" OR Scopus


Quality attributes, "quality dimension" ) AND ALL ( "machine
machine learning learning" ) AND TITLE-ABS-KEY ( data ) ) IEEE Xplore
Table 3-1 Data Base Search String

3.2.3 Selection of Database:


The selection of database is done based on the relevancy of articles that we have found in our previous
experience, we have found articles that have data on machine learning more relevant from Scopus, IEEE
Xplore, ACM DL and arXiv databases. So, we are limiting our research to these particular databases
that we have access as a student.

3.2.4 Database Search: -

12
In this literature review we have made use of below-mentioned databases to gather necessary articles to
further proceed with our research and after searching in Scopus, IEEE Xplore, ACM Library, arXiv
databases we have come across thousands of articles and to make the search easier and to reduce the
search results we have modified the keywords for the search several times. Our view on each database
and the results that we have found are as follows:

ARXIV is more algorithm oriented but not engineering-oriented, we found all the articles are related to
machine learning data but there are very few articles that have the data on data quality for machine
learning.

In IEEE after using the keyword, we have found very fewer results and that are related to our research.

Using SCOPUS, we have found results related to our study, most of the articles were either on data
quality models in different fields or core machine learning articles. They were also a few articles which
were related to data quality attributes on machine learning.

ACM Library also does not show any results that are related to our study, these results were based on
machine learning but none of the articles has data on data quality models or attributes for machine
learning.
The following table presents the number of articles found in different databases after performing the
search process

Database Years searched Total Articles found

Scopus 1980 - 2018 195

IEEE Xplore 1980 - 2018 10

ACM Library 1980 - 2018 12

arXiv 1980 - 2018 125


Table 3-2: Results of Search String

3.2.5 Title Screening:


In title screening, we searched for a keyword that was related to data quality model for machine learning,
data quality model on different fields that are related to machine learning (so that we can find the
attributes, that we use to interview data scientists and find which quality attribute is more relevant to
machine learning).

3.2.6 Abstract Screening:


While we are conducting abstract screening we have come across many articles that has different
language, articles that we have no access and duplications, after excluding those articles we have
conducted abstract screening on the other remaining articles and we mainly looked for articles in which
any experiment or survey or literature study was conducted on data quality model, data quality attributes
or characters or dimensions on machine learning. We have also selected articles that consist of any
quality dimensions or characteristics or attributes mentioned with in-depth details for further in-depth
screening.

13
3.2.7 Inclusion/exclusion criteria:
Based on the objectives of our research we have formulated these inclusion and exclusion criteria.

Inclusion:
• Articles that have attributes or characteristics or dimensions that are related to data quality model for
machine learning.
• Articles that have information on data quality on machine learning
• Articles on data quality model in which they conducted experiment or survey or literature study on
data quality attributes or dimensions or characteristics on machine learning.
• Published reports that describe any data quality attribute or data quality model for machine learning.

Exclusion:
• Articles that do not specify any data quality attributes or dimensions or characteristics for machine
learning.
• Articles that do not include data quality on machine learning.
• Articles that do not have data related to any study related to experiment, survey, literature study on
data quality attributes or characters or dimensions that are related to machine learning.
• Articles that are in any languages other than English.
• Articles that don't define mentioned data quality attributes
• Articles which we do not have access.

3.2.8 Quality Assessment:


The effectiveness of the literature needs to be evaluated which is important and can be done by
performing the quality assessment. Quality assessment was performed while extracting data from the
articles. The appraisal criteria are considered from “the central for review and Disseminations
(CRD)2009 for reviews, adjusted according to this study and criteria’s for the quality assessment have
been formulated” and listed below as per the article [50]. From the search string, we have selected 10
articles and by performing additional search strategies we have selected 3 publications. Totally for all
these 10 articles and 3 publications, we would conduct the quality assessment. After performing the
quality assessment depending upon the quality scores we would eliminate or exclude the articles and
publications, so depending upon our quality score we did not eliminate any articles or publications (10
articles and 3 publications) from our research. As this research is carried out by two authors this
quality assessment would be more precise and rechecking can be done. The first author conducts a
quality assessment on selected 10 articles and 3 publications and similarly second other also conducts
the quality assessment individual. After accumulating the results both the authors discuss among
themselves regarding the quality assessment and come to a conclusion based on the pros and cons. As
there weren’t any major differences between the results of the author's second author quality
assessment has been considered. The below table indicates Quality assessment criteria

I.D Criteria Score

C1 Mentioned research questions are clearly stated in Y: Research questions are


respect of method, outcome and design? defined clearly
P: research method is not
defined clearly but research
questions are clearly defined
N: Research questions are not
defined clearly

14
C2 The search strategy is suitable and satisfactory? Y: Authors have searched at
least 3 or more libraries with
focusing on the topic and add
searching in publications
N: Authors have searched in 3
or more libraries but did not
search in publications.
C3 Are the inclusion and exclusion criteria suitable and Y: Inclusion/exclusion criteria
satisfactory? are stated clearly.
P: Inclusion/exclusion criteria
are not stated clearly.
N: Inclusion/exclusion criteria
are not stated.
C4 The considered criteria’s which are used to assess the Y: authors have selected
quality of the selected articles appropriate or not? appropriate criteria’s and also
provided the quality
assessment score
P: Authors have considered
appropriate criteria’s but
didn’t provide a quality
assessment score
N: criteria’s used to assess the
quality and the quality
assessment score for the
selected articles are not
mentioned.
C5 Are Appropriate preventive measures are being taken Y: preventive measures were
care in order to reduce Bias & errors? taken into consideration to
overcome bias by reviewing
articles again and again by
each individual and later
discussed among authors. P:
Particular steps were followed
for the review. N: no
preventive measures were
considered in order to reduce.
C6 Were the details extracted from the selected articles Y: Each article information is
satisfactory? presented in detail by
extracting all the details and
discussions present in the
relevant paper. P: summary of
the selected article is been
mentioned. N: Information can
be retrieved from the articles.
C7 Are the concluded results precisely reflecting the Y: Yes, conclusions are been
evidence? backed up by the results
N: No, conclusions are not
been backed up the results.
Table 3-3 Criteria for Quality assessment

In the above table, criteria are formulated based on Central for Reviews and Dissemination, 2009.
Indications for the symbols presented in the above table Y(Yes)= 1, P(Partially)= 0.5, N(No)= 0. These
values help in drawing quality scores by checking the qualification of the extracted data.

15
3.3 In-Depth interview Method:
This below section describes the procedure, results and data analysis of the in-depth interview.
This below method we followed a process from a reference article[60] that we felt is closer to our study
for In-Depth interview.

3.3.1 Interview Method:


We selected In-depth interview as our research method because we want to conduct an open-end
interview with our targeted audience(data scientists who have experience working with machine
learning systems) and to know about the characteristic of data quality attributes and also to know if the
selected attribute suits the data quality model based on their experience. As mentioned in [61] we want
to conduct an interview which has open-end questions and semi-structured format. The in-depth
interview helps the interviewer to know deeply about the respondent feelings, knowledge and experience
on the subject and in a semi-structured interview, we will change the questions if necessary from
interviewee to interviewee deepening upon the previous interviewee's results and their suggestions, by
doing this we can get more accurate and updated results [61]. We will conduct an in-depth interview
based on the availability of time and interviews are conducted one after the other. Ex: If a data scientist
is interviewed, after his interview is done and before interviewing other data scientist, we will analyse
the data and if necessary, we will reframe the questions depending on the previous results. This iteration
can be carried out until we will get the required results and until/unless no new characteristics emerge
[62].

We are planning on conducting an interview on multiple data scientists who has experience working
with machine learning. We are planning to interview industry practitioners i.e.: - Bonseyes AI
marketplace, and other data scientists who have industrial experience from different organizations. We
would like to select the data scientists who previously worked with machine learning systems and have
experience in data of machine learning. We will prepare questionnaires from the literature review on
existing quality models and data quality attributes, we are expecting information regarding existing data
quality attributes that are suitable for machine learning data. After we get the required data, we will use
conventional content analysis to analyse the gathered data and conclude the research gap. The
conventional content analysis aims to describe a fact and is used with a study design [63]. As the data
is collected through interviews and the results that we receive are based on interviewee’s perspective,
we use open-ended questions where we can have more flexibility to get the results that we wanted.

The reasons for not choosing other methods such as survey, Experiment and case study are:

• The survey is done on a broad number of individuals to find their characteristics; this method is
also conducted on people from different organisations with selected qualification. In our case, we
are conducting an interview on data scientists of Bonseyes and another selected organisation. So,
this method is not valid in our study [64].
• An experiment, on the other hand, is an investigation of the hypothesis that can be tested to measure
the effect of one or more dependent or independent variables [64]. Our first preference is to develop
a data quality model as there are no data quality models available until now. So, we must create a
data model first and then we can do an experiment by testing this data quality model in the industry.
This is the reason we did not select the experiment as our research methodology. If we had enough
time after developing data quality model and if the company permits, we will test this data quality
model by applying the company data in it. But in reality, we don’t have much time and resources,
as our thesis project needs to meet all the deadlines. So as our “Future work” of this project we
recommend the testing of our data quality model in industries.
• We did not consider case study as our research method because case study mainly concentrates on
the tools used by the data scientists and concentrates on what the data scientists do in that company,
their contribution towards the company from their side and as in this project case study does not
16
concentrate on the data quality model and characteristics of quality model for data of machine
learning [64]. The case study mainly concentrates on describing in detail how data scientists are
working, what all tools they are using, what different resources is he being used for work, but this
is not our ambition. Our main aim is to build a quality model for data for machine learning, so we
can’t perform a case study.

3.3.2 Interview Questionnaire and Trello board:


Before we start our interview, we shared our interview questionnaire with each interviewee so that they
can get a basic idea of our study and the remaining explanation is given by us in the phone call.

This below is interview questionnaire:

Figure 3-2 Interview Questionnaire - I

17
Figure 3-3 Interview Questionnaire - II

18
In this interview process, we used Trello board using which all the interviewees selected all the data
quality attributes to either important or not important list. This helped us in recording the interview data
with ease.

This below is the Trello board that we shared with each interviewee:

Figure 3-4 Trello Board Questionnaire

3.3.3 Information of interviewees: -


This below table gives us the details of all 15 interviewees who participated in our in-depth industry
study and share their knowledge to develop data quality model for machine learning:

Interview Interviewee Experience Algorithms Worked


Number Name

1 Interviewee-1 5 Years Not specified

2 Interviewee-2 7 Years A neural network, regression and deep learning

3 Gbn Classifiers for direct commercials, sentiment text


Interviewee-3 1 Year analysis

4 Tree-based models i.e:- gradient boosting, equational


models such as neural networks:- convolutional,
recurrent, unsupervised techniques:- principal
component analysis, robust principal component
analysis, autoencoders, support vector data descriptors,
Interviewee-4 6 Years support vector machines

5 Interviewee-5 3 Years Regression algorithms, logistic algorithms and nav

19
6 Humble methods like Random Forest, Gradient boosting
Interviewee-6 2 Years methods

7 Deep Learning, neural networks, recommendation


Interviewee-7 1 Year algorithms

8 Computer vision, Image classification, object detection,


Random Forest, Gradient boosting methods, Logistic
Interviewee-8 1 Year Regression

9 Classification and regression, various neural network


Interviewee-9 4 Years models, supervised and unsupervised models

10 Statistical linear models, neural networks in various


Interviewee-10 4 Years designs, regression trees, language models.

11 Neural networks like LSTM networks and other ordinary


neural networks, logistic regression and simple decision
Interviewee-11 4 Years trees

12 Interviewee-12 6 Years Linear regression, neural networks

13 Interviewee-13 10 Years Deep learning, Efficient Tree models

14 Working with algorithms that could identify people with


Interviewee-14 5 Years gambling addiction

15 Deep Learning algorithms, Gradient boosting & Random


Interviewee-15 18 Years Forests

Table 3-4: Information of Interviewees

3.3.4 Analysis Protocol for In-depth Interview Method:


Qualitative Content Analysis is chosen as the data analysis method for this study and the rationale for
choosing this method is that it focuses on analysing text data which includes narrations, open-ended
survey questions, interviews, observations, books and manuals [63]. As our study includes interview
data and data from the literature, Content analysis is the best fit data analysis technique for our study.
Among Content analysis techniques, Conventional content analysis is chosen for this study as it best fits
when the existing theory or literature on a topic is limited and mainly concentrates on describing a
phenomenon based on emotions [63].

Qualitative research has merits like the richness of the data collection, but for the purpose of research,
this collected data needs to be interpreted and coded in a definitive way, we can give qualitative content
analysis as an example for this kind of research [65]. Qualitative content analysis is a ‘‘research method
for the subjective interpretation of the content of text data through the systematic classification process
of coding and identifying themes or patterns” [63]. The merits of this conventional content analysis are
that the information is extracted from the participants of this study without imposing any predetermined
theoretical views [65]. According to Hsieh and Shannon, the proper use of conventional content analysis
is when we have limited literature on research or an existing theory [63].

We are conducting our data analysis referring to this research article [63] and the steps are as follows:

20
Figure 3-5: Conventional Content Analysis

1) Selection of the sample to be analysed: -

For analysing the sample data, we take in this section is the interview transcripts and the field notes. The
interview data which we have gathered from interviews are transcribed into interview transcripts and
the field notes are the qualitative notes written by us while governing the interviews. From these
interviews' transcripts and field notes, we would repeatedly analyse, read, understand the text data and
write down the complete picture of these as complete scripts.

2) Defining the categories: -

In this phase, the categories are been formed based on the number of occurrence of data quality attributes
in all 15 interviews. After considering all the interview transcripts, field notes, all Trello boards we have
categorized our data quality attributes based on the number of occurrences of a quality attribute in an
important list of all 15 interviews. For example, there is a data quality attribute name size which has
occurred 9 times in important list when we consider all 15 interview results. There is a total of 15
categories the reason behind this is we have conducted 15 interviews and the size attribute comes under
category number 9 as it has occurred 9 times.

S.no Attributes Occurrence


(Categories)

1) Size 9

21
2) Diversity 7

3) Portability 4
Table 3-5: Categorizing Data Quality Attributes

Similarly, in this manner, all the data quality attributes are Categorized.

3) Outline the coding process: -

In the next phase, all the data quality attributes which we gather from the important list are taken after
analysing the transcripts every single word again and again repeatedly so that we don't lose any minute
detail from the transcripts, for quality standards. For these data quality attributes, we label them with
codes, once codes are been allocated to all the attributes then these are associated with the themes. For
all the coded data quality attributes we would assign them themes, this would help us in accepting or
rejecting criteria for data quality attribute and the accepted data quality attributes can be used in building
data quality model.
For example, after analysing the transcripts we consider satisfaction as A7 and the theme allocated to it
is Accepted by 50% or more interviewees
as the occurrence is more than 9 which means 9 interviewees have accepted this attribute as important
and 6 interviewees have rejected means selected as a not important attribute. This concludes that this
attribute is selected for the data quality model.

4) Implementing the coding process: -

Themes are been divided into two types: 1) Accepted by 50 % or more interviewees, 2) Rejected by
50% or more interviewees. The reason for selecting these two as our themes are, we could not include
all the data quality attributes which are been selected as important by the interviewees in our data quality
model. As we all know that each data scientist has different priorities in selecting data quality attributes,
they select those attributes based on their previous project. Due to this reason all the data quality
attributes (26 attributes) which we have gathered from the literature review and given to all the
interviewees and we noticed that all 26 attributes have been selected as important by at least 4
interviewees. For more quality and accuracy purpose included all the attributes that have more than 50%
of occurrence from all the interviews. The themes we selected are based on the occurrence of the
attributes in all interviews as shown below:

Codes Themes

A7: Satisfaction Accepted by 50% or more interviewees

A16: Precision Rejected by 50% or more interviewees

A10: Context Coverage Accepted by 50% or more interviewees


Table 3-6: Applying codes o themes

5) Analysing the results of the coding process: -

In this Phase, all the above phases are been followed step by step and applied to all the data quality
attributes which are been selected in the important list. First, all the data quality attributes are been
categorised (occurrence). Code is a label to all the attributes and different themes are been allocated to
them, and everything is reordered in the manner of number occurrence so that readers can understanding
better. For example: -

22
S.no Attributes code Occurrence Themes
(categories)

1) Satisfaction A7 9 Accepted by 50% or more interviewees

2) Precision A16 6 Rejected by 50% or more interviewees

3) Context A10 8 Accepted by 50% or more interviewees


Coverage
Table 3-7: Implementing the coding process

Similarly, it carried out on all the data quality attributes

23
4 RESULTS AND ANALYSIS

4.1 Search results for the literature review:


In this literature review, we have selected below 10 articles which are relevant to our research. These
articles consist of the data quality attributes which we are going to use for our further research. We can
use these articles to know the importance and effect of the attributes of machine learning.
The below table consists of Database search results which we have gathered while performing a
literature review
Selected articles:

S. No Selected Articles

1 Sivogolovko, Elena. "The Influence of Data Quality on Clustering Outcomes." Databases


and Information Systems VII: Selected Papers from the Tenth International Baltic
Conference, DB & IS 2012. Vol. 249. IOS Press, 2013.

2 Kiefer, Cornelia. "Assessing the Quality of Unstructured Data: An Initial Overview."


LWDA. 2016.

3 Sivogolovko, Elena, and Boris Novikov. "Validating cluster structures in data mining
tasks." Proceedings of the 2012 Joint EDBT/ICDT Workshops. ACM, 2012.

4 Blake, Roger, and Paul Mangiameli. "The effects and interactions of data quality and
problem complexity on classification." Journal of Data and Information Quality (JDIQ)2.2
(2011): 8.

5 Alshareet, Osama, et al. "Incorporation of ISO 25010 with machine learning to develop a
novel quality in use prediction system (QiUPS)." International Journal of System
Assurance Engineering and Management: 1-10.

6 Daniel, Florian, et al. "Quality Control in Crowdsourcing: A Survey of Quality Attributes,


Assessment Techniques, and Assurance Actions." ACM Computing Surveys (CSUR) 51.1
(2018): 7.

7 Gong, Zhiqiang, Ping Zhong, and Weidong Hu. "Diversity in Machine Learning." arXiv
preprint arXiv:1807.01477 (2018).

8 Henderson, Peter, et al. "Deep reinforcement learning that matters." Thirty-Second AAAI
Conference on Artificial Intelligence. 2018.

9 Huang, Yi-Min, and Shu-Xin Du. "Weighted support vector machine for classification
with uneven training class sizes." 2005 International Conference on Machine Learning
and Cybernetics. Vol. 7. IEEE, 2005.

10 Paper not, Nicolas, et al. "Towards the science of security and privacy in machine
learning." arXiv preprint arXiv:1611.03814(2016).
Table 4-1: Selected Articles

We have been looking for more data quality attributes for our research, while we were making progress
on our database search, we have gone through various published reports which include reports on data

24
quality attributes. We found many reports on attributes that match with the data quality attributes that
we have already selected, and we have found few reports that were similar to one another stating one
common problem and we planned to include that data quality attribute in our study called “Fairness”.
These below are the Publications that we have come across for this data quality attribute, and we
allocated R1-R3 as published report numbers for each publication.

R1 https://www.bostonmagazine.com/news/2018/02/23/artificial-intelligence-race-dark-skin-bias/

R2 https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

R3 https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an-algorithm-be-
racist-our-analysis-is-more-cautious-than-
propublicas/?noredirect=on&utm_term=.ef5acb376ecf
Table 4-2: Selected Publications

4.2 Analysis of the literature review


RQ1) What is the state of art of quality model for data on Machine Learning?

ANSWER: RQ1 is answered using a Literature Review by gathering the evidence on the existing data
quality literature in the given databases. In this Literature Review, we have gathered and synthesizes all
the existing information on data quality models. As the main motive of the study is to collect the
information on data quality models and its characteristics, we have chosen Literature Review as the apt
method for RQ1[66]. In this database search, we have not come across any articles that have the data on
data quality models for machine learning, which makes our study more useful for other researchers.
After the selection of 10 articles to answer our RQ1.1, we have studied those articles very keenly and
extracted the following data Quality attributes from those articles which are very useful for conducting
interviews and to develop a data quality model for machine learning.

RQ1.1) What are the data quality attributes that are used to characterize the quality of data used
in the machine learning system?

ANSWER:
After the in-depth screening of all the selected articles, we have the following attributes that are suitable
for machine learning. These data-quality attributes are useful in building the data quality model for
machine learning, the following table indicates different data quality attributes for machine learning.
The overall definition of the data quality attribute is stated in the top of the cell in the definition column.
The definitions of the data quality attributes extracted from the selected articles are stated below the
overall definition.

Selected data Quality attributes:

Selected Attributes Definitions according to articles Total


articles

1,2,3,4,6 Accuracy It can be considered as the agreement of the recorded value 5


with either an attribute of a real-world entity, a value stored
in another database, or the results of an arithmetic
computation [67].

These metrics compare the automatically annotated data to


parts of the data which represent the real world, such as
25
manually annotated gold standard corpora. Statistical
classifiers are evaluated by comparing them to gold
standards and by determining how many of the classified
entities really belong to a class[68].

Accuracy means that the recorded value is in conformity


with the actual value. Accuracy in benchmarks can be easily
measured with some external metric [69].

agreement with either an attribute of a real-world entity, a


value stored in another database, or the results of an
arithmetic computation [70].

In this article, the author said that they are many synonyms
used for referring the accuracy like goodness, correctness,
quality [71].

1,3,4 Completeness As data having all values recorded and as the measure of 3
how complete the representation of the target domain in our
database [67].

Completeness can be defined in two different ways: as data


having all values recorded, as a measure of how complete
the representation of the target domain in our database [69].

Complete data has been defined as data having all values


recorded [70].

1,3,4,6 Consistency Consistency can be considered as the case when the 4


representation of the data value is the same in all cases and
as format and definitional uniformity within and across all
comparable datasets [67].

Consistency can be considered as format and definitional


uniformity within and across all comparable datasets [69].

Consistency is defined as the representation of the data


value is the same in all cases and as format and definitional
uniformity within and across all comparable datasets [70].

Data consistency is commonly interpreted as the similarity


between outputs produced
by different workers in response to the same input. It has
been studied, for instance,
in the context of peer consistency evaluation [71].

1,3,4,6 Timeliness Volatility refers to the time period between real-world 4


change and next change which makes original data invalid.
Also defined currency as the age of data units used to
produce the information products [67].

Timeliness definition consists of two parts: volatility and


currency. Volatility refers to the time period between real-
world change and next change which makes original data

26
invalid. Currency refers to the age of data units used to
produce information products [69].

It is defined in 3 different events: The first occurs with a


change in the real world, the second when that change is
recorded as data in an information system, and the third on
the use of that data[70].

The timeliness of data is the property of outputs to be


available in useful time for further processing. Timeliness is
especially important in near Realtime crowdsourcing
scenarios it is also been studied as “reaction time” [71].

2 Relevancy The ability to retrieve data that can satisfy the requirements 1
of the customers[68].

2 Interpretability The extent to which data is appropriate languages, symbols, 1


and units and the definition is clear[68].

5 Effectiveness It is the efficacy to produce determined results alongside 1


with accuracy and completeness to attain project goals[72].

5 Efficiency Coherence to attain accuracy and completeness using the 1


available resources which can be either financial or
human[72].

5 Satisfaction The level to which the user is satisfied with the objectives, 1
confident and comfortable with the system[72].

5 Context The level to which a system can be used with efficiency, 1


coverage effectiveness, satisfaction, and freedom from risk in a more
flexible way with the specified requirements[72].

5 Freedom from The level to which a system mitigates the expected risk to 1
risk people in the designated contexts, property or the
environment in use and economic resources in use[72].

10 Privacy A privacy policy which can be used by all data consumers to 1


manage access to sources with regard to their personalized
data [73].

8 Reproducibility The extent to which the data can be reproducible or 1


repeatable that allows others to follow the results and
continue to train new machine learning systems[74].

9 Size The maximum amount of data that can vary depending on 1


the type of input data[75].

7 Diversity The given data which can be used in different levels of a 1


machine learning system[76].
Table 4-3: Selected Data Quality Attributes

This below attribute is not from the literature review, but we identified this attribute from multiple
published reports:

27
Articles Attribute Definition according to published reports Total

R1, R2, Fairness The ratio to which the machine is trained with the data of all kind of 4
R3 races (human or any other race) so that the machine can recognize
all races equally[77]–[79].
Table 4-4: Attributes Identified from Published reports

Attributes and effects on machine learning:

These below are the attributes that we extracted from our analysis of the above-given articles. These are
definitions of each attribute from our understanding of the selected articles, the influence of these
attributes on machine learning and how each attribute useful for a data scientist to monitor and improve
the quality of data are given based on the selected articles.

Accuracy: Data is accurate when data values stored in the database correspond to real-world values or
the extent which data is correct, reliable and certified
Influence: Accuracy gives a benchmark to easily measured with some external metric

Completeness: The ability of an information system to represent every meaningful state of the
represented real-world system
Influence: This dimension helps in finding the missing values. Accepted completeness levels effect
clustering quality more than accuracy ones do

Consistency: The extent to which data is presented in the same format and compatible with previous
data
Influence: Helps in finding the representation of the data value is the same in all cases. Ensures that
there were no conflicts within or between data sets

Timeliness: Timeliness refers only to the delay between a change of a real-world state and the
resulting modification of the information system state
Influence: This is used to find the age of the data that a data scientist is training the machine learning
system with

Relevancy: To retrieve the data based on the requirement of the end-user or targeted customers.
Influence: Helps in finding the right material for the right machine learning system and helps in training
the algorithm with relevant data

Interpretability: To extract the data with the right language, units and symbols
Influence: The necessity for interpretability comes from incompleteness in the problem formalisation,
meaning that for certain problems or tasks it is not enough to get the answer

Effectiveness: The capability to produce the desired output from the extracted data
Influence: This helps in making sure that all the objectives have been fulfilled based on the requirements
of the end-user

Efficiency: Using the available resources either in the form of human or financial, the coherence to
attain accuracy and completeness is efficiency
Influence: Utilized fewer data to achieve the accuracy and completeness objectives and give best results

Satisfaction: The extent to which the end-user is satisfied with the trained data
Influence: Gives and improves the level of satisfaction of the end-user

28
Context coverage: The level to which the system can be re-trained with the data that matches the end-
user's requirements
Influence: We get to know whether we have covered all the given attributes for a machine learning
system

Freedom from risk: The level to which the extracted data can mitigate the risk to data consumers or
the resources that are being utilised.
Influence: This makes sure that the machine learning system is trained without any risks

Diversity: To extract the data that can be used in various stages of data training
Influence: The selected data will be used in all levels to train a machine learning system

Reproducibility: The degree to which the data can reproduce the same results and allow others to
continue to train new machine learning systems
Influence: We can track the available data and reproduce the same results as the previously trained
model

Size: The size of the extracting data which varies depending on the input data
Influence: Helps in selecting the appropriate data so that the size of input decreases

Privacy: Privacy of data that is being extracted by the data consumers to manage access to data sources
regarding the personal data
Influence: This is useful in securing data privacy when a data scientist is training a system that needs
security as the main preference

Fairness: To extract the data in the equal ratio, so that the machine learning system can recognise
everyone equally
Influence: This helps in maintaining equality while selecting the trainable data

ISO 25012 Model:

We are planning to compare our literature review results with the attributes of ISO 25012 data quality
model. We added both quality attributes of literature review and ISO 25012 model in the interview
questionnaires that we are sharing with each interviewee. The reason for the comparison is that there is
lack of strong literature evidential research on data quality model for machine learning, in order to
develop a data quality model for machine learning we need a strong literature evidential base which can
state that not every data quality model can be used to define quality of data for machine learning. This
base can be provided by ISO 25012 and this can help us in understanding that not all data quality
attributes in ISO 25012 are suitable for machine learning, and it needs a separate data quality model
with its own suitable attributes. Another benefit is that this data quality model consists of few data
quality attributes which are missing in our literature review results but may be useful for machine
learning, in order to identify which of the data quality attributes in ISO model are suitable for machine
learning, we have included those data Quality attributes in our interview questionnaires so that the
experienced data scientist could assess the data quality attributes that are more suited to be added in our
data quality model for machine learning.

These below are the data quality attributes that we are planning to use in our interview from ISO 25012
Data Quality Model along with our literature review results:

• Accuracy • Efficiency • Reproducibility


• Completeness • Traceability • Accessibility
• Credibility • Understandability • Confidentiality
• Currentness • Availability • Precision

29
• Portability • Recoverability

4.3 Results of In-Depth Interview:


We have conducted our interview through skype or google hangouts and with the help of the Trello
board, we were able to record the interviewee answers. Before starting our Skype interview, we have
sent a document to interviewee which consists of interviewer’s profile, our thesis outline, Trello link
and few questions which states what to be done in Trello board.

Before starting our call, we have requested all our interviewee's permission for the audio recording of
our interview for better quality purpose.

In Trello board we have created three columns: - Data quality attributes, important, not important. Under
Data quality attributes column, we have mentioned 26 attributes in cards form. The interviewee is
required to read the definition of each of the attribute that we provided in each attribute card and they
have open option to ask us any queries if they do not understand any of the given attributes. Then they
are asked to drag those attributes into important and not important columns from his perspective, for
important column they need to state the reason why they have chosen that attribute under this column
for each and every attribute in comment box which is present in those attributes card respectively, and
for not important attributes column they need to state the reason on the whole why they have chosen
those attributes under this column in reason card.

These are the attributes present in the Data quality attributes column: -

1. Accuracy 14. Accessibility


2. Completeness 15. Confidentiality
3. Consistency 16. Precision
4. Timeliness 17. Traceability
5. Relevancy 18. Understandability
6. Efficiency 19. Availability
7. Satisfaction 20. Portability
8. Interpretability 21. Recoverability
9. Effectiveness 22. Size
10. Context coverage 23. Fairness
11. Freedom from risk 24. Privacy
12. Credibility 25. Reproducibility
13. Currentness 26. Diversity

4.3.1 Summarizing Interviews:


Below we have a summary of each and every interview. In this summary, we have mentioned all the
important and not important attributes selected by the interviewees. The reasons for selecting those
attributes into particular list have also been mentioned by the interviewees which can be referred to in
the appendix. We are assuming interviews 1 to 15 as I1-I15

Interview 1(I1):

The interviewee has selected 9 data quality attributes as important: Accuracy, Completeness, Context
Coverage, Freedom from risk, Relevancy, Satisfaction, Credibility, Precision, Understandability

30
Interviewee selected the rest of the 13 quality attributes as not important: Consistency, Effectiveness,
Efficiency, Interpretability, Timeliness, Accessibility, Availability, Confidentiality, Currentness, Size,
Portability, Traceability, Recoverability

Interview 2(I2):
The interviewee has selected 10 data quality attributes as important: Accuracy, Completeness, Freedom
from risk, Relevancy, Satisfaction, Precision, Understandability, Effectiveness, Interpretability,
Fairness.

Interviewee selected 8 quality attributes as not important: Context Coverage, Credibility, Consistency,
Accessibility, Availability, Recoverability, Reproducibility, size.

Interviewee has also created separate column as depending upon the application in the Trello board in
which consists of both important and not important depending on the application and the remaining 8
data quality attributes which have not been selected under important or not important: Privacy,
Efficiency, Timeliness, Currentness, Confidentiality, Portability, Traceability, Diversity.

Interview 3(I3):

The interviewee has selected 11 data quality attributes as important: Accuracy, Relevancy, Satisfaction,
Effectiveness, Credibility, Consistency, Accessibility, Currentness, Traceability, Interpretability,
Availability

Interviewee selected 8 quality attributes as not important: Completeness, Precision, Understandability,


Fairness, Recoverability, Timeliness, Confidentiality, Freedom from Risk.

Apart from selecting data quality attributes into important and not important few of the data quality
attributes have not been selected in both of the lists: Size, Context coverage, Reproducibility, Privacy,
Efficiency, Portability, Diversity.

Interview 4(I4):

The interviewee has selected 11 data quality attributes as important: Accuracy, Availability, Timeliness,
Currentness, Traceability, Completeness, Interpretability, Accessibility, Reproducibility,
Confidentiality, Efficiency

Interviewee selected rest of the 15 quality attributes as not important: Precision, Relevancy,
Understandability, Privacy, Portability, Fairness, Freedom from Risk, Diversity, Effectiveness, Context
Coverage, Size, Credibility, Satisfaction, Consistency, Recoverability

Interview 5(I5):

The interviewee has selected 17 data quality attributes as important: Relevancy, Context Coverage,
Interpretability, Understandability, Efficiency, Freedom from Risk, Satisfaction, Effectiveness,
Reproducibility, Precision, Accuracy, Currentness, Fairness, Credibility, Confidentiality, Size,
Recoverability

Interviewee selected the rest of the 9 quality attributes as not important: Portability, Completeness,
Consistency, Timeliness, Accessibility, Privacy, Diversity, Traceability, Availability.

31
Interview 6(I6):

The interviewee has selected 23 data quality attributes as important: Efficiency, Reproducibility,
Accuracy, Consistency, Availability, Fairness, Timeliness, Credibility, Relevancy, Diversity, Size,
Recoverability, Satisfaction, Context Coverage, Portability, Traceability, Currentness, Effectiveness,
Precision, Privacy, Confidentiality, Understandability, Freedom from Risk

Interviewee selected the rest of the 3 quality attributes as not important: Completeness, Interpretability,
Accessibility.

Interview 7(I7):

The interviewee has selected 9 data quality attributes as important: Accuracy, Relevancy, Accessibility,
Completeness, Interpretability, Size, Timeliness, Diversity, Privacy

Interviewee selected 1 quality attributes as not important that is Fairness.

Apart from selecting data quality attributes into important and not important few of the data quality
attributes have not been selected in both of the lists: Consistency, Efficiency, Satisfaction, Effectiveness,
Freedom from Risk, Credibility, Currentness, Confidentiality, Precision, Traceability,
Understandability, Availability, Portability, Recoverability, Context Coverage, Reproducibility.

Interview 8(I8):
The interviewee has selected 13 data quality attributes as important: Accuracy, Consistency,
Completeness, Interpretability, Relevancy, Context Coverage, Precision, Traceability, Diversity,
Effectiveness, Privacy, Size, Portability

Interviewee selected the rest of the 13 quality attributes as not important: Timeliness, Efficiency,
Satisfaction, Freedom from Risk, Credibility, Confidentiality, Accessibility, Understandability,
Availability, Fairness, Recoverability, Reproducibility

Interview 9(I9):

The interviewee has selected 12 data quality attributes as important: Relevancy, Accuracy,
Reproducibility, Satisfaction, Privacy, Interpretability, Freedom from Risk, Credibility, Traceability,
Size, Understandability, Availability

Interviewee selected rest of the 14 quality attributes as not important: Completeness, Consistency,
Diversity, Timeliness, Fairness, Efficiency, Effectiveness, Context Coverage, Currentness,
Accessibility, Recoverability, Precision, Portability, Confidentiality.

Interview 10(I10):

The interviewee has selected 12 data quality attributes as important: Relevancy, Accuracy,
Reproducibility, Completeness, Understandability, Interpretability, Efficiency, Fairness, Portability,
Confidentiality, Currentness, Credibility

Interviewee selected rest of the 14 quality attributes as not important: Consistency, Timeliness,
Satisfaction, Effectiveness, Size, Diversity, Availability, Privacy, Recoverability, Freedom from Risk,
Precision, Accessibility, Context Coverage, Traceability.

Interview 11(I11):

32
The interviewee has selected 18 data quality attributes as important: Relevancy, Accuracy, Credibility,
Currentness, Traceability, Interpretability, Recoverability, Effectiveness, Efficiency, Context Coverage,
Completeness, Satisfaction, Confidentiality, Consistency, Availability, Privacy, Fairness,
Reproducibility

Interviewee selected the rest of the 8 quality attributes as not important: Timeliness, Diversity, Precision,
Understandability, Size, Portability, Accessibility, Freedom From Risk.

Interview 12(I12):

The interviewee has selected 12 data quality attributes as important: Relevancy, Accuracy, Diversity,
Recoverability, Availability, Fairness, Accessibility, Size, Effectiveness, Context Coverage,
Consistency, Completeness

Interviewee selected rest of the 14 quality attributes as not important: Reproducibility, Portability,
Privacy, Understandability, Traceability, Confidentiality, Precision, Currentness, Credibility,
Efficiency, Freedom from Risk, Timeliness, Interpretability, Satisfaction.

Apart from this interviewee has suggested two data quality attributes that might be useful for machine
learning according to his perspective, those are Proper test methods, One single truth. But the reason for
not including those attributes in our model or discussion is that we haven't found any existing literature
on those two data quality attributes due to this main reason we didn’t include in our later research.

Interview 13(I13):

The interviewee has selected 17 data quality attributes as important: Relevancy, Accuracy, Context
Coverage, Completeness, Consistency, Efficiency, Satisfaction, Effectiveness, Reproducibility,
Diversity, Portability, Size, Privacy, Confidentiality, Traceability, Interpretability, Credibility

Interviewee selected the rest of the 9 quality attributes as not important: Timeliness, Freedom from Risk,
Currentness, Accessibility, Understandability, Precision, Recoverability, Fairness, Availability.

Interview 14(I14):

The interviewee has selected 16 data quality attributes as important: Accuracy, Size, Precision,
Timeliness, Currentness, Credibility, Satisfaction, Traceability, Reproducibility, Understandability,
Interpretability, Effectiveness, Efficiency, Fairness, Diversity, Availability

Interviewee selected rest of the 10 quality attributes as not important those are Accessibility,
Confidentiality, Context Coverage, Privacy, Freedom from Risk, Completeness, Portability,
Recoverability, Consistency, Relevancy.

Interview 15(I15):

The interviewee has selected 16 data quality attributes as important: Relevancy, Accuracy, Fairness,
Credibility, Diversity, Currentness, Availability, Efficiency, Traceability, Effectiveness, Privacy,
Completeness, Size, Accessibility, Understandability, Context Coverage

Interviewee selected rest of the 10 quality attributes as not important those are Timeliness, Satisfaction,
Freedom from Risk, Confidentiality, Interpretability, Reproducibility, Consistency, Recoverability,
Precision, Portability.

Other Feedbacks:

33
We invited an interviewee for our interview, due to some reasons he forwarded our request to one of his
colleagues. The feedback that we received from the interviewee says that the machine learning problems
are diverse as the problems of machine learning differ with the type of algorithm they work on. So, he
says that each quality attribute may differ its importance depending on the machine learning algorithm
they work on and a perfectly clean dataset (typical quality measures indicate good health) does not
guarantee successful machine learning.

4.3.2 Labelled code:


The below table we have labelled the attributes with code, these codes would help us in identifying the
attributes and also help in allocating themes to them.

S.no Attributes Code

1) Accuracy A1

2) Completeness A2

3) Consistency A3

4) Timeliness A4

5) Relevancy A5

6) Efficiency A6

7) Satisfaction A7

8) Interpretability A8

9) Effectiveness A9

10) Context Coverage A10

11) Freedom from Risk A11

12) Credibility A12

13) Currentness A13

14) Accessibility A14

15) Confidentiality A15

16) Precision A16

17) Traceability A17

18) Understability A18

19) Availability A19

20) Portability A20

34
21) Recoverability A21

22) Size A22

23) fairness A23

24) Privacy A24

25) Reproducibility A25

26) Diversity A26


Table 4-5: Labelled code:

4.4 Analysis of In-Depth Interview Study:


This below table gives us the details of all the data quality attributes that were occurred as important
from the in-depth interview study. These occurrences were selected by all the 15 interviewees as per
their importance list. Interview numbers who selected each attribute have been given in the interviewee's
column and the articles where we extracted these attributes have been specified in the selected articles
column.

S.No Attributes Occurrence Interviewees who have selected the Selected Articles
attribute (table: 4-1,4-2)
from which
attributes where
considered

1 Accuracy 15 I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, 1,2,3,4,6, ISO
I11, I12, I13, I14, I15 25012

2 Relevancy 13 I1, I2, I3, I5, I6, I7, I8, I9, I10, I11, 2
I12, I13, I15

3 Interpretability 11 I2, I3, I4, I5, I7, I8, I9, I10, I11, I13, 2
I14

4 Completeness 10 I1, I2, I4, I7, I8, I10, I11, I12, I13, I15 1,3,4, ISO 25012

5 Credibility 10 I1, I3, I5, I6, I9, I10, I11, I13, I14, I15 ISO 25012

6 Effectiveness 10 I2, I3, I5, I6, I8, I11, I12, I13, I14, I15 5

7 Satisfaction 9 I1, I2, I3, I5, I6, I9, I11, I13, I14 5

8 Traceability 9 I3, I6, I8, I9, I10, I11, I13, I14, I15 ISO 25012

9 Size 9 I5, I6, I7, I8, I9, I12, I13, I14, I15 9

10 Understandability 8 I1, I2, I4, I5, I6, I9, I14, I15 ISO 25012

11 Context Coverage 8 I1, I5, I6, I8, I11, I12, I13, I15 5

12 Fairness 8 I2, I5, I6, I10, I11, I12, I14, I15 R1, R2, R3

35
13 Currentness 8 I3, I4, I5, I6, I10, I11, I14, I15 ISO 25012

14 Availability 8 I3, I4, I6, I9, I11, I12, I14, I15 ISO 25012

15 Efficiency 8 I4, I5, I6, I10, I11, I13, I14, I15 5, ISO 25012

16 Reproducibility 8 I4, I5, I6, I9, I10, I11, I13, I14 8

17 Diversity 7 I6, I7, I8, I12, I13, I14 ,I15 7

18 Privacy 7 I6, I7, I8, I9, I11, I13, I15 10

19 Precision 6 I1, I2, I5, I6,I8, I14 ISO 25012

20 Consistency 6 I3, I6, I8, I11, I12, I13 1,3,4,6, ISO 25012

21 Confidentiality 6 I4, I5, I6, I10, I11, I13 ISO 25012

22 Freedom from 5 I1, I2, I5, I6, I9 5


Risk

23 Accessibility 5 I3, I4, I7, I12, I15 ISO 25012

24 Timeliness 4 I4, I6, I7, I14 1,3,4,6

25 Recoverability 4 I5, I6, I11, I12 ISO 25012

26 Portability 4 I6, I8, I10,I 13 ISO 25012


Table 4-6: Occurrence of Data Quality Attributes in Interviews and selected articles

RQ2) How do data scientists characterize data quality attributes for machine learning?

Answer: below are the selected data quality attributes that have the highest occurrence in our In-depth
interview study. These below data quality attributes were characterized by the most number of data
scientists that we interviewed. All the data scientists who we interviewed in our study summarized each
attribute that they selected as important, and these below are the characterized details of all selected
attributes by more than 50% data scientists.

Accuracy: To check whether the data is accurate or not is one of the most important attributes.

Relevancy: The end user will not be satisfied if the system is not trained with relevant data, making this
attribute important.

Interpretability: This is related to understandability, explainability and important for gaining the trust
of users.

Completeness: In machine learning system it is hard to deal with unseen classes, that makes this
attribute important.

Credibility: This is related to completeness and accuracy which both are important.

Effectiveness: This helps in reducing the training time and also is related to completeness.

36
Satisfaction: This is similar to relevancy as the user will not be satisfied if the final output does not
meet the user’s needs.

Traceability: This attribute helps in increasing the user's trustability, secure, trustworthiness of the data.
This attributes also helps in tracking the main source of change so that we don't deviate from it.

Size: In Machine learning systems the more data, the better it is. In generally having less data causes
worst outcomes and negatively effects the performance.

Understandability: This attribute is important for the designers of Machine Learning system. If a
designer has trouble understanding the data then he might make wrong assumptions, leading to issues
in the trained system.

Context coverage: When the user is not satisfied with the end system then the system needs to be
retrained with additional data or with a different target.

Fairness: This is needed in the long run as per ethics and population is considered.

Currentness: Old data is totally worthless for a model that aims to product effective outcome in the
real world. So, it's really important to have updated data.

Availability: Accessing the data is a key point in building a good model. If the data is not accessed,
then it cannot be used.

Efficiency: This attribute is related to Accuracy and Completeness, as both are important.

Reproducibility: Data which is used to produce the results can also be continued to train new models
and should be able to reproduce the same results in order to verify every step of the data pipeline.

RQ2.1) How are the data quality attributes of data quality model applicable to data of machine
learning?

Answer: This below list of attributes are selected by the most number of data scientists that we
interviewed in the data quality model. All the data scientists have summarized each data quality attribute
that they selected as important and has given the effect of each attribute on machine learning systems.
These below are the effects of each data quality attribute on machine learning system as per more than
50% of the data scientists whom we interviewed in our study.

S.No Attributes Applicability for Machine Learning

1 Accuracy This attribute helps in checking whether the selected data is accurate or
not making the system more reliable

2 Relevancy Helps to train the system with the appropriate data that can satisfy all
the end-users’ requirements

3 Interpretability This is useful in understanding the results and helps in applying the
results in real-time

4 Completeness This attribute helps in handling the unseen classes in the machine
learning system

37
5 Credibility Helps in training the system with more trustable and reputable data for
better quality

6 Effectiveness The more effective the data the less amount of data is needed to train
the system

7 Satisfaction Used while making the users satisfaction as the main priority, designing
the system with the appropriate data

8 Traceability This makes the data trustworthy and the source accountable

9 Size With the bigger data, we get a better outcome. The performance can
vary from system to system

10 Understandability With this attribute, if the designer understands the data well, then he
might be able to train the system accurately so that we get required
results

11 Context Coverage This is useful while retraining the Machine Learning system with
additional data to satisfy the user's requirements

12 Fairness This helps in improving all types of imbalances and give more accurate
results.

13 Currentness It is useful for the systems where the data is supposed to be from a time
series.

14 Availability It is helpful in reusing the available data as finding the appropriate data
is always time-consuming.

15 Efficiency This is useful in selecting less data with which we can get great results.

16 Reproducibility It can be useful while keeping track of the data and also can be used as a
backup database.
Table 4-7: Application of Data Quality attributes

4.4.1 Research Implications:

We have validated the data quality attributes which has been selected by the interviewees. These
quality attributes have been identified from the interview transcripts. Research implications of
the data quality attributes have been analyzed based on the references which are related to
improving the quality of data.

1 Understandability: This attribute enables the users to interpret and express the information in
appropriate languages, symbols for a specific context of use. From[80], this attribute defines as an extent
to which the data is expressed in appropriate languages, symbols and the extent to which the definitions
are clear.

2 Fairness: The machine which is trained with the data with the ratio of all the races e.g. dark, white
etc. to recognise all the races equally is Fairness. From [81], this attribute measures the impact of the
individual on the group-level ratio of discrimination for ML products.

38
3 Size: Depending on the type of input data, the maximum amount of data that varies is the size of the
data. From[82], depending on the noisy attributes in the mining tasks, the weighted attributes reduce the
impact of those noisy attributes.

4 Currentness: This attribute identifies the information that is up to date. From [83], value is identified
as current, even though there is any time-related changes in the value. In this case, it refers to an extent
to which the database is up to date.

5 Efficiency: Using the available resources either in the form of human or financial, the coherence to
attain accuracy and completeness is efficiency. From[80], the capability of the data of providing suitable
performance according to the number of resources used.

6 Availability: The degree to which the extracted data can be retrieved by authorized users for that
context of use. From[84], the system should be available according to the specified requirements. Since
the selected systems here are mission-critical, the system will stay off till the fault is fixed.

7 Relevancy: To retrieve the data based on the requirement of the end-user or targeted customers. From
[85], the extent to which the data that is selected is helpful and applicable for a task in any data in hand.

8 Context Coverage: The level to which the system can be re-trained with the data that matches the end-
user's requirements. From [86], a system that can be used with effectiveness, freedom of risk, efficiency,
satisfaction for its specified context of use.

9 Reproducibility: The degree to which the data can reproduce the same results and allow others to
continue to train new machine learning systems. From [87], The closeness between the results and the
changed conditions of the measurement of results.

10 Traceability: The extent to which the source of information including owner and/or author of the
information, and any changes made to the information can be verified. From [88], the ability to trace the
location of a product or an entity through any recorded information.

11 Satisfaction: The extent to which the end-user is satisfied with the trained data. From [89], The
performance quality has an indirect effect on behavioural intentions through satisfaction.

12 Effectiveness: The capability to produce the desired output from the extracted data. From
[90], Using object-oriented design techniques and concepts, the desired behaviour is achieved using the
design’s ability.

13 Credibility: The extent to which the information is reputable, objective, and trustable. From
[91], credibility is the quality or power to obtain belief.

14 Completeness: The ability of data to represent every meaningful state of the represented real-world
system. From [92], The extent to which all the data is included in the registry database.

15 Accuracy: Data is accurate when data values stored in the database correspond to real-world values
or the extent which data is correct, reliable and certified. From [93], the measurement or representation
of correctness or precision.

16 Interpretability: To extract the data with the right language, units, symbols with better
understandability. From[94], if the input parameters are interpretable, then the measurement can easily
be interpreted.

39
5 DATA QUALITY MODEL:
This data quality model is built based on the above results and occurrences, the attributes that are
selected in this data quality model are based on the occurrences of each attribute by all 15 interviewees.
Each selected attribute has 50% or above occurrences and either 50% or more interviewees have selected
these attributes as important to be added in the data quality model[37]. The remaining attributes were
selected as not important by either 50% or more interviewees making them less qualified compared to
the selected attributes into this quality model. The 10 attributes that are not listed in our data quality
model can also be added based on any future research, if number of data scientists in future can share
their view on the importance of an attribute that is not included in the list, it can also be added to the list
of data quality attributes in our data quality model. These below are the data quality attributes for our
Data Quality Model:

• Accuracy • Size
• Relevancy • Understandability
• Interpretability • Context Coverage
• Completeness • Fairness
• Credibility • Currentness
• Effectiveness • Availability
• Satisfaction • Efficiency
• Traceability • Reproducibility

Characterisation of sub-attributes: Here for dividing attributes into sub-attributes, we have mainly
considered the opinion of the interviews. From all the interviews, interview transcripts and field notes
we have found that more than 50% of the interviewees (Data scientist) have mentioned the relation
between attributes, and also mentioned sub-attributes for few the attributes based on the provided
definition and their industrial experiences. Based on the interviewee's opinion we have formulated the
below table, which clearly states attributes and their sub-attributes for the above-mentioned data quality
attributes of data quality model.

S.no Categories Sub-attributes

1) Accuracy Efficiency

2) Completeness Effectiveness

3) Relevancy Satisfaction

4) Interpretability Understandability

5) Credibility Traceability

6) Size None

7) Fairness None

8) Context None
coverage

9) Currentness None

10) Availability None

40
11) Reproducibility None
Table 5-1: Attributes and Sub-Attributes

Hierarchical Structure for Data Quality Model:

The below data quality model for machine learning is developed based on the interview's perspective.
The Hierarchical structure of this data quality model is constructed based on this artical[95]. The reason
for selecting the hierarchical structure is that a pictorial representation would make easy understanding
of data to the readers and the practitioners, the reference selected for the structuring of this data quality
model is an ISO data quality model which is proposed by International organization for standardization,
the data or the information proposed by this organization is done after getting approved from various
experienced people who are related to that particular topic, they proposed a hierarchical structure in this
article based on which we built this below data quality model with hierarchical structure.

Figure 5-1 Data Quality Model for Machine Learning[95]

5.1.1 Comparison with ISO/IEC 25012:


In the list of Data Quality attributes which have selected for Data Quality Model when compared with
Attributes of ISO 25012 Data Quality model, we have noticed that total of 11 out of 16 quality attributes
are selected from the attributes that we have extracted from our literature review. We have a few
common data quality attributes in common from both our literature review and ISO 25012 model.

Data quality attributes mentioned in the ISO 25012 data quality model that matches with attributes in
our data quality model are listed below [34]: -

• Accuracy • Traceability
• Completeness • Understandability
• Credibility • Availability
• Currentness • Reproducibility
• Efficiency

The remaining attributes that are selected in the Data Quality Model but not mentioned in ISO 25012
are[34] : -
41
• Relevancy • Size
• Interpretability • Context Coverage
• Effectiveness • Fairness
• Satisfaction

Not all attributes in ISO 25012 Data quality model are suitable for the given machine learning systems.
The attributes that are not mentioned in our data quality model can also be important for other machine
learning systems that we haven’t come across in our interview study.

Selection of Data Quality Attributes from Literature Review and ISO 25012 Data Quality
Model:

The below figure shows us the comparison of the data quality attributes between our literature review
study and ISO 25012 Data quality model. We can notice that they are two coloured data quality attributes
in the below figure, green colour attributes are selected for data quality model for machine learning. We
can observe that a total of 10 data quality attributes is selected from the Literature review attribute
whereas in ISO 25012 data quality model total of 8 data quality attributes are selected for data quality
model for machine learning. Accuracy and completeness are similar attributes from these both data
attribute columns.

42
Figure 5-2: Selection of data quality attributes
Comparison conclusion:
From this above figure 5-2, we can see that the total selected attributes from ISO 25012 are 8 out of 14
attributes. These attributes in ISO 25012 have about 57% of acceptancy rate according to the
interviewees. The attributes that we would like to add in data quality model for machine learning, each
attribute needs high acceptancy rate from data scientists. This acceptancy of attributes is totally based
on these 15 interviewees and their perspectives, any new attributes can be added to the list based on
further research, and these attributes can be tested by performing an experiment in further studies.

43
6 DISCUSSION
The main reason for this study is to develop a data quality model for machine learning, and from our
research, we found no strong literature that shows a predefined data quality model for machine learning
which makes our study more needed in the industry. The reason behind selecting this research gap is
that Machine learning is a booming area right now and as we have mentioned above data is at most
important in machine learning as machine learning mainly works on data[5]. Data is used to train the
machine learning systems to perform tasks and data scientists’ trains these machine learning systems.
By training machine learning systems with good quality of data would increase performances and also
increase in solving the complex tasks [12]. In order to achieve a good quality of data for machine
learning systems, data scientists can follow models like data quality model for machine learning. These
data-quality model consist of data quality attributes which help in monitor, control and increase the
quality of data. But data scientists are struggling as there are no data quality models for machine learning
(Based on the Literature available till now). So, we need to identify different data quality attributes for
machine learning and develop a data quality model. To achieve this, we have formulated objectives and
these objectives are achieved by research methods.

Objective 1: - Identify the data quality attributes for machine learning which is achieved by literature
review.

Objective 2: - To assess the selected data quality attributes for machine learning from literature to
develop a data quality model. Which is achieved by In-depth interview method and Later depending on
the analysed results of the interview we would develop data quality model (Based upon interviewees
perspective).

We chose to select existing data quality attributes for machine learning algorithms that we extracted
from a literature review search. We cannot tell for sure that all extracted data quality attributes from the
literature will improve the selection of data for all machine learning algorithms as there is no literature
evidence available, for this reason, we planned to interview data scientists who have experience with
machine learning algorithms. By performing an in-depth interview we got to know which data quality
attributes ffect which machine learning algorithms, this interview study also helped in identifying all the
data quality attributes that are commonly used for different machine learning algorithms which we later
used in developing a data quality model for machine learning. We cannot say with 100% accuracy that
this data quality model is suited for any machine learning algorithms as this data quality model is
developed totally based on interviewees perspective, and all the interviewees that we interviewed for
this study did not cover all the machine learning algorithms, but they have mentioned the algorithms
they previously worked with and the data quality model which we have suggested can be applicable to
all the machine learning algorithms mentioned by all the interviewees in table 3-3.

The attributes that were added in our data quality model has occurred at the greatest number of times as
an important attribute in our interview and this was selected by experienced interviewees.

Example: Accuracy is a data quality attribute in our data quality model which has maximum occurrence
rate, giving this as an example we can say that this attribute is useful for all the machine learning
algorithms and systems that our interviewees have worked with.

Below we have discussed in detail regarding achieving our aim and objectives first, we have found
articles regarding data quality for machine learning in section number 2.1 where we have stated how
important is the data and its quality for machine learning and different advantages and disadvantages
regarding the data quality for machine learning. As authors of this paper, we wanted our readers to know
the importance of data quality before getting in-depth into the research.

44
Objective 1 is achieved by implementing a literature review method and found relevant literature
regarding different data quality attributes and their effects on machine learning in section 4. In this
literature review first, we have performed the search process and in this, they are many kinds of the
search process from which we have selected keyword base search. After performing abstract and title
screening, we have found 27 articles and later after performing in-depth screening, based on implicit
and explicit criteria and after having a detail discussion among us (two authors of this article) 10 articles
were selected which are mentioned in table 4-1 and we also searched online publications regarding data
quality attributes for machine learning in order to find any new data quality attributes and found this
three publications which mentioned about fairness attributes in table 4-2. After analysing this article, we
have extracted different data quality attributes for machine learning and their summarized definitions in
section 4, and also the influence of those data quality attributes on machine learning has also been
mentioned.

Objective 2 is achieved by performing an in-depth interview. We have conducted interviews based on


the different attributes that we have identified from the literature review. These data-quality attributes
have been listed in the Trello board under the column name of data quality attributes, and we have also
created two other columns important and not important. This Trello board made easier for interviews to
sort out the attributes into different column based upon their perspective and experiences, we have also
formulated few questioners and asked while interviewing which had helped us in understanding more
about the data quality attributes, which attributes to be selected under the data quality model, reason
behind selecting those attributes under the specific column. With the help of these interviews, we were
able to know the interviewees perspective of different data quality attributes and which attributes to be
selected under the data quality model. Results have been specified in section 4.

After conducting the interviews, we have gathered the data from Trello board, audio recordings,
interviews transcripts, field notes, all this data is been analysed by using data analysis methods. We have
chosen qualitative content analysis as the data analysis method. Among content analysis techniques, we
have selected Conventional content analysis in section 3.3.4. We have labelled our data quality attributes
in table no 4-5. In conventional content analysis mainly considered 5 steps sample to analyse the data,
categories, outline the process, Themes. First, we have analysed the interview transcripts, next we have
categories based upon the occurrence of the attributes in the important column, we have applied the
themes to the attributes along with the label, categories (Occurrence) which are available in section 3-
6. with the help of analysing data, we were able to answer our research question 2 which characterization
of data quality attributes by interviews (Data Scientists). With the help of selected attributes which we
have gathered from data analysis of interviews, we were able to develop a data quality model for
machine learning (This data quality model is developed totally based upon interviewees perspective) in
section 5 as mentioned with this we were able to achieve our key idea which is to develop a data quality
model for machine learning Figure 5-1 in addition to this we have also described how the data quality
attributes of data quality model(which we have developed) are applicable to machine learning. Later we
have compared our developed data quality model for machine learning with existing data quality model
ISO/IEC 25012 in order to find out how many data quality attributes match with the existing data quality
model.

This research even helps in re-confirming that the attributes selected from our research do show a
positive effect on machine learning algorithms, also parallel we can say that few attributes that we
selected from the literature review are not suitable for all machine learning algorithms.

Example: Timeliness and freedom from risk are examples of attributes that have the least number of
occurrences from the data quality attribute that we extracted from our literature study, these attributes
were selected as important by 4-5 interviewees but are selected as not important by 10-11 interviews
resulting in these attributes not being included in our data quality model.

45
We selected a total of 16 data quality attributes out of 26 data quality attributes in our data quality model.
The criteria for selecting the attributes for data quality model is that the attributes which have at least 8
occurrences in our interview making more suitable to be included in our model.

Example: Relevancy is one of the attributes in our data quality model, as per interviewees saying this
attribute shows a positive influence on machine learning algorithms, this attribute helps in train the
system with the appropriate data which can satisfy all the requirements of end-users.

This study of ours might give a good base for a machine learning practitioner and data scientists who
have very less experience on a particular machine learning algorithm that is included in our the study,
they can get some good idea on the type of quality data they need to select to train their machine learning
algorithm.

Example: If a data a scientist wants to train deep learning or neural network models, he can use
Accuracy, Completeness and fairness or other data quality attributes from our data quality model to
select the quality data to train the model.

From our study, we would like to include that we could not find any existing literature evidence available
till now that all or any data quality models are suitable to use for data on machine learning. To resolve
this issue, we have developed a data quality model for machine learning which includes only machine
learning data quality attributes.

Example: ISO 25012 has attributes that are suitable for machine learning algorithms, but not all
attributes in this data quality model are suitable for all machine learning algorithms making this model
less suitable for machine learning algorithms. So, this is one of the reasons to develop a data quality
model for machine learning.

What is known: We have existing literature for data quality attributes that we found from our literature
review study, these data quality attributes are applicable for various machine learning algorithms and
are already listed in these articles[57], [59], [60], [62, p. 250]. We also have existing data quality models,
but a strong literature evidential proof has not been identifying from our study which states that any data
quality model can be applied to machine learning data. We also know what the selected data quality
attributes are and how these attributes effect machine learning from existing literature. But from our
research, we would like to know how effective these data quality attributes are in terms of applicability
for data used in machine learning systems in terms of interviewee’s perspective. We were also excited
to identify new data quality attributes for machine learning and their applicability, also after gathering
different data quality attributes, we planned on developing a data quality model (Based on interviewees
perspective) using these data quality attributes.

New things we produced from our study: We identified a new data quality attribute “Fairness” from
these published reports [67]–[69] and proved from our interviews that this data quality attribute shows
a positive effect on data quality for machine learning systems. We gathered multiple data quality
attributes from various studies, and with the help of data scientists we developed a data quality model
based on interviewees opinion, and this model can be used to improve the quality of data for machine
learning as per the opinions of interviewed data scientists.

Difference of opinion from related work with our results: We noticed that few of the data quality
attributes mentioned in related articles such as[34], [57], [59]–[62] seems to show fewer effects on the
quality of data in machine learning based on the opinions of data scientists who we interviewed in our
study.

Contribution:

46
Our contribution from this research is that we have developed a separate data quality model for machine
learning (Based on interviews perspective) which is useful for the data scientist in improving the quality
of data for machine learning. With the help of this data quality model data scientists can monitor, control
and improve that quality of data which is used to train the machine learning systems. As we have
developed the data quality model based on interviews perspective and our interviewees are experienced
data scientists who have been working in industries based on this we are able to say that this data quality
model for machine learning is very used fully in the industries. This data quality model can be referred
by any person who is interested in training machine learning systems, by referring to this data quality
model they can identify more quality data to train their machine learning systems which eventually leads
to better results.

How this data quality model can be used:


Example:
As mentioned above a data scientist trains the machine learning algorithms. Consider a data scientist
who wanted to train a machine learning algorithm in order to do that he needs data, with the help of data
a person can train the machine learning algorithm. Based on requirements, the data scientist has gathered
the data to train the machine learning algorithm, but before training the data the person need to pre-
process the data, As the collected data can’t be used directly to train the algorithm as it is raw data which
is needed to be pre-processed. This pre-processing helps in removing the Noisy data, irrelevant data,
inconsistent, unfinished data and make the collected data as a clean data set. Here data scientist can use
this developed data quality model, where data scientists with help of data quality model they monitor,
control and improve the quality of data. Collected raw data is taken into pre-processing where data
scientist considers the data quality model which consists of data quality attributes with help of these
attributes data scientists choose the appropriate data for the algorithm. For example, a data scientist is
training neural networks and in data pre-process they considered data quality model where they monitor
the quality of data in the following manner:

Accuracy: - check how accurate the data is if it is not accurate the data should be removed or replaced
with accurate data based on the requirement,

Relevancy: - the data which we have considered is relevant data according to the end-user perspective
or according to the desired output if not the data is replaced with the relevant data.

We would like to share a research gap where more data quality attributes can be identified for machine
learning and the research can be extended by interviewing more data scientists who have experience on
different machine learning algorithms which were missing in our study. Conducting an experiment on
this data quality model can help in proving the uses of this model for the industries.

All the sections and work in this thesis are discussed and documented by both the researchers to
reduce the conflict among ourselves and to improve the quality of our study.

6.1 Limitations
The below table states the limitations of the project:

S.no Limitations Affect on our research

1 We have data quality attributes for only A limited number of data quality attributes
few Machine learning systems results in the frail base for our research

47
2 We could not interview a number of data Interviewing many interviewees can give us
scientists due to the limited amount of time more results for more machine learning
and resources algorithms

3 The results which we have to gather and As this is opinion based, we cannot confirm or
interpret in this project are based on an evaluate our data quality model
interview perspective

4 We were not able to verify the results as it Without experiment, we cannot verify that this
was not an experiment data quality model works in the industry
Table 6-1: Limitations

6.2 Threats to validity


Validity can be defined as the amount of trustworthiness on the research results. Cochran has stated that
what should be done for nonexperimental studies “to clarify the step from association to causation, Sir
Ronald Fisher replied, ‘make your theories elaborate’ [96]. They are mainly 4 perspectives of validity
Threats [60].
6.2.1 Internal validity
Internal validity is defined as the components that affect both process and results [60].
Formulation of search string: -
We all know that the formulation of the search string is the most important and the main task in keyword-
based search. In our literature review, we are implementing a keyword-based search in order to obtain
articles from databases as mentioned in our method above. So, in order to get effective results (relevant
articles) we need to formulate correct and accurate search strings, in order to get the search string, we
have formulated many different search strings and tired all the pros and cons possible to get related
articles for our research. This procedure has been carried out carefully till we acquire the correct search
string and this whole procedure has been assessed by our supervisor. This Formulation of search string
might affect both process and results.

6.2.2 External validity


Generalizability of the results [60].
Defined as generalizing the results. For example, we have considered 15 experienced data scientist
whose experience is ranging from 1 to 20 as mentioned above in an in-depth interview method. We have
conducted interviews on all of them and gathered results from those interviews. We have given equal
priority to each and every interviewee by considering all their opinion on data quality attributes. This
made us generalize the results and get a better final outcome. The second example is that we would
generalize results by giving all the data scientist the same questionnaires.

6.2.3 Construct validity


The problems which occur behind the “research and observation” [60].
For example, we are conducting an interview with a data scientist. While conducting an interview there
might be a problem with questioner like not clear, did not understand a particular attribute. In order to
overcome this problem, we have conducted our ended interview which would benefit both interviewee
and interviewer. Another example is that data which we gather from the interview would be with
48
anonymity which means that the data would not be shared with any researcher, organization or the third
party. By gathering data from many data scientists would help us in avoiding mono-operation bias [97].

6.2.4 Conclusion validity


Conclusion validity is defined as one can gather accurate results [60].
For example, we are conducting interviews with experienced data scientists who are working in different
organizations and different interviews are carried out in a single section. In order to ensure that we would
get accurate results, we would beforehand practice how to conduct interviews and what all data we
would acquire from the interview and conduct pilot interview beforehand so that we would gather
accurate results.

49
7 CONCLUSION & FUTURE WORK
7.1 Conclusion
To begin with our study, we have selected keyword-based search as our literature review, using this
method we found 10 relevant articles that were related to our research. After a thorough analysis of these
10 articles, we extracted a total of 16 data quality attributes that effect the quality of data in machine
learning algorithms. These 10 articles were selected based on our inclusion and exclusion criteria,
inclusion criteria include any article that is related to machine learning, data quality attributes, machine
learning algorithms, data quality dimensions. Along with this, we found an ISO 25012 article that
includes a data quality model for Web Applications. We used this existing data quality model to check
if any of the attributes in this model can be useful in building our data quality model for machine
learning. Using all the data quality attributes that we extracted from our study and ISO 25012 model,
we started looking for interviews who have experience working with machine learning algorithms. We
selected 15 interviewees who showed more interest in our study and conducted an in-depth interview
study. We share a questionnaire with each interviewee that includes questions that can help us in
understanding the background and experience of each interviewee. We recorded all the interviews data
through audio call, and we provided each interviewee with a Trello board link in the questionnaire using
which they can share their opinion on the data quality attributes that we selected. Trello board has 3
columns: List of attributes, important and not important, the interview can select each attribute and drag
it to either important or not important list. Each attribute in the Trello board includes a definition of the
attribute with a comment box where interviewees can state their reason for selecting attributes to a
particular list. After the interview, we transcribed all the interviewee’s call along with the Trello board
results and using conventional content analysis method we analysed the data and extracted the results.
Based on the analysis we arranged each attribute in such a way that the attribute that has the greatest
number of occurrences from all 15 interviews is in the top of our results and the attributes with least
number of occurrences stay in the last. This way we created a data quality model based on the results
after our analysis of the interview. As this data quality model is built on the opinion of all the interviews,
we need to conduct an experiment on this data quality model to evaluate each data quality attribute. We
even included limitations in our work and included a structure of the final data quality model.

7.2 Future work


As this research has limitations as mentioned above, we are unable to interview more experienced data
scientists who have experience with other machine learning systems that were not mentioned in our
study. Also, there are very few articles of data quality attributes for machine learning available giving
our research a frail base. By using this research of ours, in future, we expect to see research with some
strong background work and with more interviewees who worked with other machine learning
algorithms which are not included in our study. This research can further be conducted with more data
quality attributes and with more experienced data scientists making the study with a stronger background
than existing ones. This can be further experimented to evaluate all the data quality attributes and
confirm that all these data quality attributes are suited for this data quality model for machine learning.
In future we also expect to see a fully developed data quality model with more research on Machine
learning, there is also scope for interviewing more data scientists from different regions and cultures
making this study more reliable.

50
REFERENCES
[1] D. Marr, “Artificial intelligence—A personal view,” Artif. Intell., vol. 9, no. 1, pp. 37–
48, Aug. 1977.
[2] “BONSEYES – The Artificial Intelligence Marketplace.” [Online]. Available:
https://bonseyes.com/. [Accessed: 14-Jun-2019].
[3] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Eds., Machine Learning: An
Artificial Intelligence Approach. Berlin Heidelberg: Springer-Verlag, 1983.
[4] M. Najafabadi, F. Villanustre, T. Khoshgoftaar, N. Seliya, R. Wald, and E.
Muharemagic, “Deep learning applications and challenges in big data analytics,” J. Big
Data, vol. 2, Dec. 2015.
[5] Z. Ghahramani, “Probabilistic machine learning and artificial intelligence,” Nature, vol.
521, no. 7553, pp. 452–459, May 2015.
[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, May 2015.
[7] S. Wagner, Software Product Quality Control. Berlin Heidelberg: Springer-Verlag,
2013.
[8] A. Mellit, S. Kalogirou, L. Hontoria, and S. Shaari, “Artificial intelligence techniques
for sizing photovoltaic systems: A review,” Renew. Sustain. Energy Rev., vol. 13, Feb.
2009.
[9] S. Bhattacharya, B. Czejdo, R. Agrawal, E. Erdemir, and B. Gokaraju, “Open Source
Platforms and Frameworks for Artificial Intelligence and Machine Learning,” 2018, pp.
1–4.
[10] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and
prospects,” Science, vol. 349, no. 6245, pp. 255–260, Jul. 2015.
[11] P. Ongsulee, “Artificial intelligence, machine learning and deep learning,” 2017, pp. 1–
6.
[12] lison, pierre, “An introduction to machine leraning.”
[13] “A guide to machine learning algorithms and their applications.” [Online]. Available:
https://www.sas.com/en_gb/insights/articles/analytics/machine-learning-
algorithms.html. [Accessed: 14-Jun-2019].
[14] T. O. Ayodele, “Types of Machine Learning Algorithms,” New Adv. Mach. Learn., Feb.
2010.
[15] “A Beginner’s Guide to Neural Networks and Deep Learning,” Skymind. [Online].
Available: http://skymind.ai/wiki/neural-network. [Accessed: 16-Jun-2019].
[16] “Understanding LSTM Networks -- colah’s blog.” [Online]. Available:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/. [Accessed: 16-Jun-2019].
[17] S. Saha, “A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way,”
Towards Data Science, 15-Dec-2018. [Online]. Available:
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53. [Accessed: 16-Jun-2019].
[18] R. Bhatia, “Top 6 Regression Algorithms Used In Analytics & Data Mining,” Analytics
India Magazine, 19-Sep-2017. .
[19] “What is Logistic Regression?,” Statistics Solutions. .
[20] “Support-vector machine,” Wikipedia. 13-Jun-2019.
[21] “Random Decision Forest - an overview | ScienceDirect Topics.” [Online]. Available:
https://www.sciencedirect.com/topics/computer-science/random-decision-forest.
[Accessed: 16-Jun-2019].

52
[22] D. Zhang, “Applying Machine Learning Algorithms In Software Development,” in
Proceedings of the 2000 Monterey Workshop on Modeling Software System Structures in
a Fastly Moving Scenario, 2000, pp. 275–291.
[23] H. Huang, B. Stvilia, C. Jörgensen, and H. Bass, “Prioritization of Data Quality
Dimensions and Skills Requirements in Genome Annotation Work,” JASIST, vol. 63, pp.
195–207, Jan. 2012.
[24] R. Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality Means to Data
Consumers,” J. Manag. Inf. Syst., vol. 12, no. 4, pp. 5–33, 1996.
[25] L. Liu and M. T. Özsu, Eds., Encyclopedia of Database Systems. Springer US, 2009.
[26] “Quality Data Model,” p. 36.
[27] N. Arrais, “Quality control Handbook,” Rev. Adm. Empres., vol. 6, pp. 157–159, Jun.
1966.
[28] R. Y. Wang, V. C. Storey, and C. P. Firth, “A Framework for Analysis of Data Quality
Research,” IEEE Trans Knowl Data Eng, vol. 7, no. 4, pp. 623–640, Aug. 1995.
[29] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data Quality Assessment,” Commun ACM,
vol. 45, no. 4, pp. 211–218, Apr. 2002.
[30] D. M. Strong, Y. Lee, and R. Y. Wang, “Data Quality in Context,” Commun. ACM, vol.
40, Aug. 2002.
[31] T. Redman, “The Impact of Poor Data Quality on the Typical Enterprise.,” Commun.
ACM, vol. 41, Feb. 1998.
[32] T. H. Davenport and L. Prusak, Information Ecology: Mastering the Information and
Knowledge Environment, 1st ed. New York, NY, USA: Oxford University Press, Inc.,
1997.
[33] T. C. Redman, Data quality for the information age. Boston: Artech House, 1996.
[34] I. Rafique, P. Lew, M. Q. Abbasi, and Z. Li, “Information Quality Evaluation
Framework: Extending ISO 25012 Data Quality Model,” vol. 6, no. 5, p. 6, 2012.
[35] C. Calero, A. Caro, and M. Piattini, “An Applicable Data Quality Model for Web Portal
Data Consumers,” World Wide Web, vol. 11, no. 4, pp. 465–484, Dec. 2008.
[36] C. Moraga, M. Moraga, C. Calero, and A. Caro, “SQuaRE-Aligned Data Quality Model
for Web Portals,” presented at the Proceedings - International Conference on Quality
Software, 2009, pp. 117–122.
[37] A. Caro, C. Calero, I. Caballero, and M. Piattini, “Defining a Data Quality Model for
Web Portals,” 2006, vol. 4255, pp. 363–374.
[38] A. Koronios, S. Lin, and J. Gao, “A Data Quality Model for Asset Management in
Engineering Organisations.,” presented at the Proceedings of the 2005 International
Conference on Information Quality, ICIQ 2005, 2005.
[39] C. Moraga, M. Á. Moraga, A. Caro, and C. Calero, SPDQM: SQuaRE-Aligned Portal
Data Quality Model. .
[40] M. Hejč and J. Hrebícek, “Primary Environmental Data Quality Model: Proposal of a
Prototype of Model Concept,” 2008.
[41] E. Oviedo, J. N. Mazón, and J. J. Zubcoff, “Towards a data quality model for open data
portals,” in 2013 XXXIX Latin American Computing Conference (CLEI), 2013, pp. 1–8.
[42] A. Caro, C. Calero, and M. Piattini, “A Portal Data Quality Model For Users And
Developers.,” 2007, pp. 462–476.
[43] A. Caro, C. Calero, I. Caballero, and M. Piattini, “Towards a Data Quality Model for
Web Portals - Research in Progress.,” 2006, pp. 325–331.
[44] “ISO - International Organization for Standardization,” ISO. [Online]. Available:
http://www.iso.org/cms/render/live/en/sites/isoorg/home.html. [Accessed: 16-Jun-2019].

53
[45] G. A. Liebchen, “Data sets and data quality in software engineering,” in in: PROMISE
‘08: Proceedings of the 4th International Workshop on Predictor Models in Software
Engineering, 2008, pp. 39–44.
[46] G. K. Tayi and D. P. Ballou, “Examining Data Quality,” Commun ACM, vol. 41, no. 2,
pp. 54–57, Feb. 1998.
[47] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data Preprocessing for Supervised
Learning,” Int. J. Comput. Sci., vol. 1, pp. 111–117, Jan. 2006.
[48] Jiawei Han and Micheline Kamber, “Data MIning Concepts and Techniques.”
[49] M. Kumar and A. Kalia, “Preprocessing and Symbolic Representation of Stock Data,” in
2012 Second International Conference on Advanced Computing Communication
Technologies, 2012, pp. 83–88.
[50] I. Nurdiani, J. Börstler, and S. Fricker, “The Impacts of Agile and Lean Practices on
Project Constraints: A Tertiary Study,” J. Syst. Softw., vol. 119, Jun. 2016.
[51] “Undertaking a literature review: a step-by-step approach. - PubMed - NCBI.” [Online].
Available: https://www-ncbi-nlm-nih-gov.miman.bib.bth.se/pubmed/18399395.
[Accessed: 23-Jul-2019].
[52] “PhD Research and Dissertation Writing in Statistics - Web & Digital Communications |
Montana State University.” [Online]. Available:
http://www.math.montana.edu/jobo/phdprep/. [Accessed: 15-Jun-2019].
[53] “What do we know about Testing practices in Software Startups?” [Online]. Available:
http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1156328&dswid=6978.
[Accessed: 23-Jul-2019].
[54] A. Nimmakayala and V. S. A. Gudivada, The significance of Software Engineering
Management in Software projects : A study on Project Management success factors, an
ideal Project Manager and the current state of Project Management Education. 2018.
[55] C. Okoli and K. Schabram, “A Guide to Conducting a Systematic Literature Review of
Information Systems Research,” 2010.
[56] K. Petersen, R. Feldt, S. Mujtaba, and M. Mattsson, “Systematic Mapping Studies in
Software Engineering,” p. 10.
[57] S. Jalali and C. Wohlin, “Systematic literature studies: database searches vs. backward
snowballing,” in Proceedings of the ACM-IEEE international symposium on Empirical
software engineering and measurement - ESEM ’12, Lund, Sweden, 2012, p. 29.
[58] J. X. Yu, Keyword search in databases / Jeffrey Xu Yu, Lu Qin, and Lijun Chang.
Breinigsville, Pa.]: Morgan & Claypool, 2010.
[59] D. Tümer, M. A. Shah, and Y. Bitirim, “An Empirical Evaluation on Semantic Search
Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and
Hakia,” in 2009 Fourth International Conference on Internet Monitoring and
Protection, 2009, pp. 51–55.
[60] R. Berntsson Svensson, T. Gorschek, B. Regnell, R. Torkar, A. Shahrokni, and R. Feldt,
“Quality Requirements in Industrial Practice – An Extended Interview Study at Eleven
Companies,” Softw. Eng. IEEE Trans. On, vol. 38, pp. 1–1, Nov. 2011.
[61] Z. Milena, G. Dainora, and A. Stancu, “QUALITATIVE RESEARCH METHODS: A
COMPARISON BETWEEN FOCUS-GROUP AND IN-DEPTH INTERVIEW,” Ann.
Fac. Econ., vol. 4, pp. 1279–1283, May 2008.
[62] B. DiCicco‐Bloom and B. F. Crabtree, “The qualitative research interview,” Med. Educ.,
vol. 40, no. 4, pp. 314–321, 2006.
[63] H.-F. Hsieh and S. Shannon, “Three Approaches to Qualitative Content Analysis,” Qual.
Health Res., vol. 15, pp. 1277–88, Dec. 2005.
[64] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? improving data quality
and data mining using multiple, noisy labelers,” in Proceeding of the 14th ACM
54
SIGKDD international conference on Knowledge discovery and data mining - KDD 08,
Las Vegas, Nevada, USA, 2008, p. 614.
[65] F. Moretti et al., “A standardized approach to qualitative content analysis of focus group
discussions from different countries,” Patient Educ. Couns., vol. 82, no. 3, pp. 420–428,
Mar. 2011.
[66] D. Budgen and P. Brereton, “Performing Systematic Literature Reviews in Software
Engineering,” in Proceedings of the 28th International Conference on Software
Engineering, New York, NY, USA, 2006, pp. 1051–1052.
[67] E. Sivogolovko, “The influence of data quality on clustering outcomes,” Front. Artif.
Intell. Appl., vol. 249, pp. 95–105, Jan. 2013.
[68] C. Kiefer, “Assessing the Quality of Unstructured Data: An Initial Overview,” in LWDA,
2016.
[69] E. Sivogolovko and B. Novikov, “Validating cluster structures in data mining tasks,”
2012, pp. 245–250.
[70] R. H. Blake and P. Mangiameli, “The Effects and Interactions of Data Quality and
Problem Complexity on Data Mining,” in ICIQ, 2008.
[71] F. Daniel, P. Kucherbaev, C. Cappiello, B. Benatallah, and M. Allahbakhsh, “Quality
Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques and
Assurance Actions,” ACM Comput. Surv., vol. 51, no. 1, pp. 1–40, Jan. 2018.
[72] O. Alshareet, A. Itradat, I. Doush, and A. Quttoum, “Incorporation of ISO 25010 with
machine learning to develop a novel quality in use prediction system (QiUPS),” Int. J.
Syst. Assur. Eng. Manag., vol. 9, pp. 1–10, Jun. 2017.
[73] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Towards the Science of Security
and Privacy in Machine Learning,” ArXiv161103814 Cs, Nov. 2016.
[74] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep
Reinforcement Learning that Matters,” ArXiv170906560 Cs Stat, Sep. 2017.
[75] Yi-Min Huang and Shu-Xin Du, “Weighted support vector machine for classification
with uneven training class sizes,” in 2005 International Conference on Machine
Learning and Cybernetics, 2005, vol. 7, p. 4365–4369 Vol. 7.
[76] Z. Gong, P. Zhong, and W. Hu, “Diversity in Machine Learning,” IEEE Access, vol. 7,
pp. 64323–64350, 2019.
[77] “MIT Researcher: AI Has a Race Problem, and We Need to Fix It,” Boston Magazine,
23-Feb-2018. .
[78] J. L. Julia Angwin, “Machine Bias,” ProPublica, 23-May-2016. [Online]. Available:
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-
sentencing. [Accessed: 14-Jun-2019].
[79] S. Corbett-Davies, E. Pierson, A. Feller, and S. Goel, “A computer program used for bail
and sentencing decisions was labeled biased against blacks. It’s actually not that clear.,”
Washington Post, 17-Oct-2016.
[80] Jiří Vaníček, “SOFTWARE AND DATA QUALITY JAKOST SOFTWARE A DAT.”
[81] N. Bantilan, “Themis-ml: A Fairness-Aware Machine Learning Interface for End-To-
End Discrimination Discovery and Mitigation,” J. Technol. Hum. Serv., vol. 36, no. 1,
pp. 15–30, Jan. 2018.
[82] Yogita and D. Toshniwal, “A Framework for Outlier Detection in Evolving Data
Streams by Weighting Attributes in Clustering,” Procedia Technol., vol. 6, pp. 214–222,
2012.
[83] H VEREGIN, “Data Quality Parameters.”
[84] A. Rago, C. Marcos, and A. Diaz-Pace, “Uncovering quality-attribute concerns in use
case specifications via early aspect mining,” Requir. Eng., vol. 18, Mar. 2011.

55
[85] P. Mandal, “Data Quality in Statistical Process Control,” Total Qual. Manag. Bus.
Excell., vol. 15, no. 1, pp. 89–103, Jan. 2004.
[86] L. Garcés, A. Ampatzoglou, P. Avgeriou, and E. Y. Nakagawa, “Quality attributes and
quality models for ambient assisted living software systems: A systematic mapping,” Inf.
Softw. Technol., vol. 82, pp. 121–138, Feb. 2017.
[87] JCGM, JCFGIM, “Evaluation of measurement data — Guide to the expression of
uncertainty in measurement.”
[88] M. M. Aung and Y. S. Chang, “Traceability in a food supply chain: Safety and quality
perspectives,” Food Control, vol. 39, pp. 172–184, May 2014.
[89] D. A. Baker and J. L. Crompton, “Quality, satisfaction and behavioral intentions,” Ann.
Tour. Res., vol. 27, no. 3, pp. 785–804, Jul. 2000.
[90] J. Bansiya and C. G. Davis, “A hierarchical model for object-oriented design quality
assessment,” IEEE Trans. Softw. Eng., vol. 28, no. 1, pp. 4–17, Jan. 2002.
[91] S. Moussa and M. Touzani, “The perceived credibility of quality labels: a scale
validation with refinement,” Int. J. Consum. Stud., vol. 32, no. 5, pp. 526–533, Sep.
2008.
[92] D. M. Parkin and F. Bray, “Evaluation of data quality in the cancer registry: Principles
and methods Part II. Completeness,” Eur. J. Cancer, vol. 45, no. 5, pp. 756–764, Mar.
2009.
[93] S. C. Guptill and J. L. Morrison, Elements of Spatial Data Quality. Elsevier, 2013.
[94] F. Li, S. Nastic, and S. Dustdar, “Data Quality Observation in Pervasive Environments,”
2012, pp. 602–609.
[95] C. Lynnes, G. Leptoukh, S. Shen, D. Tong, and R. Bagwell, Improving Data Quality
Information for NASA Earth Observation Data. 2012.
[96] J. Maxwell, “Designing a Qualitative Study,” Qual. Res., Jan. 2008.
[97] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén,
Experimentation in Software Engineering. Berlin Heidelberg: Springer-Verlag, 2012.

56
APPENDIX:

Interview1: Screenshots of Trello board:

57
Important attributes and reasons:

S.no Attributes Reasons

1 Accuracy This is probably one of the most important points when working on
machine learning, that the data you have is accurate.

2 Completeness This is an important factor as well, as we know ML systems have a


hard time dealing with unseen classes.

3 Context Coverage If the end system is not satisfying to the user, then the system would
need to be retrained either with additional data or with another target.

4 Freedom from An important point that should be regarded when designing a


risk system. Especially if designing a system that directly interacts with
humans.

5 Relevancy If we don't have the relevant data, we can not train systems that will
satisfy the users.

6 Satisfaction Similar to relevancy, if the output does not meet the user's needs then
it is not a good system.

7 Credibility Relates to completeness and accuracy which both are very important
points already.

8 Precision Again, this relates to accuracy which is very important.

9 Understandability This is important for the designer of an ML system. If he cannot


understand the data then he might make assumptions that are
incorrect, thus leading to potential problems in the trained system.
Table 0-1: Interview 1: Important Attributes

58
Not Important attributes and reason on the whole:

• Consistency
• Effectiveness
• Efficiency
• Interpretability
• Timeliness
• Accessibility
• Availability
• Confidentiality
• Currentness
• Size
• Portability
• Traceability
• Recoverability

Reason:
Most of the items he added to the "Not Selected" attributes are due to the following reasons:

• Problems that might occur through this item are automatically handled by well-trained ML models
(e.g. some inconsistency, etc.)
• Points I believe are not super important for ML in general (the e.g. time it requires to acquire
\bedata, the time required to train a model, etc.)

59
Interview 2 Screenshots of Trello Board:

Important attributes and reasons:

S.no Important Reasons


Attributes

1 Accuracy Application on a real-world problem, once deployed

60
2 Completeness In many applications probably hard to compile "complete" dataset,
reflecting each possibility/condition

3 Freedom from (unclear to me, before the interview)


Risk Related to completeness

4 Relevancy Related to completeness

5 Satisfaction If the end user is satisfied, the systems fulfil its purpose

6 Precision Difference to accuracy?


given an example of fairness: important in respect to ethics of AI in
the long run

7 Understandability I would rather call it "Explainability" -> see research topic of


Explainable AI
Important for gaining the trust of the end user

8 Effectiveness Related to completeness; can reduce training time

9 Interpretability Related to understandability/explainability

10 Fairness On the long run of AI and its acceptance in population, is important -


> ethics!
Table 0-2: Interview 2: Important attributes

Not Important attributes and reasons:

S.No Not Important Reason for Not Important


Attributes

1 Context Coverage See Completeness; seems similar to me

2 Credibility Related to completeness and freedom from risk;


also, artificially generated data might train network well

3 Consistency Always the option to transform data into corresponding/required


format;
or the possibility of "sensor/data fusion" of different input types

4 Accessibility Data is value for the company;


expert knowledge and labour is represented by the value of data

5 Availability Similar to accessibility

6 Recoverability Rather a risk than a feature (reverse engineering)

7 Reproducibility See recoverability; for research might be important, to prove


objectively the quality of ML-Software;
for industries probably less important

8 size None
Table 0-3: Interview 2: Not Important attributes

61
Important Attributes Depending on application and reasons:

Our Interviewee has created a new table as Depending on the application, this table consists of both
important and not important reasons depending on the application and the remaining data quality
attributes which have not been selected under important or not important has been categorized into
below table.

S.No Attributes Important Not Important


Depending on the
application

1 Privacy For applications like hospital Whereas car applications privacy is


applications, privacy is not an important attribute because
important because data is very data in this application is open like
sensitive street signs, signals

2 Efficiency Business perspective quite Research perspective not so much


important Important

3 Timeliness Where there is a big change in Same as the important big change
real-world representation: - real world but not that much
for the mobile application its important example: - production of
important cars it took a lot of time to change the
real world in such situations it's not
important

4 Currentness Similar to timeliness Similar to timeliness

5 Confidentiality Depends on application Depends on Application

6 Portability Might be important for other Not important for algorithms that I
algorithms have worked on

7 Traceability Yes, science there is privacy I It might not be important for other
would like to know who is application
working on it

8 Diversity It's a bit important as data It might not be important for other
precision is a bit expensive on application
experts and if you are going
to use for this purpose it is
good
Table 0-4: Interview 2: Attributes depending othe n application

62
Interview 3 Screenshots of Trello Board:

S.no Attributes Reasons

1 Accuracy As I work with hands-on industry applications of machine learning, this


is of great importance. The hardest part of my job is to go from
theoretical examples with, say, MNIST data sets, to real-world data,
which are much harder to create ML models from.

63
2 Relevancy I usually work with companies own data but trying to complete it with
other data sources is very time-consuming. E.g. SMHI's API is very
efficient (they only offer GRIB-formatted files).

3 Satisfaction Brings home the bacon :)

4 Effectiveness My customers are satisfied when I perform something good enough in as


little time as possible, to reduce the project cost, so this is very important.

5 Credibility Not-Specified

6 Consistency Very time-saving.

7 Accessibility I believe this is very important. Sweden has a principal of public data, but
even I, as a data scientist, think it is hard to find, collect and work with.

8 Currentness Current data would be of great importance as a complement to companies


own data.

9 Traceability This increases data trustability.

10 Interpretability Important to understand my result and apply the model in the industry

11 Availability Today, I think it is hard and time-consuming to find data.


Table 0-5: Interview 3: Important Attributes

Not Important attributes and reason:

S.no Attributes Reasons

1 Completeness Completeness is of course appreciated, and vital to some extent.


Nevertheless, it's also part of my job to find creative
estimates/workarounds for incomplete data (using hidden Markov
models or train an ML on complete data rows to complete the rows
lacking info).

2 Precision Not that important in my job. Time efficiency (during development)


and user-friendliness are more important. Today, the companies I've
worked with are very immature in DS/ML so un-precise data (to some
degree) is usually good enough for company improvement.

3 Understandability The model output has to be understandable - the inputs


understandability is of less importance.

4 Fairness "artificial" data fairness reduces the real-world similarity of data. The
idea of ML application is to mimic human actions, based on historical
data. I prefer legal adjustments instead, like GDPR, preventing ML
from taking decisions on its own, when it comes to subjects important
for the individual (for example when applying for a job or a bank
loan).

5 Recoverability I think it is my responsibility as a data scientist to keep copies and


maintain recoverability.

64
6 Timeliness As long as the delay is constant a known, I do not think this is very
important.

7 Confidentiality As long as the data distribution is legal, this does not bother me.

8 Freedom from I believe this is a political and juridical question. As long as the data
Risk distribution is legal, it does not bother me as a data scientist.
Table 0-6: Interview 3: Not Important Attributes

Leftover data Quality Attributes and reasons:

Apart from selecting data quality attributes into important and not important few of the below data
quality attributes have not been selected in both of the lists the reason for not selecting is given on the
whole.
(or)
The below data quality attributes have not been selected under important or not important list and the
reason for not selecting is given on the whole.

• Size
• Context coverage
• Reproducibility
• Privacy
• Efficiency
• Portability
• Diversity

Reasons:

• Our interviewee mentioned that the reason for not selecting these above data quality attributes is that
they depend on the model, data type and context and didn't come across some of these data quality
attributes.

65
Interview 4 Screenshots of Trello Board:

66
Important attributes and reasons:

S.no Attributes Reasons

1 Availability If the data isn't available, it can't be used.

2 Timeliness Old data is worthless for any type of application or analysis that aims to
have an effect on the world. Old data is just for playing.

3 Currentness Same as the Timeliness

4 Traceability To make it secure, credible, and accountable.

5 Accuracy It's more important to have good enough data quickly, then to have perfect
data later. It's up to us to have a verified view of how inaccurate the data
is as it comes in and model the risk that is imposed by this inaccuracy, so
that any unwanted effects of the inaccuracy may be avoided.

6 Completeness Depends on the use case but it's hard to prove that the analysis is fair and
accountable if you can't show that the data covered all relevant strata and
variables.

7 Interpretability For compliance reasons (GDPR for example) and also to not be laughed
at by customers, citizens, or other stakeholders effected by the analysis.

8 Accessibility We can't afford to leave people out of data science because they have
some type of disability.

9 Reproducibility It speeds up innovation and improves existing applications.

10 Confidentiality To enable trust among the organisation.

11 Efficiency The return on investment needs to be determined.

67
Table 0-7: Interview 4: Important attributes

Not Important attributes and reason on the whole:

• Precision
• Relevancy
• Understandability
• Privacy
• Portability
• Fairness
• Freedom from Risk
• Diversity
• Effectiveness
• Context Coverage
• Size
• Credibility
• Satisfaction
• Consistency
• Recoverability

Reasons:

Most of the items he added to the "Not Selected" attributes are due to the following reasons: -
• Many of these are "nice" but not "necessary", or in some cases, functions to other attributes.
• If the same aspect of the data can be measured through some other variable, possibly called an
"instrumental variable" Relevancy may not be very important.
• If you have accurate (enough) and complete (enough) data and know what you're doing Fairness
shouldn't become an issue.
• There's always a risk. Have proper systems and processes in place to mitigate it.
• Satisfaction can be achieved on any data, with the right instructions, training, and proof activities.

68
Interview 5 Screenshots of Trello Board:

69
Important attributes and reasons:

S.no Attributes Reasons

1 Relevancy Interpretability and Understandability are feeding relevancy and each


other, they do not substitute but they're completing each other

2 Context Coverage Not-Specified

3 Interpretability Not-Specified

4 Understandability Not-Specified

5 Efficiency If it's not reproducible, it's meaningless to have the most complex
models

6 Freedom from It's also important to have the freedom to produce better output.
Risk

7 Satisfaction It's really important to convince the shareholders, otherwise, you will
be unemployed :) So it's one of the most important things.

8 Effectiveness Not-Specified

9 Reproducibility Not-Specified

10 Precision Not-Specified

11 Accuracy Not-Specified

12 Currentness Especially in time series, it's really important to have current attributes
so it can have a great output

70
13 Fairness • Especially it's really important in the deep learning algorithms which
train with images.
• ML algorithms are worst when it comes to fairness. It only uses the
past data, so just because the previous guys from your race had a bad
score, it shouldn't affect yours.

14 Credibility If you cannot trust the data, how can you be sure about the output of
the algorithm?

15 Confidentiality To be honest, it's not important for me but important for the companies
because of the legal legislation. I support transparency in the
companies.

16 Size It's important depending on the type of the model, but generally having
less data causes worse outputs unless it creates unbalance in the data.

17 Recoverability To have a more efficient ML pipeline, recoverability is really


important, to understand the past outputs so, it becomes easier to
finetune the model.
Table 0-8: Interview 5: Important attributes

Not Important attributes and reason:

S.no Attributes Reasons

1 Portability Generally, algorithms are designed based on the special needs of the
company or the problem. So, if you take a model to another company, it will
probably fail, so it's not that important

2 Completeness Sparse data may work perfect sometimes, so it's not that important.

3 Consistency Data can be transformed into the same format unless they contain the same
data, however, DWH teams should standardize data to use in the algorithms.

4 Timeliness It's really similar to Currentness?

5 Accessibility Not-Specified

6 Privacy This is attribute is Same as credibility.

7 Diversity We can use model-based attributes, so it's not that important.

8 Traceability As soon as I have the data, I don't care that much the source of the data :)

9 Availability Not-Specified
Table 0-9: Interview 5: Not Important attributes

71
Interview 6 Screenshots of Trello Board:

72
Important attributes and reasons:

S.no Attributes Reasons

1 Efficiency While it is very important, it is quite hard to measure, especially when


the pipeline is complex and/or human is involved in the process.

2 Reproducibility Not-Specified

3 Accuracy The bulk of the work in machine learning in verifying and cleaning
inaccurate data that comes from external sources, i.e., IT infrastructure,
app events, etc.

4 Consistency Often times models are evaluating with earlier models to get a sense of
performance increase or drops. It would be difficult to do that without
a consistent data schema.

5 Availability Not-Specified

6 Fairness It is an active discussion area stems from sustainability and fairness


practices in the business domain.

7 Timeliness It varies a lot what is timeliness means. In settings where interaction


with users is in faster timescale, e.g., e-commerce, streaming services,
it is heavily important.

8 Credibility Not-Specified

9 Relevancy It is a constant exploration of data scientists to figure out what is


relevant or not. Not all features contribute to model generation equally.
Some influences are more than the others.

10 Diversity May not be possible for the lack of data coverage

73
11 Size More the data, the better it is. However, it affects negatively the
performance, hence time to react as well as noise coming from the
model not being robust enough.

12 Recoverability Not-Specified

13 Satisfaction Having low confidence in the end system can influence the user to
disregard the output generated by the system. This beats the whole
purpose of designing the system.

14 Context Coverage Not-Specified

15 Portability Like most IT solutions, it is hard to achieve.

16 Traceability Not-Specified

17 Currentness Not-Specified

18 Effectiveness Not-Specified

19 Precision Not-Specified

20 Privacy Not-Specified

21 Confidentiality Not-Specified

22 Understandability Not-Specified

23 Freedom from Not-Specified


Risk
Table 0-10: Interview 6: Important attributes

Not Important attributes and reason:

S.no Attributes Reasons

1 Completeness It is difficult to get data that can meet future business information demand
since it is a moving target in a way.

2 Interpretability It is not clear cut in many cases, especially when the model generation
process involves complex processes, e.g., neural network, optimization,
etc.

3 Accessibility Not-Specified
Table 0-11: Interview 6: Not Important attributes

74
Interview 7 Screenshots of Trello Board:

75
Important attributes and reasons:

S.no Attributes Reasons

1 Accessibility Taken for granted, without access to data we cannot train the system.

2 Accuracy Not accurate inputs lead to not accurate outputs, no matter what is the
model you use

3 Relevancy Do not feed the model with irrelevant variables as it just creates noise in
the results

4 Completeness Not-Specified

5 Interpretability This attribute helps in turn to interpret the black box model

6 Size This is important but not most important. If you more amount of data
where you can't explain about it then there is no point in maintaining such
huge data

7 Timeliness If the context of analysis requires it (e.g. fraud detection), timeliness


becomes crucial

8 Diversity Not-Specified

9 Privacy GDPR compliant data


Table 0-12: Interview 7: Important attributes

Not Important attributes and reason:

S.no Attributes Reasons

76
1 Fairness Not-Specified
Table 0-13: Interview 7: Not Important attributes

Leftover data Quality Attributes and reason on the whole:

Apart from selecting data quality attributes into important and not important few of the below data
quality attributes have not been selected in both of the lists the reason for not selecting is given on the
whole.
(or)
The below data quality attributes have not been selected under important or not important list and the
reason for not selecting is given on the whole.

• Consistency
• Efficiency
• Satisfaction
• Effectiveness
• Freedom from Risk
• Credibility
• Currentness
• Confidentiality
• Precision
• Traceability
• Understandability
• Availability
• Portability
• Recoverability
• Context Coverage
• Reproducibility

Reasons: -

Our interviewee has mentioned the following reasons for not Selecting the above attributes into any list
• Most of them are redundant for example - Effectiveness is a combination of Relevancy and
Completeness, Credibility is a combination of Accuracy and Relevancy
• Attributes like Portability to me every data is portable only if you don't have access then it can't be
portable.
• Didn't come across some of the attributes and some of them I find a bit similar to each other.

77
Interview 8 Screenshots of Trello Board:

78
Important attributes and reasons:

S.no Attributes Reasons

1 Accuracy It's better to have less data than data made up, that doesn't represent the
sample

2 Consistency standardization is a necessary step in data preparation

3 Completeness It's possible to impute missing, but there is a minimum amount of data
that should be filled out per every record

4 Interpretability At the end you have to report answers to some business stakeholder, that
has to understand the problem and the outcome, otherwise, he won't use
any model

5 Relevancy It's important to gather data that are related to the problem to be solved

6 Context That mentions so many factors that it's not easy to explain, but it's very
Coverage comprehensive.

7 Precision That makes me think of reconciliation of character strings written with a


misspelling or other types, while they are actually the same

8 Traceability Especially if the model is part of a scientific trial or research

9 Diversity You need your sample to vary across categories and features, to
make sure that every combination is well represented

10 Effectiveness My view is that if you don't obtain accurate results, probably there is
something missing in the data

11 Privacy Essential in an enterprise context, since there could be big fines

12 Size Again, for modelling, you need to make sure that you have enough data

13 Portability that depends on the type of data: doesn't apply with tables, but does for
example with images
Table 0-14: Interview 8: Important attributes

Not Important attributes and reason on the whole:

• Timeliness
• Efficiency
• Satisfaction
• Freedom from Risk
• Credibility
• Confidentiality
• Accessibility

79
• Understandability
• Availability
• Fairness
• Recoverability
• Reproducibility

Reasons: -

Most of the items were added to the "Not Selected" list are due to the following reasons: -
• Some of the attributes are too subjective and difficult to assess
• Some of the attributes are not an issue for data quality, but for modelling

80
Interview 9 Screenshots of Trello Board:

Important attributes and reasons:

S.no Attributes Reasons

1 Accuracy Very important, if accuracy is poor then results can't be trusted

2 Relevancy If you don't have adequate data, then no model will solve the problem

3 Reproducibility Important for the reliability of results

81
4 Satisfaction Generally important if the model is to be sold

5 Privacy Important that e.g. GDPR is not violated

6 Interpretability If data can't be understood, then modelling is done in the dark.

7 Freedom from Risk Important for high-risk applications

8 Credibility Important, if data can't be trusted then the outputs can't be either

9 Traceability Important for the trustworthiness of the data

10 Size Important that we don't have too little data

11 Understandability Same as interpretability

12 Availability If data can't be accessed, then it can't be used


Table 0-15: Interview 9: Important attributes

Not Important attributes and reason on the whole:

• Completeness
• Consistency
• Diversity
• Timeliness
• Fairness
• Efficiency
• Effectiveness
• Context Coverage
• Currentness
• Accessibility
• Recoverability
• Precision
• Portability
• Confidentiality

Reason:

Most of the items he added to the "Not Selected" attributes are due to the following reasons:

• Most of these are "nice-to-haves" which will improve the quality of life of the person working with
the data. However, they are not crucial, except in rare edge cases.

82
Interview 10 Screenshots of Trello Board:

Important attributes and reasons:

S.no Attributes Reasons

1 Accuracy No model is better than the data it is built on. If the data does not
accurately describe the truth, the model will not be able to learn the
true distribution.

83
2 Reproducibility Important not only for others but also for own sanity. You should
always be able to reproduce your own results as exactly as possible so
that you can verify that every step of the data pipeline does what it is
supposed to do.

3 Completeness Might depend a little, depending on the use case, the data and the
model, but on the whole, extrapolating outside the realm of the training
data is never wise.

4 Understandability Very important. See for example the case when NASA lost an orbiter
because of different teams using American vs European units.

5 Interpretability This falls in the category of domain knowledge for me. It is possible to
do a lot of things without knowing very much at all about the data, and
often you can reach very good results anyway. But having domain
knowledge is often the key that helps you do clever feature engineering
and extract new information from the data that is perhaps not visible at
first.

6 Efficiency If the data is too expensive to get, we will not get it. Assessing the
feasibility of getting the data we want is therefore important.

7 Fairness This goes for all kinds of class imbalances, not just related to people.
From a data science perspective, we want our models to give as precise
outcomes as possible, so we need to always take care to give the model
an accurate representation of the distribution of the true data and
validate the performance on all classes before we decide which trade-
offs we are happy to make.

8 Relevancy If the data does not contain the information necessary to answer the
question at hand, you will not get an answer, no matter how good an
algorithm you choose.

9 Portability If moving the data risks corrupting it, we can never be sure of the
integrity of the information in the data.

10 Confidentiality Depends of course on the nature of the data. Impossible to give a


general answer. If confidentiality is needed, of course, it important. If
not, it is not important. Most datasets fall in the latter category, but
being a consultant working with other company’s data means that we
have strict rules on how to store and access it for the most time.

11 Currentness Depends on the data and the task at hand. Do we suspect the
distribution of the data out in the real world has changed over time or
not? If yes, having current data is very important.

12 Credibility The trust in the data is irrelevant. What matters is if the data is correct
or not. How we decide that is, however, a more intricate question, and
ultimately is dependent on trust in some measuring equipment or other
people.
Table 0-16: Interview 10: Important attributes

84
Not Important attributes and reason on the whole:

• Consistency
• Timeliness
• Satisfaction
• Effectiveness
• Size
• Diversity
• Availability
• Privacy
• Recoverability
• Freedom from Risk
• Precision
• Accessibility
• Context Coverage
• Traceability

Reason:

Most of the items he added to the "Not Selected" attributes are due to the following reasons:

• Many of these attributes could be good to have or makes life as a data scientist easier but are not
essential to make a good and reproducible analysis. That has been my criterium for sorting. Some
of them also effect the quality of the outcome (size, etc.) but those are problems that any good
data scientist has to be able to handle and take into account in some way when producing results.

85
Interview 11 Screenshots of Trello Board:

86
Important attributes and reasons:

S.no Attributes Reasons

1 Credibility Without data being trustable, data scientists and other stakeholders will
not believe in the model output

2 Currentness If you're modelling who are buying something next week, data needs to
be up to date, otherwise, you might try to sell a product that's already
bought for example

3 Traceability Important to be able to keep track of e.g. if an attribute suddenly changes


the meaning. If it does and you don't know, the model might have severe
defects

4 Interpretability Interpretability is becoming very important within data science, not least
because GDPR demands a business entity to explain why someone was
contacted/declined etc

5 Recoverability The model won't live long without being able to be recovered

6 Effectiveness Not-Specified

7 Relevancy More important than having many attributes.

8 Efficiency This is very important

9 Context Not-Specified
Coverage

10 Completeness Not-Specified

11 Satisfaction Not-Specified

87
12 Confidentiality In regard to GDPR, very important

13 Accuracy If you have bad data you will have a bad model

14 Consistency The longer your history the better model, If data is not consistent you
will not have a long history to work with.

15 Availability Often in business cases, you're lacking the data you're looking for. It can
be very time consuming to retrieve this.

16 Privacy The GDPR-regulations are very strict, you MUST have a privacy policy
in order to not risk being fined heavily

17 Fairness As a society, it is very important to consider fairness, in order to cope


with human bias

18 Reproducibility It's really to keep track of the data you've used. If something goes wrong,
you are able to figure out what went wrong.
Table 0-17: Interview 11: Important attributes

Not Important attributes and reason:

S.No Not Important Reason for Not Important


Attributes

1 Diversity Not-Specified

2 Timeliness Heavily depending on what model you're building. Real-time is


preferred but, in many businesses, not achieved

3 Precision Accuracy is often more important than precision. Precision is, of


course, valuable, but less so than other attributes

4 Understandability Depends on the model, you don't need to understand attributes from
pictures as long as the machine finds relevant correlations

5 Size As long as you have an environment fast enough, size is not an


important factor

6 Portability This is of less importance

7 Accessibility This isn't come across as an issue as a data scientist

8 Freedom From Risk This is very contextual, depends on what models are produced. If
you're predicting some kind consumer behaviour, this is not a key
issue
Table 0-18: Interview 11: Not Important attributes

88
Interview 12 Screenshots of Trello Board:

Important attributes and reasons:

S.no Attributes Reasons

1 Accuracy Not-Specified

2 Diversity It's a must in order to have high quality in all areas

3 Recoverability Very important. A lost database can be a huge impact

89
4 Availability Generally important if the model is to be sold

5 Fairness Not a problem in general but can be very critical in a few use cases

6 Accessibility Depending on the size of the organization. Important in the big


organization to spread the insights

7 Size Depending on the quality of the data. With high quality, it might be
worth processing huge amounts of data

8 Effectiveness Speeding up the process is important. Anything that can do is of great


value

9 Context The platform should support python (and R) in order to be seen as an


Coverage effective data science platform

10 Consistency Most machine learning systems need to have the same input format so
without this it will be difficult to do anything

11 Relevancy Very important. This makes our insights more valuable

12 Completeness Depends. If you have control over the completeness or not. If the
completeness is random or not
Table 0-19: Interview 12: Important attributes

Not Important attributes and reason:

S.No Not Important Reason for Not Important


Attributes

1 Reproducibility Not-Specified

2 Portability Not-Specified

3 Privacy From a data science perspective not important at all. From a legal
point of view more important

4 Understandability Good to have but far from a game changer

5 Traceability Not-Specified

6 Confidentiality Depending on what the law says.

7 Precision It all depends on how much that will affect the value loss of your
output

8 Currentness Can be a cost-saving procedure for having low quality and


acutely save money

9 Credibility As long as you get the valuable thing out to the system it is ok

10 Efficiency This is more like an investment by the company. Many resources


available give higher quality but the insights generated might not
cover the costs

90
11 Freedom from Risk Not-Specified

12 Timeliness You only lose the time it takes to deploy so don’t see this as a big
problem

13 Interpretability This is not important at all. From my experience, it's valuable to


have a data scientist understanding the raw data and being part of
the process of building the structured data mart

14 Satisfaction Not-Specified
Table 0-20: Interview 12: Not Important attributes

Suggested data quality attributes for machine learning by our Interviewee:

Based on our interviewee experience, he shares with us the below 2 data quality attributes that he thinks
will be a great addition to our work.

S.No Suggested Data Definition


Quality Attributes

1 Proper test methods This is from a person with real-life experiences. Having a test system
and proper deployment routines is very important in order to reach
high quality.

2 One single truth Having one single definition of KPI's in an organization is very
important.
Table 0-21: Suggested data Quality Attributes by the interview

91
Interview 13 Screenshots of Trello Board:

92
Important attributes and reasons:

S.no Attributes Reasons

1 Relevancy Very important. Often a lot of time is lost when working from wrong
assumptions

2 Context In general, very important to have the appropriate data and the right
Coverage amount of data

3 Accuracy Hard to model data with noisy features or labels

4 Completeness Incomplete data need extra care in pre-processing (and production) so it's
an extra cognitive load

5 Consistency Saves time when updating models

6 Efficiency See accuracy and completeness cards

7 Satisfaction See relevancy

8 Effectiveness If the data is effective, we can manage with a smaller amount of data

9 Reproducibility Important for debugging, reporting, trust

10 Diversity Important, but can be difficult to do data fusion

11 Portability Can be important for eg medical records

12 Size Important for deciding on the model and validation method

13 Privacy See the other card about confidentiality

14 Confidentiality It can be important to have anonymized data, for example in medical and
financial applications
93
15 Traceability Important for "debugging" models / data pipelines

16 Interpretability It's important that the end users are able to understand the features that
the model is basing its predictions on to be able to improve the models.
Then we can eg remove redundant features or construct new ones

17 Credibility Similar to accuracy


Table 0-22: Interview 13: Important attributes

Not Important attributes and reason on the whole:

• Timeliness
• Freedom from Risk
• Currentness
• Accessibility
• Understandability
• Precision
• Recoverability
• Fairness
• Availabilit

Reason:

Most of the items he added to the "Not Selected" attributes are due to the following reasons:

• Most of these are rated as "not important" because they have not been a factor in projects I have
done, although I can see they might be

94
Interview 14 Screenshots of Trello Board:

Important attributes and reasons:

S.no Attributes Reasons

1 Accuracy If you put bad things in, you get bad things out

2 Size Simplifies a lot

95
3 Precision Allows for easier training of data

4 Timeliness Results from a model get old

5 Currentness Same as timeliness

6 Credibility Credibility allows people to act on your machine model and your
results

7 Satisfaction This true measure of feedback

8 Traceability This goes back to credibility

9 Reproducibility Allows other people to trust results and for you to verify your own

10 Understandability Goes back satisfiability and goes back to interacting as a business


unit

11 Interpretability Same as above

12 Effectiveness This is what you are improving

13 Efficiency Same as above

14 Fairness Taking the bias out of the machine

15 Diversity Taking the bias out the models

16 Availability Business requirement and due to law


Table 0-23: Interview 14: Important attributes

Not Important attributes and reason on the whole:

• Accessibility
• Confidentiality
• Context Coverage
• Privacy
• Freedom from Risk
• Completeness
• Portability
• Recoverability
• Consistency
• Relevancy

Reason:

Most of the items he added to the "Not Selected" attributes are due to the following reasons:

• Machine learnings to filter out some of the problems or use better project management to avoid the
issues.

96
Interview 15 Screenshots of Trello Board:

Important attributes and reasons:

S.no Attributes Reasons

1 Relevancy Not-Specified

2 Fairness I would like to say unbiased and include much more than race

3 Accuracy Correct data is key to correct results

97
4 Credibility Always important. Garbage in garbage out

5 Diversity Not-Specified

6 Currentness Depends on what you are modelling. If it is e.g., churn or credit risk
this is highly important

7 Availability To be able to access the data is key to be able to build good models.
This is often a problem

8 Efficiency This can be very important depending on what you mean and what
you are modelling. But I believe the team of developers/people
working with the data should represent diversity in order to capture
different aspects of values and so on... If not, the risk is high that the
data suffer from human bias

9 Traceability Important but automatically available if you have a solid platform


very important

10 Effectiveness If we can produce determined results, of course, depend on the data


but also the model in itself

11 Privacy Important when working with sensitive data

12 Completeness This is, of course, important, but depending on what you mean you
can always use advanced imputation methods

13 Size A lot of data is often required to train complex algorithms, but it is


relative I guess

14 Accessibility Not-Specified

15 Understandability The interpretability should, of course, come from the modelling


itself but garbage in garbage out - you need to know what it is you
are using to solve the issue

16 Context Coverage Depends on the problem


Table 0-24: Interview 15: Important attributes

Not Important attributes and reason:

S.No Not Important Reason for Not Important


Attributes

1 Timeliness This is more a platform/data collection issue rather than data


quality issue

2 Freedom from Don't quite understand. For me, it is the outcome that could be
Risk risky

3 Confidentiality Depends on what the exercise is

4 Interpretability I believe the modelling itself make the data understandable. That is
our goal

98
5 Reproducibility The flow/development of the model needs to be reproducible, but
the data not. Could be data from a certain event that does not
happen more

6 Consistency Can be taken care of in data prep

7 Recoverability This has more to do with the platform than the data itself?

8 Precision Is taken care of during modelling

9 Portability Many algorithms can be executed on the edge or in the database.,


where the data is

10 Satisfaction This is not a data quality attribute


Table 0-25: Interview 15: Not Important attributes

99

You might also like