An Analysis of Data Quality Requirements for Machine Learning
An Analysis of Data Quality Requirements for Machine Learning
Original Article
Received: 10 June 2023 Revised: 21 July 2023 Accepted: 06 August 2023 Published: 26 August 2023
Abstract - The importance of meeting data quality standards in the context of Machine Learning (ML) development pipelines is
explored in this study. It delves deep into why good data is crucial to confidently deploying ML models. The primary goal of this
research is to isolate and examine the most important aspects of data quality inside ML pipelines and how they affect model
performance and generalizability. The study highlights the complex connection between data quality and ML model performance
via an in-depth analysis of multiple phases within the ML pipeline, encompassing data collection, preprocessing, model training,
and validation. The study highlights the importance of data quality in reducing bias, improving predicting accuracy, and making
ML models more robust to outside influences. The study elaborates on the possible consequences of ignoring data quality issues
by highlighting the difficulties given by data noise, incompleteness, and biases. Accuracy, consistency, completeness, relevance,
and ethical issues are all part of the data quality criteria that are spelt forth. The study's relevance rests on providing a holistic
perspective on the crucial importance of data quality within the landscape of ML development. The survey results provide ML
professionals and businesses with a better appreciation for the importance of high-quality data in building trustworthy ML
models. Trust in ML model outputs, adoption of ethical data practices, and effective dissemination of ML tools are all facilitated
by their corresponding data quality needs being recognized and met.
Keywords - Data innovation, Data ecosystems, Machine learning, Data quality, Data management.
academia is often delegated to individuals or small groups industry often use various independent platforms for data
working on a specific project, who have complete autonomy gathering, processing, and storage. The standard ISO/IEC
over creating and maintaining their own data gathering, 25024 addresses the latter issue by guiding organizations as
storage, and sharing infrastructures. To maintain consistency they define data quality assurance standards and methods for
and collaboration among teams, however, researchers in the monitoring them quantitatively.
2.1. Implementation Follows Careful Planning for Data Articles Chosen Ahead of Time. Based on our experience
Quality working with ML models, we compiled a list of six papers [3,
Our research was conducted with the intention of aiding 23, 32, 34, 35, 58] on data quality planning and, more
ML professionals and data managers in the early stages of specifically, documentation.
their quest to improve data quality. Practitioners may make
more informed decisions about what measures to take for data It is an automatic search. To find relevant publications, we
quality control, assurance, and improvement if they have a utilized Google Scholar to look for titles containing our study
firm grasp of the relevant standards. While the review's focus topics' keywords. Searching simply for article titles helps get
is not on the work involved in establishing certain data quality rid of irrelevant items. Then, the results were narrowed down
standards, assessment criteria, or methods for assessing data by reading the papers' abstracts and titles. Those deemed
quality, we will provide examples when applicable. There are worthy of retention were the only ones who were kept on.
two aims with this piece. Our primary goal is to educate
professionals on the relevant standards and best practices for In the first step, we used the query "allintitle: ''data quality''
data quality in the Machine Learning (ML) community. That (''machine learning'' OR ''AI'')" to search the whole of Google
is going to happen. Scholar. The resulting number is 185.
Involves compiling research over the last several years We stopped after reviewing the first 30 results since so few
and classifying advice according to well-established data fulfilled our inclusion criterion. Seven papers [12, 19, 21, 25,
quality metrics in the discipline of data management. Our goal 27, 28, and 63] were kept after abstract review.
is to streamline the process by which businesses and
individuals can get their data management systems ready for Snowballing: The process of reading and assessing the
machine learning and plan for potential problems that may articles chosen using the aforementioned methods led us to
crop up throughout various phases of ML development. discover other publications that addressed our study concerns.
This method yielded eight articles [5, 9, 11, 36, 48, 53, 55, 57].
3. Research and Methodology Our inclusion criteria were used to evaluate these papers after
Articles for this review were chosen based on the following they were selected based on the descriptions supplied by the
research objectives, inclusion criteria, and search method, citing authors. Because we were interested in learning more
which are detailed below. Through thematic coding, we were about the research of the authors who mentioned this
able to expand our understanding of the growth of ML and the publication, we performed a forward search of papers that
significance of data quality management within the field as a cited [64], which led us to one further item [53].
whole by analyzing the selected publications.
17
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
Coding of Thematic Separate column to call out any systemic problems in people's everyday lives. Interested
peculiar data quality concerns or needs that ML may impose. readers are encouraged to pursue these issues independently
since we did not actively seek out these perspectives and
We would want to define our findings' parameters before because space and time restrictions prevented us from
sharing them. Our primary focus was on theoretical discussing them in the depth they merit.
frameworks that may be used to specify and design data
quality requirements in ML; however, we made sure to take Our research also uncovered a second scoping difficulty
note of any applicable methods that were described in the associated with the nature of the data itself. For instance, we
literature. When it comes to preparing datasets for ML, several discovered that software tools (for data management or
of the publications we looked at went above and beyond just validating input or output data) may moderate training data
"planning" data quality to provide guidance to data quality.
practitioners and managers.
Figure 2 shows how these works draw attention to the ways
Due to the fact that separate communities have in which data-centric technology may either exacerbate or
traditionally tackled these areas, the connections between alleviate systemic problems in people's everyday lives.
them are murky at best. Nonetheless, we make an effort to
demonstrate the substantial overlap in Figure 2. According to Interested readers are encouraged to pursue these issues
Rising's [59] conception, justice concerns circumstances and independently since we did not actively seek out these
outcomes, whereas ethics focuses on the choices that produce perspectives and because space and time restrictions
those outcomes. In this light, data ethics concerns how prevented us from discussing them in the depth they merit.
professionals use data to safeguard individuals' rights to
secrecy and transparency and the safety and well-being of Our research also uncovered a second scoping difficulty
themselves and the environment [6]. However, data justice associated with the nature of the data itself. For instance, we
tackles disparities in how individuals are portrayed and dealt discovered that software tools (for data management or
with based on the data they provide [69]. Data feminism validating input or output data) may moderate training data
identifies the power relations in society as the root cause of quality.
these inequalities and advocates for actions that address them
[20]. Figure 3: A depiction showing how the data processing
structure in ISO/IEC 5259 (upper part) corresponds to our data
Figure 2 shows how these works draw attention to how quality pipeline (bottom part). Used a diagram from Chang
data-centric technology may either exacerbate or alleviate [18].
18
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
Data
feminism Power
advocacy
relations
Rights, opportunities
and intersectionality
Iteration
ML building
Iteration
ML verification
& testing
ML deployment
ML monitoring
19
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
The data collection and labeling for the planned project The goal is to inspire requesters to establish standards of
[36] would need supervision, topic experience, and fair and courteous treatment of data workers in the workplace.
specialization.
20
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
21
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
Fig. 5 A sample pipeline for a situation with multiple models and datasets
3.2.1. Data Collection Standards 3.2.1. Verifying and Updating Existing Data
Data heterogeneity may take the form of Methods of In order for the data to be useful in an ML system, they
Information Input and Output. The continual data flow from must first be checked and cleaned once they have been
sensors and online applications makes automated data acquired. Data quality assurance tasks are heavily weighted at
collection a key feature of production ML. Software this point in the machine learning development process.
developers bear some of the burdens for guaranteeing high
data quality in situations like these because they may create Bertossi and Geerts [12] provide an example of how XAI
systems that send actionable warnings to users when problems approaches might be used to identify the causes of data
are detected (such as when a feature is absent or has an inconsistencies and then recommend the most effective
unexpected value) [57]. corrective measures.
22
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
However, data practitioners should still be mindful of following pre-defined procedures or publishing in advance
recording their activities whenever feasible, even if formal replicable code used to prepare the data).
data cleaning methods have not been utilized (by, for example,
23
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
References
[1] Amina Adadi, and Mohammed Berrada, “Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI),” IEEE
Access, vol. 6, pp. 52138–52160, 2018. [CrossRef] [Google Scholar] [Publisher Link]
[2] Ariful Islam Anik, and Andrea Bunt, “Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote
Transparency,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–13, 2021. [CrossRef] [Google
Scholar] [Publisher Link]
[3] Lora Aroyo et al., “Data Excellence for AI: Why Should You Care?,” Interactions, vol. 29, no. 2, pp. 66–69, 2022. [CrossRef] [Google
Scholar] [Publisher Link]
[4] Alejandro Barredo Arrieta et al., “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward
Responsible AI,” Information Fusion, vol. 58, pp. 82-115, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[5] Rob Ashmore, Radu Calinescu, and Colin Paterson, “Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges,”
ACM Computing Surveys, vol. 54, no. 5, pp. 1–39, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[6] Jacqui Ayling, and Adriane Chapman, “Putting AI Ethics to Work: Are the Tools Fit for Purpose?,” AI and Ethics, vol. 2, pp. 405–429,
2022. [CrossRef] [Google Scholar] [Publisher Link]
[7] Yang Baolong, Wu Hong, and Zhang Haodong, “Research and Application of Data Management Based on Data Management Maturity
Model (DMM),” Proceedings of the 2018 10th International Conference on Machine Learning and Computing. pp. 157–160, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Rachel K. E. Bellamy et al., “AI Fairness 360: An Extensible Toolkit for Detecting and Mitigating Algorithmic Bias,” IBM Journal of
Research and Development, vol. 63, no. 4-5, pp. 4:1 - 4:15, 2019. [CrossRef] [Google Scholar] [Publisher Link]
[9] Emily M. Bender, and Batya Friedman, “Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling
Better Science,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 587–604, 2018. [CrossRef] [Google Scholar]
[Publisher Link]
[10] Emily M. Bender et al., “On the dangers of Stochastic Parrots: Can Language Models be Too Big?,” Proceedings of the 2021 ACM
Conference on Fairness, Accountability, and Transparency, pp. 610–623, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[11] Laure Berti-Equille, “Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation,” WWW '19: The World Wide Web
Conference, pp. 2580–2586, 2019. [CrossRef] [Google Scholar] [Publisher Link]
[12] Leopoldo Bertossi, and Floris Geerts, “Data Quality and Explainable AI,” Journal of Data and Information Quality, vol. 12, no. 2. pp. 1–
9, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[13] Andrew Black, and Peter van Nederpelt, “Dimensions of Data Quality (DDQ) Research Paper,” DAMA NL Foundation, pp. 1-113, 2020.
[Google Scholar] [Publisher Link]
[14] Tolga Bolukbasi et al., “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” Proceedings of
the 30th International Conference on Neural Information Processing Systems, pp. 4356–4364, 2016. [Google Scholar] [Publisher Link]
[15] Rishi Bommasani et al., “On the Opportunities and Risks of Foundation Models.” ArXiv, pp. 1-214, 2021. [CrossRef] [Google Scholar]
[Publisher Link]
[16] Paula Branco, Luís Torgo, and Rita P. Ribeiro, “A survey of Predictive Modeling on Imbalanced Domains,” ACM Computing Surveys,
vol. 49, no. 2, pp. 1–50, 2016. [CrossRef] [Google Scholar] [Publisher Link]
[17] Samuel Budd, Emma C. Robinson, and Bernhard Kainz, “A Survey on Active Learning and Human-in-the-Loop Deep Learning for
Medical Image Analysis,” Medical Image Analysis, vol. 71, p. 102062, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[18] Wo Chang, “ISO/IEC JTC 1/SC 42(AI)/WG 2(Data) Data Quality for Analytics and Machine Learning (ML),” Information Technology
Laboratory, 2022. [Google Scholar] [Publisher Link]
[19] Haihua Chen, Jiangping Chen, and Junhua Ding, “Data Evaluation and Enhancement for Quality Improvement of Machine Learning,”
IEEE Transactions on Reliability, vol. 70, no. 2, pp. 831–847, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[20] Catherine D’Ignazio, and Lauren F. Klein, Data Feminism, Cambridge: Massachusetts Institute of Technology, 2020. [Google Scholar]
[Publisher Link]
[21] Lisa Ehrlinger et al., “A DaQL to Monitor Data Quality in Machine Learning Applications,” International Conference on Database and
Expert Systems Applications, pp. 227–237, 2019. [CrossRef] [Google Scholar] [Publisher Link]
[22] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From Data Mining to Knowledge Discovery in Databases,” AI
Magazine, vol. 17, no. 3, pp 37–54, 1996. [CrossRef] [Google Scholar] [Publisher Link]
[23] Timnit Gebru et al., “Datasheets for Datasets,” Communications of the ACM, vol. 64, no. 12, pp. 86–92, 2021. [CrossRef] [Google Scholar]
[Publisher Link]
[24] Fernando Gualo et al., “Data Quality Certification using ISO/IEC 25012: Industrial Experiences,” Journal of Systems and Software, vol.
176, p. 110938, 2021. [CrossRef] [Google Scholar] [Publisher Link]
24
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
[25] Venkat Gudivada, Amy Apon, and Junhua Ding, “Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data
Cleaning and Transformations,” International Journal on Advances in Software, vol. 10, no. 1, pp. 1–20, 2017. [Google Scholar]
[Publisher Link]
[26] David Gundry, and Sebastian Deterding, “Trading Accuracy for Enjoyment? Data Quality and Player Experience in Data Collection
Games,” Proceedings of the CHI Conference on Human Factors in Computing Systems, no. 156, pp. 1–14, 2022. [CrossRef] [Google
Scholar] [Publisher Link]
[27] Nitin Gupta et al., “Data Quality for Machine Learning Tasks,” Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, pp. 4040–4041, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[28] Thilo Hagendorff, “Linking Human and Machine Behavior: A New Approach to Evaluate Training Data Quality for Beneficial Machine
Learning,” Minds and Machines, vol. 31, pp. 563–593, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[29] Haibo He, and Edwardo A. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21,
no. 9, pp. 1263–1284, 2009. [CrossRef] [Google Scholar] [Publisher Link]
[30] Deborah Henderson, and Susan Earley, DAMA-DMBOK: Data Management Body of Knowledge, 2nd ed., Technics Publications, p. 624,
2017. [Google Scholar] [Publisher Link]
[31] Fred Hohman et al., “Understanding and Visualizing Data Iteration in Machine Learning,” Proceedings of the 2020 CHI Conference on
Human Factors in Computing Systems, pp. 1–13, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[32] Sarah Holland et al., The Dataset Nutrition Label, Data Protection and Privacy, vol. 12, no. 12, 2020. [Google Scholar] [Publisher Link]
[33] Andreas Holzinger, “From Machine Learning to Explainable AI,” World Symposium on Digital Intelligence for Systems and Machines
(DISA’18), 2018. [CrossRef] [Google Scholar] [Publisher Link]
[34] Sara Hooker, “Moving Beyond “Algorithmic Bias is a Data Problem,” Patterns, vol. 2, no. 4, p. 100241, 2021. [CrossRef] [Google
Scholar] [Publisher Link]
[35] Ben Hutchinson et al., “Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,”
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575, 2021. [CrossRef] [Google
Scholar] [Publisher Link]
[36] Eun Seo Jo, and Timnit Gebru, “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning,” Proceedings
of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 306–316, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[37] Michael I. Jordan, and Tom M. Mitchell, “Machine Learning: Trends, Perspectives, and Prospects,” Science, vol. 349, no. 6245, pp. 255–
260, 2015. [CrossRef] [Google Scholar] [Publisher Link]
[38] Ashish Juneja, and Nripendra Narayan Das, “Big Data Quality Framework: Pre-Processing Data in Weather Monitoring Application,”
International Conference on Machine Learning, Big Data, Cloud, and Parallel Computing (COMITCon’19), pp. 559–563, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Daniel S. Katz et al., “Software vs. Data in the Context of Citation,” PeerJ Preprints, pp. 1-4, 2016. [CrossRef] [Google Scholar]
[Publisher Link]
[40] Guy Katz et al., “Towards Proving the Adversarial Robustness of Deep Neural Networks,” Arxiv, pp. 19-26, 2017. [CrossRef] [Google
Scholar] [Publisher Link]
[41] Sunho Kim et al., “Organizational Process Maturity Model for IoT Data Quality Management,” Journal of Industrial Information
Integration, vol. 26, p. 100256, 2022. [CrossRef] [Google Scholar] [Publisher Link]
[42] Laura Koesten et al., “Everything you Always Wanted to Know about a Dataset: Studies in Data Summarisation,” International Journal
of Human-Computer Studies, vol. 135, p. 102367, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[43] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and
Architecture,” ArXiv, 2022. [CrossRef] [Publisher Link]
[44] Sampo Kuutti et al., “A Survey of Deep Learning Applications to Autonomous Vehicle Control,” IEEE Transactions on Intelligent
Transportation Systems, vol. 22, no. 2, pp. 712–733, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[45] Aleksander Madry et al., “Towards Deep Learning Models Resistant to Adversarial Attacks,” ArXiv, 2017. [CrossRef] [Google Scholar]
[Publisher Link]
[46] Ninareh Mehrabi et al., “A Survey on Bias and Fairness in Machine Learning,” ACM Computing Surveys, vol. 54, no. 6, pp. 1–35, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Merino Jorge et al., “A Data Quality in Use Model for Big Data,” Future Generation Computer Systems, vol. 63, pp. 123–130, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Margaret Mitchell et al., “Model Cards for Model Reporting,” Proceedings of the Conference on Fairness, Accountability, and
Transparency, pp. 220–229, 2019. [CrossRef] [Google Scholar] [Publisher Link]
[49] Tanushree Mitra, Clayton J. Hutto, and Eric Gilbert, “Comparing Person-and Process-Centric Strategies for Obtaining Quality Data on
Amazon Mechanical Turk,” Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1345–1354,
2015. [CrossRef] [Google Scholar] [Publisher Link]
25
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
[50] Jose G. Moreno-Torres et al., “A Unifying View on Dataset Shift in Classification,” Pattern Recognition, vol. 45, no. 1, pp. 521–530,
2012. [CrossRef] [Google Scholar] [Publisher Link]
[51] Eirini Ntoutsi et al., “Bias in Data-Driven Artificial Intelligence Systems–An Introductory Survey,” Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, pp. 1-14, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[52] Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence, “Challenges in Deploying Machine Learning: A Survey of Case Studies,”
ACM Computing Surveys, vol. 55, no. 6, pp. 1–29, 2022. [CrossRef] [Google Scholar] [Publisher Link]
[53] Amandalynne Paullada et al., “Data and its (dis)Contents: A Survey of Dataset Development and Use in Machine Learning Research,”
Patterns, vol. 2, no. 11, pp. 1-14, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[54] Kai Petersen et al., “Systematic Mapping Studies in Software Engineering,” Proceedings of the 12th International Conference on
Evaluation and Assessment in Software Engineering, pp. 68–77, 2008. [CrossRef] [Google Scholar] [Publisher Link]
[55] Joelle Pineau et al., “Improving Reproducibility in Machine Learning Research (a Report from the NeurIPS 2019 Reproducibility
Program),” Journal of Machine Learning Research, vol. 22, no. 1, pp. 7459–7478, 2021. [Google Scholar] [Publisher Link]
[56] Claudio Santos Pinhanez et al., “Integrating Machine Learning Data with Symbolic Knowledge from Collaboration Practices of Curators
to Improve Conversational Systems,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–13, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[57] Neoklis Polyzotis et al., “Data Lifecycle Challenges in Production Machine Learning: A survey,” ACM SIGMOD Record, vol. 47, no. 2,
pp. 17–28, 2018. [CrossRef] [Google Scholar] [Publisher Link]
[58] Jorge Ramírez et al., “On the State of Reporting in Crowdsourcing Experiments and a Checklist to Aid Current Practices,” Proceedings
of the ACM on Human-Computer Interaction, vol. 5, no. 2, pp. 1–34, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[59] Jimmy Rising, “Justice and Ethics,” Massachusetts Institute of Technology MIT, Cambridge, MA, Report., 2002. [Publisher Link]
[60] Anna Rogers, Tim Baldwin, and Kobi Leins, “Just What do You Think you’re Doing, Dave? A Checklist for Responsible Data Use in
NLP,” ArXiv, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[61] Yuji Roh, Geon Heo, and Steven Euijong Whang, “A Survey on Data Collection for Machine Learning: A Big Data-AI Integration
Perspective,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 4, pp. 1328–1347, 2019. [CrossRef] [Google Scholar]
[Publisher Link]
[62] Annabel Rothschild et al., “Towards Fair and Pro-Social Employment of Digital Pieceworkers for Sourcing Machine Learning Training
Data,” Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–9, 2022. [CrossRef] [Google Scholar]
[Publisher Link]
[63] Tammo Rukat, Dustin Lange, Sebastian Schelter, and Felix Biessmann, “Towards Automated Data Quality Management for Machine
Learning,” Proceedings of the Workshop on MLOps Systems at the 3rd Conference on Machine Learning and Systems, pp. 1–3, 2020.
[Google Scholar] [Publisher Link]
[64] Nithya Sambasivan et al., “Everyone Wants to do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI,” Proceedings
of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[65] Sebastian Schelter et al., “Deequ-Data Quality Validation for Machine Learning Pipelines,” Proceedings of the Machine Learning Systems
Workshop at the Conference on Neural Information Processing Systems, 2018. [Publisher Link]
[66] Shreya Shankar et al., “No Classification Without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing
World,” ArXiv, 2017. [CrossRef] [Google Scholar] [Publisher Link]
[67] Daniel Staegemann et al., “Determining Potential Failures and Challenges in Data-Driven Endeavors: A Real World Case Study Analysis,”
Proceedings of the 5th International Conference on Internet of Things, Big Data and Security, pp. 453–460, 2020. [CrossRef] [Google
Scholar] [Publisher Link]
[68] Ikbal Taleb et al., “Big Data Quality Framework: A Holistic Approach to Continuous Quality Management,” Journal of Big Data, vol. 8,
pp. 1–41, 2021. [CrossRef] [Google Scholar] [Publisher Link]
[69] Linnet Taylor, “What is Data Justice? The Case for Connecting Digital Rights and Freedoms Globally,” Big Data and Society, vol. 4, no.
2, pp. 1-14, 2017. [CrossRef] [Google Scholar] [Publisher Link]
[70] Divy Thakkar et al., “When is Machine Learning Data Good?: Valuing in Public Health Datafication,” Proceedings of the CHI Conference
on Human Factors in Computing Systems, pp. 1–16, 2022. [CrossRef] [Google Scholar] [Publisher Link]
[71] Jennifer Wortman Vaughan, “Making Better Use of the Crowd: How Crowdsourcing can Advance Machine Learning Research,” Journal
of Machine Learning Research, vol. 18, no. 1, pp. 1-46, 2017. [Google Scholar] [Publisher Link]
[72] April Yi Wang et al., “What Makes a Well-Documented Notebook? A Case Study of Data Scientists’ Documentation Practices in Kaggle,”
Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021. [CrossRef] [Google Scholar]
[Publisher Link]
[73] Ding Wang, Shantanu Prabhat, and Nithya Sambasivan, “Whose AI dream? In Search of the Aspiration in Data Annotation,” Proceedings
of the CHI Conference on Human Factors in Computing Systems, pp. 1–16, 2022. [CrossRef] [Google Scholar] [Publisher Link]
26
Sandeep Rangineni / IJCTT, 71(8), 16-27, 2023
[74] Richard Y. Wang, and Diane M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management
Information Systems, vol. 12, no. 4, pp. 5–33, 2015. [CrossRef] [Google Scholar] [Publisher Link]
[75] Martin J. Willemink, Wojciech A. Koszek, Cailin Hardell, Jie Wu, Dominik Fleischmann, Hugh Harvey, Les R. Folio, Ronald M.
Summers, Daniel L. Rubin, and Matthew P. Lungren. 2020. “Preparing Medical Imaging Data for Machine Learning,” Radiology, 295,
no. 1, pp. 4–15, 2020. [CrossRef] [Google Scholar] [Publisher Link]
[76] Eric Wong, and Zico Kolter, “Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope,” Proceedings
of the International Conference on Machine Learning, pp. 5286–5295, 2018. [Google Scholar] [Publisher Link]
[77] Amrapali Zaveri et al., “Quality Assessment for Linked Data: A Survey,” Semantic Web, vol. 7, no. 1, pp. 63–93, 2016. [CrossRef]
[Google Scholar] [Publisher Link]
[78] Sandeep Ranginenin, Arvind Kumar Bhardwaj, and Divya Marupaka, “An Overview and Critical Analysis of Recent Advances in
Challenges Faced in Building Data Engineering Pipelines for Streaming Media,” The Review of Contemporary Scientific and Academic
Studies, vol. 3, no. 6, pp. 1-5, 2023. [CrossRef] [Publisher Link]
[79] Divya Marupaka, Sandeep Rangineni, and Arvind Kumar Bhardwaj, “Data Pipeline Engineering in the Insurance Industry: A Critical
Analysis of ETL Frameworks, Integration Strategies, and Scalability,” International Journal of Creative Research Thoughts, vol. 11, no.
6, pp. 530-539, 2023. [CrossRef] [Publisher Link]
[80] Sandeep Rangineni, Divya Marupaka, and Arvind Kumar Bhardwaj, “An Examination of Machine Learning in the Process of Data
Integration,” SSRG International Journal of Computer Trends and Technology, vol. 71, no. 6, 2023. [CrossRef] [Publisher Link]
27