Contemporary Issues in Computing (CIC) : ISBN: 978-1-948012-16-4

The paper discusses the importance of improving data validation through machine learning in the context of big data, highlighting the challenges associated with data quality. It emphasizes the need for new strategies to manage the vast amounts of data generated and the significance of data quality assessment for effective machine learning applications. The authors argue that understanding and addressing data quality issues is crucial for leveraging big data effectively in various fields.

Uploaded by

zooriabdulrehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views5 pages

Contemporary Issues in Computing (CIC) : ISBN: 978-1-948012-16-4

Uploaded by

zooriabdulrehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Topics in Intelligent Computing and Industry Design (ICID) 2(1) (2020) 54-58

Contemporary Issues in Computing (CIC)

DOI: http://doi.org/10.26480/cic.01.2020.54.58

ISBN: 978-1-948012-16-4

IMPROVING DATA VALIDATION USING MACHINE LEARNING, A NEW WAYS OF

SEEING BIG DATA
Tahamina Yesmina and Kisalaya Chakrabartib

Haldia Institute of Management, Haldia, 721657, India

Haldia Institute of Technology, Haldia, 721657, India

This is an open access article distributed under the Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.

ARTICLE DETAILS ABSTRACT

Article History:
The new Data Innovation Strategy gives a chance to us to bring the thought into a progressively solid
Received 26 October 2020 acknowledgment and to consider how it could turn out to be a piece of the generation procedure. Big data is
Accepted 27 November 2020 an awesome stockpile of data and information from the frameworks to opposite end-clients. Big data’s are
Available online 03 December 2020 growing in all science and graphically areas very rapidly. While the capability of these gigantic data is without
a doubt critical, completely comprehending them requires new ways o thinking and novel learning strategies
to address the different difficulties. Although for operating such quality of knowledge requires equipment and
it motivate a simulation of data to made and machine learning technique. In this paper, we portray the idea
of the data factualness contents with respect to big data and machine learning pertinent works and inspect
machine learning strategies, tools, and statistical quality models.

KEYWORDS
Data Quality Management, Measurement Errors, Data Gaps, Machine Learning, Supervised Learning, Data
quality, Data Clean, Quality Control.

1. INTRODUCTION Over the previous decade, machine learning procedures have been
generally received in various gigantic and complex data-escalated fields,
Late advancements in data and correspondence innovation (ICT) have for example, drug, stargazing, science, etc, for these strategies give
prompted large scale manufacturing of data by informal organizations, potential answers for mine the data covered up in the data. In any case, as
sensor systems, and other Internet-based uses of different spaces like the ideal opportunity for big data is coming, the assortment of data sets is
human services. The huge measure of data that is created with rapid from so huge and complex that it is hard to manage utilizing customary learning
different web sources is called Big Data (Mehmood et al., 2016). Huge Data strategies since the set up procedure of learning from traditional datasets
is all around depicted by volume, grouping, and speed. In any case, veracity was not intended to and won't function admirably with high volumes of
is another property of Big Data which is creating in notoriety and concerns data. For example, most conventional machine learning calculations are
the rising issue of conviction or quality related with the usage of intended for data that would be totally stacked into memory (Chen et al.,
information. Information quality (DQ) is explained as information 2014), which doesn't hold anything else with regards to big data. Thus,
‘Comfort to apply’ (Panahy et al., 2013), this wide definition passes on the regardless of the way that gaining from this different information is
idea that data is utilized for specific destinations, and data of high caliber required to bring basic science and planning progresses close by
would be data which is sufficiently satisfactory to permit the clients of data improvements in nature of our life it brings colossal challenges all the
to meet their targets. In the domain of Big Data, investigate on quality is while (Slavakis et al., 2014).
still at an early stages organize, setting off the requirement for more
research in this area. Data quality appraisal is a fundamental essential for Affiliations as often as possible overestimate information quality and
data improvement, and the motivation behind data quality assessment is underplay the consequences of low quality information. The results of
to decide the quality degree of data (Philip et al., 2010). horrible information may run from significant to sad. Information quality
issues can make adventures miss the mark, realize lost wages and
Absence of data quality showed through missing data, copy data, diminished customer associations, and customer turnover. Affiliations are
profoundly connected factors, enormous number of factors, and routinely fined for not having a reasonable regulatory consistence process.
anomalies. Low quality data can present critical issues for building ML Top notch data is at the core of administrative consistence. The Data
models just as big data applications. Statistical strategies, for example, Warehousing Institute (TDWI) gauges that poor data quality costs
missing data ascription, anomaly recognition, data changes, organizations in the United States over $700 billion every year (TDWI,
dimensionality decrease, powerful measurements, cross-approval and 2016).
bootstrapping assume a basic job in data quality administration.
All things considered, the field of AI is isolated into three sub regions:

Quick Response Code This paper was presented at Access this article online

International Conference on Contemporary Issues in

Computing (ICCIC-2020) - Virtual Website: DOI:
IETE Sector V, Salt Lake, Kolkata www.intelcomp-design.com 10.26480/cic.01.2020.54.58
From 25th-26th July 2020

Cite The Article: Tahamina Yesmin and Kisalaya Chakrabarti (2020). Improving Data Validation Using Machine Learning, A New Ways of Seeing Big Data.
Topics in Intelligent Computing and Industry Design, 2(1): 54-58.
Topics in Intelligent Computing and Industry Design (ICID) 2(1) (2020) 54-58

controlled learning, independent learning, and bolster learning (Adam et The discourse would likewise apply to independent realizing, where no
al., 2008). Immediately, oversaw learning requires getting ready with stamped information is open and groupings of information are being
named information which has inputs and needed yields. On the other hand made. The habits by which estimations are readied, through splitting
with the controlled learning, independent learning doesn't require named getting ready information into planning, test and endorsement sets is
planning information and the earth just gives commitments without respectably perplexed and not part of the talk of the paper. Moreover, the
needed targets. Stronghold taking in enables gaining from analysis usage of fortress realizing, where a section of experimentation is
traversed correspondences with an external circumstance. associated with the arrangement organize, isn't considered right now.

Machine Learning has unfathomable potential results for and is a vital Huge advances have been made for picture affirmation, where a count is
piece of large information assessment (Sun et al., 2014). Figure 1. shows worked by figuring out how to organize pictures. For instance, various
the method old enough of Machine learning. AI is a mix of Computer photographs including houses are dismembered to make estimation for
Science, Engineering and Statistics. Any field that necessities to decode recognizing houses. The standards reliant on the examination are then
and catch up on information can benefit by AI. attempted against another course of action of pictures, which moreover
join houses, for precision. Various models include: information from past
bad behavior events that help to get where and when in all likelihood, bad
behavior occurs, or information from online life posts (for example the use
of unequivocal words or mixes of words) are used to anticipate in case
someone is most likely going to tap on an advancement. Information from
past prosperity appraisals and their outcomes can help with appreciation
and anticipate which patients have a particular disorder or are all things
considered raised peril to encounter the evil impacts of savage
contamination. Information on scrutinizing history could be used to
Machine Learning foresee the pay of people.
Figure 1: Formation of Machine Learning It is imperative to take note of that most learning calculations (can only
with significant effort) go past the data they use to take in designs from the
2. DEFINITION AND CLASSIFICATION OF MACHINE LEARNING purported preparing data. While PC researchers put forth all potential
attempts to gain from the current data, when a machine learning model is
Machine inclining is a field of research that officially centers on the
sent and utilized, in actuality, it isn't generally (effectively) conceivable to
hypothesis, execution, and properties of learning frameworks and
confirm its prosperity or potential harm.
calculations. It is a profoundly interdisciplinary field expanding upon
thoughts from a wide range of sorts of fields, for example, man-made When utilizing machine learning calculations, there are at any rate three
brainpower, improvement hypothesis, data hypothesis, measurements, distinct data sets included. This is a rearranged depiction of one sort of
intellectual science, ideal control, and numerous different orders of machine learning calculations, which should assist with showing signs of
science, designing, and arithmetic. In light of its execution in a wide scope improvement comprehension of the kind of data engaged with regularly
of uses, machine learning has secured pretty much every logical space, utilized instances of machine learning calculations.
which has expedited incredible effect the science and society. It has been
utilized on an assortment of issues, including proposal motors, It is essential to see that most learning estimations (can just with vital
acknowledgment frameworks, informatics and data mining, and self- effort) go past the data they use to take in plans from the showed arranging
ruling control frameworks. data. While PC investigators put forward every single potential
undertaking to get from the present data, when an AI model is sent and
By and large, the field of machine learning is isolated into three sub spaces: utilized, actually, it isn't all around (effectively) conceivable to attest its
directed learning, unaided learning, and support learning. Quickly, flourishing or potential harm.
directed learning requires preparing with named data which has inputs
and wanted yields. Interestingly with the managed learning, solo learning When utilizing AI tallies, there are at any rate three undeniable
doesn't require named preparing data and the earth just gives enlightening lists included. This is an adjusted outline of one sort of AI
contributions without wanted targets. Fortification learning empowers figuring, which should help with giving signs of progress impression of the
learning from criticism got through connections with an outside domain. kind of data attracted with ordinarily utilized events of AI tallies.
In view of these three basic learning ideal models, great deals of
hypothesis components and application administrations have been Exactly when a count is sent; it is energized with new, unnoticeable
proposed for managing data errands. features (input information). These are surveyed against the model
parameters for taking exercises or choosing. For example, the examining
From data preparing point of view, administered learning and unaided history of people without yet knowing whether they would tap on a
learning mostly center on data examination while fortification learning is particular advancement.
favored for basic leadership issues. Another point is that most customary
machine-learning based frameworks are structured with the suspicion The figuring by then delivers understood checks (or gauges, deducing,
that all the gathered data would be totally stacked into memory for determined exercises or out-put information) that are made when the
concentrated handling. Nonetheless, as the data continues getting bigger disguised information are empowered into the AI computation.
and bigger, the current machine learning procedures experience
extraordinary troubles when they are required to deal with the 4. DATA QUALITY FUNDAMENTALS
exceptional volume of data. These days, there is an extraordinary need to
Because of an expanded dependence upon data to help business choices,
create productive and astute learning techniques to adapt to future data
preparing requests (Mitchell, 1977; Russell et al., 1995; Cherkassky et al., inquire about in the territory of data quality has made incredible walks as
of late. Scientists have been looking to characterize what is implied by such
2007; Mitchell, 2006; Rudin et al., 2014; Bishop, 2006; Adam et al., 2008;
terms as exactness, fulfillment, and conceivably, and to figure out what
Jones, 2014; Langford, 2005; Bekkerman et al., 2003; Weiss et al., 2016).
quality elements are most significant when evaluating the quality of our
3. HOW ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING data. [9] Shows the different contrasts in both expert and the scholarly
perspective on the parts of data quality. While a few scientists have more
ALGORITHMS USE DATA than 20 measurements for deciding data quality, we will concentrate
exclusively on data exactness for this examination. Most specialists
Artificial Intelligence is one sweeping field of AI, where most talks revolve
remember precision for their rundown of data quality segments, and we
around learning figuring that use information for working up plans that
will follow the definition and thoughts of (Wand et al., 1996).
can be applied to new, unnoticeable information. A count learns its
benchmarks subject to the models associated with the readiness Precision is viewed as how close an estimation, or data record, is to this
information. Most discourse, as the one right now, around supposed present reality circumstance it speaks to. It is regularly viewed as a natural
coordinated AI which uses named information. data quality segment, free of its setting inside the framework. While

substantially more mind boggling and incredible measurements exist to database gives better registering, lower cost and furthermore improve the
detail our data quality, we decided to concentrate on precision since it is presentation when contrasted with the brought together database
the most effortless to control in our test data sets and it is of all inclusive framework. This is on the grounds that incorporated design depends on
worry to analysts and chiefs. the centralized servers which are not as financial as microchips in
circulated database framework. Additionally, the dispersed database has
Data quality research is principally best in class by software engineering progressively computational force when contrasted with the unified
and data frameworks specialists. Software engineering analysts address database framework which is utilized to oversee conventional data.
data quality issues identified with the recognizable proof of copy data,
settling irregularities in data, ascription for missing data, connecting and 6.1 Assessing data quality for improvement of data
incorporating related data got from different sources (Ganti et al., 2013).
PC researchers utilize algorithmic methodologies dependent on statistical Many criteria can be looked into when assessing the quality of data for AI
strategies to address the above issues (Osborne, 2013). Data frameworks applications. In general, data quality includes many different issues; for
scientists, then again, study data quality issues from a frameworks point example questions of completeness, accuracy, consistency, timeliness,
of view (McGilvray, 2008). For instance, they research the commitment of duplication, validity, availability and provenance (Burt et al., 2018).
user interfaces towards data quality issues. Despite the fact that analysts
One of the problems of big data is that the sheer size of the data has the
additionally stand up to data quality issues, the size of their data sets fail
tendency to convince us that the findings based on such large-scale data
to measure up to big data and machine learning situations.
must be accurate. However, if the data quality is not taken into account,
5. DATA VALIDATION this assumption might not hold. Data quantity is only one criterion for the
accuracy of measuring or predicting something. To tackle statistical
National statistical organizations (NSI) perform Data Validation to test the accuracy based on data and determine how well we can represent the real
dependability of conveyed data. Data that appear to be either clearly off- world, it needs to be assessed alongside data quality (Caliskan et al., 2017).
base or conceivably wrong is sent back to the data providers for remedy
6.2 Data Quality Challenges in Big Data
or remark. This is especially significant in the space of regulatory data that
is sent by data providers, (for example, collectives, cantons/states, Data quality is an issue that has been read for quite a few years now. Be
government units, colleges or schools), and where a two-route that as it may, fundamentally the attention has been on the data in
correspondence with these is conceivable and fundamental as these data operational databases and data distribution centers. Recently, analysts
are not gathered for statistical purposes in the primary moment. Up to this have started researching data quality issues past the operational and
point, such Data Validation was acted in for the most part two different warehousing data. In contrast to the instance of social databases, No SQL
ways: either physically by eye-balling through the got data, or (Structured Query Language) frameworks for big data utilize a wide
consequently through edges and intelligent tests. In specific cases the arrangement of data models. The specialist question is: Do we need a
automation (not performed by machine learning) is adequately different data quality administration approach for each No SQL
exceptional with the goal that the data providers need to address all the (Structured Query Language) framework? In big data and machine
apparent blunders and need to give their OK for questionable cases. learning spaces, data is procured from different sellers. Data is likewise
Therewith, the data that lands at the National statistical organizations is of produced by publicly supporting, which is supplemented by client
a more excellent, all the more immediately prepared, and the procedure is contributed data through portable and web applications. How would we
more assets productive. We anticipate comparable outcomes with the use survey the veracity and exactness of publicly supported and client
of machine learning in this space in a way that probably won't supplant contributed data? The expansion of advanced channels and versatile
however supplement the current programmed instruments as shown in processing is creating more data than any time in recent memory.
Figure 2.
What is the effect of cloud organizations on data quality? Should data
The plan to perform machine learning on Data Validation emerged rather quality examinations move past segment investigation in social databases
by some coincidence. Associates from a similar unit were talking about a and address issues identified with complex data changes, joining of data
specific astonishing example in their data in a get-together with data from differing data sources, and collections that give bits of knowledge
providers. The astonishing example ended up being a minor wonder that into data? (Venkat et al., 2017).
had valid justifications to exist. This prompted some reflex particles about
examples in the data. Of the few ten thousand data focuses every year (and 6.2.1 Dealing with Missing Data
in other regulatory data we are talking even of 6 or 7 digits for each year)
there may be 'correct' designs, there may be orderly errors, and there may Missing data is a significant worry in big data. From a statistical point of
be botches that are recognizable by people to not be right. The inquiry in view, missing data is ordered into one of the three classifications: Missing
this manner developed, regardless of whether such examples and Completely At Random (MCAR), missing At Random (MAR), Missing Not
'exceptions' can be recognized by a machine learning calculation – or at At Random (MNAR) (Allison, 2001). In Missing Completely At Random, as
the end of the day by a programmed instrument that goes past ex-bet the name suggests, there is no example to the missing data. Data is missing
characterized edges or sensible tests (Ruiz et al., 2018). freely of both watched and in secret data. The absent and watched values
have comparable circulations. As such, Missing Completely At Random is
only a subset of the data. Blemish is a misnomer and another name like
Missing Conditionally at Random better catches its significance. Given the
watched data, data is missing autonomously of imperceptibly data. It is
conceivable that there are orderly contrasts between the absent and
watched values, and these distinctions can totally be clarified by other
watched factors. Missing Completely At Random suggests MAR, yet the
other way around. In Missing Not At Random, missing perceptions are
identified with the in secret data. Thusly, watched data is a one-sided test.
Figure 2: Big Data Validation Process High paces of missing data require cautious consideration, paying little
mind to the examination strategy utilized. On the off chance that Missing
6. BIG DATA Completely At Random and MAR cases win for a variable, such factors can
be dropped over all perceptions in the data set. Be that as it may, dropping
Conventional data utilize brought together database engineering in which Missing Not At Random factors may prompt outcomes that are
enormous and complex issues are unraveled by a solitary PC framework. emphatically one-sided.
Brought together engineering is exorbitant and inadequate to process
huge measure of data. Big data depends on the conveyed database design 6.2.2 Data Quality Monitoring
where a huge square of data is tackled by separating it into a few littler
This unit expands on the Data Quality Assessment segment's usefulness.
sizes. At that point the answer for an issue is registered by a few distinct
It’s essential capacity is to constantly screen data quality and report
PCs present in a given PC arranges. The PCs convey to one another so as to
results through dashboards and alarms. Data quality is affected by forms
discover the answer for an issue (Weiss et al., 2016). The conveyed
that procure data from the outside world through clump and ongoing

feeds. For instance, an association may obtain data sustains from Frameworks consolidating commonly require considerable manual
numerous merchants. Issues emerge when various qualities are accounted exertion and choices made in settling data issues must be recorded as Meta
for similar data thing. For example, budgetary administrations data. Ultimately, changes, reconciliation, and collection likewise add to
organizations get continuous value data from various merchants. On the data quality rot. These procedures may involve loss of data accuracy and
off chance that the occasions packed cost for a money related instrument scale and supplanting of old identifiers with new ones.
got from merchants contrast, how does the framework select one of the
qualities as the right cost? Seller source chain of command is a way to deal 6.2.3 Data Storage, Retrieval & Purging
with taking care of this issue. It requires forcing a requesting on the sellers
This part gives tenacious stockpiling, question systems, and the board
dependent on their notoriety. On the off chance that various costs are
usefulness to verify and recover data. Both social and No SQL (Structured
accessible for an instrument, select the cost from that seller who has the
Query Language) database frameworks are utilized to understand this
most elevated notoriety. A way to deal with improving data quality
usefulness (Gudivada et al., 2016). A few data models and question dialects
utilizing the clashing data esteems about a similar item starting from
are given to productively store and inquiry organized, semi-organized, and
different sources is exhibited in (Muller et al., 2012).
unstructured data (Gudivada et al., 2016). Rules-driven preparing is
Another wellspring of data quality debasement is when frameworks are utilized to cleanse lapsed and stale data.
blended or redesigned. The frameworks being consolidated might be
Here in Table 1, shown below we present some big data tools’
working under various arrangements of business rules and User
performances that will efficiently compare data validation processes in a
Interfaces. The data may likewise cover between the frameworks.
convenient way.
Surprisingly more terrible, the covering data might be opposing.

Table 1: Big Data Tools’ Performance Chart

Tools’ Name
Apache Hadoop MongoDB Cassendra Storm SAS
Features
Description The center quality of Provides support for No single purpose of Apache Storm is a SAS can peruse
On Data Validation Hadoop is its HDFS various advances and disappointment. cross-platform, information documents
(Hadoop Distributed stages. Handles huge distributed stream made by other factual
File System) which No hiccups in information rapidly. processing and fault- bundles. SAS permits
can hold all sort of establishment and Log-organized tolerant real-time information documents
information – video, upkeep. Reliable and capacity, Automated computational made by SPSS, Excel,
pictures, JSON, XML, ease. replication, Linear framework. It is free Minitab, Stata, Systat, and
and plain content versatility and Simple and open-source. The others to be consolidated
over a similar record Ring engineering. developers of the into a SAS program
framework. storm include legitimately or through
Backtype and record change
Twitter. It is written programming.
in Clojure and Java.
Developer Apache Software MongoDB, Inc Apache Software Canonical Ltd. SAS Institute
Foundation Foundation
Operating System Windows, IBM mainframe,
Linux, Windows,
Server Cross Platform BSD Cross Platform Unix/Linux, OpenVMS
OSX, Solaries
Alpha
Programming
Java C++ Java Clojure and Java C Programming
Language
Usage It is a software MongoDB is a NoSQL, Apache Cassandra is Reliable at scale, Very SAS has a huge part to play
structure utilized for archive arranged liberated from cost fast and fault- later on for examination
grouped record database written in C, and open-source tolerant, Guarantees and enormous
framework and C++, and JavaScript. It conveyed No SQL the processing of information. Today
treatment of huge is allowed to utilize DBMS built to data. It has multiple everybody in the business
information. It forms and is an open source oversee immense use cases – real-time world must know about
informational instrument that volumes of analytics, log the advantages of having
indexes of large underpins different information spread processing, ETL SAS abilities and
information by working frameworks over various item (Extract-Transform- comprehend why these
methods for the Map including Windows servers, conveying Load), continuous aptitudes are popular in
Reduce programming Vista (and later high accessibility. computation, present and future
model. forms), OS X (10.7 distributed RPC, markets. As somebody
and later forms), machine learning. with SAS skill, here are my
Linux, Solaris, and top reasons why you ought
FreeBSD. to learn SAS.

7. CONCLUSION Adam, B., I.F.C. Smith., F. Asce. 2008. Reinforcement learning for structural
control, J Comput Civil Eng, 22(2), 133–139.
Machine learning is basic to address the troubles posed by big data and
uncover covered patterns, data, and bits of information from huge data Allison, P.D. 2001. Missing Data, SAGE Publications.
remembering the ultimate objective to change the ability of the last into
Bekkerman, R., E.Y. Ran, N. Tishby, Y. Winter. 2003. Distributional word
veritable motivation for business essential administration and consistent
clusters vs. words for text categorization, J Mach Learn Res, 3, 1183–
examination. The quality of data can give the ascent of unfair or in any case
1208.
incorrect machine learning frameworks. Subsequently, a comprehension
of the data quality can assist with comprehension and moderate potential Bishop, C.M. 2006. Pattern recognition and machine learning, (Springer,
issues of such frameworks. Machine learning frameworks and calculations New York).
that utilize data require more extensive and progressively adaptable ways
for surveying and tending to data quality corresponding to a few, and Burt, A., et al. 2018. Beyond Explainability: A Practical Guide to Managing
interconnected, essential rights. Risk in Machine Learning Models. In, The Future of Privacy Forum
White Paper.
REFERENCES
Caliskan, A., J.J. Bryson, A. Narayanan, 2017. Semantics derived
Adam, B., I.F.C. Smith, F. Asce. 2008. Reinforcement learning for structural automatically from language corpora contain human-like biases,
control, J Comput Civil Eng, 22(2), 133–139. Science, 356(6334), 183-186.

Chen, X.W., X. Lin. 2014. Big data deep learning: challenges and Osborne, J.W. 2013. Best practices in data cleaning: a complete guide to
perspectives, IEEE Access, 2, 514–525. everything you need to do before and after collecting your data.
Thousand Oaks, CA: SAGE.
Cherkassky, V., F.M. Mulier. 2007. Learning from data: concepts, theory,
and methods, (John Wiley & Sons, New Jersey). Owen, S., R. Anil, T. Dunning, E. Friedman. 2011. Mahout in Action:
Manning Publications Co.
Ganti, V., A.D. Sarma. 2013. Data Cleaning: A Practical Perspective, ser.
Synthesis Lectures on Data Management, Morgan & Claypool Panahy, P., Sidi, F., Affendey, L., Jabar, M., Ibrahim, H., Mustapha, A. 2013.
Publishers. Discovering dependencies among data quality dimensions: A validation
of instrument. J. Appl. Sci., 13, 95–105.
Gudivada, V., D. Rao, V. Raghavan. 2016. Renaissance in database
management: Navigating the landscape of candidate systems, IEEE Philip, A.K., P. Woodall. 2010. A Hybrid Approach to Assessing Data
Computer, 49(4), 31 – 42. Quality.

Gudivada, V., N.D. Rao, V.V. Raghavan. 2014. NoSQL systems for big data Rudin, C., K.L. Wagstaff. 2014. Machine learning for science and society,
management, in 2014 IEEE World Congress on Services. Los Alamitos, Mach Learn, 95(1), 1–9.
CA, USA: IEEE Computer Society, 190–197.
Ruiz, C., S. Federal. 2018. Improving Data Validation using Machine
Jones, N. 2014. Computer science: the learning machines, Nature, Learning, Neuchâtel, Switzerland, 18-20 September.
505(7482), 146–148.
Russell, S., P. Norvig. 1995. Artificial intelligence: a modern approach,
Langford, J. 2005. Tutorial on practical prediction theory for classification, (Prentice-Hall, Englewood Cliffs).
J Mach Learn Res, 6(3), 273–306.
Slavakis, K., G.B. Giannakis, G. Mateos. 2014. Modeling and optimization for
Lee, Y., D. Strong, B. Kahn, R. Wang. 2002. AIMQ: A Methodology for big data analytics: (statistical) learning tools for our era of data deluge,
Information Quality Assessment, Information and Management. 133- IEEE Signal Proc Mag, 31(5), 18–31.
146.
Sun, Y. et al. 2014. Organizing and Querying the Big Sensing Data with
McGilvray, D. 2008. Executing Data Quality Projects: Ten Steps to Quality Event-Linked Network in the Internet of Things, International Journal
Data and Trusted Information, San Francisco, CA: Morgan Kaufmann of Distributed Sensor Networks, 14, 11.
Publishers Inc.
TDWI. 2016. The data warehousing institute. Available:
Mehmood, A., I. Natgunanathan, Y. Xiang, G. Hua, S. Guo. 2016. Protection https://tdwi.org/Home.aspx.
of Big Data Privacy, IEEE Access, 4, 1821–1834.
Venkat, N., Gudivada, A., Apon, J. Ding. 2017. Data Quality Considerations
Mitchell, T.M. 1997. Machine learning (McGraw-Hill, New York,). for Big Data and Machine Learning: Going Beyond Data Cleaning and
Transformations, Department of Computer Science, East Carolina
Mitchell, T.M. 2006. The discipline of machine learning, (Carnegie Mellon University, USA, International Journal on Advances in Software, 10, 1-2.
University, School of Computer Science, Machine Learning
Department). Wand, Y., R. Wang. 1996. Anchoring Data Quality Dimensions in
Ontological Foundations, Communications of the ACM, 39, 86-95.
Müller, H.J., C. Freytag, U. Leser. 2012. Improving data quality by source
analysis, J. Data and Information Quality, 2(4), 1–15, 38. Weiss, K., T.M. Khoshgoftaar, D.D. Wang, Weiss, et al. 2016. A survey of
transfer learning, J Big Data, 3, 9. DOI: 10.1186/s40537-016-0043-6