Data Validation
Data Validation
Data validation means checking the accuracy and quality of source data before using,
importing, or otherwise processing data. Different types of validation can be performed
depending on destination constraints or objectives. Data validation is a form of data
cleansing.
When moving and merging data it’s important to make sure data from different sources and
repositories will conform to business rules and not become corrupted due to inconsistencies
in type or context. The goal is to create data that is consistent, accurate and complete so to
prevent data loss and errors during a move.
In data warehousing, data validation is often performed prior to the ETL (Extraction
Translation Load) process. A data validation test is performed so that analyst can get insight
into the scope or nature of data conflicts. Data validation is a general term and can be
performed on any type of data, however, including data within a single application (such as
Microsoft Excel) or when merging simple data within a single data store.
Measures of data mining generally fall into the categories of accuracy, reliability, and
usefulness.
Accuracy is a measure of how well the model correlates an outcome with the attributes
in the data that has been provided. There are various measures of accuracy, but all
measures of accuracy are dependent on the data that is used. In reality, values might be
missing or approximate, or the data might have been changed by multiple processes.
Particularly in the phase of exploration and development, you might decide to accept a
certain amount of error in the data, especially if the data is fairly uniform in its
characteristics. For example, a model that predicts sales for a particular store based on past
sales can be strongly correlated and very accurate, even if that store consistently used the
wrong accounting method. Therefore, measurements of accuracy must be balanced by
assessments of reliability.
Reliability assesses the way that a data mining model performs on different data sets. A
data mining model is reliable if it generates the same type of predictions or finds the
same general kinds of patterns regardless of the test data that is supplied.
For example, the model that you generate for the store that used the wrong accounting
method would not generalize well to other stores, and therefore would not be reliable.
Usefulness includes various metrics that tell you whether the model provides useful
information. For example, a data mining model that correlates store location with sales
might be both accurate and reliable, but might not be useful, because you cannot generalize
that result by adding more stores at the same location. Moreover, it does not answer the
fundamental business question of why certain locations have more sales. You might also find
that a model that appears successful in fact is meaningless, because it is based on cross-
correlations in the data.
Every organization will have its own set of rules for storing and maintaining data.
Setting basic data validation rules will assist your company in maintaining
organized standards that will make working with data more efficient. Most Data
Validation procedures will run one or more of these checks to ensure that the data
is correct before it is stored in the database.
● Code Check
● Range Check
● Format Check
● Consistency Check
● Uniqueness Check
● Presence Check
● Length Check
● Look Up
A Data Type check ensures that data entered into a field is of the correct data
type. A field, for example, may only accept numeric data. The system should then
reject any data containing other characters, such as letters or special symbols, and
an error message should be displayed.
2) Code Check
A Code Check ensures that a field is chosen from a valid list of values or that
certain formatting rules are followed. For example, it is easier to verify the validity
of a postal code by comparing it to a list of valid codes. Other items, such as
country codes and NAICS industry codes, can be approached in the same way.
3) Range Check
A Range Check will determine whether the input data falls within a given range.
Latitude and longitude, for example, are frequently used in geographic data.
Latitude should be between -90 and 90, and longitude should be between -180 and
180. Any values outside of this range are considered invalid.
4) Format Check
Many data types have a predefined format. A Format Check will ensure that the
data is in the correct format. Date fields, for example, are stored in a fixed format
such as “YYYY-MM-DD” or “DD-MM-YYYY.” If the date is entered in any
other format, it will be rejected. A National Insurance number looks like this: LL
99 99 99 L, where L can be any letter and 9 can be any number.
5) Consistency Check
6) Uniqueness Check
Some data, such as IDs or e-mail addresses, are inherently unique. These fields in a
database should most likely have unique entries. A Uniqueness Check ensures
that an item is not entered into a database more than once.
7) Presence Check
A Presence Check ensures that all mandatory fields are not left blank. If someone
tries to leave the field blank, an error message will be displayed, and they will be
unable to proceed to the next step or save any other data that they have entered. A
key field, for example, cannot be left blank in most databases.
8) Length Check
A Length Check ensures that the appropriate number of characters are entered
into the field. It verifies that the entered character string is neither too short nor too
long. Consider a password that must be at least 8 characters long. The Length
Check ensures that the field is filled with exactly 8 characters.
9) Look Up
Look Up assists in reducing errors in a field with a limited set of values. It consults
a table to find acceptable values. The fact that there are only 7 possible days in a
week, for example, ensures that the list of possible values is limited
What are the Methods to Perform Data Validation?
There are various methods for Data Validation available, and each method includes
specific features for the best Data Validation process.
● Validation by Scripts
● Validation by Programs
1) Validation by Scripts
Because open-source options are cost-effective, developers can save money if they
are cloud-based. However, in order to complete the process effectively, this
method necessitates extensive knowledge and hand-coding. OpenRefine and
SourceForge are two excellent examples of open-source tools.
B) Enterprise Tools
For the Data Validation process, various enterprise tools are available. Enterprise
tools are secure and stable, but they require infrastructure and are more expensive
than open-source tools. For instance, the FME tool area is used to repair and
validate data.
● Database Validation
If you have a large amount of data to validate, you will need a sample rather than
the entire dataset. To ensure the project’s success, you must first understand and
decide on the volume of the data sample, as well as the error rate.
Step 2: Database Validation
You must ensure that all requirements are met with the existing database during the
Database Validation process. To compare source and target data fields, unique IDs
and the number of records must be determined.
Determine the overall data capability and the variation that requires source data for
the targeted validation, and then search for inconsistencies, duplicate data,
incorrect formats, and null field values.