0% found this document useful (0 votes)
19 views9 pages

Data Validation

Data validation involves checking data accuracy and quality before use. Common validation types include data type checks, code checks, range checks, format checks, consistency checks, uniqueness checks, presence checks, length checks, and lookups. Validation is often done using scripts or programs and involves determining data samples, validating databases and data formats, and identifying inconsistencies.

Uploaded by

rishabh28072002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Data Validation

Data validation involves checking data accuracy and quality before use. Common validation types include data type checks, code checks, range checks, format checks, consistency checks, uniqueness checks, presence checks, length checks, and lookups. Validation is often done using scripts or programs and involves determining data samples, validating databases and data formats, and identifying inconsistencies.

Uploaded by

rishabh28072002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

DATA VALIDATION

Data validation means checking the accuracy and quality of source data before using,
importing, or otherwise processing data. Different types of validation can be performed
depending on destination constraints or objectives. Data validation is a form of data
cleansing.

Why perform data validation?

When moving and merging data it’s important to make sure data from different sources and
repositories will conform to business rules and not become corrupted due to inconsistencies
in type or context. The goal is to create data that is consistent, accurate and complete so to
prevent data loss and errors during a move.

When is data validation performed?

In data warehousing, data validation is often performed prior to the ETL (Extraction
Translation Load) process. A data validation test is performed so that analyst can get insight
into the scope or nature of data conflicts. Data validation is a general term and can be
performed on any type of data, however, including data within a single application (such as
Microsoft Excel) or when merging simple data within a single data store.

Definition of Criteria for Validating Data Mining Models

Measures of data mining generally fall into the categories of accuracy, reliability, and
usefulness.

Accuracy is a measure of how well the model correlates an outcome with the attributes
in the data that has been provided. There are various measures of accuracy, but all
measures of accuracy are dependent on the data that is used. In reality, values might be
missing or approximate, or the data might have been changed by multiple processes.
Particularly in the phase of exploration and development, you might decide to accept a
certain amount of error in the data, especially if the data is fairly uniform in its
characteristics. For example, a model that predicts sales for a particular store based on past
sales can be strongly correlated and very accurate, even if that store consistently used the
wrong accounting method. Therefore, measurements of accuracy must be balanced by
assessments of reliability.

Reliability assesses the way that a data mining model performs on different data sets. A
data mining model is reliable if it generates the same type of predictions or finds the
same general kinds of patterns regardless of the test data that is supplied.

For example, the model that you generate for the store that used the wrong accounting
method would not generalize well to other stores, and therefore would not be reliable.

Usefulness includes various metrics that tell you whether the model provides useful
information. For example, a data mining model that correlates store location with sales
might be both accurate and reliable, but might not be useful, because you cannot generalize
that result by adding more stores at the same location. Moreover, it does not answer the
fundamental business question of why certain locations have more sales. You might also find
that a model that appears successful in fact is meaningless, because it is based on cross-
correlations in the data.

What are the Types of Data Validation?

Every organization will have its own set of rules for storing and maintaining data.
Setting basic data validation rules will assist your company in maintaining
organized standards that will make working with data more efficient. Most Data
Validation procedures will run one or more of these checks to ensure that the data
is correct before it is stored in the database.

The following are the common Data Validation Types:

● Data Type Check

● Code Check

● Range Check

● Format Check
● Consistency Check

● Uniqueness Check

● Presence Check

● Length Check

● Look Up

1) Data Type Check

A Data Type check ensures that data entered into a field is of the correct data
type. A field, for example, may only accept numeric data. The system should then
reject any data containing other characters, such as letters or special symbols, and
an error message should be displayed.

2) Code Check

A Code Check ensures that a field is chosen from a valid list of values or that
certain formatting rules are followed. For example, it is easier to verify the validity
of a postal code by comparing it to a list of valid codes. Other items, such as
country codes and NAICS industry codes, can be approached in the same way.

3) Range Check

A Range Check will determine whether the input data falls within a given range.
Latitude and longitude, for example, are frequently used in geographic data.
Latitude should be between -90 and 90, and longitude should be between -180 and
180. Any values outside of this range are considered invalid.

4) Format Check

Many data types have a predefined format. A Format Check will ensure that the
data is in the correct format. Date fields, for example, are stored in a fixed format
such as “YYYY-MM-DD” or “DD-MM-YYYY.” If the date is entered in any
other format, it will be rejected. A National Insurance number looks like this: LL
99 99 99 L, where L can be any letter and 9 can be any number.

5) Consistency Check

A Consistency Check is a type of logical check that ensures data is entered in a


logically consistent manner. Checking if the delivery date for a parcel is after the
shipping date is one example.

6) Uniqueness Check

Some data, such as IDs or e-mail addresses, are inherently unique. These fields in a
database should most likely have unique entries. A Uniqueness Check ensures
that an item is not entered into a database more than once.

7) Presence Check

A Presence Check ensures that all mandatory fields are not left blank. If someone
tries to leave the field blank, an error message will be displayed, and they will be
unable to proceed to the next step or save any other data that they have entered. A
key field, for example, cannot be left blank in most databases.

8) Length Check

A Length Check ensures that the appropriate number of characters are entered
into the field. It verifies that the entered character string is neither too short nor too
long. Consider a password that must be at least 8 characters long. The Length
Check ensures that the field is filled with exactly 8 characters.

9) Look Up

Look Up assists in reducing errors in a field with a limited set of values. It consults
a table to find acceptable values. The fact that there are only 7 possible days in a
week, for example, ensures that the list of possible values is limited
What are the Methods to Perform Data Validation?

There are various methods for Data Validation available, and each method includes
specific features for the best Data Validation process.

These methods to perform Data Validation are as follows:

● Validation by Scripts

● Validation by Programs

1) Validation by Scripts

● In this method, the validation process is carried out using a scripting


language such as Python, which is used to write the entire script for the
validation process.
● To ensure that all necessary information is within the required quality
parameters, you can compare data values and structure to your defined rules.
● This method of Data Validation can be time-consuming depending on the
complexity and size of the data set you are validating.
● 2) Validation by Programs
● Many software programs are available to help you validate data.
● Because these programs have been developed to understand your rules and
the file structures you are working with, this method of validation is very
simple.
● The ideal tool will allow you to incorporate validation into every step of
your workflow without requiring a deep understanding of the underlying
format.

The different programs that can be used are:

● Open Source Tools


● Enterprise Tools

A) Open Source Tools

Because open-source options are cost-effective, developers can save money if they
are cloud-based. However, in order to complete the process effectively, this
method necessitates extensive knowledge and hand-coding. OpenRefine and
SourceForge are two excellent examples of open-source tools.

B) Enterprise Tools

For the Data Validation process, various enterprise tools are available. Enterprise
tools are secure and stable, but they require infrastructure and are more expensive
than open-source tools. For instance, the FME tool area is used to repair and
validate data.

What are the Steps to perform Data Validation?

The steps carried out to perform Data Validation are as follows:

● Determine Data Sample

● Database Validation

● Data Format Validation

Step 1: Determine Data Sample

If you have a large amount of data to validate, you will need a sample rather than
the entire dataset. To ensure the project’s success, you must first understand and
decide on the volume of the data sample, as well as the error rate.
Step 2: Database Validation

You must ensure that all requirements are met with the existing database during the
Database Validation process. To compare source and target data fields, unique IDs
and the number of records must be determined.

Step 3: Data Format Validation

Determine the overall data capability and the variation that requires source data for
the targeted validation, and then search for inconsistencies, duplicate data,
incorrect formats, and null field values.

What are the Benefits of Data Validation?

Some of the benefits of Data Validation are as follows:

● It is cost-effective because it saves the appropriate amount of time and


money through dataset collection.
● Because it removes duplication from the entire dataset, it is simple to use
and is compatible with other processes.
● With improved information collection, data validation can directly help to
improve the business.
● It is made up of a data-efficient structure that provides a standard database
and cleaned dataset information.

What are the Limitations of Data Validation?

Some of the limitations of Data Validation are as follows:


● Because of the organization’s multiple databases, there may be some
disruption. As a result, data may be out of date, which can cause issues
when validating the data.
● When you have a large database, the process of data validation can be time-
consuming because you have to perform the validation manually.

You might also like