Data Quality and Data Preproccessing

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for analysis. The key steps are data cleaning to handle missing values, noise, and inconsistencies; data integration to combine data from multiple sources; data reduction to reduce data size through techniques like dimensionality reduction; and data transformation to prepare data for analysis through normalization, discretization, and concept hierarchy generation. Preprocessing improves data quality and prepares it for mining algorithms by addressing issues like inaccurate, incomplete, inconsistent, and redundant data that are common in real-world datasets.

Uploaded by

Aishwarya Jagtap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views4 pages

Data Quality and Data Preproccessing

Uploaded by

Aishwarya Jagtap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Pre-processing: An Overview

DATA QUALITY: WHY PREPROCESS THE DATA?

data have quality if they satisfy the requirements of the intended use. There are many factors
comprising data quality, including accuracy, completeness, consistency, timeliness,
believability, and interpretability.
Imagine that you are a manager at AllElectronics and have been charged with analyzing the
company’s data with respect to your branch’s sales. You immediately set out to perform this
task. You carefully inspect the company’s database and data warehouse, identifying and
selecting the attributes or dimensions (e.g., item, price, and units_sold ) to be included in your
analysis. Alas! You notice that several of the attributes for various tuples have no recorded
value. For your analysis, you would like to include information as to whether each item
purchased was advertised as on sale, yet you discover that this information has not been
recorded. Furthermore, users of your database system have reported errors, unusual values, and
inconsistencies in the data recorded for some transactions. In other words, the data you wish to
analyze by data mining techniques are incomplete (lacking attribute values or certain attributes
of interest, or containing only aggregate data); inaccurate or noisy (containing errors, or values
that deviate from the expected); and inconsistent (e.g., containing discrepancies in the
department codes used to categorize items). Welcome to the real world!
This scenario illustrates three of the elements defining data quality: accuracy, completeness,
and consistency. Inaccurate, incomplete, and inconsistent data are commonplace properties of
large real-world databases and data warehouses. There are many possible reasons for inaccurate
data (i.e., having incorrect attribute values). The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data entry. Users may purposely
submit incorrect data values for mandatory fields when they do not wish to submit personal
information (e.g., by choosing the default value “January 1” displayed for birthday). This is
known as disguised missing data. Errors in data transmission can also occur. There may be
technology limitations such as limited buffer size for coordinating synchronized data transfer
and consumption. Incorrect data may also result from inconsistencies in naming conventions
or data codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also require
data cleaning.
Incomplete data can occur for a number of reasons. Attributes of interest may not always be
available, such as customer information for sales transaction data. Other data may not be
included simply because they were not considered important at the time of entry. Relevant data
may not be recorded due to a misunderstanding or because of equipment malfunctions. Data
that were inconsistent with other recorded data may have been deleted. Furthermore, the
recording of the data history or modifications may have been overlooked. Missing data,
particularly for tuples with missing values for some attributes, may need to be inferred.
Recall that data quality depends on the intended use of the data. Two different users may have
very different assessments of the quality of a given database. For example, a marketing analyst
may need to access the database mentioned before for a list of customer addresses. Some of
the addresses are outdated or incorrect, yet overall, 80% of the addresses are accurate. The
marketing analyst considers this to be a large customer database for target marketing purposes
and is pleased with the database’s accuracy, although, as sales manager, you found the data
inaccurate.
Timeliness also affects data quality. Suppose that you are overseeing the distribution of
monthly sales bonuses to the top sales representatives at AllElectronics. Several sales
representatives, however, fail to submit their sales records on time at the end of the month.
There are also a number of corrections and adjustments that flow in after the month’s end. For

1
a period of time following each month, the data stored in the database are incomplete. However,
once all of the data are received, it is correct. The fact that the month-end data are not updated
in a timely fashion has a negative impact on the data quality.
Two other factors affecting data quality are believability and
interpretability. Believability reflects how much the data are trusted by users,
while interpretability reflects how easy the data are understood. Suppose that a database, at
one point, had several errors, all of which have since been corrected. The past errors, however,
had caused many problems for sales department users, and so they no longer trust the data. The
data also use many accounting codes, which the sales department does not know how to
interpret. Even though the database is now accurate, complete, consistent, and timely, sales
department users may regard it as of low quality due to poor believability and interpretability.

MAJOR TASKS IN DATA PREPROCESSING

he major steps involved in data preprocessing, namely, data cleaning, data integration, data
reduction, and data transformation.

Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies. If users believe the data
are dirty, they are unlikely to trust the results of any data mining that has been applied.
Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable
output. Although most mining routines have some procedures for dealing with incomplete or
noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting
the data to the function being modeled. Therefore, a useful preprocessing step is to run your
data through some data cleaning routines.

Getting back to your task at AllElectronics, suppose that you would like to include data from
multiple sources in your analysis. This would involve integrating multiple databases, data
cubes, or files (i.e., data integration). Yet some attributes representing a given concept may
have different names in different databases, causing inconsistencies and redundancies. For
example, the attribute for customer identification may be referred to as customer_id in one data
store and cust_id in another. Naming inconsistencies may also occur for attribute values. For
example, the same first name could be registered as “Bill” in one database, “William” in
another, and “B.” in a third. Furthermore, you suspect that some attributes may be inferred
from others (e.g., annual revenue). Having a large amount of redundant data may slow down
or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps must
be taken to help avoid redundancies during data integration. Typically, data cleaning and data
integration are performed as a preprocessing step when preparing data for a data warehouse.
Additional data cleaning can be performed to detect and remove redundancies that may have
resulted from data integration.

“Hmmm,” you wonder, as you consider your data even further. “The data set I have selected
for analysis is HUGE, which is sure to slow down the mining process. Is there a way I can
reduce the size of my data set without jeopardizing the data mining results?” Data
reduction obtains a reduced representation of the data set that is much smaller in volume, yet
produces the same (or almost the same) analytical results. Data reduction strategies
include dimensionality reduction

In dimensionality reduction, data encoding schemes are applied so as to obtain a reduced or

“compressed” representation of the original data. Examples include data compression

2
techniques (e.g., wavelet transforms and principal components analysis), attribute subset
selection (e.g., removing irrelevant attributes), and attribute construction (e.g., where a small
set of more useful attributes is derived from the original set).

Getting back to your data, you have decided, say, that you would like to use a distance-based
mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or
clustering.1 Such methods provide better results if the data to be analyzed have
been normalized, that is, scaled to a smaller range such as [0.0, 1.0]. Your customer data, for
example, contain the attributes age and annual salary. The annual salary attribute usually
takes much larger values than age. Therefore, if the attributes are left unnormalized, the
distance measurements taken on annual salary will generally outweigh distance measurements
taken on age. Discretization and concept hierarchy generation can also be useful, where raw
data values for attributes are replaced by ranges or higher conceptual levels. For example, raw
values for age may be replaced by higher-level concepts, such as youth, adult, or senior.
Discretization and concept hierarchy generation are powerful tools for data mining in that they
allow data mining at multiple abstraction levels. Normalization, data discretization, and
concept hierarchy generation are forms of data transformation. You soon realize such data
transformation operations are additional data preprocessing procedures that would contribute
toward the success of the mining process.
Figure 3.1 summarizes the data preprocessing steps described here. Note that the previous
categorization is not mutually exclusive. For example, the removal of redundant data may be
seen as a form of data cleaning, as well as data reduction.

FIGURE 3.1 Forms of data preprocessing.

3
In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing
techniques can improve data quality, thereby helping to improve the accuracy and efficiency
of the subsequent mining process. Data preprocessing is an important step in the knowledge
discovery process, because quality decisions must be based on quality data. Detecting data
anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs
for decision making.

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
NewSkies Guide For Modify Bookings
No ratings yet
NewSkies Guide For Modify Bookings
34 pages
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Modern Database Management Test Bank Chapter 7
100% (1)
Modern Database Management Test Bank Chapter 7
37 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Individual Determinants of Consumer Behavior
100% (1)
Individual Determinants of Consumer Behavior
19 pages
CB Unit 3
No ratings yet
CB Unit 3
60 pages
CFM Study Material (Study Group) 1
No ratings yet
CFM Study Material (Study Group) 1
110 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
Social Judgment Theory
No ratings yet
Social Judgment Theory
9 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Linking NFT Transaction Events To Identify Privacy Risks Final
No ratings yet
Linking NFT Transaction Events To Identify Privacy Risks Final
17 pages
RDBMS Assignment
No ratings yet
RDBMS Assignment
8 pages
MODULE-3 DBMS CS208 NOTES (Ktuassist - In)
No ratings yet
MODULE-3 DBMS CS208 NOTES (Ktuassist - In)
4 pages
WEBD 236: Web Information Systems Programming
No ratings yet
WEBD 236: Web Information Systems Programming
60 pages
DOJ Indictment Against Chinese Military (Equifax)
No ratings yet
DOJ Indictment Against Chinese Military (Equifax)
24 pages
Unit 5 DMS
No ratings yet
Unit 5 DMS
16 pages
Advanced Java InterviewQuetions
No ratings yet
Advanced Java InterviewQuetions
4 pages
Laporan-Tugas3 - PHP-MySQL - Hendri Yunus Wijaya - Fauzia 'Uddin - M.Arifin Ilham - Zainal Ilmi
No ratings yet
Laporan-Tugas3 - PHP-MySQL - Hendri Yunus Wijaya - Fauzia 'Uddin - M.Arifin Ilham - Zainal Ilmi
15 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Pharmaceutical Relational Database Design
No ratings yet
Pharmaceutical Relational Database Design
14 pages
ST Solution
No ratings yet
ST Solution
29 pages
4.1 Java Database Programming
No ratings yet
4.1 Java Database Programming
19 pages
P2P-IR Architecture PDF
No ratings yet
P2P-IR Architecture PDF
11 pages
ABAP Programming
0% (1)
ABAP Programming
15 pages
Jmeter Interview Questions
No ratings yet
Jmeter Interview Questions
4 pages
Btech Cse 5 Sem Database Management Systems 2009
No ratings yet
Btech Cse 5 Sem Database Management Systems 2009
8 pages
Unit 2 Part 2 System Analysis and Design
No ratings yet
Unit 2 Part 2 System Analysis and Design
243 pages
Sandhya Pochamreddy
No ratings yet
Sandhya Pochamreddy
3 pages
QUESTION
No ratings yet
QUESTION
3 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Neo4j Cypher Refcard 4
No ratings yet
Neo4j Cypher Refcard 4
21 pages
SL No Topic Page No
No ratings yet
SL No Topic Page No
6 pages
Primary vs. Secondary Advantages and Disadvantages of Secondary Data Classification of Secondary Data
No ratings yet
Primary vs. Secondary Advantages and Disadvantages of Secondary Data Classification of Secondary Data
9 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
Adobe Scan Feb 18, 2025 2
No ratings yet
Adobe Scan Feb 18, 2025 2
8 pages
4.1 Semantic Data and Web: Unit 4 Ontology
No ratings yet
4.1 Semantic Data and Web: Unit 4 Ontology
12 pages
DELHI PUBLIC SCHOOL CSC PDF
No ratings yet
DELHI PUBLIC SCHOOL CSC PDF
23 pages
Glossary - Tools For DS
No ratings yet
Glossary - Tools For DS
3 pages

Data Quality and Data Preproccessing

Uploaded by

Data Quality and Data Preproccessing

Uploaded by

Data Pre-processing: An Overview

DATA QUALITY: WHY PREPROCESS THE DATA?

MAJOR TASKS IN DATA PREPROCESSING

In dimensionality reduction, data encoding schemes are applied so as to obtain a reduced or

FIGURE 3.1 Forms of data preprocessing.

You might also like