Why Data Preprocessing

The document discusses why data preprocessing is important. Real-world data is often dirty, incomplete, noisy, and inconsistent which can negatively impact analysis. Data preprocessing aims to clean data by handling missing values, identifying outliers, resolving inconsistencies, integrating multiple sources, and reducing data volume while maintaining quality. The major tasks involve data cleaning, integration, and reduction.

Uploaded by

kusamee0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views7 pages

Why Data Preprocessing

Uploaded by

kusamee0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Why Data Preprocessing?

► Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Why Is Data Dirty?

 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why Is Data Preprocessing Important?

■ No quality data, no quality mining results!

o Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
o Data warehouse needs consistent integration of quality data

■ Data extraction, cleaning, and transformation comprises the majority of the

work of building a data warehouse.

Multi-Dimensional Measure of Data Quality

► A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
► Broad categories:
 Intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
 Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
o Integration of multiple databases, data cubes, or files
 Data reduction
o Obtains reduced representation in volume but produces the same or
similar analytical results

Forms of Data Preprocessing

Data Cleaning
1. Importance
 “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data warehousing”—
DCI survey
2. Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula or
decision tree.

Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

Data Integration
 Data integration:
► Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id ≡ B.cust-#
► Integrate metadata from different sources
 Entity identification problem:
► Identify real-world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
► For the same real-world entity, attribute values from different sources
are different
► Possible reasons: different representations, different scales, e.g.,
metric vs. British units

Handling Redundancy in Data Integration

o Redundant data occur often when integration of multiple databases
 Object identification: The same attribute or object may have different
names in different databases
 Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
o Redundant attributes may be able to be detected by correlation analysis
o Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
Data Reduction Strategies
◇ Why data reduction?
⇰ A database/data warehouse may store terabytes of data
⇰ Complex data analysis/mining may take a very long time to run on
the complete data set
◇ Data reduction
⇰ Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
◇ Data reduction strategies
⇰ Data cube aggregation:
⇰ Dimensionality reduction — e.g., remove unimportant attributes
⇰ Data Compression
⇰ Numerosity reduction — e.g., fit data into models
⇰ Discretization and concept hierarchy generation

CC Certified in Cybersecurity All-in-One Exam Guide 1st Edition - eBook PDF download
100% (9)
CC Certified in Cybersecurity All-in-One Exam Guide 1st Edition - eBook PDF download
60 pages
Essential Reading 2nd Edition Level Student S Book Level 2 Unit 9
No ratings yet
Essential Reading 2nd Edition Level Student S Book Level 2 Unit 9
9 pages
Allison Troubleshooting Manual Wtec III
100% (56)
Allison Troubleshooting Manual Wtec III
20 pages
Intro To System Administration & Maintenance
100% (3)
Intro To System Administration & Maintenance
30 pages
Introduction To Nanoelectronics
100% (3)
Introduction To Nanoelectronics
347 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CH 3
No ratings yet
CH 3
68 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Correlation
No ratings yet
Correlation
14 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Unit-2 new
No ratings yet
Unit-2 new
61 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Normalization
No ratings yet
Normalization
35 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Module 2_DM_AI
No ratings yet
Module 2_DM_AI
61 pages
Chapter 2 Introduction Data Mining
No ratings yet
Chapter 2 Introduction Data Mining
2 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Schema Basics
From Everand
Data Schema Basics
Mei Gates
No ratings yet
SWOT Analysis of AMD by Prajwal.P.rahangdale
No ratings yet
SWOT Analysis of AMD by Prajwal.P.rahangdale
8 pages
ეკონომიკის პრინციპები (მენქიუ)
No ratings yet
ეკონომიკის პრინციპები (მენქიუ)
1,108 pages
CCE Syllabus 2016 2020 PDF
No ratings yet
CCE Syllabus 2016 2020 PDF
60 pages
GSM Architecture
No ratings yet
GSM Architecture
57 pages
PLC, PLC LADDER, PLC EBOOK, PLC PROGRAMMING, - PID Control in SIEMENS S7 PLC PDF
No ratings yet
PLC, PLC LADDER, PLC EBOOK, PLC PROGRAMMING, - PID Control in SIEMENS S7 PLC PDF
4 pages
Advanced_VLSI_QP3 (B)Final (1) (1)
No ratings yet
Advanced_VLSI_QP3 (B)Final (1) (1)
1 page
Power Loss Calculation
No ratings yet
Power Loss Calculation
7 pages
RPGToolBox Manual
No ratings yet
RPGToolBox Manual
156 pages
Software Engineering Chapter 11
100% (1)
Software Engineering Chapter 11
52 pages
Cmos Schmitt Trigger CKT
No ratings yet
Cmos Schmitt Trigger CKT
4 pages
HW 7 SP 13
No ratings yet
HW 7 SP 13
2 pages
ASTM D 4945: 2017 - Standard Test Method For High-Strain ..
No ratings yet
ASTM D 4945: 2017 - Standard Test Method For High-Strain ..
5 pages
python paper
No ratings yet
python paper
1 page
Map Multimap
No ratings yet
Map Multimap
18 pages
System of Uniform Marking by The Kks Code: Process and Technology Control
No ratings yet
System of Uniform Marking by The Kks Code: Process and Technology Control
44 pages
Panasonic SC UA3
No ratings yet
Panasonic SC UA3
52 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
43 pages
Lý Thuyết Kiểm Định Chất Lượng Phần Mềm - 2. Quality Assurance - Quality Engineering
No ratings yet
Lý Thuyết Kiểm Định Chất Lượng Phần Mềm - 2. Quality Assurance - Quality Engineering
35 pages
McKinsey Quarterly - Digital Marketing
No ratings yet
McKinsey Quarterly - Digital Marketing
8 pages
Etech - PPT Lesson (Week8)
No ratings yet
Etech - PPT Lesson (Week8)
13 pages
Ramachandran S
No ratings yet
Ramachandran S
1 page
SAP CRM Service - Tech Component Requirements
0% (1)
SAP CRM Service - Tech Component Requirements
33 pages
Oscilloscope Familiarization
50% (2)
Oscilloscope Familiarization
5 pages
Business Case Word Template - Draft v0.2
No ratings yet
Business Case Word Template - Draft v0.2
8 pages
Portrait Displays - Supported Hardware: Nvidia Geforce Windows 7 Windows 8 Windows 10
No ratings yet
Portrait Displays - Supported Hardware: Nvidia Geforce Windows 7 Windows 8 Windows 10
6 pages

Why Data Preprocessing

Uploaded by

Why Data Preprocessing

Uploaded by

Why Data Preprocessing?

► Data in the real world is dirty

Why Is Data Dirty?

■ No quality data, no quality mining results!

o Quality decisions must be based on quality data

■ Data extraction, cleaning, and transformation comprises the majority of the

Multi-Dimensional Measure of Data Quality

Forms of Data Preprocessing

Handling Redundancy in Data Integration

You might also like