0% found this document useful (0 votes)

30 views

Chapter 2 - Data Preprocessing

The document discusses the importance of data preprocessing for data mining. It describes common issues with real-world data being dirty, incomplete, noisy or inconsistent. The major tasks of data preprocessing - cleaning, integration and reduction are explained.

Uploaded by

kusamee0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Chapter 2 - Data Preprocessing

Uploaded by

kusamee0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Chapter 2

Data Preprocessing

Eng. Ali sheak Ahmed

[email protected]
090-7731966

* Data Mining: Concepts and Techniques 1

Outline

■ Why preprocess the data?

■ Descriptive data summarization
■ Data cleaning
■ Data integration and transformation
■ Data reduction

* Data Mining: Concepts and Techniques 2

Why Data Preprocessing?
■ Data in the real world is dirty
■ incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
■ e.g., occupation=“ ”
■ noisy: containing errors or outliers
■ e.g., Salary=“-10”
■ inconsistent: containing discrepancies in codes
or names
■ e.g., Age=“42” Birthday=“03/07/1997”
■ e.g., Was rating “1,2,3”, now rating “A, B, C”
■ e.g., discrepancy between duplicate records
* Data Mining: Concepts and Techniques 3
Why Is Data Dirty?
■ Incomplete data may come from
■ “Not applicable” data value when collected
■ Different considerations between the time when the data was
collected and when it is analyzed.
■ Human/hardware/software problems
■ Noisy data (incorrect values) may come from
■ Faulty data collection instruments
■ Human or computer error at data entry
■ Errors in data transmission
■ Inconsistent data may come from
■ Different data sources
■ Functional dependency violation (e.g., modify some linked data)
■ Duplicate records also need data cleaning

* Data Mining: Concepts and Techniques 4

Why Is Data Preprocessing Important?

■ No quality data, no quality mining results!

■ Quality decisions must be based on quality data
■ e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
■ Data warehouse needs consistent integration of quality
data
■ Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

* Data Mining: Concepts and Techniques 5

Multi-Dimensional Measure of Data Quality

■ A well-accepted multidimensional view:

■ Accuracy
■ Completeness
■ Consistency
■ Timeliness
■ Believability
■ Value added
■ Interpretability
■ Accessibility
■ Broad categories:
■ Intrinsic, contextual, representational, and accessibility

* Data Mining: Concepts and Techniques 6

Major Tasks in Data Preprocessing

■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

■ Data integration
■ Integration of multiple databases, data cubes, or files

■ Data reduction
■ Obtains reduced representation in volume but produces the same
or similar analytical results

* Data Mining: Concepts and Techniques 7

Forms of Data Preprocessing

* Data Mining: Concepts and Techniques 8

Data Cleaning

■ Importance
■ “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
■ “Data cleaning is the number one problem in data
warehousing”—DCI survey
■ Data cleaning tasks
■ Fill in missing values
■ Identify outliers and smooth out noisy data
■ Correct inconsistent data
■ Resolve redundancy caused by data integration

* Data Mining: Concepts and Techniques 9

How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same class:
smarter
■ the most probable value: inference-based such as Bayesian
formula or decision tree
* Data Mining: Concepts and Techniques 10
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
■ Other data problems which requires data cleaning
■ duplicate records
■ incomplete data
■ inconsistent data

* Data Mining: Concepts and Techniques 11

Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent
store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from
different sources are different
■ Possible reasons: different representations, different
scales, e.g., metric vs. British units

* Data Mining: Concepts and Techniques 12

Handling Redundancy in Data Integration

■ Redundant data occur often when integration of multiple

databases
■ Object identification: The same attribute or object
may have different names in different databases
■ Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
■ Redundant attributes may be able to be detected by
correlation analysis
■ Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

* Data Mining: Concepts and Techniques 13

Data Reduction Strategies

■ Why data reduction?

■ A database/data warehouse may store terabytes of data
■ Complex data analysis/mining may take a very long time to run
on the complete data set
■ Data reduction
■ Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
■ Data reduction strategies
■ Data cube aggregation:
■ Dimensionality reduction — e.g., remove unimportant attributes
■ Data Compression
■ Numerosity reduction — e.g., fit data into models
■ Discretization and concept hierarchy generation

* Data Mining: Concepts and Techniques 14

End

* Data Mining: Concepts and Techniques 15

GRE Big Book 27 Analytical Puzzles Solution
No ratings yet
GRE Big Book 27 Analytical Puzzles Solution
33 pages
Electrochemistry Worksheet
0% (2)
Electrochemistry Worksheet
4 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Lect 4
No ratings yet
Lect 4
30 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Chapter 2: Data Preprocessing: Why Preprocess The Data?
No ratings yet
Chapter 2: Data Preprocessing: Why Preprocess The Data?
42 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
52 pages
Correlation
No ratings yet
Correlation
14 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
3 Prep
No ratings yet
3 Prep
50 pages
3prep
No ratings yet
3prep
53 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
02
No ratings yet
02
78 pages
Chap 3
No ratings yet
Chap 3
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Unit I Chapter III
No ratings yet
Unit I Chapter III
71 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
DataMining S
No ratings yet
DataMining S
103 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Unit-4 Introduction To Data Mining
No ratings yet
Unit-4 Introduction To Data Mining
26 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
CH 3
No ratings yet
CH 3
68 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Down 2
No ratings yet
Down 2
61 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Exams Questions Papers Sec1 5
No ratings yet
Exams Questions Papers Sec1 5
104 pages
Cryptography, Winter Term 16/17: Sample Solution To Assignment 3
No ratings yet
Cryptography, Winter Term 16/17: Sample Solution To Assignment 3
3 pages
Single Variable Data Analysis and Bivariate Data Analysis: Assignment 2: Unit Plan
No ratings yet
Single Variable Data Analysis and Bivariate Data Analysis: Assignment 2: Unit Plan
32 pages
Si tts520 10mhz To 520mhz Transmitter Test Set Amm B Automatic Modulation
No ratings yet
Si tts520 10mhz To 520mhz Transmitter Test Set Amm B Automatic Modulation
108 pages
MDI Training Profile 2023
No ratings yet
MDI Training Profile 2023
16 pages
Vibration-Book - SHABANA PDF
No ratings yet
Vibration-Book - SHABANA PDF
359 pages
Isha Upanishad: Word-for-Word Translation With Grammatical Notes
100% (1)
Isha Upanishad: Word-for-Word Translation With Grammatical Notes
40 pages
13.kinetic Theory of Gases and ThermodynamicsTheory
No ratings yet
13.kinetic Theory of Gases and ThermodynamicsTheory
25 pages
COA of Siberian Ginseng Extract
No ratings yet
COA of Siberian Ginseng Extract
1 page
3CX Phone System: Take Control of Your PBX
No ratings yet
3CX Phone System: Take Control of Your PBX
2 pages
Aerodynamic Design and Optimization of LR MUAV
No ratings yet
Aerodynamic Design and Optimization of LR MUAV
190 pages
GATE 2019 Solution
No ratings yet
GATE 2019 Solution
30 pages
Incremental Encoders: Blind Hollow Shaft or Cone Shaft 300 5000 Pulses Per Revolution
No ratings yet
Incremental Encoders: Blind Hollow Shaft or Cone Shaft 300 5000 Pulses Per Revolution
5 pages
Optimum Design of Reinforced Concrete Raft Foundations Using Finite Element Analysis
No ratings yet
Optimum Design of Reinforced Concrete Raft Foundations Using Finite Element Analysis
78 pages
ams_mscphy_2nd sem_2021
No ratings yet
ams_mscphy_2nd sem_2021
1 page
June 2018 QP - Paper 1 OCR (B) Chemistry GCSE
No ratings yet
June 2018 QP - Paper 1 OCR (B) Chemistry GCSE
24 pages
ChromPass HPLC Software
No ratings yet
ChromPass HPLC Software
2 pages
Power Transformers Lesson4
100% (1)
Power Transformers Lesson4
40 pages
Laboratory Manual Course Code:Ece 201
No ratings yet
Laboratory Manual Course Code:Ece 201
41 pages
LM3876 Overture™ Audio Power Amplifier Series High-Performance 56W Audio Power Amplifier W/mute
No ratings yet
LM3876 Overture™ Audio Power Amplifier Series High-Performance 56W Audio Power Amplifier W/mute
28 pages
Laws of Lenses Objective: Principle and Task
No ratings yet
Laws of Lenses Objective: Principle and Task
6 pages
Course Code: Cosc239 Credit Hours: 3+lab Lecture Hours: 2 Laboratory Hours: 2 Prerequisites: Cosc132
No ratings yet
Course Code: Cosc239 Credit Hours: 3+lab Lecture Hours: 2 Laboratory Hours: 2 Prerequisites: Cosc132
11 pages
Progfunhandouts 2010
No ratings yet
Progfunhandouts 2010
43 pages
Week 4
No ratings yet
Week 4
89 pages
Activity 2. Search On This Activity 3. THINK OF THIS
No ratings yet
Activity 2. Search On This Activity 3. THINK OF THIS
3 pages
Problem Session-2
No ratings yet
Problem Session-2
32 pages
Gas Metal Arc Welding-1
No ratings yet
Gas Metal Arc Welding-1
88 pages
IHM Hitec Manual
No ratings yet
IHM Hitec Manual
0 pages

Chapter 2 - Data Preprocessing

Uploaded by

Chapter 2 - Data Preprocessing

Uploaded by

Chapter 2

Eng. Ali sheak Ahmed

* Data Mining: Concepts and Techniques 1

■ Why preprocess the data?

* Data Mining: Concepts and Techniques 2

* Data Mining: Concepts and Techniques 4

■ No quality data, no quality mining results!

* Data Mining: Concepts and Techniques 5

■ A well-accepted multidimensional view:

* Data Mining: Concepts and Techniques 6

* Data Mining: Concepts and Techniques 7

* Data Mining: Concepts and Techniques 8

* Data Mining: Concepts and Techniques 9

* Data Mining: Concepts and Techniques 11

* Data Mining: Concepts and Techniques 12

■ Redundant data occur often when integration of multiple

* Data Mining: Concepts and Techniques 13

■ Why data reduction?

* Data Mining: Concepts and Techniques 14

* Data Mining: Concepts and Techniques 15

You might also like