0% found this document useful (0 votes)

8 views27 pages

Module2 DataPreprocessing

Data Preprocessing in Data Mining

Uploaded by

Ceejay Estigoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views27 pages

Module2 DataPreprocessing

Data Preprocessing in Data Mining

Uploaded by

Ceejay Estigoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 3 —

1
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

Data Mining: Concepts and Techniques 2

What went wrong?
Imagine that you are a manager at AllElectronics and have been
charged with analyzing the company’s data with respect to your
branch’s sales. You immediately set out to perform this task. You
carefully inspect the company’s database and data warehouse,
identifying and selecting the attributes or dimensions (e.g., item,
price, and units sold) to be included in your analysis. Alas! You
notice that several of the attributes for various tuples have no
recorded value. For your analysis, you would like to include
information as to whether each item purchased was advertised as on
sale, yet you discover that this information has not been recorded.
Furthermore, users of your database system have reported errors,
unusual values, and inconsistencies in the data recorded for some
transactions.

Data Mining: Concepts and Techniques 3

Data Quality: Why Preprocess the Data?
■ Measures for data quality: A multidimensional view
■ Accuracy: correct or wrong, accurate or not
■ Completeness: not recorded, unavailable, …
■ Consistency: some modified but some not, dangling, …
■ Timeliness: timely update?
■ Believability: how trustable the data are correct?
■ Interpretability: how easily the data can be
understood?

Data Mining: Concepts and Techniques 4

Major Tasks in Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy
data, identify or remove outliers, and
resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data
cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data
discretization
■ Normalization Figure 1. Forms of data preprocessing
■ Concept hierarchy generation

Data Mining: Concepts and Techniques 5

Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

Data Mining: Concepts and Techniques 6

Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

■ e.g., Occupation = “ ” (missing data)

■ noisy: containing noise, errors, or outliers

■ e.g., Salary = “−10” (an error)

■ inconsistent: containing discrepancies in codes or names, e.g.,

■ Age = “42”, Birthday = “03/07/2010”

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)

■ Jan. 1 as everyone’s birthday?

Data Mining: Concepts and Techniques 7

Incomplete (Missing) Data

■ Data is not always available

■ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time
of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
Data Mining: Concepts and Techniques 8
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same
class: smarter
■ the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments

■ data entry problems

■ data transmission problems

■ technology limitation

■ inconsistency in naming convention

■ Other data problems which require data cleaning

■ duplicate records

■ incomplete data

■ inconsistent data

10
How to Handle Noisy Data?

■ Binning
■ first sort data and partition into (equal-frequency) bins

■ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

■ Regression
■ smooth by fitting the data into regression functions

■ Clustering
■ detect and remove outliers

■ Combined computer and human inspection

■ detect suspicious values and check by human (e.g., deal

with possible outliers)

11
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
■ Check field overloading
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections

■ Data auditing: by analyzing data to discover rules and relationship to

detect violators (e.g., correlation and clustering to find outliers)

■ Data migration and integration
■ Data migration tools: allow transformations to be specified
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
■ Integration of the two processes
■ Iterative and interactive (e.g., Potter’s Wheels)
12
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

13
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are
different
■ Possible reasons: different representations, different scales, e.g., metric
vs. British units
14
Handling Redundancy in Data Integration

■ Redundant data occur often when integration of multiple

databases
■ Object identification: The same attribute or object may
have different names in different databases
■ Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
■ Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
■ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
15
Histogram Analysis
■ Divide data into buckets and
store average (sum) for each
bucket
■ Partitioning rules:
■ Equal-width: equal bucket
range
■ Equal-frequency (or
equal-depth)

16
Clustering
■ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
■ Can be very effective if data is clustered but not if data is
“smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms
■ Cluster analysis will be studied in depth in Chapter 10

17
Sampling

■ Sampling: obtaining a small sample s to represent the whole

data set N
■ Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
■ Key principle: Choose a representative subset of the data
■ Simple random sampling may have very poor performance in
the presence of skew
■ Develop adaptive sampling methods, e.g., stratified
sampling:
■ Note: Sampling may not reduce database I/Os (page at a time)

18
Types of Sampling

■ Simple random sampling

■ There is an equal probability of selecting any particular item

■ Sampling without replacement

■ Once an object is selected, it is removed from the population

■ Sampling with replacement

■ A selected object is not removed from the population

■ Stratified sampling:
■ Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of

the data)
■ Used in conjunction with skewed data

19
Sampling: With or without Replacement

W O R
SRS le random
i m p h ou t
(s e wi t
p l
sam ment)
p l a ce
re

SRSW
R

Raw Data
20
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

21
Data Cube Aggregation

■ The lowest level of a data cube (base cuboid)

■ The aggregated data for an individual entity of interest
■ E.g., a customer in a phone calling data warehouse
■ Multiple levels of aggregation in data cubes
■ Further reduce the size of data to deal with
■ Reference appropriate levels
■ Use the smallest representation which is enough to solve the
task
■ Queries regarding aggregated information should be answered
using data cube, when possible

22
Data Reduction 3: Data Compression
■ String compression
■ There are extensive theories and well-tuned algorithms

■ Typically lossless, but only limited manipulation is possible

without expansion
■ Audio/video compression
■ Typically lossy compression, with progressive refinement

■ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole

■ Time sequence is not audio
■ Typically short and vary slowly with time

■ Dimensionality and numerosity reduction may also be

considered as forms of data compression
23
Data Compression

Original Data Compressed

Data
lossless

s s y
lo
Original Data
Approximated

24
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct

values
city 3567 distinct values

street 674,339 distinct

values 25
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

26
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem; Remove redundancies; Detect
inconsistencies
■ Data reduction
■ Dimensionality reduction; Numerosity reduction; Data
compression
■ Data transformation and data discretization
■ Normalization; Concept hierarchy generation
27

03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Correlation
No ratings yet
Correlation
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Chapter 2 - Data Preprocessing
No ratings yet
Chapter 2 - Data Preprocessing
15 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Lect 4
No ratings yet
Lect 4
30 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Mining
No ratings yet
Data Mining
40 pages
CH 3
No ratings yet
CH 3
68 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DataMining S
No ratings yet
DataMining S
103 pages
unit 1 c
No ratings yet
unit 1 c
63 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Beginner’s Guide to Databases & SQL
From Everand
The Beginner’s Guide to Databases & SQL
Steven Mcananey
No ratings yet
LectureNotes 480
No ratings yet
LectureNotes 480
192 pages
Study Design: Dr. Legiran, M.Kes
No ratings yet
Study Design: Dr. Legiran, M.Kes
53 pages
Chapter 8 Case Study Questions Spring 2021
No ratings yet
Chapter 8 Case Study Questions Spring 2021
9 pages
Problem Set 1: Introduction To R - Solutions With R Output: 1 Install Packages
No ratings yet
Problem Set 1: Introduction To R - Solutions With R Output: 1 Install Packages
24 pages
Gamma Distribution: Presented To: Dr. Zahid Ahmad Presented By: Rauf Shaukat (557) Waheed Afzal
No ratings yet
Gamma Distribution: Presented To: Dr. Zahid Ahmad Presented By: Rauf Shaukat (557) Waheed Afzal
45 pages
Business Statistics-2 PDF
80% (5)
Business Statistics-2 PDF
2 pages
Sta238 Wks - Week1+2
No ratings yet
Sta238 Wks - Week1+2
35 pages
1 s2.0 S1471595323002779 Main
No ratings yet
1 s2.0 S1471595323002779 Main
7 pages
DỰ BÁO, MÔ HÌNH HOLT, SAN BẰNG MŨ
No ratings yet
DỰ BÁO, MÔ HÌNH HOLT, SAN BẰNG MŨ
15 pages
Reviewer For Pre Final Examination
No ratings yet
Reviewer For Pre Final Examination
21 pages
Investigating Class Rarity in Big Data: Open Access Research
No ratings yet
Investigating Class Rarity in Big Data: Open Access Research
17 pages
Individual Assigment Stastistics
No ratings yet
Individual Assigment Stastistics
7 pages
Chapter Iv 1
No ratings yet
Chapter Iv 1
100 pages
Hasil SPSS PDF
No ratings yet
Hasil SPSS PDF
61 pages
D Value Average Median STD Deviation 1st Quantile 3rd Quantile Skewness Excess Kurtosis
No ratings yet
D Value Average Median STD Deviation 1st Quantile 3rd Quantile Skewness Excess Kurtosis
4 pages
Biostatistics Revision DR - NJ
No ratings yet
Biostatistics Revision DR - NJ
67 pages
Branch Wise Performance With Rating 1st Oct To 19th Oct 2024
No ratings yet
Branch Wise Performance With Rating 1st Oct To 19th Oct 2024
1 page
Time Series Analysis in R A Beginner's Guide
No ratings yet
Time Series Analysis in R A Beginner's Guide
13 pages
Mung Bean Experiment
No ratings yet
Mung Bean Experiment
16 pages
STAT 830 Bayesian Estimation: Richard Lockhart
No ratings yet
STAT 830 Bayesian Estimation: Richard Lockhart
23 pages
Pub 2 Review Lozenges Niranjan Ijair-Volume
No ratings yet
Pub 2 Review Lozenges Niranjan Ijair-Volume
17 pages
Chikaa A2 TucsonData
No ratings yet
Chikaa A2 TucsonData
454 pages
Topic 2:: Errors and Statistics in Survey Measurements and Adjustments
No ratings yet
Topic 2:: Errors and Statistics in Survey Measurements and Adjustments
93 pages
Introduction To Econometrics, 5 Edition: Chapter 5: Dummy Variables
No ratings yet
Introduction To Econometrics, 5 Edition: Chapter 5: Dummy Variables
40 pages
Cement Process Engineering Vade Mecum: 2. Statistics
No ratings yet
Cement Process Engineering Vade Mecum: 2. Statistics
15 pages
I Semester Complementary Statistics - Course I Basic Statistics
No ratings yet
I Semester Complementary Statistics - Course I Basic Statistics
4 pages
Topic 3 Exercises: - Introduction To Sampling Distributions (Mean and Proportion)
No ratings yet
Topic 3 Exercises: - Introduction To Sampling Distributions (Mean and Proportion)
6 pages
HIM6007 T3.2024 Group Assignment - V1
No ratings yet
HIM6007 T3.2024 Group Assignment - V1
7 pages
Chapter 3 Hypothesis Testing - Mean One Sample
No ratings yet
Chapter 3 Hypothesis Testing - Mean One Sample
53 pages
L 2 Discrete Distributions
No ratings yet
L 2 Discrete Distributions
18 pages

Module2 DataPreprocessing

Uploaded by

Module2 DataPreprocessing

Uploaded by

Data Mining:

Concepts and Techniques

■ Data Preprocessing: An Overview

Data Mining: Concepts and Techniques 2

Data Mining: Concepts and Techniques 3

Data Mining: Concepts and Techniques 4

Data Mining: Concepts and Techniques 5

■ Data Preprocessing: An Overview

Data Mining: Concepts and Techniques 6

interest, or containing only aggregate data

■ noisy: containing noise, errors, or outliers

■ e.g., Salary = “−10” (an error)

■ inconsistent: containing discrepancies in codes or names, e.g.,

■ Age = “42”, Birthday = “03/07/2010”

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)

■ Jan. 1 as everyone’s birthday?

Data Mining: Concepts and Techniques 7

■ Data is not always available

■ data entry problems

■ data transmission problems

■ inconsistency in naming convention

■ Other data problems which require data cleaning

■ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

■ Combined computer and human inspection

with possible outliers)

spell-check) to detect errors and make corrections

detect violators (e.g., correlation and clustering to find outliers)

■ Data Preprocessing: An Overview

■ Redundant data occur often when integration of multiple

■ Sampling: obtaining a small sample s to represent the whole

■ Simple random sampling

■ Sampling without replacement

■ Sampling with replacement

(proportionally, i.e., approximately the same percentage of

Raw Data Cluster/Stratified Sample

■ The lowest level of a data cube (base cuboid)

■ Typically lossless, but only limited manipulation is possible

■ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole

■ Dimensionality and numerosity reduction may also be

Original Data Compressed

country 15 distinct values

province_or_ state 365 distinct

street 674,339 distinct

■ Data Preprocessing: An Overview

You might also like