0% found this document useful (0 votes)

18 views34 pages

Mod1 DM Part2

Uploaded by

josephabraham808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views34 pages

Mod1 DM Part2

Uploaded by

josephabraham808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Pre Processing

Data preprocessing is a data mining technique that involves

transforming raw data into an understandable format.

Data preprocessing is a proven method of resolving such issues.

Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as

customer relationship management and rule-based applications.

Data Preprocessing
Preprocess Steps
Data cleaning
Data integration
Data transformation
Data reduction
Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

e.g., occupation=“ ”
noisy: containing errors or outliers

e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:

Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Forms of data preprocessing
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by
boundaries
Cluster Analysis

Clustering: detect and remove outliers

Regression
y

Regression: Y1
smooth by fitting
the data into
regression functions Y1’ y=x+1

X1 x
Data cleaning as a process
Discrepancy detection

Use meta data

Field overloading

Unique rules

Consecutive rules

Null rules

15 April 13, 2021

Data Integration
Data integration:
combines data from multiple sources.
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources
are different
possible reasons: different representations, different scales, e.g.,
metric vs. British units
Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple

databases
The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table.
Redundant data may be able to be detected by correlational analysis

Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation analysis
Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization

min-max normalization
Min-max normalization performs a linear transformation on the original
data.

Suppose that mina and maxa are the minimum and the maximum values
for attribute A. Min-max normalization maps a value v of A to v’ in the
range [new-mina, new-maxa] by computing:
v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized based on the
mean and standard deviation of A. a value v of A is normalized to v’
by computing:
v’ = ( ( v – A ) / µA )

where and A are the mean and the standard deviation

respectively of attribute A.

This method of normalization is useful when the actual minimum and

maximum of attribute A are unknown.
Data Transformation: Normalization
Normalization by Decimal Scaling
Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.

The number of decimal points moved depends on the maximum

absolute value of A.

a value v of A is normalized to v’ by computing: v’ = ( v / 10j ). Where j

is the smallest integer such that Max(|v’|)<1.
Data reduction
Obtain a reduced representation of the data set that is much smaller

in volume but yet produces the same (or almost the same) analytical

results

Why data reduction? — A database/data warehouse may store

terabytes of data. Complex data analysis may take a very long time

to run on the complete data set.

22
Data reduction strategies
Data cube aggregation

Attribute subset selection

Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy generation

Data cube aggregation

aggregation operations are applied to the data in the construction

of a data cube.
This is achieved by aggregation operations on data cube.

24 April 13, 2021

Attribute subset selection

Irrelevant ,weakly relevant or redundant attributes or

dimensions may be detected and removed.
Stepwise forward selection:
Stepwise backward elimination
Combination of forward selection and backward
elimination:
Decision tree induction:

25 April 13, 2021

Dimensionality reduction
Encoding mechanisms are used to reduce the data size.
Wavelet transforms
The discrete wavelet transform(DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X , of wavelet coefficients.
0

Principal components analysis, or PCA

Unlike attribute subset selection, which reduces the attribute set size by
retaining a subset of the initial set of attributes, PCA “combines” the
essence of attributes by creating an alternative, smaller set of variables.
Data compression
Numerosity reduction
The data are replaced or estimated by alternative, smaller data
representations such as parametric models(which need to store only
the model parameters instead of the actual data) or nonparametric
methods such as clustering, sampling and the use of histograms.
Regression and Log-Linear Models
Histograms
Clustering
Sampling
Data compression
Regression and Log-Linear Models

Regression and log-linear models can be used to approximate the

given data.
linear regression, the data are modeled to fit a straight line.
y (called a response variable)
X (called a predictor variable)
y = wx+b
Log-linear models approximate discrete multidimensional
probability distributions.
This allows a higher-dimensional data space to be constructed from
lower dimensional spaces.
Log-linear models are therefore also useful for dimensionality
reduction
28 April 13, 2021
Histograms
Histograms use binning to approximate data distributions and are a

popular form of

The following data are a list of prices of commonly sold items at

AllElectronics(rounded to the nearest dollar). The numbers have been

sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,

15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,

21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

29 April 13, 2021

30 April 13, 2021
Sampling
Sampling can be used as a data reduction technique because it allows a
large data set to be represented by a much smaller random sample (or
subset) of the data.

31 April 13, 2021

Sampling
Simple random sample without replacement (SRSWOR) of size s
Simple random sample with replacement (SRSWR) of size s
Cluster sample
Stratified sample

32 April 13, 2021

Data Discretization and Concept
Hierarchy Generation
Data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the
attribute into intervals.

Interval labels can then be used to replace actual data values.

Supervised discretization
Unsupervised discretization
Top-down discretization or splitting
Bottom-up discretization or merging
33 April 13, 2021
Data Discretization and Concept
Hierarchy Generation
A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.

Concept hierarchies can be used to reduce the data by

collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such as
youth, middle-aged, or senior).

34 April 13, 2021

Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Week 2
No ratings yet
Week 2
96 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Dieter Rams 10 Principles of Good Design 2
100% (1)
Dieter Rams 10 Principles of Good Design 2
87 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Philippine Poetry Its Form, Language, and Speech
33% (3)
Philippine Poetry Its Form, Language, and Speech
27 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
2024 SPHRi Workbook Module 3 Preview
No ratings yet
2024 SPHRi Workbook Module 3 Preview
20 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Normalization
No ratings yet
Normalization
35 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
DR - Pratik Rajan Mungekar Resume
No ratings yet
DR - Pratik Rajan Mungekar Resume
26 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Invitation To Health 16th Edition Dianne Hales Solutions Manual 1
100% (80)
Invitation To Health 16th Edition Dianne Hales Solutions Manual 1
9 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
8.13 - Glycol Heat Tracing 11
No ratings yet
8.13 - Glycol Heat Tracing 11
1 page
KNX
0% (1)
KNX
60 pages
RULE 9 of PD1096
No ratings yet
RULE 9 of PD1096
5 pages
How To List Security Clearance On Resume
100% (2)
How To List Security Clearance On Resume
9 pages
Whirling Cloth, Breeze of Blessing: Ancestral Masquerade Performances Among The Yoruba
100% (3)
Whirling Cloth, Breeze of Blessing: Ancestral Masquerade Performances Among The Yoruba
8 pages
PDCC Cheat Sheet
No ratings yet
PDCC Cheat Sheet
6 pages
Lecture-6: Organization Structure, Coordination and Spans of Control, Managing Organizational Change and Conflict
No ratings yet
Lecture-6: Organization Structure, Coordination and Spans of Control, Managing Organizational Change and Conflict
53 pages
p7 - Circuits - Series Parallel
No ratings yet
p7 - Circuits - Series Parallel
27 pages
Nature Science4 Syllabus
No ratings yet
Nature Science4 Syllabus
22 pages
Market Research - Wed-E - 1
No ratings yet
Market Research - Wed-E - 1
43 pages
Understanding Power Flow
No ratings yet
Understanding Power Flow
20 pages
Shaxsiy Topshiriq
No ratings yet
Shaxsiy Topshiriq
10 pages
ECF 104-Personal Financial Decision Making Course Outline
No ratings yet
ECF 104-Personal Financial Decision Making Course Outline
4 pages
Physical Interpretation of Maxwells Equations
100% (2)
Physical Interpretation of Maxwells Equations
2 pages
Class 7 Notes
No ratings yet
Class 7 Notes
6 pages
Special Condition of Contract
No ratings yet
Special Condition of Contract
28 pages
21BEC1676 Avinash V Analog Communiction EXP 4
No ratings yet
21BEC1676 Avinash V Analog Communiction EXP 4
6 pages
Bouncing Balls On Oscillating Tables
No ratings yet
Bouncing Balls On Oscillating Tables
16 pages
Aryan Invasion Theory
No ratings yet
Aryan Invasion Theory
2 pages
Business Plan Proposal: Willreen Bakery and Cakery
No ratings yet
Business Plan Proposal: Willreen Bakery and Cakery
49 pages
Openfit: Open-Ear True Wireless Earbuds
No ratings yet
Openfit: Open-Ear True Wireless Earbuds
16 pages
Byram Estate
No ratings yet
Byram Estate
3 pages
EOT Crane Cable Selection and Schedule
No ratings yet
EOT Crane Cable Selection and Schedule
3 pages
Report SK135SR-2 YY06-185XX en b31991
No ratings yet
Report SK135SR-2 YY06-185XX en b31991
1 page
Bottom Live 2001
No ratings yet
Bottom Live 2001
3 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Mod1 DM Part2

Uploaded by

Mod1 DM Part2

Uploaded by

Data Pre Processing

Data preprocessing is a data mining technique that involves

transforming raw data into an understandable format.

Data preprocessing is a proven method of resolving such issues.

Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as

customer relationship management and rule-based applications.

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

inconsistent: containing discrepancies in codes or names

A well-accepted multidimensional view:

Clustering: detect and remove outliers

Use meta data

15 April 13, 2021

Redundant data occur often when integration of multiple

Careful integration of the data from multiple sources may help

Smoothing: remove noise from data

where and A are the mean and the standard deviation

This method of normalization is useful when the actual minimum and

The number of decimal points moved depends on the maximum

a value v of A is normalized to v’ by computing: v’ = ( v / 10j ). Where j

Why data reduction? — A database/data warehouse may store

to run on the complete data set.

Attribute subset selection

Discretization and concept hierarchy generation

aggregation operations are applied to the data in the construction

24 April 13, 2021

Irrelevant ,weakly relevant or redundant attributes or

25 April 13, 2021

Principal components analysis, or PCA

Regression and log-linear models can be used to approximate the

The following data are a list of prices of commonly sold items at

AllElectronics(rounded to the nearest dollar). The numbers have been

29 April 13, 2021

31 April 13, 2021

32 April 13, 2021

Interval labels can then be used to replace actual data values.

Concept hierarchies can be used to reduce the data by

34 April 13, 2021

You might also like