0% found this document useful (0 votes)

66 views51 pages

Data Preprocessing: Why Preprocess The Data?

The document discusses data preprocessing which includes data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves filling in missing values, identifying and handling outliers, resolving inconsistencies, and addressing redundancy from data integration. Data integration merges data from multiple sources which requires schema integration and resolving object matching issues. Data transformation includes normalization and aggregation.

Uploaded by

Noel Jackson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views51 pages

Data Preprocessing: Why Preprocess The Data?

Uploaded by

Noel Jackson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 51

Data Preprocessing

 Why preprocess the data?

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Data Mining: Concepts and Techniques August 2, 2019 1

Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques August 2, 2019 2
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Data Mining: Concepts and Techniques August 2, 2019 3
Why Is Data Preprocessing
Important?
 No quality data, no quality mining results.
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse

Data Mining: Concepts and Techniques August 2, 2019 4

Multi-Dimensional Measure of Data
Quality
 A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
 Accessibility

Data Mining: Concepts and Techniques August 2, 2019 5

Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Data Mining: Concepts and Techniques August 2, 2019 6
Forms of Data Preprocessing

Data Mining: Concepts and Techniques August 2, 2019 7

Data Preprocessing

 Why preprocess the data?

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Data Mining: Concepts and Techniques August 2, 2019 8

Data Cleaning
 Importance
 “Data cleaning is one of the biggest and number one
problems in data warehousing”
 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

Data Mining: Concepts and Techniques August 2, 2019 9

Missing Data

 Data is not always available

 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 Missing data may need to be inferred.

Data Mining: Concepts and Techniques August 2, 2019 10

How to Handle Missing Data?
1.Ignore the tuple:

 usually done when class label is missing (assuming the tasks is

classification).

 This method is not very effective, unless the tuple contains several
attributes with missing values.
 Not effective when the percentage of missing values per attribute varies
considerably.
2.Fill in the missing value manually: time-consuming and
may not be feasible given a large data set with many missing values.

August 2, 2019 11
How to Handle Missing Data?
3.Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label
like “Unknown” or infinity.

4. Use the attribute mean to fill in the missing value:

For example, suppose that the average income of customers is

$56,000. Use this value to replace the missing value for income.

5. Use the attribute mean for all samples belonging to the

same class:For example, if classifying customers according to
credit risk, replace the missing value with the average income
value for customers in the same credit risk category as that of
the given tuple August 2, 2019 12
How to Handle Missing Data?

6. Use the most probable value to fill in the missing value:

inference-based such as Bayesian formula or decision tree can
be used.
 For example, using the other customer attributes in our data
set, we may construct a decision tree to predict the missing
values for income.

Data Mining: Concepts and Techniques August 2, 2019 13

Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

Data Mining: Concepts and Techniques August 2, 2019 14

How to Handle Noisy Data?
1.Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
2.Regression
 smooth by fitting the data into regression functions
3.Clustering
 detect and remove outliers
4.Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal
with possible outliers)

Data Mining: Concepts and Techniques August 2, 2019 15

Noisy data handling Methods:
Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width

of intervals will be: W = (B –A)/N.

 Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately same

number of samples

Data Mining: Concepts and Techniques August 2, 2019 16

Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 26, 29, 34

Data Mining: Concepts and Techniques August 2, 2019 17

Noisy data handling Methods
:Regression
 Data can be smoothed by fitting the data to a function, such as
with regression.
 Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used to
predict the other.

Data Mining: Concepts and Techniques August 2, 2019 18

Regression
y

Y1’ y=x+1

X1 x

Data Mining: Concepts and Techniques August 2, 2019 19

Noisy data handling Methods:
Cluster Analysis
 Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.”

Data Mining: Concepts and Techniques August 2, 2019 20

Cluster Analysis

Data Mining: Concepts and Techniques August 2, 2019 21

Data Integration
Data integration:
 Integration—the merging of data from multiple data stores.

Issues to be considered during data integration

1.Schema integration &object matching
 Entity identification problem: Same attribute with different
names in two different tables
 Eg., A.cust-id = B.cust-# (A and B are two different tables)
 Can be solved by Integrating metadata from different sources .
 Metadata can be used to help avoid errors in schema integration.
 The metadata may also be used to help transform the data (e.g., where
data codes for pay type in one database may be “H” and “S”, and 1
and 2 in another).

Data Mining: Concepts and Techniques August 2, 2019 22

Data Integration

2.Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object may
have different names in different databases
 Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

Data Mining: Concepts and Techniques August 2, 2019 23

Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment

coefficient)

i 1 (ai  A)(bi  B) 
n n
( ai bi )  n A B
rA, B   i 1
( n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
24
Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test
(Observed  Expected) 2
2  
Expected

 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count

25
Correlation Analysis (Nominal
Data)
 For categorical (discrete) data, a correlation relationship
between two attributes, A and B, can be discovered by a chi-
square test.
 Suppose A has c distinct values, namely a1;a2; : : :ac.
 B has r distinct values, namely b1;b2; : : :br.
 The data tuples described by A and B can be shown as a
contingency table, with the c values of A making up the
columns and the r values of B making up the rows.
 Let (Ai;Bj) denote the event that attribute A takes on value ai
and attribute B takes on value bj, that is, where (A = ai;B = bj).
Each and every possible (Ai;Bj) joint event has its own cell (or
slot) in the table.

Data Mining: Concepts and Techniques August 2, 2019 26

 Chi- Square value can be computed as

 N-number of tuples
 The chi square statistic tests the hypothesis that A and B are
independent.
 The test is based on a significance level, with (r-1)(c-1) degrees
of freedom

Data Mining: Concepts and Techniques August 2, 2019 27

Chi-Square Calculation: An
Example
 Suppose that a group of 1,500 people was surveyed. The gender
of each person was noted. Each person was polled as to whether
their preferred type of reading material was fiction or
nonfiction. Thus, we have two attributes, gender and preferred
reading. Find the correlation between these two attributes

Data Mining: Concepts and Techniques August 2, 2019 28

Chi-Square Calculation: An Example

male female Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the equation in previous
slide)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840

 It shows that like_science_fiction and male are correlated in the

group
29
Data Integration
3.Detection and resolution of data value conflicts.
 For the same real-world entity, attribute values from different
sources may differ.
Eg:A weight attribute may be stored in metric units in one system
and British imperial units in another.
 For a hotel chain, the price of rooms in different cities may
involve not only different currencies but also different services
(such as free breakfast) and taxes.
 An attribute in one system may be recorded at a lower level of
abstraction than the “same” attribute in another. For example,
the total sales in one database may refer to one branch of All
Electronics, while an attribute of the same name in another
database may refer to the total sales for All Electronics stores in
a given region.

Data Mining: Concepts and Techniques August 2, 2019 30

Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values
Methods
 Smoothing: remove noise from data (binning, clustering, regression)
 Normalization: scaled to fall within a small, specified range such as –
1.0 to 1.0 or 0.0 to 1.0
 Attribute/feature construction
 New attributes constructed / added from the given ones
 Aggregation: summarization or aggregation operations apply to data

 Generalization: concept hierarchy climbing

 Low level/ primitive/raw data are replace by higher level concepts
Data Transformation-Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600  12,000
Then $73,000 is mapped to (1.0  0)  0  0.716
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
32
Data Reduction

 Why data reduction?

 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time
to run on the complete data set
 Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results

33
Data Reduction

 Data reduction strategies

 Data Cube Aggregation
 Attribute Subset Selection
 Dimensionality reduction, e.g., remove unimportant
attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms,
 Clustering,
 Sampling
Data Mining: Concepts and Techniques August 2, 2019 34
Data Reduction1:Data Cube
Aggregation
 Data cubes store multidimensional aggregated information.
 Data cubes provide fast access to precomputed, summarized
data, thereby benefiting on-line analytical processing as well
as data mining.
 Queries regarding aggregated information should be
answered using data cube, when possible.

Data Mining: Concepts and Techniques August 2, 2019 35

Data Reduction1:Data Cube Aggregation

Data Mining: Concepts and Techniques August 2, 2019 36

Data Reduction2:Attribute
Subset Selection
 Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).

 The goal of attribute subset selection is to find a minimum set

of attributes such that the resulting probability distribution of
the data classes is as close as possible to the original
distribution obtained using all attributes.

 It reduces the number of attributes appearing in the

discovered patterns, helping to make the patterns easier to
understand.

Data Mining: Concepts and Techniques August 2, 2019 37

Attribute Subset Selection -
Techniques
1.Stepwise forward selection:
 The procedure starts with an empty set of attributes as the
reduced set.
 The best of the original attributes is determined and added to
the reduced set.
 At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.

2. Stepwise backward elimination:

 The procedure starts with the full set of attributes.
 At each step, it removes the worst attribute remaining in the set.

Data Mining: Concepts and Techniques August 2, 2019 38

Attribute Subset Selection -
Techniques
3. Combination of forward selection and backward
elimination:
 The stepwise forward selection and backward elimination
methods can be combined .
 At each step, selects the best attribute and removes the worst
from among the remaining attributes.

Data Mining: Concepts and Techniques August 2, 2019 39

Attribute Subset Selection -
Techniques
4. Decision tree induction:
 Decision tree induction constructs a flowchart like structure
where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction.
 At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes .
 When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
 All attributes that do not appear in the tree are assumed to be
irrelevant.
 The set of attributes appearing in the tree form the reduced
subset of attributes.
Data Mining: Concepts and Techniques August 2, 2019 40
Attribute Subset Selection -Techniques

Data Mining: Concepts and Techniques August 2, 2019 41

Data Reduction 3: Dimensionality
Reduction
 Dimensionality reduction
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

42
Data Reduction 4: Numerosity Reduction

 Reduce data volume by choosing alternative, smaller forms

of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
 Ex.: Log-linear models
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

43
Parametric Data Reduction: Regression and
Log-Linear Models
 Linear regression: Y = w X + b
 A randomvariable, y (called a response variable), can be modeled as
a linear function of another random variable.

Multiple regression: Y = b0 + b1 X1 + b2 X2
 Multiple linear regression is an extension of (simple) linear
regression, which allows a response variable, y, to be modeled as a
linear function of two or more predictor variables.

44
Parametric Data Reduction: Regression
and Log-Linear Models
Log-linear models
 Given a set of tuples in n dimensions (e.g., described by n
attributes), we can consider each tuple as a point in an n-
dimensional space.
 Log-linear models can be used to estimate the probability
of each point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of
dimensional combinations.
 This allows a higher-dimensional data space to be
constructed from lower dimensional spaces.
 Log-linear models are thus useful for dimensionality
reduction and data smoothing

Data Mining: Concepts and Techniques August 2, 2019 45

Histogram Analysis

 Divide data into buckets and store average (sum) for

each bucket
Partitioning rules:
Equal-width: In an equal-width histogram, the width of each
bucket range is uniform.

Equal-frequency (or equidepth): In an equal-frequency

histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data
samples).
46
Histogram Analysis
V-Optimal: If we consider all of the possible histograms for a
given number of buckets, the V-Optimal histogram is the one
with the least variance. Histogram variance is a weighted sum
of the original values that each bucket represents, where bucket
weight is equal to the number of values in the bucket.

MaxDiff: MaxDiff histogram, considers the difference between

each pair of adjacent values. A bucket boundary is established
between each pair for pairs having the b-1 largest differences,
where b is the user-specified number of buckets

Data Mining: Concepts and Techniques August 2, 2019 47

Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many kinds of clustering algorithms.

48
Sampling

 Sampling: obtaining a small sample s to represent the whole data set

N
 Key principle: Choose a representative subset of the data
Simple random sample without replacement (SRSWOR) of size s:
 This is created by drawing s of the N tuples fromD (s < N), where the
probability of drawing any tuple in D is 1/N, that is, all tuples are
equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This is
similar to SRSWOR, except that each time a tuple is drawn from D, it
is recorded and then replaced. That is, after a tuple is drawn, it is
placed back in D so that it may be drawn again.

49
Sampling

Cluster sample: If the tuples in D are grouped into M mutually

disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M
Stratified sample: If D is divided into mutually disjoint parts
called strata, a stratified sample of D is generated by obtaining
an SRS at each stratum

Data Mining: Concepts and Techniques August 2, 2019 50

Sampling

Data Mining: Concepts and Techniques August 2, 2019 51

3prep
No ratings yet
3prep
53 pages
Analisis Data 2
No ratings yet
Analisis Data 2
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
3 Prep
No ratings yet
3 Prep
50 pages
Chapter 2 dataPreProcessing HAN
No ratings yet
Chapter 2 dataPreProcessing HAN
76 pages
Lect 4
No ratings yet
Lect 4
30 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Chapter 2: Data Preprocessing: Why Preprocess The Data?
No ratings yet
Chapter 2: Data Preprocessing: Why Preprocess The Data?
42 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Chapter 2 - Data Preprocessing
No ratings yet
Chapter 2 - Data Preprocessing
15 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
PPT 2
No ratings yet
PPT 2
51 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Data Mining:: - Chapter 2
No ratings yet
Data Mining:: - Chapter 2
75 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
Introduction of Fire Modelling
No ratings yet
Introduction of Fire Modelling
18 pages
02
No ratings yet
02
78 pages
Chap 3
No ratings yet
Chap 3
55 pages
Perkins - Alternator Cataloque PDF
100% (2)
Perkins - Alternator Cataloque PDF
48 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
52 pages
Data Mining
No ratings yet
Data Mining
22 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
Correlation
No ratings yet
Correlation
14 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
CH 3
No ratings yet
CH 3
68 pages
Complete Solution Assignment-1 PDF
100% (1)
Complete Solution Assignment-1 PDF
15 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Hilbert
No ratings yet
Hilbert
10 pages
Kurzweil K2000 Compact Flash R/W Installation
100% (1)
Kurzweil K2000 Compact Flash R/W Installation
21 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Longitudes & Latitudes
No ratings yet
Longitudes & Latitudes
15 pages
Title: The Social Norms of Tax Compliance: Evidence From Australia, Singapore, and The United States
No ratings yet
Title: The Social Norms of Tax Compliance: Evidence From Australia, Singapore, and The United States
2 pages
Rudra Series - Quadratic Equations and Complex Numbers
No ratings yet
Rudra Series - Quadratic Equations and Complex Numbers
155 pages
Microwave-Assisted Dehydration of Aqua Complexes: Naokazu Yoshikawa and Hiroshi Takashima
No ratings yet
Microwave-Assisted Dehydration of Aqua Complexes: Naokazu Yoshikawa and Hiroshi Takashima
2 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
E-Marking Notes Maths HSSC I
No ratings yet
E-Marking Notes Maths HSSC I
28 pages
R20 M.Tech DS
No ratings yet
R20 M.Tech DS
64 pages
Graphic INDEX
No ratings yet
Graphic INDEX
2 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Ivanhoe 2020
No ratings yet
Ivanhoe 2020
13 pages
Illusionpin: Shoulder-Surfing Resistant Authentication Using Hybrid Images
No ratings yet
Illusionpin: Shoulder-Surfing Resistant Authentication Using Hybrid Images
14 pages
TTB ANNEX N NATO Vector Graphics v2.0
No ratings yet
TTB ANNEX N NATO Vector Graphics v2.0
72 pages
TQP3M9009: Applications
No ratings yet
TQP3M9009: Applications
10 pages
ML 7.2 01 Masterlist of Analyses Rev 2 Feb 1 2023
No ratings yet
ML 7.2 01 Masterlist of Analyses Rev 2 Feb 1 2023
10 pages
Energy Change An Electron Is Added To A Gaseous Atom: E (G) +e E (G) Energy Change EA
No ratings yet
Energy Change An Electron Is Added To A Gaseous Atom: E (G) +e E (G) Energy Change EA
8 pages
basiak ewlina et al 2014
No ratings yet
basiak ewlina et al 2014
9 pages
Section 06.5 and Essential Synthesis C Shared Lab - New
No ratings yet
Section 06.5 and Essential Synthesis C Shared Lab - New
5 pages
How to Calculate Soccer Betting Predictions
No ratings yet
How to Calculate Soccer Betting Predictions
3 pages
Protege Fuseki Yasgui Manual
No ratings yet
Protege Fuseki Yasgui Manual
6 pages
C Language Preliminaries: C-Programming Tutorial 1. Introduction and Basics of C
No ratings yet
C Language Preliminaries: C-Programming Tutorial 1. Introduction and Basics of C
13 pages
PHP- file - Manual
No ratings yet
PHP- file - Manual
2 pages
Font Lab Studio 5: Type Designer's Delight
No ratings yet
Font Lab Studio 5: Type Designer's Delight
8 pages
Assignment 3 optical communication 2025_250404_161424 (1)
No ratings yet
Assignment 3 optical communication 2025_250404_161424 (1)
1 page
Aluminum 2011 T6
No ratings yet
Aluminum 2011 T6
7 pages
(At) 01 - Preface, Framework, Etc
No ratings yet
(At) 01 - Preface, Framework, Etc
8 pages
ECE4313 Worksheet Endsem
No ratings yet
ECE4313 Worksheet Endsem
5 pages
Đề Thi IGCSE Math 2024 May Paper 4 Variant 3
100% (1)
Đề Thi IGCSE Math 2024 May Paper 4 Variant 3
22 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Preprocessing: Why Preprocess The Data?

Uploaded by

Data Preprocessing: Why Preprocess The Data?

Uploaded by

Data Preprocessing

 Why preprocess the data?

 Data integration and transformation

 Discretization and concept hierarchy generation

Data Mining: Concepts and Techniques August 2, 2019 1

Data Mining: Concepts and Techniques August 2, 2019 4

Data Mining: Concepts and Techniques August 2, 2019 5

Data Mining: Concepts and Techniques August 2, 2019 7

 Why preprocess the data?

 Data integration and transformation

 Discretization and concept hierarchy generation

Data Mining: Concepts and Techniques August 2, 2019 8

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

Data Mining: Concepts and Techniques August 2, 2019 9

 Data is not always available

Data Mining: Concepts and Techniques August 2, 2019 10

 usually done when class label is missing (assuming the tasks is

4. Use the attribute mean to fill in the missing value:

For example, suppose that the average income of customers is

5. Use the attribute mean for all samples belonging to the

6. Use the most probable value to fill in the missing value:

Data Mining: Concepts and Techniques August 2, 2019 13

Data Mining: Concepts and Techniques August 2, 2019 14

Data Mining: Concepts and Techniques August 2, 2019 15

of intervals will be: W = (B –A)/N.

 Equal-depth (frequency) partitioning

Data Mining: Concepts and Techniques August 2, 2019 16

Data Mining: Concepts and Techniques August 2, 2019 17

Data Mining: Concepts and Techniques August 2, 2019 18

Data Mining: Concepts and Techniques August 2, 2019 19

Data Mining: Concepts and Techniques August 2, 2019 20

Data Mining: Concepts and Techniques August 2, 2019 21

Issues to be considered during data integration

Data Mining: Concepts and Techniques August 2, 2019 22

2.Redundant data occur often when integration of multiple

Data Mining: Concepts and Techniques August 2, 2019 23

 Correlation coefficient (also called Pearson’s product moment

where n is the number of tuples, A and B are the respective

Data Mining: Concepts and Techniques August 2, 2019 26

Data Mining: Concepts and Techniques August 2, 2019 27

Data Mining: Concepts and Techniques August 2, 2019 28

male female Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

 It shows that like_science_fiction and male are correlated in the

Data Mining: Concepts and Techniques August 2, 2019 30

 Generalization: concept hierarchy climbing

 Why data reduction?

 Data reduction strategies

Data Mining: Concepts and Techniques August 2, 2019 35

Data Mining: Concepts and Techniques August 2, 2019 36

 The goal of attribute subset selection is to find a minimum set

 It reduces the number of attributes appearing in the

Data Mining: Concepts and Techniques August 2, 2019 37

2. Stepwise backward elimination:

Data Mining: Concepts and Techniques August 2, 2019 38

Data Mining: Concepts and Techniques August 2, 2019 39

Data Mining: Concepts and Techniques August 2, 2019 41

 Reduce data volume by choosing alternative, smaller forms

Data Mining: Concepts and Techniques August 2, 2019 45

 Divide data into buckets and store average (sum) for

Equal-frequency (or equidepth): In an equal-frequency

MaxDiff: MaxDiff histogram, considers the difference between

Data Mining: Concepts and Techniques August 2, 2019 47

 Sampling: obtaining a small sample s to represent the whole data set

Cluster sample: If the tuples in D are grouped into M mutually

Data Mining: Concepts and Techniques August 2, 2019 50

Data Mining: Concepts and Techniques August 2, 2019 51

You might also like