0% found this document useful (0 votes)

17 views

UpdatedUnit 1 Data Preprocessing

data preprocessing in data mining

Uploaded by

lavanyamallela23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

UpdatedUnit 1 Data Preprocessing

data preprocessing in data mining

Uploaded by

lavanyamallela23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

DATA PREPROCESSING

Introduction
Real-world DBs are highly noisy, missing and inconsistent data
Preprocessing techniques are used to Improve the quality, efficiency
and mining results
• Data Cleaning: remove noise and correct the inconsistencies in
data
• Data Integration: merges data from multiple sources into
coherent data store such as a data warehouse
• Data Reduction: reduce data size by performing aggregations,
eliminating redundant features and clustering
• Data Transformations: where data are scaled to fall with in a
smaller range like 0.0 to 1.0, this can improve the accuracy and
efficiency of mining algorithms involving distance measures
Why Preprocess the Data?
• Data Quality: Data have quality if they satisfy
the requirements of the intended use.
• Many factors comprising data quality
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
Cont.,
• Several attributes for various tuples have no record
values, that is called missing data that reduces the
quality by reporting errors, unusual values and
inconsistencies
• The data you wish to analyze by the DM techniques are
– Incomplete (lacking attribute values or containing aggregate
data)
– Inaccurate or noisy (having incorrect attr values that are
deviate from the expected)
– Inconsistent(contains discrepancies in the dept codes used to
categorize items)
Accuracy, Completeness and Consistency
• Inaccurate, incomplete and inconsistent data are common
properties if large dbs and dws
• Reasons :
– Data collection instruments used may be faulty
– Human errors or computer errors occurring at data entry
– Users may purposely submit incorrect data values for mandatory
fields when they do not submit personal information(DoB)
– Errors in data transmission
– Technology limitations such as limited buffer size for coordinating
synchronized data transfer and computation
– Incorrect data may result the inconsistencies in naming
conversions or formats in input fields(Date)
Timeliness
• Timeliness: also affects data quality
– All Electronics- Update sales details at the month
end
– Some sales managers not update before month
last day
– And updated details have corrections and
adjustments
– Fact is month end data are not updated in a timely
fashion has a negative impact on data quality
Believability and Interpretability
• Believability reflects how much the data are
trusted by users/employees
• Interpretability reflects how easy the data are
understood
Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Data Preprocessing
Data Cleaning
• Data cleaning routines attempt to
– Fill in missing values
– smooth out noise while identifying outliers
– correct inconsistencies in the data
– Resolve redundancy caused by data integration
Missing Values
1. Ignore the tuple,
2. Fill in the missing value manually- time consuming may not be
feasible for large datasets
3. Use a global constant to fill in the missing value- like unknown
or ∞
4. Use a measure of central tendency- mean for symmetric data
and median for skewed data
5. Use the attribute mean or median for all samples belonging to
the same class as the given tuple
6. Use the most probable value to fill in the missing value
– Determined with regression, inference based tools using a Bayesian
formalism or decision tree induction
Noisy Data
• Noise is a random error or variance in a
measured variable
– Boxplots and scatter plots are used to identifying
the outliers which may represent noise
– Ex: attribute “price” , we have to smooth out the
data to remove noise
Smoothing techniques
• Binning: smooth the sorted data value by
consulting its neighbourhood i.e. the values
around it
• The sorted values are distributed into a no.of
buckets or bins and the perform local
smoothing
Smoothing techniques
• Smooth by Bin means
• Smooth by bin medians
• Smooth by bin
boundaries(min and max
values are as bin
boundaries)
Regression
• Regression: is a technique that conforms data
values to a function
• Linear Regression involves finding the best
line to fit two attributes so that one attribute
can be used to predict the other
• Multiple linear regression is an extension of
linear regression, where more than two
attributes are involved and data are fit to
multidimensional surface
• Outlier analysis: detected by clustering
• Ex: similar values are organized into groups or
clusters, values outside of the set of clusters
are considered as outliers
Data Integration
• The semantic heterogeneity and structure of
the data pose great challenges in the data
integration
– Entity identification problem – How can we match
schema and objects from different sources?
– Correlation tests on numeric data and nominal
data – Specifies the correlation between objects
Entity Identification Problem
• No.of issues are consider during data integration
– Schema integration and object matching can be tricky
• The equivalent of real world entities from multiple
data sources is known as entity identity problem
– Ex: Different representations and different scales like
cust_id and custmer_id how they are refer
• Metadata for each attribute (name, meaning, datatype, range
of values permitted for the attribute, null rules for handling
blanks, zeros)
• Such metadata may also be used to help avoid errors in
schema integration
Redundancy and Correlation Analysis
• An attribute may be redundant if it can be derived from
another attribute.
• Careful integration of data from multiple sources may
help/avoid redundancies

• Some redundancies can be detected by correlation analysis

• Given two attributes, correlation analysis can measure how
strongly one attribute implies the other, based on the available
data
– For Nominal data, we use Chi-Square test
– For Numeric attribute, we use Correlation Coefficient and Covariance
Correlation Test for Nominal data
• Correlation relationship between two attributes, A and B, can be
discovered by Chi-Square test
• Suppose A has c distinct values, namely a1,a2,…,ac
• B has r distinct values, namely b1,b2,…,br
• The data tuples described by contingency table, with the c values of
A making up the columns, r values of B making up the rows.

– is Observed frequency(actual count) of the joint event ()

– is Expected frequency
=

- is the total number of data tuples

Correlation Test for Nominal data
• Chi-Square statistic tests the hypothesis that A
and B are independent, i.e., there is no
correlation b/w them
• The test is based on the significant level, with
(r-1)x(c-1) degree of freedom
• In terms of a p-value and a chosen significance level
(alpha), the test can be interpreted as follows:
– If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
– If p-value > alpha: not significant result, fail to reject null hypothesis (H0),
independent.
𝑬𝒙𝒂𝒎𝒑𝒍𝒆
• The observed frequency (or count) of each possible joint
event is summarized in the contingency table, where the
numbers in parentheses are the expected frequencies.

• Compute Chi-Square value:

𝑬𝒙𝒂𝒎𝒑𝒍𝒆
• For this 2×2 table, the degrees of freedom are (2−1)
(2−1)=1. For 1 degree of freedom, the χ2 value
needed to reject the hypothesis at the 0.001
signiﬁcance level is 10.828
• Since our computed value is above this, we can
reject the hypothesis that gender and preferred
reading are independent and conclude that the two
attributes are (strongly) correlated for the given
group of people.
Correlation Coefficient for Numeric Data
• Also known as Pearson’s product moment coefficient

• n is the number of tuples

• ai and bi are respective values of A and B in tuple i
• A and B are respective mean values of A and B
• σA and σB are respective standard deviation of A and B
• is the sum of the AB cross-product
• Note: -1<=>= +1
• If is greater than 0, then A and B are Positively Correlated (values of A increase as
the values of B increase)
• If the resulting value = 0, then A and B are independent, and there is no correlation
between them
• If the resulting value is < 0, then A and B are Negatively correlated (values of A
increase as the values of B decrease)
Covariance of Numeric Data
• In P&S, correlation and covariance are similar measures for
assessing how much two attributes change together
• Consider two numeric attr A and B, and a set of n
observations.
• The mean values of A and B are also known as Expected
values of A and B, i.e., E(A) and E(B) respectively.
• The covariance between A and B is deﬁned as

• If we compare correlation coefficient with covariance, we

can see that
• Covariance can also be shown as
Covariance of Numeric Data
• If COV(A,B) is greater than 0, then A and B are
Positively Correlated (values of A increase as the
values of B increase)
• If the resulting value = 0, then A and B are
independent, and there is no correlation between
them
• If the resulting value is < 0, then A and B are
Negatively correlated (values of A increase as the
values of B decrease)
Example

• If the stocks are affected by the same industry

trends, will their prices rise or fall together?
DATA TRANSFORMATION
• Smoothing: To remove noise from the data (Techniques include binning, regression
and clustering)
• Attribute construction (or feature construction): New attributes are constructed
from the given set of attributes to help the mining process.
• Aggregation: Aggregation operations are applied to the data. Ex- the daily sales
data may be aggregated so as to compute monthly and annual total amounts.
• Normalization: The attribute data are scaled so as to fall within a smaller range,
such as -1.0 to 1.0, or 0.0 to 1.0.
• Discretization: The raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.
• Concept hierarchy generation for nominal data: Attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for
nominal attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
Data Transformation by Normalization
• Changing the measurement unit of attribute values/units
is called normalization
• Transforming the data to fall within a smaller range such
as [-1,1] or [0.0,1.0]
• Normalization is particularly useful for classification
algorithms involving neural networks or distance
measurements such as clustering
• Data Normalization methods
– Min-max normalization
– Z-score normalization
– Normalization by Decimal Scaling
Min-Max Normalization
• Performs a linear transformation on the original data.
• Suppose that and are the minimum and maximum values
of an attribute, A.
• Min-max normalization maps a value, , of A to in the
range [new_,new_]
v  min by computing
A
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

• Min-max normalization preserves the relationships among

the original data values.
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600  12
Then $73,000 is mapped to,000 (1.0  0)  0 0.716
98,000  12,000
Z(Zero) - Score Normalization
• The values for an attribute A, are normalized
based on mean and standard deviation of A.
• A value, , of A normalized to by computing

v  A
v' 
 A
Normalization by Decimal Scaling
• Normalizes by moving the decimal point of
values of attribute A
• The number of decimal points moved
depends on the maximum absolute value of A.
• A value, , of A normalized to by computing
v
v'  j
10

• Where j is the smallest integer, Max(|ν|)

Data Reduction
• To obtain a reduced representation of the data set that is much smaller in volume
but yet produce the same (almost same) analytical results.
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction, e.g., reduces the no.of random variables or
attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Attribute subset selection
– Numerosity reduction, Replace the original data volume by alternative or
smaller forms
• Parametric methods: Regression and Log-Linear Models
• Non – parametric methods: Histograms, clustering, sampling
– Data compression:
• Lossy
• Lossless
Attribute Subset Selection
• Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or
dimensions).
• The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
• It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier
to understand.
Attribute Subset Selection
• “How can we find a ‘good’ subset of the original attributes?”
– For n attributes, there are possible subsets. An exhaustive search for the
optimal subset of attributes can be prohibitively expensive, especially as n and
the number of data classes increase. Therefore, heuristic methods that explore
a reduced search space are commonly used for attribute subset selection.
• These methods are typically greedy in that, while searching through attribute
space, they always make what looks to be the best choice at the time.
• Their strategy is to make a locally optimal choice in the hope that this will lead to a
globally optimal solution.
• Such greedy methods are effective in practice and may come close to estimating
an optimal solution.
• The “best” (and “worst”) attributes are typically determined using tests of
statistical significance, which assume that the attributes are independent of one
another.
• Many other attribute evaluation measures can be used such as the information
gain measure used in building decision trees for classification.
Greedy(Heuristic) methods for Attribute Subset
Selection
• Stepwise forward selection:
– starts with an empty set of attributes as the reduced set.
– Best of the original attributes is determined and added to the reduced set.
– At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
• Stepwise backward elimination:
– Starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the set.
• Combination of forward selection and backward elimination:
– The stepwise forward selection and backward elimination methods can be combined
– At each step, the procedure selects the best attribute and removes the worst from among the remaining
attributes.
• Decision tree induction:
– Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for classification.
– Decision tree induction constructs a flow chart like structure where each internal (nonleaf) node denotes a
test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition the data into individual classes.

• When decision tree induction is used for attribute subset selection, a tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of attributes.
Thank You

1 Corinthians Bible Quiz Questions and Answers
100% (1)
1 Corinthians Bible Quiz Questions and Answers
5 pages
1896 Presentation
0% (1)
1896 Presentation
22 pages
DM_merged
No ratings yet
DM_merged
169 pages
Null 1
No ratings yet
Null 1
62 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
data mining 3
No ratings yet
data mining 3
57 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
PPT1
No ratings yet
PPT1
93 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Unit 3
No ratings yet
Unit 3
164 pages
Module 2
No ratings yet
Module 2
62 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Lec7
No ratings yet
Lec7
45 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
DP
No ratings yet
DP
44 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Mining
No ratings yet
Mining
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
10-1 Data analysis and pre-processing part 3.pdf
No ratings yet
10-1 Data analysis and pre-processing part 3.pdf
19 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
datalinklayer-ppt-1
No ratings yet
datalinklayer-ppt-1
49 pages
It Content Sem 2
No ratings yet
It Content Sem 2
65 pages
Chandana
No ratings yet
Chandana
9 pages
Unit 2 Data Warehousing and OLAP
No ratings yet
Unit 2 Data Warehousing and OLAP
72 pages
Port Trace
No ratings yet
Port Trace
32 pages
Lab 3 PHP Introduction and Forms Handling 18052023 095029am
No ratings yet
Lab 3 PHP Introduction and Forms Handling 18052023 095029am
14 pages
General Santos City Pilipino
No ratings yet
General Santos City Pilipino
3 pages
Tutorial Letter 1010202 2 ELECTRICAL MACHINES II (Practical) EMMPRA2 Year Module
No ratings yet
Tutorial Letter 1010202 2 ELECTRICAL MACHINES II (Practical) EMMPRA2 Year Module
12 pages
Expansion Activities Chapter 1: Overview of Verb Tenses: Understanding and Using English Grammar, 3rd Edition
No ratings yet
Expansion Activities Chapter 1: Overview of Verb Tenses: Understanding and Using English Grammar, 3rd Edition
1 page
PR JekyllAndHyde Test3-1
No ratings yet
PR JekyllAndHyde Test3-1
4 pages
Elizabeth Parcells Vocal Wisdom Seminar
No ratings yet
Elizabeth Parcells Vocal Wisdom Seminar
22 pages
Concept Maps Assessment
No ratings yet
Concept Maps Assessment
5 pages
B2_Grammar_Test_Bank
No ratings yet
B2_Grammar_Test_Bank
5 pages
14 - Scenarios
No ratings yet
14 - Scenarios
13 pages
Computer Applications in Industrial Engg.-I: Lecture #01
No ratings yet
Computer Applications in Industrial Engg.-I: Lecture #01
41 pages
Guarded By Two Jaguars A Catholic Parish Divided By Language And Faith Hoenes Del Pinal download
100% (1)
Guarded By Two Jaguars A Catholic Parish Divided By Language And Faith Hoenes Del Pinal download
89 pages
A history of modern political thought : the question of interpretation 1st Edition Browning all chapter instant download
100% (3)
A history of modern political thought : the question of interpretation 1st Edition Browning all chapter instant download
65 pages
ISO 8000 Data Quality Workshop
No ratings yet
ISO 8000 Data Quality Workshop
41 pages
A Study of Gender Specific Pitch Variation Pattern of Emotion Expression For Hindi Speech
No ratings yet
A Study of Gender Specific Pitch Variation Pattern of Emotion Expression For Hindi Speech
9 pages
Theorist's Toolkit Lecture 10: Discrete Fourier Transform and Its Uses
No ratings yet
Theorist's Toolkit Lecture 10: Discrete Fourier Transform and Its Uses
9 pages
Lalpur Centre
No ratings yet
Lalpur Centre
3 pages
Tulsidas
100% (1)
Tulsidas
24 pages
Wa0010
No ratings yet
Wa0010
3 pages
Exam Contempo. 2022 2023
No ratings yet
Exam Contempo. 2022 2023
3 pages
Air Line Management System - DBMS Project
No ratings yet
Air Line Management System - DBMS Project
14 pages
Week 6 - Worksheet N°1
No ratings yet
Week 6 - Worksheet N°1
3 pages
Local Theory of Surfaces: Reading: Millman and Parker CH 4: Sections 4.1 - 4.5
No ratings yet
Local Theory of Surfaces: Reading: Millman and Parker CH 4: Sections 4.1 - 4.5
28 pages
Phonetics Diphthongs Practice
67% (3)
Phonetics Diphthongs Practice
2 pages
[Ebooks PDF] download Wireshark for Network Forensics: An Essential Guide for IT and Cloud Professionals 1st Edition Nagendra Kumar Nainar full chapters
100% (6)
[Ebooks PDF] download Wireshark for Network Forensics: An Essential Guide for IT and Cloud Professionals 1st Edition Nagendra Kumar Nainar full chapters
66 pages
Download Full (Ebook) King's Indian: A Complete Black Repertoire by Victor Bologan ISBN 9789548782715, 9548782715 PDF All Chapters
100% (1)
Download Full (Ebook) King's Indian: A Complete Black Repertoire by Victor Bologan ISBN 9789548782715, 9548782715 PDF All Chapters
81 pages
Regex Regular Cro
No ratings yet
Regex Regular Cro
1 page
Clutch and Rasticrac
No ratings yet
Clutch and Rasticrac
7 pages

UpdatedUnit 1 Data Preprocessing

Uploaded by

UpdatedUnit 1 Data Preprocessing

Uploaded by

DATA PREPROCESSING

• Some redundancies can be detected by correlation analysis

– is Observed frequency(actual count) of the joint event ()

- is the total number of data tuples

• Compute Chi-Square value:

• n is the number of tuples

• If we compare correlation coefficient with covariance, we

• If the stocks are affected by the same industry

• Min-max normalization preserves the relationships among

• Where j is the smallest integer, Max(|ν|)

You might also like