Data Preprocessing

1. Data preprocessing techniques are used to improve the quality, efficiency, and results of data mining by cleaning data, integrating data from multiple sources, reducing data size, and transforming data. 2. Common issues with real-world databases include noisy, missing, and inconsistent data which can be addressed through techniques like data cleaning, integration, reduction, and transformation. 3. The goals of data preprocessing are to improve data quality by addressing issues like accuracy, completeness, consistency, timeliness, and interpretability which can be reduced by things like missing data, inaccurate values, and inconsistent representations.

Uploaded by

Stephen Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

102 views33 pages

Data Preprocessing

Uploaded by

Stephen Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

DATA

PREPROCESSING

By
G. Shobhana
K. Lakshmi Kanth
R. Siva Narayana
Introduction
Real-world DBs are highly noisy, missing and inconsistent data
Preprocessing techniques are used to Improve the quality, efficiency
and mining results
• Data Cleaning: remove noise and correct the inconsistencies in
data
• Data Integration: merges data from multiple sources into
coherent data store such as a data warehouse
• Data Reduction: reduce data size by performing aggregations,
eliminating redundant features and clustering
• Data Transformations: where data are scaled to fall with in a
smaller range like 0.0 to 1.0, this can improve the accuracy and
efficiency of mining algorithms involving distance measures
Why Preprocess the Data?
• Data Quality: Data have quality if they satisfy
the requirements of the intended use.
• Many factors comprising data quality
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Interpretability
• Several attributes for various tuples have no
record values, that is called missing data that
reduces the quality by reporting errors, unusual
values and inconsistencies
• The data you wish to analyze by the DM
techniques are
– Incomplete (lacking attribute values or containig
aggregate data)
– Inaccurate or noisy (having incorrect attr values that
are deviate from the expected)
– Inconsistent(contains discrepancies in the dept codes
used to categorize items)
Accuracy, Completeness and
Consistancy
• Inaccurate, incomplete and inconsistant data are common
properties if large dbs and dws
• Reasons :
– data collection instruments used may be faulty
– Human errors or computer errors occurring at data entry
– Users may purposely submit incorrect data values for
mandatory fields when they do not submit personal
information(DoB)
– Errors in data transmission
– Technology limitations such as limited buffer size for
coordinating synchronized data transfer and computation
– Incorrect data may result the inconsistancies in naming
conversions or formats in input fields(Date)
Timeliness
• Timeliness: also affects data quality
– All Electronics- Update sales details at the month
end
– Some sales managers not update before month
last day
– And updated details have corrections and
adjestments
– Fact is month end data are not updated in a timely
fashion has a negative impact on data quality
Believability and Interpretability
• Believability reflects how much the data are
trusted by users/employees
• Interpretability reflects how easy the data are
understood
Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Data Cleaning
• Filling the missing values, smoothing noisy
data, identifying or removing outliers and
resolving inconsistencies
• If the data are dirty then the results also
unlikely
• Dirty data can cause confusion for the mining
procedure resulting in unreliable output
• Have some data cleaning methods to
overcome above problems
Data Integration
• To include data from multiple sources or
integrating multiple databases, datacubes or files
– Having different attribute names can causes
inconsistancies and redundencies (Cust_id,
custemer_id)
– Naming inconsistancies may occur(First name, middle
name and last name)
– Having large amount of redundant data may
slowdown or confuse the KDD
• In addition Data cleaning, have to avoid
redundancies during data integration
Data Reduction
• Reduced representation of the dataset that is much smaller in volume
hence it produces the same results
• Strategies
– Dimensionality Reduction
– Numerosity Reduction
• Dimensionality Reduction- data encoding schemes are applied to reduced
representaion
– Compression techniques(Wavelet transforms, Principle Component Analysis)
– Attribute Subset Selection(Removing irrelevent attributes)
– Attribute construction(Small set of more useful attributes dirived from original
set)
• Numerosity Reduction- data are replaced by alternative, smaller
representations using
– parametric models(Regression and log-linear models)
– Non-parametric models(Histograms, Clusters, Sampling or data aggrigation)
Data Transformation
• Normalized, Discritization and Concept
hierarchy generation are useful, where raw
data values for attributes are replaced by
ranges or higher conceptual levels
– Ex: age replaced by Youth, adult, senior etc
Data Preprocessing
Data Cleaning
• Data cleaning routines attempt to
– Fill in missing values
– smooth out noise while identifying outliers
– correct inconsistencies in the data
– Resolve redundancy caused by data integration
Missing Values
1. Ignore the tuple,
2. Fill in the missing value manually- time consuming may
not be feasible for large datasets
3. Use a global constant to fill in the missing value- like
unknown or ∞
4. Use a measure of central tendency- mean for symmetric
data and median for skewed data
5. Use the attribute mean or median for all samples
belonging to the same class as the given tuple
6. Use the most probable value to fill in the missing value
– Determined with regression, inference based tools using a
Bayesian formalism or decision tree induction
Noisy Data
• Noise is a random error or variance in a
measured variable
– Boxplots and scatter plots are used to identifying
the outliers which may represent noise
– Ex: attribute “price” , we have to smooth out the
data to remove noise
Smoothing techniques
• Binning: smooth the sorted data value by
consulting its neighbourhood i.e. the values
around it
• The sorted values are distributed into a no.of
buckets or bins and the perform local
smoothing
• Smooth by Bin means
• Smooth by bin medians
• Smooth by bin
boundaries(min and max
values are as bin
boundaries)
• Regression: is a technique that conforms data
values to a function
• Linear Regression involves finding the best
line to fit two attributes so that one attribute
can be used to predict the other
• Multiple linear regression is an extension of
linear regression, where more than two
attributes are involved and data are fit to
multidimensional surface
• Outlier analysis: detected by clustering
• Ex: similar values are organized into groups or
clusters, values outside of the set of clusters
are considered as outliers
Data cleaning as a Process
• First step is discrepancy detection (similarity
or difference)
• Discrepancy caused several factors
– Poorly designed data entry forms having many
optional fields
– Human errors in data entry
– Deliberate errors(not want to give)
– Data decay(Out dated address)
– Data integration process may cause
• Data discrepancy detection
– Use Metadata (knowledge about data, data type
and domain of attribute, acceptable values, basic
statistical data discretions to identify anomalies)
– Taking care about inconsistent data
representations(“2020/01/03” and “03/01/2020”)
– Field overloading – is an error source it results
when developers squeeze new attribute definition
into unused portions of already defined attributes
(used 31 bits out of 32 bits)
• Data is examined for unique rules,
constructive rules and null rules
– Unique rule: Each value of the given attribute
must be different
– Constructive rule: There can be no missing values
between the lowest and highest values and all
values must be unique also
– Null rule: specifies the use of blanks, question
marks, special characters or other strings that may
indicate the null condition
• No.of commercial tools used to discrepancy
detection
– Data Scrubbing Tool: use simple domain
knowledge(Postal code, spell check, id numbers)
to detect errors and make corrections in the data
– Data auditing tool: by analyzing the data to
discover rules and relationships to detect violators
(Correlations, clusters and basic statistical
descriptions)
• Once we find data discrepancies, we need to
define and apply series of transformations to
correct them
• Commercial tools assist in the data
transformation step
• Data Migration tools: allow simple
transformations to be specified (replace gender
by sex)
• ETL (Extraction/transformation/loading) tools:
allow users to specify transforms through a GUI
Data Integration
• The semantic heterogeneity and structure of
the data pose great challenges in the data
integration
– Entity identification problem – How can we match
schema and objects from different sources?
– Correlation tests on numeric data and nominal
data – Specifies the correlation between objects
Entity Identification Problem
• No.of issues are consider during data integration
– Schema integration and object matching can be tricky
• The equivalent of real world entities from
multiple data sources is known as entity identity
problem
– Ex: Different representations and Different scales like
cust_id and custmer_id how they are reffer
• Metadata for each attribute (name, meaning, datatype,
range of values permitted for the attribute, null rules for
handling blanks, zeros)
• Such metadata may also be used to help avoid errors in
schema integration
Redundancy and Correlation Analysis
• If an attribute may be redundant if it can be derived
from another attribute.
• Careful integration of data from multiple sources may
help/avoid redundancies
• Some redundancies can be detected by correlation
analysis
• Given two attributes, correlation analysis can measure
how strongly one attribute implies the other, based on
the available data
– For Nominal data, we use Chi-Square test
– For Numeric attribute, we use Correlation Coefficient and
Covariance
2
𝑥 Correlation data for Nominal data
• Correlation relationship between two attributes, A and B, can be
discovered by Chi-Square test
• Suppose A has c distinct values, namely a1,a2,…,ac B has r distinct
values, namely b1,b2,…,br
• The data tuples described by contingency table, with the c values of
A making up the columns, r values of B making up the rows.
• Chi-Square test is computed as :
2
𝑐 𝑟 𝑂𝑖𝑗 −𝑒𝑖𝑗
• 𝐶ℎ𝑖 − 𝑆𝑞𝑢𝑎𝑟𝑒 = 𝑖=1 𝑗=1 𝑒𝑖𝑗
– 𝑂𝑖𝑗 is Observed frequency(actual count) of the joint event (𝐴𝑖 , 𝐵𝑗 )
– 𝑒𝑖𝑗 is Expected frequency
𝑐𝑜𝑢𝑛𝑡 𝐴=𝑎𝑖 𝑥𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏𝑖 )
• 𝑒𝑖𝑗 =
𝑛
– 𝑛 is the total number of data tuples
• Chi-Square statistic tests the hypothesis that A and B are
independent, i.e., there is no correlation b/w them
• The test is based on the significant level, with (r-1)x(c-
1) degree of freedom
• In terms of a p-value and a chosen significance level
(alpha), the test can be interpreted as follows:
– If p-value <= alpha: significant result, reject null hypothesis
(H0), dependent.
– If p-value > alpha: not significant result, fail to reject null
hypothesis (H0), independent.
Correlation Coefficient for Numeric
Data
• Also known as Pearson’s product moment coefficient

• n is the number of tuples

• ai and bi are respective values of A and B in tuple i
• A and B are respective mean values of A and B
• σA and σB are respective standard deviation of A and B
• (𝑎𝑖 , 𝑏𝑖 ) is the sum of the AB cross-product
• Note: -1<= 𝑟𝐴,𝐵 >= +1
• If 𝑟𝐴,𝐵 is greater than 0, then A and B are Positively Correlated (values of A
increase as the values of B increase)
• If the resulting value = 0, then A and B are independent, and there is no
correlation between them
• If the resulting value is < 0, then A and B are Negatively correlated (values
of A increase as the values of B decrease)
Covariance of Numeric Data
• In P&S, correlation and covariance are similar
measures for assessing how much two
attributes change together
• Consider two numeric attr A and B,

Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
Kaggle's State of Machine Learning and Data Science 2021
No ratings yet
Kaggle's State of Machine Learning and Data Science 2021
45 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
44 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
DSGO 2019 Official Notes
No ratings yet
DSGO 2019 Official Notes
75 pages
Lecture 5 Introduction To Data Mining Business Intelligence
No ratings yet
Lecture 5 Introduction To Data Mining Business Intelligence
50 pages
Class Xi Python
100% (2)
Class Xi Python
138 pages
Data Science and Ethical Issues
No ratings yet
Data Science and Ethical Issues
42 pages
CH 6
No ratings yet
CH 6
72 pages
Introduction To Tree Methods
No ratings yet
Introduction To Tree Methods
15 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Mining: Books
No ratings yet
Data Mining: Books
14 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
No ratings yet
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
27 pages
ML Lab Manual
100% (1)
ML Lab Manual
37 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
ML Projects For Final Year
No ratings yet
ML Projects For Final Year
7 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Data Mining
No ratings yet
Data Mining
49 pages
Ensemble Methods Bagging Boosting and Stacking
100% (1)
Ensemble Methods Bagging Boosting and Stacking
19 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
4 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Mc9280 Data Mining and Data Warehousing
No ratings yet
Mc9280 Data Mining and Data Warehousing
1 page
Data Warehousing and Data Mining (10cs755)
No ratings yet
Data Warehousing and Data Mining (10cs755)
142 pages
Data Preprocessing ML Lab
No ratings yet
Data Preprocessing ML Lab
6 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
100% (1)
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
86 pages
Unit I
No ratings yet
Unit I
85 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
3 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Simulated Annealing Overview
No ratings yet
Simulated Annealing Overview
10 pages
STATS1900 Mock Exam Solution Duration: 1 Hour: Answer
No ratings yet
STATS1900 Mock Exam Solution Duration: 1 Hour: Answer
12 pages
Math Gre w2015 Lecture7
No ratings yet
Math Gre w2015 Lecture7
11 pages
Riemann Integration
No ratings yet
Riemann Integration
12 pages
Chapter 7 C
No ratings yet
Chapter 7 C
27 pages
Curtin University of Technology Department of Mathematics and Statistics
No ratings yet
Curtin University of Technology Department of Mathematics and Statistics
5 pages
Steven G. Krantz - A Guide To Functional Analysis (Dolciani Mathematical Expositions, No. 49)
100% (2)
Steven G. Krantz - A Guide To Functional Analysis (Dolciani Mathematical Expositions, No. 49)
150 pages
Methods For The Determination of Paracetamol, Pseudoephedrine and
No ratings yet
Methods For The Determination of Paracetamol, Pseudoephedrine and
7 pages
NPTEL Online Certification Courses Indian Institute of Technology Kharagpur
No ratings yet
NPTEL Online Certification Courses Indian Institute of Technology Kharagpur
6 pages
Module V Applications of Integration
No ratings yet
Module V Applications of Integration
21 pages
Brian James Gurka Resume Recent
No ratings yet
Brian James Gurka Resume Recent
1 page
Algebra H Iteration v2 PDF
No ratings yet
Algebra H Iteration v2 PDF
6 pages
Chemistry Education Thesis Topics
100% (2)
Chemistry Education Thesis Topics
7 pages
D Role
No ratings yet
D Role
68 pages
Pump Linearity and Dwell Volume Measurements: How Is The Linearity of The Pump Measured?
No ratings yet
Pump Linearity and Dwell Volume Measurements: How Is The Linearity of The Pump Measured?
4 pages
Model of The Representative Agent
No ratings yet
Model of The Representative Agent
20 pages
Lecture 2 - Optimization With Equality Constraints
No ratings yet
Lecture 2 - Optimization With Equality Constraints
44 pages
JC Excellente Christian Academy Inc.: Blk. 40 Lot 73 Road 1 Minuyan II, CSJDM Bulacan
No ratings yet
JC Excellente Christian Academy Inc.: Blk. 40 Lot 73 Road 1 Minuyan II, CSJDM Bulacan
2 pages
MGT 492
No ratings yet
MGT 492
6 pages
Gradient Boosting
No ratings yet
Gradient Boosting
32 pages
First Periodic Test in Math 10
No ratings yet
First Periodic Test in Math 10
5 pages
Department Wise Book List Final
100% (1)
Department Wise Book List Final
347 pages
Module 4 PDF
No ratings yet
Module 4 PDF
20 pages
r1. Function
No ratings yet
r1. Function
2 pages
Exploratory Data Analysis With Matlab PDF
No ratings yet
Exploratory Data Analysis With Matlab PDF
2 pages
Discriptive and Inferential Statistics
No ratings yet
Discriptive and Inferential Statistics
6 pages
(Chapman & Hall - CRC Applied Mathematics & Nonlinear Science) Chen, Yi-Tung - Li, Jichun-Computational Partial Differential Equations Using MATLAB-CRC Press (2008)
No ratings yet
(Chapman & Hall - CRC Applied Mathematics & Nonlinear Science) Chen, Yi-Tung - Li, Jichun-Computational Partial Differential Equations Using MATLAB-CRC Press (2008)
14 pages
More Awesome Proof of The FTC
No ratings yet
More Awesome Proof of The FTC
3 pages
Van Deemter Equation
No ratings yet
Van Deemter Equation
12 pages
Donepezil Hydrochloride Orally Disintegrating Tablets
No ratings yet
Donepezil Hydrochloride Orally Disintegrating Tablets
2 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

DATA

• n is the number of tuples

You might also like