0% found this document useful (0 votes)
21 views

Data Preprocessing: G.A.Putri Saptawati

Data preprocessing techniques are used to improve the quality of data mining results. These techniques include data cleaning to handle missing values and noise, data integration to combine data from multiple sources, data transformation to consolidate data into appropriate forms for mining, and data reduction to reduce the volume of data while maintaining integrity. The goal of these preprocessing steps is to address issues like incomplete, noisy, and inconsistent data that could influence data mining processes and pattern detection.

Uploaded by

Dito Kartiko
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Data Preprocessing: G.A.Putri Saptawati

Data preprocessing techniques are used to improve the quality of data mining results. These techniques include data cleaning to handle missing values and noise, data integration to combine data from multiple sources, data transformation to consolidate data into appropriate forms for mining, and data reduction to reduce the volume of data while maintaining integrity. The goal of these preprocessing steps is to address issues like incomplete, noisy, and inconsistent data that could influence data mining processes and pattern detection.

Uploaded by

Dito Kartiko
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Preprocessing

G.A.Putri Saptawati

The need of data preprocessing


Problems

with huge real-world database

Incomplete data : missing value


Noisy
Inconsistent

Influence data mining process, especially pattern


mined

Techniques
Data

cleaning
Data integration
Data transformation
Data reduction
Improve the quality of the pattern mined
and/or the time required for the actual mining
3

Data Cleaning Missing values


Tuples have no recorded value for several
attributes
Ignore the tuple
Fill in the missing value

Using global constant


Using measured values : attribute mean, most
probable value

Data Cleaning Noisy


Random error or variance in a measured
variable
Binning
smooth a sorted data value by consulting
its neighborhood
local smoothing

Clustering

Detect the outliers by grouping similar


values
Regression
smooth data by fitting data to a function,
such as regression
linear regression, multiple linier regression
6

Data Integration

Combine data from multiple sources into coherent


data store
Schema integration: entity identification problem
Redundancy: detected by correlation analysis
Detection & resolution of data value conflict:
semantic heterogenity & different representation

Data Transformation
Data

are transformed or consolidated into


forms appropriate for mining
Involve:

Smoothing
Aggregation
Generalisation
Normalisation

Data Reduction
Reduce

representation of data set that is


much smaller in volume, while maintains the
integrity of the original data.
Strategies:

Data cube aggregation


Dimension reduction
Data compression

You might also like