Data Preprocessing Unit 2
Data Preprocessing Unit 2
Data cleaning
Data cleaning help us remove inaccurate, incomplete and incorrect data from
the dataset. Some techniques used in data cleaning are −
Standard values can be used to fill up the missing values in a manual way but only for a small
dataset.
Attribute's mean and median values can be used to replace the missing values in normal and
non-normal distribution of data respectively.
Tuples can be ignored if the dataset is quite large and many values are missing within a tuple.
Most appropriate value can be used while using regression or decision tree algorithms
Noisy Data
Noisy data are the data that cannot be interpreted by machine and are
containing unnecessary faulty data. Some ways to handle them are −
Binning − This method handle noisy data to make it smooth. Data gets divided equally and
stored in form of bins and then methods are applied to smoothing or completing the tasks.
The methods are Smoothing by a bin mean method(bin values are replaced by mean values),
Smoothing by bin median(bin values are replaced by median values) and Smoothing by bin
boundary(minimum/maximum bin values are taken and replaced by closest boundary values).
Regression − Regression functions are used to smoothen the data. Regression can be
linear(consists of one independent variable) or multiple(consists of multiple independent
variables).
Clustering − It is used for grouping the similar data in clusters and is used for finding
outliers.
Data integration
Data transformation
In this part, change in format or structure of data in order to transform the data
suitable for mining process. Methods for data transformation are −
Attribute Selection − To help the mining process, new attributes are derived from
the given attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level
to higher level in hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality
and quantity of data to make the result more optimal.
Data reduction
It helps in increasing storage efficiency and reducing data storage to make the
analysis easier by producing almost the same results. Analysis becomes harder
while working with huge amounts of data, so reduction is used to get rid of that.
Data Compression
Numerosity Reduction
There is a reduction in volume of data i.e. only store model of data instead of
whole data, which provides smaller representation of data without any loss of
data.
Dimensionality reduction