05 DS Data Preprocessing - Cleaning
05 DS Data Preprocessing - Cleaning
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
❖ Data Cleaning
❖ Data Integration
❖ Data Transformation
❖ Data Reduction
❖ Data Discretization
❖ Data Normalization
Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
• Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
Binning Method:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local
smoothing.
For Example: Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is
9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value. In general, the larger the width, the
greater the effect of the smoothing. Alternatively, bins may be equal width,
where the interval range of values in each bin is constant.
Binning is also used as a discretization technique.
Data Transformation
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1.Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2.Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3.Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4.Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in
hierarchy.
For Example-The attribute “city” can be converted to “country”
Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in
volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be
more efficient yet produce the same (or almost the same) analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected
and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as
parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods
such as clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation,where raw data values for attributes are replaced by ranges or higher
conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation
of concept hierarchies.Discretization and concept hierarchy generation are powerful tools for datamining, in that they
allow the mining of data at multiple levels of abstraction.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can
help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy
and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–
20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-
level concepts, resulting in a concept hierarchy for the numeric attribute. e. More than one concept hierarchy can be
defined for the same attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can
be automatically defined at the schema definition level.