Data Preprocessing 013333
Data Preprocessing 013333
• The representation and quality of data which is being used for carrying out an
analysis is the first and foremost concern to be addressed by any analyst.
• Data preprocessing is a data mining technique that involves transformation of
raw data into an understandable format, because real world data can often be
incomplete, inconsistent or even erroneous in nature.
• Data preprocessing resolves such issues.
• Issues of data quality include:
o Noise and outliers
o Missing data
o Duplicate data
• Data preprocessing ensures that further data mining process are free from
errors.
• It is a prerequisite preparation for data mining, it prepares raw data for the
core processes.
• Data preprocessing is part of a very complex phase known as ETL (Extraction,
Transformation and Loading).
• It involves extraction of data from multiple sources of data, transforming it to
a standardized format and loading it to the data mining system for analysis.
Data Preprocessing Methods
Raw data is highly vulnerable to missing values, noise and inconsistency and the
quality of data affects the data mining results. So, there is a need to improve the
quality of data, in order to improve mining results. For achieving better results,
raw data is pre-processed so as to enhance its quality and make it error free. This
eases the mining process.
The various stages in which data preprocessing is performed.
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
1. Data cleaning
First the raw data or noisy data goes through the process of cleansing.
In data cleansing missing values are filled, noisy data is smoothened,
inconsistencies are resolved, and outliers are identified and removed in order to
clean the data. To elaborate further:
a) Handling missing values:
It is often found that many of the database tuples or records do not have any
recorded values for some attributes. Such cases of missing values are filled by
different methods, as described below.
i) Fill in the missing value manually: Naturally, manually filling each missing
value is laborious and time consuming and so it is practical only when the
missing values are few in number. There are other methods to deal with the
problem of missing values when the dataset is very large or when the missing
values are very many.
ii) Use of some global constant in place of missing value: In this method,
missing values are replaced by some global label such as ‘Unknown’ or -∞.
Although one of the easiest approaches to deal with the missing values, it
should be avoided when mining program presents a pattern due to repetitive
occurrences of global labels such as ‘Unknown’. Hence, this method should be
used with caution.
iii) Use the attribute mean to fill in the missing value: Fill in the missing values
for each attribute with the mean of other data values of the same attribute.
This is a better way to handle missing values in a dataset.
iv) Use some other value which is high in probability to fill in the missing value:
Another efficient method is to fill in the missing values with values
determined by tools such as Bayesian Formalism or Decision Tree Induction
or other inference-based tools. This is one of the best methods, as it uses
most of the information already present to predict the missing values,
although it is not biased like previous methods. The only difficulty with this
method is complexity in performing the analysis.
v) Ignore the tuple: If the tuple contains more than one missing value and all
other methods are not applicable, then the best strategy to cope with missing
values is to ignore the whole tuple. This is commonly used if the class label
goes missing or the tuple contains missing values for most of the attributes.
This method should not be used, if the percentage of values that are missing
per attribute varies significantly.
Note: Use of the attribute mean to fill in the missing value is most common
technique used by most data mining tools to handle the missing values. However,
one can always use knowledge of probability to fill these values.
b) Handling noisy data
• Most data mining algorithms are affected adversely due to noisy data.
• The noise can be defined as unwanted variance or some random error that
occurred in a measurable variable.
• Noise is removed from the data by the method of ‘smoothing’. The
methods used for data smoothing are as follows:
i) Binning methods
• The Binning method is used to divide the values of an attribute into bins or
buckets.
• It is commonly used to convert one type of attribute to another type.
• For example, it may be necessary to convert a real-valued numeric attribute
like temperature to a nominal attribute with values cold, cool, warm, and hot
before its processing.
• This is also called ‘discretizing’ a numeric attribute.
• There are two types of discretization, namely, equal interval and equal
frequency.
• In equal interval binning, we calculate a bin size and then put the samples
into the appropriate bin.
• In equal frequency binning, we allow the bin sizes to vary, with our goal being
to choose bin sizes so that every bin has about the same number of samples
in it.
o The idea is that if each bin has the same number of samples, no bin, or
sample, will have greater or lesser impact on the results of data
mining.
• To understand this process, consider a dataset of the marks of 50 students.
The process divides this dataset on the basis of their marks into, for this
example, 10 bins.
In case of equal interval binning, we will create bins from 0-10, 10-20, 20-
30, 30-40, 40-50, 50- 60, 60-70, 70-80, 80-90, 90-100. If most students
commonly have marks between 60 to 80, some bins may be full and most
bins may have very few entries e.g., 0-10, 10-20, 90-100. Thus, it might be
better to divide this dataset on the equal frequency basis. It means that
with the same 50 students in class and we want to put these into 10 bins on
the basis of their marks then instead of creating the bins for marks like 0-10,
10-20 and so on, here we will first sort the records of students on the basis
of their marks in descending order (or ascending order as we prefer). The
first 5 students having highest marks will put into one bin and next 5
students on the basis of their marks will put into another and so on. If our
boundary students have same marks then bin range can be shifted to
accommodate students with the same marks into one common bin.
ii) Clustering or outlier analysis
• Clustering or outlier analysis is a method that allows detection of outliers by
clustering.
• In clustering, values which are common or similar are organized into groups
or ‘clusters’, and those values which lie outside these clusters are termed as
outliers or noise.
iii) Regression
• Regression is another such method which allows data smoothing by fitting
it to some function.
• For example, Linear Regression is one of the most used methods that aims
at finding the most suitable line to fit values of two variables or attributes
(i.e., best fit).
• The primary purpose of this is to predict the value of other variable the
using the first one.
• Similarly, Multiple Regression is used when more than two variables are
involved. Regression allows data fitting which in turn removes noise from
data and hence smoothens the dataset using mathematical equations.
2. Data integration
• A most necessary step to be taken during data analysis is Data Integration.
• Data integration is a process which combines data from a plethora of
sources (such as multiple databases, flat files or data cubes) into a unified
data store.
• During data integration, a number of tricky issues have to be considered.
• For example, how does the data analyst or the analyzing machine be sure
that student_id of one database and student_number of another database
refer to the same entity? This is referred to as the problem of entity
identification.
• Solution to the problem lies with the term ‘metadata’. Databases and data
warehouses consist of metadata, which is data about data. This metadata
is taken as a reference and referred by the data analyst to avoid errors
during the process of data integration.
• Another such issue which may be caused due to schema integration is
redundancy. In the language of database, an attribute is said to be
redundant if it is derivable from some other table (of the same database).
• Mistakes in attribute naming can also lead to data redundancies in the
resulting dataset. We use a number of tools to perform data integration
from different sources into one unified schema.
3) Data transformation
• When the value of one attribute is small as compared to other attributes,
then that attribute will not have much influence on mining of information,
since the values of this attrib ute were smaller than other attributes and
the variation within the attribute will also be small.
• Thus, data transformation is a process in which data is consolidated or
transformed into some other standard forms which are better suited for
data mining.
• All attributes should be transformed to a similar scale for clustering to be
effective unless we wish to give more weight to some attributes that are
comparatively large in scale.
• Commonly, we use two techniques to convert the attributes: Normalization
and Standardization are the most popular and widely used data
transformation methods.
Normalization
• In case of normalization, all the attributes are converted to a normalized
score or to a range (0, 1). The problem of normalization is an outlier.
• If there is an outlier, it will tend to crunch all of the other values down
toward the value of zero. In order to understand this, let’s suppose the
range of students’ marks is 35 to 45 out of 100.
• Then 35 will be considered as 0 and 45 as 1, and students will be distributed
between 0 to 1 depending upon their marks.
• But if there is one student having marks 90, then it will act as an outlier and
in this case, 35 will be considered as 0 and 90 as 1.
• Now, it will crunch most of the values down toward the value of zero. In this
scenario, the solution is standardization.
Standardization
• In case of standardization, the values are all spread out so that we have a
standard deviation of 1.
• Generally, there is no rule for when to use normalization versus
standardization.
• However, if your data has outliers, use standardization, otherwise use
normalization.
• Using standardization tends to make the remaining values for all of the
other attributes fall into similar ranges since all attributes will have the
same standard deviation of 1.
4) Data reduction
• It is often seen that when the complex data analysis and mining processes
are carried out over humongous datasets, they take such a long time that
the whole data mining or analysis process becomes unviable.
• Data reduction techniques come to the rescue in such situations.
• Using data reduction techniques, a dataset can be represented in a reduced
manner without actually compromising the integrity of original data.
• Data reduction is all about reducing the dimensions (referring to the total
number of attributes) or reducing the volume.
• Moreover, mining when carried out on reduced datasets often results in
better accuracy and proves to be more efficient.
• There are many methods to reduce large datasets to yield useful
knowledge. A few among them are:
i. Dimension reduction: In data warehousing, ‘dimension’ equips us with
structured labeling information. But not all dimensions (attributes) are
necessary at a time. Dimension reduction uses algorithm such as
Principal Component Analysis (PCA) and others. With the usage of such
algorithms one can detect and remove redundant and weakly relevant,
attributes or dimensions.
ii. Numerosity reduction: It is a technique which is used to choose smaller
forms of data representation for reducing the dataset volume.
iii. Data compression: We can also use data compression techniques to
reduce the dataset size. These techniques are classified as lossy and
loseless compression techniques where some encoding mechanisms
(e.g. Huffman coding) are used.