3-Data Pre-Processing
3-Data Pre-Processing
2
Steps Involved
What is Data
For instance, days in a
week : {Monday, Tuesday,
Wednesday, Thursday, Another example could
Friday, Saturday, Sunday} be the Boolean set :
is a category because its {True, False}
value is always taken
from this set.
Features
whose values
are taken from
a defined set
of values.
Categorical Attributes
Numerical : Features whose values are continuous or integer-valued. They are represented by numbers
and possess most of the properties of numbers.
Missing values :
Eliminate rows with missing data :
• Simple and sometimes effective strategy. Fails if
many objects have missing values. If a feature has
mostly missing values, then that feature itself can
also be eliminated.
A dataset may include data objects which are duplicates of one another.
It may happen when say the same person submits a form more than
once. The term deduplication is often used to refer to the process of
dealing with duplicates.
• Binning method:
• Clustering
• Combined computer and human inspection
• Regression
Binning method is used to smoothing data or to handle noisy
data. In this method, the data is first sorted and then the sorted
values are distributed into a number of buckets or bins.
As binning methods consult the neighbourhood of values, they
perform local smoothing.