Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
Introduction to
Data Mining &
Data Warehouse
MR. Addisu M. (Asst. Prof)
Garbage In Garbage Out
(GIGO)
CHAPTER THREE
DATA PREPROCESSING
02/28/2022 2
What is Data Pre-processing?
• Data Preprocessing is a technique that is used to convert the raw data
into a clean data set.
• It is used to transform the raw data in a useful and efficient format.
• Data preprocessing is used for representing complex structures with
attributes, discretization of continuous attributes, binarization of
attributes, converting discrete attributes to continuous, and dealing with
missing and unknown attribute values. Various visualization techniques
provide valuable help in data preprocessing.
• The quality of the data should be checked before applying machine
learning or data mining algorithms.
02/28/2022 3
Why process the data?
• •existence
data isofnot
duplication
continuously
within
collected,
data,
• problem of data gathering tools
• Data in the real world may be, • •human
a mistake
data entry,
in data entry,
• a human mistake during data entry …
• Inaccurate data (missing data) • •containing
technicalmistakes
problemsin codes
with biometrics
or names… …
02/28/2022 4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization, aggregation and Generalization
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
02/28/2022 analytical results 5
02/28/2022 6
Forms of data preprocessing
02/28/2022 7
How is Data Preprocessing performed?
02/28/2022 8
Major Tasks in Data Preprocessing
• Data cleaning
– process to remove incorrect, incomplete and inaccurate data from the
datasets
– There are some techniques in data cleaning
– Handling missing values:
– Standard values like “Not Available” or “NA” can be used to replace the
missing values
– Missing values can also be filled manually but it is not recommended
when that dataset is big.
– attribute’s mean value can be used to replace missing value.
– While using regression or decision tree algorithms the missing value can
be replaced by the most probable value.
02/28/2022 9
Major Tasks in Data Preprocessing
• Data cleaning
– There are some techniques in data cleaning
– Noisy: generally means random error or containing unnecessary data points
– some of the methods to handle noisy data
– Binning: to handle noisy data. First, data is sorted by consulting its ‘neighbour-
hood’ and then the sorted values are separated/distributed into equal number of
‘buckets’ or bins.
– There are three methods for smoothing data in the bin.
– Smoothing by bin mean method
– Smoothing by bin median
– Smoothing by bin boundary
– Regression: help to handle data when unnecessary data is present. For the
analysis purpose, regression helps to decide the variable which is suitable for
analysis
– Clustering: used for finding outliers and also in grouping data
02/28/2022 10
Major Tasks in Data Preprocessing
• Data integration
– process of combining multiple sources into a single dataset
– There are some problems to be considered during data integration
– Schema integration: Integrates metadata from different sources
– Entity identification problem: Identifying entities from multiple databases.
E.g., the system or use should know student_id of one database and
student_name of another database belongs to the same entity.
– Detecting and resolving data value concepts: data taken from different
databases while merging may differ
– attribute values from one DB may differ from another DB
– For example, date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”
02/28/2022 11
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– ensure the integrity of data while reducing the data
– reduces the volume of original data and represents it in a much smaller
volume
02/28/2022 14
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– some of techniques in data reduction are
– Numerosity Reduction: data are replaced or estimated by alternative,
smaller form of data representation
– Data compression: compressed form of data can be lossless or lossy
– When there is no loss of information during compression it is called
lossless compression
– Whereas lossy compression removes
only the unnecessary information
02/28/2022 15
Major Tasks in Data Preprocessing
• Data Transformation
– change made in the format or structure of the data
– can be simple or complex based on the requirements
– There are some methods in data transformation.
– Smoothing: means removing noise from the dataset
– how noise is removed? using techniques such as binning,
regression, clustering,…
– Attribute Construction: new attributes are constructed consulting the
existing set of attributes in order to construct a new data set that eases
data mining
– E.g., data set referring to measurements of different plots i.e. may have
height & width of each plot. So, possible to construct a new attribute ‘area’
from attributes ‘height’ and ‘weight’
02/28/2022 – also helps in understanding relations among the attributes 16
Major Tasks in Data Preprocessing
• Data Transformation
– There are some methods in data transformation.
– Aggregation: data is stored and presented in the form of a summary.
The data set which is from multiple sources is integrated into with data
analysis description
– Discretization: continuous data here is split into intervals
– replacing values of numeric data by interval labels
– E.g., values for the attribute ‘age’ can be replaced by the interval
labels such as (0-10, 11-20…) or (kid, youth, adult, senior)
– Normalization: method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
02/28/2022 17
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Used for labelling or naming variables, without any quantitative value
– E.g.; country, gender, color,…
– Ordinal — values from an ordered set
– E.g.; first, second,….good, neutral, bad,…
– Continuous — real numbers, can be interval or ration variables
– E.g.; temperature in degrees Celsius/Fahrenheit, height, mass, distance,…
• Discretization: divide the range of a continuous attribute into intervals
– why?
– Some classification algorithms only accept categorical attributes.
– Reduce data size by discretization
– Prepare for further analysis
02/28/2022 18
Discretization
used to Transform the attributes that are in continuous format
02/28/2022 19
Thank You