0% found this document useful (0 votes)
85 views20 pages

Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering

Data preprocessing is an important step in preparing raw data for analysis. It involves cleaning the data by handling missing values and outliers, integrating multiple data sources, reducing the data volume through techniques like dimensionality reduction and data cube aggregation, and transforming the data for modeling algorithms. The major tasks in data preprocessing are data cleaning, data integration, data reduction, and data transformation, which prepare the raw data into a format suitable for mining useful patterns.

Uploaded by

Mustefa Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views20 pages

Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering

Data preprocessing is an important step in preparing raw data for analysis. It involves cleaning the data by handling missing values and outliers, integrating multiple data sources, reducing the data volume through techniques like dimensionality reduction and data cube aggregation, and transforming the data for modeling algorithms. The major tasks in data preprocessing are data cleaning, data integration, data reduction, and data transformation, which prepare the raw data into a format suitable for mining useful patterns.

Uploaded by

Mustefa Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Arba Minch University

Arba Minch Institute of Technology


Faculty of Computing & Software Engineering

Introduction to
Data Mining &
Data Warehouse
MR. Addisu M. (Asst. Prof)
Garbage In Garbage Out
(GIGO)
CHAPTER THREE
DATA PREPROCESSING
02/28/2022 2
What is Data Pre-processing?
• Data Preprocessing is a technique that is used to convert the raw data
into a clean data set.
• It is used to transform the raw data in a useful and efficient format.
• Data preprocessing is used for representing complex structures with
attributes, discretization of continuous attributes, binarization of
attributes, converting discrete attributes to continuous, and dealing with
missing and unknown attribute values. Various visualization techniques
provide valuable help in data preprocessing.
• The quality of the data should be checked before applying machine
learning or data mining algorithms.
02/28/2022 3
Why process the data?
• •existence
data isofnot
duplication
continuously
within
collected,
data,
• problem of data gathering tools
• Data in the real world may be, • •human
a mistake
data entry,
in data entry,
• a human mistake during data entry …
• Inaccurate data (missing data) • •containing
technicalmistakes
problemsin codes
with biometrics
or names… …

• The presence of noisy data (erroneous data and outliers)


• Inconsistent
• No quality data, no quality mining results!
• In other words, whenever the data is gathered from different
sources it is collected in raw format which is not feasible for the
analysis.

02/28/2022 4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization, aggregation and Generalization
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
02/28/2022 analytical results 5
02/28/2022 6
Forms of data preprocessing

02/28/2022 7
How is Data Preprocessing performed?

02/28/2022 8
Major Tasks in Data Preprocessing
• Data cleaning
– process to remove incorrect, incomplete and inaccurate data from the
datasets
– There are some techniques in data cleaning
– Handling missing values:
– Standard values like “Not Available” or “NA” can be used to replace the
missing values
– Missing values can also be filled manually but it is not recommended
when that dataset is big.
– attribute’s mean value can be used to replace missing value.
– While using regression or decision tree algorithms the missing value can
be replaced by the most probable value.
02/28/2022 9
Major Tasks in Data Preprocessing
• Data cleaning
– There are some techniques in data cleaning
– Noisy: generally means random error or containing unnecessary data points
– some of the methods to handle noisy data
– Binning: to handle noisy data. First, data is sorted by consulting its ‘neighbour-
hood’ and then the sorted values are separated/distributed into equal number of
‘buckets’ or bins.
– There are three methods for smoothing data in the bin. 
– Smoothing by bin mean method
– Smoothing by bin median
– Smoothing by bin boundary
– Regression: help to handle data when unnecessary data is present. For the
analysis purpose, regression helps to decide the variable which is suitable for
analysis
– Clustering: used for finding outliers and also in grouping data
02/28/2022 10
Major Tasks in Data Preprocessing
• Data integration
– process of combining multiple sources into a single dataset
– There are some problems to be considered during data integration
– Schema integration: Integrates metadata from different sources
– Entity identification problem: Identifying entities from multiple databases.
E.g., the system or use should know student_id of one database and
student_name of another database belongs to the same entity.
– Detecting and resolving data value concepts: data taken from different
databases while merging  may differ
– attribute values from one DB may differ from another DB
– For example, date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”

02/28/2022 11
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– ensure the integrity of data while reducing the data
– reduces the volume of original data and represents it in a much smaller
volume

Techniques of Data Reduction


02/28/2022 12
Major Tasks in Data Preprocessing
• Data reduction
– some of techniques in data reduction are
– Dimensionality reduction: necessary for real-world applications as data size
is big
– eliminates outdated or unwanted or redundant variables/ attributes,..
– Combining and merging attributes of the data without losing its
original characteristics

ID No Name Mobile Number Region


RAMiT/125/11 Tesfaye Ayele 091 698 7463 SNNPR
RNS/0125/10 Tsion Demisew 091 145 8321 Addis Ababa

– If we know mobile number, then weIDcan


No know the region.
Name So we nee Mobile
toreduce
Number
the one dimension RAMiT/125/11 Tesfaye Ayele 091 698 7463
02/28/2022 RNS/0125/10 Tsion Demisew 091 145 8321
13
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– some of techniques in data reduction are
– Data Cube Aggregation: used to aggregate data
– It is multidimensional aggregation that uses aggregation at various levels
of data cube to represent the original data set
– E.g., suppose you have the data of all Electronics sales per quarter for the
year 2018 to 2022
– If you want to get the annual
sale per year, you just have to
aggregate the sales per
quarter for each year

02/28/2022 14
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– some of techniques in data reduction are
– Numerosity Reduction: data are replaced or estimated by alternative,
smaller form of data representation
– Data compression: compressed form of data can be lossless or lossy
– When there is no loss of information during compression it is called
lossless compression
– Whereas lossy compression removes
only the unnecessary information

02/28/2022 15
Major Tasks in Data Preprocessing
• Data Transformation
– change made in the format or structure of the data
– can be simple or complex based on the requirements
– There are some methods in data transformation.
– Smoothing: means removing noise from the dataset
– how noise is removed? using techniques such as binning,
regression, clustering,…
– Attribute Construction: new attributes are constructed consulting the
existing set of attributes in order to construct a new data set that eases
data mining
– E.g., data set referring to measurements of different plots i.e. may have
height & width of each plot. So, possible to construct a new attribute ‘area’
from attributes ‘height’ and ‘weight’
02/28/2022 – also helps in understanding relations among the attributes 16
Major Tasks in Data Preprocessing
• Data Transformation
– There are some methods in data transformation.
– Aggregation: data is stored and presented in the form of a summary.
The data set which is from multiple sources is integrated into with data
analysis description
– Discretization: continuous data here is split into intervals
– replacing values of numeric data by interval labels
– E.g., values for the attribute ‘age’ can be replaced by the interval
labels such as (0-10, 11-20…) or (kid, youth, adult, senior)
– Normalization: method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.

02/28/2022 17
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Used for labelling or naming variables, without any quantitative value
– E.g.; country, gender, color,…
– Ordinal — values from an ordered set
– E.g.; first, second,….good, neutral, bad,…
– Continuous — real numbers, can be interval or ration variables
– E.g.; temperature in degrees Celsius/Fahrenheit, height, mass, distance,…
• Discretization: divide the range of a continuous attribute into intervals
– why?
– Some classification algorithms only accept categorical attributes.
– Reduce data size by discretization
– Prepare for further analysis

02/28/2022 18
Discretization
 used to Transform the attributes that are in continuous format

02/28/2022 19
Thank You

You might also like