Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering

Data preprocessing is an important step in preparing raw data for analysis. It involves cleaning the data by handling missing values and outliers, integrating multiple data sources, reducing the data volume through techniques like dimensionality reduction and data cube aggregation, and transforming the data for modeling algorithms. The major tasks in data preprocessing are data cleaning, data integration, data reduction, and data transformation, which prepare the raw data into a format suitable for mining useful patterns.

Uploaded by

Mustefa Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views20 pages

Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering

Uploaded by

Mustefa Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Arba Minch University

Arba Minch Institute of Technology

Faculty of Computing & Software Engineering

Introduction to
Data Mining &
Data Warehouse
MR. Addisu M. (Asst. Prof)
Garbage In Garbage Out
(GIGO)
CHAPTER THREE
DATA PREPROCESSING
02/28/2022 2
What is Data Pre-processing?
• Data Preprocessing is a technique that is used to convert the raw data
into a clean data set.
• It is used to transform the raw data in a useful and efficient format.
• Data preprocessing is used for representing complex structures with
attributes, discretization of continuous attributes, binarization of
attributes, converting discrete attributes to continuous, and dealing with
missing and unknown attribute values. Various visualization techniques
provide valuable help in data preprocessing.
• The quality of the data should be checked before applying machine
learning or data mining algorithms.
02/28/2022 3
Why process the data?
• •existence
data isofnot
duplication
continuously
within
collected,
data,
• problem of data gathering tools
• Data in the real world may be, • •human
a mistake
data entry,
in data entry,
• a human mistake during data entry …
• Inaccurate data (missing data) • •containing
technicalmistakes
problemsin codes
with biometrics
or names… …

• The presence of noisy data (erroneous data and outliers)

• Inconsistent
• No quality data, no quality mining results!
• In other words, whenever the data is gathered from different
sources it is collected in raw format which is not feasible for the
analysis.

02/28/2022 4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization, aggregation and Generalization
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
02/28/2022 analytical results 5
02/28/2022 6
Forms of data preprocessing

02/28/2022 7
How is Data Preprocessing performed?

02/28/2022 8
Major Tasks in Data Preprocessing
• Data cleaning
– process to remove incorrect, incomplete and inaccurate data from the
datasets
– There are some techniques in data cleaning
– Handling missing values:
– Standard values like “Not Available” or “NA” can be used to replace the
missing values
– Missing values can also be filled manually but it is not recommended
when that dataset is big.
– attribute’s mean value can be used to replace missing value.
– While using regression or decision tree algorithms the missing value can
be replaced by the most probable value.
02/28/2022 9
Major Tasks in Data Preprocessing
• Data cleaning
– There are some techniques in data cleaning
– Noisy: generally means random error or containing unnecessary data points
– some of the methods to handle noisy data
– Binning: to handle noisy data. First, data is sorted by consulting its ‘neighbour-
hood’ and then the sorted values are separated/distributed into equal number of
‘buckets’ or bins.
– There are three methods for smoothing data in the bin.
– Smoothing by bin mean method
– Smoothing by bin median
– Smoothing by bin boundary
– Regression: help to handle data when unnecessary data is present. For the
analysis purpose, regression helps to decide the variable which is suitable for
analysis
– Clustering: used for finding outliers and also in grouping data
02/28/2022 10
Major Tasks in Data Preprocessing
• Data integration
– process of combining multiple sources into a single dataset
– There are some problems to be considered during data integration
– Schema integration: Integrates metadata from different sources
– Entity identification problem: Identifying entities from multiple databases.
E.g., the system or use should know student_id of one database and
student_name of another database belongs to the same entity.
– Detecting and resolving data value concepts: data taken from different
databases while merging may differ
– attribute values from one DB may differ from another DB
– For example, date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”

02/28/2022 11
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– ensure the integrity of data while reducing the data
– reduces the volume of original data and represents it in a much smaller
volume

Techniques of Data Reduction

02/28/2022 12
Major Tasks in Data Preprocessing
• Data reduction
– some of techniques in data reduction are
– Dimensionality reduction: necessary for real-world applications as data size
is big
– eliminates outdated or unwanted or redundant variables/ attributes,..
– Combining and merging attributes of the data without losing its
original characteristics

ID No Name Mobile Number Region

RAMiT/125/11 Tesfaye Ayele 091 698 7463 SNNPR
RNS/0125/10 Tsion Demisew 091 145 8321 Addis Ababa

– If we know mobile number, then weIDcan

No know the region.
Name So we nee Mobile
toreduce
Number
the one dimension RAMiT/125/11 Tesfaye Ayele 091 698 7463
02/28/2022 RNS/0125/10 Tsion Demisew 091 145 8321
13
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– some of techniques in data reduction are
– Data Cube Aggregation: used to aggregate data
– It is multidimensional aggregation that uses aggregation at various levels
of data cube to represent the original data set
– E.g., suppose you have the data of all Electronics sales per quarter for the
year 2018 to 2022
– If you want to get the annual
sale per year, you just have to
aggregate the sales per
quarter for each year

02/28/2022 14
Major Tasks in Data Preprocessing
• Data reduction
– helps in reduction of the volume of data which makes analysis easier yet
produces the same or almost the same result
– some of techniques in data reduction are
– Numerosity Reduction: data are replaced or estimated by alternative,
smaller form of data representation
– Data compression: compressed form of data can be lossless or lossy
– When there is no loss of information during compression it is called
lossless compression
– Whereas lossy compression removes
only the unnecessary information

02/28/2022 15
Major Tasks in Data Preprocessing
• Data Transformation
– change made in the format or structure of the data
– can be simple or complex based on the requirements
– There are some methods in data transformation.
– Smoothing: means removing noise from the dataset
– how noise is removed? using techniques such as binning,
regression, clustering,…
– Attribute Construction: new attributes are constructed consulting the
existing set of attributes in order to construct a new data set that eases
data mining
– E.g., data set referring to measurements of different plots i.e. may have
height & width of each plot. So, possible to construct a new attribute ‘area’
from attributes ‘height’ and ‘weight’
02/28/2022 – also helps in understanding relations among the attributes 16
Major Tasks in Data Preprocessing
• Data Transformation
– There are some methods in data transformation.
– Aggregation: data is stored and presented in the form of a summary.
The data set which is from multiple sources is integrated into with data
analysis description
– Discretization: continuous data here is split into intervals
– replacing values of numeric data by interval labels
– E.g., values for the attribute ‘age’ can be replaced by the interval
labels such as (0-10, 11-20…) or (kid, youth, adult, senior)
– Normalization: method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.

02/28/2022 17
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Used for labelling or naming variables, without any quantitative value
– E.g.; country, gender, color,…
– Ordinal — values from an ordered set
– E.g.; first, second,….good, neutral, bad,…
– Continuous — real numbers, can be interval or ration variables
– E.g.; temperature in degrees Celsius/Fahrenheit, height, mass, distance,…
• Discretization: divide the range of a continuous attribute into intervals
– why?
– Some classification algorithms only accept categorical attributes.
– Reduce data size by discretization
– Prepare for further analysis

02/28/2022 18
Discretization
 used to Transform the attributes that are in continuous format

02/28/2022 19
Thank You

World Religions Week 3
100% (1)
World Religions Week 3
24 pages
St. Cyril of Alexandria Term Paper For Patrology
100% (3)
St. Cyril of Alexandria Term Paper For Patrology
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
Swimming Pool Structural Calcs
100% (1)
Swimming Pool Structural Calcs
7 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
4.1 - Data Preprocessing
No ratings yet
4.1 - Data Preprocessing
28 pages
Data Preprocessing: G.A.Putri Saptawati
No ratings yet
Data Preprocessing: G.A.Putri Saptawati
9 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
DS Unit 1 Essay Answers.
No ratings yet
DS Unit 1 Essay Answers.
18 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
1.3 Introduction To Data Preprocessing
No ratings yet
1.3 Introduction To Data Preprocessing
16 pages
CH2 Data Integration - Transformation
No ratings yet
CH2 Data Integration - Transformation
16 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
IELTS Writing
0% (1)
IELTS Writing
8 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
CS322 - Lec 3 - S25
No ratings yet
CS322 - Lec 3 - S25
42 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
No ratings yet
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
9 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Engr213 Chapter 4 Homework Solutions
No ratings yet
Engr213 Chapter 4 Homework Solutions
18 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
DWM
No ratings yet
DWM
14 pages
Characteristics and Functions of Data Warehouse
No ratings yet
Characteristics and Functions of Data Warehouse
13 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
23 pages
Chapter 1 Introduction To AI
No ratings yet
Chapter 1 Introduction To AI
26 pages
Bi Unit 4
No ratings yet
Bi Unit 4
19 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
Project: Integration Management
No ratings yet
Project: Integration Management
71 pages
2014-Planmeca-Pricing Retail 0911814 Low
No ratings yet
2014-Planmeca-Pricing Retail 0911814 Low
122 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Tense and Aspect in IE PDF
No ratings yet
Tense and Aspect in IE PDF
255 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
Scope and Time Management
No ratings yet
Scope and Time Management
52 pages
Database Questions and Answers
No ratings yet
Database Questions and Answers
3 pages
Solving Problems by Searching & Constraint Satisfaction Problem
No ratings yet
Solving Problems by Searching & Constraint Satisfaction Problem
53 pages
A First Introduction To P-Adic Numbers
No ratings yet
A First Introduction To P-Adic Numbers
6 pages
Irr 7920
No ratings yet
Irr 7920
15 pages
Regent College London New
No ratings yet
Regent College London New
2 pages
Q3 Gender 2018 Sex Gender Nature Nurture
No ratings yet
Q3 Gender 2018 Sex Gender Nature Nurture
5 pages
Enumeration EH A
No ratings yet
Enumeration EH A
35 pages
Breif Induction GT 2019-20
No ratings yet
Breif Induction GT 2019-20
247 pages
Eugen Fink Oasis of Happiness
No ratings yet
Eugen Fink Oasis of Happiness
29 pages
Lesson 4 (Computer Maintenance)
No ratings yet
Lesson 4 (Computer Maintenance)
4 pages
PG AHC Admissions Policy 2020
No ratings yet
PG AHC Admissions Policy 2020
13 pages
The Act
No ratings yet
The Act
2 pages
Sustainable Architecture Wiki
No ratings yet
Sustainable Architecture Wiki
9 pages
Motion 1 QP
No ratings yet
Motion 1 QP
15 pages
Review 1 Lop 11 Thi Diem Units 123
No ratings yet
Review 1 Lop 11 Thi Diem Units 123
2 pages
Buzz Marketing For Movies
No ratings yet
Buzz Marketing For Movies
9 pages
Learning: Book: Artificial Intelligence, A Modern Approach (Russell & Norvig)
No ratings yet
Learning: Book: Artificial Intelligence, A Modern Approach (Russell & Norvig)
22 pages
Curs 2-Formatare Conditionala
No ratings yet
Curs 2-Formatare Conditionala
12 pages
Complied by Mesfin A. 1
No ratings yet
Complied by Mesfin A. 1
25 pages
Update On Renewed Effort To Strengthen Routine Immunization
No ratings yet
Update On Renewed Effort To Strengthen Routine Immunization
49 pages
Urological Oncology: A Comparison Between Clinical and Pathologic Staging in Patients With Bladder Cancer
No ratings yet
Urological Oncology: A Comparison Between Clinical and Pathologic Staging in Patients With Bladder Cancer
5 pages
Figure of Speech
No ratings yet
Figure of Speech
4 pages
How Human Behaviour Amplifies The Bullwhip Effect A Study Based On The Beer Distribution Game Online
No ratings yet
How Human Behaviour Amplifies The Bullwhip Effect A Study Based On The Beer Distribution Game Online
12 pages
Introduction To Soil Ecology
No ratings yet
Introduction To Soil Ecology
15 pages
#01 G.R. No. 100113
No ratings yet
#01 G.R. No. 100113
19 pages
File Page No 1663658874765
No ratings yet
File Page No 1663658874765
10 pages
Sport
No ratings yet
Sport
1 page
Assignment 1 ECN3112
No ratings yet
Assignment 1 ECN3112
4 pages
Airbnb Seasonality and Revenue Data Trends For Grand Prairie - AirDNA MarketMinder
No ratings yet
Airbnb Seasonality and Revenue Data Trends For Grand Prairie - AirDNA MarketMinder
2 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering

Uploaded by

Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering

Uploaded by

Arba Minch University

Arba Minch Institute of Technology

• The presence of noisy data (erroneous data and outliers)

Techniques of Data Reduction

ID No Name Mobile Number Region

– If we know mobile number, then weIDcan

You might also like