0% found this document useful (0 votes)

27 views44 pages

CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1

The document discusses data preprocessing tasks including data cleaning, integration, and transformation. It describes the need for data preprocessing to handle issues like missing data, noisy data, outliers, and inconsistencies. Specific techniques discussed include data cleaning tasks like filling in missing values, identifying outliers, and correcting inconsistencies. It also covers handling missing data through methods like ignoring values, imputing means or modes, and data smoothing techniques like binning and clustering to handle noisy data.

Uploaded by

kishanselvarajah80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views44 pages

CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1

Uploaded by

kishanselvarajah80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Data Management

CT075-3-2

Data Preprocessing (PART 1)

Topic & Structure of Lesson

• Need for data preparation

• Multidimensional view of data quality
• Major tasks in data preprocessing
– Data cleaning

Database Architecture
Data Preprocessing

• Why preprocess the data?

• Data cleaning
• Data integration
• Data transformation

Database Architecture
Why Data Preprocessing?

Database Architecture
Dedicated Tool for Data Cleaning

Source: http://sampleclean.org Database Architecture

Multi-Dimensional Measure of
Data Quality

Database Architecture
Major Tasks in Data Preprocessing
Fill in missing values,
smooth noisy data, identify
or remove outliers, and
resolve inconsistencies

Integration of
multiple
databases, data
cubes, or files
Normalization

Duplication

Database Architecture
Data Preprocessing

• Why preprocess the data?

• Data cleaning
• Data integration
• Data transformation
• Summary

Database Architecture
Data Cleaning

• Data cleaning tasks

– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data

Database Architecture
Sample Dataset

Identify how many

errors in this sample
dataset?
Database Architecture
Missing Data

• Data is not always available

– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

• Missing data may be due to

– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data

Database Architecture
How to Handle Missing Data?

1. Ignore the tuple (instance): usually done when class label is

missing
2. Fill in the missing value manually: boring + infeasible?
3. Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
6. Use the most probable value to fill in the missing value

Database Architecture
Noisy Data

• Noise: random error or variance in a measured variable

• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
Database Architecture
How to Handle Noisy Data?
1. Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
boundaries, etc.
2. Clustering
– detect and remove outliers
3. Regression
– smooth by fitting the data into regression functions
4. Combined computer and human inspection
– detect suspicious values and check by human

Database Architecture
1. Binning Method

Database Architecture ‹#›

Binning – Data Smoothing
• Why do we need data smoothing ?

Database Architecture ‹#›

Binning Method

• Equal-depth (frequency) partitioning:

– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling

Database Architecture
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Database Architecture
Example: “3 Mean Smoothing”

Database Architecture ‹#›

Example: Mean Smoothing -
Centering

Database Architecture ‹#›

Example: Median Smoothing

Database Architecture ‹#›

2. Clustering

Database Architecture ‹#›

Cluster Analysis

A B

2
1

Database Architecture
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
Database Architecture
Database Architecture
General Applications of Clustering

• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns

Database Architecture
What Is Good Clustering?

• A good clustering method will produce high quality

clusters with
– low intra-class similarity (within a class)
– high inter-class similarity (between 2 classes)
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.

Database Architecture
Typical Requirements of Clustering in
Data Mining
• Scalability : work good on small sets only
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• High dimensionality
• Interpretability and usability

Database Architecture
Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D

of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning condition.

– k-means : Each cluster is represented by the center of

the cluster.

Database Architecture
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
• Step 1: Ask the user how many clusters k the data set should be
partitioned into.
• Step 2: Randomly assign k records to be the initial cluster center
locations.
• Step 3: For each record, find the nearest cluster center. Thus, in a
sense, each cluster center “owns” a subset of the records, thereby
representing a partition of the data set. We therefore have k clusters,
C1,C2, . . . ,Ck .
• Step 4: For each of the k clusters, find the cluster centroid, and
update the location of each cluster center to the new value of the
centroid.
• Step 5: Repeat steps 3 to 5 until convergence or termination.

Database Architecture
The K-Means Clustering Method
• Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Database Architecture
Manhattan distance

Manhattan: to calculate the nearest value to the center of

cluster.

Database Architecture Data Mining: Concepts and Techniques

Example
k-means algorithm: consider the following dataset consisting of the
ratings of two variables on each of seven movies.

Movie A B
M1 1.0 1.0
M2 1.5 2.0
M3 3.0 4.0
M4 5.0 7.0
M5 3.5 5.0
M6 4.5 5.0
M7 3.5 4.5

Database Architecture
Example

Steps 1 and 2: Lets choose two seeds in

random
Movie A B

M1 1.0 1.0

M4 5.0 7.0

Database Architecture
Example

Steps 3 & 4: Compute the distances using

the two attributes and using the sum of
absolute difference for simplicity (K-means
method)

Database Architecture
Example

DISTANCE FROM CLUSTERS

C1 1 1
ALLOCATION TO
C2 5 7 C1 C2 NEAREST CLUSTER

M1 1 1 0 10 C1

M2 1.5 2 1.5 8.5 C1

M3 3 4 5 5 C1, C2

M4 5 7 10 0 C2

M5 3.5 5 6.5 3.5 C2

M6 4.5 5 7.5 2.5 C2

M7 3.5 4.5 6 4 C2

Database Architecture
Example

STEP 5

A B

C1 1.83 2.33

C2 3.9 5.1

SEED1 1 1

SEED2 5 7

Database Architecture
Example
DISTANCE FROM CLUSTERS

C1 1.83 2.33 FROM

ALLOCATION
TO
THE NEAREST
C2 3.9 5.1 C1 C2 CLUSTER

M1 1 1 2.16 7 C1
M2 1.5 2 0.66 5.5 C1
M3 3 4 2.84 2 C2
M4 5 7 7.84 3 C2
M5 3.5 5 4.34 0.5 C2
M6 4.5 5 5.34 0.5 C2
M7 Cluster
3.5 14.5
-> M1, 3.84
M2 1 C2
Cluster 2 -> M3, M4, M5, M6, M7
Database Architecture
3. Regression

Database Architecture ‹#›

Regression

Dependent variable (y)

Independent variable (x)

Regression is the attempt to explain the variation in a dependent variable

using the variation in independent variables.
Regression is thus an explanation of causation.
If the independent variable(s) sufficiently explain the variation in the
dependent variable, the model can be used for prediction.

Database Architecture
Regression
y

Y1’ y=x+1

X1 x

Database Architecture
Summary

• Data preparation is a big issue for mining

• Data preparation includes
– Data cleaning
– Data integration
– Data Transformation
• A lot a methods have been developed but still an active
area of research there is no perfect method.

Database Architecture
Question & Answer Session

Q&A

Database Architecture ‹#›

Next Topic

Data Integration and

Transformation

Database Architecture ‹#›

Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
No ratings yet
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
77 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
Study Material I
No ratings yet
Study Material I
140 pages
Unit 1 - Big Data Technologies
No ratings yet
Unit 1 - Big Data Technologies
89 pages
DE Unit1 - Introdcution - DE - 8jul24
No ratings yet
DE Unit1 - Introdcution - DE - 8jul24
56 pages
Unit-1 PPT Dma
No ratings yet
Unit-1 PPT Dma
83 pages
001lecture - 1 Introduction-1
No ratings yet
001lecture - 1 Introduction-1
40 pages
772s Data - Mining.concepts - And.techniques.2nd - Ed
No ratings yet
772s Data - Mining.concepts - And.techniques.2nd - Ed
239 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Unit-1 A
No ratings yet
Unit-1 A
47 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
Class 3 Introduction
No ratings yet
Class 3 Introduction
32 pages
1intro - Data Mining
No ratings yet
1intro - Data Mining
61 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Information Technology Fundamentals: CCIT4085
No ratings yet
Information Technology Fundamentals: CCIT4085
43 pages
DataMining S
No ratings yet
DataMining S
103 pages
Unit 1
No ratings yet
Unit 1
46 pages
DWDM - Unit - II
No ratings yet
DWDM - Unit - II
55 pages
Unit 1
No ratings yet
Unit 1
59 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Fortinet Fortianalyzer Administrator Study Guide For Fortianalyzer 72
No ratings yet
Fortinet Fortianalyzer Administrator Study Guide For Fortianalyzer 72
201 pages
Adv Database Environment
No ratings yet
Adv Database Environment
19 pages
Module 1
No ratings yet
Module 1
41 pages
Why We Need Data Mining?
No ratings yet
Why We Need Data Mining?
39 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
FFFFFFFFFFFFFFFFFFFF
No ratings yet
FFFFFFFFFFFFFFFFFFFF
17 pages
Introduction To Information Systems People Technology and Processes 3rd Edition Wallace Solutions Manual 1
100% (83)
Introduction To Information Systems People Technology and Processes 3rd Edition Wallace Solutions Manual 1
26 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
5 6204001570976171927
100% (1)
5 6204001570976171927
30 pages
Chapter 1-MC5403
No ratings yet
Chapter 1-MC5403
41 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Dmi Unit 1
No ratings yet
Dmi Unit 1
8 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Paper 6: Management Information System Module 20: Data Mining For Decision Support
No ratings yet
Paper 6: Management Information System Module 20: Data Mining For Decision Support
16 pages
Data Mining Technologies and Implementations
No ratings yet
Data Mining Technologies and Implementations
34 pages
Data Mining - IMT Nagpur-Manish
No ratings yet
Data Mining - IMT Nagpur-Manish
82 pages
SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha
No ratings yet
SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha
5 pages
29501clustering in Data Mining Process
No ratings yet
29501clustering in Data Mining Process
3 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
DataAnalytics Using R
No ratings yet
DataAnalytics Using R
101 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Unit 3 Data Mining PDF
No ratings yet
Unit 3 Data Mining PDF
19 pages
Computer Science 3rd Year Specilization
No ratings yet
Computer Science 3rd Year Specilization
9 pages
Database Adnalesque Cano: I Sing of A Database and Its Records
No ratings yet
Database Adnalesque Cano: I Sing of A Database and Its Records
42 pages
Chap 1
No ratings yet
Chap 1
32 pages
Sap Basis Mock Test
No ratings yet
Sap Basis Mock Test
4 pages
MVC 1
No ratings yet
MVC 1
90 pages
DBA - Syllabus
No ratings yet
DBA - Syllabus
13 pages
r23 Dbms Unit 4 - Implementation Techniques
100% (1)
r23 Dbms Unit 4 - Implementation Techniques
25 pages
SP3D Admin Responsibility
No ratings yet
SP3D Admin Responsibility
5 pages
AZ-303 and AZ-304 Links and Tips
No ratings yet
AZ-303 and AZ-304 Links and Tips
25 pages
How To Build A Data Science Portfolio
No ratings yet
How To Build A Data Science Portfolio
17 pages
Eb Data Warehouse Automation in Azure For Dummies en
No ratings yet
Eb Data Warehouse Automation in Azure For Dummies en
46 pages
Simba Apache Spark ODBC Connector Install and Configuration Guide
No ratings yet
Simba Apache Spark ODBC Connector Install and Configuration Guide
125 pages
Data Mining Tutorial - Javatpoint
No ratings yet
Data Mining Tutorial - Javatpoint
12 pages
CS2301 - Internet Programming Question Paper
No ratings yet
CS2301 - Internet Programming Question Paper
3 pages
Senstar Symphony 8 SDK Developer Guide en-US
No ratings yet
Senstar Symphony 8 SDK Developer Guide en-US
23 pages
AD Schema Attributes-N-Classes
No ratings yet
AD Schema Attributes-N-Classes
65 pages
Current Log
No ratings yet
Current Log
56 pages
Data Warehousing Predictive Analytics For Economic Insights
No ratings yet
Data Warehousing Predictive Analytics For Economic Insights
17 pages
Professional Jakarta Struts (ISBN 0-7645-4437-3) by James Goodwill & Richard Hightower
No ratings yet
Professional Jakarta Struts (ISBN 0-7645-4437-3) by James Goodwill & Richard Hightower
40 pages
Still Smart, CSC 415 Summery
No ratings yet
Still Smart, CSC 415 Summery
3 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
29 pages
Class-Viii Worksheet-2
No ratings yet
Class-Viii Worksheet-2
4 pages
Cloud Sync POS Chapter 1
No ratings yet
Cloud Sync POS Chapter 1
5 pages
Building Your First Laravel Application
No ratings yet
Building Your First Laravel Application
38 pages
Smart Library Management System
No ratings yet
Smart Library Management System
3 pages
OUAF Troubleshooting Concepts
No ratings yet
OUAF Troubleshooting Concepts
16 pages
Mandatory Assignment 2
No ratings yet
Mandatory Assignment 2
9 pages
Aris Mashzone: Cool Business Mashups in Minutes
No ratings yet
Aris Mashzone: Cool Business Mashups in Minutes
2 pages

CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1

Uploaded by

CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1

Uploaded by

Data Management

Data Preprocessing (PART 1)

• Need for data preparation

• Why preprocess the data?

Source: http://sampleclean.org Database Architecture

• Why preprocess the data?

• Data cleaning tasks

Identify how many

• Data is not always available

• Missing data may be due to

1. Ignore the tuple (instance): usually done when class label is

• Noise: random error or variance in a measured variable

Database Architecture ‹#›

Database Architecture ‹#›

• Equal-depth (frequency) partitioning:

Database Architecture ‹#›

Database Architecture ‹#›

Database Architecture ‹#›

Database Architecture ‹#›

• A good clustering method will produce high quality

• Partitioning method: Construct a partition of a database D

– k-means : Each cluster is represented by the center of

Manhattan: to calculate the nearest value to the center of

Database Architecture Data Mining: Concepts and Techniques

Steps 1 and 2: Lets choose two seeds in

Steps 3 & 4: Compute the distances using

DISTANCE FROM CLUSTERS

M2 1.5 2 1.5 8.5 C1

M5 3.5 5 6.5 3.5 C2

M6 4.5 5 7.5 2.5 C2

C1 1.83 2.33 FROM

Database Architecture ‹#›

Dependent variable (y)

Independent variable (x)

Regression is the attempt to explain the variation in a dependent variable

• Data preparation is a big issue for mining

Database Architecture ‹#›

Data Integration and

Database Architecture ‹#›

You might also like