CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1
CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1
CT075-3-2
Database Architecture
Data Preprocessing
Database Architecture
Why Data Preprocessing?
Database Architecture
Dedicated Tool for Data Cleaning
Database Architecture
Major Tasks in Data Preprocessing
Fill in missing values,
smooth noisy data, identify
or remove outliers, and
resolve inconsistencies
Integration of
multiple
databases, data
cubes, or files
Normalization
Duplication
Database Architecture
Data Preprocessing
Database Architecture
Data Cleaning
Database Architecture
Sample Dataset
Database Architecture
How to Handle Missing Data?
Database Architecture
Noisy Data
Database Architecture
1. Binning Method
Database Architecture
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Database Architecture
Example: “3 Mean Smoothing”
A B
2
1
Database Architecture
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
Database Architecture
Database Architecture
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
Database Architecture
What Is Good Clustering?
Database Architecture
Typical Requirements of Clustering in
Data Mining
• Scalability : work good on small sets only
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• High dimensionality
• Interpretability and usability
Database Architecture
Partitioning Algorithms: Basic Concept
Database Architecture
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
• Step 1: Ask the user how many clusters k the data set should be
partitioned into.
• Step 2: Randomly assign k records to be the initial cluster center
locations.
• Step 3: For each record, find the nearest cluster center. Thus, in a
sense, each cluster center “owns” a subset of the records, thereby
representing a partition of the data set. We therefore have k clusters,
C1,C2, . . . ,Ck .
• Step 4: For each of the k clusters, find the cluster centroid, and
update the location of each cluster center to the new value of the
centroid.
• Step 5: Repeat steps 3 to 5 until convergence or termination.
Database Architecture
The K-Means Clustering Method
• Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Database Architecture
Manhattan distance
Movie A B
M1 1.0 1.0
M2 1.5 2.0
M3 3.0 4.0
M4 5.0 7.0
M5 3.5 5.0
M6 4.5 5.0
M7 3.5 4.5
Database Architecture
Example
M1 1.0 1.0
M4 5.0 7.0
Database Architecture
Example
Database Architecture
Example
C1 1 1
ALLOCATION TO
C2 5 7 C1 C2 NEAREST CLUSTER
M1 1 1 0 10 C1
M3 3 4 5 5 C1, C2
M4 5 7 10 0 C2
M7 3.5 4.5 6 4 C2
Database Architecture
Example
STEP 5
A B
C1 1.83 2.33
C2 3.9 5.1
SEED1 1 1
SEED2 5 7
Database Architecture
Example
DISTANCE FROM CLUSTERS
M1 1 1 2.16 7 C1
M2 1.5 2 0.66 5.5 C1
M3 3 4 2.84 2 C2
M4 5 7 7.84 3 C2
M5 3.5 5 4.34 0.5 C2
M6 4.5 5 5.34 0.5 C2
M7 Cluster
3.5 14.5
-> M1, 3.84
M2 1 C2
Cluster 2 -> M3, M4, M5, M6, M7
Database Architecture
3. Regression
Database Architecture
Regression
y
Y1
Y1’ y=x+1
X1 x
Database Architecture
Summary
Database Architecture
Question & Answer Session
Q&A