0% found this document useful (0 votes)
27 views44 pages

CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1

The document discusses data preprocessing tasks including data cleaning, integration, and transformation. It describes the need for data preprocessing to handle issues like missing data, noisy data, outliers, and inconsistencies. Specific techniques discussed include data cleaning tasks like filling in missing values, identifying outliers, and correcting inconsistencies. It also covers handling missing data through methods like ignoring values, imputing means or modes, and data smoothing techniques like binning and clustering to handle noisy data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views44 pages

CT075!3!2-DTM-Topic 5-Data Preprocessing PART 1

The document discusses data preprocessing tasks including data cleaning, integration, and transformation. It describes the need for data preprocessing to handle issues like missing data, noisy data, outliers, and inconsistencies. Specific techniques discussed include data cleaning tasks like filling in missing values, identifying outliers, and correcting inconsistencies. It also covers handling missing data through methods like ignoring values, imputing means or modes, and data smoothing techniques like binning and clustering to handle noisy data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Management

CT075-3-2

Data Preprocessing (PART 1)


Topic & Structure of Lesson

• Need for data preparation


• Multidimensional view of data quality
• Major tasks in data preprocessing
– Data cleaning

Database Architecture
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration
• Data transformation

Database Architecture
Why Data Preprocessing?

Database Architecture
Dedicated Tool for Data Cleaning

Source: http://sampleclean.org Database Architecture


Multi-Dimensional Measure of
Data Quality

Database Architecture
Major Tasks in Data Preprocessing
Fill in missing values,
smooth noisy data, identify
or remove outliers, and
resolve inconsistencies

Integration of
multiple
databases, data
cubes, or files
Normalization

Duplication

Database Architecture
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration
• Data transformation
• Summary

Database Architecture
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data

Database Architecture
Sample Dataset

Identify how many


errors in this sample
dataset?
Database Architecture
Missing Data

• Data is not always available


– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

• Missing data may be due to


– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of
entry
– not register history or changes of the data

Database Architecture
How to Handle Missing Data?

1. Ignore the tuple (instance): usually done when class label is


missing
2. Fill in the missing value manually: boring + infeasible?
3. Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
6. Use the most probable value to fill in the missing value

Database Architecture
Noisy Data

• Noise: random error or variance in a measured variable


• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
Database Architecture
How to Handle Noisy Data?
1. Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
boundaries, etc.
2. Clustering
– detect and remove outliers
3. Regression
– smooth by fitting the data into regression functions
4. Combined computer and human inspection
– detect suspicious values and check by human

Database Architecture
1. Binning Method

Database Architecture ‹#›


Binning – Data Smoothing
• Why do we need data smoothing ?

Database Architecture ‹#›


Binning Method

• Equal-depth (frequency) partitioning:


– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling

Database Architecture
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Database Architecture
Example: “3 Mean Smoothing”

Database Architecture ‹#›


Example: Mean Smoothing -
Centering

Database Architecture ‹#›


Example: Median Smoothing

Database Architecture ‹#›


2. Clustering

Database Architecture ‹#›


Cluster Analysis

A B

2
1

Database Architecture
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
Database Architecture
Database Architecture
General Applications of Clustering

• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns

Database Architecture
What Is Good Clustering?

• A good clustering method will produce high quality


clusters with
– low intra-class similarity (within a class)
– high inter-class similarity (between 2 classes)
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.

Database Architecture
Typical Requirements of Clustering in
Data Mining
• Scalability : work good on small sets only
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• High dimensionality
• Interpretability and usability

Database Architecture
Partitioning Algorithms: Basic Concept

• Partitioning method: Construct a partition of a database D


of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning condition.

– k-means : Each cluster is represented by the center of


the cluster.

Database Architecture
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
• Step 1: Ask the user how many clusters k the data set should be
partitioned into.
• Step 2: Randomly assign k records to be the initial cluster center
locations.
• Step 3: For each record, find the nearest cluster center. Thus, in a
sense, each cluster center “owns” a subset of the records, thereby
representing a partition of the data set. We therefore have k clusters,
C1,C2, . . . ,Ck .
• Step 4: For each of the k clusters, find the cluster centroid, and
update the location of each cluster center to the new value of the
centroid.
• Step 5: Repeat steps 3 to 5 until convergence or termination.

Database Architecture
The K-Means Clustering Method
• Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Database Architecture
Manhattan distance

Manhattan: to calculate the nearest value to the center of


cluster.

Database Architecture Data Mining: Concepts and Techniques


Example
k-means algorithm: consider the following dataset consisting of the
ratings of two variables on each of seven movies.

Movie A B
M1 1.0 1.0
M2 1.5 2.0
M3 3.0 4.0
M4 5.0 7.0
M5 3.5 5.0
M6 4.5 5.0
M7 3.5 4.5

Database Architecture
Example

Steps 1 and 2: Lets choose two seeds in


random
Movie A B

M1 1.0 1.0

M4 5.0 7.0

Database Architecture
Example

Steps 3 & 4: Compute the distances using


the two attributes and using the sum of
absolute difference for simplicity (K-means
method)

Database Architecture
Example

DISTANCE FROM CLUSTERS

C1 1 1
ALLOCATION TO
C2 5 7 C1 C2 NEAREST CLUSTER

M1 1 1 0 10 C1

M2 1.5 2 1.5 8.5 C1

M3 3 4 5 5 C1, C2

M4 5 7 10 0 C2

M5 3.5 5 6.5 3.5 C2

M6 4.5 5 7.5 2.5 C2

M7 3.5 4.5 6 4 C2

Database Architecture
Example

STEP 5

A B

C1 1.83 2.33

C2 3.9 5.1

SEED1 1 1

SEED2 5 7

Database Architecture
Example
DISTANCE FROM CLUSTERS

C1 1.83 2.33 FROM


ALLOCATION
TO
THE NEAREST
C2 3.9 5.1 C1 C2 CLUSTER

M1 1 1 2.16 7 C1
M2 1.5 2 0.66 5.5 C1
M3 3 4 2.84 2 C2
M4 5 7 7.84 3 C2
M5 3.5 5 4.34 0.5 C2
M6 4.5 5 5.34 0.5 C2
M7 Cluster
3.5 14.5
-> M1, 3.84
M2 1 C2
Cluster 2 -> M3, M4, M5, M6, M7
Database Architecture
3. Regression

Database Architecture ‹#›


Regression

Dependent variable (y)

Independent variable (x)

Regression is the attempt to explain the variation in a dependent variable


using the variation in independent variables.
Regression is thus an explanation of causation.
If the independent variable(s) sufficiently explain the variation in the
dependent variable, the model can be used for prediction.

Database Architecture
Regression
y

Y1

Y1’ y=x+1

X1 x

Database Architecture
Summary

• Data preparation is a big issue for mining


• Data preparation includes
– Data cleaning
– Data integration
– Data Transformation
• A lot a methods have been developed but still an active
area of research there is no perfect method.

Database Architecture
Question & Answer Session

Q&A

Database Architecture ‹#›


Next Topic

Data Integration and


Transformation

Database Architecture ‹#›

You might also like