0% found this document useful (0 votes)
88 views

Unit 2 - Data Preprocessing

This document provides an overview of data preprocessing techniques for data mining. It discusses why data preprocessing is important, common reasons why real-world data is dirty or incomplete, and major tasks in data preprocessing including data cleaning, integration, transformation, reduction, and discretization. Specific techniques are described for handling missing data, noisy data, and data integration. Data transformation techniques like normalization, aggregation, and attribute construction are also covered. The goal of data preprocessing is to prepare raw data for data mining by cleaning noise and inconsistencies, filling in missing values, and reducing data volume so the results of data mining algorithms are more accurate and useful.

Uploaded by

vikasbhowate
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Unit 2 - Data Preprocessing

This document provides an overview of data preprocessing techniques for data mining. It discusses why data preprocessing is important, common reasons why real-world data is dirty or incomplete, and major tasks in data preprocessing including data cleaning, integration, transformation, reduction, and discretization. Specific techniques are described for handling missing data, noisy data, and data integration. Data transformation techniques like normalization, aggregation, and attribute construction are also covered. The goal of data preprocessing is to prepare raw data for data mining by cleaning noise and inconsistencies, filling in missing values, and reducing data volume so the results of data mining algorithms are more accurate and useful.

Uploaded by

vikasbhowate
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

St.

Vincent Pallotti College of Engineering


& Technology

Data Warehousing and Mining


(BEIT701T)
7th Sem B.E. (IT)
Presented By

Samir Siddiqui
CR FINAL YEAR IT
Department of Information Technology

1
December 22, 2022 Data Mining: Concepts and Techniques 2
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking

certain attributes of interest, or containing


only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
December 22, 2022 Data Mining: Concepts and Techniques 3
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
December 22, 2022 Data Mining: Concepts and Techniques 4
Why Is Data Preprocessing Important?

 No quality data, no quality mining results!


 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

December 22, 2022 Data Mining: Concepts and Techniques 5


Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

December 22, 2022 Data Mining: Concepts and Techniques 6


December 22, 2022 Data Mining: Concepts and Techniques 7
Forms of Data Preprocessing

December 22, 2022 Data Mining: Concepts and Techniques 8


How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree

December 22, 2022 Data Mining: Concepts and Techniques 9


December 22, 2022 Data Mining: Concepts and Techniques 10
December 22, 2022 Data Mining: Concepts and Techniques 11
December 22, 2022 Data Mining: Concepts and Techniques 12
December 22, 2022 Data Mining: Concepts and Techniques 13
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,

deal with possible outliers)

December 22, 2022 Data Mining: Concepts and Techniques 14


Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
December 22, 2022 Data Mining: Concepts and Techniques 15
Regression

Y1

Y1’ y=x+1

X1 x

December 22, 2022 Data Mining: Concepts and Techniques 16


Cluster Analysis

December 22, 2022 Data Mining: Concepts and Techniques 17


December 22, 2022 Data Mining: Concepts and Techniques 18
December 22, 2022 Data Mining: Concepts and Techniques 19
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent

store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

 Entity identification problem:


 Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton


 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from

different sources are different


 Possible reasons: different representations, different

scales, e.g., metric vs. British units

December 22, 2022 Data Mining: Concepts and Techniques 20


December 22, 2022 Data Mining: Concepts and Techniques 21
December 22, 2022 Data Mining: Concepts and Techniques 22
December 22, 2022 Data Mining: Concepts and Techniques 23
December 22, 2022 Data Mining: Concepts and Techniques 24
Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones

December 22, 2022 Data Mining: Concepts and Techniques 25


Data Transformation: Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
1.0]. Then $73,000 is mapped to 98,000  12,000 (1.0  0)  0  0.716
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
December 22, 2022 Data Mining: Concepts and Techniques 26
December 22, 2022 Data Mining: Concepts and Techniques 27
December 22, 2022 Data Mining: Concepts and Techniques 28
December 22, 2022 Data Mining: Concepts and Techniques 29
December 22, 2022 Data Mining: Concepts and Techniques 30
December 22, 2022 Data Mining: Concepts and Techniques 31
December 22, 2022 Data Mining: Concepts and Techniques 32
Data Reduction Strategies

 Why data reduction?


 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time to run

on the complete data set


 Data reduction
 Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the
same) analytical results
 Data reduction strategies
 Data cube aggregation:

 Dimensionality reduction — e.g., remove unimportant attributes

 Data Compression

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

December 22, 2022 Data Mining: Concepts and Techniques 33


Data Cube Aggregation

 The lowest level of a data cube (base cuboid)


 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
December 22, 2022 Data Mining: Concepts and Techniques 34
Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the

probability distribution of different classes given the


values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to

understand
 Heuristic methods (due to exponential # of choices):
 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward elimination

 Decision-tree induction

December 22, 2022 Data Mining: Concepts and Techniques 35


Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

December 22, 2022 Data Mining: Concepts and Techniques 36


Dimensionality Reduction: Principal
Component Analysis (PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
 Steps
 Normalize input data: Each attribute falls within the same range

 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal

component vectors
 The principal components are sorted in order of decreasing

“significance” or strength
 Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with low


variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large
December 22, 2022 Data Mining: Concepts and Techniques 37
Principal Component Analysis

X2

Y1
Y2

X1

December 22, 2022 Data Mining: Concepts and Techniques 38


Chapter 2: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
December 22, 2022 Data Mining: Concepts and Techniques 39
Summary
 Data preparation or preprocessing is a big issue for both
data warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data
preprocessing still an active area of research
December 22, 2022 Data Mining: Concepts and Techniques 40
Question Bank
 Q1. What is the need of data preprocessing. Explain in brief. [6M][S-17], [6M][W-16], [5M][S-19]
 Q2. Summarize the data preprocessing steps in brief. [7M][W-17], [7M][S-18],
 Q3. What is data cleaning? Explain different methods of data cleaning. [7M][W-17], [6M][W-16]
 Q4. What is data transformation? Explain different methods of transformation[8M][S-17]
 Q5. Write short notes on:
 a. Missing value b. Noisy data c. Cluster d. Outlier
 Q6. Write short note on data cleaning. OR How data cleaning can be can be handled in
preprocessing.[6M][S-18], [3M][S-16]
 Q7. Q.10. What is data reduction? Explain different methods of data reduction. [7M][W-17], [7M]
[S-18], [4M][S-16], [7M][W-16], [4M][S-19]
 Q.8.What is normalization. Explain various types of Normalization techniques with example. [7M]
[S-18]

December 22, 2022 Data Mining: Concepts and Techniques 41


Question Bank

 Q9. Explain the data discretization and concept hierarchy generation. [6M][S-17], [7M]
[S-19]
 Q10. What are the measures of data dispersion. [4M][S-19]
 Q11. What is the need for multidimensional analysis. [5M][S-16]
 Q12. Write short notes on:
 a. Binning b. Regressionc. Clustering d. Smoothing
 e. Generalization f. Aggregation
 Q13. Explain MIN-MAX normalization and Z-score normalization. [7M][W-17], [4M]
[S-16], [6M][S-19]
 Q14. Explain the various issues to be considered in data integration. Also give the
various forms of preprocessing? [6M][S-16]
 Q.15. What are the challenges in data preprocessing?

December 22, 2022 Data Mining: Concepts and Techniques 42

You might also like