0% found this document useful (0 votes)

213 views42 pages

Data Preprocessing in Data Mining

This document provides an overview of data preprocessing techniques for data mining. It discusses why data preprocessing is important, common reasons why real-world data is dirty or incomplete, and major tasks in data preprocessing including data cleaning, integration, transformation, reduction, and discretization. Specific techniques are described for handling missing data, noisy data, and data integration. Data transformation techniques like normalization, aggregation, and attribute construction are also covered. The goal of data preprocessing is to prepare raw data for data mining by cleaning noise and inconsistencies, filling in missing values, and reducing data volume so the results of data mining algorithms are more accurate and useful.

Uploaded by

vikasbhowate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

213 views42 pages

Data Preprocessing in Data Mining

Uploaded by

vikasbhowate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

St.

Vincent Pallotti College of Engineering

& Technology

Data Warehousing and Mining

(BEIT701T)
7th Sem B.E. (IT)
Presented By

Samir Siddiqui
CR FINAL YEAR IT
Department of Information Technology

1
December 22, 2022 Data Mining: Concepts and Techniques 2
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking

certain attributes of interest, or containing

only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
December 22, 2022 Data Mining: Concepts and Techniques 3
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
December 22, 2022 Data Mining: Concepts and Techniques 4
Why Is Data Preprocessing Important?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

December 22, 2022 Data Mining: Concepts and Techniques 5

Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

December 22, 2022 Data Mining: Concepts and Techniques 6

December 22, 2022 Data Mining: Concepts and Techniques 7
Forms of Data Preprocessing

December 22, 2022 Data Mining: Concepts and Techniques 8

How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree

December 22, 2022 Data Mining: Concepts and Techniques 9

December 22, 2022 Data Mining: Concepts and Techniques 10
December 22, 2022 Data Mining: Concepts and Techniques 11
December 22, 2022 Data Mining: Concepts and Techniques 12
December 22, 2022 Data Mining: Concepts and Techniques 13
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,

deal with possible outliers)

December 22, 2022 Data Mining: Concepts and Techniques 14

Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
December 22, 2022 Data Mining: Concepts and Techniques 15
Regression

Y1’ y=x+1

X1 x

December 22, 2022 Data Mining: Concepts and Techniques 16

Cluster Analysis

December 22, 2022 Data Mining: Concepts and Techniques 17

December 22, 2022 Data Mining: Concepts and Techniques 18
December 22, 2022 Data Mining: Concepts and Techniques 19
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent

store
 Schema integration: e.g., [Link]-id  [Link]-#
 Integrate metadata from different sources

 Entity identification problem:

 Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton

 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from

different sources are different

 Possible reasons: different representations, different

scales, e.g., metric vs. British units

December 22, 2022 Data Mining: Concepts and Techniques 20

December 22, 2022 Data Mining: Concepts and Techniques 21
December 22, 2022 Data Mining: Concepts and Techniques 22
December 22, 2022 Data Mining: Concepts and Techniques 23
December 22, 2022 Data Mining: Concepts and Techniques 24
Data Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones

December 22, 2022 Data Mining: Concepts and Techniques 25

Data Transformation: Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
1.0]. Then $73,000 is mapped to 98,000  12,000 (1.0  0)  0  0.716
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
December 22, 2022 Data Mining: Concepts and Techniques 26
December 22, 2022 Data Mining: Concepts and Techniques 27
December 22, 2022 Data Mining: Concepts and Techniques 28
December 22, 2022 Data Mining: Concepts and Techniques 29
December 22, 2022 Data Mining: Concepts and Techniques 30
December 22, 2022 Data Mining: Concepts and Techniques 31
December 22, 2022 Data Mining: Concepts and Techniques 32
Data Reduction Strategies

 Why data reduction?

 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time to run

on the complete data set

 Data reduction
 Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the
same) analytical results
 Data reduction strategies
 Data cube aggregation:

 Dimensionality reduction — e.g., remove unimportant attributes

 Data Compression

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

December 22, 2022 Data Mining: Concepts and Techniques 33

Data Cube Aggregation

 The lowest level of a data cube (base cuboid)

 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
December 22, 2022 Data Mining: Concepts and Techniques 34
Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the

probability distribution of different classes given the

values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to

understand
 Heuristic methods (due to exponential # of choices):
 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward elimination

 Decision-tree induction

December 22, 2022 Data Mining: Concepts and Techniques 35

Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

December 22, 2022 Data Mining: Concepts and Techniques 36

Dimensionality Reduction: Principal
Component Analysis (PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
 Steps
 Normalize input data: Each attribute falls within the same range

 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal

component vectors
 The principal components are sorted in order of decreasing

“significance” or strength
 Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with low

variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large
December 22, 2022 Data Mining: Concepts and Techniques 37
Principal Component Analysis

Y1
Y2

December 22, 2022 Data Mining: Concepts and Techniques 38

Chapter 2: Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
December 22, 2022 Data Mining: Concepts and Techniques 39
Summary
 Data preparation or preprocessing is a big issue for both
data warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data
preprocessing still an active area of research
December 22, 2022 Data Mining: Concepts and Techniques 40
Question Bank
 Q1. What is the need of data preprocessing. Explain in brief. [6M][S-17], [6M][W-16], [5M][S-19]
 Q2. Summarize the data preprocessing steps in brief. [7M][W-17], [7M][S-18],
 Q3. What is data cleaning? Explain different methods of data cleaning. [7M][W-17], [6M][W-16]
 Q4. What is data transformation? Explain different methods of transformation[8M][S-17]
 Q5. Write short notes on:
 a. Missing value b. Noisy data c. Cluster d. Outlier
 Q6. Write short note on data cleaning. OR How data cleaning can be can be handled in
preprocessing.[6M][S-18], [3M][S-16]
 Q7. Q.10. What is data reduction? Explain different methods of data reduction. [7M][W-17], [7M]
[S-18], [4M][S-16], [7M][W-16], [4M][S-19]
 [Link] is normalization. Explain various types of Normalization techniques with example. [7M]
[S-18]

December 22, 2022 Data Mining: Concepts and Techniques 41

Question Bank

 Q9. Explain the data discretization and concept hierarchy generation. [6M][S-17], [7M]
[S-19]
 Q10. What are the measures of data dispersion. [4M][S-19]
 Q11. What is the need for multidimensional analysis. [5M][S-16]
 Q12. Write short notes on:
 a. Binning b. Regressionc. Clustering d. Smoothing
 e. Generalization f. Aggregation
 Q13. Explain MIN-MAX normalization and Z-score normalization. [7M][W-17], [4M]
[S-16], [6M][S-19]
 Q14. Explain the various issues to be considered in data integration. Also give the
various forms of preprocessing? [6M][S-16]
 Q.15. What are the challenges in data preprocessing?

December 22, 2022 Data Mining: Concepts and Techniques 42

Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
52 pages
3 Prep
No ratings yet
3 Prep
53 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
3 Prep
No ratings yet
3 Prep
50 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
52 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
59 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Lect 4
No ratings yet
Lect 4
30 pages
Chap 3
No ratings yet
Chap 3
55 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
47 pages
Chapter 3 - Data Preparation - Data Mining Concepts and Techniques Han and Kamber
No ratings yet
Chapter 3 - Data Preparation - Data Mining Concepts and Techniques Han and Kamber
61 pages
2.3 Data Cleaning
No ratings yet
2.3 Data Cleaning
24 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
42 pages
Sona College of Technology (Autonomous) : U23IT953-Data Warehousing and Data Mining
No ratings yet
Sona College of Technology (Autonomous) : U23IT953-Data Warehousing and Data Mining
128 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Lecture 3 and 4 - Data Preprocessing
No ratings yet
Lecture 3 and 4 - Data Preprocessing
25 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
42 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Analisis Data 2
No ratings yet
Analisis Data 2
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Mining - Preprocessing
No ratings yet
Data Mining - Preprocessing
77 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
02
No ratings yet
02
78 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
52 pages
Chapter 2 - Data Preprocessing
No ratings yet
Chapter 2 - Data Preprocessing
15 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
33 pages
Data Mining:: - Chapter 2
No ratings yet
Data Mining:: - Chapter 2
75 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
7 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Data Preprocessing in Data Warehousing
No ratings yet
Data Preprocessing in Data Warehousing
67 pages
Chapter 2 dataPreProcessing HAN
No ratings yet
Chapter 2 dataPreProcessing HAN
76 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
8 pages
Data Mining Module 2 Important Topics PYQs
No ratings yet
Data Mining Module 2 Important Topics PYQs
35 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
15 pages
Unit 1
No ratings yet
Unit 1
48 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
17 pages
Aicte Books 2023
No ratings yet
Aicte Books 2023
4 pages
Sushmeet Singh Bhurji
No ratings yet
Sushmeet Singh Bhurji
5 pages
Capstone Project - Capstone Project
No ratings yet
Capstone Project - Capstone Project
1 page
28 April 2023 - BoS On New Branches - 0001
No ratings yet
28 April 2023 - BoS On New Branches - 0001
1 page
UEU Sistem Pendukung Keputusan Pertemuan 11
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 11
48 pages
Guidelines grantInAid Scheme
No ratings yet
Guidelines grantInAid Scheme
35 pages
Exam Malpractice Penalties
No ratings yet
Exam Malpractice Penalties
4 pages
IOT Data Acquisition
No ratings yet
IOT Data Acquisition
13 pages
DeepPov GAI
100% (1)
DeepPov GAI
47 pages
Unveiling King Tut's Mummy Mysteries
No ratings yet
Unveiling King Tut's Mummy Mysteries
4 pages
Figure Classification & Mirror Image Test
No ratings yet
Figure Classification & Mirror Image Test
13 pages
Unit 1 - DWM
No ratings yet
Unit 1 - DWM
112 pages
Unit 3 OLAP and OLTP
No ratings yet
Unit 3 OLAP and OLTP
64 pages
Sensors: A Novel Secure Iot-Based Smart Home Automation System Using A Wireless Sensor Network
No ratings yet
Sensors: A Novel Secure Iot-Based Smart Home Automation System Using A Wireless Sensor Network
19 pages
Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García
No ratings yet
Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García
15 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
Ensemble Models for Imbalanced Data Classification
No ratings yet
Ensemble Models for Imbalanced Data Classification
17 pages
ICU Mortality Prediction Using Machine Learning
No ratings yet
ICU Mortality Prediction Using Machine Learning
7 pages
Newbook
No ratings yet
Newbook
80 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
HSC Society Culture Notes 6559455945650
No ratings yet
HSC Society Culture Notes 6559455945650
65 pages
pr1 Chapter III Answer Key
No ratings yet
pr1 Chapter III Answer Key
10 pages
Probability Concepts Quiz
No ratings yet
Probability Concepts Quiz
3 pages
Module 3-Descriptive Statisics and Measures of Central Tendency
No ratings yet
Module 3-Descriptive Statisics and Measures of Central Tendency
67 pages
Week 2 Probability and Statistics
No ratings yet
Week 2 Probability and Statistics
11 pages
Update On Proposed ASTM D4929 Procedure C - XRF
No ratings yet
Update On Proposed ASTM D4929 Procedure C - XRF
17 pages
How Research Instruments Are Validated
No ratings yet
How Research Instruments Are Validated
2 pages
Jurnal Kotle and Keller 2012
No ratings yet
Jurnal Kotle and Keller 2012
20 pages
Special Science Class Curriculum Poblacion, Dumalag, Capiz
No ratings yet
Special Science Class Curriculum Poblacion, Dumalag, Capiz
41 pages
1 7 PPT Presentation (Quality Assurance Framework) ENG PDF
100% (1)
1 7 PPT Presentation (Quality Assurance Framework) ENG PDF
37 pages
Research Design
No ratings yet
Research Design
96 pages
R Transformations for Statisticians
No ratings yet
R Transformations for Statisticians
4 pages
Understanding Linear Correlation
No ratings yet
Understanding Linear Correlation
4 pages
Data Science Python Question Paper
No ratings yet
Data Science Python Question Paper
4 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
Nonparametric Estimation
No ratings yet
Nonparametric Estimation
27 pages
Multivariate Probability Distributions: Part III: Cyr Emile M'LAN, PH.D
No ratings yet
Multivariate Probability Distributions: Part III: Cyr Emile M'LAN, PH.D
10 pages
Eliyas Term Paper
No ratings yet
Eliyas Term Paper
21 pages
Understanding Cointegration in Time Series
No ratings yet
Understanding Cointegration in Time Series
105 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
Probability and Statistics III Exam Questions
No ratings yet
Probability and Statistics III Exam Questions
2 pages
Bcm-106/Bc-02: O Kolkf D Lkaf ( DH VKSJ XF - Kr@O Kolkf D Lkaf ( DH
No ratings yet
Bcm-106/Bc-02: O Kolkf D Lkaf ( DH VKSJ XF - Kr@O Kolkf D Lkaf ( DH
11 pages
A Data Driven Residential Transformer Overloading Risk Assessment Method
No ratings yet
A Data Driven Residential Transformer Overloading Risk Assessment Method
10 pages
Unit 1,2,3, And4
100% (1)
Unit 1,2,3, And4
159 pages
Lecturer Research Performance Model Evaluation Using Machine Learning Approach
No ratings yet
Lecturer Research Performance Model Evaluation Using Machine Learning Approach
4 pages
IE3 1st LE
No ratings yet
IE3 1st LE
2 pages
Credit Risk Impact on Nigerian Banks' Profitability
No ratings yet
Credit Risk Impact on Nigerian Banks' Profitability
8 pages
Bayes Regression
No ratings yet
Bayes Regression
16 pages
Kisteria Content
No ratings yet
Kisteria Content
27 pages
Evaluation of Metallurgical Recovery Factors For Diamonds Recovered From Kimberlites
No ratings yet
Evaluation of Metallurgical Recovery Factors For Diamonds Recovered From Kimberlites
253 pages

Data Preprocessing in Data Mining

Uploaded by

Data Preprocessing in Data Mining

Uploaded by

St.

Vincent Pallotti College of Engineering

Data Warehousing and Mining

certain attributes of interest, or containing

 No quality data, no quality mining results!

December 22, 2022 Data Mining: Concepts and Techniques 5

December 22, 2022 Data Mining: Concepts and Techniques 6

December 22, 2022 Data Mining: Concepts and Techniques 8

December 22, 2022 Data Mining: Concepts and Techniques 9

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Combined computer and human inspection

deal with possible outliers)

December 22, 2022 Data Mining: Concepts and Techniques 14

December 22, 2022 Data Mining: Concepts and Techniques 16

December 22, 2022 Data Mining: Concepts and Techniques 17

 Entity identification problem:

e.g., Bill Clinton = William Clinton

different sources are different

scales, e.g., metric vs. British units

December 22, 2022 Data Mining: Concepts and Techniques 20

 Smoothing: remove noise from data

December 22, 2022 Data Mining: Concepts and Techniques 25

 Why data reduction?

 Complex data analysis/mining may take a very long time to run

on the complete data set

 Dimensionality reduction — e.g., remove unimportant attributes

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

December 22, 2022 Data Mining: Concepts and Techniques 33

 The lowest level of a data cube (base cuboid)

probability distribution of different classes given the

 Step-wise backward elimination

 Combining forward selection and backward elimination

December 22, 2022 Data Mining: Concepts and Techniques 35

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

December 22, 2022 Data Mining: Concepts and Techniques 36

 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal

reduced by eliminating the weak components, i.e., those with low

December 22, 2022 Data Mining: Concepts and Techniques 38

 Why preprocess the data?

December 22, 2022 Data Mining: Concepts and Techniques 41

December 22, 2022 Data Mining: Concepts and Techniques 42

You might also like