0% found this document useful (0 votes)

69 views24 pages

Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi

The document discusses various techniques for data pre-processing including data reduction, parametric and non-parametric methods, linear and log-linear regression models, histograms, clustering, sampling, discretization, and concept hierarchy generation. The goal of data pre-processing is to prepare raw data for further analysis by cleaning, transforming, reducing dimensionality, and handling missing values. Common techniques include feature selection, normalization, aggregation, and converting data into appropriate formats.

Uploaded by

Ovais Younus Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views24 pages

Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi

Uploaded by

Ovais Younus Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Lecture 4 – Data Pre-processing

Fall 2010

Dr. Tariq MAHMOOD

NUCES (FAST) – KHI
1
 Reduce data volume by choosing alternative,
smaller forms of data representation
 Parametric methods
 Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
 Example: Log-linear models—obtain value at
a point in m-D space as the product on
appropriate marginal subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering,
sampling.
December 8, 2021 Data Mining: Concepts and Techniques 2
 Linear regression: Data are modeled to fit a straight
line

 Often uses the least-square method to fit the line

 Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions

December 8, 2021 Data Mining: Concepts and Techniques 3

 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify
the line and are to be estimated by using the
data at hand
 Using the least squares criterion to the known
values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed
into the above
 Log-linear models:
 The multi-way table of joint probabilities is
approximated by a product of lower-order
tables
 Probability: p(a, b, c, d) = ab acad bcd
 Divide data into buckets and store average (sum) for each
bucket
 Partitioning rules:
 Equal-width: equal bucket range
 Equal-frequency (or equal-depth)
 V-optimal: with the least histogram variance (weighted
sum of the original values that each bucket represents)
 MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences

December 8, 2021 Data Mining: Concepts and Techniques 5

 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 7

December 8, 2021 Data Mining: Concepts and Techniques 6

 Sampling: obtaining a small sample s to
represent the whole data set N
 Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
 Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
 Used in conjunction with skewed data.

December 8, 2021 Data Mining: Concepts and Techniques 7

Sampling: With or Without
Replacement

W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re

SRSW
R

Raw Data
December 8, 2021 Data Mining: Concepts and Techniques 8
Raw Data Cluster/Stratified Sample

December 8, 2021 Data Mining: Concepts and Techniques 9

 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

December 8, 2021 Data Mining: Concepts and Techniques 10

 Three types of attributes:
 Nominal — values from an unordered set, e.g., color,
profession
 Ordinal — values from an ordered set, e.g., military or
academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical
attributes
 Reduce data size by discretization
 Prepare for further analysis

December 8, 2021 Data Mining: Concepts and Techniques 11

 Discretization
 Reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low
level concepts (such as numeric values for age) by higher
level concepts (such as young, middle-aged, or senior)

December 8, 2021 Data Mining: Concepts and Techniques 12

 Typical methods: All the methods can be applied recursively
 Binning (covered above)
 Top-down split, unsupervised,
 Histogram analysis (covered above)
 Top-down split, unsupervised
 Clustering analysis (covered above)
 Either top-down split or bottom-up merge, unsupervised
 Entropy-based discretization: supervised, top-down split
 Interval merging by 2 Analysis: unsupervised, bottom-up
merge
 Segmentation by natural partitioning: top-down split,
unsupervised

December 8, 2021 Data Mining: Concepts and Techniques 13

 Given a set of samples S, if S is partitioned into two intervals
S1 and S2 using boundary T, the information gain after
partitioning is | S1 | | S2 |
I (S , T )  Entropy ( S 1)  Entropy ( S 2)
|S| |S|
 Entropy is calculated based on class distribution of the
samples in the set. Given m classes,
m
the entropy of S1 is
Entropy ( S1 )   pi log 2 ( pi )
i 1

where pi is the probability of class i in S1

 The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization
 The process is recursively applied to partitions obtained until
some stopping criterion is met
 Such a boundary may reduce data size and improve
classification accuracy
December 8, 2021 Data Mining: Concepts and Techniques 14
 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge them
to form larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
 Initially, each distinct value of a numerical attr. A is
considered to be one interval
 2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar class
distributions
 This merge process proceeds recursively until a predefined
stopping criterion is met (such as significance level)

December 8, 2021 Data Mining: Concepts and Techniques 15

 A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at
the most significant digit, partition the range into 3
equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals

December 8, 2021 Data Mining: Concepts and Techniques 16

count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)

(-$400 - 0) (0 -
($1,000($1,000
- - $2, 000)
(-$400 - $200) (0 - $1,000)
$1,200) ($2,000 -
-$300) (0 -
($200 - ($1,000 - $3,000)
(-$400 - $200) ($1,200 -
-$300) $400) $1,200)
(-$300 - $1,400)
-$200) ($200 - ($3,000 -
($1,200 -
(-$300 - $400)
($400 - ($1,400 - $4,000)
$1,400)
(-$200 -
-$200) $600) $1,600) ($4,000 -
-$100) ($400 - ($600 - ($1,400 - $5,000)
($1,600 -
(-$200 - $600) $800) $1,600) ($1,800 -
($800 - $1,800)
-$100)(-$100 - $2,000)
($600 - $1,000) ($1,600 -
0) $800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
December 8, 2021 Data Mining: Concepts and Techniques 17
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by
explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state,
country}
December 8, 2021 Data Mining: Concepts and Techniques 18
 Some hierarchies can be automatically
generated based on the analysis of the number
of distinct values per attribute in the data set
 The attribute with the most distinct values is
placed at the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter,
year

country 15 distinct
15 distinct values
values

province_or_ state 365 365 distinct

distinct values
values

city 35673567 distinct

distinct values
values

street 674,339
674,339 distinct
distinct values
values
December 8, 2021 Data Mining: Concepts and Techniques 19
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary

December 8, 2021 Data Mining: Concepts and Techniques 20

 Data preparation or preprocessing is a big issue
for both data warehousing and data mining
 Descriptive data summarization is needed for
quality data preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data
preprocessing still an active area of research.
December 8, 2021 Data Mining: Concepts and Techniques 21
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.
Communications of ACM, 42:73-78, 1999
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons,
2003
 T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk.
Mining Database Structure; Or, How to Build a Data Quality Browser. SIGMOD’02.
 H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the
Technical Committee on Data Engineering, 20(4), December 1997
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin
of the Technical Committee on Data Engineering. Vol.23, No.4
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning
and Transformation, VLDB’2001
 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
 Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations.
Communications of ACM, 39:86-95, 1996
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE
Trans. Knowledge and Data Engineering, 7:623-640, 1995

December 8, 2021 Data Mining: Concepts and Techniques 22

23
1. What is meant by symmetric and skewed data [5]

2. Describe techniques for smoothing out data [10]

3. Why is it important to carry out descriptive data

summarization? Justify your response through a
fictitious quantile-quantile plot [5]

4. Why is it necessary to carry out co-relation analysis?

[5]

5. Describe “data cube aggregation” and its

advantages [5]
6. Can you suggest some change(s) to the state-of-
the-art data pre-processing activity? [10]
24

Ohs352 Project Report Notes
No ratings yet
Ohs352 Project Report Notes
67 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
3 Prep
No ratings yet
3 Prep
53 pages
8 CLST
No ratings yet
8 CLST
100 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
Discret Ization
No ratings yet
Discret Ization
12 pages
Pert 3 Advanced Feature Selection Teqnique
No ratings yet
Pert 3 Advanced Feature Selection Teqnique
58 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
8 CLST
No ratings yet
8 CLST
98 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Chapter 2 dataPreProcessing HAN
No ratings yet
Chapter 2 dataPreProcessing HAN
76 pages
Module III Data Mining
No ratings yet
Module III Data Mining
7 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
8 CLST
No ratings yet
8 CLST
98 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Data Mining:: - Chapter 2
No ratings yet
Data Mining:: - Chapter 2
75 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
64 pages
Information Retrieval 8 Term Weighting A
No ratings yet
Information Retrieval 8 Term Weighting A
11 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
GATE Electromagnetic Theory Book
No ratings yet
GATE Electromagnetic Theory Book
12 pages
3 Prep
No ratings yet
3 Prep
50 pages
Seismic Sensor
100% (4)
Seismic Sensor
47 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Analisis Data 2
No ratings yet
Analisis Data 2
40 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
No ratings yet
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
40 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Morphological Analysis: Natural Language Processing (CSE 5321)
No ratings yet
Morphological Analysis: Natural Language Processing (CSE 5321)
23 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
Grade 2 Math - End Term 2 - 2024
No ratings yet
Grade 2 Math - End Term 2 - 2024
7 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Preprocessing: Data Cleaning Data Integration and Transformation
No ratings yet
Data Preprocessing: Data Cleaning Data Integration and Transformation
41 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Lect 4
No ratings yet
Lect 4
30 pages
Trigo MCQs
No ratings yet
Trigo MCQs
9 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
52 pages
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
No ratings yet
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
7 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
Chapitre 1
No ratings yet
Chapitre 1
22 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
43 pages
AI V1 V2 V3 Fall 2020 - 21 Assg 02
No ratings yet
AI V1 V2 V3 Fall 2020 - 21 Assg 02
3 pages
Assgg
No ratings yet
Assgg
12 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Qeee Solution Documnet
100% (1)
Qeee Solution Documnet
9 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
PD Iso TS 22762-4-2014
No ratings yet
PD Iso TS 22762-4-2014
40 pages
The Problem of Punctuation in Modern English
No ratings yet
The Problem of Punctuation in Modern English
18 pages
Reward Management Practices and Its Impact On Employees Motivation An Evidence
No ratings yet
Reward Management Practices and Its Impact On Employees Motivation An Evidence
6 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
21 pages
Is 808
No ratings yet
Is 808
60 pages
Oracle SQL Cheatsheet
No ratings yet
Oracle SQL Cheatsheet
2 pages
PH.D Presentation
No ratings yet
PH.D Presentation
16 pages
Applications of Trigonometry
No ratings yet
Applications of Trigonometry
7 pages
Data Mining
No ratings yet
Data Mining
29 pages
Probability Questions
No ratings yet
Probability Questions
2 pages
Simone's Magnetic Force Lab: Research Question
No ratings yet
Simone's Magnetic Force Lab: Research Question
5 pages
Computer Fundamentals and Programming Using Dev C++
No ratings yet
Computer Fundamentals and Programming Using Dev C++
16 pages
Cumulative Test
No ratings yet
Cumulative Test
7 pages
Math 1210 Project 2
No ratings yet
Math 1210 Project 2
3 pages
Ex 265 266
No ratings yet
Ex 265 266
2 pages
FEWS NET Matrix Example
No ratings yet
FEWS NET Matrix Example
10 pages
Matlab Exercises 2: X (0:360) Y1 Sin (X Pi/180) Y2 Cos (X Pi/180) Y3 Tan (X Pi/180)
No ratings yet
Matlab Exercises 2: X (0:360) Y1 Sin (X Pi/180) Y2 Cos (X Pi/180) Y3 Tan (X Pi/180)
2 pages
AP Physics 1 Study Guide
No ratings yet
AP Physics 1 Study Guide
29 pages
Chap 3
No ratings yet
Chap 3
55 pages
Oscillations Printed Notes and Assignment
No ratings yet
Oscillations Printed Notes and Assignment
72 pages
Bearing Stress: A P DT P
No ratings yet
Bearing Stress: A P DT P
5 pages
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
No ratings yet
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
2 pages
Year 2 Autumn Paper 2 Reasoning 2022
No ratings yet
Year 2 Autumn Paper 2 Reasoning 2022
12 pages
Kindergarten Math Shapes Unit
No ratings yet
Kindergarten Math Shapes Unit
4 pages

Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi

Uploaded by

Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi

Uploaded by

Lecture 4 – Data Pre-processing

Dr. Tariq MAHMOOD

 Often uses the least-square method to fit the line

December 8, 2021 Data Mining: Concepts and Techniques 3

December 8, 2021 Data Mining: Concepts and Techniques 5

December 8, 2021 Data Mining: Concepts and Techniques 6

December 8, 2021 Data Mining: Concepts and Techniques 7

December 8, 2021 Data Mining: Concepts and Techniques 9

December 8, 2021 Data Mining: Concepts and Techniques 10

December 8, 2021 Data Mining: Concepts and Techniques 11

December 8, 2021 Data Mining: Concepts and Techniques 12

December 8, 2021 Data Mining: Concepts and Techniques 13

where pi is the probability of class i in S1

December 8, 2021 Data Mining: Concepts and Techniques 15

December 8, 2021 Data Mining: Concepts and Techniques 16

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

province_or_ state 365 365 distinct

city 35673567 distinct

December 8, 2021 Data Mining: Concepts and Techniques 20

December 8, 2021 Data Mining: Concepts and Techniques 22

2. Describe techniques for smoothing out data [10]

3. Why is it important to carry out descriptive data

4. Why is it necessary to carry out co-relation analysis?

5. Describe “data cube aggregation” and its

You might also like