Class-Data Preprocessing-II
Class-Data Preprocessing-II
Yashvardhan Sharma
CS F415 4
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
30-Jan-24 CS F415 5
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute
30-Jan-24 CS F415 6
Document – term matrix
• Each document becomes a ‘term’ vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
30-Jan-24 CS F415 7
Graph Data
• Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
5 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
30-Jan-24 CS F415 8
Chemical Data
• Benzene Molecule: C6H6
30-Jan-24 CS F415 9
Ordered Data
• Sequences of transactions
Items/Events
An element of
the sequence
30-Jan-24 CS F415 10
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
30-Jan-24 CS F415 11
Ordered Data
• Spatio-Temporal Data
Average
Monthly
Temperature of
land and ocean
30-Jan-24 CS F415 12
30-Jan-24 CS F415 13
Major Tasks in Data Preprocessing
outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
30-Jan-24
numerical data CS F415 15
Forms of data preprocessing
30-Jan-24 CS F415 16
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
30-Jan-24 CS F415 17
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
30-Jan-24 CS F415 18
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
30-Jan-24 CS F415 19
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
30-Jan-24 CS F415 21
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
30-Jan-24 CS F415 22
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone
30-Jan-24 CS F415 24
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
30-Jan-24 CS F415 25
Binning Methods for Data Smoothing
• Sorted data (e.g., by price)
• 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34
30-Jan-24 CS F415 26
Cluster Analysis
30-Jan-24 CS F415 27
Regression
y
Y1
Y1’ y=x+1
X1 x
30-Jan-24 CS F415 28
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
30-Jan-24 CS F415 29
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
30-Jan-24 CS F415 30
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
30-Jan-24 CS F415 31
Data Reduction Strategies
• A data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete
data set
• Data reduction
• Obtain a reduced representation of the data set that is much smaller in volume
but yet produce the same (or almost the same) analytical results
• Data reduction strategies
• Data cube aggregation
• Dimensionality reduction — remove unimportant attributes
• Data Compression
• Discretization and concept hierarchy generation
30-Jan-24 CS F415 32
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• More “stable” data
• Aggregated data tends to have less variability
30-Jan-24 CS F415 33
Data Cube Aggregation
• The lowest level of a data cube
• the aggregated data for an individual entity of interest
• e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using
data cube, when possible
30-Jan-24 CS F415 34
SAMPLE CUBE
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV Total annual sales
PC ofU.S.A
PC in U.S.A.
VCR Total salesTotal annual sales
Total Q1 sales
Country
sum In U.S.Aof VCR in U.S.A.
In U.S.A Canada
Total sales
Total Q1 sales
In Canada In Canada Mexico
Total Q1 sales Total sales
In Mexico In Mexico sum
Total Q2 sales
Total Q1 sales TOTAL SALES
In all countries
In all countries
• Sampling is used in data mining because processing the entire set of data
of interest is too expensive or time consuming.
30-Jan-24 CS F415 37
Sampling …
• The key principle for effective sampling is the following:
• using a sample will work almost as well as using the entire data sets, if the
sample is representative
30-Jan-24 CS F415 38
Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected for the
sample.
• In sampling with replacement, the same object can be picked up more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples from each
partition
30-Jan-24 CS F415 39
Sampling
• Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
• Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence of
skew
• Develop adaptive sampling methods
• Stratified sampling:
• Approximate the percentage of each class (or subpopulation of interest) in the
overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).
30-Jan-24 CS F415 40
Sampling
Raw Data
30-Jan-24 CS F415 41
Sampling
Raw Data Cluster/Stratified Sample
30-Jan-24 CS F415 42
Sample Size
30-Jan-24 CS F415 43
Sample Size
• What sample size is necessary to get at least one
object from each of 10 groups.
30-Jan-24 CS F415 44
Data Dimensionality
• From a theoretical point of view, increasing the number of
features should lead to better performance.
46
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
30-Jan-24 CS F415 47
Example of
Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
UT
50
Dimensionality Reduction (cont’d)
• Idea: represent data in terms of basis vectors in a lower dimensional space
(embedded within the original space).
51
Principal Component Analysis
• Given N data vectors from k-dimensions, find c ≤ k orthogonal
vectors that can be best used to represent data
• The original data set is reduced to one consisting of N data vectors on c
principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal component
vectors
• Works for numeric data only
• Used when the number of dimensions is large
30-Jan-24 CS F415 52
Principal Component Analysis
X2
Y1
Y2
X1
30-Jan-24 CS F415 53
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the largest amount of
variation in data
x2
x1
30-Jan-24 CS F415 54
Dimensionality Reduction: PCA
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space
x2
x1
30-Jan-24 CS F415 55
Dimensionality Reduction: PCA
Dimensions = 206
Dimensions
Dimensions==160
120
10
40
80
30-Jan-24 CS F415 56
PCA: Motivation
• Choose directions such that a total variance of data
will be maximum
• Maximize Total Variance
57
Principal Component Analysis (PCA)
• Dimensionality reduction implies information loss; PCA
preserves as much information as possible by minimizing
the reconstruction error: