0% found this document useful (0 votes)
3 views

04 DM BI Data Preprocessing

The document discusses data preprocessing techniques essential for improving data quality in data mining and business intelligence. It covers methods such as data cleaning, integration, reduction, and transformation, emphasizing the importance of addressing issues like noise, missing data, and inconsistencies. The document also highlights the significance of data quality measures, including accuracy, completeness, and consistency, to ensure effective mining results.

Uploaded by

batch0406sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

04 DM BI Data Preprocessing

The document discusses data preprocessing techniques essential for improving data quality in data mining and business intelligence. It covers methods such as data cleaning, integration, reduction, and transformation, emphasizing the importance of addressing issues like noise, missing data, and inconsistencies. The document also highlights the significance of data quality measures, including accuracy, completeness, and consistency, to ensure effective mining results.

Uploaded by

batch0406sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Data Mining

and
Business Intelligence
Data Preprocessing
Module 2
Created/Adopted/Modified for
Data Mining and Business Intelligence – MCA II Semester
Vidya Vikas Institute of Engineering & Technology
Mysore
2023-24
GPD
Data Preprocessing
 Today’s real-world databases contain data that is

 noisy, with missing data, and inconsistent


 Why ? : huge size (often several gigabytes or more). and their likely origin
from multiple, heterogenous sources.
 Low-quality data will lead to low-quality mining results.

 How can the data be preprocessed in order to help improve the quality of
the dataand, consequently, of the mining results?
 How can the data be preprocessed so as to improve the efficiency and ease
of the mining process?”
Data Preprocessing Techniques
 Data cleaning can be applied to remove noise and correct inconsistencies in data.
 Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
 Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
 Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0.
 This can improve the accuracy and efficiency of mining algorithms involving
distance measurements.
Data Preprocessing Techniques
 Data cleaning : remove noise and
correct inconsistencies in data.
 Data integration : merge data from
multiple sources into a coherent
data store like data warehouse.
 Data reduction : reduce data size
by aggregating, eliminating
redundant features, or clustering.
 Data transformations : from one
form to another form.
What is Data Preprocessing? — Major Tasks
q Data cleaning
q Handle missing data, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
q Data integration
q Integration of multiple databases, data cubes, or files
q Data reduction
q Dimensionality reduction
q Numerosity reduction
q Data compression
q Data transformation and data discretization
q Normalization
q Concept hierarchy generation
4
Why Preprocess the Data? — Data Quality Issues
q Measures for data quality: A multidimensional view
q Accuracy: correct or wrong, accurate or not
q Completeness: not recorded, unavailable, …
q Consistency: some modified but some not, dangling, …
q Timeliness: timely update?
q Believability: how trustable the data are correct?
q Interpretability: how easily the data can be understood?

5
Data Cleaning
Data Cleaning
q Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, and transmission error
q Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
q e.g., Occupation = “ ” (missing data)
q Noisy: containing noise, errors, or outliers
q e.g., Salary = “−10” (an error)
q Inconsistent: containing discrepancies in codes or names, e.g.,
q Age = “42”, Birthday = “03/07/2010”
q Was rating “1, 2, 3”, now rating “A, B, C”
q discrepancy between duplicate records
q Intentional (e.g., disguised missing data)
q Jan. 1 as everyone’s birthday?

7
Incomplete (Missing) Data
q Data is not always available
q E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
q Missing data may be due to
q Equipment malfunction
q Inconsistent with other recorded data and thus deleted
q Data were not entered due to misunderstanding
q Certain data may not be considered important at the time of entry
q Did not register history or changes of the data
q Missing data may need to be inferred

8
How to Handle Missing Data?
q Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
q Fill in the missing value manually: tedious + infeasible?
q Fill in it automatically with
q a global constant : e.g., “unknown”, a new class?!
q the attribute mean
q the attribute mean for all samples belonging to the same class: smarter
q the most probable value: inference-based such as Bayesian formula or decision
tree

9
Noisy Data
q Noise: random error or variance in a measured variable
q Incorrect attribute values may be due to
q Faulty data collection instruments
q Data entry problems
q Data transmission problems
q Technology limitation
q Inconsistency in naming convention
q Other data problems
q Duplicate records
q Incomplete data
q Inconsistent data

10
How to Handle Noisy Data?
q Binning
q First sort data and partition into (equal-frequency) bins
q Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
q Regression
q Smooth by fitting the data into regression functions
q Clustering
q Detect and remove outliers
q Semi-supervised: Combined computer and human inspection
q Detect suspicious values and check by human (e.g., deal with possible outliers)

11
Handling Noisy Data : Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately same number of
samples
 Good data scaling
 Managing categorical attributes can be tricky
9
Handling Noisy Data : Binning
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
10
Handling Noisy Data : Clustering
 Partition data set into clusters based on

similarity, and store cluster


representation (e.g., centroid and
diameter) only
 There are many choices of clustering

definitions and clustering algorithms


 Some clustering algorithms will omit

outliers.
– These are probably noise.
Binning vs. Clustering

Data Equal width (distance) binning

Equal depth (frequency) (binning) K-means clustering leads to better results


12
Data Cleaning as a Process
q Data discrepancy detection
q Use metadata (e.g., domain, range, dependency, distribution)
q Check field overloading
q Check uniqueness rule, consecutive rule and null rule
q Use commercial tools
q Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
q Data auditing: by analyzing data to discover rules and relationship to detect violators
(e.g., correlation and clustering to find outliers)
q Data migration and integration
q Data migration tools: allow transformations to be specified
q ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
q Integration of the two processes
q Iterative and interactive (e.g., Potter’s Wheels)
12
Data Integration
Data Integration
q Data integration
q Combining data from multiple sources into a coherent store
q Schema integration: e.g., A.cust-id º B.cust-#
q Integrate metadata from different sources
q Entity identification:
q Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
q Detecting and resolving data value conflicts
q For the same real world entity, attribute values from different sources are
different
q Possible reasons: different representations, different scales, e.g., metric vs.
British units

14
Handling Redundancy in Data Integration
q Redundant data occur often when integration of multiple databases
q Object identification: The same attribute or object may have different names in
different databases
q Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
q Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
q Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

15
Correlation Analysis (for Categorical Data)
q Χ2 (chi-square) test:

q Null hypothesis: The two distributions are independent


q The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
q The larger the Χ2 value, the more likely the variables are related
q Note: Correlation does not imply causality
q # of hospitals and # of car-theft in a city are correlated
q Both are causally linked to the third variable: population
16
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450 How to derive 90?
450/1500 * 300 = 90
Not like science fiction 50 (210) 1000 (840) 1050

Sum(col.) 300 1200 1500


We can reject the
null hypothesis of
q Χ (chi-square) calculation (numbers in parenthesis are expected
2
independence at a
counts calculated based on the data distribution in the two categories) confidence level of
0.001
(250 - 90) 2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2
c =
2
+ + + = 507.93
90 210 360 840
q It shows that like_science_fiction and play_chess are correlated in the
group

17
Variance for Single Variable (Numerical Data)
q The variance of a random variable X provides a measure of how much the value of
X deviates from the mean or expected value of X:
ì å ( x - µ ) 2 f ( x) if X is discrete
ï
ï x
s = var( X ) = E[(X - µ ) ] = í ¥
2 2

ï ò ( x - µ ) 2 f ( x)dx if X is continuous
ï
î -¥
q where σ2 is the variance of X, σ is called standard deviation
µ is the mean, and µ = E[X] is the expected value of X
q That is, variance is the expected value of the square deviation from the mean
q It can also be written as: s 2 = var( X ) = E[(X - µ ) 2 ] = E[X 2 ] - µ 2 = E[X 2 ] - [ E ( x)]2
q Sample variance is the average squared deviation of the data value xi from the
n
sample meanµ̂ 1
sˆ = å ( xi - µˆ ) 2
2

n i =1
18
Covariance for Two Variables
q Covariance between two variables X1 and X2
s 12 = E[( X 1 - µ1 )( X 2 - µ2 )] = E[ X 1 X 2 ] - µ1µ2 = E[ X 1 X 2 ] - E[ X 1 ]E[ X 2 ]
where µ1 = E[X1] is the respective mean or expected value of X1; similarly for µ2
1 n
q Sample covariance between X1 and X2: sˆ12 = å ( xi1 - µˆ1 )( xi 2 - µˆ 2 )
n i =1
q Sample covariance is a generalization of the sample variance:
1 n 1 n
sˆ11 = å ( xi1 - µˆ1 )( xi1 - µˆ1 ) = å ( xi1 - µˆ1 ) 2 = sˆ12
n i =1 n i =1
q Positive covariance: If σ12 > 0
q Negative covariance: If σ12 < 0
q Independence: If X1 and X2 are independent, σ12 = 0 but the reverse is not true
q Some pairs of random variables may have a covariance 0 but are not independent
q Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
19
Example: Calculation of Covariance
q Suppose two stocks X1 and X2 have the following values in one week:
q (2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
q Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
q Covariance formula
s 12 = E[( X 1 - µ1 )( X 2 - µ2 )] = E[ X 1 X 2 ] - µ1µ2 = E[ X 1 X 2 ] - E[ X 1 ]E[ X 2 ]

q Its computation can be simplified as: s 12 = E[ X 1 X 2 ] - E[ X 1 ]E[ X 2 ]


q E(X1) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
q E(X2) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
q σ12 = (2×5 + 3×8 + 5×10 + 4×11 + 6×14)/5 − 4 × 9.6 = 4
q Thus, X1 and X2 rise together since σ12 > 0
20
Correlation between Two Numerical Variables
q Correlation between two variables X1 and X2 is the standard covariance, obtained by
normalizing the covariance with the standard deviation of each variable
s 12 s 12
r12 = =
s 1s 2 s 12s 2 2 n

sˆ12 å (x - µˆ1 )( xi 2 - µˆ 2 )
Sample correlation for two attributes X1 and X2: rˆ
i1
q = = i =1
12
sˆ1sˆ 2 n n

å (x
i =1
i1 - µˆ1 ) 2
å (x
i =1
i2 - µˆ 2 ) 2

where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 ,
σ1 and σ2 are the respective standard deviation of X1 and X2
q If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s)
q The higher, the stronger correlation
q If ρ12 = 0: independent (under the same assumption as discussed in co-variance)
q If ρ12 < 0: negatively correlated
21
Visualizing Changes of Correlation Coefficient

q Correlation coefficient value range:


[–1, 1]
q A set of scatter plots shows sets of
points and their correlation
coefficients changing from –1 to 1

22
Covariance Matrix
q The variance and covariance information for the two variables X1 and X2
can be summarized as 2 X 2 covariance matrix as
X 1 - µ1
S = E[( X - µ )( X - µ ) ] = E[(
T
)( X 1 - µ1 X 2 - µ2 )]
X 2 - µ2
æ E[( X 1 - µ1 )( X 1 - µ1 )] E[( X 1 - µ1 )( X 2 - µ2 )] ö
=ç ÷
è E[( X 2 - µ 2 )( X 1 - µ1 )] E[( X 2 - µ 2 )( X 2 - µ )]
2 ø

æ s 12 s 12 ö
=ç 2 ÷
è s 21 s 2 ø
q Generalizing it to d dimensions, we have,

23
Variance, Covariance, Correlation
 Variance provides insight into how much individual data points

deviate (or spread out or clustered around) from the central


tendency, which is the mean.
 If an attribute has near-zero variance or has a constant value for

the majority of the data points, it does not contribute much to the
variability and can be flagged as redundant.
Variance, Covariance, Correlation
 Covariance between two variables measures how they change

together.
 A positive covariance indicates that the two variables tend to

increase or decrease simultaneously, while a negative covariance


indicates an inverse relationship.
 High positive covariance values suggest redundant attributes that move
together, and high negative covariance values indicate attributes that move
in opposite directions.
 Identifying these strong correlations helps detect potential redundancies in
the dataset.
Variance, Covariance, Correlation
 Correlation is a standardized measure of the linear relationship

between two variables. It quantifies the strength and direction of


their linear association.
 High absolute correlation coefficients (close to +1 or -1) indicate

strong linear relationships between two attributes. Such attributes


provide similar information and may be redundant for the
analysis.
Variance, Covariance, Correlation
 By leveraging variance, covariance, and correlation measures during

data integration, analysts can identify redundant attributes and make


informed decisions on feature selection or dimensionality reduction.
 Removing redundant attributes can improve the efficiency and

interpretability of machine learning models while retaining essential


information for accurate predictions or analysis.
Data Reduction
Data Reduction Techniques
 Data reduction techniques are methods used to reduce the volume,

size, or complexity of large datasets while preserving as much


relevant information as possible.
 These techniques are valuable when dealing with massive

datasets that may be computationally expensive or challenging to


analyze in their original form.
 By reducing data, analysts can perform tasks more efficiently and

effectively without sacrificing critical insights.


Data Reduction Techniques
 Data reduction techniques are methods used to reduce the volume,
size, or complexity of large datasets while preserving as much
relevant information as possible.
Dimensionality Reduction
 Dimensionality reduction is a data reduction technique that aims to

reduce the number of features or variables in a dataset while


retaining the most relevant information.
 It is used to simplify complex datasets with high-dimensional attributes,
making them easier to analyze, visualize, and process.
 High-dimensional data can lead to computational challenges, increased
memory requirements, and reduced performance in machine learning
algorithms.
 Dimensionality reduction helps in overcoming these issues and improving
the efficiency and effectiveness of data analysis.
Dimensionality Reduction
 Two main appoaches : Feature Selection & Feature Extraction
 1. Feature Selection:
Selection Identify and select a subset of the original
features that are most informative or relevant to the analysis.
 These methods evaluate the importance of each feature based on

statistical metrics or machine learning algorithms. Keep only


those that contribute significantly to the target variable or model
performance.
 Attribute Subset Selection is an example method.
Dimensionality Reduction
 Two main appoaches : Feature Selection & Feature Extraction
 2. Feature Extraction:
Extraction Feature extraction methods create new,
lower-dimensional features by combining or transforming the
original features.
 These methods seek to capture the essential information

contained in the data while reducing its dimensionality.


 Discrete Wavelet Transforms (DWT),
(DWT) Principal Component Analysis
(PCA)
PCA and Singular Value Decomposition (SVD)
SVD are common
feature extraction techniques.
Dimensionality Reduction – Discrete Wavelet Transform
 DWT is a signal processing technique used to break down data into
multiple frequency components at different resolutions. It transforms
data from the time or spatial domain into the frequency domain.
 Wavelets: Small oscillating functions that are shifted and scaled to
analyze data at various levels of detail.
 DWT captures both global trends (low-frequency components) and
local details (high-frequency components).
 Wavelet transforms have many real-world applications, including the
compression of fingerprint images, computer vision, analysis of time-
series data, and data cleaning.
Dimensionality Reduction – Discrete Wavelet Transform
 Dimensionality Reduction Using DWT:
 DWT compresses data by eliminating insignificant high-frequency
details while retaining the core information in the lower
frequencies.
 This reduces the
number of dimensions
(attributes) without
losing essential data
patterns.
Dimensionality Reduction – Discrete Wavelet Transform
Dimensionality Reduction – DWT Steps
 Apply DWT: Break the data into different resolution levels using wavelets.

 Separate Coefficients:
 Low-frequency components (Approximation): Capture the general trend of
the data.
 High-frequency components (Details): Capture noise and minor fluctuations.
 Discard High-Frequency Components: These usually represent noise or less
important data, so they can be removed, resulting in fewer attributes.
 Reconstruct Data: Use only the low-frequency components to form a
compressed version of the original dataset.
Dimensionality Reduction – Principal Components Analysis
 Principal Components Analysis is a statistical technique used to reduce
the dimensionality of data by transforming the data into a set of
linearly uncorrelated variables called Principal Components.
 Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions.
 Principal components analysis searches for k n-dimensional
orthogonal vectors that can best be used to represent the data, where
k ≤ n.
 The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction.
Dimensionality Reduction – Principal Components Analysis
 Principal components analysis searches for k n-dimensional
orthogonal vectors that can best be used to represent the data, where
k ≤ n.
Principal Component Analysis (PCA)
q PCA: A statistical procedure that uses an
orthogonal transformation to convert a set of
observations of possibly correlated variables into
a set of values of linearly uncorrelated variables
called principal components
q The original data are projected onto a much
smaller space, resulting in dimensionality
reduction
q Method: Find the eigenvectors of the covariance
matrix, and these eigenvectors define the new
space Ball travels in a straight line. Data from
three cameras contain much redundancy

56
Dimensionality Reduction – PCA Process
 Standardize the data: Ensure that each feature has a mean of 0 and
unit variance.
 Compute the covariance matrix: Find the relationships between all
the features.
 Calculate eigenvectors and eigenvalues:These determine the
directions (principal components) and magnitude of variance in each
direction.
 Select principal components: Choose the top components that
capture the most variance and discard the rest.
 Transform the data:Project the data onto the selected principal
components, reducing the dimensionality.
Dimensionality Reduction – Attribute Subset Selection
 Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
 The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
 Mining on a reduced set of attributes has an additional benefit: It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
Dimensionality Reduction – Attribute Subset Selection
Methods
 1. Stepwise Forward Selection:

 Start with no attributes, add one at a time.


 Adds the attribute that improves the model the most.
 Stops when no further improvement is seen.
 2. Stepwise Backward Elimination:
 Start with all attributes, remove the least important one at a
time.
 Stops when further removal decreases model performance.
Dimensionality Reduction – Attribute Subset Selection
Methods
 3. Combination of Forward & Backward Selection :
 Combines adding and removing attributes step-by-step.
 Flexible
approach that refines the model as attributes are
added or removed.
 4. Decision Tree Induction :
 Automatically selects the most relevant features by splitting
data at each node.
 The tree grows based on the most important attributes,
irrelevant ones are ignored.
Dimensionality Reduction – Attribute Subset Selection
Methods
Attribute Creation (Feature Generation)
q Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
q Three general methodologies
q Attribute extraction
q Domain-specific
q Mapping data to new space (see: data reduction)
q E.g., Fourier transformation, wavelet transformation, manifold approaches (not
covered)
q Attribute construction
q Combining features (see: discriminative frequent patterns in Chapter on
“Advanced Classification”)
q Data discretization

60
Data Reduction Techniques
 Data reduction techniques are methods used to reduce the volume,
size, or complexity of large datasets while preserving as much
relevant information as possible.
Numerocity Reduction / Data Reduction
 Also known as Data size reduction

 Methods/techniques that aim at reducing the number of

records/objects/rows in consideration.
 That is, a reduced representation of the dataset

 Why data reduction?

 A database/data warehouse may store terabytes of data

 Complex analysis may take a very long time to run on the

complete data set


Numerocity Reduction / Data Reduction
 Methods for data reduction (data size reduction or numerosity

reduction)
 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression
Data Reduction: Parametric vs. Non-Parametric Methods
q Reduce data volume by choosing alternative, smaller
forms of data representation tip vs. bill

q Parametric methods (e.g., regression)


q Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
q Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
q Non-parametric methods
q Do not assume models Histogram
Clustering on Stratified
the Raw Data
Sampling
q Major families: histograms, clustering, sampling, …
26
Parametric Data Reduction: Regression Analysis
q Regression analysis: A collective name for y
techniques for the modeling and analysis of
numerical data consisting of values of a Y1
dependent variable (also called response
variable or measurement) and of one or more Y1’ y=x+1
independent variables (also known as
explanatory variables or predictors) X1 x
q The parameters are estimated so as to give a q Used for prediction
"best fit" of the data (including forecasting of
time-series data),
q Most commonly the best fit is evaluated by using
inference, hypothesis
the least squares method, but other criteria have
testing, and modeling of
also been used
causal relationships

27
Linear and Multiple Regression
q Linear regression: Y = w X + b
q Data modeled to fit a straight line
q Often uses the least-square method to fit the line
q Two regression coefficients, w and b, specify the line
and are to be estimated by using the data at hand
q Using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
q Nonlinear regression:
q Data are modeled by a function which is a nonlinear
combination of the model parameters and depends
on one or more independent variables
q The data are fitted by a method of successive
approximations
28
Histogram
 Divide the data into bins and aggregate the values within each bin.

 Instead of representing individual data points, you group data into

intervals and
 Get the count of data points in each bin, or

 calculate summary statistics, such as the mean, median, sum, or

count, for each bin.


 This reduces the number of data points and can give you a good

overview of the data's distribution without the need to retain each


individual value.
Histogram
 Example:

 Data Bins/Histogram
Histogram
 Example:


Histogram Analysis
40
q Divide data into buckets and store
average (sum) for each bucket 35

q Partitioning rules: 30
q Equal-width: equal bucket range 25
q Equal-frequency (or equal-depth) 20
15
10
5
0
10000 30000 50000 70000 90000

30
Clustering
q Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and
diameter) only
q Can be very effective if data is clustered but not if data
is “smeared”
q Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
q There are many choices of clustering definitions and
clustering algorithms
q Cluster analysis will be studied in depth in Chapter 10

31
Sampling
q Sampling: obtaining a small sample s to represent the whole data set N
q Allow a mining algorithm to run in complexity that is potentially sub-linear to the
size of the data
q Key principle: Choose a representative subset of the data
q Simple random sampling may have very poor performance in the presence of
skew
q Develop adaptive sampling methods, e.g., stratified sampling:
q Note: Sampling may not reduce database I/Os (page at a time)

34
Types of Sampling
q Simple random sampling: equal probability
of selecting any particular item Raw Data

q Sampling without replacement


q Once an object is selected, it is removed
from the population
q Sampling with replacement
q A selected object is not removed from the
population
q Stratified sampling Stratified sampling

q Partition (or cluster) the data set, and


draw samples from each partition
(proportionally, i.e., approximately the
same percentage of the data)
35
Data Cube Aggregation
 Data cube aggregation is a process used in multidimensional data

analysis to summarize and condense data across multiple


dimensions.
 It involves creating a multidimensional representation of data,

known as a data cube, and then aggregating or summarizing the


data along different dimensions to provide insights and facilitate
efficient analysis.
Data Cube Aggregation
 For data reduction, we can use data cube aggregation to summarize

and compress large datasets, making them more manageable and


efficient to analyze.
 It helps in reducing the volume of data while preserving essential

information for data mining tasks.


 By using data cube aggregation techniques to summarize and

reduce the dataset, data miners can work with more manageable
datasets while still retaining the critical information required for
meaningful analysis and valuable discoveries.
Data Cube Aggregation
q The lowest level of a data cube (base cuboid)
q The aggregated data for an individual entity of
interest
q E.g., a customer in a phone calling data warehouse
q Multiple levels of aggregation in data cubes
q Further reduce the size of data to deal with
q Reference appropriate levels
q Use the smallest representation which is enough to
solve the task
q Queries regarding aggregated information should be
answered using data cube, when possible
36
Data Reduction Techniques
 Data reduction techniques are methods used to reduce the volume,
size, or complexity of large datasets while preserving as much
relevant information as possible.
Data Compression
q String compression
q There are extensive theories and well-tuned
algorithms
Original Data Compressed
q Typically lossless, but only limited manipulation Data
is possible without expansion lossless
q Audio/video compression
q Typically lossy compression, with progressive Original Data
refinement Approximated
q Sometimes small fragments of signal can be
reconstructed without reconstructing the whole Lossy vs. lossless compression
q Time sequence is not audio
q Typically short and vary slowly with time
q Data reduction and dimensionality reduction may
37 also be considered as forms of data compression
Wavelet Transform: A Data Compression Technique
q Wavelet Transform
q Decomposes a signal into different
frequency subbands
q Applicable to n-dimensional signals
q Data are transformed to preserve relative
distance between objects at different levels
of resolution
q Allow natural clusters to become more
distinguishable
q Used for image compression

38
Wavelet Transformation
Haar2 Daubechie4
q Discrete wavelet transform (DWT) for linear signal processing, multi-resolution
analysis
q Compressed approximation: Store only a small fraction of the strongest of the
wavelet coefficients
q Similar to discrete Fourier transform (DFT), but better lossy compression, localized
in space
q Method:
q Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
q Each transform has 2 functions: smoothing, difference
q Applies to pairs of data, resulting in two set of data of length L/2
q Applies two functions recursively, until reaches the desired length
39
Why Wavelet Transform?
q Use hat-shape filters
q Emphasize region where points cluster
q Suppress weaker information in their boundaries
q Effective removal of outliers
q Insensitive to noise, insensitive to input order
q Multi-resolution
q Detect arbitrary shaped clusters at different scales
q Efficient
q Complexity O(N)
q Only applicable to low dimensional data
41
Data Transformation
Data Transformation
 Data transformation is a preprocessing technique used to convert
data into a suitable format for analysis, modeling, and visualization.
 The goal of data transformation is to improve the quality,
distribution, and suitability of the data for specific tasks.
 The data are transformed or consolidated so that the resulting
mining process may be more efficient, and the patterns found may
be easier to understand.
 A function that maps the entire set of values of a given attribute to
a new set of replacement values s.t. each old value can be
identified with one of the new values
29
 Methods
Data Transformation
 1. Smoothing:
Smoothing Remove noise from data
 2. Attribute/feature construction
 New attributes constructed from the given ones
 3. Aggregation:
Aggregation Summarization, data cube construction
 4. Normalization:
Normalization Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 5. Discretization
 6. Concept hierarchy climbing
30
Data Transformation
 Smoothing
Remove noise from the data.
 Binning, Regression, Clustering.

 Attribute Construction
 New attributes are constructed and added from the given set of attributes to
help the mining process
 Aggregation
Summary or aggregation operations are applied to the data
 Data Cube construction
 Example : Daily sales data aggregated so as to compute monthly and annual

31 total amounts
Normalization
 Changing measurement units from meters to inches for height, or
from kilograms to pounds for weight, may lead to very different
results.
 In general, expressing an attribute in smaller units will lead to a
larger range for that attribute, and thus tend to give such an
attribute greater effect or “weight.”
 To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized.
 This involves transforming the data to fall within a smaller or
common range such as [−1, 1] or [0.0, 1.0].
32
Normalization
 Normalizing the data attempts to give all attributes an equal weight.

 The terms standardize and normalize are used interchangeably in


data preprocessing. (In Statistics, they mean different things).
 There are many methods for data normalization. We study :
 min-max normalization,
 z-score normalization, and
 normalization by decimal scaling.

 For our discussion, let A be a numeric attribute with n observed


values, v1 , v2 , . . . , vn .
33
Normalization
 Min-max normalization:
normalization to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]
 Then $73,000 is mapped to 73,600  12,000 (1.0  0)  0 0.716
98,000  12,000
 Min-max normalization performs a linear transformation on the original
data.
 Min-max normalization preserves the relationships among the original data
values.
 It will encounter an “out-of-bounds” error if a future input case for
normalization falls outside of the original data range for A.
34
Normalization
 Z-score normalization (μ: mean, σ: standard deviation):
v  A Z-score: The distance between the raw score and the
v' 
 A
population mean in the unit of the standard deviation
73,600  54,000
 Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000

 In z-score normalization (or zero-mean normalization), the values for an


attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A.
 This method of normalization is useful when the actual minimum and
maximum of attribute A are unknown, or when there are outliers that
dominate the min-max normalization.
35
Normalization
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|νi’|) < 1
10
 Normalization by decimal scaling normalizes by moving the decimal point of
values of attribute A.
 The number of decimal points moved depends on the maximum absolute
value of A.
 Example:

36
Discretization
 We will differentiate Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 It can be Supervised or Unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
37
Data Discretization Methods
q Binning
q Top-down split, unsupervised
q Histogram analysis
q Top-down split, unsupervised
q Clustering analysis
q Unsupervised, top-down split or bottom-up merge
q Decision-tree analysis
q Supervised, top-down split
q Correlation (e.g., c2) analysis
q Unsupervised, bottom-up merge
q Note: All the methods can be applied recursively
45
Simple Discretization: Binning
q Equal-width (distance) partitioning
q Divides the range into N intervals of equal size: uniform grid
q if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
q The most straightforward, but outliers may dominate presentation
q Skewed data is not handled well
q Equal-depth (frequency) partitioning
q Divides the range into N intervals, each containing approximately same number
of samples
q Good data scaling
q Managing categorical attributes can be tricky
46
Example: Binning Methods for Data Smoothing
q Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
47
Discretization Without Supervision: Binning vs. Clustering

Data Equal width (distance) binning

Equal depth (frequency) (binning) K-means clustering leads to better results


48
Discretization by Classification & Correlation Analysis
q Classification (e.g., decision tree analysis)

q Supervised: Given class labels, e.g., cancerous vs. benign


q Using entropy to determine split point (discretization point)
q Top-down, recursive split
q Details to be covered in Chapter “Classification”
q Correlation analysis (e.g., Chi-merge: χ2-based discretization)

q Supervised: use class information


q Bottom-up merge: Find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
q Merge performed recursively, until a predefined stopping condition

49
Concept Hierarchy Generation
q Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is
usually associated with each dimension in a data warehouse
q Concept hierarchies facilitate drilling and rolling in data warehouses to view data
in multiple granularity
q Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
q Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
q Concept hierarchy can be automatically formed for both numeric and nominal
data—For numeric data, use discretization methods shown
50
Concept Hierarchy Generation for Nominal Data
q Specification of a partial/total ordering of attributes explicitly at the schema level
by users or experts
q street < city < state < country
q Specification of a hierarchy for a set of values by explicit data grouping
q {Urbana, Champaign, Chicago} < Illinois
q Specification of only a partial set of attributes
q E.g., only street < city, not others
q Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
q E.g., for a set of attributes: {street, city, state, country}

51
Automatic Concept Hierarchy Generation
qSome hierarchies can be automatically generated based on the analysis of the
number of distinct values per attribute in the data set
q The attribute with the most distinct values is placed at the lowest level of the
hierarchy
q Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

52
 Methods
Data Transformation
 1. Smoothing:
Smoothing Remove noise from data
 2. Attribute/feature construction
 New attributes constructed from the given ones
 3. Aggregation:
Aggregation Summarization, data cube construction
 4. Normalization:
Normalization Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 5. Discretization
 6. Concept hierarchy climbing
30
Data Preprocessing Techniques
 Data cleaning can be applied to remove noise and correct inconsistencies in data.
 Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
 Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
 Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0.
 This can improve the accuracy and efficiency of mining algorithms involving
distance measurements.

You might also like