0% found this document useful (0 votes)
17 views91 pages

CS822 DataMining Week3

The document discusses data preprocessing in data mining, emphasizing the importance of data quality, which includes accuracy, completeness, consistency, timeliness, believability, and interpretability. It outlines major tasks such as data cleaning, integration, reduction, and transformation, detailing methods to handle issues like missing values, noisy data, and redundancy. Additionally, it covers techniques for dimensionality reduction and correlation analysis to enhance data analysis and mining results.

Uploaded by

zainab zahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views91 pages

CS822 DataMining Week3

The document discusses data preprocessing in data mining, emphasizing the importance of data quality, which includes accuracy, completeness, consistency, timeliness, believability, and interpretability. It outlines major tasks such as data cleaning, integration, reduction, and transformation, detailing methods to handle issues like missing values, noisy data, and redundancy. Additionally, it covers techniques for dimensionality reduction and correlation analysis to enhance data analysis and mining results.

Uploaded by

zainab zahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

1

CS822
Data
Mining
Instructor: Dr. Muhammad Tahir

2
Data Preprocessing

3
Data Quality: Why Preprocess the
Data?
• Preprocessing improves data quality including
• Accuracy: correctness of data i.e., having incorrect attribute values
could be due to instruments used may be faulty or human error (or
intentional).
• Completeness: full information is available i.e. Incomplete data
can occur for values that are not always available, such as customer
information for sales transaction data.
• Consistency: data from all sources are the same i.e. Two different
users may have very different assessments or live in different time
zones.
• Timeliness: time difference and delay i.e. some store branches has
delay in syncing sales
• Believability: reflects how much the data are trusted by users
• Interpretability: reflects how easy the data are understood. 4
Major Tasks in Data Preprocessing
• Data cleaning routines work to “clean” the data by
filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving
inconsistencies.
• Data integration is the process integrating data from
multiple sources (databases, data cubes, or files).
• Data reduction obtains a reduced representation of
the data set that is much smaller in volume but
produces the same (or almost the same) mining results
• Data transformation convert the data into
appropriate forms for better mining results.
5
Data Preprocessing
Overview

6
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 7
Incomplete Data (Missing Values)
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred 8
How to Handle Missing Values?
• Ignore the sample: not effective when the % of ignored
sample is too high.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean for all data
• the attribute mean for all samples belonging to the same
class
• the most probable value: based on some statistical
models.
9
Noisy Data
• Noise: random error in data
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data 10
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers) 11
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading i.e. storing multiple types of information
within a single data field (column) in a dataset
• Check
• uniqueness rule i.e. storing multiple types of information within a
single data field (column) in a dataset
• consecutive rule i.e. there should not be a sequence of repeated
values in a particular field or column
• null rule i.e. a particular field is allowed to have null (empty)
values or not

12
Data Cleaning as a Process
• Use Commercial Tools for
• Data scrubbing (also called data cleansing) is the process of:
• Identifying and correcting inaccurate, incomplete, or irrelevant data.
• Removing duplicate records.
• Fixing structural errors (like typos, incorrect formatting, or wrongly
placed data).
• Handling missing values by filling them with appropriate substitutes or
removing incomplete records.
• Ensuring data consistency across datasets.

• Data Auditing refers to the process of systematically reviewing, assessing,


and verifying data to ensure its accuracy, consistency, completeness,
and compliance with predefined standards or rules.

13
Data Cleaning as a Process
• Data migration and integration
• Data migration tools are software solutions used to move data from one
system, storage, or format to another. These tools play a crucial role in data
management, particularly when organizations upgrade systems, consolidate
databases, or move to cloud platforms.
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
• Integration of the two processes
• refers to the combination of data preprocessing (data cleaning, transformation,
integration, etc.) with the actual data mining process (pattern discovery,
modeling, analysis).

14
Data Integration

15
Data Integration
• Combining data from multiple sources into a single,
unified view (a coherent data store).
• In data mining, data often comes from different
databases, files, or systems.
• To analyze it effectively, you need all data combined in
a single format/location.
• Example:
Combining customer data from a CRM system with sales
data from an ERP system into a data warehouse.

16
Schema Integration
• The process of merging schemas (data structures) from
different sources into a consistent schema.
• Challenge:
• Different systems may label or structure the same data differently.
• Example:
• One system has A.cust-id, while another uses B.cust-#.
• Both refer to the same concept — customer ID — so they must be
mapped together.
• Key Step:
• Integrating metadata (information about data) to match up
fields/attributes correctly across sources.
17
Entity Identification Problem
• The challenge of identifying that records from
different sources actually refer to the same real-
world entity (person, company, product, etc.).
• Example:
• One database lists "Bill Clinton".
• Another lists "William Clinton".
• Both refer to the same person, but they need to be
recognized as such.
• Why it’s important:
• If not resolved, duplicate records can skew analysis or cause
incorrect results.
18
Detecting and Resolving Data Value
Conflicts
• Even when the same entity is correctly identified, different sources
may report different attribute values for the same data point.
• Example:
• A person’s weight in one system is recorded as 70 kg, while another system
says 154 lbs.
• These are equivalent, but they are in different scales (metric vs imperial).
• Other conflicts can happen with formats (DD/MM/YYYY vs MM/DD/YYYY) or
naming styles ("John Smith" vs "Smith, John").
• Key Task:
• Detect these differences and resolve them into a single, accurate value.

19
Handling Redundancy in Data
Integration

20
Handling Redundancy in Data
Integration
• Redundant Data in Data Integration
• When combining data from multiple databases or sources,
the same information may appear in multiple places, often
in different formats or levels of detail.
• This creates redundancies, which can lead to data
duplication, inconsistencies, and wasted storage space.
• Problems
• Object Identification
• Derivable Data

21
Handling Redundancy in Data
Integration
• Object Identification
• The same object (entity) or attribute (field) may have different
names across different databases.
• Example: In Database A, a customer’s ID is stored as cust_id.
• In Database B, the same field is called customer_number.
• These need to be identified as the same attribute to correctly
integrate the data.
• Without matching equivalent fields across sources, data
integration will fail, and redundant data will accumulate.

22
Handling Redundancy in Data
Integration
• Derivable Data
• Sometimes, an attribute in one table can be derived
from attributes in another table.
• Example:
• A table might store quarterly revenue for a company.
• Another table stores annual revenue, which is simply the
sum of the quarterly revenues.
• Key Point:
• These derived attributes can introduce redundancy
because they can be computed instead of being stored
directly.
23
Handling Redundancy in Data
Integration
• Detecting Redundant Attributes
• Redundant attributes (like annual revenue and sum of
quarterly revenue) can sometimes be detected using:
• Correlation Analysis:
• Measures the statistical relationship between two
attributes.
• High correlation suggests the attributes may be redundant
or related in some way.
• Covariance Analysis:
• Measures how two attributes vary together.
• Helps to detect attributes that provide similar
information. 24
Handling Redundancy in Data
Integration
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

25
Correlation Analysis (Nominal Data)
• Correlation analysis is used on attributes or features to determine their
dependencies on among each other.

• (chi-square) test is to find correlation (dependency) among two features.

𝜒 =∑ ¿ ¿ ¿
2

• The larger the value, the more likely the variables are related
• The cells that contribute the most to the value are those whose actual
count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population 26
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450
Not like science 50 (210) 1000 (840) 1050
fiction
Sum(col.) 300 1200 1500
• (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data
distribution in the
(250 two
90) 2 categories)
(50  210) 2 (200  360) 2 (1000  840) 2
2
     507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are


correlated in the group
27
Visually Evaluating Correlation
• The result of correlation analysis is typically
a value between -1 and +1.
• This value is called the correlation
coefficient.
• The sign (+ or -) shows the direction of the
relationship.
• The magnitude (how close it is to 1 or -1)
shows the strength of the relationship.

28
Covariance (Numeric Data)
• Covariance is similar to correlation
• Measures how two variables change together.
• Correlation coefficient:
• Standardized version of covariance.
• Scales the relationship to always fall between -1 and +1
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are
not independent. Only under some additional assumptions (e.g., the
data follow multivariate normal distributions) does a covariance of 0
imply independence
29
Covariance versus Correlation

• Covariance tells you if two variables move in the same


direction or opposite directions — but not how strong the
relationship is.
Correlation does everything covariance does, but it also
tells you the strength of the relationship and is easy to
interpret (because it’s on a -1 to +1 scale).
• Example
• If Cov(X, Y) = 150, you know X and Y are positively related, but
you don’t know if that’s a weak or strong relationship.
• If Correlation(X, Y) = 0.92, you know X and Y are strongly
positively correlated. 30
CoVariance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0. 31
Data Reduction

32
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete dataset.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression (lossy/lossless) 33
Data Reduction 1: Dimensionality
Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering and
outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection) 34
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or
more other attributes
• E.g., purchase price of a product and the amount of sales
tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining
task at hand
• E.g., students' ID is often irrelevant to the task of predicting
students' GPA
35
Heuristic Search in Attribute
Selection
• There are 2^d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption:
choose by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking 36
Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)
• Attribute construction
• Combining features (see: discriminative frequent patterns in
Chapter 7)
• Data discretization
37
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods
• Non-parametric methods

38
Parametric methods

39
Parametric methods
• Parametric methods for data reduction involve
summarizing a large dataset using a fixed number of
parameters.
• These methods assume that the data follows a known
statistical distribution and reduce the dataset while
retaining important patterns.

40
Key Parametric Methods for Data
Reduction
1. Regression Models (Linear Regression)
• Concept: Instead of storing an entire dataset, a mathematical
function (model) is used to approximate the relationship between
variables.
• Example:
• Suppose we have 1000 data points showing how house prices depend
on their size.
• A linear regression model can approximate this relationship using the
formula: Price=50000+200×Size
• Instead of storing all 1000 points, we store only the equation and
predict prices based on size.
41
Key Parametric Methods for Data
Reduction
2. Principal Component Analysis (PCA)
• Concept: PCA reduces the number of dimensions while
retaining the most important information.
• Example:
• A dataset with 10 attributes (e.g., height, weight, age,
income, education, etc.) can be reduced to 2 or 3 principal
components that explain most of the variance.
• This reduces storage and speeds up computation while
preserving essential trends.

42
Key Parametric Methods for Data
Reduction
3. Logarithmic Data Reduction
• Concept: Transforming data using logarithmic functions
to compress large values.
• Example:
• Instead of storing a dataset of 1 million records with large
numbers (e.g., 1000000, 500000, etc.), we can store their
logarithmic values: log⁡(1000000)=6, log⁡(500000)=5.7
• This helps in compressing data while keeping relative
differences intact.

43
Key Parametric Methods for Data
Reduction
4. Data Approximation (Curve Fitting)
• Concept: Instead of storing all data points, an equation
approximates the trend.
• Example:
• A dataset of temperature variations over a year (365
days) can be approximated using a sine wave function
rather than storing every single value.

44
Non-Parametric methods

45
Non-Parametric methods
• Non-parametric methods for data reduction do not
assume any predefined statistical distribution or
mathematical model.
• Instead, they reduce the dataset while preserving key
patterns and relationships.
• These methods are more flexible than parametric
methods because they adapt to the structure of the
data.

46
Key Non-Parametric Methods for
Data Reduction
1. Sampling
• Concept: Instead of analyzing the entire dataset, a
smaller, representative subset is used.
• Example:
• Suppose we have 1 million customer records. Instead of
using all data points, we randomly select 10,000 customers
that reflect the overall trends.
• This reduces computational cost while maintaining accuracy.

47
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data

48
Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re

SRSW
R

Raw Data

49
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample

50
Key Non-Parametric Methods for
Data Reduction
2. Clustering-Based Data Reduction
• Concept: Groups similar data points together and
stores only the cluster representatives.
• Example:
• Suppose we have 100,000 customer records with different
shopping behaviours.
• Using k-Means clustering, we can group them into 5
customer types and store only these representative
profiles, rather than all individual records.

51
Key Non-Parametric Methods for
Data Reduction
3. Dimensionality Reduction using Feature
Selection
• Concept: Removes irrelevant or redundant attributes
while keeping important ones.
• Example:
• A dataset with 100 attributes (e.g., height, weight, income,
education, ZIP code, etc.) might contain irrelevant
attributes.
• If ZIP code does not impact customer spending, we remove it
to make analysis more efficient.

52
Key Non-Parametric Methods for
Data Reduction 40

4. Binning (Histogram-Based Reduction)35


30
• Concept: Groups continuous data into a 25

smaller number of bins to simplify storage 20


15
and analysis. 10

• Example: 5
0
• A dataset has 1 million income values 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000
0 0 0 0 0 0 0 0 0 00
ranging from $10,000 to $200,000.
• Instead of storing all values, we group them
into bins:
• Low Income: $10,000 – $40,000
• Middle Income: $40,001 – $100,000
• High Income: $100,001 – $200,000
• This keeps the distribution intact while reducing
the number of stored values. 53
Key Non-Parametric Methods for
Data Reduction
5. Discretization
• Concept: Converts continuous numerical values into
categorical values.
• Example:
• Instead of storing exact student scores (e.g., 89.4, 92.1,
76.8), we group them into letter grades:
• A (80–100)
• B (60–79)
• C (40–59)
• This makes analysis easier and reduces storage requirements.

54
Data Transformation

55
Data Transformation
• A function that maps the entire set of values of a given attribute
to a new set of replacement values such that each old value can
be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing 56
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − 𝑚𝑖𝑛 𝐴
𝑣 ′= (new ¿ − new ¿ )+𝑛𝑒𝑤 ¿
𝑚 𝑎𝑥 𝐴 − 𝑚𝑖𝑛 𝐴

• Ex. Let income range $12,000 to $98,000 normalized to


[0.0, 1.0]. Then $73,00073,600
is −12,000
mapped to
(1.0 − 0)+0=0.716
98,000 −12,000

• Z-score normalization (μ: mean, σ: standard deviation): v  A


v' 
600  54,000
• Ex. Let μ = 54,000, σ = 16,000.73,Then 1.225 A
16,000

57
Normalization by Decimal
Scaling
• Definition: Normalization by decimal scaling is a method
used in data preprocessing to rescale numerical values so
that they fall within a certain range, typically between -1 and
1. This is done by dividing each data value by a power of 10,
determined by the maximum absolute valuev in the dataset.
v'  j
10
where:
• X′ is the normalized value,
• X is the original value,
• j is the smallest integer such that the maximum absolute value of X′
is less than 1.

58
Normalization by Decimal
Scaling
Example: Consider a dataset with values: [150, 300, 1200,
5000]
1.Find the maximum absolute value: 5000.
2.Determine j: The smallest j such that 5000 / 10^j is less than
1.
j=4 (since 5000/=0.5)
3.Normalize each value:
1.150/ = 0.015,
2.300/ = 0.03,
3.1200/ = 0.12,
4.5000/ = 0.5
• The transformed dataset becomes [0.015, 0.03, 0.12, 0.5],
59
ensuring values remain within the range [-1,1].
Discretization

60
Discretization
• Discretization is a data preprocessing technique used in
data mining and machine learning that transforms
continuous numerical data into discrete categories or
bins.
• This process is particularly useful when working with
algorithms that require categorical input or when
reducing the complexity of large datasets.

61
Discretization
Why is Discretization Important?
1.Improves Interpretability: Converting continuous values into
discrete intervals makes the data easier to understand and
analyze.
2.Enhances Model Performance: Some machine learning
models, such as decision trees and rule-based classifiers, perform
better with categorical data.
3.Reduces Noise: Grouping values into bins helps smooth out
minor variations and reduces the impact of small fluctuations in
data.
4.Facilitates Pattern Recognition: Many patterns become more
evident when similar values are grouped together.
62
Types of Discretization Techniques
There are two main categories of discretization:
1.Supervised Discretization: Uses class labels to
determine optimal binning.
2.Unsupervised Discretization: Does not use class
labels and instead follows predefined rules.

63
Types of Discretization Techniques
1. Equal-Width Binning (Unsupervised)
• The range of values is divided into intervals of equal
size.
• Example: Suppose we have student scores ranging from
0 to 100. If we use equal-width binning with 4 bins, the
intervals might be:
• 0–25 (Low)
• 26–50 (Medium)
• 51–75 (High)
• 76–100 (Very High)
• Limitation: Uneven data distribution can lead to
64
unbalanced bins.
Types of Discretization Techniques
2. Equal-Frequency Binning (Unsupervised)
• Each bin contains approximately the same number of
data points.
• Example: If we have 100 data points and want 5 bins,
each bin will contain around 20 values.
• Advantage: Ensures each category has an equal
number of instances.
• Limitation: Bin ranges may be uneven, making it
harder to interpret.

65
Types of Discretization Techniques
3. Entropy-Based Binning (Supervised)
• Uses information gain to determine optimal binning.
• Commonly used in decision trees.
• Example: If we are categorizing patients based on their
cholesterol levels and their likelihood of heart disease,
entropy-based binning will create intervals that best
separate the risk groups.

66
Types of Discretization Techniques
4. Clustering-Based Binning (Unsupervised or
Supervised)
• Uses clustering algorithms like k-Means to group similar
values.
• Example: If we have customer purchase amounts, k-
Means can cluster them into “low spenders,” “moderate
spenders,” and “high spenders.”

67
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky 68
Smoothing by Bin Means –
Numerical Example
• Smoothing by bin means is a data preprocessing
technique in which the values in a bin are replaced with
the mean (average) value of that bin. This technique
reduces noise and smooths the data while maintaining
the overall trend.
Steps in Smoothing by Bin Means:
1.Sort the data in ascending order.
2.Divide the data into bins (equal-width or equal-
frequency).
3.Compute the mean of each bin.
4.Replace each value in the bin with the mean of that 69
Smoothing by Bin Means –
Numerical Example
Example
Given Data:
5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 1: Sort the Data
Already sorted: 5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 2: Divide into Bins
Let’s divide the data into 3 equal-width bins:
• Bin 1: (5, 18, 20)
• Bin 2: (25, 27, 30, 35)
• Bin 3: (40, 50, 60)
Step 3: Compute the Mean for Each Bin
• Bin 1 Mean: = ≈14.33
• Bin 2 Mean: = = 29.25
• Bin 3 Mean: = = 50
• Step 4: Replace Each Value with the Bin Mean 70
Smoothing by Bin Means –
Numerical Example

Final Smoothed Data:


• 14.33, 14.33, 14.33, 29.25, 29.25, 29.25, 29.25, 50, 50,
50
71
Smoothing by Bin Boundaries –
Numerical Example
Smoothing by bin boundaries is a data preprocessing technique
used in data mining. It involves replacing the values in each
bin with the closest boundary value, reducing noise and
variation while maintaining the overall distribution.
Steps in Smoothing by Bin Boundaries:
1.Sort the Data: Arrange the given data in ascending order.
2.Divide into Bins: Partition the data into equal-width or
equal-frequency bins.
3.Replace Values: Each value in a bin is replaced with the
closest boundary value (either the minimum or maximum
value of that bin).
72
Smoothing by Bin Boundaries –
Numerical Example
Example
Given Dataset:
5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 1: Sort the Data
Already sorted: 5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 2: Divide into Bins
Let’s divide the data into 3 equal-width bins:
• Bin 1: (5, 18, 20)
• Bin 2: (25, 27, 30, 35)
• Bin 3: (40, 50, 60)
Step 3: Apply Smoothing by Bin Boundaries
• Replace each value with the closest bin boundary.
73
Smoothing by Bin Boundaries –
Numerical Example

Final Smoothed Data:


• 5, 20, 20, 25, 25, 35, 35, 40, 60, 60
74
Simple Discretization: Binning
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
75
Measuring Data Similarity and
Dissimilarity

76
Measuring Data Similarity and
Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
77
Data Matrix and Dissimilarity Matrix
 x11 ... x1f ... x1p 
• Data matrix  
 ... ... ... ... ... 
• n data points with p dimensions x ... x if ... x ip 
 i1 
• Two modes  ... ... ... ... ... 
x ... x nf ... x np 
 n1 

 0 
• Dissimilarity matrix  d(2,1) 
 0 
• n data points, but registers only the  d(3,1) d ( 3,2) 0 
distance  
• A triangular matrix  : : : 
 d ( n,1) d ( n,2) ... ... 0
• Single mode
78
Proximity Measure for Nominal
Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j)  p p m

• Method 2: Use a large number of binary attributes


• creating a new binary attribute for each of the M
nominal states

79
Proximity Measure for Binary
Attributes Object j

A contingency table for binary data Object i

Distance measure for symmetric


binary variables:

Distance measure for


asymmetric binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as

“coherence”:
80
Dissimilarity between Binary
Variables
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
• Example
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

• Gender is a symmetric attribute


• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0

0+1
𝑑( 𝑗𝑎𝑐𝑘,𝑚𝑎𝑟𝑦)= =0.33
2+0+1 81
Standardizing Numeric Data
x
z  
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
• the distance between the raw score and the population mean in
units of the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
xif  m f
m f  1n (x1 f  x2 f  ...  xnf ). zif  s
where
f
• standardized measure (z-score):
• Using mean absolute deviation is more robust than using standard
deviation 82
Example: Data Matrix and
Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5

Dissimilarity Matrix
2 (with Euclidean Distance)
x1
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 x3 5.1 5.1 0
4 x4 4.24 1 5.39 0
0 2 83
Distance on Numeric Data:
Minkowski Distance
• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
84
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

• h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

• h  . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component
(attribute) of the vectors 85
Example: Data Matrix and Dissimilarity
Matrix
Manhattan
point attribute 1 attribute 2
(L1L) x1 x2 x3 x4
x1 1 2
x1 0
x2 5 0 x2 3 5
x3 3 6 0 x3 2 0
x4 6 1 7 0 x4 4 5

Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
x1 0
x2 3.61 0 4
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
2 x1
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0 x3 86
x4 3 1 5 0 0 4
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif {1,...,M f }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable r
by  1
zif  if
Mf 1

• compute the dissimilarity using methods for interval-scaled


variables 87
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their
p
effects
d
(f) (f)
f 1 ij
d (i, j)  ij
 p
 (f)
f 1 ij
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal rif  1 zif 
• Compute ranks rif and M f 1
• Treat zif as interval-scaled
88
Cosine Similarity
• A document can be represented by thousands of attributes,
each recording the frequency of a particular word (such as
keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
89
Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
90
cos(d , d ) = 0.94
You are welcome

91

You might also like