0% found this document useful (0 votes)

17 views91 pages

CS822 DataMining Week3

The document discusses data preprocessing in data mining, emphasizing the importance of data quality, which includes accuracy, completeness, consistency, timeliness, believability, and interpretability. It outlines major tasks such as data cleaning, integration, reduction, and transformation, detailing methods to handle issues like missing values, noisy data, and redundancy. Additionally, it covers techniques for dimensionality reduction and correlation analysis to enhance data analysis and mining results.

Uploaded by

zainab zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views91 pages

CS822 DataMining Week3

Uploaded by

zainab zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 91

1

CS822
Data
Mining
Instructor: Dr. Muhammad Tahir

2
Data Preprocessing

3
Data Quality: Why Preprocess the
Data?
• Preprocessing improves data quality including
• Accuracy: correctness of data i.e., having incorrect attribute values
could be due to instruments used may be faulty or human error (or
intentional).
• Completeness: full information is available i.e. Incomplete data
can occur for values that are not always available, such as customer
information for sales transaction data.
• Consistency: data from all sources are the same i.e. Two different
users may have very different assessments or live in different time
zones.
• Timeliness: time difference and delay i.e. some store branches has
delay in syncing sales
• Believability: reflects how much the data are trusted by users
• Interpretability: reflects how easy the data are understood. 4
Major Tasks in Data Preprocessing
• Data cleaning routines work to “clean” the data by
filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving
inconsistencies.
• Data integration is the process integrating data from
multiple sources (databases, data cubes, or files).
• Data reduction obtains a reduced representation of
the data set that is much smaller in volume but
produces the same (or almost the same) mining results
• Data transformation convert the data into
appropriate forms for better mining results.
5
Data Preprocessing
Overview

6
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 7
Incomplete Data (Missing Values)
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred 8
How to Handle Missing Values?
• Ignore the sample: not effective when the % of ignored
sample is too high.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean for all data
• the attribute mean for all samples belonging to the same
class
• the most probable value: based on some statistical
models.
9
Noisy Data
• Noise: random error in data
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data 10
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers) 11
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading i.e. storing multiple types of information
within a single data field (column) in a dataset
• Check
• uniqueness rule i.e. storing multiple types of information within a
single data field (column) in a dataset
• consecutive rule i.e. there should not be a sequence of repeated
values in a particular field or column
• null rule i.e. a particular field is allowed to have null (empty)
values or not

12
Data Cleaning as a Process
• Use Commercial Tools for
• Data scrubbing (also called data cleansing) is the process of:
• Identifying and correcting inaccurate, incomplete, or irrelevant data.
• Removing duplicate records.
• Fixing structural errors (like typos, incorrect formatting, or wrongly
placed data).
• Handling missing values by filling them with appropriate substitutes or
removing incomplete records.
• Ensuring data consistency across datasets.

• Data Auditing refers to the process of systematically reviewing, assessing,

and verifying data to ensure its accuracy, consistency, completeness,
and compliance with predefined standards or rules.

13
Data Cleaning as a Process
• Data migration and integration
• Data migration tools are software solutions used to move data from one
system, storage, or format to another. These tools play a crucial role in data
management, particularly when organizations upgrade systems, consolidate
databases, or move to cloud platforms.
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
• Integration of the two processes
• refers to the combination of data preprocessing (data cleaning, transformation,
integration, etc.) with the actual data mining process (pattern discovery,
modeling, analysis).

14
Data Integration

15
Data Integration
• Combining data from multiple sources into a single,
unified view (a coherent data store).
• In data mining, data often comes from different
databases, files, or systems.
• To analyze it effectively, you need all data combined in
a single format/location.
• Example:
Combining customer data from a CRM system with sales
data from an ERP system into a data warehouse.

16
Schema Integration
• The process of merging schemas (data structures) from
different sources into a consistent schema.
• Challenge:
• Different systems may label or structure the same data differently.
• Example:
• One system has A.cust-id, while another uses B.cust-#.
• Both refer to the same concept — customer ID — so they must be
mapped together.
• Key Step:
• Integrating metadata (information about data) to match up
fields/attributes correctly across sources.
17
Entity Identification Problem
• The challenge of identifying that records from
different sources actually refer to the same real-
world entity (person, company, product, etc.).
• Example:
• One database lists "Bill Clinton".
• Another lists "William Clinton".
• Both refer to the same person, but they need to be
recognized as such.
• Why it’s important:
• If not resolved, duplicate records can skew analysis or cause
incorrect results.
18
Detecting and Resolving Data Value
Conflicts
• Even when the same entity is correctly identified, different sources
may report different attribute values for the same data point.
• Example:
• A person’s weight in one system is recorded as 70 kg, while another system
says 154 lbs.
• These are equivalent, but they are in different scales (metric vs imperial).
• Other conflicts can happen with formats (DD/MM/YYYY vs MM/DD/YYYY) or
naming styles ("John Smith" vs "Smith, John").
• Key Task:
• Detect these differences and resolve them into a single, accurate value.

19
Handling Redundancy in Data
Integration

20
Handling Redundancy in Data
Integration
• Redundant Data in Data Integration
• When combining data from multiple databases or sources,
the same information may appear in multiple places, often
in different formats or levels of detail.
• This creates redundancies, which can lead to data
duplication, inconsistencies, and wasted storage space.
• Problems
• Object Identification
• Derivable Data

21
Handling Redundancy in Data
Integration
• Object Identification
• The same object (entity) or attribute (field) may have different
names across different databases.
• Example: In Database A, a customer’s ID is stored as cust_id.
• In Database B, the same field is called customer_number.
• These need to be identified as the same attribute to correctly
integrate the data.
• Without matching equivalent fields across sources, data
integration will fail, and redundant data will accumulate.

22
Handling Redundancy in Data
Integration
• Derivable Data
• Sometimes, an attribute in one table can be derived
from attributes in another table.
• Example:
• A table might store quarterly revenue for a company.
• Another table stores annual revenue, which is simply the
sum of the quarterly revenues.
• Key Point:
• These derived attributes can introduce redundancy
because they can be computed instead of being stored
directly.
23
Handling Redundancy in Data
Integration
• Detecting Redundant Attributes
• Redundant attributes (like annual revenue and sum of
quarterly revenue) can sometimes be detected using:
• Correlation Analysis:
• Measures the statistical relationship between two
attributes.
• High correlation suggests the attributes may be redundant
or related in some way.
• Covariance Analysis:
• Measures how two attributes vary together.
• Helps to detect attributes that provide similar
information. 24
Handling Redundancy in Data
Integration
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

25
Correlation Analysis (Nominal Data)
• Correlation analysis is used on attributes or features to determine their
dependencies on among each other.

• (chi-square) test is to find correlation (dependency) among two features.

𝜒 =∑ ¿ ¿ ¿
2

• The larger the value, the more likely the variables are related
• The cells that contribute the most to the value are those whose actual
count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population 26
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450
Not like science 50 (210) 1000 (840) 1050
fiction
Sum(col.) 300 1200 1500
• (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data
distribution in the
(250 two
90) 2 categories)
(50  210) 2 (200  360) 2 (1000  840) 2
2
     507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are

correlated in the group
27
Visually Evaluating Correlation
• The result of correlation analysis is typically
a value between -1 and +1.
• This value is called the correlation
coefficient.
• The sign (+ or -) shows the direction of the
relationship.
• The magnitude (how close it is to 1 or -1)
shows the strength of the relationship.

28
Covariance (Numeric Data)
• Covariance is similar to correlation
• Measures how two variables change together.
• Correlation coefficient:
• Standardized version of covariance.
• Scales the relationship to always fall between -1 and +1
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are
not independent. Only under some additional assumptions (e.g., the
data follow multivariate normal distributions) does a covariance of 0
imply independence
29
Covariance versus Correlation

• Covariance tells you if two variables move in the same

direction or opposite directions — but not how strong the
relationship is.
Correlation does everything covariance does, but it also
tells you the strength of the relationship and is easy to
interpret (because it’s on a -1 to +1 scale).
• Example
• If Cov(X, Y) = 150, you know X and Y are positively related, but
you don’t know if that’s a weak or strong relationship.
• If Correlation(X, Y) = 0.92, you know X and Y are strongly
positively correlated. 30
CoVariance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0. 31
Data Reduction

32
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete dataset.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression (lossy/lossless) 33
Data Reduction 1: Dimensionality
Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering and
outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection) 34
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or
more other attributes
• E.g., purchase price of a product and the amount of sales
tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining
task at hand
• E.g., students' ID is often irrelevant to the task of predicting
students' GPA
35
Heuristic Search in Attribute
Selection
• There are 2^d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption:
choose by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking 36
Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)
• Attribute construction
• Combining features (see: discriminative frequent patterns in
Chapter 7)
• Data discretization
37
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods
• Non-parametric methods

38
Parametric methods

39
Parametric methods
• Parametric methods for data reduction involve
summarizing a large dataset using a fixed number of
parameters.
• These methods assume that the data follows a known
statistical distribution and reduce the dataset while
retaining important patterns.

40
Key Parametric Methods for Data
Reduction
1. Regression Models (Linear Regression)
• Concept: Instead of storing an entire dataset, a mathematical
function (model) is used to approximate the relationship between
variables.
• Example:
• Suppose we have 1000 data points showing how house prices depend
on their size.
• A linear regression model can approximate this relationship using the
formula: Price=50000+200×Size
• Instead of storing all 1000 points, we store only the equation and
predict prices based on size.
41
Key Parametric Methods for Data
Reduction
2. Principal Component Analysis (PCA)
• Concept: PCA reduces the number of dimensions while
retaining the most important information.
• Example:
• A dataset with 10 attributes (e.g., height, weight, age,
income, education, etc.) can be reduced to 2 or 3 principal
components that explain most of the variance.
• This reduces storage and speeds up computation while
preserving essential trends.

42
Key Parametric Methods for Data
Reduction
3. Logarithmic Data Reduction
• Concept: Transforming data using logarithmic functions
to compress large values.
• Example:
• Instead of storing a dataset of 1 million records with large
numbers (e.g., 1000000, 500000, etc.), we can store their
logarithmic values: log⁡(1000000)=6, log⁡(500000)=5.7
• This helps in compressing data while keeping relative
differences intact.

43
Key Parametric Methods for Data
Reduction
4. Data Approximation (Curve Fitting)
• Concept: Instead of storing all data points, an equation
approximates the trend.
• Example:
• A dataset of temperature variations over a year (365
days) can be approximated using a sine wave function
rather than storing every single value.

44
Non-Parametric methods

45
Non-Parametric methods
• Non-parametric methods for data reduction do not
assume any predefined statistical distribution or
mathematical model.
• Instead, they reduce the dataset while preserving key
patterns and relationships.
• These methods are more flexible than parametric
methods because they adapt to the structure of the
data.

46
Key Non-Parametric Methods for
Data Reduction
1. Sampling
• Concept: Instead of analyzing the entire dataset, a
smaller, representative subset is used.
• Example:
• Suppose we have 1 million customer records. Instead of
using all data points, we randomly select 10,000 customers
that reflect the overall trends.
• This reduces computational cost while maintaining accuracy.

47
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data

48
Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re

SRSW
R

Raw Data

49
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample

50
Key Non-Parametric Methods for
Data Reduction
2. Clustering-Based Data Reduction
• Concept: Groups similar data points together and
stores only the cluster representatives.
• Example:
• Suppose we have 100,000 customer records with different
shopping behaviours.
• Using k-Means clustering, we can group them into 5
customer types and store only these representative
profiles, rather than all individual records.

51
Key Non-Parametric Methods for
Data Reduction
3. Dimensionality Reduction using Feature
Selection
• Concept: Removes irrelevant or redundant attributes
while keeping important ones.
• Example:
• A dataset with 100 attributes (e.g., height, weight, income,
education, ZIP code, etc.) might contain irrelevant
attributes.
• If ZIP code does not impact customer spending, we remove it
to make analysis more efficient.

52
Key Non-Parametric Methods for
Data Reduction 40

4. Binning (Histogram-Based Reduction)35

30
• Concept: Groups continuous data into a 25

smaller number of bins to simplify storage 20

15
and analysis. 10

• Example: 5
0
• A dataset has 1 million income values 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000
0 0 0 0 0 0 0 0 0 00
ranging from $10,000 to $200,000.
• Instead of storing all values, we group them
into bins:
• Low Income: $10,000 – $40,000
• Middle Income: $40,001 – $100,000
• High Income: $100,001 – $200,000
• This keeps the distribution intact while reducing
the number of stored values. 53
Key Non-Parametric Methods for
Data Reduction
5. Discretization
• Concept: Converts continuous numerical values into
categorical values.
• Example:
• Instead of storing exact student scores (e.g., 89.4, 92.1,
76.8), we group them into letter grades:
• A (80–100)
• B (60–79)
• C (40–59)
• This makes analysis easier and reduces storage requirements.

54
Data Transformation

55
Data Transformation
• A function that maps the entire set of values of a given attribute
to a new set of replacement values such that each old value can
be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing 56
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − 𝑚𝑖𝑛 𝐴
𝑣 ′= (new ¿ − new ¿ )+𝑛𝑒𝑤 ¿
𝑚 𝑎𝑥 𝐴 − 𝑚𝑖𝑛 𝐴

• Ex. Let income range $12,000 to $98,000 normalized to

[0.0, 1.0]. Then $73,00073,600
is −12,000
mapped to
(1.0 − 0)+0=0.716
98,000 −12,000

• Z-score normalization (μ: mean, σ: standard deviation): v  A

v' 
600  54,000
• Ex. Let μ = 54,000, σ = 16,000.73,Then 1.225 A
16,000

57
Normalization by Decimal
Scaling
• Definition: Normalization by decimal scaling is a method
used in data preprocessing to rescale numerical values so
that they fall within a certain range, typically between -1 and
1. This is done by dividing each data value by a power of 10,
determined by the maximum absolute valuev in the dataset.
v'  j
10
where:
• X′ is the normalized value,
• X is the original value,
• j is the smallest integer such that the maximum absolute value of X′
is less than 1.

58
Normalization by Decimal
Scaling
Example: Consider a dataset with values: [150, 300, 1200,
5000]
1.Find the maximum absolute value: 5000.
2.Determine j: The smallest j such that 5000 / 10^j is less than
1.
j=4 (since 5000/=0.5)
3.Normalize each value:
1.150/ = 0.015,
2.300/ = 0.03,
3.1200/ = 0.12,
4.5000/ = 0.5
• The transformed dataset becomes [0.015, 0.03, 0.12, 0.5],
59
ensuring values remain within the range [-1,1].
Discretization

60
Discretization
• Discretization is a data preprocessing technique used in
data mining and machine learning that transforms
continuous numerical data into discrete categories or
bins.
• This process is particularly useful when working with
algorithms that require categorical input or when
reducing the complexity of large datasets.

61
Discretization
Why is Discretization Important?
1.Improves Interpretability: Converting continuous values into
discrete intervals makes the data easier to understand and
analyze.
2.Enhances Model Performance: Some machine learning
models, such as decision trees and rule-based classifiers, perform
better with categorical data.
3.Reduces Noise: Grouping values into bins helps smooth out
minor variations and reduces the impact of small fluctuations in
data.
4.Facilitates Pattern Recognition: Many patterns become more
evident when similar values are grouped together.
62
Types of Discretization Techniques
There are two main categories of discretization:
1.Supervised Discretization: Uses class labels to
determine optimal binning.
2.Unsupervised Discretization: Does not use class
labels and instead follows predefined rules.

63
Types of Discretization Techniques
1. Equal-Width Binning (Unsupervised)
• The range of values is divided into intervals of equal
size.
• Example: Suppose we have student scores ranging from
0 to 100. If we use equal-width binning with 4 bins, the
intervals might be:
• 0–25 (Low)
• 26–50 (Medium)
• 51–75 (High)
• 76–100 (Very High)
• Limitation: Uneven data distribution can lead to
64
unbalanced bins.
Types of Discretization Techniques
2. Equal-Frequency Binning (Unsupervised)
• Each bin contains approximately the same number of
data points.
• Example: If we have 100 data points and want 5 bins,
each bin will contain around 20 values.
• Advantage: Ensures each category has an equal
number of instances.
• Limitation: Bin ranges may be uneven, making it
harder to interpret.

65
Types of Discretization Techniques
3. Entropy-Based Binning (Supervised)
• Uses information gain to determine optimal binning.
• Commonly used in decision trees.
• Example: If we are categorizing patients based on their
cholesterol levels and their likelihood of heart disease,
entropy-based binning will create intervals that best
separate the risk groups.

66
Types of Discretization Techniques
4. Clustering-Based Binning (Unsupervised or
Supervised)
• Uses clustering algorithms like k-Means to group similar
values.
• Example: If we have customer purchase amounts, k-
Means can cluster them into “low spenders,” “moderate
spenders,” and “high spenders.”

67
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky 68
Smoothing by Bin Means –
Numerical Example
• Smoothing by bin means is a data preprocessing
technique in which the values in a bin are replaced with
the mean (average) value of that bin. This technique
reduces noise and smooths the data while maintaining
the overall trend.
Steps in Smoothing by Bin Means:
1.Sort the data in ascending order.
2.Divide the data into bins (equal-width or equal-
frequency).
3.Compute the mean of each bin.
4.Replace each value in the bin with the mean of that 69
Smoothing by Bin Means –
Numerical Example
Example
Given Data:
5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 1: Sort the Data
Already sorted: 5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 2: Divide into Bins
Let’s divide the data into 3 equal-width bins:
• Bin 1: (5, 18, 20)
• Bin 2: (25, 27, 30, 35)
• Bin 3: (40, 50, 60)
Step 3: Compute the Mean for Each Bin
• Bin 1 Mean: = ≈14.33
• Bin 2 Mean: = = 29.25
• Bin 3 Mean: = = 50
• Step 4: Replace Each Value with the Bin Mean 70
Smoothing by Bin Means –
Numerical Example

Final Smoothed Data:

• 14.33, 14.33, 14.33, 29.25, 29.25, 29.25, 29.25, 50, 50,
50
71
Smoothing by Bin Boundaries –
Numerical Example
Smoothing by bin boundaries is a data preprocessing technique
used in data mining. It involves replacing the values in each
bin with the closest boundary value, reducing noise and
variation while maintaining the overall distribution.
Steps in Smoothing by Bin Boundaries:
1.Sort the Data: Arrange the given data in ascending order.
2.Divide into Bins: Partition the data into equal-width or
equal-frequency bins.
3.Replace Values: Each value in a bin is replaced with the
closest boundary value (either the minimum or maximum
value of that bin).
72
Smoothing by Bin Boundaries –
Numerical Example
Example
Given Dataset:
5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 1: Sort the Data
Already sorted: 5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 2: Divide into Bins
Let’s divide the data into 3 equal-width bins:
• Bin 1: (5, 18, 20)
• Bin 2: (25, 27, 30, 35)
• Bin 3: (40, 50, 60)
Step 3: Apply Smoothing by Bin Boundaries
• Replace each value with the closest bin boundary.
73
Smoothing by Bin Boundaries –
Numerical Example

Final Smoothed Data:

• 5, 20, 20, 25, 25, 35, 35, 40, 60, 60
74
Simple Discretization: Binning
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
75
Measuring Data Similarity and
Dissimilarity

76
Measuring Data Similarity and
Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
77
Data Matrix and Dissimilarity Matrix
 x11 ... x1f ... x1p 
• Data matrix  
 ... ... ... ... ... 
• n data points with p dimensions x ... x if ... x ip 
 i1 
• Two modes  ... ... ... ... ... 
x ... x nf ... x np 
 n1 

 0 
• Dissimilarity matrix  d(2,1) 
 0 
• n data points, but registers only the  d(3,1) d ( 3,2) 0 
distance  
• A triangular matrix  : : : 
 d ( n,1) d ( n,2) ... ... 0
• Single mode
78
Proximity Measure for Nominal
Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j)  p p m

• Method 2: Use a large number of binary attributes

• creating a new binary attribute for each of the M
nominal states

79
Proximity Measure for Binary
Attributes Object j

A contingency table for binary data Object i

Distance measure for symmetric

binary variables:

Distance measure for

asymmetric binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as
•

“coherence”:
80
Dissimilarity between Binary
Variables
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
• Example
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

• Gender is a symmetric attribute

• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0

0+1
𝑑( 𝑗𝑎𝑐𝑘,𝑚𝑎𝑟𝑦)= =0.33
2+0+1 81
Standardizing Numeric Data
x
z  
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
• the distance between the raw score and the population mean in
units of the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
xif  m f
m f  1n (x1 f  x2 f  ...  xnf ). zif  s
where
f
• standardized measure (z-score):
• Using mean absolute deviation is more robust than using standard
deviation 82
Example: Data Matrix and
Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5

Dissimilarity Matrix
2 (with Euclidean Distance)
x1
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 x3 5.1 5.1 0
4 x4 4.24 1 5.39 0
0 2 83
Distance on Numeric Data:
Minkowski Distance
• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
84
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

• h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

• h  . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component
(attribute) of the vectors 85
Example: Data Matrix and Dissimilarity
Matrix
Manhattan
point attribute 1 attribute 2
(L1L) x1 x2 x3 x4
x1 1 2
x1 0
x2 5 0 x2 3 5
x3 3 6 0 x3 2 0
x4 6 1 7 0 x4 4 5

Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
x1 0
x2 3.61 0 4
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
2 x1
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0 x3 86
x4 3 1 5 0 0 4
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif {1,...,M f }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable r
by  1
zif  if
Mf 1

• compute the dissimilarity using methods for interval-scaled

variables 87
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their
p
effects
d
(f) (f)
f 1 ij
d (i, j)  ij
 p
 (f)
f 1 ij
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal rif  1 zif 
• Compute ranks rif and M f 1
• Treat zif as interval-scaled
88
Cosine Similarity
• A document can be represented by thousands of attributes,
each recording the frequency of a particular word (such as
keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …

• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
89
Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
90
cos(d , d ) = 0.94
You are welcome

Ohs352 Project Report Notes
No ratings yet
Ohs352 Project Report Notes
67 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Unit I
No ratings yet
Unit I
57 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
No-Frills Worksheet For All Ages - Present Simple vs. Present Continuous
No ratings yet
No-Frills Worksheet For All Ages - Present Simple vs. Present Continuous
2 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Unit-2 New
No ratings yet
Unit-2 New
61 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Mining
No ratings yet
Data Mining
40 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Warehousing Unit 1
No ratings yet
Data Warehousing Unit 1
26 pages
Lecture Notes 12-Higher-Order Taylor Methods
No ratings yet
Lecture Notes 12-Higher-Order Taylor Methods
85 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
M Stage 8 p110 02 Afp PDF
67% (3)
M Stage 8 p110 02 Afp PDF
14 pages
Lec 1 Data Acquisition and Preprocessing
No ratings yet
Lec 1 Data Acquisition and Preprocessing
8 pages
Application of Eurocode 7 For Earth Retaining Structures
100% (1)
Application of Eurocode 7 For Earth Retaining Structures
57 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Trigo MCQs
No ratings yet
Trigo MCQs
9 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Marginal Rate of Technical Substitution
No ratings yet
Marginal Rate of Technical Substitution
9 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Thermodynamics of A Rotating Detonation Engine
No ratings yet
Thermodynamics of A Rotating Detonation Engine
217 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Correlation
No ratings yet
Correlation
14 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
User's Manual: Titel - pm6 13.07.2004, 10:37 1
No ratings yet
User's Manual: Titel - pm6 13.07.2004, 10:37 1
122 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
EEEN 201 Lecture Notes-08
No ratings yet
EEEN 201 Lecture Notes-08
10 pages
Sašo Živanović: Quantificational Aspects of LF
No ratings yet
Sašo Živanović: Quantificational Aspects of LF
285 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Why Are Complex Numbers Needed in Quantum Mechanics? Some Answers For The Introductory Level
No ratings yet
Why Are Complex Numbers Needed in Quantum Mechanics? Some Answers For The Introductory Level
8 pages
Cad Unit-3 PDF
No ratings yet
Cad Unit-3 PDF
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
5 - A/D and D/A Conversion: Systems For Digital Signal Processing
No ratings yet
5 - A/D and D/A Conversion: Systems For Digital Signal Processing
35 pages
CFD Course Notes v14
No ratings yet
CFD Course Notes v14
20 pages
Act std4
No ratings yet
Act std4
3 pages
Applications of Trigonometry
No ratings yet
Applications of Trigonometry
7 pages
Penggunaan Balanced Scorecard Dalam: Strategic Management Jamu Puspo
No ratings yet
Penggunaan Balanced Scorecard Dalam: Strategic Management Jamu Puspo
15 pages
Module 2.5
No ratings yet
Module 2.5
32 pages
OMBC106 Research Methodology
No ratings yet
OMBC106 Research Methodology
13 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
34 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
TDDFT
No ratings yet
TDDFT
15 pages
Cost and Management Accounting I Group (5) Assignment
No ratings yet
Cost and Management Accounting I Group (5) Assignment
9 pages
Ajol File Journals - 716 - Articles - 249533 - Submission - Proof - 249533 8452 596659 1 10 20230619
No ratings yet
Ajol File Journals - 716 - Articles - 249533 - Submission - Proof - 249533 8452 596659 1 10 20230619
21 pages
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
No ratings yet
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
11 pages
Buble Sort
No ratings yet
Buble Sort
97 pages
Vidyapeeth: @icse - 2024 - Materials - Backup
No ratings yet
Vidyapeeth: @icse - 2024 - Materials - Backup
7 pages
Real Life Graphs Answers MME
No ratings yet
Real Life Graphs Answers MME
2 pages
Bearing Stress: A P DT P
No ratings yet
Bearing Stress: A P DT P
5 pages
7 Market Segmentation 3 Data Analysis
No ratings yet
7 Market Segmentation 3 Data Analysis
33 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet

CS822 DataMining Week3

Uploaded by

CS822 DataMining Week3

Uploaded by

1

• Data Auditing refers to the process of systematically reviewing, assessing,

• (chi-square) test is to find correlation (dependency) among two features.

• It shows that like_science_fiction and play_chess are

• Covariance tells you if two variables move in the same

• It can be simplified in computation as

4. Binning (Histogram-Based Reduction)35

smaller number of bins to simplify storage 20

• Ex. Let income range $12,000 to $98,000 normalized to

• Z-score normalization (μ: mean, σ: standard deviation): v  A

Final Smoothed Data:

Final Smoothed Data:

• Method 2: Use a large number of binary attributes

A contingency table for binary data Object i

Distance measure for symmetric

Distance measure for

• Gender is a symmetric attribute

• h = 2: (L2 norm) Euclidean distance

• h  . “supremum” (Lmax norm, L norm) distance.

• compute the dissimilarity using methods for interval-scaled

• Other vector objects: gene features in micro-arrays, …

• Ex: Find the similarity between documents 1 and 2.

You might also like