0% found this document useful (0 votes)

34 views107 pages

Modified Module 2-DM

This document provides an overview of data preprocessing concepts and techniques. It discusses the different types of data, including record data, data matrix, document data, and transaction data. It also covers the different types of attributes such as nominal, ordinal, interval, and ratio attributes. Finally, it describes properties of attribute values and provides examples to classify attributes.

Uploaded by

Vaishnavi M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views107 pages

Modified Module 2-DM

Uploaded by

Vaishnavi M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 107

Data Mining:

Concepts and Techniques

(3rd ed.)

MODULE 2

1
Chapter 2: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
2
What is Data?
Attributes
 Collection of data objects and
their attributes
Tid Refund Marital Taxable
Status Income Cheat
 An attribute is a property or 1 Yes Single 125K No
characteristic of an object 2 No Married 100K No
 Examples: eye color of a 3 No Single 70K No
person, temperature, etc. 4 Yes Married 120K No
 Attribute is also known as 5 No Divorced 95K Yes
variable, field, characteristic,
or feature Objects 6 No Married 60K No
7 Yes Divorced 220K No
 A collection of attributes 8 No Single 85K Yes
describe an object 9 No Married 75K No
 Object is also known as 10 No Single 90K Yes
record, point, case, sample, 10

entity, or instance
Types of attributes ….
1) Nominal:
 items differentiated by a simple naming
system
 They may have numbers assigned to them.
But, they are not actual numbers. They
simply capture and reference.
 They are ‘categorical’ i.e, they belong to a
definable category.
 Ex:- ID numbers, eye color, zip codes
Types of attributes…
2) Ordinal:
 They have some kind of order by their position on
the scale.
 order of items can be defined by assigning numbers
– relative position.
 letters can also be assigned.
 they are ‘categorical’
 cannot do arithmetic – only ordering property.
 Ex:
rankings (taste of potato chips on a scale from 1 –
10)
Grades in {A, B, C, D, E}
Height in {small, medium, large}
Types of attributes ….

3) Interval:
 Is measured along a scale in which each position is
equidistant from one another.

 Distance between two pairs will be equivalent in some way.

Cannot be multiplied, or divided.

 Ex:
Calendar dates
Temperature in celsius/fah
Types of attributes….

4) Ratio
 numbers can be compared as multiples of one another.
 One person can be twice as tall as another person
 Number zero has no meaning
Ex:
 Difference between a person of age 35 and a person of age 38 is
same as difference between people who are 12 and 15. ( 35 to
38 = 3 , 12 to 15 = 3) 3:3 .
 Ratio data can be multiplied and divided.
Types of attributes…
 Interval and ratio data measure quantities and
hence are quantitative.
 Ex: length, time, count
Types of attributes…
 Nominal (symbolic, categorical)
 Values from an unordered set

 Ex: {red, yellow, blue, ….}

 Ordinal :
 Values from an ordered set

 Ex: {good, better, best}

 Continuous : real numbers

 Ex: {-9.8, 3.9,…..}
Discrete and Continuous Attributes
Depending on the number of values :-
 Discrete Attribute

 Has only a finite or countably infinite set of values

 Examples: zip codes, counts, or the set of words in a collection of
documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented
using a finite number of digits.
 Continuous attributes are typically represented as floating-point
variables.
Types of Attributes Summary
 There are different types of attributes
 Nominal

 Examples: ID numbers, eye color, zip codes

 Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
 Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
 Ratio
 Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
 The type of an attribute depends on which of the
following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */

 Nominal attribute: distinctness

 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties
Types of Attributes Summary
Classify the following attributes as :-
- binary, discrete or continuous

- Qualitative(nominal or ordinal) or quantitative

(interval or ratio)
1) Age in years

Ans: discrete, quantitative, ratio

2) Brightness as measured by a light meter
Ans: continuous, quantitative, ratio
3) Bronze, silver and gold medals as awarded at
Olympics
Ans: Discrete, qualitative, ordinal.
Types of data sets
 Record
 Data Matrix
 Document Data
 Transaction Data
 Graph
 World Wide Web
 Molecular Structures
 Ordered
 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data
Record Data

 Data that consists of a collection of records,

each of which consists of a fixed set of
attributes Tid Refund Marital
Status
Taxable
Income Cheat
 Data stored in flat files 1 Yes Single 125K No

 Like Excel, word etc

2 No Married 100K No
3 No Single 70K No

 Or RDBMS 4 Yes Married 120K No

5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix

 If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

 Such data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data

 Each document becomes a `term' vector,

 each term is a component (attribute) of the vector,

 the value of each component is the number of

times the corresponding term occurs in the

document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

 A special type of record data, where

 each record (transaction) involves a set of items.

 For example, consider a grocery store. The set of

products purchased by a customer during one

shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
 Examples: Generic graph and HTML Links
•a graph is sometimes a more
convenient and powerful
representation of data
2
• can be used to capture
5 1 relationship between data
2 objects.
5
•Data objects themselves can be
graphs.
•Ex: set of linked web pages can
be represented as graphs
Chemical Data as a Graph

Data with objects that are graphs:-

Objects have sub-objects that have relationships

Ex : structure of chemical compounds

Nodes – atoms
Links – chemical compounds
Benzene Molecule: C H
6 6

Mining Substructures
Which substructures occur frequently

in a chemical compound?
Is the presence of any substructure

associated with any other?

Ordered Data

 Attributes have relationships that involve order in

time/space
 Extension of a record data
 Each record has a time associated with it
 Each attribute can also be given a time stamp.
Ordered Data
 Sequences of transactions
Items/Events

• Patterns ?

• People who buy DVD

player tend to buy
DVDs in the period
immediately following
the purchase of DVD
An element of player.
the sequence
Ordered Data
 Genomic sequence data – sequences of individual
entities, letters/words. No time stamp
 Ex: genetic info of animals/plants in the form of
sequences of genes/nucleotides.
• Human genetic code
GGTTCCGCCTTCAGCCCCGCGCC sequence- 4 genes :
CGCAGGGCCCGCCCCGCGCCGTC A, T, G and C.
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC • Mining on gene data –
CCAACCGAGTCCGACCAGGTGCC capture structure and
CCCTCTGCTCGGCCTAGACCTGA properties of genes
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC • Mining biological
TGGGCTGCCTGCTGCGACCAGGG sequence data -
BIOINFORMATICS
Ordered Data

 Spatio-Temporal
Data
 Spatial attributes
– position and area
 Ex: weather data
 Earth sciences data
 - measures temp
and pressure
measured at points
on latitude - longitude
Average Monthly
Temperature of
land and ocean
Time-series data
 A special type of sequential data
 Each record is a time-series i.e. a series of
measurements taken over time
 Ex: financial data set has objects which are the time
series of the daily prices of various stocks.
 Have temporal autocorrelation
 If two measurements are close in time, then their
values are often similar.
Chapter 2: Data Preprocessing

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

27
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
28
Chapter 2: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
29
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
30
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
31
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
32
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

33
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,

deal with possible outliers)

34
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering

to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface

 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

35
Chapter 3: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
36
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
37
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
38
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

39
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
40
 For this 2 × 2 table, the degrees of freedom are (2
- 1)(2 - 1) = 1.
 For 1 degree of freedom, the χ2 value needed to
reject the hypothesis at the 0.001 significance level
is 10.828.
 Since our computed value is above this, we can
reject the hypothesis and conclude that the two
attributes are (strongly) correlated for the given
group of people.

Data Mining: Concepts and Techniques 41

Data Mining: Concepts and Techniques 42
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product

moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n A B
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
43
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

44
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot product

a 'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B )) / std ( B )

correlation( A, B)  A' B'

45
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, Aand are the respective mean or

B
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence 46
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
48
48
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

49
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

50
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

51
52
Attribute subset selection..

1. Stepwise forward selection: The procedure starts with an empty set of

attributes as the reduced set. The best of the original attributes is and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and removes the worst
from among the remaining attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction. At each node, the algorithm chooses the “best” attribute
to partition the data into individual classes.
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model

parameters, store only the parameters, and discard

the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-

D space as the product on appropriate marginal

subspaces
 Non-parametric methods
 Do not assume models

 Major families: histograms, clustering, sampling, …

54
Numerosity Reduction

 Linear regression
 Histogram
 Clustering
 Sampling

55
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line

 Often uses the least-square method to fit the line

 Multiple regression
 Allows a response variable Y to be modeled as a

linear function of multidimensional feature vector

 Log-linear model
 Approximates discrete multidimensional probability

distributions

56
y
Regression Analysis
Y1
 Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors) 
Used for prediction
 The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
 Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used

57
Regress Analysis and Log-Linear Models
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
 Useful for dimensionality reduction and data smoothing
58
Histogram Analysis
 Divide data into buckets and 40
store average (sum) for each 35
bucket 30
 Partitioning rules: 25
 Equal-width: equal bucket 20
range 15
 Equal-frequency (or equal- 10
depth) 5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
59
Histogram

11/06/23 Data Mining: Concepts and Techniques 60

Data Cube Aggregation

 The lowest level of a data cube (base cuboid)

 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
61
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
63
Data Transformation
 Discretization
 Supervised

 Entropy – based
 Unsupervised
 Equal width and equal frequency
 Normalization
 Min-max

 Z-score

 Decimal scaling

 Binarization
Discretization/Quantization
 Three types of attributes:
 Nominal – values from an unordered set

 Ordinal – values from an ordered set

 Continuous – real numbers

 Discretization :
 Divide the range of a continuous attribute into

intervals
 Some classification algos only accept categorical

attributes
 Reduce data size by discretization

 Prepare for further analysis

Transformation by Discretization
Some Algorithms require nominal/ discrete attributes
Discretization methods
 Unsupervised
 Independent of the class label

 Ex: Equal width binning, equal frequency

binning

 Supervised
 Dependent on the class label

 Ex: entropy based binning

Unsupervised Discretization
Equal Width binning…
 Advantages :
 Simple and easy to implement

 Produce a reasonable abstraction of data

 Disadvantages :
 Unsupervised

 If there are many occurrences of one range

in the data set, it would be useless for the

data mining task.
Entropy Based Discretization- Supervised
 Uses the class info present in the data
 Entropy(info content) is calculated based on the
class label
 Tries to find the best split so that bins are as
pure as possible.
 Pure bin : majority of the values in a bin should
correspond to the same class.
 Purity of a bin is measured using its entropy
 Entropy
 Zero – perfectly pure bin

 Max (1) – impure – equal class distribution

Entropy Based Discretization- Supervised…

10 15 7 10 6 13 7

+ + + + - - +
Entropy
Procedure…
1. Sort the attb values to be discretized, S
2. Bisect the initial values so that the resulting two
intervals have minimum entropy.
i. Consider each value T as a possible split point
Where T = midpoint of each consecutive attb
values.
ii. Compute the information gain before and after
choosing T as a split point.
Gain = E(S) – E(T,S)
i. Select the best T which gives the highest info
gain as the optimum split.
3. Repeat step 2 with another interval (highest
entropy) until a user specified no. of intervals is
reached or some stopping criterion is met.
Normalization
 Scale attribute values to fall within a small-
specified range.
 Min-max

 Z-score

 Decimal scaling
Similarity and Dissimilarity
• More important – used in clustering, some
classification, anomaly detection.
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Objects with Single
Attribute
p and q are the attribute values for two data objects.
Dissimilarities between Data Objects with multiple
Numeric attributes
• For ordinal attributes : Need to normalize the attribute
ranks.
• Refer Page 75 of the example 2.21
Dissimilarities between Data Objects with multiple
Numeric attributes
• Euclidean Distance

Where n is the number of dimensions and pk and qk are, respectively, the kth attributes of data objects p and q.

• Standardization is necessary, if scales differ.

n 2
dist   ( pk  qk )
k 1
Euclidean Distance

3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Both the Euclidean and the Manhattan distance satisfy the

following mathematical properties:
•Non-negativity: d(i, j) ≥ 0: Distance is a non-negative number.
•Identity of indiscernible: d(i, i) = 0: The distance of an object to itself is
0.
•Symmetry: d(i, j) = d( j, i): Distance is a symmetric function
•Triangle inequality: d(i, j) ≤ d(i, k)+d(k, j): Going directly from object i to
object j in space is no more than making a detour over any other object k

CSE-307-Data Mining 86
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance. Given two objects p and q

Where r is a parameter, n is the number of dimensions and pk and qk are, respectively, the kth attributes of data objects p and q.

1
n r r
dist  (  | pk  qk |)
k 1
Minkowski Distance: Examples
Minkowski Distance
Manhattan Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1 Euclidean Distance
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness). Distance will never have negative.
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.
p & q are the data points on the graph.
• A distance that satisfies all these properties is a
metric
Common Properties of a Similarity

• Similarities, also have some well known

properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

objects), p and q.
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)

Binary means there will be only 0’s & 1’s

SMC versus Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

SMC & JC
• SMC - Counts both presences and absences equally.
Used for objects with symmetric binary attributes.
• Can be used to find students who answered similarly
in a test – true/false questions
• JC is used to handle objects with assymetric binary
attributes.
• Ex: In a TDB:
• No. of products not purchased is far more than
purchased
• SMC would say all transactions are very similar.
• Use JC
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Cos(x,y) = 0 indicates both are dissimilar
Cos(x,y) = 1 indicates both are similar, if output is nearer to 1 we can say that it
is similar
But in this case it is different/dissimilar.
Extended Jaccard Coefficient (Tanimoto)

• Variation of JC
• Used for document data
• Reduces to Jaccard for binary attributes
Correlation
Pearson’s Correlation
• If correlation between two variables x and y is -1,
they are negatively correlated.
• If one increases, the other decreases and vice versa.
• If correlation between two variables x and y is
+1, they are positively correlated.
• Either both increase or both decrease.

CSE-307-Data Mining
98
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

 Remove redundancies

 Detect inconsistencies

 Data reduction
 Dimensionality reduction

 Numerosity reduction

 Data compression

 Data transformation and data discretization

 Normalization

 Concept hierarchy generation

 Similarity and Dissimilarity

99
References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
100
Sample Questions
 Data quality can be assessed in terms of several
issues, including accuracy, completeness, and
consistency. For each of the above three issues,
discuss how data quality assessment can depend
on the intended use of the data, giving examples.
Propose two other dimensions of data quality.
 In real-world data, tuples with missing values for
some attributes are a common occurrence.
Describe various methods for handling this
problem.

101
 The attribute age is given below in ascending order:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
 (a) Use smoothing by bin means to smooth these

data, using a bin depth of 3. Illustrate your steps.

Comment on the effect of this technique for the
given data.
 (b) How might you determine outliers in the data?

 (c) What other methods are there for data

smoothing?

102
 Discuss issues to consider during data
integration.
 What are the value ranges of the following
normalization methods?
 (a) min-max normalization

 (b) z-score normalization

 (c) z-score normalization using the mean

absolute deviation instead of standard

deviation
 (d) normalization by decimal scaling

103
 Use these methods to normalize the following
group of data: 200,300,400,600,1000
 (a) min-max normalization by setting min = 0

and max = 1
 (b) z-score normalization

 (c) z-score normalization using the mean

absolute deviation instead of standard

deviation
 (d) normalization by decimal scaling

104
 Using the data for age given in Exercise 3.slide
102, answer the following:
 (a) Use min-max normalization to transform the
value 35 for age onto the range [0.0, 1.0].
 (b) Use z-score normalization to transform the value
35 for age, where the standard deviation of age is
12.94 years.
 (c) Use normalization by decimal scaling to
transform the value 35 for age.
 d) Comment on which method you would prefer to
use for the given data, giving reasons as to why.

105
 suppose a group of 12 sales price records has
been sorted as follows:
5,10,11,13,15,35,50,55,72,92,204,215.
Partition them into three bins by each of the
following methods:
 (a) equal-frequency (equal-depth) partitioning

 (b) equal-width partitioning

106
 Briefly outline how to compute the dissimilarity
between objects described by the following:
 (a) Nominal attributes
 (b) Asymmetric binary attributes
 (c) Numeric attributes
 Given two objects represented by the tuples (22, 1, 42,
10) and (20, 0, 36, 8):
 (a) Compute the Euclidean distance between the two
objects.
 (b) Compute the Manhattan distance between the two
objects.
 (c) Compute the Minkowski distance between the two
objects, using q = 3.
107
 It is important to define or select similarity measures in data
analysis. However, there is no commonly accepted subjective
similarity measure. Results can vary depending on the
similarity measures used. Nonetheless, seemingly different
similarity measures may be equivalent after some
transformation. Suppose we have the following 2-D data set:

 (a) Consider the data as 2-D data points. Given a new data point, x
= (1.4,1.6) as a query, rank the database points based on similarity
with the query using Euclidean distance, Manhattan distance,
supremum distance, and cosine similarity.
 (b) Normalize the data set to make the normofeach data point equal
to 1. Use Euclidean distance on the transformed data to rank the
data points 108

Awesome Angles For Mathematics Competitions Book1 Look Inside
No ratings yet
Awesome Angles For Mathematics Competitions Book1 Look Inside
11 pages
Space Battles - A Spacefarers Guide - The Exciting New Game From Rick Priestley - Warlord Community
No ratings yet
Space Battles - A Spacefarers Guide - The Exciting New Game From Rick Priestley - Warlord Community
15 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Full
No ratings yet
Full
367 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
Attributes
No ratings yet
Attributes
66 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data
No ratings yet
Data
84 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
DataMining Unit I Notes
No ratings yet
DataMining Unit I Notes
28 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Lect 2
No ratings yet
Lect 2
77 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
Updated DM
No ratings yet
Updated DM
72 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
CL 2
No ratings yet
CL 2
85 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
L1
No ratings yet
L1
44 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Machine Learning Lecture 4 Data Types
No ratings yet
Machine Learning Lecture 4 Data Types
21 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Overview of The Management of Acute Kidney Injury (AKI) in Adults - UpToDate
No ratings yet
Overview of The Management of Acute Kidney Injury (AKI) in Adults - UpToDate
30 pages
IENG300 Assignment - 1 Solution
No ratings yet
IENG300 Assignment - 1 Solution
6 pages
Complexity of Matrix Rank and Rigidity
No ratings yet
Complexity of Matrix Rank and Rigidity
1 page
Industrial Instrumentation: Process Measurement
No ratings yet
Industrial Instrumentation: Process Measurement
54 pages
Design of Alcohol Detection System For Car Users Thru Iris Recognition Pattern Using Wavelet Transform
No ratings yet
Design of Alcohol Detection System For Car Users Thru Iris Recognition Pattern Using Wavelet Transform
5 pages
7th Grade General Science Proficiency Scales
No ratings yet
7th Grade General Science Proficiency Scales
10 pages
The Evolution of Neighborhood Planning Since The Early 20th Century
No ratings yet
The Evolution of Neighborhood Planning Since The Early 20th Century
3 pages
3DS Max 2011 Shortcuts
100% (1)
3DS Max 2011 Shortcuts
16 pages
ҰБТ тест жинағы Ағылшын
No ratings yet
ҰБТ тест жинағы Ағылшын
96 pages
Rotational Mechanics
No ratings yet
Rotational Mechanics
17 pages
Military Chants For Criminology.
No ratings yet
Military Chants For Criminology.
3 pages
2.5L Diesel 1993 On
No ratings yet
2.5L Diesel 1993 On
2 pages
Lithium Battery Service and Repair - A Handbook
No ratings yet
Lithium Battery Service and Repair - A Handbook
10 pages
Handyman/mainance/costruction
No ratings yet
Handyman/mainance/costruction
2 pages
Easybib 553e7541694d58 39916757
No ratings yet
Easybib 553e7541694d58 39916757
7 pages
Vertical In-Line Pump: Design Envelope, IVS & Conventional Parts List
No ratings yet
Vertical In-Line Pump: Design Envelope, IVS & Conventional Parts List
22 pages
Brosur Ground Rod
No ratings yet
Brosur Ground Rod
2 pages
Iman Magnético
No ratings yet
Iman Magnético
6 pages
Release Notes Isatis - Neo 2020.06: Last Update: June 28, 2020
No ratings yet
Release Notes Isatis - Neo 2020.06: Last Update: June 28, 2020
35 pages
The Hexagon For Trigonometric Identities
No ratings yet
The Hexagon For Trigonometric Identities
11 pages
Assignment 8614 PDF
No ratings yet
Assignment 8614 PDF
22 pages
Purchase Spec. For Plates-Copper Alloy (SB171 Uns C46400)
No ratings yet
Purchase Spec. For Plates-Copper Alloy (SB171 Uns C46400)
4 pages
Air Operator Certification Manual CAP 3300: Scheduled Commuter & NSOP Airplanes
No ratings yet
Air Operator Certification Manual CAP 3300: Scheduled Commuter & NSOP Airplanes
5 pages
CPR Compact Position Report Format Faa
No ratings yet
CPR Compact Position Report Format Faa
10 pages
Noc20 Cs81 Assignment 01 Week 03
No ratings yet
Noc20 Cs81 Assignment 01 Week 03
5 pages
343 Kokutai
No ratings yet
343 Kokutai
2 pages
Ipa Schedulingguide
100% (2)
Ipa Schedulingguide
44 pages
Development of An Automated Multi-Level Car Parking System: December 2015
No ratings yet
Development of An Automated Multi-Level Car Parking System: December 2015
8 pages