0% found this document useful (0 votes)
34 views107 pages

Modified Module 2-DM

This document provides an overview of data preprocessing concepts and techniques. It discusses the different types of data, including record data, data matrix, document data, and transaction data. It also covers the different types of attributes such as nominal, ordinal, interval, and ratio attributes. Finally, it describes properties of attribute values and provides examples to classify attributes.

Uploaded by

Vaishnavi M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views107 pages

Modified Module 2-DM

This document provides an overview of data preprocessing concepts and techniques. It discusses the different types of data, including record data, data matrix, document data, and transaction data. It also covers the different types of attributes such as nominal, ordinal, interval, and ratio attributes. Finally, it describes properties of attribute values and provides examples to classify attributes.

Uploaded by

Vaishnavi M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 107

Data Mining:

Concepts and Techniques


(3rd ed.)

MODULE 2

1
Chapter 2: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
2
What is Data?
Attributes
 Collection of data objects and
their attributes
Tid Refund Marital Taxable
Status Income Cheat
 An attribute is a property or 1 Yes Single 125K No
characteristic of an object 2 No Married 100K No
 Examples: eye color of a 3 No Single 70K No
person, temperature, etc. 4 Yes Married 120K No
 Attribute is also known as 5 No Divorced 95K Yes
variable, field, characteristic,
or feature Objects 6 No Married 60K No
7 Yes Divorced 220K No
 A collection of attributes 8 No Single 85K Yes
describe an object 9 No Married 75K No
 Object is also known as 10 No Single 90K Yes
record, point, case, sample, 10

entity, or instance
Types of attributes ….
1) Nominal:
 items differentiated by a simple naming
system
 They may have numbers assigned to them.
But, they are not actual numbers. They
simply capture and reference.
 They are ‘categorical’ i.e, they belong to a
definable category.
 Ex:- ID numbers, eye color, zip codes
Types of attributes…
2) Ordinal:
 They have some kind of order by their position on
the scale.
 order of items can be defined by assigning numbers
– relative position.
 letters can also be assigned.
 they are ‘categorical’
 cannot do arithmetic – only ordering property.
 Ex:
rankings (taste of potato chips on a scale from 1 –
10)
Grades in {A, B, C, D, E}
Height in {small, medium, large}
Types of attributes ….

3) Interval:
 Is measured along a scale in which each position is
equidistant from one another.

 Distance between two pairs will be equivalent in some way.


Cannot be multiplied, or divided.

 Ex:
Calendar dates
Temperature in celsius/fah
Types of attributes….

4) Ratio
 numbers can be compared as multiples of one another.
 One person can be twice as tall as another person
 Number zero has no meaning
Ex:
 Difference between a person of age 35 and a person of age 38 is
same as difference between people who are 12 and 15. ( 35 to
38 = 3 , 12 to 15 = 3) 3:3 .
 Ratio data can be multiplied and divided.
Types of attributes…
 Interval and ratio data measure quantities and
hence are quantitative.
 Ex: length, time, count
Types of attributes…
 Nominal (symbolic, categorical)
 Values from an unordered set

 Ex: {red, yellow, blue, ….}

 Ordinal :
 Values from an ordered set

 Ex: {good, better, best}

 Continuous : real numbers


 Ex: {-9.8, 3.9,…..}
Discrete and Continuous Attributes
Depending on the number of values :-
 Discrete Attribute

 Has only a finite or countably infinite set of values


 Examples: zip codes, counts, or the set of words in a collection of
documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented
using a finite number of digits.
 Continuous attributes are typically represented as floating-point
variables.
Types of Attributes Summary
 There are different types of attributes
 Nominal

 Examples: ID numbers, eye color, zip codes


 Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
 Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
 Ratio
 Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
 The type of an attribute depends on which of the
following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */

 Nominal attribute: distinctness


 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties
Types of Attributes Summary
Classify the following attributes as :-
- binary, discrete or continuous

- Qualitative(nominal or ordinal) or quantitative

(interval or ratio)
1) Age in years

Ans: discrete, quantitative, ratio


2) Brightness as measured by a light meter
Ans: continuous, quantitative, ratio
3) Bronze, silver and gold medals as awarded at
Olympics
Ans: Discrete, qualitative, ordinal.
Types of data sets
 Record
 Data Matrix
 Document Data
 Transaction Data
 Graph
 World Wide Web
 Molecular Structures
 Ordered
 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data
Record Data

 Data that consists of a collection of records,


each of which consists of a fixed set of
attributes Tid Refund Marital
Status
Taxable
Income Cheat
 Data stored in flat files 1 Yes Single 125K No

 Like Excel, word etc


2 No Married 100K No
3 No Single 70K No

 Or RDBMS 4 Yes Married 120K No


5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix

 If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

 Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

 Each document becomes a `term' vector,


 each term is a component (attribute) of the vector,

 the value of each component is the number of

times the corresponding term occurs in the


document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

 A special type of record data, where


 each record (transaction) involves a set of items.

 For example, consider a grocery store. The set of

products purchased by a customer during one


shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
 Examples: Generic graph and HTML Links
•a graph is sometimes a more
convenient and powerful
representation of data
2
• can be used to capture
5 1 relationship between data
2 objects.
5
•Data objects themselves can be
graphs.
•Ex: set of linked web pages can
be represented as graphs
Chemical Data as a Graph

Data with objects that are graphs:-


Objects have sub-objects that have relationships

Ex : structure of chemical compounds

Nodes – atoms
Links – chemical compounds
Benzene Molecule: C H
6 6

Mining Substructures
Which substructures occur frequently

in a chemical compound?
Is the presence of any substructure

associated with any other?


Ordered Data

 Attributes have relationships that involve order in


time/space
 Extension of a record data
 Each record has a time associated with it
 Each attribute can also be given a time stamp.
Ordered Data
 Sequences of transactions
Items/Events

• Patterns ?

• People who buy DVD


player tend to buy
DVDs in the period
immediately following
the purchase of DVD
An element of player.
the sequence
Ordered Data
 Genomic sequence data – sequences of individual
entities, letters/words. No time stamp
 Ex: genetic info of animals/plants in the form of
sequences of genes/nucleotides.
• Human genetic code
GGTTCCGCCTTCAGCCCCGCGCC sequence- 4 genes :
CGCAGGGCCCGCCCCGCGCCGTC A, T, G and C.
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC • Mining on gene data –
CCAACCGAGTCCGACCAGGTGCC capture structure and
CCCTCTGCTCGGCCTAGACCTGA properties of genes
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC • Mining biological
TGGGCTGCCTGCTGCGACCAGGG sequence data -
BIOINFORMATICS
Ordered Data

 Spatio-Temporal
Data
 Spatial attributes
– position and area
 Ex: weather data
 Earth sciences data
 - measures temp
and pressure
measured at points
on latitude - longitude
Average Monthly
Temperature of
land and ocean
Time-series data
 A special type of sequential data
 Each record is a time-series i.e. a series of
measurements taken over time
 Ex: financial data set has objects which are the time
series of the daily prices of various stocks.
 Have temporal autocorrelation
 If two measurements are close in time, then their
values are often similar.
Chapter 2: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
26
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view


 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

27
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
28
Chapter 2: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
29
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
30
Incomplete (Missing) Data

 Data is not always available


 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
31
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
32
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning


 duplicate records

 incomplete data

 inconsistent data

33
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,

deal with possible outliers)

34
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections


 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering


to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface


 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

35
Chapter 3: Data Preprocessing

 Data Preprocessing
 Types of data?
 Data Preprocessing
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Similarity and Dissimilarity measures.
36
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
37
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple


databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
38
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

39
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
40
 For this 2 × 2 table, the degrees of freedom are (2
- 1)(2 - 1) = 1.
 For 1 degree of freedom, the χ2 value needed to
reject the hypothesis at the 0.001 significance level
is 10.828.
 Since our computed value is above this, we can
reject the hypothesis and conclude that the two
attributes are (strongly) correlated for the given
group of people.

Data Mining: Concepts and Techniques 41


Data Mining: Concepts and Techniques 42
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product


moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n A B
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
43
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

44
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot product

a 'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B )) / std ( B )

correlation( A, B)  A' B'

45
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, Aand are the respective mean or


B
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence 46
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
48
48
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

49
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

50
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

51
52
Attribute subset selection​..

1. Stepwise forward selection: The procedure starts with an empty set of


attributes as the reduced set. The best of the original attributes is and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and removes the worst
from among the remaining attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction. At each node, the algorithm chooses the “best” attribute
to partition the data into individual classes.
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model

parameters, store only the parameters, and discard


the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-

D space as the product on appropriate marginal


subspaces
 Non-parametric methods
 Do not assume models

 Major families: histograms, clustering, sampling, …

54
Numerosity Reduction

 Linear regression
 Histogram
 Clustering
 Sampling

55
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line

 Often uses the least-square method to fit the line

 Multiple regression
 Allows a response variable Y to be modeled as a

linear function of multidimensional feature vector


 Log-linear model
 Approximates discrete multidimensional probability

distributions

56
y
Regression Analysis
Y1
 Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors) 
Used for prediction
 The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
 Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used

57
Regress Analysis and Log-Linear Models
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
 Useful for dimensionality reduction and data smoothing
58
Histogram Analysis
 Divide data into buckets and 40
store average (sum) for each 35
bucket 30
 Partitioning rules: 25
 Equal-width: equal bucket 20
range 15
 Equal-frequency (or equal- 10
depth) 5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
59
Histogram

11/06/23 Data Mining: Concepts and Techniques 60


Data Cube Aggregation

 The lowest level of a data cube (base cuboid)


 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
61
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
63
Data Transformation
 Discretization
 Supervised

 Entropy – based
 Unsupervised
 Equal width and equal frequency
 Normalization
 Min-max

 Z-score

 Decimal scaling

 Binarization
Discretization/Quantization
 Three types of attributes:
 Nominal – values from an unordered set

 Ordinal – values from an ordered set

 Continuous – real numbers

 Discretization :
 Divide the range of a continuous attribute into

intervals
 Some classification algos only accept categorical

attributes
 Reduce data size by discretization

 Prepare for further analysis


Transformation by Discretization
Some Algorithms require nominal/ discrete attributes
Discretization methods
 Unsupervised
 Independent of the class label

 Ex: Equal width binning, equal frequency

binning

 Supervised
 Dependent on the class label

 Ex: entropy based binning


Unsupervised Discretization
Equal Width binning…
 Advantages :
 Simple and easy to implement

 Produce a reasonable abstraction of data

 Disadvantages :
 Unsupervised

 If there are many occurrences of one range

in the data set, it would be useless for the


data mining task.
Entropy Based Discretization- Supervised
 Uses the class info present in the data
 Entropy(info content) is calculated based on the
class label
 Tries to find the best split so that bins are as
pure as possible.
 Pure bin : majority of the values in a bin should
correspond to the same class.
 Purity of a bin is measured using its entropy
 Entropy
 Zero – perfectly pure bin

 Max (1) – impure – equal class distribution


Entropy Based Discretization- Supervised…

10 15 7 10 6 13 7

+ + + + - - +
Entropy
Procedure…
1. Sort the attb values to be discretized, S
2. Bisect the initial values so that the resulting two
intervals have minimum entropy.
i. Consider each value T as a possible split point
Where T = midpoint of each consecutive attb
values.
ii. Compute the information gain before and after
choosing T as a split point.
Gain = E(S) – E(T,S)
i. Select the best T which gives the highest info
gain as the optimum split.
3. Repeat step 2 with another interval (highest
entropy) until a user specified no. of intervals is
reached or some stopping criterion is met.
Normalization
 Scale attribute values to fall within a small-
specified range.
 Min-max

 Z-score

 Decimal scaling
Similarity and Dissimilarity
• More important – used in clustering, some
classification, anomaly detection.
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Objects with Single
Attribute
p and q are the attribute values for two data objects.
Dissimilarities between Data Objects with multiple
Numeric attributes
• For ordinal attributes : Need to normalize the attribute
ranks.
• Refer Page 75 of the example 2.21
Dissimilarities between Data Objects with multiple
Numeric attributes
• Euclidean Distance

Where n is the number of dimensions and pk and qk are, respectively, the kth attributes of data objects p and q.

• Standardization is necessary, if scales differ.

n 2
dist   ( pk  qk )
k 1
Euclidean Distance

3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Manhattan (or city block) distance
d(i, j) = |xi1 −xj1| + |xi2 −xj2| + · · · + |xip −xjp|

Both the Euclidean and the Manhattan distance satisfy the


following mathematical properties:
•Non-negativity: d(i, j) ≥ 0: Distance is a non-negative number.
•Identity of indiscernible: d(i, i) = 0: The distance of an object to itself is
0.
•Symmetry: d(i, j) = d( j, i): Distance is a symmetric function
•Triangle inequality: d(i, j) ≤ d(i, k)+d(k, j): Going directly from object i to
object j in space is no more than making a detour over any other object k

CSE-307-Data Mining 86
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance. Given two objects p and q

Where r is a parameter, n is the number of dimensions and pk and qk are, respectively, the kth attributes of data objects p and q.

1
n r r
dist  (  | pk  qk |)
k 1
Minkowski Distance: Examples
Minkowski Distance
Manhattan Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1 Euclidean Distance
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness). Distance will never have negative.
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.
p & q are the data points on the graph.
• A distance that satisfies all these properties is a
metric
Common Properties of a Similarity

• Similarities, also have some well known


properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data


objects), p and q.
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values


= (M11) / (M01 + M10 + M11)

Binary means there will be only 0’s & 1’s


SMC versus Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0


SMC & JC
• SMC - Counts both presences and absences equally.
Used for objects with symmetric binary attributes.
• Can be used to find students who answered similarly
in a test – true/false questions
• JC is used to handle objects with assymetric binary
attributes.
• Ex: In a TDB:
• No. of products not purchased is far more than
purchased
• SMC would say all transactions are very similar.
• Use JC
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Cos(x,y) = 0 indicates both are dissimilar
Cos(x,y) = 1 indicates both are similar, if output is nearer to 1 we can say that it
is similar
But in this case it is different/dissimilar.
Extended Jaccard Coefficient (Tanimoto)

• Variation of JC
• Used for document data
• Reduces to Jaccard for binary attributes
Correlation
Pearson’s Correlation
• If correlation between two variables x and y is -1,
they are negatively correlated.
• If one increases, the other decreases and vice versa.
• If correlation between two variables x and y is
+1, they are positively correlated.
• Either both increase or both decrease.

CSE-307-Data Mining
98
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

 Remove redundancies

 Detect inconsistencies

 Data reduction
 Dimensionality reduction

 Numerosity reduction

 Data compression

 Data transformation and data discretization


 Normalization

 Concept hierarchy generation

 Similarity and Dissimilarity


99
References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
100
Sample Questions
 Data quality can be assessed in terms of several
issues, including accuracy, completeness, and
consistency. For each of the above three issues,
discuss how data quality assessment can depend
on the intended use of the data, giving examples.
Propose two other dimensions of data quality.
 In real-world data, tuples with missing values for
some attributes are a common occurrence.
Describe various methods for handling this
problem.

101
 The attribute age is given below in ascending order:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
 (a) Use smoothing by bin means to smooth these

data, using a bin depth of 3. Illustrate your steps.


Comment on the effect of this technique for the
given data.
 (b) How might you determine outliers in the data?

 (c) What other methods are there for data

smoothing?

102
 Discuss issues to consider during data
integration.
 What are the value ranges of the following
normalization methods?
 (a) min-max normalization

 (b) z-score normalization

 (c) z-score normalization using the mean

absolute deviation instead of standard


deviation
 (d) normalization by decimal scaling

103
 Use these methods to normalize the following
group of data: 200,300,400,600,1000
 (a) min-max normalization by setting min = 0

and max = 1
 (b) z-score normalization

 (c) z-score normalization using the mean

absolute deviation instead of standard


deviation
 (d) normalization by decimal scaling

104
 Using the data for age given in Exercise 3.slide
102, answer the following:
 (a) Use min-max normalization to transform the
value 35 for age onto the range [0.0, 1.0].
 (b) Use z-score normalization to transform the value
35 for age, where the standard deviation of age is
12.94 years.
 (c) Use normalization by decimal scaling to
transform the value 35 for age.
 d) Comment on which method you would prefer to
use for the given data, giving reasons as to why.

105
 suppose a group of 12 sales price records has
been sorted as follows:
5,10,11,13,15,35,50,55,72,92,204,215.
Partition them into three bins by each of the
following methods:
 (a) equal-frequency (equal-depth) partitioning

 (b) equal-width partitioning

106
 Briefly outline how to compute the dissimilarity
between objects described by the following:
 (a) Nominal attributes
 (b) Asymmetric binary attributes
 (c) Numeric attributes
 Given two objects represented by the tuples (22, 1, 42,
10) and (20, 0, 36, 8):
 (a) Compute the Euclidean distance between the two
objects.
 (b) Compute the Manhattan distance between the two
objects.
 (c) Compute the Minkowski distance between the two
objects, using q = 3.
107
 It is important to define or select similarity measures in data
analysis. However, there is no commonly accepted subjective
similarity measure. Results can vary depending on the
similarity measures used. Nonetheless, seemingly different
similarity measures may be equivalent after some
transformation. Suppose we have the following 2-D data set:

 (a) Consider the data as 2-D data points. Given a new data point, x
= (1.4,1.6) as a query, rank the database points based on similarity
with the query using Euclidean distance, Manhattan distance,
supremum distance, and cosine similarity.
 (b) Normalize the data set to make the normofeach data point equal
to 1. Use Euclidean distance on the transformed data to rank the
data points 108

You might also like