Data Mining: Data: Lecture Notes For Chapter 2
Data Mining: Data: Lecture Notes For Chapter 2
Tan,Steinbach, Kumar
4/18/2004
What is Data?
z
Collection of data objects and their attributes An attribute is a property or characteristic of an object
Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature Objects
Attributes
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Tan,Steinbach, Kumar
4/18/2004
Attribute Values
z
Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values
Same attribute can be mapped to different attribute values
Measurement of Length
z
The way you measure an attribute is somewhat may not match the attributes properties.
5 A B 7 C 8 3 2 1
D 10 4
15
Tan,Steinbach, Kumar
4/18/2004
Types of Attributes
z
Examples: ID numbers, numbers eye color, color zip codes Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Examples: calendar dates, temperatures in Celsius or Fahrenheit Fahrenheit. Examples: temperature in Kelvin, length, time, counts
Ordinal
Interval
Ratio
Tan,Steinbach, Kumar
4/18/2004
Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties
Introduction to Data Mining 4/18/2004 #
Tan,Steinbach, Kumar
Attribute Type
Nominal
Description
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) The values of an ordinal attribute provide enough information to order objects. (<, >)
Examples
zip codes, employee ID numbers, eye color, sex: {male, female}
Operations
mode, entropy, contingency correlation, 2 test
Ordinal
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+ - ) (+, For ratio variables, both differences and ratios are meaningful. (*, /)
Ratio
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
Transformation
Comments
If all employee ID numbers were reassigned, would it make any difference? An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Length can be measured in meters or feet.
Ordinal
An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function.
Interval
Ratio
new_value = a * old_value
Discrete Attribute
Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of d documents t Often represented as integer variables. Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
Tan,Steinbach, Kumar
4/18/2004
Record
Data Matrix Document Data Transaction Data
Graph
World Wide Web Molecular Structures
Ordered
Spatial Data T Temporal l Data D t Sequential Data Genetic Sequence Data
Tan,Steinbach, Kumar
4/18/2004
Curse of Dimensionality
Sparsity
Resolution
Tan,Steinbach, Kumar
4/18/2004
Record Data
z
Data that consists of a collection of records, each of which consists of a fixed set of attributes
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
10
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Tan,Steinbach, Kumar
4/18/2004
Data Matrix
z
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
Projection of x Load 10.23 12.65 Projection of y load 5.27 6.25 Distance Load Thickness
15.22 16.22
Introduction to Data Mining
2.7 2.2
1.2 1.1
4/18/2004 #
Tan,Steinbach, Kumar
Document Data
z
timeout
season
coach
game
score
team
ball
lost
pla y
wi n
Tan,Steinbach, Kumar
4/18/2004
Transaction Data
z
1 2 3 4 5
Tan,Steinbach, Kumar
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
Introduction to Data Mining 4/18/2004 #
Graph Data
z
2 5 2 5 1
Tan,Steinbach, Kumar
4/18/2004
Chemical Data
z
Tan,Steinbach, Kumar
4/18/2004
Ordered Data
z
Sequences of transactions
Items/Events
Ordered Data
z
Tan,Steinbach, Kumar
4/18/2004
Ordered Data
z
Spatio-Temporal Data
Tan,Steinbach, Kumar
4/18/2004
Data Quality
What kinds of data quality problems? z How can we detect problems with the data? z What can we do about these problems?
z
Tan,Steinbach, Kumar
4/18/2004
Noise
z
Outliers
z
Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
Tan,Steinbach, Kumar
4/18/2004
Missing Values
z
Tan,Steinbach, Kumar
Duplicate Data
z
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Tan,Steinbach, Kumar
4/18/2004
Data Preprocessing
Aggregation z Sampling z Dimensionality Reduction z Feature subset selection z Feature creation z Discretization and Binarization z Attribute Transformation
z
Tan,Steinbach, Kumar
4/18/2004
Aggregation
z
Combining two or more attributes (or objects) into a single attribute (or object) Purpose
Data reduction
Reduce the number of attributes or objects Cities aggregated into regions regions, states states, countries countries, etc Aggregated data tends to have less variability
Change of scale
Tan,Steinbach, Kumar
4/18/2004
Aggregation
Variation of Precipitation in Australia
Sampling
z
Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive p or time consuming.
Tan,Steinbach, Kumar
4/18/2004
Sampling
z
Tan,Steinbach, Kumar
4/18/2004
Types of Sampling
z
Stratified sampling
Split the data into several partitions; then draw random samples from each partition
Tan,Steinbach, Kumar
4/18/2004
Sample Size
8000 points
2000 Points
500 Points
Tan,Steinbach, Kumar
4/18/2004
Sample Size
z
What sample size is necessary to get at least one object from each of 10 groups.
Tan,Steinbach, Kumar
4/18/2004
Curse of Dimensionality
z
When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
Randomly generate 500 points Compute difference between max and min distance between any pair of points
Tan,Steinbach, Kumar
4/18/2004
Dimensionality Reduction
z
Purpose:
Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise
Techniques
Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques
Tan,Steinbach, Kumar
4/18/2004
Goal is to find a projection that captures the largest amount of variation in data
x2 e
x1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
x2 e
x1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
z z
Construct a neighbourhood graph For each pair of points in the graph, compute the shortest path distances geodesic distances
Introduction to Data Mining 4/18/2004 #
Tan,Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Irrelevant features
contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA
Tan,Steinbach, Kumar
4/18/2004
Techniques:
Brute-force approch:
Try
Embedded approaches:
Feature selection occurs naturally as part of the data mining algorithm
Filter approaches:
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset of attributes
Tan,Steinbach, Kumar
4/18/2004
Feature Creation
z
Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies:
Feature Extraction
domain-specific
combining features
Tan,Steinbach, Kumar
4/18/2004
Frequency
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Data
Equal frequency
Tan,Steinbach, Kumar Introduction to Data Mining
K-means
4/18/2004 #
Attribute Transformation
z
A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
Simple functions: xk, log(x), ex, |x| Standardization and Normalization
Tan,Steinbach, Kumar
4/18/2004
Similarity
Numerical measure of how alike two data objects are. Is I higher hi h when h objects bj t are more alike. lik Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
Tan,Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Euclidean Distance
z
Euclidean Distance
dist =
k =1
( pk qk )
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
z
Tan,Steinbach, Kumar
4/18/2004
Euclidean Distance
3 2 1
p2 p1 p3 p4
0 0 1 2 3 4 5 6
point p1 p2 p3 p4
x 0 2 3 5
y 2 0 1 1
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Minkowski Distance
z
dist = ( | pk qk
k =1
1 |r ) r
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Tan,Steinbach, Kumar
4/18/2004
z z
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Tan,Steinbach, Kumar
4/18/2004
Minkowski Distance
L1 p1 p2 p3 p4
point p1 p2 p3 p4 x 0 2 3 5 y 2 0 1 1
p3 4 2 0 2 p3 3.162 1.414 0 2
p3 p 3 1 0 2
p4 6 4 2 0 p4 5.099 3.162 2 0
p4 p 5 3 2 0
L2 p1 p2 p3 p4
L p1 p2 p3 p4
Distance Matrix
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Mahalanobis Distance
mahalanobi s ( p , q ) = ( p q ) 1 ( p q )T
is i the th covariance i matrix t i of f the input data X
j ,k =
1 n ( X ij X j )( X ik X k ) n 1 i =1
Mahalanobis Distance
Covariance Matrix:
C B A
Mahal(A,B) = 5 Mahal(A,C) = 4
Tan,Steinbach, Kumar
4/18/2004
Distances, such as the Euclidean distance, have some well known properties.
1. 2. 3. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), objects) p and q.
z
Tan,Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Common situation is that objects, p and q, have only binary attributes Compute p similarities using g the following gq quantities
M01 =thenumberofattributeswherepwas0andqwas1 M10=thenumberofattributeswherepwas1andqwas0 M00 =thenumberofattributeswherepwas0andqwas0 M11 =thenumberofattributeswherepwas1andqwas1
J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Cosine Similarity
z
If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Tan,Steinbach, Kumar
4/18/2004
Correlation
Correlation measures the linear relationship between objects z To T compute t correlation, l ti we standardize t d di d data t objects, p and q, and then take their dot product
z
Tan,Steinbach, Kumar
4/18/2004
Sometimes attributes are of many different types, but an overall similarity is needed.
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Density
z
Tan,Steinbach, Kumar
4/18/2004
Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains
Tan,Steinbach, Kumar
4/18/2004
Euclidean density is the number of points within a specified radius of the point
Tan,Steinbach, Kumar
4/18/2004