Data Mining Chapter 2 Data Preprocessing
Data Mining Chapter 2 Data Preprocessing
Data Preprocessing
What is an attribute?
• An attribute is a property or characteristic of an object. Examples:
eye color of a person, temperature, etc.
• Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object. Object is also known as
record, point, case, sample, entity, or instance.
• Attribute values are numbers or symbols assigned to an attribute
• Same attribute can be mapped to different attribute values. Example:
height can be measured in feet or meters.
• Different attributes can be mapped to the same set of values.
Example: Attribute values for ID and age are integers but properties
of attribute values can be different. ID has no limit but age has a
maximum and minimum value.
Types of Attributes ( Approach 1)
Attribut Description Examples
e Type
Nominal The values of a nominal attribute are just zip codes, employee ID numbers, eye
different names, i.e., nominal attributes color.
provide only enough information to
distinguish one object from another. (=,
≠)
Ordinal The values of an ordinal attribute hardness of minerals,
provide enough information to order {good, better, best}, grades, street
objects. (<, >) numbers
Interval For interval attributes, the differences calendar dates, temperature in
between values are meaningful, i.e., a Celsius or Fahrenheit
unit of measurement exists.
(+, - )
Ratio For ratio variables, both differences and temperature in Kelvin, monetary
ratios are meaningful. (*, /) quantities, counts, age, mass,
length, electrical current
Types of Attribute (Approach 2)
• Discrete Attribute
– Has only a finite or countable infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Types of Attribute (Approach 3)
• Character
– values are represented in forms of character or set of characters
(string).
• Number
– values are represented in forms of number. Number may be in
form of whole number, decimal number.
Types of data sets
Record
• Data that consists of a collection of records, each of which consists
of a fixed set of attributes
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there
are m rows, one for each object, and n columns, one for each
attribute
Projection of x Projection of y Distance Load Thickness
Load load
10.23 5.27 15.22 2.7 1.2
Play
timeout
season
coach
score
game
team
ball
lost
wi n
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Types of data sets
Transaction Data
• A special type of record data, where each record (transaction) involves a set of
items.
• For example, consider a grocery store. The set of products purchased by a
customer during one shopping trip constitute a transaction, while the individual
products that were purchased are the items
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Types of data sets
Graph
• Contains notes and connecting vertices
Types of data sets
Ordered
– Has Sequences of transactions
Spatial Data
– Spatial data, also known as geospatial data, is information
about a physical object that can be represented by
numerical values in a geographic coordinate system.
• Temporal Data
– A temporal data denotes the evolution of an object
characteristic over a period of time. Eg d=f(t).
• Sequential Data
– Data arranged in sequence.
Important Characteristics of Structured Data
Dimensionality
• A Data Dimension is a set of data attributes pertaining to something of
interest to a business. Dimensions are things like "customers", "products",
"stores" and "time".
– Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse in the space that it
occupies.
• Definitions of density and distance between points, which is critical for clustering
and outlier detection, become less meaningful
– *Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
– *Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Dimensionality Reduction:
PCA
– Goal is to find a projection that captures the largest amount of variation in data.
– Find the eigenvectors of the covariance matrix.
– The eigenvectors define the new space.
– Construct a neighborhoods graph
– For each pair of points in the graph, compute the shortest path distances – geodesic
distances
Feature Subset Selection
– Another way to reduce dimensionality of data.
– Redundant features.
– Duplicate much or all of the information contained in one or more other attributes.
– Example: purchase price of a product and the amount of sales tax paid.
Irrelevant features
– Contain no information that is useful for the data mining task at hand
– Example: students' ID is often irrelevant to the task of predicting students' GPA
Dimensionality Reduction:
Techniques:
– Brute-force approach:
• Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
Wrapper approaches:
– Use the data mining algorithm as a black box to find best subset of attributes.
Feature Creation
– Create new attributes that can capture the important information in a data set much more
efficiently than the original attributes.
– Three general methodologies:
• Feature Extraction: domain-specific
• Mapping Data to New Space
• Feature Construction: combining features
Sparsity and Density
• Sparsity and density are terms used to describe the percentage
of cells in a database table that are not populated and
populated, respectively. The sum of the sparsity and density
should equal 100.
• Type 1:
– Determine the outliers with no prior knowledge of data. This is a
learning approach analogous to unsupervised learning.
• Type 2:
– Model with normality and abnormality. Analogous to supervised
learning.
• Type 3:
– Model with normality. Semi- supervised learning approach
Data Integration
• Combines data from multiple sources into a coherent store.
• Integrate meta data from different sources (Schema Integration)
– Problem: - Entity Identification Problem.
– Different sources have different values for same attributes.
– Data Redundancy
• These problems are mainly because of different representation,
different scales etc.
How to handle redundant data in data integration?
• Redundant data may be able to be detected by correlation
analysis.
• Step-wise and careful integration of data from multiple sources
may help to improve mining speed and quality.
Data Transformation
• Changing data from one form to another form.
• Approaches:
– Smoothing: Remove noise from data.
– Aggregation: Summarizations of data
– Generalization: Hierarchy climbing of data
– Normalization: Scaled to fall within a small specified range.
Types
– Min-Max Normalization:
• V’ = ((V-min)/(max-min)* (new_max – new_min)) + new_min
– Z-Score Normalization:
• V’ = (V-min)/ stand_dev.
– Normalization by decimal scaling:
• V’= V/ 10j where j is the smallest integer such that max (|V’|) <1
Data Aggregation:
Combining two or more attributes (or objects)
into a single attribute (or object).
• Purpose
– Data reduction: Reduce the number of attributes
or objects
– Change of scale: Cities aggregated into regions,
states, countries, etc
– More “stable” data: Aggregated data tends to have
less variability
Data Reduction:
• Warehouse may store terabytes of data hence complex data
mining may take a very long time to run on complete data set.
• Used to carry out day to day business functions such as ERP (Enterprise
Resource Planning), CRM ( Customer Relationship Planning)