Lecture 2
Lecture 2
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Types of data sets
Record
Data Matrix
Important Characteristics of
Document Data
Structured Data:
Transaction Data
Graph –Dimensionality
World Wide Web Curse of Dimensionality
Molecular Structures
–Sparsity
Ordered Only presence counts
Spatial Data
Temporal Data
–Resolution
Sequential Data Patterns depend on the scale
Genetic Sequence Data
Record Data
Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
2 Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
5 1 Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
2 N-Body Computation and Dense Linear System Solvers
5
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
Combining two or more attributes (or objects) into
a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability
Sampling
Sampling is the main technique employed for data selection.
It is often used for both the preliminary investigation of the
data and the final data analysis.
Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
Sampling is used in data mining because processing the entire
set of data of interest is too expensive or time consuming.
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire
data sets, if the sample is representative
A sample is representative if it has approximately the same
property (of interest) as the original set of data
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Stratified sampling
Split the data into several partitions; then draw random samples
from each partition
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Definitions of density
and distance between
points, which is critical
for clustering and outlier
• Randomly generate 500 points
detection, become less
meaningful • Compute difference between max
and min distance between any pair
of points
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining
algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
Techniques
Principle Component Analysis (PCA)
Singular Value Decomposition
Others: supervised and non-linear techniques
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in one
or more other attributes
Example: purchase price of a product and the amount of
sales tax paid
Irrelevant features
contain no information that is useful for the data mining
task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Feature Subset Selection
Techniques:
Brute-force approach:
Try all possible feature subsets as input to data mining
algorithm
Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best
subset of attributes
Feature Creation
Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
Thanks