CIS62283 02 PreProcessing
CIS62283 02 PreProcessing
feature
Objects
5 No Divorced 95K Yes
• Con7nuous
A<ribute
• Has
real
numbers
as
a<ribute
values
• Examples:
temperature,
height,
or
weight.
• Prac7cally,
real
values
can
only
be
measured
and
represented
using
a
finite
number
of
digits.
• Con7nuous
a<ributes
are
typically
represented
as
floa7ng-‐point
variables.
Any
Questions?
Types
of
Datasets
• Record
• Data
Matrix
• Document
Data
• Transac3on
Data
• Graph
• World
Wide
Web
• Molecular
Structures
• Ordered
• Spa3al
Data
• Temporal
Data
• Sequen3al
Data
• Gene3c
Sequence
Data
Record
Data
• Data
that
consists
of
a
collec7on
of
records,
each
of
which
consists
of
a
fixed
set
of
a<ributes
Tid Refund Marital Taxable
Status Income Cheat
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction
Data
• A
special
type
of
record
data,
where
• each
record
(transac7on)
involves
a
set
of
items.
• For
example,
consider
a
grocery
store.
The
set
of
products
purchased
by
a
customer
during
one
shopping
trip
cons7tute
a
transac7on,
while
the
individual
products
that
were
purchased
are
the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph
Data
• Examples:
Generic
graph
and
HTML
Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Chemical
Data
• Benzene
Molecule:
C6H6
Ordered
Data
• Sequences
of
transac7ons
Items/Events
An element of
the sequence
Ordered
Data
•
Genomic
sequence
data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered
Data
• Spa7o-‐Temporal
Data
Average Monthly
Temperature of
land and ocean
Any
Questions?
Data
Mining
Process
Steps
of
Data
Mining
1. Learning
the
applica7on
domain:
• relevant
prior
knowledge
and
goals
of
applica7on
2. Crea7ng
a
target
data
set:
data
selec7on
3. Data
cleaning
and
preprocessing:
(may
take
60%
of
effort!)
4. Data
reduc7on
and
transforma7on:
• Find
useful
features,
dimensionality/variable
reduc7on,
invariant
representa7on.
5. Choosing
func7ons
of
data
mining
•
summariza7on,
classifica7on,
regression,
associa7on,
clustering.
6. Choosing
the
mining
algorithm(s)
7. Data
mining:
search
for
pa<erns
of
interest
8. Pa<ern
evalua7on
and
knowledge
presenta7on
• visualiza7on,
transforma7on,
removing
redundant
pa<erns,
etc.
9. Use
of
discovered
knowledge
Data
Quality
• What
kinds
of
data
quality
problems?
• How
can
we
detect
problems
with
the
data?
• What
can
we
do
about
these
problems?
• Examples:
• Same
person
with
mul7ple
email
addresses
• Data
cleaning
• Process
of
dealing
with
duplicate
data
issues
Any
Questions?
Data
Preprocessing
• Aggrega7on
• Sampling
• Dimensionality
Reduc7on
• Feature
subset
selec7on
• Feature
crea7on
• Discre7za7on
and
Binariza7on
• A<ribute
Transforma7on
Aggregation
• Combining
two
or
more
a<ributes
(or
objects)
into
a
single
a<ribute
(or
object)
• Purpose
• Data
reduc7on
•
Reduce
the
number
of
a<ributes
or
objects
• Change
of
scale
•
Ci7es
aggregated
into
regions,
states,
countries,
etc
• More
“stable”
data
•
Aggregated
data
tends
to
have
less
variability
Sampling
• Sampling
is
the
main
technique
employed
for
data
selec7on.
• It
is
oZen
used
for
both
the
preliminary
inves7ga7on
of
the
data
and
the
final
data
analysis.
• Sampling
is
used
in
data
mining
because
processing
the
en7re
set
of
data
of
interest
is
too
expensive
or
7me
consuming.
Sampling
…
• The
key
principle
for
effec7ve
sampling
is
the
following:
• Using
a
sample
will
work
almost
as
well
as
using
the
en7re
data
sets,
if
the
sample
is
representa7ve
• A
sample
is
representa7ve
if
it
has
approximately
the
same
property
(of
interest)
as
the
original
set
of
data
Types
of
Sampling
• Simple
Random
Sampling
• There
is
an
equal
probability
of
selec7ng
any
par7cular
item
• Stra7fied
sampling
• Split
the
data
into
several
par77ons;
then
draw
random
samples
from
each
par77on
Dimensionality
Reduction
• Purpose:
• Reduce
amount
of
7me
and
memory
required
by
data
mining
algorithms
• Allow
data
to
be
more
easily
visualized
• May
help
to
eliminate
irrelevant
features
or
reduce
noise
• Techniques
• Principle
Component
Analysis
• Singular
Value
Decomposi7on
• Others:
supervised
and
non-‐linear
techniques
Dimensionality
Reduction:
PCA
• Goal
is
to
find
a
projec7on
that
captures
the
largest
amount
of
varia7on
in
data
• Find
the
eigenvectors
of
the
covariance
matrix
• The
eigenvectors
define
the
new
space
x2
x1
Feature
Subset
Selection
• Another
way
to
reduce
dimensionality
of
data
• Redundant
features
• duplicate
much
or
all
of
the
informa7on
contained
in
one
or
more
other
a<ributes
• Example:
purchase
price
of
a
product
and
the
amount
of
sales
tax
paid
• Irrelevant
features
• contain
no
informa7on
that
is
useful
for
the
data
mining
task
at
hand
• Example:
students'
ID
is
oZen
irrelevant
to
the
task
of
predic7ng
students'
GPA
Feature
Subset
Selection
• Techniques:
• Brute-‐force
approch:
• Try
all
possible
feature
subsets
as
input
to
data
mining
algorithm
• Embedded
approaches:
•
Feature
selec7on
occurs
naturally
as
part
of
the
data
mining
algorithm
• Filter
approaches:
•
Features
are
selected
before
data
mining
algorithm
is
run
• Wrapper
approaches:
•
Use
the
data
mining
algorithm
as
a
black
box
to
find
best
subset
of
a<ributes
Feature
Creation
• Create
new
a<ributes
that
can
capture
the
important
informa7on
in
a
data
set
much
more
efficiently
than
the
original
a<ributes