Machine Learning Machine Learning Data
Machine Learning Machine Learning Data
Data
z Visualization
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
f
from another.
h (=,( ))
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Attribute Allowed Transformations Comments
Level
R ti
Ratio new_value
l = a * old_value
ld l Length
L th can be
b measuredd in
i
meters or feet.
Discrete and continuous attributes
z Discrete attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
z Continuous attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically,
Practically real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
z Record
Data matrix
Document data
Transaction data
z G h
Graph
World Wide Web
Molecular structures
z Ordered
Spatial data
Temporal (time series) data
Sequential data
Genetic
G ti sequence data
d t
timeou
seaso
coach
game
score
team
ball
lost
pla
wi
n
y
m
on
e
e
h
ut
Jeff Howbert Introduction to Machine Learning Winter 2012 13
Transaction data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer,
ee , Coke,
Co e, Diaper,
pe , Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
z Sequences of transactions
It
Items/Events
/E t
An element of
the sequence
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
z Spatio-temporal data
Average monthly
temperature of
land and ocean
Interpretation /
evaluation
Knowledge
Machine
learning
Transformation Patterns
Preprocessing
Transformed
data
Selection Preprocessed
d t
data
Data Target
data
z Handling
g missing
g values
Eliminate data objects
Estimate missing values
Ignore the missing value during analysis
Replace with all possible values (weighted by their
probabilities)
z Example:
Same person with multiple email addresses
z Data cleaning
Includes process of dealing with duplicate data issues
z Aggregation
z Sampling
z Discretization and binarization
z Attribute transformation
z Feature creation
z Feature
F t selection
l ti
Choose subset of existing features
z Di
Dimensionality
i lit reduction
d ti
Create smaller number of new features through linear
or nonlinear combination of existing features
z Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More stable data
Aggregated data tends to have less variability
Standard
S a da d de
deviation
a o o of a
average
e age Standard
S a da d de
deviation
a o o of a
average
e age
monthly precipitation yearly precipitation
Definition:
A function that maps the entire set of
values of a given attribute to a new set
of replacement values, such that each
old value can be identified with one of
the new values.
z Simple functions
Examples of transform functions:
xk log( x ) ex |x|
Often used to make the data more like some standard distribution,
to better satisfy assumptions of a particular algorithm.
Example: discriminant analysis explicitly models each class distribution as a
multivariate Gaussian
log( x )
z Standardization or normalization
Usually involves making attribute:
mean =0
standard deviation =1
in MATLAB, use zscore() function
Important when working in Euclidean space and attributes have
very different numeric scales.
Also necessaryy to satisfy
y assumptions
p of certain algorithms.
g
Example: principal component analysis (PCA) requires each attribute to be
mean-centered (i.e. have mean subtracted from each value)
z Fourier transform
Eliminates noise present in time domain
Lets
Let s use a tool thats
that s good at those things
PowerPoint isn
isntt it
petal width
Iris virginica. Robert H. Mohlenbrock. USDA
petal length NRCS. 1995. Northeast wetland flora: Field office
guide to plant species
species. Northeast National
Species is class label Technical Center, Chester, PA. Courtesy of USDA
NRCS Wetland Science Institute.
Jeff Howbert Introduction to Machine Learning Winter 2012 43