Donalek Classif
Donalek Classif
Learning
Ciro Donalek
Ay/Bi 199 – April 2011
Summary
• KDD and Data Mining Tasks
• Finding the opmal approach
• Supervised Models
– Neural Networks
– Mul Layer Perceptron
– Decision Trees
• Unsupervised Models
– Dierent Types of Clustering
– Distances and Normalizaon
– Kmeans
– Self Organizing Maps
• Combining dierent models
– Commiee Machines
– Introducing a Priori Knowledge
– Sleeping Expert Framework
Knowledge Discovery in Databases
• KDD may be dened as: "The non trivial process of
idenfying valid, novel, potenally useful, and
ulmately understandable paerns in data".
• KDD is an interacve and iterave process involving
several steps.
You got your data: what’s next?
What kind of analysis do you need? Which model is more appropriate for it? …
Clean your data!
• Data preprocessing transforms the raw data
into a format that will be more easily and
eecvely processed for the purpose of the
user.
• Some tasks
• sampling: selects a representave subset
from a large populaon of data;
Use standard
• Noise treatment formats!
• strategies to handle missing data: somemes
your rows will be incomplete, not all
parameters are measured for all samples.
• normalizaon
• feature extracon: pulls out specied data
that is signicant in some parcular context.
Missing Data
• Missing data are a part of almost all research, and we all have to
decide how to deal with it.
• Complete Case Analysis: use only rows with all the values
• Available Case Analysis
• Substuon
– Mean Value: replace the missing value with the
mean value for that parcular aribute
– Regression Substuon: we can replace the
missing value with historical value from similar cases
– Matching Imputaon: for each unit with a missing y,
nd a unit with similar values of x in the observed
data and take its y value
– Maximum Likelihood, EM, etc
• Some DM models can deal with missing data beer than others.
• Which technique to adopt really depends on your data
Data Mining
• Crucial task within the KDD
• Data Mining is about automang the process of
searching for paerns in the data.
• More in details, the most relevant DM tasks are:
– associaon
– sequence or path analysis
– clustering
– classicaon
– regression
– visualizaon
Finding Soluon via Purposes
• You have your data, what kind of analysis do you need?
• Regression
– predict new values based on the past, inference
– compute the new values for a dependent variable based on the
values of one or more measured aributes
• Classicaon:
– divide samples in classes
– use a trained set of previously labeled data
• Clustering
– paroning of a data set into subsets (clusters) so that data in
each subset ideally share some common characteriscs
• Crispy classicaon
– given an input, the classier returns its label
• Probabilisc classicaon
– given an input, the classier returns its probabilies to belong to
each class
– useful when some mistakes can be more
costly than others (give me only data >90%)
– winner take all and other rules
• assign the object to the class with the
highest probability (WTA)
• …but only if its probability is greater than 40%
(WTA with thresholds)
Regression / Forecasng
• Data table stascal correlaon
– mapping without any prior assumpon on the funconal
form of the data distribuon;
– machine learning algorithms well suited for this.
• Curve ng
– nd a well dened and known
funcon underlying your data;
– theory / experse can help.
Machine Learning
• To learn: to get knowledge of by study, experience,
or being taught.
• Types of Learning
• Supervised
• Unsupervised
Unsupervised Learning
• The model is not provided with the correct results
during the training.
• Can be used to cluster the input data in classes on
the basis of their stascal properes only.
• Cluster signicance and labeling.
• The labeling can be carried out even if the labels are
only available for a small number of objects
representave of the desired classes.
Supervised Learning
• Training data includes both the input and the
desired results.
• For some examples the correct results (targets) are
known and are given in input to the model during
the learning process.
• The construcon of a proper training, validaon and
test set (Bok) is crucial.
• These methods are usually fast and accurate.
• Have to be able to generalize: give the correct
results when new data are given in input without
knowing a priori the target.
Generalizaon
• Refers to the ability to produce reasonable outputs
for inputs not encountered during the training.
Aer repeang this process for a suciently large number of training cycles,
the network will usually converge.
Hidden Units
• The best number of hidden units depend on:
– number of inputs and outputs
– number of training case
– the amount of noise in the targets
– the complexity of the funcon to be learned
– the acvaon funcon
• Too few hidden units => high training and generalizaon error, due to
underng and high stascal bias.
• Too many hidden units => low training error but high generalizaon
error, due to overng and high variance.
• Rules of thumb don't usually work.
Acvaon and Error Funcons
Acvaon Funcons
Results: confusion matrix
Results: completeness and contaminaon
Exercise: compute completeness and contaminaon for the previous confusion matrix (test set)
Decision Trees
• Is another classicaon method.
• A decision tree is a set of simple rules, such as "if the
sepal length is less than 5.45, classify the specimen as
setosa."
• Decision trees are also nonparametric because they do
not require any assumpons about the distribuon of
the variables in each class.
Summary
• KDD and Data Mining Tasks
• Finding the opmal approach
• Supervised Models
– Neural Networks
– Mul Layer Perceptron
– Decision Trees
• Unsupervised Models
– Dierent Types of Clustering
– Distances and Normalizaon
– Kmeans
– Self Organizing Maps
• Combining dierent models
– Commiee Machines
– Introducing a Priori Knowledge
– Sleeping Expert Framework
Unsupervised Learning
• The model is not provided with the correct results
during the training.
• Can be used to cluster the input data in classes on
the basis of their stascal properes only.
• Cluster signicance and labeling.
• The labeling can be carried out even if the labels are
only available for a small number of objects
representave of the desired classes.
Types of Clustering
• Types of clustering:
– HIERARCHICAL: nds successive clusters using previously
established clusters
• agglomerave (boom‐up): start with each element in a separate cluster
and merge them accordingly to a given property
• divisive (top‐down)
– PARTITIONAL: usually determines all clusters at once
Distances
• Determine the similarity between two clusters and
the shape of the clusters.
In case of strings…
• The Hamming distance between two strings of equal length is
the number of posions at which the corresponding symbols
are dierent.
– measures the minimum number of substuons required to
change one string into the other
• The Levenshtein (edit) distance is a metric for measuring the
amount of dierence between two sequences.
– is dened as the minimum number of edits needed to transform
one string into the other.
RANGE (Min‐Max Normalizaon): subtracts the minimum value of an aribute from each value
of the aribute and then divides the dierence by the range of the aribute. It has the
advantage of preserving exactly all relaonship in the data, without adding any bias.
SOFTMAX: is a way of reducing the inuence of extreme values or outliers in the data without
removing them from the data set. It is useful when you have outlier data that you wish to
include in the data set while sll preserving the signicance of data within a standard deviaon
of the mean.
KMeans
KMeans: how it works
Kmeans: Pro and Cons
Learning K
• Find a balance between two variables: the number of
clusters (K) and the average variance of the clusters.
• Minimize both values
• As the number of clusters increases, the average
variance decreases (up to the trivial case of k=n and
variance=0).
• Some criteria:
– BIC (Bayesian Informaon Criteria)
– AIC (Akaike Informaon Criteria)
– Davis‐Bouldin Index
– Confusion Matrix
Self Organizing Maps
SOM topology
SOM Prototypes
SOM Training
Compeve and Cooperave Learning
SOM Update Rule
Parameters
DM with SOM
SOM Labeling
Localizing Data
Cluster Structure
Cluster Structure ‐ 2
Component Planes
Relave Importance
How accurate is your clustering
Trajectories
Combining Models
Commiee Machines
A priori knowledge
Sleeping Experts