0% found this document useful (0 votes)

32 views

Introduction To Big Data and Data Mining

This document provides an introduction to knowledge discovery and data mining from big data. It discusses how data mining can be used to extract useful information from large complex datasets in business, science, and healthcare. Specifically, it covers predictive modeling using classification algorithms like decision trees. Decision trees operate by splitting training data into purer subsets based on attribute values, with the goal of minimizing impurity at each node as measured by metrics like the Gini index.

Uploaded by

Tse Chris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Introduction To Big Data and Data Mining

Uploaded by

Tse Chris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 130

Introduction to

Knowledge
Discovery and
Data Mining
from Big Data

By Alex Cheung
Introduction
Mining Big Data: Motivation
▪ Today’s digital society has seen
enormous data growth in both
commercial and scientific databases

▪ Data Mining is becoming a commonly

used tool to extract information from Homeland Security
large and complex datasets

▪ Examples:
▪ Helps provide better customer service
in business/commercial setting
▪ Helps scientists in hypothesis
formation Business Data
Geo-spatial data

Computational Simulatio
Sensor Networks
Scientific Data
Data Mining for Life and Health Sciences
▪ Recent technological advances are helping to
generate large amounts of both medical and
genomic data
• High-throughput experiments/techniques
- Gene and protein sequences
- Gene-expression data
- Biological networks and phylogenetic profiles
• Electronic Medical Records
- IBM-Mayo clinic partnership has created a DB of 5
million patients
- Single Nucleotides Polymorphisms (SNPs)

▪ Data mining offers potential solution for

analysis of large-scale data
• Automated analysis of patients history for customized
treatment
• Prediction of the functions of anonymous genes
• Identification of putative binding sites in protein
structures for drugs/chemicals discovery

Clu
s
Data
ter ng
ing l i
ode
M
ti ve
ic
r ed
P

An
tio
n De oma
c i a tec ly
s so tio
A n

ul es
R

Milk
Predictive
Modeling:
Classification
General Approach for Building a
Classification Model
l al e
a iv
ric ori c
ita
t
go g n t
ss
te te a a
ca ca qu cl

Test
Set

Training
Learn
Set Classifier Model
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

• Predicting functions of proteins

• Classifying credit card transactions

as legitimate or fraudulent

• Categorizing news stories as finance,

weather, entertainment, sports, etc

• Identifying intruders in the cyberspace

Commonly Used Classification Models

• Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines

• Ensemble Classifiers
– Boosting, Bagging, Random Forests
Classification Model: Decision Tree

Model for predicting credit

worthiness
Employed
Class
No Yes

No Education
{ High school,
Graduate Undergrad }

Yes Number of
years
> 7 yrs < 7 yrs

Yes No
Constructing a Decision Tree
Employed

Yes No
Education
Worthy: 4 Worthy: 0
Not Worthy: 3 Not Worthy: 3

Graduate High School/

Undergrad
Worthy: 2 Worthy: 2
Not Worthy: 2 Not Worthy: 4

Not
Worthy Worthy

Employed = Yes 4 3
Key Computation
Employed = No 0 3
Constructing a Decision Tree

Employed
= Yes

Employed
= No
Design Issues of Decision Tree Induction

• How should training records be split?

– Method for specifying test condition
• depending on attribute types
– Measure for evaluating the goodness of a test
condition

• How should the splitting procedure stop?

– Stop splitting if all the records belong to the same
class or have identical attribute values
– Early termination
How to determine the Best Split

• Greedy approach:
– Nodes with purer class distribution are
preferred

• Need a measure of node impurity:

High degree of impurity Low degree of impurity

Measure of Impurity: GINI
• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least interesting
information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
Measure of Impurity: GINI
• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):

• GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
Computing Gini Index of a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Computing Gini Index for a Collection of Nodes

• When a node p is split into k partitions (children)

where,ni = number of records at child i,

n = number of records at parent node p.

• Choose the attribute that minimizes weighted average Gini

index of the children

• Gini index is used in decision tree algorithms such as

CART, SLIQ, SPRINT
Binary Attributes: Computing GINI Index

• Splits into two partitions

• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.

Yes No

Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2
Weighted Gini of N1 N2
= 0.278
= 6/12 * 0.278 +
Gini(N2) 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
Continuous Attributes: Computing Gini Index

• Use Binary Decisions based on one

value
• Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
• Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A ³ v
• Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute its
Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Yes 0 3
No 3 4
Decision Tree Based Classification
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant or irrelevant attributes (unless the attributes
are interacting)
• Disadvantages:
– Space of possible decision trees is exponentially large. Greedy approaches
are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
Handling interactions
+ : 1000 instances Entropy (X) : 0.99
Entropy (Y) : 0.99
o : 1000 instances
Y

X
Handling interactions
+ : 1000 instances Entropy (X) : 0.99
Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X

Z
Z

Y
X
Limitations of single attribute-based decision boundaries

Both positive (+) and

negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.
Model
Overfitting
Classification Errors

• Training errors (apparent errors)

– Errors committed on the training set

• Test errors
– Errors committed on the test set

• Generalization errors
– Expected error of a model over random
selection of records from same distribution
Example Data Set
Two class problem:
+ : 5200 instances
• 5000 instances generated from
a Gaussian centered at (10,10)
• 200 noisy instances added

o : 5200 instances
• Generated from a uniform
distribution

10 % of the data used for

training and 90% of the
data used for testing
Increasing number of nodes in Decision Trees
Decision Tree with 4 nodes

Decision Tree

Decision boundaries on Training data

Decision Tree with 50 nodes

Decision Tree

Decision boundaries on Training data

Which tree is better?

Decision Tree with 4 nodes

Which tree is better ?

Decision Tree with 50 nodes
Model Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting

Using twice the number of data instances

• If training data is under-representative, testing errors increase and training errors
decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
Reasons for Model Overfitting

• Presence of Noise

• Lack of Representative Samples

• Multiple Comparison Procedure

Effect of Multiple Comparison Procedure

• Consider the task of predicting whether Day 1 Up

stock market will rise/fall in the next 10 Day 2 Down
trading days Day 3 Down
Day 4 Up
• Random guessing: Day 5 Down
Day 6 Down
P(correct) = 0.5
Day 7 Up
Day 8 Up
• Make 10 random guesses in a row: Day 9 Up
Day 10 Down
Effect of Multiple Comparison Procedure

• Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions

• Probability that at least one analyst makes

at least 8 correct predictions
Effect of Multiple Comparison Procedure

• Many algorithms employ the following greedy strategy:

– Initial model: M
– Alternative model: M’ = M È g,
where g is a component to be added to the model (e.g., a test
condition of a decision tree)
– Keep M’ if improvement, D(M,M’) > a

• Often times, g is chosen from a set of alternative components, G

= {g1, g2, …, gk}

• If many alternatives are available, one may inadvertently add

irrelevant components to the model, resulting in model overfitting
Effect of Multiple Comparison - Example

Use additional 100 noisy variables

generated from a uniform distribution
along with X and Y as attributes.

Use 30% of the data for training and

70% of the data for testing
Using only X and Y as attributes
Notes on Overfitting

• Overfitting results in decision trees that are

more complex than necessary

• Training error does not provide a good

estimate of how well the tree will perform
on previously unseen records
• Need ways for incorporating model
complexity into model development
Evaluating Performance of Classifier

• Model Selection
– Performed during model building
– Purpose is to ensure that model is not overly
complex (to avoid overfitting)

• Model Evaluation
– Performed after model has been constructed
– Purpose is to estimate performance of
classifier on previously unseen data (e.g., test
set)
Methods for Classifier Evaluation
• Holdout
– Reserve k% for training and (100-k)% for testing
• Random subsampling
– Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Bootstrap
– Sampling with replacement
– .632 bootstrap:
Application
on
Biomedical
Data
Application : SNP Association Study
• Given: A patient data set that has genetic variations
(SNPs) and their associated Phenotype (Disease).
• Objective: Finding a combination of genetic characteristics
that best defines the phenotype under study.

SNP1 SNP2 … SNPM Disease

Patient 1 1 1 … 1 1
Patient 2 0 1 … 1 1
Patient 3 1 0 … 0 0
… … … … … …
Patient N 1 1 1 1

Genetic Variation in Patients (SNPs) as Binary Matrix and

Survival/Disease (Yes/No) as Class Label.
SNP (Single nucleotide polymorphism)

• Definition of SNP (wikipedia)

– A SNP is defined as a single base change in a DNA
sequence that occurs in a significant proportion (more
than 1 percent) of a large population
Individual 1 AG C GT GAT C GAG G CTA Each SNP has 3 values
Individual 2 AG C GT GAT C GAG G CTA ( GG / GT / TT )
Individual 3 AG C GT GAG C GAG G CTA
Individual 4 AG C GT GAT C GAG G CTA ( mm / Mm/ MM)
Individual 5 AG C GT GAT C GAG G CTA

SNP

– How many SNPs in Human genome?

– 10,000,000
Why is SNPs interesting?
• In human beings, 99.9 percent bases are same.
• Remaining 0.1 percent makes a person unique.
– Different attributes / characteristics / traits
• how a person looks,
• diseases a person develops.
• These variations can be:
– Harmless (change in phenotype)
– Harmful (diabetes, cancer, heart disease, Huntington's disease,
and hemophilia )
– Latent (variations found in coding and regulatory regions, are not
harmful on their own, and the change in each gene only becomes
apparent under certain conditions e.g. susceptibility to lung
cancer)
Issues in SNP Association Study
• In disease association studies number of SNPs varies
from a small number (targeted study) to a million (GWA
Studies)
• Number of samples is usually small
• Data sets may have noise or missing values.
• Phenotype definition is not trivial (ex. definition of survival)
• Environmental exposure, food habits etc adds more
variability even among individuals defined under the same
phenotype
• Genetic heterogeneity among individuals for the same
phenotype
Existing Analysis Methods
• Univariate Analysis: single SNP tested against the
phenotype for correlaton and ranked.
– Feasible but doesn’t capture the existing true combinations.
• Multivariate Analysis: groups of SNPs of size two or
more are tested for possible association with the
phenotype.
– Infeasible but captures any true combinations.
• These two approaches are used to identify biomarkers.
• Some approaches employ classification methods like
SVMs to classify cases and controls.
Discovering SNP Biomarkers
• Given a SNP data set of Myeloma patients, find a combination of
3404 SNPs
SNPs that best predicts survival.

• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)

Complexity of the Problem:

• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)

Odds ratio Measures whether two

Biomarker (SNPs)
groups have the same odds of an event.
OR = 1 Odds of event is equal in both groups Has Lacks
Marker Marker
OR > 1 Odds of event is higher in cases
CASE a b
OR < 1 Odds of event is higher in controls CLASS
Control c d

Odds ratio is invariant to row and

column scaling
Discovering SNP Biomarkers
• Given a SNP data set of Myeloma patients, find a combination of
3404 SNPs
SNPs that best predicts survival.

• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)

Complexity of the Problem:

•Large number of SNPs (over a million in GWA
studies) and small sample size
•Complex interaction among genes may be
responsible for the phenotype
•Genetic heterogeneity among individuals sharing
the same phenotype (due to environmental
exposure, food habits, etc) adds more variability
•Complex phenotype definition (eg. survival)
P-value
• P-value
– Statistical terminology for a probability value
– Is the probability that the we get an odds ratio as extreme
as the one we got by random chance
– Computed by using the chi-square statistic or Fisher’s
exact test
• Chi-square statistic is not valid if the number of entries in a cell
of the contingency table is small
• p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) if we are
testing value is higher than expected by random chance using
Fisher’s exact test
• A statistical test to determine if there are nonrandom associations
between two categorical variables.
– P-values are often expressed in terms of the negative log
of p-value, e.g., -log10(0.005) = 2.3

53
Discovering SNP Biomarkers
• Given a SNP data set of Myeloma patients, find a combination of
3404 SNPs
SNPs that best predicts survival.

• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)

Highest p-value,
moderate odds ratio
Highest odds ratio,
moderate p value

Moderate odds ratio,

moderate p value
Example: High pvalue, moderate odds ratio

Biomarker (SNPs)
Has Lacks Marker
Marker
CASE (a) 40 (b) 30
CLASS Control (c) 19 (d) 54

Highest p-value,
moderate odds ratio

Odds ratio = (ad)/(bc) = (40 * 54) / (30 * 19) = 3.64

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

= 1 – hygecdf( 39, 143, 59, 70 )

log10(0.0243) = 3.85
55
Example …
Biomarker (SNPs)
Has Lacks
Marker Marker
CASE (a) 7 (b) 63
CLASS
Control (c) 1 (d) 72

Odds ratio = (ad)/(bc) = (7 * 72) / (63* 1) = 8 Highest odds ratio,

moderate p value

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

= 1 – hygecdf( 6, 143, 8, 70)

log10(pvalue) = 1.56

56
Example …
Biomarker (SNPs)
x 10
Has Marker Lacks Marker

CASE (a) 70 (b) 630

CLASS
Control (c) 10 (d) 720

Odds ratio = (ad)/(bc) = (70 * 720) / (630* 10) = 8

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

= 1 – hygecdf( 60, 1430, 80, 700)

log10(pvalue) = 6.56
Example …
Biomarker (SNPs)
x 20
Has Marker Lacks Marker

CASE (a) 140 (b) 1260

CLASS
Control (c) 20 (d) 1440

Odds ratio = (ad)/(bc) = (140 * 1440) / (1260* 20) = 8

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

= 1 – hygecdf( 139, 2860, 160, 1400)

log10(pvalue) = 11.9
Issues with Traditional Methods

Top ranked SNP:

• Each SNP is tested and
-log10P-value = 3.8; Odds
ranked individually
Ratio = 3.7
• Individual SNP
associations with true
phenotype are not
distinguishable from
random permutation of
phenotype

Van Ness et al 2009

However, most reported associations are not robust: of the 166 putative
associations which have been studied three or more times, only 6 have
been consistently replicated.
Evaluating the Utility of Univariate Rankings
for Myeloma Data
Leave-one-out
Feature Cross validation
Selection With SVM
Biased Evaluation
Evaluating the Utility of Univariate Rankings
for Myeloma Data
Leave-one-out Leave-one-out Cross
Feature Cross validation validation with SVM
Selection With SVM Feature Selection
Biased Evaluation
Clean Evaluation
Random Permutation test
• 10,000 random permutations of real phenotype generated.
• For each one, Leave-one-out cross validation using SVM .

• Accuracy larger than 65% are highly significant. (p-value is < 10-4)
Nearest Neighbor
Classifier
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the

Records “nearest” records
Nearest-Neighbor Classifiers
• Requires three things
– The set of stored records
– Distance metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

• To classify an unknown record:

– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
Clustering
Clustering
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Clustering
• Applications:
– Gene expression clustering
– Clustering of patients based on phenotypic and
genotypic factors for efficient disease diagnosis
– Market Segmentation
– Document Clustering
– Finding groups of driver behaviors based upon
patterns of automobile motions (normal,
drunken, sleepy, rush hour driving, etc)

Courtesy: Michael Eisen

Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Similarity and Dissimilarity Measures
• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Euclidean Distance
• Euclidean Distance

Where n is the number of dimensions (attributes) and xk and yk

are, respectively, the kth attributes (components) or data objects x and
y.
• Correlation
Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
– A division data objects into
non-overlapping subsets (clusters)
such that each data object is in
exactly one subset

• Hierarchical clustering
– A set of nested clusters organized
as a hierarchical tree
Other Distinctions Between Sets of Clusters

• Exclusive versus non-exclusive

– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Clusters of widely different sizes, shapes, and densities
Clustering Algorithms

• K-means and its variants

• Hierarchical clustering

• Other types of clustering

K-means Clustering

• Partitional clustering approach

• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
• The basic algorithm is very simple
Example of K-means Clustering
K-means Clustering – Details

• The centroid is (typically) the mean of the points in the

cluster

• Initial centroids are often chosen randomly

– Clusters produced vary from one run to another

• ‘Closeness’ is measured by Euclidean distance,

cosine similarity, correlation, etc

• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them

• x is a data point in cluster Ci and mi is the representative point for

cluster Ci

– Given two sets of clusters, we prefer the one with the smallest
error

– One easy way to reduce SSE is to increase K, the number of

clusters
Two different K-means Clusterings

Original Points

Optimal Clustering Sub-optimal Clustering

Limitations of K-means

• K-means has problems when clusters are

of differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data

contains outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Hierarchical Clustering
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the
5
sequences of merges or splits 1
3
5
2 1
2 3 6

4
4
Strengths of Hierarchical Clustering

• Do not have to assume any particular number of

clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level

• They may correspond to meaningful taxonomies

– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point
(or there are k clusters)

• Traditional hierarchical algorithms use a similarity or

distance matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique

• Basic algorithm is straightforward

1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of

two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Starting Situation

• Start with clusters of individual points and a

proximity matrix p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
.
Proximity Matrix
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1
Proximity Matrix

C2 C5
Intermediate Situation
• We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5

C1 Proximity Matrix

C2 C5
After Merging
• The question is “How do we update the proximity
C2
matrix?” U
C1 C5 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C3

C4 C3 ?

C4 ?

C1 Proximity Matrix

C2 U C5
How to Define Inter-Cluster Distance

p1 p2 p3 p4 p5 ...
Similarity? p1

p4
• MIN
p5
• MAX
.
• Group Average
. Proximity Matrix
• Distance Between Centroids .
• Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p4
• MIN
p5
• MAX
.
• Group Average .
• Distance Between Centroids .
• Other methods driven by an objective Proximity Matrix
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p4
• MIN p5
• MAX .
• Group Average .
• Distance Between Centroids .
Proximity Matrix
• Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

´ ´ p2

p4
• MIN p5
• MAX .
• Group Average .
• Distance Between Centroids .
Proximity Matrix
• Other methods driven by an objective
function
– Ward’s Method uses squared error
Other Types of Cluster Algorithms
• Hundreds of clustering algorithms

• Some clustering algorithms

– K-means
– Hierarchical
– Statistically based clustering algorithms
• Mixture model based clustering
– Fuzzy clustering
– Self-organizing Maps (SOM)
– Density-based (DBSCAN)

• Proper choice of algorithms depends on the type of

clusters to be found, the type of data, and the objective
Cluster Validity
• For supervised classification we have a variety of measures
to evaluate how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to

evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?

– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Clusters found in Random Data

Random
Points
DBSCAN

K-means Complete
Link
Different Aspects of Cluster Validation

• Distinguishing whether non-random structure actually exists in the data

• Comparing the results of a cluster analysis to externally known results,

e.g., to externally given class labels

• Evaluating how well the results of a cluster analysis fit the data without
reference to external information

• Comparing the results of two different sets of cluster analyses to

determine which is better

• Determining the ‘correct’ number of clusters

Using Similarity Matrix for Cluster Validation

• Order the similarity matrix with respect to

cluster labels and inspect visually.
Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

DBSCAN
Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

K-means
Using Similarity Matrix for Cluster Validation

• Clusters in random data are not so crisp

Complete Link
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types of indices.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g.,
SSE or entropy
• For futher details please see “Introduction to Data
Mining”, Chapter 8.
– http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
Clustering
Microarray
Data
Clustering Microarray Data
• Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions
C1 C2 C3 C4 C5 C6 C7
• Data: Expression profiles of approximately 3606
genes of E Coli are recorded for 30 experimental Gene1
Gene2
conditions Gene3
Gene4
Gene5
• SAM (Significance Analysis of Microarrays) package Gene6
Gene7
from Stanford University is used for the analysis of ….
the data and to identify the genes that are
substantially differentially upregulated in the dataset –
17 such genes are identified for study purposes

• Hierarchical clustering is performed and plotted using

TreeView
Clustering Microarray Data…
CLUTO for Clustering for Microarray Data
• CLUTO (Clustering Toolkit) George Karypis (UofM)
http://glaros.dtc.umn.edu/gkhome/views/cluto/
• CLUTO can also be used for clustering microarray data
Issues in Clustering Expression Data

• Similarity uses all the conditions

– We are typically interested in sets of genes that are
similar for a relatively small set of conditions
• Most clustering approaches assume that an
object can only be in one cluster
– A gene may belong to more than one functional group
– Thus, overlapping groups are needed
• Can either use clustering that takes these factors
into account or use other techniques
– For example, association analysis
Clustering Packages
• Mathematical and Statistical Packages
– MATLAB
– SAS
– SPSS
– R
• CLUTO (Clustering Toolkit) George Karypis (UM)
http://glaros.dtc.umn.edu/gkhome/views/cluto/
• Cluster Michael Eisen (LBNL/UCB) (microarray)
http://rana.lbl.gov/EisenSoftware.htm
http://genome-www5.stanford.edu/resources/restech.shtml (more microarray
clustering algorithms)
• Many others
– KDNuggets http://www.kdnuggets.com/software/clustering.html
Association
Analysis
Association Analysis

• Given a set of records, find dependency rules which will

predict occurrence of an item based on occurrences of
other items in the record
Rules Discovered:
{Milk} --> {Coke} (s=0.6, c=0.75)

{Diaper, Milk} --> {Beer}

(s=0.4, c=0.67)

• Applications
– Marketing and Sales Promotion
– Supermarket shelf management
– Traffic pattern analysis (e.g., rules such as "high congestion on
Intersection 58 implies high accident rates for left turning traffic")
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule
mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach: Two Steps

– Frequent Itemset Generation
• Generate all itemsets whose support ³ minsup
– Rule Generation
• Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is computationally expensive

Efficient Pruning Strategy (Ref: Agrawal & Srikant 1994)

If an itemset is infrequent,
then all of its supersets
must also be infrequent

Found to be
Infrequent

Pruned
supersets
Illustrating Apriori Principle
Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate

candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered,

6
C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
Association Measures
• Association measures evaluate the strength of an
association pattern
– Support and confidence are the most commonly used

– The support, s(X), of an itemset X is the number of transactions that

contain all the items of the itemset
• Frequent itemsets have support > specified threshold
• Different types of itemset patterns are distinguished by a
measure and a threshold

– The confidence of an association rule is given by

conf(X ® Y) = s(X È Y) / s(X)
• Estimate of the conditional probability of Y given X
• Other measures can be more useful
– H-confidence
– Interest
Application
on
Biomedical
Data
Mining Differential Coexpression (DC)

• Differential expression ➔ Differential coexpression

• Differential Expression (DE)
– Traditional analysis targets the changes of expression level

controls cases
Expression
level

Expression over samples in controls and cases

[Golub et al., 1999], [Pan 2002], [Cui and Churchill, 2003] etc.
Differential Coexpression (DC)
• Differential Coexpression (DC)
– Targets changes of the coherence of expression
controls cases
controls
Question: cases interesting,
Is this gene
i.e. associated w/ the phenotype?
genes

Answer: No, in term of differential

expression (DE).
However, what if there are
another two genes ……?
Matrix of expression values
Yes!& Spang, 2005]
[Kostka Expression over samples
in controls and cases
Biological interpretations of DC:
Dysregulation of pathways, mutation of transcriptional factors, etc.

[Silva et al., 1995], [Li, 2002], [Kostka & Spang, 2005], [Rosemary et al., 2008], [Cho et al. 2009] etc.
Differential Coexpression (DC)
• Existing work on differential coexpression
– Pairs of genes with differential coexpression
• [Silva et al., 1995], [Li, 2002], [Li et al., 2003], [Lai et al. 2004]
– Clustering based differential coexpression analysis
• [Ihmels et al., 2005], [Watson., 2006]
– Network based analysis of differential coexpression
• [Zhang and Horvath, 2005], [Choi et al., 2005], [Gargalovic et al. 2006],
[Oldham et al. 2006], [Fuller et al., 2007], [Xu et al., 2008]
– Beyond pair-wise (size-k) differential coexpression
• [Kostka and Spang., 2004], [Prieto et al., 2006]
– Gene-pathway differential coexpression
• [Rosemary et al., 2008]
– Pathway-pathway differential coexpression
• [Cho et al., 2009]
Existing DC work is “full-space”

• Full-space differential coexpression

Full-space measures: e.g.

correlation difference

• May have limitations due to the heterogeneity of

– Causes of a disease (e.g. genetic difference)
– Populations affected (e.g. demographic difference)

Motivation:
Such subspace patterns
may be missed by full-
space models
Extension to Subspace Differential Coexpression

• Definition of Subspace Differential Coexpression Pattern

– A set of k genes = {g1, g2 ,…, gk}
– : Fraction of samples in class A, on which the k genes are coexpressed
– : Fraction of samples in class B, on which the k genes are coexpressed

Problem: given n
genes, find all the
subsets of genes,
s.t. SDC≥d

as a measure of subspace differential coexpression

Details in [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]
Computational Challenge

Problem: given n genes, find all the subsets of genes, s.t. SDC≥d

Given n genes, there are 2n

candidates of SDC pattern!
How to effectively handle the
combinatorial search space?

Similar motivation and

challenge as biclustering,
but here
differential biclustering !
Direct Mining of Differential Patterns

Refined SDC measure: “direct”

A measure M is antimonotonic
if V A,B: A B ➔ M(A) >= M(B)

≈
Details in [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]
[Fang, Pandey, Gupta, Steinbach and Kumar, IEEE TKDE 2011]
An Association-analysis Approach

Refined SDC measure

Disqualified

A measure M is antimonotonic if
V A,B: A B ➔ M(A) >= M(B)

Advantages:
1) Systematic & direct
2) Completeness
3) Efficiency
Prune all the
supersets
[ Agrawal et al. 1994]
A 10-gene Subspace DC Pattern

Enriched with the TNF-α/NFkB signaling pathway

(6/10 overlap with the pathway, P-value: 1.4*10-5)

Suggests that the dysregulation of TNF-α/NFkB

≈ 10% pathway may be related to lung cancer ≈ 60%

www. ingenuity.com: enriched Ingenuity subnetwork

Data Mining Book
For further details and sample
chapters see
www.cs.umn.edu/~kumar/dmbook
References
• Book
• Computational Approaches for Protein Function Prediction, Gaurav Pandey, Vipin Kumar and Michael Steinbach, to be published by
John Wiley and Sons in the Book Series on Bioinformatics in Fall 2007

• Conferences/Workshops
• Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, Gaurav Pandey, Michael
Steinbach, Rohit Gupta, Tushar Garg and Vipin Kumar, to appear, ACM SIGKDD 2007

• Incorporating Functional Inter-relationships into Algorithms for Protein Function Prediction, Gaurav Pandey and Vipin Kumar, to appear,
ISMB satellite meeting on Automated Function Prediction 2007

• Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis, Rohit
Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach and Vipin Kumar, To be published in the proceedings of the Workshop on Data
Mining for Biomedical Informatics, held in conjunction with SIAM International Conference on Data Mining, 2007

• Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery, Hui Xiong, X. He, Chris Ding, Ya Zhang, Vipin
Kumar and Stephen R. Holbrook, pp 221-232, Proc. of the Pacific Symposium on Biocomputing, 2005

• Feature Mining for Prediction of Degree of Liver Fibrosis, Benjamin Mayer, Huzefa Rangwala, Rohit Gupta, Jaideep Srivastava, George
Karypis, Vipin Kumar and Piet de Groen, Proc. Annual Symposium of American Medical Informatics Association (AMIA), 2005

• Technical Reports
• Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, Gaurav Pandey, Michael
Steinbach, Rohit Gupta, Tushar Garg, Vipin Kumar, Technical Report 07-007, March 2007, Department of Computer Science, University of
Minnesota

• Computational Approaches for Protein Function Prediction: A Survey, Gaurav Pandey, Vipin Kumar, Michael Steinbach, Technical Report
06-028, October 2006, Department of Computer Science, University of Minnesota

Soultion5
No ratings yet
Soultion5
3 pages
Tanagra Tutorial
No ratings yet
Tanagra Tutorial
7 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
CSE445 NSU Week_4
No ratings yet
CSE445 NSU Week_4
48 pages
Class Basic
No ratings yet
Class Basic
75 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Unit-3
No ratings yet
Unit-3
98 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
dm4
No ratings yet
dm4
68 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Trees: Decision Tree Is One of The Most Widely Used and
No ratings yet
Decision Trees: Decision Tree Is One of The Most Widely Used and
53 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
dm unit 4
No ratings yet
dm unit 4
24 pages
DMDW_Classification
No ratings yet
DMDW_Classification
18 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Unit 4, DWDM,IT Dept, III Year- II Semester
No ratings yet
Unit 4, DWDM,IT Dept, III Year- II Semester
87 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
Data Minning Unit 5 PDF
No ratings yet
Data Minning Unit 5 PDF
19 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Classification
No ratings yet
Classification
45 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Module 4
No ratings yet
Module 4
99 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
CH 5
No ratings yet
CH 5
81 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
08ClassBasic-L
No ratings yet
08ClassBasic-L
78 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
dm 3
No ratings yet
dm 3
37 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
MLunit 2 Mynotes
No ratings yet
MLunit 2 Mynotes
15 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
95 pages
Classification
No ratings yet
Classification
33 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
CH 5
No ratings yet
CH 5
84 pages
Classification
100% (1)
Classification
37 pages
Data Science Concepts Lesson04 Decision Tree Concepts
No ratings yet
Data Science Concepts Lesson04 Decision Tree Concepts
22 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
ML-Lecture-8-9-Classification
No ratings yet
ML-Lecture-8-9-Classification
35 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Classification[1]
No ratings yet
Classification[1]
45 pages
Tackling Imbalanced Data with Python: Advanced Techniques and Real-World Applications for Tackling Class Imbalance
From Everand
Tackling Imbalanced Data with Python: Advanced Techniques and Real-World Applications for Tackling Class Imbalance
Aarav Joshi
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Repair Techniques For in Service and Out of Service Buried Pipelines
100% (1)
Repair Techniques For in Service and Out of Service Buried Pipelines
26 pages
Operations Research Outline Updated
No ratings yet
Operations Research Outline Updated
8 pages
Analysis and Prediction in Agricultural Data Using Data Mining Techniques
No ratings yet
Analysis and Prediction in Agricultural Data Using Data Mining Techniques
8 pages
Time Series Regime Analysis in Python - by Spencer - Medium
No ratings yet
Time Series Regime Analysis in Python - by Spencer - Medium
26 pages
Data Analytics With Python - Week 12 - 2022
No ratings yet
Data Analytics With Python - Week 12 - 2022
3 pages
Content DM
No ratings yet
Content DM
10 pages
Ly and Nguyen (2020) ICSC
No ratings yet
Ly and Nguyen (2020) ICSC
4 pages
UNIT-IV - Decision Tree Induction
No ratings yet
UNIT-IV - Decision Tree Induction
19 pages
Moisen and Frescino. 2002 Comparing Modeling Techniques To Predict Forest Characteristics
No ratings yet
Moisen and Frescino. 2002 Comparing Modeling Techniques To Predict Forest Characteristics
17 pages
Lab-Practice-I(ML)-Lab Manual-Vaishali
No ratings yet
Lab-Practice-I(ML)-Lab Manual-Vaishali
57 pages
Sumit Tripathi Applied AI Course Schedule
No ratings yet
Sumit Tripathi Applied AI Course Schedule
31 pages
Parametric and Nonparametric Machine Learning Algorithms
No ratings yet
Parametric and Nonparametric Machine Learning Algorithms
16 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Da Programs
No ratings yet
Da Programs
10 pages
Decision Trees
No ratings yet
Decision Trees
13 pages
All Proc
No ratings yet
All Proc
737 pages
Comparing Linear Regression and Decision Trees For Housing Price Prediction
No ratings yet
Comparing Linear Regression and Decision Trees For Housing Price Prediction
8 pages
Rutik Kothwala Final Practical Data Science
No ratings yet
Rutik Kothwala Final Practical Data Science
27 pages
A Course in Machine Learning 1648562733
No ratings yet
A Course in Machine Learning 1648562733
193 pages
A Hybrid Data Mining Model For Diagnosis of Patients With Clinical Suspicion of Dementia
No ratings yet
A Hybrid Data Mining Model For Diagnosis of Patients With Clinical Suspicion of Dementia
11 pages
Early Detection of University Students With Potential Difficulties
No ratings yet
Early Detection of University Students With Potential Difficulties
39 pages
AI Important Questions
No ratings yet
AI Important Questions
17 pages
Pattern Recognition Unit 1,2
No ratings yet
Pattern Recognition Unit 1,2
82 pages
Predicting Customer Using SVM
100% (1)
Predicting Customer Using SVM
24 pages
INT354 Question Bank
No ratings yet
INT354 Question Bank
11 pages
Comparison of Modeling Methods For Loss Given Default
No ratings yet
Comparison of Modeling Methods For Loss Given Default
14 pages
MLAll Practical
No ratings yet
MLAll Practical
27 pages
Essentials of ML, VITOL Course DA1
No ratings yet
Essentials of ML, VITOL Course DA1
5 pages

Introduction To Big Data and Data Mining

Uploaded by

Introduction To Big Data and Data Mining

Uploaded by

Introduction to

▪ Data Mining is becoming a commonly

▪ Data mining offers potential solution for

Protein Interaction Network

• Classifying secondary structures of protein

• Predicting functions of proteins

• Classifying credit card transactions

• Categorizing news stories as finance,

• Identifying intruders in the cyberspace

Model for predicting credit

Graduate High School/

• How should training records be split?

• How should the splitting procedure stop?

• Need a measure of node impurity:

High degree of impurity Low degree of impurity

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

P(C1) = 1/6 P(C2) = 5/6

P(C1) = 2/6 P(C2) = 4/6

• When a node p is split into k partitions (children)

where,ni = number of records at child i,

• Choose the attribute that minimizes weighted average Gini

• Gini index is used in decision tree algorithms such as

• Splits into two partitions

• Use Binary Decisions based on one

Both positive (+) and

• Training errors (apparent errors)

10 % of the data used for

Decision boundaries on Training data

Decision boundaries on Training data

Decision Tree with 4 nodes

Which tree is better ?

Using twice the number of data instances

• Lack of Representative Samples

• Multiple Comparison Procedure

• Consider the task of predicting whether Day 1 Up

• Probability that at least one analyst makes

• Many algorithms employ the following greedy strategy:

• Often times, g is chosen from a set of alternative components, G

• If many alternatives are available, one may inadvertently add

Use additional 100 noisy variables

Use 30% of the data for training and

• Overfitting results in decision trees that are

• Training error does not provide a good

SNP1 SNP2 … SNPM Disease

Genetic Variation in Patients (SNPs) as Binary Matrix and

• Definition of SNP (wikipedia)

– How many SNPs in Human genome?

Complexity of the Problem:

Odds ratio Measures whether two

Odds ratio is invariant to row and

Complexity of the Problem:

Moderate odds ratio,

Odds ratio = (a*d)/(b*c) = (40 * 54) / (30 * 19) = 3.64

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

Odds ratio = (a*d)/(b*c) = (7 * 72) / (63* 1) = 8 Highest odds ratio,

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

CASE (a) 70 (b) 630

Odds ratio = (a*d)/(b*c) = (70 * 720) / (630* 10) = 8

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

CASE (a) 140 (b) 1260

Odds ratio = (a*d)/(b*c) = (140 * 1440) / (1260* 20) = 8

P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b )

Top ranked SNP:

Van Ness et al 2009

Training Choose k of the

• To classify an unknown record:

Courtesy: Michael Eisen

How many clusters? Six Clusters

Two Clusters Four Clusters

Where n is the number of dimensions (attributes) and xk and yk

• A clustering is a set of clusters

• Exclusive versus non-exclusive

Odds ratio = (ad)/(bc) = (40 * 54) / (30 * 19) = 3.64

Odds ratio = (ad)/(bc) = (7 * 72) / (63* 1) = 8 Highest odds ratio,

Odds ratio = (ad)/(bc) = (70 * 720) / (630* 10) = 8

Odds ratio = (ad)/(bc) = (140 * 1440) / (1260* 20) = 8