Introduction To Big Data and Data Mining
Introduction To Big Data and Data Mining
Knowledge
Discovery and
Data Mining
from Big Data
By Alex Cheung
Introduction
Mining Big Data: Motivation
▪ Today’s digital society has seen
enormous data growth in both
commercial and scientific databases
▪ Examples:
▪ Helps provide better customer service
in business/commercial setting
▪ Helps scientists in hypothesis
formation Business Data
Geo-spatial data
Computational Simulatio
Sensor Networks
Scientific Data
Data Mining for Life and Health Sciences
▪ Recent technological advances are helping to
generate large amounts of both medical and
genomic data
• High-throughput experiments/techniques
- Gene and protein sequences
- Gene-expression data
- Biological networks and phylogenetic profiles
• Electronic Medical Records
- IBM-Mayo clinic partnership has created a DB of 5
million patients
- Single Nucleotides Polymorphisms (SNPs)
Clu
s
Data
ter ng
ing l i
ode
M
ti ve
ic
r ed
P
An
tio
n De oma
c i a tec ly
s so tio
A n
ul es
R
Milk
Predictive
Modeling:
Classification
General Approach for Building a
Classification Model
l al e
a iv
ric ori c
ita
t
go g n t
ss
te te a a
ca ca qu cl
Test
Set
Training
Learn
Set Classifier Model
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
• Ensemble Classifiers
– Boosting, Bagging, Random Forests
Classification Model: Decision Tree
No Education
{ High school,
Graduate Undergrad }
Yes Number of
years
> 7 yrs < 7 yrs
Yes No
Constructing a Decision Tree
Employed
Yes No
Education
Worthy: 4 Worthy: 0
Not Worthy: 3 Not Worthy: 3
Not
Worthy Worthy
Employed = Yes 4 3
Key Computation
Employed = No 0 3
Constructing a Decision Tree
Employed
= Yes
Employed
= No
Design Issues of Decision Tree Induction
• Greedy approach:
– Nodes with purer class distribution are
preferred
B?
Yes No
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2
Weighted Gini of N1 N2
= 0.278
= 6/12 * 0.278 +
Gini(N2) 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
Continuous Attributes: Computing Gini Index
X
Handling interactions
+ : 1000 instances Entropy (X) : 0.99
Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X
Z
Z
Y
X
Limitations of single attribute-based decision boundaries
• Test errors
– Errors committed on the test set
• Generalization errors
– Expected error of a model over random
selection of records from same distribution
Example Data Set
Two class problem:
+ : 5200 instances
• 5000 instances generated from
a Gaussian centered at (10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution
Decision Tree
Decision Tree
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting
• Presence of Noise
• Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
• Model Selection
– Performed during model building
– Purpose is to ensure that model is not overly
complex (to avoid overfitting)
• Model Evaluation
– Performed after model has been constructed
– Purpose is to estimate performance of
classifier on previously unseen data (e.g., test
set)
Methods for Classifier Evaluation
• Holdout
– Reserve k% for training and (100-k)% for testing
• Random subsampling
– Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Bootstrap
– Sampling with replacement
– .632 bootstrap:
Application
on
Biomedical
Data
Application : SNP Association Study
• Given: A patient data set that has genetic variations
(SNPs) and their associated Phenotype (Disease).
• Objective: Finding a combination of genetic characteristics
that best defines the phenotype under study.
SNP
• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)
• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)
• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)
53
Discovering SNP Biomarkers
• Given a SNP data set of Myeloma patients, find a combination of
3404 SNPs
SNPs that best predicts survival.
• cases
3404 SNPs selected from various
regions of the chromosome
• 70 cases (Patients survived shorter
than 1 year)
• 73 Controls (Patients survived longer Controls
than 3 years)
Highest p-value,
moderate odds ratio
Highest odds ratio,
moderate p value
Biomarker (SNPs)
Has Lacks Marker
Marker
CASE (a) 40 (b) 30
CLASS Control (c) 19 (d) 54
Highest p-value,
moderate odds ratio
log10(0.0243) = 3.85
55
Example …
Biomarker (SNPs)
Has Lacks
Marker Marker
CASE (a) 7 (b) 63
CLASS
Control (c) 1 (d) 72
log10(pvalue) = 1.56
56
Example …
Biomarker (SNPs)
x 10
Has Marker Lacks Marker
log10(pvalue) = 6.56
Example …
Biomarker (SNPs)
x 20
Has Marker Lacks Marker
log10(pvalue) = 11.9
Issues with Traditional Methods
However, most reported associations are not robust: of the 166 putative
associations which have been studied three or more times, only 6 have
been consistently replicated.
Evaluating the Utility of Univariate Rankings
for Myeloma Data
Leave-one-out
Feature Cross validation
Selection With SVM
Biased Evaluation
Evaluating the Utility of Univariate Rankings
for Myeloma Data
Leave-one-out Leave-one-out Cross
Feature Cross validation validation with SVM
Selection With SVM Feature Selection
Biased Evaluation
Clean Evaluation
Random Permutation test
• 10,000 random permutations of real phenotype generated.
• For each one, Leave-one-out cross validation using SVM .
• Accuracy larger than 65% are highly significant. (p-value is < 10-4)
Nearest Neighbor
Classifier
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
• Hierarchical clustering
– A set of nested clusters organized
as a hierarchical tree
Other Distinctions Between Sets of Clusters
• Hierarchical clustering
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them
– Given two sets of clusters, we prefer the one with the smallest
error
Original Points
4
4
Strengths of Hierarchical Clustering
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point
(or there are k clusters)
p2
p3
p4
p5
.
.
.
Proximity Matrix
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Proximity Matrix
C2 C5
Intermediate Situation
• We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
After Merging
• The question is “How do we update the proximity
C2
matrix?” U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C4 C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 ...
Similarity? p1
p2
p3
p4
• MIN
p5
• MAX
.
• Group Average
. Proximity Matrix
• Distance Between Centroids .
• Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
• MIN
p5
• MAX
.
• Group Average .
• Distance Between Centroids .
• Other methods driven by an objective Proximity Matrix
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
• MIN p5
• MAX .
• Group Average .
• Distance Between Centroids .
Proximity Matrix
• Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
• MIN p5
• MAX .
• Group Average .
• Distance Between Centroids .
Proximity Matrix
• Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
´ ´ p2
p3
p4
• MIN p5
• MAX .
• Group Average .
• Distance Between Centroids .
Proximity Matrix
• Other methods driven by an objective
function
– Ward’s Method uses squared error
Other Types of Cluster Algorithms
• Hundreds of clustering algorithms
Random
Points
DBSCAN
K-means Complete
Link
Different Aspects of Cluster Validation
• Evaluating how well the results of a cluster analysis fit the data without
reference to external information
DBSCAN
Using Similarity Matrix for Cluster Validation
K-means
Using Similarity Matrix for Cluster Validation
Complete Link
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types of indices.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g.,
SSE or entropy
• For futher details please see “Introduction to Data
Mining”, Chapter 8.
– http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
Clustering
Microarray
Data
Clustering Microarray Data
• Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions
C1 C2 C3 C4 C5 C6 C7
• Data: Expression profiles of approximately 3606
genes of E Coli are recorded for 30 experimental Gene1
Gene2
conditions Gene3
Gene4
Gene5
• SAM (Significance Analysis of Microarrays) package Gene6
Gene7
from Stanford University is used for the analysis of ….
the data and to identify the genes that are
substantially differentially upregulated in the dataset –
17 such genes are identified for study purposes
• Applications
– Marketing and Sales Promotion
– Supermarket shelf management
– Traffic pattern analysis (e.g., rules such as "high congestion on
Intersection 58 implies high accident rates for left turning traffic")
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule
mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
If an itemset is infrequent,
then all of its supersets
must also be infrequent
Found to be
Infrequent
Pruned
supersets
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
controls cases
Expression
level
[Golub et al., 1999], [Pan 2002], [Cui and Churchill, 2003] etc.
Differential Coexpression (DC)
• Differential Coexpression (DC)
– Targets changes of the coherence of expression
controls cases
controls
Question: cases interesting,
Is this gene
i.e. associated w/ the phenotype?
genes
[Silva et al., 1995], [Li, 2002], [Kostka & Spang, 2005], [Rosemary et al., 2008], [Cho et al. 2009] etc.
Differential Coexpression (DC)
• Existing work on differential coexpression
– Pairs of genes with differential coexpression
• [Silva et al., 1995], [Li, 2002], [Li et al., 2003], [Lai et al. 2004]
– Clustering based differential coexpression analysis
• [Ihmels et al., 2005], [Watson., 2006]
– Network based analysis of differential coexpression
• [Zhang and Horvath, 2005], [Choi et al., 2005], [Gargalovic et al. 2006],
[Oldham et al. 2006], [Fuller et al., 2007], [Xu et al., 2008]
– Beyond pair-wise (size-k) differential coexpression
• [Kostka and Spang., 2004], [Prieto et al., 2006]
– Gene-pathway differential coexpression
• [Rosemary et al., 2008]
– Pathway-pathway differential coexpression
• [Cho et al., 2009]
Existing DC work is “full-space”
Motivation:
Such subspace patterns
may be missed by full-
space models
Extension to Subspace Differential Coexpression
Problem: given n
genes, find all the
subsets of genes,
s.t. SDC≥d
Details in [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]
Computational Challenge
Problem: given n genes, find all the subsets of genes, s.t. SDC≥d
>>
A measure M is antimonotonic
if V A,B: A B ➔ M(A) >= M(B)
≈
Details in [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]
[Fang, Pandey, Gupta, Steinbach and Kumar, IEEE TKDE 2011]
An Association-analysis Approach
A measure M is antimonotonic if
V A,B: A B ➔ M(A) >= M(B)
Advantages:
1) Systematic & direct
2) Completeness
3) Efficiency
Prune all the
supersets
[ Agrawal et al. 1994]
A 10-gene Subspace DC Pattern
• Conferences/Workshops
• Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, Gaurav Pandey, Michael
Steinbach, Rohit Gupta, Tushar Garg and Vipin Kumar, to appear, ACM SIGKDD 2007
• Incorporating Functional Inter-relationships into Algorithms for Protein Function Prediction, Gaurav Pandey and Vipin Kumar, to appear,
ISMB satellite meeting on Automated Function Prediction 2007
• Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis, Rohit
Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach and Vipin Kumar, To be published in the proceedings of the Workshop on Data
Mining for Biomedical Informatics, held in conjunction with SIAM International Conference on Data Mining, 2007
• Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery, Hui Xiong, X. He, Chris Ding, Ya Zhang, Vipin
Kumar and Stephen R. Holbrook, pp 221-232, Proc. of the Pacific Symposium on Biocomputing, 2005
• Feature Mining for Prediction of Degree of Liver Fibrosis, Benjamin Mayer, Huzefa Rangwala, Rohit Gupta, Jaideep Srivastava, George
Karypis, Vipin Kumar and Piet de Groen, Proc. Annual Symposium of American Medical Informatics Association (AMIA), 2005
• Technical Reports
• Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, Gaurav Pandey, Michael
Steinbach, Rohit Gupta, Tushar Garg, Vipin Kumar, Technical Report 07-007, March 2007, Department of Computer Science, University of
Minnesota
• Computational Approaches for Protein Function Prediction: A Survey, Gaurav Pandey, Vipin Kumar, Michael Steinbach, Technical Report
06-028, October 2006, Department of Computer Science, University of Minnesota