0% found this document useful (0 votes)
96 views26 pages

Data Mining

Data mining is the process of analyzing large datasets to uncover previously unknown patterns. For example, analyzing sales records showed that beers and diapers are frequently bought together, so placing them together increased sales. Data mining aims to find relationships, patterns, and models in data. The knowledge discovery in databases (KDD) process involves selecting data, preprocessing, transforming, mining patterns, and interpreting results. Common techniques include regression, clustering, neural networks, and naive Bayesian classification. Data mining has applications in marketing, finance, insurance, and bioinformatics like analyzing gene expression from microarray experiments.

Uploaded by

Jam One
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views26 pages

Data Mining

Data mining is the process of analyzing large datasets to uncover previously unknown patterns. For example, analyzing sales records showed that beers and diapers are frequently bought together, so placing them together increased sales. Data mining aims to find relationships, patterns, and models in data. The knowledge discovery in databases (KDD) process involves selecting data, preprocessing, transforming, mining patterns, and interpreting results. Common techniques include regression, clustering, neural networks, and naive Bayesian classification. Data mining has applications in marketing, finance, insurance, and bioinformatics like analyzing gene expression from microarray experiments.

Uploaded by

Jam One
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining and Bioinformatics

April 30, 2004

What is Data Mining?


Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage. (SAS Institute) Example: detecting suspicious transactions with credit cards

A Newer Definition
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

The Beers and Diapers Story


Analyze sales records Beers & diapers frequently occur together in customer orders Put beers next to diapers Sales volume increases dramatically
Explanation?

Why Do Data Mining


Do you know the differences between the following concepts?
Data Information Knowledge

Difference between data mining and data analysis


The latter is more specific

What do We Aim to Mine?


Relationships and summaries
Models (global summary of a data set)
Linear equations, clusters, graphs, tree structures Prediction, classification, interpretation

Patterns (local, restricted regions)


Recurrent patterns, rules Unusualness - Anomaly detection

Analogy to data compression

The Whole KDD Process


KDD: Knowledge Discovery in Databases
Selecting the target data Preprocessing the data Transforming them if necessary Performing data mining to extract patterns and relationships Interpreting and assessing the discovered structures

Data Mining Techniques


Many of them originate from statistics, machine learning, or pattern recognition General steps
Determine the nature and structure of the represenation to be used Deciding how to quantify and compare how well different representations fit the data (score function) Choose an algorithm process to optimize the score function Deciding what principles of data management are required to implement the algorithm efficiently

Example: Regression analysis X = aY + b


Credit card spending vs Annual income

Techniques
Regression/Fitting Clustering Neural networks Bayesian networks Hidden Markov models

Example: Nave Bayesian


outlook
sunny

temp
mild

humidity windy
high false

play
no

sunny
rainy

hot
cool

mild
high

true
false

yes
yes

sunny cool high true ?

Nave Bayesian - Continued


9 yes samples (out of 14):
2 sunny, 3 cool, 3 high, 2 true Prob of yes: 9/14 * 2/9 * 3/9 * 3/9 * 2/9 = 0.0053

5 no samples (out of 14):


3 sunny, 1 cool, 4 high, 3 true Prob of yes: 5/14 * 3/5 * 1/5 * 4/5 * 3/5 = 0.0206

Yes / No = 20.5% / 79.5%

Clustering
Iterative clustering
K-means

Hierarchical clustering
Agglomerative method

Probabilistic model-based clustering


EM (Expectation Minimization)

Data Mining Applications


Interdisciplinary
statistics, databases, machine learning, pattern recognition, AI, visualization, etc

Applications:
Marketing sales model, Finance loan decision Insurance risk analysis, Telecom load predication Web/text mining, Surveillance security Bioinformatics

In Bioinformatics
Analysis of Microarray Data Mining free text Structural genomics protein crystallization Predicting structure from sequence

Common theme: complex data, fast growing (outgrowing our processing power)

Hybridization of Sample to Probe

Data Collection and Preprocessing


Microarray Expression Data
Fluorescence level Noisy
Examples Gene 1 Gene 2 Gene M Features Experiment 1 Experiment 2 1083 1585 170 1464 398 302 Experiment N 1115 511 751 Category Y X X

Data Representations

Microarray Experiement Result

Machine Learning Tasks


Design of Microarrays
Probes (67 features) w/ fluorescence value learn to choose the best probes for a new gene

Biological Applications of Microarrays


Classify new examples Prediction the functional category of genes Cluster genes based on similarity Cluster experimental conditions Learn a Bayesian network (that captures the joint prob distribution over the expression levels of genes)

A Support Vector Machine

Cluster Analysis

Bayesian Network

Machine Learning Tasks (contd)


Medical Applications of Microarrays
Cell disease classification Predicting existing disease classes Predicting the prognsis Predicting the drug response of different patients

Disease Diagnosis Models

Factors That Affect Drug Response

Wrap It Up
Data mining has great potential Danger: dont over predict
S&P index = function of the previous years butter production, cheese production, sheep population in Bangladesh and US?

Finally - dont expect it to answer all questions

You might also like