0% found this document useful (0 votes)
98 views19 pages

Data Mining at UVA: New Horizons in Teaching and Learning Conference

This document summarizes a presentation on data mining at UVA. It discusses the commercial and scientific motivations for data mining, including finding patterns in large datasets. Data mining can help discover useful information that humans may miss. The presentation covers classification techniques like decision trees and neural networks. It provides examples of software for data mining, demonstrating SAS Enterprise Miner, R with the Rattle package, and Weka.

Uploaded by

sathishjoseph
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views19 pages

Data Mining at UVA: New Horizons in Teaching and Learning Conference

This document summarizes a presentation on data mining at UVA. It discusses the commercial and scientific motivations for data mining, including finding patterns in large datasets. Data mining can help discover useful information that humans may miss. The presentation covers classification techniques like decision trees and neural networks. It provides examples of software for data mining, demonstrating SAS Enterprise Miner, R with the Rattle package, and Weka.

Uploaded by

sathishjoseph
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Mining at UVA

New Horizons in Teaching and Learning


Conference
May 21-24, 2007
Kathy Gerber, ITC Research Computing
[email protected]
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions

• Computers have become cheaper


and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data (e.g., GEOSS)
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information
• Much of the data is never analyzed at all
4,000,000

3,500,000

3,000,000
The Data Gap
2,500,000

2,000,000

1,500,000
Total new disk (TB) since 1995
1,000,000

500,000
Number of
0
analysts
1995 1996 1997 1998 1999

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Summary of SAS DM Process -
SEMMA
• Sample the data by creating one or more data tables.
The sample should be large enough to contain the
significant information, yet small enough to process.
• Explore the data by searching for anticipated
relationships, unanticipated trends, and anomalies in
order to gain understanding and ideas.
• Modify the data by creating, selecting, and transforming
the variables to focus the model selection process.
• Model the data by using the analytical tools to search for
a combination of the data that reliably predicts a desired
outcome.
• Assess the data by evaluating the usefulness and
reliability of the findings from the data mining process.
What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data Statistics/ Machine Learning/
– High dimensionality AI Pattern
of data Recognition

– Heterogeneous, Data Mining


distributed nature
of data
Database
systems
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Techniques

• Decision Tree based Methods


• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief
Networks
• Support Vector Machines
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Software Demonstrations

SAS Enterprise Miner


R Rattle
Weka
SAS Enterprise Miner
Screenshot – EM Tutorial Workflow
R Rattle
• Install R 2.5.0   
• > source("http://www.ggobi.org/downloads/install.r")
• > install(“rattle”, dep=TRUE)
Weka
Slide Credits

• R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering


Applications”

• SAS Enterprise Miner tutorial

• Frank Eibe, Machine Learning with Weka

• Tan, Steinbach, Kumar “Introduction to Data Mining”


Versions and References for
Software Used Today
• SAS 9.1.3 EAS with Enterprise Miner
– UVA licensed software
– http://rescomp.virginia.edu
• R 2.5.0 with Rattle (open source)
– Open source
• Weka (open source)
– Ian Witten, Frank Eibe: Data Mining: Practical Machine Learning
Tools and Techniques (Second Edition)

• Not demonstrated but also see Insightful Miner and


Orange

You might also like