Data Mining at UVA: New Horizons in Teaching and Learning Conference
Data Mining at UVA: New Horizons in Teaching and Learning Conference
3,500,000
3,000,000
The Data Gap
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
500,000
Number of
0
analysts
1995 1996 1997 1998 1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Summary of SAS DM Process -
SEMMA
• Sample the data by creating one or more data tables.
The sample should be large enough to contain the
significant information, yet small enough to process.
• Explore the data by searching for anticipated
relationships, unanticipated trends, and anomalies in
order to gain understanding and ideas.
• Modify the data by creating, selecting, and transforming
the variables to focus the model selection process.
• Model the data by using the analytical tools to search for
a combination of the data that reliably predicts a desired
outcome.
• Assess the data by evaluating the usefulness and
reliability of the findings from the data mining process.
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
Test Set
Software Demonstrations