d2k Tutorial
d2k Tutorial
Supercomputing 2003
Loretta Auvil
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
217. 265.8021
[email protected]
Outline
What is It?
50
40
Effort (%)
30
20
10
0
Objectives Data Preparation Data Mining Interpretation/
Determination Evaluation
• D2K Infrastructure
D2K API, data flow environment,
distributed computing framework
and runtime system
• D2K Modules
Computational units written in Java
that follow the D2K API
• D2K Itineraries
Modules that are connected to form
an application
• D2K Toolkit
User interface for specification of
itineraries and execution that
provides the rapid application
development environment
• D2K-Driven Applications
Applications that use D2K modules,
but do not need to run in the D2K
Toolkit
1. Workspace
2. Resource
Panel
3. Modules
4. Models
5. Itineraries
6. Visualizations
7. Generated
Visualizations
8. Generated
Models
9. Component
Information
10. Toolbar
11. Console
Data Prep Module: Performs functions to select, clean, or transform the data
• Binning, Normalizing, Feature Selection, etc.
Properties Symbol
If a “P” is shown in the lower left
corner of the module, then the
module has properties that can be
set before execution
The area to the left of the Workspace that contains the components
necessary for data analysis
• Modules
• Models
• Itineraries
• Visualizations
• Component Information
• Shows detailed information about components of D2K
• Shows module information, inputs, outputs, and property descriptions
• Shows itinerary annotation
• Generated Visualization
• Shows visualizations generated during this session
• Provides ability to save these visualizations for later use
• Generated Models
• Shows models generated during this session
• Provides ability to save these visualizations for later use
• Preferences
• Written to a file called “d2k.props”
• Set up automatically the first time D2K is installed
• Changed via Edit menu… Preferences…
• Some changes do require restart of D2K
• Check the User Manual for more details (available online)
CLASSIFICATION
NAÏVE BAYESIAN
• Naïve assumption:
• Feature independence
• P(xi|C) is estimated as the relative frequency of examples having
value xi as feature in class C
• Computationally easy!!!
What if scenarios…
• Click on petal-width of
1.3:1.9
• Now the probability of
Iris-versicolor is
66.37%
What if scenarios…
continue with
conditional probabilities
calculations
• Click on petal-length of
3.95:5.32
• Click on sepal-length of
5.28:6.15
• Now the probability of
Iris-versicolor is 94.99%
CLASSIFICATION
Decision Trees
• Given
• Database of transactions
• Each transaction contains a set of items
• Find all rules X->Y that correlate the presence of one set of items X
with another set of items Y
• Example: When a customer buys bread and butter, they buy milk 85% of
the time
• While association rules are easy to understand, they are not always
useful
Useful
On Fridays convenience store customers often purchase diapers and beer
together
Trivial
Customers who purchase maintenance agreements are very likely to
purchase large appliances
Inexplicable
When a new Super Store opens, one of the most commonly sold item is
light bulbs
Transaction ID # Items
1 { 1, 2, 3 }
For minimum support = 50% = 2 transactions
2 { 1,3 }
and minimum confidence = 50%
3 { 1,4 }
4 { 2, 5, 6 }
{4} 50 %
• Find all rules that have “Brats” in the antecedent and “mustard”
in the consequent
• These rules may help in determining the additional items that have to be
sold together to make it highly likely that mustard will also be sold
Strengths
• It produces easy to understand results
• It supports undirected data mining
• It works on variable length data
• Rules are relatively easy to compute
Weaknesses
• It is an exponential growth algorithm
• It is difficult to determine the optimal number of items
• It discounts rare items
• It is limited by the support that it provides attributes
• It produces many rules
• For large numbers of attribute-value combinations, considerable
cpu and memory resources are consumed
Frozen
General
Foods
Partial Product Taxonomy
CLUSTERING
• KMeans clustering
• Creates a sample set containing Number of Clusters rows is chosen from an input
table of examples and used as initial cluster centers
• These initial clusters undergo a series of assignment/refinement iterations, resulting
in a final cluster model
• Buckshot clustering
• Creates a sample of size Sqrt(Number of Clusters * Number of Examples) is chosen at
random from the table of examples
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model
• Coverage clustering
• Creates a sample set from the input table such that the set formed is approximately
the minimum number of samples needed such that for every example in the input
table there is at least one example in the sample set of distance = Distance Threshold
(% of Maximum)
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model
• Fractionation
• Creates a sample set of the initial examples (converted to clusters) by a
key attribute denoted by Sort Attribute
• The set of sorted clusters is then segmented into equal partitions of size
maxPartitionsize
• Each of these partitions is then passed through the agglomerative clusterer
to produce numberOfClusters clusters
• All the clusters are gathered together for all partitions and the entire
process is repeated until only Number of Clusters clusters remain. The
sorting step is to encourage like clusters into same partitions
PARALLEL COORDINATES
SCATTERPLOT
• Provides step
by step
interface to
guide user in
data analysis
• Uses same
D2K modules
• Provides way
to capture
different
experiments
(streams)