0% found this document useful (0 votes)

3K views

d2k Tutorial

This document provides an overview of the D2K (Data To Knowledge) data mining framework. It describes D2K's functionality for predictive modeling, discovery, and deviation detection using machine learning techniques like classification, clustering, and rule association. The document outlines hands-on exercises demonstrating D2K's predictive modeling and discovery capabilities. It also summarizes the goals of the tutorial, an overview of the knowledge discovery process, and key aspects of using the D2K framework like its modules, toolkit interface, and distributed computing functionality.

Uploaded by

Abhishek Kumar Singh

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3K views

d2k Tutorial

Uploaded by

Abhishek Kumar Singh

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 78

D2K Tutorial

Supercomputing 2003

Loretta Auvil
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
217. 265.8021
[email protected]
Outline

• Overview of D2K Functionality

• Hands-On Exercise: Predictive Modeling
• Classification
– Using Naïve Bayesian
– Using Decision Trees
• Hands-On Exercise: Discovery
• Rule Association
– Using SQL Htree
• Clustering
• Deviation Detection
• Visualization
– Parallel Coordinates
– Small Multiples of scatterplots

alg | Automated Learning Group

Goals

• Understanding the Knowledge Discovery in Databases Process

• Gaining Knowledge of Basic Data Mining Operations and Techniques
• Understanding the Role of the Knowledge Discovery Framework
• Key Issues in Utilization of D2K Framework
• Understanding the Role of Information Visualization in Data Mining

alg | Automated Learning Group

Overview of Knowledge Discovery

What is It?

Knowledge Discovery in Databases is the non-trivial process of

identifying valid, novel, potentially useful, and ultimately
understandable patterns in data

• The understandable patterns are used to:

• Make predictions about or classifications of new data
• Explain existing data
• Summarize the contents of a large database to support decision making
• Create graphical data visualization to aid humans in discovering complex
patterns

alg | Automated Learning Group

Overview of Knowledge Discovery

Knowledge Discovery Process

alg | Automated Learning Group

Overview of Knowledge Discovery

Required Effort for each KDD Step

Arrows indicate the direction we want the effort to go

40
Effort (%)

0
Objectives Data Preparation Data Mining Interpretation/
Determination Evaluation

alg | Automated Learning Group

Overview of Knowledge Discovery

Three Primary Paradigms

• Predictive Modeling – supervised learning approach where

classification or prediction of one of the attributes is desired
• Classification is the prediction of predefined classes
– e.g. Naive Bayesian, Decision Trees, and Neural Networks
• Regression is the prediction of continuous data
– e.g. Neural Networks, and Decision (Regression) Trees
• Discovery – unsupervised learning approach for exploratory data
analysis
• e.g. Association Rules, Link Analysis, Clustering, and Self Organizing Maps
• Deviation Detection – identifying outliers in the data
• e.g. Visualization

alg | Automated Learning Group

Importance of Data Mining Framework

• Provides capability to build custom applications

• Provides access to data management tools
• Loading data from database, flat file or DataSpaces
• Contains data mining algorithms for prediction and discovery that
can be applied
• Provides data transformations for standard operations
• Supports an extensible interface for creating one’s own algorithms
• Provides means for building and applying models
• Provides integrated visualizations components
• Provides access to distributed computing capabilities

alg | Automated Learning Group

D2K Overview
D2K - Data To Knowledge

D2K is a flexible data mining system that integrates

effective analytical data mining methods for prediction,
discovery, and anomaly detection with data management
and information visualization

alg | Automated Learning Group

D2K Overview
D2K and Its Many Components

• D2K Infrastructure
D2K API, data flow environment,
distributed computing framework
and runtime system
• D2K Modules
Computational units written in Java
that follow the D2K API
• D2K Itineraries
Modules that are connected to form
an application
• D2K Toolkit
User interface for specification of
itineraries and execution that
provides the rapid application
development environment
• D2K-Driven Applications
Applications that use D2K modules,
but do not need to run in the D2K
Toolkit

alg | Automated Learning Group

D2K Overview
D2K Toolkit

Major features that D2K provides

to an application developer
include:

• Visual programming system

employing a data flow
paradigm
• Scalable distributed computing
capabilities
• Flexible and extensible
software development
environment
• Multi-layered learning
strategies
• Integrated environment for
models and visualization
• Capability to access data
transparently from multiple
sources

alg | Automated Learning Group

D2K Overview
D2K Basic 1.0
• New release of D2K 3.0
• New release of the D2K Toolkit
• New release of a set of D2K Modules to perform data mining techniques
• Prediction
– Decision Trees
C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree
– Naïve Bayesian Classification and SQL Naïve Bayesian Classification
– Neural Networks
• Discovery
– Rule Association
Apriori, Htree
– Clustering
Hierarchical Agglomerative, Kmeans, Coverage, etc.
• Better documentation for Toolkit and modules
• Includes visualizations for many of the modeling approaches
• Includes a set of data transformations
• Attribute selection, binning, filtering, attribute construction
• Includes optimization strategy for searching parameter space
• Plus more…

alg | Automated Learning Group

D2K Overview
D2K 3.0 Features

• Current Release downloadable off our website

• Extension of existing API
• Provides the capability to programmatically connect modules and set properties
• Allows D2K-driven applications to be developed
• Provides ability to pause and restart an itinerary
• Enhanced Distributed Computing
• Allows modules that are re-entrant to be executed remotely
• Uses Jini services to look up distributed resources
• Includes interface for specifying the runtime layout of a distributed itinerary
• Processor Status Overlay
• Shows utilization of distributed computing resources
• Distributed Checkpointing
• Resource Manager
• Provides a mechanism for treating selected data structures as if they were stored in
global memory
• Provides memory space that is accessible from multiple modules running locally as
well as remotely

alg | Automated Learning Group

D2K Overview
New D2K 4.0 Highlights

• Ability to use the web for deployment

• Ability for modules to run headless (with no gui)
• Changed the way itineraries are saved
• Stored in zip file
• Itinerary is described in an xml format
• Annotation is saved in html format
• Additional data is stored in a serialized HashMap
• Table structure was re-implemented to improve performance and
simplify the API
• Improvements of module selection, with area selection
• Support of copy and paste of selected modules

alg | Automated Learning Group

D2K Overview
D2K ToolKit

1. Workspace
2. Resource
Panel
3. Modules
4. Models
5. Itineraries
6. Visualizations
7. Generated
Visualizations
8. Generated
Models
9. Component
Information
10. Toolbar
11. Console

alg | Automated Learning Group

D2K Overview
D2K Modules

Input Module: Loads data from the outside world

• Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data
• Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations

• Naïve Bayesian, Decision Tree, Apriori, etc.

User Input Module: Requires interaction with the user

• Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world

• Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user

• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot,
3D Surface Plot

alg | Automated Learning Group

D2K Overview
D2K Module Icon Description

Module Progress Bar

Appears during execution to show the
percentage of time that this
module executed over the entire
execution time. It is green when
the module is executing and red
when not

Input Port Output Port

Rectangular shapes on the left side of Rectangular shapes on the right
the module represent the inputs side of the module represent the
for the module. They are colored
outputs for the module. They are
according to the data type that
they represent colored according to the data
type that they represent

Properties Symbol
If a “P” is shown in the lower left
corner of the module, then the
module has properties that can be
set before execution

alg | Automated Learning Group

D2K Overview
Resource Panel

The area to the left of the Workspace that contains the components
necessary for data analysis
• Modules
• Models
• Itineraries
• Visualizations

alg | Automated Learning Group

D2K Overview
D2K Itineraries

• Itineraries are partial or

complete applications
composed of connected
modules
• D2K Core Itineraries
include:
• Prediction
• Discovery
• Anomaly Detection
• Data Selection
• Transformation
• Visualization

alg | Automated Learning Group

D2K Overview
Workspace

The Workspace is the area where applications are formed

• Modules are placed, connected, and properties set
• Itineraries are saved and executed

alg | Automated Learning Group

D2K Overview
Session Panes

• Component Information
• Shows detailed information about components of D2K
• Shows module information, inputs, outputs, and property descriptions
• Shows itinerary annotation
• Generated Visualization
• Shows visualizations generated during this session
• Provides ability to save these visualizations for later use
• Generated Models
• Shows models generated during this session
• Provides ability to save these visualizations for later use

alg | Automated Learning Group

D2K Overview
D2K Setup

• Preferences
• Written to a file called “d2k.props”
• Set up automatically the first time D2K is installed
• Changed via Edit menu… Preferences…
• Some changes do require restart of D2K
• Check the User Manual for more details (available online)

alg | Automated Learning Group

D2K Overview
Using the Toolkit

Build an itinerary for loading data and

viewing it in a TableViewer
• Drag and Drop Modules from
Modules Pane of Resource Panel to
the Workspace as shown
• Expand directory ncsa/io/file/input
– Drag and Drop Input1Filename to
Workspace
– Drag and Drop
CreateDelimitedParser to
Workspace
– Drag and Drop ParseFileToTable
to Workspace
• Expand directory ncsa/vis
– Drag and Drop TableViewer to
Workspace

alg | Automated Learning Group

D2K Overview
Using the Toolkit (cont’d)

Connect the modules like shown

• Drag from the output port of
one module to the input port
of the next module
• Check the properties of
modules by double clicking
on the module
• Input File Name
– Choose data/UCI/iris.csv
• Create Delimited File Parser
– Defaults work
• Parse File To Table
– Defaults work
• Click Run to execute

alg | Automated Learning Group

D2K Overview
Variation Using a Nested Itinerary

• An itinerary can be used

as a module – nested
itinerary
• Properties can be set by
holding Control and double
clicking on the nested
itinerary
• Then connecting the
inputs and output ports of
the nested itinerary as one
would any other module

alg | Automated Learning Group

PREDICTIVE MODELING

CLASSIFICATION
NAÏVE BAYESIAN

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Naïve Bayesian Classification

• Applied to supervised learning

problem
• Expects training examples with
input and output attributes
• Single output attribute with small
number of possible values for best
performance
• Computes the distribution of an
input associated with each class,
for example, given the variable X
with a value at xi the probability Mathematically speaking — If one knows
how P(X | C), and the densities P(xi) and
of it being in Class A is greater
P(cj) (prior probabilities) are known
than it being in Class B
then the classifier is one which assigns
class cj to datum xi if cj has the highest
posterior probability given the data

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Bayesian Classification: Why?

• Probabilistic learning: Calculate explicit probabilities for

hypothesis, is among the most practical approaches to certain
types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
• Prior knowledge: Can be combined with observed data
• Standard:
• Provide a standard of optimal decision making against which other methods
can be measured
• In a simpler form, provide a baseline against which other methods can be
measured

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Naïve Bayesian Classification

• Naïve assumption:
• Feature independence
• P(xi|C) is estimated as the relative frequency of examples having
value xi as feature in class C
• Computationally easy!!!

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Classification Applications Using Naïve Bayesian
• Predict a response to a
marketing campaign
• Predict the most
profitable customers
for a product or
service
• Classify applicants as
high/med/low risk
• Predict which
customers will leave
for a competitor
• Predict whether email
message is SPAM or not

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Opening the Itinerary

• Click on the “Itinerary” Pane in the Resource Panel

• Expand the “Prediction” directory with a single click
• Double click on “NaïveBayes” to load the itinerary into your Workspace

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Executing the Itinerary

• Check modules with

properties
• Double click to open property
editor
• Respond to User Interfaces
that open
• Click Run button
• Respond to GUI’s that pop-up

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
PredictionTableReport for iris data

Double click on the

PredictionTableReport to launch the
report that shows the classification
error and confusion matrix for the
data

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Naïve Bayesian Visualization
• Double click on the
NaiveBayesVis to view the
results
• The upper right hand pane
shows the distribution of the
classes
• The left hand pane shows the
attributes and each of their
values. They are listed by
order of significance
• The message box shows details
about each pie chart when
brushed
• Clicking on a pie chart shows
how knowing this information
can change the overall class
predication
• Clicking on multiple pie charts
calculates conditional Notice Iris-versicolor has a 33%
probabilities likelihood

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Naïve Bayesian Visualization

What if scenarios…
• Click on petal-width of
1.3:1.9
• Now the probability of
Iris-versicolor is
66.37%

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Naïve Bayesian Visualization

What if scenarios…
continue with
conditional probabilities
calculations
• Click on petal-length of
3.95:5.32
• Click on sepal-length of
5.28:6.15
• Now the probability of
Iris-versicolor is 94.99%

alg | Automated Learning Group

Predictive Modeling: Naïve Bayesian
Applying Models
• In Generated Models Session Pane, right click on the model and
choose Save
• The saved model shows up in the Model View of the Resource Panel
• Click and drag the model into the workspace
• Connect the input and output of the model as shown

alg | Automated Learning Group

PREDICTIVE MODELING

CLASSIFICATION
Decision Trees

alg | Automated Learning Group

Predictive Modeling: Decision Trees
Decision Trees Classification

• Supervised learning problem

• Builds a model to classify one
attribute based on other data
attributes
• Builds the tree by deciding how to
split the data so that classification
error is reduced
• Shown is a decision tree predicting
whether one will play tennis based
on some weather conditions

alg | Automated Learning Group

Predictive Modeling: Decision Trees
Applications Using Decision Trees

• Decision trees can solve both

classification and regression
problems
• Decision Trees work for many
of the same problems as
Naïve Bayesian analysis
• Prediction of who should be
given a loan
• Prediction of high/med/low
risk

alg | Automated Learning Group

Predictive Modeling: Decision Trees
PredictionTableReport for iris data

Double click on the

PredictionTableReport to launch the
report that shows the classification
error and a confusion matrix for the
data
Note: This is a very clean data set

alg | Automated Learning Group

Predictive Modeling: Decision Trees
Decision Tree Visualization
Two main panes
• Navigator Pane shown in the
top left pane illustrates the
full decision tree, the
viewable decision tree is
shown with a black box
outline
• Viewable Tree shows a chart
of the percentages of the
examples in each of the
classes
• Brushing indicates the
percentages in the Brushing
Pane
• Clicking on a small chart opens
a larger view of the chart
-showing the complete path
taken to get to this node

alg | Automated Learning Group

Predictive Modeling: Decision Trees
Using the Model

• In Generated Models Session

Pane, right click on the
model and choose Save. The
saved model shows up in the
Model View of the Resource
Panel
• Click and drag the model into
the workspace (shown in
green circle, disconnect the
items in the red blob)
• Connect the input and output
of the model as shown
• Results can be sent to the
PredictionTableReport and to
the DecisionTreeVis
• New (test) data can be
examined with the model

alg | Automated Learning Group

DISCOVERY
RULE ASSOCIATION
Using fp-growth

alg | Automated Learning Group

Discovery: Rule Association
Market Basket Example

? Where should detergents be placed in the

Store to maximize their sales?

? Are window cleaning products purchased

when detergents and orange juice are
bought together?

? Is soda typically purchased with bananas?

Does the brand of soda make a difference?

? How are the demographics of the

neighborhood affecting what customers
are buying?

alg | Automated Learning Group

Discovery: Rule Association
Association Rules

• There has been a considerable amount of research in the area of

Market Basket Analysis. Its appeal comes from the clarity and utility
of its results, which are expressed in the form association rules

• Given
• Database of transactions
• Each transaction contains a set of items

• Find all rules X->Y that correlate the presence of one set of items X
with another set of items Y
• Example: When a customer buys bread and butter, they buy milk 85% of
the time

alg | Automated Learning Group

Discovery: Rule Association
Overview

• Unsupervised learning problem

• Find all rules that correlate the presence of one set of items X with
another item Y
• Example: When a customer buys bread and butter, they buy milk 85% of the
time
• Support is the percentage of the records that contain both X and Y
• A rule must have some minimum user-specified support to show its impact
• Confidence is the percentage of records that contain X and Y out of
the number of records that contain X
• A rule must have some minimum user-specified confidence to show its value

alg | Automated Learning Group

Discovery: Rule Association
Results: Useful, Trivial, or Inexplicable?

• While association rules are easy to understand, they are not always
useful

Useful
On Fridays convenience store customers often purchase diapers and beer
together

Trivial
Customers who purchase maintenance agreements are very likely to
purchase large appliances

Inexplicable
When a new Super Store opens, one of the most commonly sold item is
light bulbs

alg | Automated Learning Group

Discovery: Rule Association
How Does It Work?
• In the data, two of five
Grocery Point-of-Sale Transactions
transactions include both
soda and orange juice Customer Items
• These two transactions 1 Orange Juice,
juice, Soda
support the rule
2 Milk, Orange Juice, Window Cleaner
• Support for the rule is
two out of five or 40% 3 Orange Juice, Detergent

• Since both transactions 4 Orange Juice, Detergent, soda

juice, detergent, Soda
that contain soda also 5 Window Cleaner, Soda
cleaner, soda
contain orange juice
• There is a high degree of
Co-Occurrence of Products
confidence in the rule
Window
• In fact every transaction OJ Cleaner Milk Soda Detergent
that contains soda
OJ 4 1 1 2 1
contains orange juice
• So the rule If soda, THEN Window Cleaner 1 2 1 1 0
orange juice has a Milk 1 1 1 0 0
confidence of 100% Soda 2 1 0 3 1
Detergent 1 0 0 1 2

alg | Automated Learning Group

Discovery: Rule Association
Confidence and Support - How Good Are the Rules

• A rule must have some minimum user-specified confidence

• 1 and 2 -> 3 has a 90% confidence if when a customer bought 1 and 2, in
90% of the cases, the customer also bought 3
• A rule must have some minimum user-specified support
• 1 and 2 -> 3 should hold in some minimum percentage of transactions to
have value

alg | Automated Learning Group

Discovery: Rule Association
Confidence and Support

Transaction ID # Items
1 { 1, 2, 3 }
For minimum support = 50% = 2 transactions
2 { 1,3 }
and minimum confidence = 50%
3 { 1,4 }
4 { 2, 5, 6 }

Frequent Item Set Support

{1} 75 % For the rule 1=> 3:
{2} 50 % Support = Support({1,3}) = 50%
{3} 50 % Confidence = Support ({1,3})/Support({1}) = 66%

{4} 50 %

alg | Automated Learning Group

Discovery: Rule Association
Association Examples

• Find all rules that have “Diet Coke” as a consequent (result)

• These rules may help plan what the store should do to boost the sales of
Diet Coke

• Find all rules that have “Yogurt” in the antecedent (condition)

• These rules may help determine what products may be impacted if the
store discontinues selling “Yogurt”

• Find all rules that have “Brats” in the antecedent and “mustard”
in the consequent
• These rules may help in determining the additional items that have to be
sold together to make it highly likely that mustard will also be sold

• Find the best k rules that have “Yogurt” in the result

alg | Automated Learning Group

Discovery: Rule Association
Basic Process

• Choosing the right set of items

• Taxonomies
• Virtual Items
• Anonymous versus Signed
• Generation of rules
• If condition Then result
• Negation/Dissociation
• Improvement
• Overcoming the practical limits imposed by thousand or tens of
thousands of products
• Minimum Support Pruning

alg | Automated Learning Group

Discovery: Rule Association
Strengths and Weaknesses

Strengths
• It produces easy to understand results
• It supports undirected data mining
• It works on variable length data
• Rules are relatively easy to compute

Weaknesses
• It is an exponential growth algorithm
• It is difficult to determine the optimal number of items
• It discounts rare items
• It is limited by the support that it provides attributes
• It produces many rules
• For large numbers of attribute-value combinations, considerable
cpu and memory resources are consumed

alg | Automated Learning Group

Discovery: Rule Association Using fp-growth
Opening the Itinerary

• Click on the “Itinerary” Pane in the Resource Panel

• Expand the “Discovery” directory with a single click
• Expand the “RuleAssociation” directory with a single click
• Double click on “fp-growth” to load the itinerary into your Workspace

alg | Automated Learning Group

Discovery: Rule Association Using fp-growth
Executing the Itinerary

• Check modules with

properties
• Double click to open
property editor
• fp-growth
• Compute Confidence
• Respond to User Interfaces
that open
• Click Run button

alg | Automated Learning Group

Discovery: Rule Association Using fp-growth
Rule Association Visualization
• Read rules down the column
• Example - the first rule is
• If petal-width Binned=[…:0.7] then
flower-type=Iris-setosa
• Support = 25%
• Confidence = 100%
• Brush the bars to find out support
and confidence levels
• Different sorting schemes
• Sort by Confidence
• Sort by Support
• Alphabetize button sorts the
attribute-value pairs alphabetically
• Rank button sorts the rows based
on the current Confidence/Support
selection, moving the consequents
and antecedents of the highest
ranking rules to the top of the
attribute-value list

alg | Automated Learning Group

Discovery: Rule Association
Choosing the Right Set of Items

Frozen
General

Foods
Partial Product Taxonomy

Frozen Frozen Frozen

Desserts Vegetables Dinners

Frozen Ice Frozen

Yogurt Cream Fruit Bars Peas Carrots Mixed Other
Specific

Chocolate Strawberry Vanilla Rocky Cherry Other

Road Garcia

alg | Automated Learning Group

Discovery: Rule Association
Other Association Rule Applications

• Quantitative Association Rules

• Age[35..40] and Married[Yes] -> NumCars[2]

• Association Rules with Constraints

• Find all association rules where the prices of items are > 100 dollars

• Temporal Association Rules

• Diaper -> Beer (1% support, 80% confidence)
• Diaper -> Beer (20%support) 7:00-9:00 PM weekdays

• Optimized Association Rules

• Given a rule (l < A < u) and X -> Y, Find values for l and u such that
support greater than certain threshold and maximizes a support,
confidence, or gain
• ChkBal [$ 30,000 .. $50,000] -> JumboCD = Yes

alg | Automated Learning Group

DISCOVERY

CLUSTERING

alg | Automated Learning Group

Discovery: Clustering
Overview

• Unsupervised learning problem

• Group all examples that are similar
• View results with dendogram or parallel coordinates
• Provide several different clustering algorithms
• Kmeans
• Buckshot
• Fractionation
• Coverage

alg | Automated Learning Group

Discovery: Clustering
Clustering Algorithms

• KMeans clustering
• Creates a sample set containing Number of Clusters rows is chosen from an input
table of examples and used as initial cluster centers
• These initial clusters undergo a series of assignment/refinement iterations, resulting
in a final cluster model
• Buckshot clustering
• Creates a sample of size Sqrt(Number of Clusters * Number of Examples) is chosen at
random from the table of examples
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model
• Coverage clustering
• Creates a sample set from the input table such that the set formed is approximately
the minimum number of samples needed such that for every example in the input
table there is at least one example in the sample set of distance = Distance Threshold
(% of Maximum)
• This sampling is sent through the hierarchical agglomerative clustering module to
form Number of Clusters clusters. These clusters' centroids are used as the initial
"means" for the cluster assignment module. The assignment module, once it has made
refinements, outputs the final Cluster Model

alg | Automated Learning Group

Discovery: Clustering
Clustering Algorithms (2)

• Fractionation
• Creates a sample set of the initial examples (converted to clusters) by a
key attribute denoted by Sort Attribute
• The set of sorted clusters is then segmented into equal partitions of size
maxPartitionsize
• Each of these partitions is then passed through the agglomerative clusterer
to produce numberOfClusters clusters
• All the clusters are gathered together for all partitions and the entire
process is repeated until only Number of Clusters clusters remain. The
sorting step is to encourage like clusters into same partitions

alg | Automated Learning Group

Discovery: Clustering
Opening the Itinerary

• Click on “Itinerary” Pane in the Resource Panel

• Expand the “Discovery” directory
• Expand the “Clustering” directory
• Double click on “BuckshotClusterer”

alg | Automated Learning Group

Discovery: Clustering
Clustering Results

Dendogram or Parallel Coordinates

alg | Automated Learning Group

DEVIATION DETECTION
VISUALIZATIONS

PARALLEL COORDINATES
SCATTERPLOT

alg | Automated Learning Group

Deviation Detection: Parallel Coordinates
Itinerary

• Visualization to detect outliers and patterns

• Expand the vis directory and load the “ParallelCoordinate” itinerary

alg | Automated Learning Group

Deviation Detection: Parallel Coordinates
Parallel Coordinates - Visualization
• Each vertical line represents a
attribute with the minimum and
maximum values shown at
bottom and top
• Each record has a line that
connects it to the its value at
each attribute
• Lines are colored based on the
output attribute
• Clicking and dragging on the
label boxes allows the attributes
to be rearranged
• Zooming is accomplished by
dragging a box over the desired
area. Clicking returns to the
original view

alg | Automated Learning Group

Deviation Detection: Scatterplots
Scatterplots – Itinerary
• Visualization to detect outliers and patterns
• Load the “scatterplot” itinerary

alg | Automated Learning Group

Deviation Detection: Scatterplots
Scatterplots – Visualization

alg | Automated Learning Group

Deviation Detection: Small Multiples
Small Multiples of Scatterplots - Itinerary

alg | Automated Learning Group

Deviation Detection: Small Multiples
Small Multiples of Scatterplots Vis

alg | Automated Learning Group

Deviation Detection: Small Multiples
Small Multiples of Linear Regressions Vis

alg | Automated Learning Group

D2K SL
D2K Streamline (D2K SL)

• Reduces the learning

curve associated with the
KDD process
• Encompasses discovery,
prediction and deviation
detection techniques
• Saves and applies models
to new data sets easily
• Supports return to earlier
steps in the KDD process
to run with different
parameters
• Uses the D2K
Infrastructure
transparently

alg | Automated Learning Group

D2K SL
New D2K User Interface – D2K SL

• Provides step
by step
interface to
guide user in
data analysis
• Uses same
D2K modules
• Provides way
to capture
different
experiments
(streams)

alg | Automated Learning Group

D2K SL
Another View of the New D2K User Interface – D2K SL

• Help users keep

track of data
• Define templates
that can be
reused in
different
experiments
(streams)

alg | Automated Learning Group

The ALG Team
Staff Students
Loretta Auvil Tyler Alumbaugh
Ruth Aydt Peter Groves
Peter Bajcsy Olubanji Iyun
Colleen Bushell Sang-Chul Lee
Dora Cai Xiaolei Li
David Clutter Brian Navarro
Lisa Gatzke Jeff Ng
Vered Goren Scott Ramon
Chris Navarro Sunayana Saha
Greg Pape Martin Urban
Tom Redman Bei Yu
Duane Searsmith Hwanjo Yu
Andrew Shirk
Anca Suvaiala
David Tcheng
Michael Welge

alg | Automated Learning Group

Licensing D2K

• Faculty, staff and students at US academic institutions will be able

to license and use D2K for free by downloading from
alg.ncsa.uiuc.edu
• Private Sector Partners who have provided funding for projects
related to D2K will be able to license and use D2K for free
• Private Sector Partners who have not provided funding will be able
to license and use D2K for a discounted fee

Contact John McEntire

Office of Technology Management
308 Ceramics Building, MC-243
105 South Goodwin Avenue
Urbana, Illinois 61801-2901
(217) 333-3715
[email protected]

alg | Automated Learning Group

Download full An Introduction to Generalized Linear Models Third Edition Barnett ebook all chapters
No ratings yet
Download full An Introduction to Generalized Linear Models Third Edition Barnett ebook all chapters
55 pages
The Ramayana of Tulsidas Ramacharitamanasa
87% (15)
The Ramayana of Tulsidas Ramacharitamanasa
1,117 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Learning Informatica PowerCenter 9.x
From Everand
Learning Informatica PowerCenter 9.x
Rahul Malewar
3/5 (4)
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Mastering ScyllaDB: High-Performance NoSQL with C++
From Everand
Mastering ScyllaDB: High-Performance NoSQL with C++
Robert Johnson
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Oracle Quick Guides: Part 2 - Oracle Database Design
From Everand
Oracle Quick Guides: Part 2 - Oracle Database Design
Malcolm Coxall
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Weka
No ratings yet
Weka
22 pages
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
From Everand
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
No ratings yet
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Oracle Information Integration, Migration, and Consolidation
From Everand
Oracle Information Integration, Migration, and Consolidation
Jason Williamson
No ratings yet
1-Introduction to Data Mining-13-12-2024
No ratings yet
1-Introduction to Data Mining-13-12-2024
48 pages
Object-Oriented Programming Made Simple: A Practical Guide with Java Examples
From Everand
Object-Oriented Programming Made Simple: A Practical Guide with Java Examples
William E. Clark
No ratings yet
Mastering Advanced Object-Oriented Programming in Java: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Advanced Object-Oriented Programming in Java: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
SQL Fundamentals for New Developers: A Practical Guide with Examples
From Everand
SQL Fundamentals for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
SQL Made Easy: Tips and Tricks to Mastering SQL Programming
From Everand
SQL Made Easy: Tips and Tricks to Mastering SQL Programming
Ryan Campbell
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Challan Form
No ratings yet
Challan Form
72 pages
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet
SQL Database Mastery: Advanced Techniques for Database Management
From Everand
SQL Database Mastery: Advanced Techniques for Database Management
Adam Jones
No ratings yet
Data Mining
No ratings yet
Data Mining
27 pages
Docker Basics Explained Clearly: A Practical Guide with Examples
From Everand
Docker Basics Explained Clearly: A Practical Guide with Examples
William E. Clark
No ratings yet
Spark for Data Science
From Everand
Spark for Data Science
Srinivas Duvvuri
No ratings yet
Learning Oracle 12c: A PL/SQL Approach
From Everand
Learning Oracle 12c: A PL/SQL Approach
Prof. Sham Tickoo
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Cloud Computing Essentials: A Practical Guide with Examples
From Everand
Cloud Computing Essentials: A Practical Guide with Examples
William E. Clark
No ratings yet
Data Science Syllabus (1) (1) (1) (1) (1)
No ratings yet
Data Science Syllabus (1) (1) (1) (1) (1)
14 pages
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
A Simple PPT With Data Concepts 1703019257
No ratings yet
A Simple PPT With Data Concepts 1703019257
28 pages
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
20 pages
Advanced Database Architecture: Strategic Techniques for Effective Design
From Everand
Advanced Database Architecture: Strategic Techniques for Effective Design
Adam Jones
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
SQL Mastery: From Novice Queries to Advanced Database Wizardry
From Everand
SQL Mastery: From Novice Queries to Advanced Database Wizardry
Scott Markham
No ratings yet
Kaggle Kernels in Action: From Exploration to Competition
From Everand
Kaggle Kernels in Action: From Exploration to Competition
Robert Johnson
No ratings yet
MySQL Management and Administration with Navicat
From Everand
MySQL Management and Administration with Navicat
Gokhan Ozar
No ratings yet
Oracle Advanced PL/SQL Developer Professional Guide
From Everand
Oracle Advanced PL/SQL Developer Professional Guide
Saurabh K. Gupta
4/5 (8)
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
Mastering Data Structure in Java: Advanced Techniques
From Everand
Mastering Data Structure in Java: Advanced Techniques
Ed A Norex
No ratings yet
Mastering BigQuery: Scalable Analytics on Google Cloud
From Everand
Mastering BigQuery: Scalable Analytics on Google Cloud
Robert Johnson
No ratings yet
Exploring AutoCAD Map 3D 2022, 9th Edition
From Everand
Exploring AutoCAD Map 3D 2022, 9th Edition
Prof. Sham Tickoo
No ratings yet
AI Session 5 Class 10
No ratings yet
AI Session 5 Class 10
19 pages
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Expert Cube Development with SSAS Multidimensional Models
From Everand
Expert Cube Development with SSAS Multidimensional Models
Marco Russo
No ratings yet
FFFFFFFFFFFFFFFFFFFF
No ratings yet
FFFFFFFFFFFFFFFFFFFF
17 pages
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
From Everand
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
David Hecksel
5/5 (2)
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Mastering Database Design
From Everand
Mastering Database Design
Ted Noreux
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
C# Debugging from Scratch: A Practical Guide with Examples
From Everand
C# Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Java Algorithms for Beginners: A Practical Guide with Examples
From Everand
Java Algorithms for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering JIRA 7 - Second Edition
From Everand
Mastering JIRA 7 - Second Edition
Ravi Sagar
No ratings yet
DevOps Engineer's Guidebook: Essential Techniques
From Everand
DevOps Engineer's Guidebook: Essential Techniques
Ted Noreux
No ratings yet
Tutorial Python
No ratings yet
Tutorial Python
100 pages
Perl Toot - Perl Documentation .
No ratings yet
Perl Toot - Perl Documentation .
29 pages
Perl 5 Tutorial
100% (18)
Perl 5 Tutorial
241 pages
Locale - Perl Documentation .
No ratings yet
Locale - Perl Documentation .
1 page
Perlobj - Perl Documentation .
No ratings yet
Perlobj - Perl Documentation .
9 pages
Perl Re Ref - Perl Documentation .
No ratings yet
Perl Re Ref - Perl Documentation .
6 pages
Perlobj - Perl Documentation .
No ratings yet
Perlobj - Perl Documentation .
9 pages
Perl GTK
100% (1)
Perl GTK
21 pages
Common Job Interview Questions and Answers 6
20% (5)
Common Job Interview Questions and Answers 6
6 pages
PHP Interview Questions With Answers
No ratings yet
PHP Interview Questions With Answers
102 pages
Oracle Open Client Adapter
No ratings yet
Oracle Open Client Adapter
101 pages
Tutorial For Designer and Developer 2000
67% (3)
Tutorial For Designer and Developer 2000
92 pages
Oracle Forms
0% (1)
Oracle Forms
20 pages
Java Programming 1
No ratings yet
Java Programming 1
44 pages
Oracle Forms Developer Tutorials
No ratings yet
Oracle Forms Developer Tutorials
2 pages
Oracle Forms Developer Guide
No ratings yet
Oracle Forms Developer Guide
12 pages
JSPBasics
No ratings yet
JSPBasics
129 pages
Maths Project
No ratings yet
Maths Project
5 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
CS263 - Bayesian Decision Theory
No ratings yet
CS263 - Bayesian Decision Theory
16 pages
Specialisation Stochastics AM 2023 en
No ratings yet
Specialisation Stochastics AM 2023 en
20 pages
Buy ebook Probabilistic Machine Learning Advanced Topics Draft 1st Edition Kevin P. Murphy cheap price
100% (3)
Buy ebook Probabilistic Machine Learning Advanced Topics Draft 1st Edition Kevin P. Murphy cheap price
40 pages
Application of Artificial Intelligence in Petroleum Engineering
No ratings yet
Application of Artificial Intelligence in Petroleum Engineering
104 pages
Model and Parameter Uncertainty in IDF Relationships Under Climate Change
No ratings yet
Model and Parameter Uncertainty in IDF Relationships Under Climate Change
13 pages
Bayesian Theory and Methods With Applications
100% (2)
Bayesian Theory and Methods With Applications
327 pages
Bayes Handout
No ratings yet
Bayes Handout
17 pages
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
No ratings yet
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
44 pages
Bayesian Inference: STAT 101 Dr. Kari Lock Morgan
No ratings yet
Bayesian Inference: STAT 101 Dr. Kari Lock Morgan
35 pages
Midterm Sample 2 2010
No ratings yet
Midterm Sample 2 2010
8 pages
Optimal FX Market Making Under Inventory Risk and Adverse Selection Constraints
No ratings yet
Optimal FX Market Making Under Inventory Risk and Adverse Selection Constraints
12 pages
Complete Download Advances in transportation geotechnics Proceedings of the 1st International Conference on Transportation Geotechnics Nottingham UK 25 27 August 2008 1st Edition Ed Ellis PDF All Chapters
100% (1)
Complete Download Advances in transportation geotechnics Proceedings of the 1st International Conference on Transportation Geotechnics Nottingham UK 25 27 August 2008 1st Edition Ed Ellis PDF All Chapters
67 pages
Accident Modelling
No ratings yet
Accident Modelling
35 pages
Ai PPT 3
No ratings yet
Ai PPT 3
29 pages
Venture Capital Financing: A Conceptual Framework: Swee-Sum Lam
No ratings yet
Venture Capital Financing: A Conceptual Framework: Swee-Sum Lam
14 pages
17.bayesian Learning Via Stochastic Gradient Langevin Dynamics
No ratings yet
17.bayesian Learning Via Stochastic Gradient Langevin Dynamics
8 pages
Belayneeh
No ratings yet
Belayneeh
87 pages
Efficient Coding and Risky Choice
No ratings yet
Efficient Coding and Risky Choice
53 pages
MTech AI DS KIIT Syllabus v1.5
No ratings yet
MTech AI DS KIIT Syllabus v1.5
27 pages
Calibrated Learning and Correlated Equilibrium
No ratings yet
Calibrated Learning and Correlated Equilibrium
24 pages
Get Data Driven Remaining Useful Life Prognosis Techniques Stochastic Models Methods and Applications Hu PDF ebook with Full Chapters Now
No ratings yet
Get Data Driven Remaining Useful Life Prognosis Techniques Stochastic Models Methods and Applications Hu PDF ebook with Full Chapters Now
55 pages
Bayesian Optimization : Theory and Practice Using Python Peng Liu instant download
100% (1)
Bayesian Optimization : Theory and Practice Using Python Peng Liu instant download
61 pages
Wiley Series in Probability and Statistics
No ratings yet
Wiley Series in Probability and Statistics
10 pages
Edge Co-Occurrence in Natural Images Predicts Contour Grouping Performance
No ratings yet
Edge Co-Occurrence in Natural Images Predicts Contour Grouping Performance
14 pages
PDF (eBook PDF) Loss Models: From Data to Decisions 5th Edition download
100% (7)
PDF (eBook PDF) Loss Models: From Data to Decisions 5th Edition download
56 pages
MTECH IAR 2023 Abstract Syllabus Final (1)
No ratings yet
MTECH IAR 2023 Abstract Syllabus Final (1)
16 pages
40 Questions On Probability For Data Science
No ratings yet
40 Questions On Probability For Data Science
42 pages