Oracle Data Miner
Oracle Data Miner
October 2013
Denny Wong
Oracle Data Mining Technologies
10 Van de Graff Drive
Burlington, MA 01803
USA
[email protected]
Contents
Introduction .................................................................................................................................................. 3
Importing the Workflow ............................................................................................................................... 3
Create the User ......................................................................................................................................... 3
Load the Table........................................................................................................................................... 3
Import the Workflow ................................................................................................................................ 4
Workflow Overview ...................................................................................................................................... 5
Modeling ................................................................................................................................................... 5
Scoring....................................................................................................................................................... 6
Deployment Use Case ............................................................................................................................... 6
Workflow Run ............................................................................................................................................... 6
Generating Workflow Script Files ................................................................................................................. 7
Deploy Script Options ............................................................................................................................... 7
Generate Script UI Wizard ........................................................................................................................ 8
Script Files Specifications ........................................................................................................................ 11
Master Script: <Workflow name>_Run.sql ......................................................................................... 11
Cleanup script: <Workflow name>_Drop.sql ...................................................................................... 11
Workflow Image: <Workflow name>.png ........................................................................................... 11
Node Script: <Node name>.sql ........................................................................................................... 11
Running Workflow Script Files .................................................................................................................... 12
Variable Definitions................................................................................................................................. 12
Control Table........................................................................................................................................... 13
Table Structure.................................................................................................................................... 13
Column Descriptions ........................................................................................................................... 14
Scheduling Workflow Script Files ................................................................................................................ 17
SQL Developer......................................................................................................................................... 17
Oracle Enterprise Manager Jobs ............................................................................................................. 24
Conclusions ................................................................................................................................................. 29
Appendix ..................................................................................................................................................... 30
Introduction
Integrating data mining with an end user application has been one of the more challenging assignments
for developers of advanced analytics. With Data Miner 4.0, this effort has been made considerably
easier to accomplish. Data analysts and application developers can now work together to develop
advanced analytics and easily imbed them into applications.
Data analysts can use the Data Miner workflows to define, build and test their analytic methodologies.
When they are satisfied with the results, they can use the new script generation feature to hand off a
set of SQL scripts to the application developer. The application developer will be able to take these
standard Oracle SQL scripts and integrate them easily into their applications. Since all of the results of
Oracle Data Miner are database objects, there is no concern about moving data and models from non
Oracle database systems.
Although the primary audience for this paper are application developers; data analysts, database
administrators and IT management can benefit from reading it as well.
Objectives/Benefits:
Application Developers
o Describes the structure of the generated scripts and run time behavior.
o Shows how the generated script can be scheduled to run in the database.
Data Analysts
o Describes how to generate a SQL script for all or part of a workflow.
Database Administrators and IT Management
o Provides an understanding of the entire process and how it all runs within the Oracle
Database environment.
The business benefit will be in the ability to quickly integrate advanced analytics into existing
applications without incurring the high cost and complexity of integrating additional non-database
platforms
The instInsurCustData.sql SQL script can be invoked in the SQL Worksheet of SQL Developer:
Make sure that the script is run from within the user schema.
Once imported, the workflow should look like the picture below:
Workflow Overview
The workflow is comprised of two distinct processes contained within a single lineage: modeling (top)
and scoring (bottom). Both processes use the demo data INSUR_CUST_LTV_SAMPLE as input data
source.
Modeling
The modeling process is used for building a classification SVM model to predict whether the customer
will buy insurance or not. The model coefficients are persisted to a database table for viewing and this
table may provide a basis for application integration.
Scoring
The scoring process is used to make predictions for the customer data using the SVM model created by
the modeling lineage. The prediction result is persisted to a database view for viewing. This view
always reflects the predictions of the current input data. For example, if the input table is refreshed
with new data, this view will automatically capture the predictions of the new data. Moreover, this view
may provide a basis for application integration.
Workflow Run
The workflow must be run before it can be deployed. Right-click the INSUR_CUST_LTV_SAMPLE BUILD
node and select Force Run | Selected Node and Children from the menu to run the whole workflow.
Once workflow run completed, the workflow should look like the picture below:
For example, if the Apply node is selected, a script will be generated for these nodes:
INSUR_CUST_LTV_SAMPLE BUILD, Class Build, INSUR_CUST_LTV_SAMPLE APPLY, Apply,
and SCORED_CUSTOMERS.
Selected node and connected nodes
o Generate script for the selected node(s) and all nodes that are connected to this node.
o For example, if the Apply node is selected, a script will be generated for all nodes in the
workflow.
Alternatively, to generate a script for the entire workflow, you can multi-select all nodes in the workflow
and select any of above options from the selected node.
To generate a script for the demo workflow, right-click the INSUR_CUST_LTV_SAMPLE BUILD node and
select the Deploy | Selected node and connected nodes item from the menu.
Next you need to specify the Script Directory where the scripts will be saved. The Script Directory name
defaults to the workflow name. A new directory will be created to store all the generated scripts.
10
11
Variables that allows you to change the name of the object names that are input to the Node
level scripts, such as tables/views and models. By default, these names are the original
table/view and model names.
Variable that allows you to change the name of the Control table (see below). By default, this
name is the workflow name.
Variable that indicates if named objects should be deleted first before they are generated by the
script.
For the demo workflow, the following variable definitions are generated in the Master script:
-- Subsitution Variable Definition Section: Override default object names here
DEFINE WORKFLOW_OUTPUT = 'codegen_workflow'
-- Drop user named objects (e.g. model, output table)? TRUE or FALSE
DEFINE DROP_EXISTING_OBJECTS = 'TRUE'
-- From Node: "INSUR_CUST_LTV_SAMPLE BUILD"
DEFINE DATA_SOURCE_1 = '"DMUSER"."INSUR_CUST_LTV_SAMPLE"'
-- From Node: "INSUR_CUST_LTV_SAMPLE APPLY"
DEFINE DATA_SOURCE_2 = '"DMUSER"."INSUR_CUST_LTV_SAMPLE"'
-- From Node: "Class Build"
DEFINE MODEL_1 = '"DMUSER"."CLAS_SVM_MODEL"'
-- From Node: "MODEL_COEFFCIENTS"
DEFINE CREATE_TABLE_2 = '"DMUSER"."MODEL_COEFFCIENTS"'
-- From Node: "SCORED_CUSTOMERS"
DEFINE CREATE_VIEW_3 = '"DMUSER"."SCORED_CUSTOMERS_V"'
The Control table name variable, WORKFLOW_OUTPUT defaults to the workflow name. The drop
variable, DROP_EXISTING_OBJECTS defaults to TRUE, which indicates all existing named objects should
be removed before they are generated. The Node specific object variables default to their original
names in the workflow. For example, the generated Node variable DATA_SOURCE_1 allows user to
override the input table used in the INSUR_CUST_LTV_SAMPLE BUILD Data Source node. It is expected
12
that the input data sources have the same columns as the originally referenced input data sources.
Missing column names could result in a run time failure.
Control Table
When the Master script is run, it first creates the Control table using the name specified in the control
table name variable. The purposes of the Control table are the followings:
Generated objects, such as views, models, text specifications, are registered in this table.
Logical nodes in the workflow are able to look up their input objects and register their output
objects in the control table.
The Cleanup script uses the table to determine what objects need to be dropped.
For advanced users, the Control table provides internal name of objects that are not readily
accessible via the workflows today. For example, users can find the model test result tables by
viewing the Control table.
By using different control file names along with different output variable names, the generated
script can be used concurrently to generate and manage different results. This may be useful if
the input data sources continue different sets of data that you wish to mine independently. In
this use case, the application would be responsible for saving the name of the control table so
that it can be utilized when rerunning or dropping the generated results.
Table Structure
The Control table is defined as following structure:
CREATE TABLE "&WORKFLOW_OUTPUT"
(
NODE_ID VARCHAR2(30) NOT NULL,
NODE_NAME VARCHAR2(30) NOT NULL,
NODE_TYPE VARCHAR2(30) NOT NULL,
MODEL_ID VARCHAR2(30),
MODEL_NAME VARCHAR2(65),
MODEL_TYPE VARCHAR2(35),
OUTPUT_NAME VARCHAR2(30) NOT NULL,
OUTPUT_TYPE VARCHAR2(30) NOT NULL,
ADDITIONAL_INFO VARCHAR2(65),
CREATION_TIME TIMESTAMP(6) NOT NULL,
COMMENTS VARCHAR2(4000 CHAR)
)
13
Column Descriptions
The followings describe how the columns in the table are used:
Column Name
Description
Examples
NODE_ID
10001, 10002
NODE_NAME
NODE_TYPE
MODEL_ID
10101, 10102
MODEL_NAME
Name of Model.
CLAS_GLM_1_6
MODEL_TYPE
OUTPUT_NAME
OUTPUT_TYPE
ADDITIONAL_INFO
CREATION_TIME
COMMENTS
14
To run the deployed workflow, invoke the Master script in the SQL Plus:
>@" C:\code gen\codegen workflow\codegen_workflow_Run.sql"
For subsequent run, invoke the Cleanup script first to delete previously generated objects, and then run
the Master script:
>@" C:\code gen\codegen workflow\codegen_workflow_Drop.sql"
>@" C:\code gen\codegen workflow\codegen_workflow_Run.sql"
After the script is run successfully, you can query the Control table to examine the generated objects:
>select * from "codegen_workflow"
For example, the Create Table Node, MODEL_COEFFICIENTS, produced an output table
MODEL_COEFFCIENTS that persisted the coefficient data extracted from the generated SVM model.
15
To examine the coefficient data, you can query the output table:
> select * from "DMUSER"."MODEL_COEFFCIENTS"
16
SQL Developer
SQL Developer provides a graphical interface for developers to define Scheduler Jobs. A SQLPlus script
job is required to run the SQL script files, but the Script job type is only supported in the 12c database.
The Job definition defines the master script invocation as a script file using a full file path. The user can
17
decide on whether the Job should be run on a schedule or on demand. The Job execution can be
monitored within the application. The result will either be a success or a reported failure.
The following system privileges are required for the account that runs the scheduled SQL script files:
CREATE CREDENTIAL
CREATE EXTERNAL JOB
A credential is an Oracle Scheduler object that has a user name and password pair stored in a dedicated
database object. A SQLPlus script job uses a host credential to authenticate itself with a database
instance or the operating system so that the SQLPlus executable can run. In addition, the job may point
to a connect credential that contains a database credential, which is used to connect SQLPlus to the
database before running the script.
To create credentials, right-click the Credentials item in the Connections navigator as shown below.
18
First we will create the database host credential, and then the database credential.
Next we will use the Job wizard to define a new Scheduler job. Click the Job item in the Connections
navigator to launch the wizard.
19
In the Job Details step, enter the following Job name and description. Select the Script job type and
SQLPlus script type, and enter the full path names of the cleanup and master scripts in the script
window. You can specify if the job will be run one time only or repeating. The following shows the job
will be scheduled to run on the first Sunday of each month at midnight.
20
In the Destination step, select the Local database and the credentials created above.
21
In the Notification step, you can set up an email notification based on the job status. For this to work,
email server should be set up in the host machine. For now, we will skip to the Summary step and click
Finish to create the job.
Once the job is created, you can monitor it within the SQL Developer. To monitor the job, right-click the
job name, CODEGEN WORKFLOW, in the Jobs folder in the Connections navigator to open the job viewer
as shown below.
22
Click the Run Log tab to see the status of the job. Here you can see when the job started running, how
long it took to complete, CPU usage, etc. In case of failure, you should see an error message.
To examine the script run result, click the Job Script Output tab in the bottom window, select the result
in the OUTPUT column and right-click to bring up the context menu. Then select the Single Row View to
bring up the viewer with the script run result.
23
24
Once logged in, use the Jobs link in the Job Activity section at the bottom of the page to launch the Job
creation page.
Select the SQL Script type in the Create Job drop-down list and click Go. This will take you to the Create
Job page where you can define the new job.
25
In the General tab, enter the following Job name and description and add a target database where the
job will be run.
In the Parameters tab, enter the full path names of the cleanup and master scripts in the SQL Script edit
control.
26
In the Credentials tab, enter the database host credential and the database credential.
In the Schedule tab, you can specify if the job will be run one time only or repeating. The following
shows the job will be scheduled to run on the last day of each month at midnight.
27
In the Access tab, you can set up an email notification based on the job status. For this to work, email
server should be set up in the host machine.
Once you have entered all the settings, click Submit to create the job.
28
Conclusions
This white paper shows how easy it is to deploy Data Miner workflow to the target or production
database using the SQL script generation feature. Furthermore, Oracle Enterprise Manager in the target
database can be used to schedule the scripts to run in any desirable time interval. Given that Oracle's
data mining and advanced analytics operate natively inside the Oracle Database, mining insights and
predictions also remain inside the database so can be accessed by SQL queries from OBIEE. Predictive
model results can be called interactively using OBIEE reports and dashboards. With Oracle, all the data
mining and advanced analysis and results stay inside the database providing a simpler architecture,
better security, better scalability and a single source of truth.
29
Appendix
The following table describes the objects that are generated by each node script.
Script Type
Create Table Node
Data Explore Node
Aggregate Node
Anomaly Detection Query Node (12c)
Apply Node
Apply Text Node
Clustering Query Node (12c)
Data Source Node
Feature Extraction Query Node (12c)
Filter Columns Details Node
Filter Rows Node
Join Node
Model Details Node
Prediction Query Node (12c)
Sample Node
SQL Query Node
Transform Node
Filter Columns Node
Output
Creates a table or view that persists the input data
Creates a table that contains statistical data of the
input data
Creates a view reflecting the output of the node
Association Build
Model Node
Text Reference Node
Graph Node
Each Model Test will have one table for each of the
following tests per target value (up to 100
maximum target values)
o Lift
Create a Build and Test (if necessary) data
For each imbedded text transformation (12c),
Oracle Text Objects are created (Policy, Lexer,
Stoplist)
A model is created for each model build
specification
GLM Model Row Diagnostics Table (if row
diagnostics turned on)
Each Model Test will have one table generated for
each of the following test results
o Test Metric
o Residual Plot data
For each imbedded text transformation (12c),
Oracle Text Objects are created (Policy, Lexer,
Stoplist)
A model is created for each model build
specification
A model is created for each model build
specification
Same specifications as Class Build except no models
are created
Same specifications as Regression Build except no
models are created
No code generated
31