Aster Analytics Foundation User Guide 0621 Update 1
Aster Analytics Foundation User Guide 0621 Update 1
Teradata, Applications-Within, Aster, BYNET, Claraview, DecisionCast, Gridscale, MyCommerce, QueryGrid, SQL-MapReduce, Teradata
Decision Experts, "Teradata Labs" logo, Teradata ServiceConnect, Teradata Source Experts, WebAnalyst, and Xkoto are trademarks or registered
trademarks of Teradata Corporation or its affiliates in the United States and other countries.
Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc.
Amazon Web Services, AWS, [any other AWS Marks used in such materials] are trademarks of Amazon.com, Inc. or its affiliates in the United
States and/or other countries.
AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc.
Apache, Apache Avro, Apache Hadoop, Apache Hive, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the
Apache Software Foundation in the United States and/or other countries.
Apple, Mac, and OS X all are registered trademarks of Apple Inc.
Axeda is a registered trademark of Axeda Corporation. Axeda Agents, Axeda Applications, Axeda Policy Manager, Axeda Enterprise, Axeda Access,
Axeda Software Management, Axeda Service, Axeda ServiceLink, and Firewall-Friendly are trademarks and Maximum Results and Maximum
Support are servicemarks of Axeda Corporation.
CENTOS is a trademark of Red Hat, Inc., registered in the U.S. and other countries.
Cloudera, CDH, [any other Cloudera Marks used in such materials] are trademarks or registered trademarks of Cloudera Inc. in the United States,
and in jurisdictions throughout the world.
Data Domain, EMC, PowerPath, SRDF, and Symmetrix are registered trademarks of EMC Corporation.
GoldenGate is a trademark of Oracle.
Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company.
Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other
countries.
Intel, Pentium, and XEON are registered trademarks of Intel Corporation.
IBM, CICS, RACF, Tivoli, and z/OS are registered trademarks of International Business Machines Corporation.
Linux is a registered trademark of Linus Torvalds.
LSI is a registered trademark of LSI Corporation.
Microsoft, Active Directory, Windows, Windows NT, and Windows Server are registered trademarks of Microsoft Corporation in the United States
and other countries.
NetVault is a trademark or registered trademark of Dell Inc. in the United States and/or other countries.
Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries.
Oracle, Java, and Solaris are registered trademarks of Oracle and/or its affiliates.
QLogic and SANbox are trademarks or registered trademarks of QLogic Corporation.
Quantum and the Quantum logo are trademarks of Quantum Corporation, registered in the U.S.A. and other countries.
Red Hat is a trademark of Red Hat, Inc., registered in the U.S. and other countries. Used under license.
SAP is the trademark or registered trademark of SAP AG in Germany and in several other countries.
SAS and SAS/C are trademarks or registered trademarks of SAS Institute Inc.
Simba, the Simba logo, SimbaEngine, SimbaEngine C/S, SimbaExpress and SimbaLib are registered trademarks of Simba Technologies Inc.
SPARC is a registered trademark of SPARC International, Inc.
Symantec, NetBackup, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the United States and
other countries.
Unicode is a registered trademark of Unicode, Inc. in the United States and other countries.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other product and company names mentioned herein may be the trademarks of their respective owners.
The information contained in this document is provided on an "as-is" basis, without warranty of any kind, either express
or implied, including the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.
Some jurisdictions do not allow the exclusion of implied warranties, so the above exclusion may not apply to you. In no
event will Teradata Corporation be liable for any indirect, direct, special, incidental, or consequential damages,
including lost profits or lost savings, even if expressly advised of the possibility of such damages.
The information contained in this document may contain references or cross-references to features, functions, products, or services that are not
announced or available in your country. Such references do not imply that Teradata Corporation intends to announce such features, functions,
products, or services in your country. Please consult your local Teradata Corporation representative for those features, functions, products, or
services available in your country.
Information contained in this document may contain technical inaccuracies or typographical errors. Information may be changed or updated
without notice. Teradata Corporation may also make improvements or changes in the products or services described in this information at any time
without notice.
To maintain the quality of our products and services, we would like your comments on the accuracy, clarity, organization, and value of this
document. Please e-mail: [email protected]
Any comments or materials (collectively referred to as "Feedback") sent to Teradata Corporation will be deemed non-confidential. Teradata
Corporation will have no obligation of any kind with respect to Feedback and will be free to use, reproduce, disclose, exhibit, display, transform,
create derivative works of, and distribute the Feedback and derivative works thereof without limitation on a royalty-free basis. Further, Teradata
Corporation will be free to use any ideas, concepts, know-how, or techniques contained in such Feedback for any purpose whatsoever, including
developing, manufacturing, or marketing products or services incorporating Feedback.
Copyright © 2015 - 2016 by Teradata. All Rights Reserved.
Table of Contents
Preface.................................................................................................................................................................59
Overview........................................................................................................................................................................59
Conventions Used in This Guide...............................................................................................................................59
Typefaces........................................................................................................................................................... 59
Notation Conventions.....................................................................................................................................59
Command Shell Text Conventions............................................................................................................... 60
Contact Teradata Global Technical Support (GTS)................................................................................................60
About Teradata Aster.................................................................................................................................................. 60
About This Document.................................................................................................................................................60
Version History................................................................................................................................................ 60
Chapter 1:
Introduction.................................................................................................................................................61
Introduction..................................................................................................................................................................61
Analytics at Scale: Full Dataset Analysis...................................................................................................................61
Introduction to Teradata Aster SQL-MapReduce®..................................................................................................62
What is MapReduce?....................................................................................................................................... 62
Aster Database SQL-MapReduce...................................................................................................................63
SQL-MapReduce Query Syntax................................................................................................................................. 64
SQL-MapReduce with Multiple Inputs.....................................................................................................................66
Benefits of Multiple Inputs............................................................................................................................. 66
How Multiple Inputs are Processed.............................................................................................................. 66
Types of SQL-MapRequest Inputs.................................................................................................................67
Semantic Requirements for SQL-MapReduce Functions.......................................................................... 67
Use Cases and Examples for Multiple Inputs.............................................................................................. 69
SQL-MapRequest Multiple Input FAQ........................................................................................................ 75
Aster Analytics Function Product Bundles.............................................................................................................. 76
Aster Analytics Functions by Product Bundle.........................................................................................................77
Premium Path...................................................................................................................................................77
Premium Relationship.....................................................................................................................................77
Analytics Foundation...................................................................................................................................... 78
Premium Graph............................................................................................................................................... 82
Aster Analytics..................................................................................................................................................83
Aster Scoring SDK........................................................................................................................................... 83
Aster Analytics Functions by Category.....................................................................................................................83
Time Series, Path, and Attribution Analysis................................................................................................ 83
Pattern Matching with Teradata Aster nPath.............................................................................................. 84
Statistical Analysis............................................................................................................................................85
Text Analysis.....................................................................................................................................................86
Cluster Analysis................................................................................................................................................88
Naive Bayes....................................................................................................................................................... 88
Ensemble Methods...........................................................................................................................................88
Association Analysis........................................................................................................................................ 89
Graph Analysis................................................................................................................................................. 89
Aster Scoring SDK........................................................................................................................................... 90
NeuralNet..........................................................................................................................................................90
Data Transformation.......................................................................................................................................91
Aster Database Utilities...................................................................................................................................92
Chapter 2:
Installing Aster Analytics Functions.........................................................................93
Installing Aster Analytics Functions......................................................................................................................... 93
Aster Analytics Function Version Numbers............................................................................................................93
Finding Function Version Numbers......................................................................................................................... 93
Aster Analytics Compatibility Matrix.......................................................................................................................94
Aster Analytics Function Packages............................................................................................................................95
Downloading an Aster Analytics Function Package...............................................................................................96
Getting Install and Uninstall Scripts......................................................................................................................... 96
Scripts for the Schema PUBLIC..................................................................................................................... 96
Scripts for a Specified Schema........................................................................................................................97
Installing an Aster Analytics Function Package...................................................................................................... 98
Set Default Schema for Function Users........................................................................................................ 99
Set Permissions to Allow Users to Run Functions...................................................................................... 99
Testing the Functions...................................................................................................................................... 99
Updating an Aster Analytics Function Package...................................................................................................... 99
Installing a Function in a Specific Schema............................................................................................................. 101
Managing Files with ACT Commands................................................................................................................... 101
Usage Notes................................................................................................................................................................ 102
Enclosing Database Object Names in Double Quotation Marks............................................................ 102
Boolean Argument Values............................................................................................................................103
Column Specification Arguments............................................................................................................... 103
DATE Columns..............................................................................................................................................103
BC/BCE Timestamps.....................................................................................................................................103
Creating a Timestamp Column................................................................................................................... 104
Granting CREATE Privileges....................................................................................................................... 104
Adding Model File Locations to the Default Search Path........................................................................ 104
Connecting to Aster Database Using Authentication Cascading........................................................... 104
Connecting to Aster Database Using SSL JDBC Connections................................................................ 107
Error Message Delays.................................................................................................................................... 108
Sparse Tables and Dense Tables.................................................................................................................. 108
Permanent Tables As Output of Driver-Based Functions....................................................................... 108
Input Table Aliases........................................................................................................................................ 108
Chapter 3:
Time Series, Path, and Attribution Analysis................................................109
Time Series, Path, and Attribution Analysis.......................................................................................................... 109
Arima........................................................................................................................................................................... 109
Summary......................................................................................................................................................... 109
Background.....................................................................................................................................................109
Usage................................................................................................................................................................110
Example........................................................................................................................................................... 114
ArimaPredictor.......................................................................................................................................................... 116
Summary......................................................................................................................................................... 116
Usage................................................................................................................................................................117
Example........................................................................................................................................................... 118
Attribution.................................................................................................................................................................. 119
Summary......................................................................................................................................................... 119
Background.....................................................................................................................................................119
Attribution (Multiple-Input Version).....................................................................................................................120
Summary......................................................................................................................................................... 120
Usage................................................................................................................................................................121
Example........................................................................................................................................................... 126
Attribution (Single-Input Version)......................................................................................................................... 129
Summary......................................................................................................................................................... 129
Usage................................................................................................................................................................129
Examples......................................................................................................................................................... 131
Burst.............................................................................................................................................................................140
Summary......................................................................................................................................................... 140
Usage................................................................................................................................................................141
Examples......................................................................................................................................................... 146
Change-Point Detection Functions.........................................................................................................................151
Summary......................................................................................................................................................... 151
Background.....................................................................................................................................................151
ChangePointDetection..............................................................................................................................................154
Summary......................................................................................................................................................... 154
Usage................................................................................................................................................................154
Examples......................................................................................................................................................... 157
RtChangePointDetection..........................................................................................................................................164
Summary......................................................................................................................................................... 164
Usage................................................................................................................................................................164
Examples......................................................................................................................................................... 165
Convergent Cross-Mapping..................................................................................................................................... 167
CCMPrepare...............................................................................................................................................................168
Usage................................................................................................................................................................168
Example........................................................................................................................................................... 169
CCM.............................................................................................................................................................................172
Summary......................................................................................................................................................... 172
Usage................................................................................................................................................................172
Examples......................................................................................................................................................... 175
DTW............................................................................................................................................................................ 178
Summary......................................................................................................................................................... 178
Usage................................................................................................................................................................179
Example........................................................................................................................................................... 182
DWT............................................................................................................................................................................ 185
Summary......................................................................................................................................................... 185
Background.....................................................................................................................................................186
Usage................................................................................................................................................................186
Example........................................................................................................................................................... 191
DWT2D.......................................................................................................................................................................195
Summary......................................................................................................................................................... 195
Background.....................................................................................................................................................196
Usage................................................................................................................................................................197
Example........................................................................................................................................................... 202
FrequentPaths.............................................................................................................................................................205
Summary......................................................................................................................................................... 205
Background.....................................................................................................................................................205
Usage................................................................................................................................................................206
Examples......................................................................................................................................................... 211
IDWT...........................................................................................................................................................................225
Summary......................................................................................................................................................... 225
Usage................................................................................................................................................................225
Example........................................................................................................................................................... 227
IDWT2D..................................................................................................................................................................... 232
Summary......................................................................................................................................................... 232
Usage................................................................................................................................................................232
Example........................................................................................................................................................... 234
Interpolator.................................................................................................................................................................238
Summary......................................................................................................................................................... 238
Usage................................................................................................................................................................238
Examples......................................................................................................................................................... 246
Path Analysis Functions............................................................................................................................................256
Summary......................................................................................................................................................... 256
Path_Generator.......................................................................................................................................................... 257
Summary......................................................................................................................................................... 257
Usage................................................................................................................................................................258
Example........................................................................................................................................................... 259
Path_Summarizer...................................................................................................................................................... 262
Summary......................................................................................................................................................... 262
Usage................................................................................................................................................................262
Example........................................................................................................................................................... 264
Path_Start....................................................................................................................................................................266
Summary......................................................................................................................................................... 266
Usage................................................................................................................................................................266
Example........................................................................................................................................................... 268
Path_Analyzer............................................................................................................................................................ 270
Summary......................................................................................................................................................... 270
Usage................................................................................................................................................................270
Example........................................................................................................................................................... 272
SAX2............................................................................................................................................................................ 273
Summary......................................................................................................................................................... 273
Background.....................................................................................................................................................273
Usage................................................................................................................................................................274
Examples......................................................................................................................................................... 280
SeriesSplitter............................................................................................................................................................... 286
Summary......................................................................................................................................................... 286
Background.....................................................................................................................................................286
Usage................................................................................................................................................................286
Examples......................................................................................................................................................... 291
Sessionize.................................................................................................................................................................... 296
Summary......................................................................................................................................................... 296
Background.....................................................................................................................................................296
Usage................................................................................................................................................................296
Example........................................................................................................................................................... 298
Shapelet Functions.....................................................................................................................................................299
Overview......................................................................................................................................................... 300
UnsupervisedShapelet............................................................................................................................................... 300
Summary......................................................................................................................................................... 300
Usage................................................................................................................................................................301
Example........................................................................................................................................................... 304
Troubleshooting.............................................................................................................................................307
SupervisedShapeletTrainer....................................................................................................................................... 308
Summary......................................................................................................................................................... 308
Usage................................................................................................................................................................308
Example........................................................................................................................................................... 312
Troubleshooting.............................................................................................................................................314
SupervisedShapeletClassifier.................................................................................................................................... 315
Summary......................................................................................................................................................... 315
Usage................................................................................................................................................................315
Example........................................................................................................................................................... 317
VARMAX....................................................................................................................................................................319
Summary......................................................................................................................................................... 319
Usage................................................................................................................................................................320
Examples......................................................................................................................................................... 324
Chapter 4:
Pattern Matching with Teradata Aster nPath............................................ 343
Pattern Matching with Teradata Aster nPath........................................................................................................343
nPath............................................................................................................................................................................343
Summary......................................................................................................................................................... 343
Usage................................................................................................................................................................344
Pattern Matching....................................................................................................................................................... 348
Greedy Pattern Matching..............................................................................................................................348
Symbols....................................................................................................................................................................... 350
LAG Expressions in Symbol Predicates...................................................................................................... 350
Filters........................................................................................................................................................................... 355
Example........................................................................................................................................................... 356
Result: Applying Aggregate Functions................................................................................................................... 357
Example 1........................................................................................................................................................359
Example 2........................................................................................................................................................360
Example 3........................................................................................................................................................361
nPath Examples..........................................................................................................................................................362
Clickstream Data Examples..........................................................................................................................362
Range-Matching Examples...........................................................................................................................364
Chapter 5:
Statistical Analysis...........................................................................................................................375
Statistical Analysis......................................................................................................................................................375
Approximate Distinct Count....................................................................................................................................376
Summary......................................................................................................................................................... 376
Background.....................................................................................................................................................376
Usage................................................................................................................................................................376
Example........................................................................................................................................................... 377
Approximate Percentile............................................................................................................................................ 380
Summary......................................................................................................................................................... 380
Background.....................................................................................................................................................380
Usage................................................................................................................................................................380
Example........................................................................................................................................................... 382
CMAVG...................................................................................................................................................................... 385
Summary......................................................................................................................................................... 385
Background.....................................................................................................................................................385
Usage................................................................................................................................................................385
Example........................................................................................................................................................... 386
ConfusionMatrix........................................................................................................................................................388
Summary......................................................................................................................................................... 388
Background.....................................................................................................................................................389
Usage................................................................................................................................................................389
Example........................................................................................................................................................... 391
Correlation..................................................................................................................................................................394
Summary......................................................................................................................................................... 394
Usage................................................................................................................................................................394
Examples......................................................................................................................................................... 396
CoxPH......................................................................................................................................................................... 398
Summary......................................................................................................................................................... 398
Background.....................................................................................................................................................399
Usage................................................................................................................................................................399
Example........................................................................................................................................................... 403
CoxPredict.................................................................................................................................................................. 407
Summary......................................................................................................................................................... 407
Background.....................................................................................................................................................407
Usage................................................................................................................................................................408
Examples......................................................................................................................................................... 411
Hypothesis-Test Mode.............................................................................................................................................. 418
Summary......................................................................................................................................................... 418
Usage................................................................................................................................................................418
Examples......................................................................................................................................................... 424
CoxSurvFit.................................................................................................................................................................. 433
Summary......................................................................................................................................................... 433
Background.....................................................................................................................................................433
Usage................................................................................................................................................................434
Example........................................................................................................................................................... 437
CrossValidation..........................................................................................................................................................438
Summary......................................................................................................................................................... 438
Usage................................................................................................................................................................439
Example........................................................................................................................................................... 440
Distribution Matching.............................................................................................................................................. 443
Summary......................................................................................................................................................... 443
Best-Match Mode.......................................................................................................................................................444
Summary......................................................................................................................................................... 444
Usage................................................................................................................................................................444
Examples......................................................................................................................................................... 448
EMAVG.......................................................................................................................................................................454
Summary......................................................................................................................................................... 454
Background.....................................................................................................................................................454
Usage................................................................................................................................................................454
Example........................................................................................................................................................... 456
FMeasure.....................................................................................................................................................................458
Summary......................................................................................................................................................... 458
Background.....................................................................................................................................................458
Usage................................................................................................................................................................458
Examples......................................................................................................................................................... 460
GLM.............................................................................................................................................................................462
Summary......................................................................................................................................................... 462
Background.....................................................................................................................................................462
Usage................................................................................................................................................................464
Examples......................................................................................................................................................... 472
GLMPredict................................................................................................................................................................ 483
Summary......................................................................................................................................................... 483
Usage................................................................................................................................................................484
Examples......................................................................................................................................................... 485
Hidden Markov Model Functions...........................................................................................................................491
Overview......................................................................................................................................................... 491
Models and Descriptions.............................................................................................................................. 492
Aster Distributed Platforms......................................................................................................................... 493
HMMUnsupervisedLearner..................................................................................................................................... 493
Summary......................................................................................................................................................... 493
Usage................................................................................................................................................................493
Example........................................................................................................................................................... 498
HMMSupervisedLearner.......................................................................................................................................... 503
Summary......................................................................................................................................................... 503
Usage................................................................................................................................................................504
Example........................................................................................................................................................... 507
HMMEvaluator.......................................................................................................................................................... 514
Summary......................................................................................................................................................... 514
Usage................................................................................................................................................................514
Example........................................................................................................................................................... 518
HMMDecoder............................................................................................................................................................ 522
Summary......................................................................................................................................................... 522
Usage................................................................................................................................................................522
Examples......................................................................................................................................................... 524
Histogram................................................................................................................................................................... 535
Summary......................................................................................................................................................... 535
Background.....................................................................................................................................................535
Usage................................................................................................................................................................536
Examples......................................................................................................................................................... 538
KNN.............................................................................................................................................................................542
Summary......................................................................................................................................................... 542
Background.....................................................................................................................................................543
Usage................................................................................................................................................................544
Example........................................................................................................................................................... 546
LARS Functions......................................................................................................................................................... 550
Summary......................................................................................................................................................... 550
Background.....................................................................................................................................................551
LARS............................................................................................................................................................................ 551
Summary......................................................................................................................................................... 551
Usage................................................................................................................................................................551
Examples......................................................................................................................................................... 554
LARSPredict............................................................................................................................................................... 559
Summary......................................................................................................................................................... 559
Usage................................................................................................................................................................559
Examples......................................................................................................................................................... 561
Linear Regression.......................................................................................................................................................564
Summary......................................................................................................................................................... 564
Background.....................................................................................................................................................564
Usage................................................................................................................................................................565
Example........................................................................................................................................................... 566
LRTEST....................................................................................................................................................................... 567
Summary......................................................................................................................................................... 567
Background.....................................................................................................................................................567
Usage................................................................................................................................................................568
Example........................................................................................................................................................... 569
Percentile.....................................................................................................................................................................572
Summary......................................................................................................................................................... 572
Usage................................................................................................................................................................573
Example........................................................................................................................................................... 574
Principal Component Analysis................................................................................................................................ 575
Summary......................................................................................................................................................... 575
Background.....................................................................................................................................................576
Usage................................................................................................................................................................576
Example........................................................................................................................................................... 578
PCAPlot.......................................................................................................................................................................588
Summary......................................................................................................................................................... 588
Usage................................................................................................................................................................589
Example........................................................................................................................................................... 590
RandomSample.......................................................................................................................................................... 591
Summary......................................................................................................................................................... 591
Usage................................................................................................................................................................591
Examples......................................................................................................................................................... 594
Sample......................................................................................................................................................................... 599
Summary......................................................................................................................................................... 599
Usage................................................................................................................................................................600
Examples......................................................................................................................................................... 603
Shapley Value Functions...........................................................................................................................................609
Summary......................................................................................................................................................... 609
Background.....................................................................................................................................................609
GenerateCombination...............................................................................................................................................610
Usage................................................................................................................................................................610
Examples......................................................................................................................................................... 611
SortCombination....................................................................................................................................................... 611
Usage................................................................................................................................................................611
Examples......................................................................................................................................................... 612
AddOnePlayer............................................................................................................................................................ 612
Usage................................................................................................................................................................613
Examples......................................................................................................................................................... 615
SMAVG....................................................................................................................................................................... 629
Summary......................................................................................................................................................... 629
Background.....................................................................................................................................................630
Usage................................................................................................................................................................630
Example........................................................................................................................................................... 631
Support Vector Machines......................................................................................................................................... 633
SparseSVM Functions............................................................................................................................................... 634
SparseSVMTrainer.................................................................................................................................................... 634
Summary......................................................................................................................................................... 634
Usage................................................................................................................................................................635
Example........................................................................................................................................................... 637
SparseSVMPredictor................................................................................................................................................. 640
Summary......................................................................................................................................................... 640
Usage................................................................................................................................................................640
Example........................................................................................................................................................... 642
SVMModelPrinter..................................................................................................................................................... 645
Summary......................................................................................................................................................... 645
Usage................................................................................................................................................................645
Examples......................................................................................................................................................... 646
DenseSVM Functions................................................................................................................................................647
DenseSVMTrainer.....................................................................................................................................................648
Summary......................................................................................................................................................... 648
Usage................................................................................................................................................................648
Examples......................................................................................................................................................... 651
DenseSVMPredictor..................................................................................................................................................657
Summary......................................................................................................................................................... 657
Usage................................................................................................................................................................657
Examples......................................................................................................................................................... 659
DenseSVMModelPrinter.......................................................................................................................................... 665
Summary......................................................................................................................................................... 665
Usage................................................................................................................................................................665
Example........................................................................................................................................................... 666
VectorDistance...........................................................................................................................................................667
Summary......................................................................................................................................................... 667
Background.....................................................................................................................................................668
Usage................................................................................................................................................................669
Examples......................................................................................................................................................... 673
VWAP......................................................................................................................................................................... 676
Summary......................................................................................................................................................... 676
Usage................................................................................................................................................................677
Example........................................................................................................................................................... 678
WMAVG.....................................................................................................................................................................680
Summary......................................................................................................................................................... 680
Background.....................................................................................................................................................681
Usage................................................................................................................................................................681
Example........................................................................................................................................................... 682
Chapter 6:
Text Analysis............................................................................................................................................ 685
Summary......................................................................................................................................................... 740
Usage................................................................................................................................................................740
Example........................................................................................................................................................... 741
Evaluate Named Entity Finder.................................................................................................................................743
Summary......................................................................................................................................................... 743
Usage................................................................................................................................................................743
Example........................................................................................................................................................... 744
nGram..........................................................................................................................................................................745
Summary......................................................................................................................................................... 745
Background.....................................................................................................................................................745
Usage................................................................................................................................................................746
Examples......................................................................................................................................................... 748
POSTagger.................................................................................................................................................................. 751
Summary......................................................................................................................................................... 751
Background.....................................................................................................................................................751
Usage................................................................................................................................................................754
Example........................................................................................................................................................... 756
Sentenizer....................................................................................................................................................................759
Summary......................................................................................................................................................... 759
Background.....................................................................................................................................................759
Usage................................................................................................................................................................759
Example........................................................................................................................................................... 760
Sentiment Extraction Functions.............................................................................................................................. 763
Summary......................................................................................................................................................... 763
Background.....................................................................................................................................................763
TrainSentimentExtractor..........................................................................................................................................764
Summary......................................................................................................................................................... 764
Usage................................................................................................................................................................764
Example........................................................................................................................................................... 765
ExtractSentiment........................................................................................................................................................767
Summary......................................................................................................................................................... 767
Usage................................................................................................................................................................767
Examples......................................................................................................................................................... 770
EvaluateSentimentExtractor.....................................................................................................................................778
Summary......................................................................................................................................................... 778
Usage................................................................................................................................................................778
Example........................................................................................................................................................... 779
Text Classifier............................................................................................................................................................. 782
Summary......................................................................................................................................................... 782
Background.....................................................................................................................................................783
TextClassifierTrainer.................................................................................................................................................783
Summary......................................................................................................................................................... 783
Usage................................................................................................................................................................783
Example........................................................................................................................................................... 786
TextClassifier.............................................................................................................................................................. 788
Summary......................................................................................................................................................... 788
Usage................................................................................................................................................................788
Example........................................................................................................................................................... 789
TextClassifierEvaluator............................................................................................................................................. 790
Summary......................................................................................................................................................... 790
Usage................................................................................................................................................................790
Example........................................................................................................................................................... 791
Text_Parser................................................................................................................................................................. 792
Summary......................................................................................................................................................... 792
Background.....................................................................................................................................................792
Usage................................................................................................................................................................792
Examples......................................................................................................................................................... 796
TextChunker...............................................................................................................................................................799
Summary......................................................................................................................................................... 799
Background.....................................................................................................................................................799
Usage................................................................................................................................................................800
Example........................................................................................................................................................... 801
TextMorph..................................................................................................................................................................806
Summary......................................................................................................................................................... 806
Background.....................................................................................................................................................806
Usage................................................................................................................................................................806
Examples......................................................................................................................................................... 809
TextTagging................................................................................................................................................................ 818
Summary......................................................................................................................................................... 818
Usage................................................................................................................................................................819
Examples......................................................................................................................................................... 823
TextTokenizer............................................................................................................................................................ 827
Summary......................................................................................................................................................... 827
Usage................................................................................................................................................................828
Examples......................................................................................................................................................... 830
TF_IDF........................................................................................................................................................................ 834
Summary......................................................................................................................................................... 834
Background.....................................................................................................................................................835
Usage................................................................................................................................................................835
Examples......................................................................................................................................................... 838
Chapter 7:
Cluster Analysis.................................................................................................................................... 847
Cluster Analysis..........................................................................................................................................................847
Canopy.........................................................................................................................................................................847
Summary......................................................................................................................................................... 847
Background.....................................................................................................................................................847
Usage................................................................................................................................................................848
Example........................................................................................................................................................... 849
Gaussian Mixture Model Functions........................................................................................................................851
GMMFit.......................................................................................................................................................................851
Summary......................................................................................................................................................... 851
Usage................................................................................................................................................................851
Examples......................................................................................................................................................... 856
GMMPredict...............................................................................................................................................................863
Summary......................................................................................................................................................... 863
Usage................................................................................................................................................................863
Example........................................................................................................................................................... 865
GMMProfile................................................................................................................................................................868
Summary......................................................................................................................................................... 868
Usage................................................................................................................................................................868
Examples......................................................................................................................................................... 869
KMeans........................................................................................................................................................................872
Summary......................................................................................................................................................... 872
Background.....................................................................................................................................................872
Usage................................................................................................................................................................873
Examples......................................................................................................................................................... 877
KMeansPlot................................................................................................................................................................ 886
Summary......................................................................................................................................................... 886
Usage................................................................................................................................................................886
Example........................................................................................................................................................... 887
KModes....................................................................................................................................................................... 890
Summary......................................................................................................................................................... 890
Usage................................................................................................................................................................891
Examples......................................................................................................................................................... 894
KModesPredict...........................................................................................................................................................900
Summary......................................................................................................................................................... 900
Usage................................................................................................................................................................900
Example........................................................................................................................................................... 901
Minhash.......................................................................................................................................................................904
Summary......................................................................................................................................................... 904
Background.....................................................................................................................................................904
Usage................................................................................................................................................................905
Example........................................................................................................................................................... 906
Chapter 8:
Naive Bayes................................................................................................................................................ 909
Naive Bayes................................................................................................................................................................. 909
What is Naive Bayes?.................................................................................................................................................909
Naive Bayes Functions.............................................................................................................................................. 909
Summary......................................................................................................................................................... 909
NaiveBayesMap and NaiveBayesReduce................................................................................................................ 910
Summary......................................................................................................................................................... 910
Usage................................................................................................................................................................910
Naive Bayes Example.....................................................................................................................................912
NaiveBayesPredict..................................................................................................................................................... 917
Summary......................................................................................................................................................... 917
Usage................................................................................................................................................................918
Naive Bayes Example.....................................................................................................................................919
Naive Bayes Example.................................................................................................................................................925
NaiveBayesMap Input: Training Table.......................................................................................................925
Split Input into Training and Testing Data Sets........................................................................................925
SQL-MapReduce Call to Generate the Model........................................................................................... 927
NaiveBayesReduce and NaiveBayesMap Output: Model Table.............................................................. 927
NaiveBayesPredict Input...............................................................................................................................928
SQL-MapReduce Call to Predict Outcomes of Test Table Data............................................................. 928
NaiveBayesPredict Output: Predict Outcomes Table............................................................................... 928
Prediction Accuracy...................................................................................................................................... 930
Chapter 9:
Ensemble Methods............................................................................................................................931
Ensemble Methods.....................................................................................................................................................931
Random Forest Functions........................................................................................................................................ 931
Summary......................................................................................................................................................... 931
Background.....................................................................................................................................................932
Implementation Notes.................................................................................................................................. 933
Usage................................................................................................................................................................934
Forest_Drive............................................................................................................................................................... 934
Summary......................................................................................................................................................... 934
Usage................................................................................................................................................................935
Example........................................................................................................................................................... 938
Forest_Predict............................................................................................................................................................ 943
Summary......................................................................................................................................................... 943
Usage................................................................................................................................................................943
Example........................................................................................................................................................... 946
Forest_Analyze...........................................................................................................................................................949
Summary......................................................................................................................................................... 949
Usage................................................................................................................................................................950
Examples......................................................................................................................................................... 951
Single Decision Tree Functions............................................................................................................................... 954
Single_Tree_Drive..................................................................................................................................................... 955
Summary......................................................................................................................................................... 955
Background.....................................................................................................................................................955
Usage................................................................................................................................................................956
Examples......................................................................................................................................................... 962
Single_Tree_Predict...................................................................................................................................................973
Summary......................................................................................................................................................... 973
Usage................................................................................................................................................................973
Example........................................................................................................................................................... 974
AdaBoost Functions.................................................................................................................................................. 976
Background.....................................................................................................................................................976
AdaBoost_Drive.........................................................................................................................................................977
Summary......................................................................................................................................................... 977
Usage................................................................................................................................................................978
Example........................................................................................................................................................... 981
AdaBoost_Predict...................................................................................................................................................... 987
Summary......................................................................................................................................................... 987
Usage................................................................................................................................................................987
Example........................................................................................................................................................... 988
Chapter 10:
Association Analysis...................................................................................................................... 995
Association Analysis..................................................................................................................................................995
Basket_Generator.......................................................................................................................................................995
Summary......................................................................................................................................................... 995
Background.....................................................................................................................................................995
Usage................................................................................................................................................................995
Examples......................................................................................................................................................... 997
CFilter........................................................................................................................................................................1000
Summary....................................................................................................................................................... 1000
Background...................................................................................................................................................1000
Usage..............................................................................................................................................................1000
Examples....................................................................................................................................................... 1003
FPGrowth..................................................................................................................................................................1007
Summary....................................................................................................................................................... 1007
Background...................................................................................................................................................1007
Usage..............................................................................................................................................................1008
Example.........................................................................................................................................................1013
Recommender Functions........................................................................................................................................1016
WSRecommender....................................................................................................................................................1017
Summary....................................................................................................................................................... 1017
Usage..............................................................................................................................................................1017
Example.........................................................................................................................................................1020
KNNRecommenderTrain....................................................................................................................................... 1023
Summary....................................................................................................................................................... 1023
Usage..............................................................................................................................................................1023
Example.........................................................................................................................................................1026
KNNRecommenderPredict.................................................................................................................................... 1030
Summary....................................................................................................................................................... 1030
Usage..............................................................................................................................................................1030
Example.........................................................................................................................................................1031
Chapter 11:
Graph Analysis..................................................................................................................................... 1035
Graph Analysis......................................................................................................................................................... 1035
Overview of Graph Analysis...................................................................................................................................1035
Graph Functions.......................................................................................................................................... 1035
Iterations....................................................................................................................................................... 1036
What is a Graph?..........................................................................................................................................1036
Directed Graphs........................................................................................................................................... 1037
Graph Discovery.......................................................................................................................................... 1037
AllPairsShortestPath................................................................................................................................................1037
Summary....................................................................................................................................................... 1037
Usage..............................................................................................................................................................1038
Examples....................................................................................................................................................... 1041
Betweenness..............................................................................................................................................................1045
Summary....................................................................................................................................................... 1045
Background...................................................................................................................................................1045
Usage..............................................................................................................................................................1046
Example.........................................................................................................................................................1048
Closeness................................................................................................................................................................... 1050
Summary....................................................................................................................................................... 1050
Background...................................................................................................................................................1051
Usage..............................................................................................................................................................1051
Examples....................................................................................................................................................... 1054
EigenvectorCentrality..............................................................................................................................................1057
Summary....................................................................................................................................................... 1057
Background...................................................................................................................................................1057
Usage..............................................................................................................................................................1059
Examples....................................................................................................................................................... 1061
gTree.......................................................................................................................................................................... 1064
Summary....................................................................................................................................................... 1064
Background...................................................................................................................................................1064
Usage..............................................................................................................................................................1065
Examples....................................................................................................................................................... 1068
LocalClusteringCoefficient.....................................................................................................................................1072
Summary....................................................................................................................................................... 1072
Background...................................................................................................................................................1072
Usage..............................................................................................................................................................1075
Examples....................................................................................................................................................... 1079
LoopyBeliefPropagation......................................................................................................................................... 1082
Summary....................................................................................................................................................... 1082
Background...................................................................................................................................................1082
Usage..............................................................................................................................................................1083
Examples....................................................................................................................................................... 1086
Modularity................................................................................................................................................................ 1090
Summary....................................................................................................................................................... 1090
Background...................................................................................................................................................1090
Definitions.................................................................................................................................................... 1091
Usage..............................................................................................................................................................1093
Examples....................................................................................................................................................... 1097
Tips................................................................................................................................................................ 1100
Troubleshooting...........................................................................................................................................1101
nTree..........................................................................................................................................................................1102
Summary....................................................................................................................................................... 1102
Background...................................................................................................................................................1102
Usage..............................................................................................................................................................1103
Examples....................................................................................................................................................... 1107
PageRank...................................................................................................................................................................1110
Summary....................................................................................................................................................... 1110
Background...................................................................................................................................................1111
Usage..............................................................................................................................................................1111
Example.........................................................................................................................................................1113
pSALSA..................................................................................................................................................................... 1114
Summary....................................................................................................................................................... 1114
Background...................................................................................................................................................1115
Usage..............................................................................................................................................................1117
Examples....................................................................................................................................................... 1120
RandomWalkSample...............................................................................................................................................1128
Summary....................................................................................................................................................... 1128
Background...................................................................................................................................................1128
Usage..............................................................................................................................................................1129
Example.........................................................................................................................................................1131
Chapter 12:
Neural Networks............................................................................................................................... 1135
Chapter 13:
Data Transformation................................................................................................................... 1151
Data Transformation...............................................................................................................................................1151
Antiselect...................................................................................................................................................................1151
Summary....................................................................................................................................................... 1151
Usage..............................................................................................................................................................1152
Example.........................................................................................................................................................1152
Apache_Log_Parser.................................................................................................................................................1154
Summary....................................................................................................................................................... 1154
Background...................................................................................................................................................1154
Usage..............................................................................................................................................................1156
Examples....................................................................................................................................................... 1158
Categorize................................................................................................................................................................. 1161
Summary....................................................................................................................................................... 1161
Usage..............................................................................................................................................................1161
Example.........................................................................................................................................................1162
Fellegi-Sunter Functions.........................................................................................................................................1164
Summary....................................................................................................................................................... 1164
Background...................................................................................................................................................1165
FellegiSunterTrainer................................................................................................................................................1165
Summary....................................................................................................................................................... 1165
Usage..............................................................................................................................................................1165
Examples....................................................................................................................................................... 1168
FellegiSunterPredict................................................................................................................................................ 1173
Summary....................................................................................................................................................... 1173
Usage..............................................................................................................................................................1173
Examples....................................................................................................................................................... 1174
Geometry Functions................................................................................................................................................1179
GeometryLoader...................................................................................................................................................... 1180
Summary....................................................................................................................................................... 1180
Usage..............................................................................................................................................................1180
Example.........................................................................................................................................................1182
PointInPolygon........................................................................................................................................................ 1184
Summary....................................................................................................................................................... 1184
Background...................................................................................................................................................1184
Usage..............................................................................................................................................................1185
Examples....................................................................................................................................................... 1188
GeometryOverlay.....................................................................................................................................................1192
Summary....................................................................................................................................................... 1192
Usage..............................................................................................................................................................1192
Examples....................................................................................................................................................... 1195
IdentityMatch...........................................................................................................................................................1198
Summary....................................................................................................................................................... 1198
Background...................................................................................................................................................1199
Usage..............................................................................................................................................................1199
Example.........................................................................................................................................................1203
IPGeo......................................................................................................................................................................... 1205
Summary....................................................................................................................................................... 1205
Usage..............................................................................................................................................................1206
Examples....................................................................................................................................................... 1207
Extending IPGeo..........................................................................................................................................1209
JSONParser............................................................................................................................................................... 1214
Summary....................................................................................................................................................... 1214
Background...................................................................................................................................................1214
Usage..............................................................................................................................................................1215
Examples....................................................................................................................................................... 1217
Multi_Case................................................................................................................................................................1222
Summary....................................................................................................................................................... 1222
Usage..............................................................................................................................................................1222
Example.........................................................................................................................................................1223
MurmurHash............................................................................................................................................................1225
Summary....................................................................................................................................................... 1225
Background...................................................................................................................................................1225
Usage..............................................................................................................................................................1226
Example.........................................................................................................................................................1227
OutlierFilter.............................................................................................................................................................. 1230
Summary....................................................................................................................................................... 1230
Usage..............................................................................................................................................................1230
Examples....................................................................................................................................................... 1233
Pack............................................................................................................................................................................1237
Summary....................................................................................................................................................... 1237
Usage..............................................................................................................................................................1238
Examples....................................................................................................................................................... 1239
Pivot...........................................................................................................................................................................1241
Summary....................................................................................................................................................... 1241
Usage..............................................................................................................................................................1241
Examples....................................................................................................................................................... 1245
PSTParserAFS.......................................................................................................................................................... 1248
Summary....................................................................................................................................................... 1248
Usage..............................................................................................................................................................1249
Examples....................................................................................................................................................... 1254
Scale Functions.........................................................................................................................................................1258
Summary....................................................................................................................................................... 1258
Background...................................................................................................................................................1258
ScaleMap................................................................................................................................................................... 1259
Usage..............................................................................................................................................................1259
Scale........................................................................................................................................................................... 1261
Usage..............................................................................................................................................................1261
ScalePrinter...............................................................................................................................................................1263
Usage..............................................................................................................................................................1264
PartitionScale............................................................................................................................................................1265
Usage..............................................................................................................................................................1265
Scale Function Examples............................................................................................................................ 1267
StringSimilarity........................................................................................................................................................ 1276
Summary....................................................................................................................................................... 1276
Usage..............................................................................................................................................................1276
Examples....................................................................................................................................................... 1278
Unpack...................................................................................................................................................................... 1281
Summary....................................................................................................................................................... 1281
Usage..............................................................................................................................................................1282
Examples....................................................................................................................................................... 1284
Unpivot..................................................................................................................................................................... 1287
Summary....................................................................................................................................................... 1287
Usage..............................................................................................................................................................1287
Examples....................................................................................................................................................... 1289
URIPack.................................................................................................................................................................... 1293
Summary....................................................................................................................................................... 1293
Usage..............................................................................................................................................................1293
Example.........................................................................................................................................................1294
URIUnpack...............................................................................................................................................................1295
Summary....................................................................................................................................................... 1295
Background...................................................................................................................................................1295
Usage..............................................................................................................................................................1295
Example.........................................................................................................................................................1297
XMLParser................................................................................................................................................................1298
Summary....................................................................................................................................................... 1298
Background...................................................................................................................................................1298
Usage..............................................................................................................................................................1298
Examples....................................................................................................................................................... 1303
XMLRelation............................................................................................................................................................ 1309
Summary....................................................................................................................................................... 1309
Usage..............................................................................................................................................................1309
Examples....................................................................................................................................................... 1313
Chapter 14:
Aster Scoring SDK........................................................................................................................... 1319
Aster Scoring SDK................................................................................................................................................... 1319
Introduction to Aster Scoring SDK.......................................................................................................................1319
AMLGenerator.........................................................................................................................................................1320
Summary....................................................................................................................................................... 1320
Usage..............................................................................................................................................................1320
Example.........................................................................................................................................................1324
Scorer.........................................................................................................................................................................1326
Summary....................................................................................................................................................... 1326
Package.......................................................................................................................................................... 1327
Installation.................................................................................................................................................... 1328
Functional Support......................................................................................................................................1329
Input Formats...............................................................................................................................................1329
Chapter 15:
Visualization Functions............................................................................................................1351
Visualization Functions.......................................................................................................................................... 1351
Chapter 16:
Aster Database System Utility Functions...................................................... 1353
Aster Database System Utility Functions............................................................................................................. 1353
Appendix A:
List of Functions and Their Syntax......................................................................... 1355
About the List of Functions....................................................................................................................................1355
Time Series, Path, and Attribution Analysis........................................................................................................1355
Arima (version 1.1)......................................................................................................................................1355
Table 1120: FellegiSunterPredict Example Input Table fspredict_input (Columns 1-4).............................. 1175
Table 1121: FellegiSunterPredict Example Input Table fspredict_input (Columns 5-7).............................. 1176
Table 1122: FellegiSunterPredict Example 1 Output Table (Columns 1-4).................................................... 1176
Table 1123: FellegiSunterPredict Example 1 Output Table (Columns 5-9).................................................... 1177
Table 1124: FellegiSunterPredict Example 2 Output Table (Columns 1-4).................................................... 1178
Table 1125: FellegiSunterPredict Example 2 Output Table (Columns 5-9).................................................... 1178
Table 1126: Geospatial File Formats That GeometryLoader Accepts.............................................................. 1181
Table 1127: GeometryLoader Output Table Schema..........................................................................................1181
Table 1128: GeometryLoader Output Table........................................................................................................ 1183
Table 1129: PointInPolygon Source Table Schema............................................................................................ 1187
Table 1130: PointInPolygon Reference Table Schema....................................................................................... 1187
Table 1131: PointInPolygon Output Table Schema............................................................................................1188
Table 1132: PointInPolygon Example 1 Input Table source_passenger..........................................................1189
Table 1133: PointInPolygon Example 1 Input Table reference_terminal....................................................... 1189
Table 1134: PointInPolygon Example 1 Output Table (Columns 1-2)............................................................1189
Table 1135: PointInPolygon Example 1 Output Table (Columns 3-6)............................................................1190
Table 1136: PointInPolygon Example 2 Output Table (Columns 1-2)............................................................1190
Table 1137: PointInPolygon Example 2 Output Table (Columns 3-6)............................................................1190
Table 1138: PointInPolygon Example 2 Input Table source_passenger1........................................................1191
Table 1139: PointInPolygon Example 3 Output Table (Columns 1-3)............................................................1191
Table 1140: PointInPolygon Example 3 Output Table (Columns 4-7)............................................................1191
Table 1141: GeometryOverlay Boundary Operators.......................................................................................... 1193
Table 1142: GeometryOverlay Source Table Schema......................................................................................... 1194
Table 1143: GeometryOverlay Reference Table Schema....................................................................................1194
Table 1144: GeometryOverlay Output Table Schema........................................................................................ 1195
Table 1145: GeometryOverlay Input Table source_gatetype............................................................................ 1195
Table 1146: GeometryOverlay Input Table ref_terminal...................................................................................1195
Table 1147: GeometryOverlay Example 1 Output Table................................................................................... 1196
Table 1148: GeometryOverlay Example 2 Output Table................................................................................... 1197
Table 1149: GeometryOverlay Example 3 Output Table................................................................................... 1197
Table 1150: IdentityMatch Source Input Table Schema.................................................................................... 1202
Table 1151: IdentityMatch Reference Input Table Schema............................................................................... 1202
Table 1152: IdentityMatch Output Table Schema.............................................................................................. 1202
Table 1153: IdentityMatch Example Input Table applicant_reference............................................................1203
Table 1154: IdentityMatch Example Input Table applicant_external..............................................................1204
Table 1155: IdentityMatch Output Table (Columns 1-6)..................................................................................1205
Table 1156: IdentityMatch Output Table (Columns 7-13)................................................................................1205
Table 1157: IPGeo Input Table Schema............................................................................................................... 1207
Table 1158: IPGeo Output Table Schema............................................................................................................ 1207
Table 1159: IPGeo Example Input Table ipgeo_1...............................................................................................1208
Table 1160: IPGeo Example Output Table (Column 1-7)................................................................................. 1209
Table 1161: IPGeo Example Output Table (Column 8-15)............................................................................... 1209
Table 1162: JSONParser Example 1 Input Table.................................................................................................1217
Table 1163: JSONParser Example 1 Output Table..............................................................................................1218
Table 1164: JSONParser Example 2 Input Table.................................................................................................1218
Table 1165: JSONParser Example 2 OutputTable (Columns 1-5)....................................................................1219
Table 1166: JSONParser Example 2 OutputTable (Columns 6-8)....................................................................1219
Table 1214: PSTParserAFS Example 1 Input Table dum1.pst, Columns 5-8................................................. 1255
Table 1215: PSTParserAFS Example 1 Output Table dum1.pst, Columns 1-4.............................................. 1256
Table 1216: PSTParserAFS Example 1 Output Table dum1.pst, Columns 5-8.............................................. 1256
Table 1217: PSTParserAFS Example 2 Output Table......................................................................................... 1256
Table 1218: Input Data Example........................................................................................................................... 1259
Table 1219: Output Table Example.......................................................................................................................1259
Table 1220: ScaleMap, Scale, or PartitionScale Input Table Schema............................................................... 1260
Table 1221: ScaleMap Output Table Schema.......................................................................................................1260
Table 1222: Supported Statistical Data Types in ScaleMap Output Table.......................................................1260
Table 1223: Location and Scale for Statistical Methods..................................................................................... 1263
Table 1224: Scale and PartitionScale Output Table Schema..............................................................................1263
Table 1225: Supported Statistical Data Types in ScalePrinter Output Table.................................................. 1264
Table 1226: Scale Functions Examples Input Table scale_housing.................................................................. 1267
Table 1227: Scale and ScaleMap Example 1 Output Table................................................................................ 1268
Table 1228: Scale and ScaleMap Example 2 Output Table................................................................................ 1269
Table 1229: Scale and ScaleMap Example 3 Input Table scale_housing_test................................................. 1270
Table 1230: Scale and ScaleMap Example 3 Output Table................................................................................ 1270
Table 1231: ScalePrinter Example Output Table (Columns 1-3)......................................................................1271
Table 1232: ScalePrinter Example Output Table (Columns 4-6)......................................................................1271
Table 1233: Scale and ScaleMap Example 5 Output Table (Columns 1-3)..................................................... 1272
Table 1234: Scale and ScaleMap Example 5 Output Table (Columns 4-7)..................................................... 1273
Table 1235: Scale and ScaleMap Example 5 Output Table (Columns 1-3)..................................................... 1274
Table 1236: Scale and ScaleMap Example 5 Output Table (Columns 4-7)..................................................... 1274
Table 1237: Scale and KMeans Example Output Table......................................................................................1275
Table 1238: StringSimilarity Input Table Schema...............................................................................................1277
Table 1239: StringSimilarity Output Table Schema............................................................................................1278
Table 1240: StringSimilarity Example Input Table strsimilarity_input........................................................... 1278
Table 1241: StringSimilarity Example 1 Output Table (Columns 1-3)............................................................1279
Table 1242: StringSimilarity Example 1 Output Table (Columns 4-7)............................................................1279
Table 1243: StringSimilarity Example 2 Output Table (Columns 1-3)............................................................1280
Table 1244: StringSimilarity Example 2 Output Table (Columns 4-7)............................................................1281
Table 1245: Unpack Input Table Schema.............................................................................................................1284
Table 1246: Unpack Output Table Schema..........................................................................................................1284
Table 1247: Unpack Example 1 Input Table ville_tempdata.............................................................................1284
Table 1248: Unpack Example 1 Output Table.....................................................................................................1285
Table 1249: Unpack Example 2 Input Table ville_tempdata1...........................................................................1286
Table 1250: Unpack Example 2 Output Table.....................................................................................................1286
Table 1251: Pivot Input Table Schema................................................................................................................. 1288
Table 1252: Pivot Output Table Schema.............................................................................................................. 1289
Table 1253: Unpivot Examples Input Table unpivot_input.............................................................................. 1289
Table 1254: Unpivot Example 1 Output Table.................................................................................................... 1290
Table 1255: Unpivot Example 2 Output Table.................................................................................................... 1291
Table 1256: URIPack Output Table Schema........................................................................................................1294
Table 1257: URIPack Example Output Table...................................................................................................... 1294
Table 1258: Key Hierarchical URI Components................................................................................................. 1295
Table 1259: URIUnpack Input Table Schema..................................................................................................... 1296
Table 1260: URIUnpack Output Table Schema.................................................................................................. 1296
Overview
This guide provides instructions for users and administrators of Teradata Aster® Analytics 6.21. If you are
using a different version, you must download a different edition of this guide.
The following additional resources are available:
• Aster Database upgrades, clients and other packages:
http://downloads.teradata.com/download/tools
• Documentation for existing customers with a Teradata @ Your Service login:
http://tays.teradata.com/
• Documentation that is available to the public:
http://www.info.teradata.com/
Typefaces
Command line input and output, commands, program code, filenames, directory names, and system
variables are shown in a monospaced font. Words in italics indicate an example or placeholder value that
you must replace with a real value. Bold type is intended to draw your attention to important or changed
items. Menu navigation and user interface elements are shown using the User Interface Command font.
Notation Conventions
In the synopsis sections, we follow these conventions:
• Square brackets ([ and ]) indicate one or more optional items.
• Curly braces ({ and }) indicate that you must choose an item from the list inside the braces. Choices are
separated by vertical lines (|).
• An ellipsis (...) means the preceding element can be repeated.
• A comma and an ellipsis (, ...) means the preceding element can be repeated in a comma-separated list.
• In command line instructions, SQL commands and shell commands are typically written with no
preceding prompt, but where needed the default SQL prompt is shown: beehive=>
Version History
Table 1: Version History Table
Introduction
• Analytics at Scale: Full Dataset Analysis
• Introduction to Teradata Aster SQL-MapReduce
• SQL-MapReduce Query Syntax
• SQL-MapReduce with Multiple Inputs
• Aster Analytics Function Product Bundles
• Aster Analytics Functions by Product Bundle
• Aster Analytics Functions by Category
What is MapReduce?
MapReduce is a framework for operating on large sets of data using massively parallel processing (MPP)
systems. MapReduce enables complex analysis to be performed efficiently on extremely large sets of data,
such as those obtained from weblogs and clickstreams. It has applications in areas such as machine learning,
scientific data analysis, and document classification.
The basic ideas behind MapReduce originated with the map and reduce functions common to many
programming languages, though the implementation and application are somewhat different on multi-node
systems.
In programming languages, a map function applies the same operation to every input tuple (for example,
every member of a list, element of an array, or row of a table) and produces one output tuple for each input
tuple. (A map function is sometimes called a transformation operation.)
On an MPP database such as Aster Database, the map step of a MapReduce function has special meaning.
The input data set is broken into smaller data sets, which are distributed to the worker nodes in a cluster,
where an instance of the function operates on them. If the data is already distributed as specified in the
function call, the distribution step does not occur, because the function can operate on the data where it is
already stored. The outputs from these smaller data sets may be redirected back into the function for further
processing, input into another function, or otherwise further processed. Finally, all outputs are consolidated
again on the queen to produce the final output result, with one output tuple for each input tuple.
In programming languages, a reduce function combines the input tuples to produce a single result by using a
mathematical operator (like sum, multiply, or average). Reduce functions consolidate data into smaller
groups of data. They can accept the output of a map or reduce function or operate recursively on their own
output.
In Aster Database, the reduce step of a MapReduce follows this procedure:
1. The input data is partitioned by the given partitioning attribute.
2. If required by the function call, the input tuples are distributed to the worker nodes, with all the tuples
that share a partitioning key assigned to the same node for processing.
3. On each node, the function operates on the input tuples and returns the output tuples to the queen.
The number of tuples that the function outputs might differ from the number of input tuples that it
received.
4. The output from each node is consolidated on the queen.
5. If necessary, additional operations are performed on the queen.
For example, if the function averages its input, the average results from all the nodes must be averaged on
the queen to obtain the final output.
6. The SQL-MapReduce function returns the final output.
SELECT [ ALL |
DISTINCT [ ON ( expression [,...] ) ] |
* |
expression [ [ AS ] output_name ][,...]
]
FROM sqlmr_function_name (on_clause function_argument) [[ AS ] alias ][,...]
[ WHERE condition ]
[ GROUP BY expression [,...] ]
[ HAVING condition [,...] ]
[ ORDER BY expression [ ASC | DESC ][ NULLS { FIRST | LAST } ][,...] ]
[ LIMIT { count | ALL } ]
[ OFFSET start ];
on_clause is:
partition_any_input is:
partition_attributes_input is:
table_input is:
table_expression is:
The preceding syntax focuses on SQL-MapReduce. For the complete syntax of the SELECT statement,
including the WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and OFFSET clauses, refer to the Aster
Database User Guide for Aster Appliances or Aster Database User Guide for Commodity Hardware.
Notes:
• sqlmr_function_name is the name of a SQL-MapReduce function you have installed in Aster Database. If
the sqlmr_function_name contains uppercase letters, you must enclose it in double quotation marks (").
• The on_clause provides the input data on which the function operates. This data is composed of one or
more partitioned inputs and zero or more dimensional inputs. The partitioned inputs can be a single
partition_any_input clause and/or one or more partition_attributes_input clauses. The dimensional
inputs can be zero or more dimensional_input clauses.
• partition_any_input and partition_attributes_input introduce expressions that partition the inputs before
the function operates on them.
• partitioning_attributes specifies the partition key(s) to use to partition the input data before the function
operates on it.
• dimensional_input introduces an expression that replicates the input to all nodes before the function
operates on them.
• order_by (optional) introduces an expression that sorts the input data after partitioning, but before the
function operates on it.
• table_input includes an alias for the table_expression. For rules about when an alias is required, see Rules
for Table Aliases. When declaring an alias, the AS keyword is optional.
• function_argument optionally introduces an argument clause that typically modifies the behavior of the
SQL-MapReduce function. Do not confuse argument clauses with input data: Input data is the data on
which the function operates; argument clauses provide runtime parameters. You pass an argument clause
in the form argument_name (literal[, ...]), where argument_name is the name of the
argument clause (as defined in the function) and literal is the value to be assigned to that argument. If an
argument clause is a multi-value argument, you can supply a comma-separated list of values. You can
pass multiple argument clause blocks, each consisting of an argument_name followed by its value(s)
encased in a single pair of parentheses, separated from the next argument clause block with whitespace
(not commas).
• AS provides an optional alias for the SQL-MapReduce function in the query.
Note:
All PARTITION BY partitioning_attributes clauses in a function must specify the same number of
attributes and corresponding attributes must be equijoin compatible (that is, of the same datatype or
datatypes that can be implicitly cast to match). This casting is “partition safe,” meaning that it does
not cause redistribution of data on the vworkers.
2. Dimensional inputs use the DIMENSION keyword, and the entire input is distributed to each vworker.
This is done because the entire set of data is required on each vworker for the SQL-MapRequest function
to run. The most common use cases for dimensional inputs are lookup tables and trained models.
Here’s how it works. A multiple-input SQL-MapRequest function takes as input sets of rows from multiple
relations or queries. In addition to the input rows, the function can accept arguments passed by the calling
query. The function then effectively combines the partitioned and dimensional inputs into a single nested
relation for each unique set of partitioning attributes. The SQL-MapRequest function is then invoked once
on each record of the nested relation. It produces a single set of rows as output.
SELECT ...
ON store_locations DIMENSION,
ON purchases PARTITION BY purchase_date,
ON products DIMENSION ORDER BY prod_name
...
You must give a different alias to each reference to the table or view.
Number of Inputs
A SQL-MapReduce function invocation triggers the following validations:
• The multiple input SQL-MapReduce function expects more than one input.
Cogroup Example
This example uses a fictional SQL-MapReduce function named attribute_sales to show how cogroup
works. The function accepts two partitioned inputs, specified in two ON clauses, and two arguments.
The inputs to the SQL-MapReduce function are:
• weblog, which contains the store web logs, the source of purchase information
• adlog, which contains the logs from the ad server
Both inputs are partitioned on the user’s browser cookie.
The arguments to the attribute_sales function are clicks and impressions, which supply the
percentages of sales to attribute for ad clickthroughs and views (impressions) leading up to a purchase.
Use the following SQL-MapReduce to call the attribute_sales function:
The two inputs are cogrouped before the function operates on them. Conceptually, the cogroup operation is
performed in two steps:
1. Each input data set is grouped according to the cookie attribute specified in the PARTITION BY clauses.
A cogroup tuple is formed for each unique resulting group. The tuple is composed of the cookie value
identifying the group and a nested relation that contains all values from both the weblog and adlog
inputs that belong to the group.
The middle box in the preceding figure shows the output of the cogroup operation.
2. The attribute_sales function is invoked once for each cogroup tuple.
This type of result cannot easily be computed using basic cogroup capabilities, because the data sets must be
related using a proximity join as opposed to an equijoin. The following figure shows how this relationship is
expressed and executed using cogroup extended with dimensional inputs.
Create a SQL-MapReduce function named closest_store, which accepts two partitioned inputs as
specified in two ON clauses. Use this query to call the closest_storefunction:
The closest_store function receives the result of a cogroup operation on the phone_purchases data
set and the dimensional input data set stores. The two boxes at the top of the diagram show sample
phone_purchases input and stores input, respectively. Conceptually, the operation is performed in
three steps:
1. The phone_purchases input remains grouped as it is stored in the database, as specified by the
PARTITION BY ANY clause, and thestores input is grouped into a single group as specified by the
DIMENSION clause.
2. The groups are combined using what is essentially a Cartesian join. The result of the cogroup operation
is a nested relation. Conceptually, each tuple of the nested relation contains an arbitrary group of phone
purchases concatenated with the single group comprising all retail stores.
The middle box in the diagram shows the result of the cogroup operation.
The Aster Analytic Premium Portfolio 6.21 package combines the following packages:
• Aster MapReduce Analytic Foundation Portfolio 6.21
• Aster MapReduce Analytic Premium Portfolio - Path/Pattern Module 6.21
• Aster MapReduce Analytic Premium Portfolio - Relationship Module 6.21
• Aster Graph Analytic Premium Portfolio - Graph Module 6.21
Premium Path
Table 3: Premium Path Bundle Functions in Alphabetical Order
Function Name
Attribution
FrequentPaths
nPath
Path_Analyzer
Path_Generator
Path_Starter
Path_Summarizer
Premium Relationship
Table 4: Premium Relationship Bundle Functions in Alphabetical Order
Function Name
Basket_Generator
CFilter (Collaborative Filtering)
FPGrowth
nTree
WSRecommender
Function Name
AdaBoost_Drive
AdaBoost_Predict
AddOnePlayer
Antiselect
Apache Log Parser
Approximate Distinct Count
Approximate Percentile
Arima
ArimaPredictor
Burst
Canopy
Categorize
CCM
CCMPrepare
ChangePointDetection
CMAVG (Cumulative Moving Average)
ConfusionMatrix
Correlation
CoxPH
CoxPredict
CoxSurvFit
CrossValidation
DenseSVMModelPrinter
DenseSVMPredictor
DenseSVMTrainer
Distribution Matching
DTW
DWT
DWT2 D
EMAVG (Exponential Moving Average)
Function Name
EvaluateNamedEntityFinderPartition
EvaluateNamedEntityFinderRow
EvaluateSentimentExtractor
ExtractSentiment
FMeasure
FellegiSunterTrainer
FellegiSunterPredict
FindNamedEntity
Forest_Analyze
Forest_Drive
Forest_Predict
GenerateCombination
GeometryLoader
GeometryOverlay
GLM
GLMPredict
GMMFit
GMMPredict
GMMProfile
Histogram
HMMUnsupervisedLearner
HMMSupervisedLearner
HMMEvaluator
HMMDecoder
IdentityMatch
IDWT
IDWT2D
Interpolator
IPGeo
JSONParser
KMeans
KMeansPlot
Function Name
KModes
KModesPredict
KNN
KNNRecommenderPredict
KNNRecommenderTrain
LARS
LARSPredict
LDAInference
LDATopicPrinter
LDATrainer
LDist (Levenshtein Distance)
LinReg
LinRegMatrix
LRTEST
Minhash
Multi_Case
MurmurHash
NaiveBayesTextClassifierPredict
NaiveBayesTextClassifierTrainer
NaiveBayesMap
NaiveBayesPredict
NaiveBayesReduce
NER
NEREvaluator
NERTrainer
NeuralNet
NeuralNetPredict
nGram
OutlierFilter
Pack
PartitionScale
PCA (Principal Component Analysis)
Function Name
PCAPlot
Percentile
Pivot
PointInPolygon
POSTagger (Part-of-Speech Tagger)
PSTParserAFS
RandomSample
RtChangePointDetection
Sample
SAX2
Scale
ScaleMap
ScalePrinter
Sentenizer
SeriesSplitter
Sessionize
Single_Tree_Drive
Single_Tree_Predict
SMAVG (Simple Moving Average)
SortCombination
SparseSVMPredictor
SparseSVMTrainer
StringSimilarity
SupervisedShapeletClassifier
SupervisedShapeletTrainer
SVMModelPrinter
TextChunker
TextClassifier
TextClassifierTrainer
TextClassifierEvaluator
TextMorph
TextTagging
Function Name
TextTokenizer
Text_Parser
TF_IDF (Term Frequency Inverse Document Frequency)
TrainNamedEntityFinder
TrainSentimentExtractor
Unpack
Unpivot
UnsupervisedShapelet
URIPack
URIUnpack
VARMAX
VectorDistance
VWAP (Volume-Weighted Average Price)
WMAVG (Weighted Moving Average)
XMLParser
XMLRelation
Premium Graph
Table 6: Premium Graph Bundle Functions in Alphabetical Order
Function Name
AllPairsShortestPath
Betweenness
Closeness
EigenvectorCentrality
gTree
LocalClusteringCoefficient
LoopyBeliefPropagation
Modularity
PageRank
pSALSA
RandomWalkSample
Function Name
Scorer
Function Description
Arima Calculates the coefficients for a sequence of parameters, producing an ARIMA
model.
ArimaPredictor Takes as input the ARIMA model produced by the Arima function and predicts a
specified number of future values (time point forecasts) for the modeled sequence.
Attribution Calculates attributions with a wide range of distribution models. Often used in web-
page analysis.
Burst Bursts (splits) a time interval into a series of shorter "burst" intervals that can be
analyzed independently.
Change-Point Detect the change points in a stochastic process or time series. The change-point
Detection detection functions are ChangePointDetection and RtChangePointDetection.
Functions
Convergent Cross- Includes the CCMPrepare function, which adds a new partition column and
Mapping partitions the data to prepare it for use with the CCM function, which tests multiple
causes and effects simultaneously, reporting an effect size for each cause-effect pair.
DTW Computes the dynamic time warping—the similarity between two sequences that
vary in time or speed.
DWT Implements Mallat’s algorithm, an iterative algorithm in the discrete wavelet
transform field that applies wavelet transform on multiple sequences simultaneously.
Function Description
DWT2D Implements wavelet transforms on two-dimensional input, and simultaneously
applies the transforms on multiple sequences.
FrequentPaths Mines (finds) patterns that appear more than a specified number of times in the
sequence database. The difference between sequential pattern mining and frequent
pattern mining is that the former works on time sequences where the order of items
must be kept.
IDWT Applies inverse wavelet transformation on multiple sequences simultaneously. IDWT
is the inverse of DWT.
IDWT2D Simultaneously applies inverse wavelet transforms on multiple sequences. Inverse
function of DWT2D.
Interpolator Calculates missing values in a time series, using either interpolation or aggregation.
Interpolation estimates missing values between known values. Aggregation combines
known values to produce an aggregate value.
Path Analysis Automate path analysis. These functions are useful for clickstream analysis of web
Functions site traffic and other sequence/path analysis tasks, such as advertisement or referral
attribution. The path analysis functions are Path_Generator, Path_Summarizer,
Path_Start, and Path_Analyzer.
SAX2 Transforms original time series data into symbolic strings, which are more suitable
for many additional types of manipulation, because of their smaller size and the
relative ease with which patterns can be identified and compared. Input and output
formats allow it to supply data to the Shapelet functions.
SeriesSplitter Splits a partition into subpartitions (called splits) by creating an additional column
that contains the split identifier. Optionally, the function also copies a specified
number of boundary rows to each split.
Sessionize Maps each click in a clickstream to a unique session identifier.
Shapelet Functions Detect distinguishing features among ordered sequences (time series) and use them
to cluster or classify new data. The shapelet functions are UnsupervisedShapelet,
SupervisedShapeletTrainer, and SupervisedShapeletClassifier.
VARMAX Extends the ARMA/ARIMA model to work with time series with multiple response
variables (vector time series), as well as exogenous variables, or variables that are
independent of the other variables in the system.
Function Description
nPath Pattern-matching function that lets you to specify a pattern in a row sequence,
specify additional conditions on the rows matching the symbols, and extract useful
information from the row sequence.
Function Description
Approximate Computes the approximate global distinct count of the values in one or more
Distinct Count columns, scanning the table only once. Counts all children for a specified parent.
Approximate Computes approximate percentiles for one or more columns, with specified accuracy.
Percentile
CMAVG Computes the cumulative moving average—the average of a value from the
beginning of a series.
ConfusionMatrix Shows how often a classification algorithm correctly classifies items.
Correlation Computes the global correlation between any pair of table columns.
CoxPH Estimates coefficients of a Cox proportional hazards model by learning a set of
explanatory variables. Generates coefficient and linear prediction tables.
CoxPredict Takes the coefficient table generated by the CoxPH function and outputs the hazard
ratios between predict features and either their corresponding reference features or
their unit differences.
CoxSurvFit Takes the coefficient and linear prediction tables generated by the CoxPH function
and outputs a table of survival probabilities.
CrossValidation Validates a model by assessing how the results of a statistical analysis will generalize
to an independent data set.
Distribution Uses hypothesis testing to find the best matching distribution for data.
Matching
EMAVG Computes the average over a number of points in a time series while applying an
exponentially decaying damping (weighting) factor to older values so that more
recent values are given a heavier weight in the calculation.
FMeasure Calculates the accuracy of a test.
GLM Performs linear regression analysis for any of a number of distribution functions,
using a user-specified distribution family and link function.
GLMPredict Uses the model generated by the Stats GLM function to make predictions for new
data.
Hidden Markov Describe the evolution of observable events that depend on factors that are not
Model Functions directly observable. The Hidden Markov Model functions are
HMMUnsupervisedLearner, HMMSupervisedLearner, HMMEvaluator, and
HMMDecoder.
Histogram Calculates the frequency distribution of a dataset using sophisticated binning
techniques that can automatically calculate the bin width and number of bins. The
function maps each input row to one bin and returns the frequency (row count) and
proportion (percentage of rows) of each bin.
KNN Uses the kNN algorithm to classify new objects based on their proximity to already-
classified objects.
Function Description
LARS Functions Select the most important variables one by one and fit the coefficients dynamically.
The LARS functions are LARS and LARSPredict.
Linear Regression Output the coefficients of the linear regression model represented by the input
matrices.
LRTEST Performs the likelihood ratio test for two GLM models.
Percentile Finds percentiles on a per group basis.
Principal Common unsupervised learning technique that is useful for both exploratory data
Component analysis and dimensionality reduction, often used as the core procedure for factor
Analysis analysis. Implemented by the functions PCA_Map and PCA_Reduce. If the version
of PCA_Reduce is AA 6.21 or later, you can input the PCA output to the function
PCAPlot.
RandomSample Takes a data set and uses a specified sampling method to output one or more random
samples, each with a specified size.
Sample Draws rows randomly from input, using either of two sampling schemes.
Shapley Value Computes the Shapley value, typically from nPath function output. The Shapley value
Functions is intended to reflect the importance of each player to the coalition in a cooperative
game (a game between coalitions of players, rather than between individual players).
The Shapley value functions are GenerateCombination, SortCombination, and
AddOnePlayer.
SMAVG Computes the simple moving average for a number of points in a series.
Support Vector Use a popular classification algorithm to build a predictive model according to a
Machines training set, give a prediction for each sample in the test set, and display the readable
information of the model. Support Vector Machines include both SparseSVM and
DenseSVM functions. The SparseSVM Functions include SparseSVMTrainer,
SparseSVMPredictor, and SVMModelPrinter, while the DenseSVM Functions
include DenseSVMTrainer, DenseSVMPredictor and DenseSVMModelPrinter.
VectorDistance Measures the distance between sparse vectors (for example, TF-IDF vectors) in a
pairwise manner.
VWAP Computes the volume-weighted average price of a traded item (usually an equity
share) over a specified time interval.
WMAVG Computes the weighted moving average of a number of points in a time series,
applying an arithmetically-decreasing weighting to older values.
Text Analysis
Table 11: Text Analysis Functions
Function Description
LDA Functions Build a topic model based on the supplied training data and parameters, estimate the
topic distribution for each document based on the generated model, and display
Function Description
information from the model. The LDA functions are LDATrainer, LDAInference,
and LDATopicPrinter.
Levenshtein Computes the Levenshtein distance between two text values, that is, the number of
Distance (LDist) edits needed to transform one string into the other, where edits include insertions,
deletions, or substitutions of individual characters.
Naive Bayes Text Uses the Naive Bayes algorithm to classify data objects. The Naive Bayes Text
Classifier Classifier is composed of the functions NaiveBayesTextClassifierTrainer and
NaiveBayesTextClassifierPredict.
NER Functions Use the Conditional Random Fields (CRF) model to specify how to extract features
(CRF Model (for example, person, location, and organization) when training data models. Trains,
Implementation) evaluates, and applies models. These NER functions are NERTrainer, NER, and
NEREvaluator.
NER Functions Use the Max Entropy model to specify how to extract features (for example, person,
(Max Entropy location, and organization) when training data models. Trains, evaluates, and applies
Model models. These NER functions are FindNamedEntity, TrainNamedEntityFinder,
Implementation) Evaluate Named Entity Finder.
nGram Tokenizes (splits) an input stream and emits n multi-grams based on specified
delimiter and reset parameters. Useful for sentiment analysis, topic identification,
and document classification.
POSTagger Tags the parts-of-speech of input text.
Sentenizer Extracts the sentences in the input paragraphs.
Sentiment Deduce user opinion (positive, negative, or neutral) from text. The sentiment
Extraction extraction functions are TrainSentimentExtractor, ExtractSentiment, and
Functions EvaluateSentimentExtractor.
Text Classifier Chooses the correct class label for given text. Text Classifier is composed of the
functions TextClassifierTrainer, TextClassifier, and TextClassifierEvaluator.
Text_Parser Tokenizes a stream of words, optionally stems them, and outputs the individual
words and their counts.
TextChunker Divides text into phrases and assigns each phrase a tag identifying its type.
TextMorph Provides lemmatization, a basic tool in text analysis. Outputs a standard form of the
input words.
TextTagging Tags input tuples according to user-defined rules that use logical and text processing
operators.
TextTokenizer Extracts tokens (for example, words, punctuation marks, and numbers) from text.
TF_IDF Evaluates the importance of a word within a specific document, weighted by the
number of times the word appears in the entire document set.
Function Description
Canopy Simple, fast, accurate function for grouping objects into preliminary clusters. Often
used as an initial step in more rigorous clustering techniques, such as k-means.
Gaussian Mixture Fit a Gaussian mixture model (GMM) to input data, using either a basic GMM
Model Functions algorithm with a fixed number of clusters or a Dirichlet Process GMM (DP-GMM)
algorithm with a variable number of clusters. The GMM functions are GMMFit,
GMMPredict, and GMMProfile.
KMeans Takes a data set and outputs the centroids of its clusters and, optionally, the clusters
themselves.
KMeansPlot Takes a model—a table of cluster centroids output by the KMeans function—and an
input table of test data, and uses the model to assign the test data points to the cluster
centroids.
KModes Extends KMeans to support categorical data. The core algorithm is an expectation-
maximization algorithm that finds a locally optimal solution.
KModesPredict Prediction function that corresponds to KModes.
Minhash Probabilistic clustering method that assigns a pair of users to the same cluster with
probability proportional to the overlap between the sets of items that these users have
bought.
Naive Bayes
Table 13: Naive Bayes Functions
Function Description
Naive Bayes Train a Naive Bayes classification model and use the model to predict new outcomes.
Functions The Naive Bayes functions are NaiveBayesMap and NaiveBayesReduce and
NaiveBayesPredict.
Ensemble Methods
Table 14: Ensemble Methods Functions
Function Description
Random Forest Create a predictive model based on a combination of the classification and regression
Functions trees (CART) algorithm for training decision trees and the ensemble learning method
of bagging. The Random Forest functions are Forest_Drive, Forest_Predict, and
Forest_Analyze.
Single Decision Create a predictive model that has a single decision tree. The Single Decision Tree
Tree Functions functions are Single_Tree_Drive and Single_Tree_Predict.
Function Description
AdaBoost Create a predictive model based on the AdaBoost algorithm. The AdaBoost functions
Functions are AdaBoost_Drive and AdaBoost_Predict.
Association Analysis
Table 15: Association Analysis Functions
Function Description
Basket_Generator Generates baskets (sets) of items that occur together in data records (typically
transaction records or web page logs).
CFilter Helps discover which items or events are frequently paired with other items or
events.
FPGrowth Uses an FP-growth algorithm to generate association rules from patterns in a data set
and then determines their interestingness.
Recommender The recommender functions include the following:
Functions WSRecommender is an item-based, collaborative filtering function that uses a
weighted-sum algorithm to make recommendations (such as items for users to
consider buying).
KNNRecommenderTrain and KNNRecommenderPredict take a similar approach to
WSRecommender, but attempt to increase prediction accuracy by adjusting for
systematic biases and replacing heuristic calculations of similarity coefficients with a
global optimization that simultaneously estimates all weights.
Graph Analysis
Table 16: Graph Analysis Functions
Function Description
AllPairsShortestPath Computes the shortest distances between all combinations of the specified
source and target vertices.
Betweenness Determines betweenness for every vertex in a graph. Betweenness is a type
of centrality (relative importance) measurement.
Closeness Computes closeness and k-degree scores for each specified source vertex
in a graph.
EigenvectorCentrality Calculates the centrality (relative importance) of each node in a graph.
gTree Follows all paths in a graph, starting from a given set of root vertices, and
calculates specified aggregate functions along those paths.
LocalClusteringCoefficient Analyzes the structure of a network.
LoopyBeliefPropagation Calculates the marginal distribution for each unobserved node,
conditional on any observed nodes.
Function Description
Modularity Discovers communities (clusters) in input graphs without advance
information about the clusters. Detects communities by discovering the
strength of relationships among data points.
nTree Builds and traverses tree structures on all worker nodes in a graph.
PageRank Computes PageRank values for a directed graph.
pSALSA Evaluates the similarity of nodes in a bipartite graph according to their
proximity. Typically used for recommendation.
RandomWalkSample Outputs a sample graph that represents the input graph (which is typically
extremely large).
Function Description
AMLGenerator Transforms model data from Aster to an XML-based AML (Aster Model Language)
format that is compatible with the real-time functionality.
Scorer Provides a software framework to score input queries based on a given model and
predictor. The following real-time functions are currently supported by scorer: Aster
Scoring SDK CoxPH, Aster Scoring SDK Extract Sentiment, Aster Scoring SDK
Generalized Linear Model, Aster Scoring SDK LDAInference, Aster Scoring SDK
Naïve Bayes, Aster Scoring SDK Naïve Bayes Text Classifier, Aster Scoring SDK
Random Forest, Aster Scoring SDK Single Decision Tree, Aster Scoring SDK
SparseSVM, Aster Scoring SDK Text Parser, Aster Scoring SDK Text Tagging, and
Aster Scoring SDK Text Tokenizer.
NeuralNet
Table 18: Neural Net Functions
Function Description
NeuralNet Uses backpropagation to train neural networks. The user provides input data and
other argument settings for training the networks, and the fitted weights of the neural
network are created. The Neural Net function is optimized for performance on very
large datasets (millions of rows).
NeuralNetPredict Predicts the output for specific arbitrary covariate inputs, using a particular trained
neural network output weight table.
Function Description
Antiselect Returns all columns except those specified.
Apache_Log_Parser Parses Apache log file content and extracts multiple columns of structural
information, including search engines and search terms.
Categorize Converts specified columns from any numeric type to VARCHAR.
Fellegi-Sunter Functions FellegiSunterTrainer estimates the parameters of the Fellegi-Sunter model, using
either supervised or unsupervised learning. FellegiSunterPredict predicts whether a
pair of objects are duplicates.
Geometry Functions GeometryLoader retrieves file-based geospatial files from AFS, parses them, and
stores them in Aster Database. GeometryOverlay calculates the result of overlaying
two geometries as specified by the overlay operator. PointInPolygon takes as input a
list of location points and a list of polygons and returns a list of binary values for
every point-and-polygon combination, which indicates whether the point is
contained in the polygon.
IdentityMatch Tries to match enterprise customers with users records provided by external data
sources.
IPGeo Maps IP addresses to information that you can use to identify the geographical
location of a visitor.
JSONParser Extracts the element name and text from JSON strings and outputs them in a
flattened relational table.
Multi_Case Extends the capability of the SQL CASE statement by supporting matches to
multiple options and iterating through the input data set only once, emitting
matches as they occur.
MurmurHash Computes the hash value of the input columns.
OutlierFilter Removes outliers from a data set.
Pack Compresses data in multiple columns into a single “packed” data column.
Pivot Converts rows into columns.
PSTParserAFS Parses Personal Storage Table (PST) files that store email in Microsoft software such
as Microsoft Outlook and Microsoft Exchange Client.
Scale Functions Normalize input data sets. The Scale functions are ScaleMap, Scale, ScalePrinter,
andPartitionScale.
StringSimilarity Calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-
Gram, or Levenshtein distance.
Unpack Expands data from a single “packed” column to multiple “unpacked” columns.
Unpivot Converts columns into rows.
URIPack Reconstructs encoded hierarchical uniform resource identifier (URI) strings that
were unpacked by the URIUnpack function.
Function Description
URIUnpack Separates hierarchical URIs into constituent components and extracts the values of
specified parameters.
XMLParser Extracts data from XML documents and flattens it into a relational table.
XMLRelation Extracts element name, text and attribute values, and structural information from
XML documents and output them in a relational table.
Sample output from the \dE command and the preceding query are in the following two tables. The version
numbers of the Aster Analytics functions AllPairsShortestPath and Antiselect are 6.20_rel_1.5_r39242 and
6.20_rel_1.0_r39242.
Note:
Neither the \dE command nor the query displays version numbers for the Aster Database Utility
functions, which are installed as part of the Aster Database installation and always compatible with it.
cd /opt/teradata/AsterAnalytics_Foundation
3. If the ZIP file is not there, contact your Teradata account manager to get its location.
4. Create a directory for the package that you are installing. For example:
mkdir AA_6.21
5. Change your directory to the newly created directory. For example:
cd AA_6.21
Postrequisite
Getting Install and Uninstall Scripts
Note:
Driver functions look for the functions that they call internally in the PUBLIC schema. Therefore, driver
functions might work incorrectly if you install all the functions to a schema other than PUBLIC.
cp /opt/teradata/AsterAnalytics_Foundation/
AsterAnalytics_Foundation__indep_indep.06.21.00.00.zip README
postinstall /opt/teradata/AsterAnalytics_Foundation/AA_6.21/
2. Unzip the ZIP file. For example:
unzip AsterAnalytics_Foundation__indep_indep.06.21.00.00.zip
install_aster_analytics.sql
un_install_aster_analytics.sql
Next Step:
• If you are installing the package for the first time:
Installing an Aster Analytics Function Package
• If you are updating a package that is already installed:
Updating an Aster Analytics Function Package
To get the scripts that install a package in, and uninstall it from, a specified schema:
1. Go to https://downloads.teradata.com/download/aster
2. Download the Aster Analytics Custom Schema Installer package.
3. Unzip and copy these files to your current directory:
make_install_scripts.py analytics_packages.csv
4. Run this command, where SCHEMANAME is the name of the desired schema:
python make_install_scripts.py $SCHEMANAME
The Python script make_install_scripts.py reads the CSV file analytics_packages.csv,
which tells which functions are in which packages, and generates install and uninstall SQL scripts for
each package, which it saves to your current directory. The names of the generated SQL scripts have this
format:
Postrequisite
• If you are installing the package for the first time:
Installing an Aster Analytics Function Package
• If you are updating a package that is already installed:
Updating an Aster Analytics Function Package
1. Ensure that you have the necessary access privileges for installing the Aster Analytics functions in the
desired schema.
Note:
Teradata recommends installing the Aster Analytics functions only in the schema PUBLIC. If you
must maintain a different version of a function in another schema, refer to Scripts for a Specified
Schema.
2. Go to the directory where the functions from the package are. For example:
cd /opt/teradata/AsterAnalytics_Foundation/AA_6.21
3. Run the SQL script that installs the package. For example:
install_aster_analytics.sql
or:
install_ASTER_ANALYTICS_to_SCHEMANAME.sql
4. Check the version numbers of the newly installed functions, using the instructions in Finding Function
Version Numbers.
Note:
Aster Database does not share database objects, including functions, across databases.
Note:
When you expand a cluster by adding worker nodes, all installed Aster Analytics functions on your
cluster are automatically added to the new nodes.
Postrequisite
Set Permissions to Allow Users to Run Functions
Postrequisite
Testing the Functions
The procedure for uninstalling the package that contains the older functions and then installing the package
that contains the newer versions is:
1. Find the version numbers and schemas of the installed functions, using the instructions in Finding
Function Version Numbers.
2. Ensure that you have the necessary privileges for uninstalling the Aster Analytics functions in their
schema.
3. In the list of version numbers, find the release number of the functions in the package.
For example, in Finding Function Version Numbers, the installed Aster Analytics functions have release
number 6.20.
4. Go to the directory where the functions from the older package are. For example:
cd /opt/teradata/AsterAnalytics_Foundation/AA_6.20
Note:
Uninstall scripts for older packages are also available in the Aster Analytics 6.21 packages.
5. Run the SQL script that uninstalls the older package. For example:
un_install_aster_analytics.sql
or:
un_install_ASTER_ANALYTICS_from_SCHEMANAME.sql
6. Go to the directory where the functions from the newer package are. For example:
cd /opt/teradata/AsterAnalytics_Foundation/AA_6.21
7. Run the SQL script that installs the newer package. For example:
install_aster_analytics.sql
or:
install_ASTER_ANALYTICS_from_SCHEMANAME.sql
Postrequisite
The alternative to the recommended procedure is to use the ACT commands \remove and \install on each
function:
1. beehive=> \remove function_filename
2. beehive=> \install function_filename
Command Meaning
\dF Lists all installed files and functions.
\install file Installs the file or function file, which is the path name of the file relative to the
[installed_filename] directory where you are running ACT.
By default, the installed local file or function has the same name as the corresponding
remote file.
Command Meaning
The optional installed_filename is an alias. Aliases are useful for renaming helper
files, but are not recommended for SQL-MapReduce functions, because they can
cause confusion.
\download Downloads the file or function installed_filename to the directory where you are
installed_filename running ACT.
[newfilename] By default, the downloaded local file or function has the same name as the
corresponding remote file. To give the local file or function a new name, specify the
optional newfilename. If newfilename is a path, the destination directory must exist
on the file system where you are running ACT.
\remove Removes the file or function installed_filename.
installed_filename
Usage Notes
These usage notes apply to all functions, except as noted.
If such a database object name is a function argument that must be enclosed in single quotation marks, then
you must put the double quotation marks inside the single quotation marks, as in this SQL-MapReduce
query:
argument ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
For such arguments, the values 't', 'yes', 'y', and 1 are equivalent to 'true' and the values 'f', 'no', 'n', and 0 are
equivalent to false.
'start_column:end_column' [, '-exclude_column' ]
DATE Columns
Input columns of the type DATE must have formats with four-digit years.
BC/BCE Timestamps
SQL-MapReduce functions do not support Before the Common Era (BCE) timestamps. BCE is an
alternative to Before Christ (BC). Examples of BC/BCE timestamp are:
4713-01-01 11:07:11-07:52:58 BC
4713-01-01 11:07:11 BC
In this example, because a schema was not specified in the argument ModelFile, the function selects the first
schema from your search path and tries to install the model file on it. By default, the first schema in the
search path is the 'public' schema. Because you do not have CREATE privileges on the 'public' schema, a
privilege failure occurs.
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
Arguments
Examples
When you use authentication cascading, you can change the usage of the following example driver function
from:
to:
The following example connects to the database over SSL using authentication cascading:
SSLSettings ('SSLsettings')
SSLTrustStorePassword ('SSLTrustStorePassword')
Arguments
Argument Category Description
SSLSettings Required The string that specifies the SSL connection information,
excluding the SSL TrustStore password. Use this argument if you
want the function to use a JDBC SSL connection to connect to
Aster Database instead of a normal JDBC connection. The
connection string specified by this argument is appended to the
end of the SSL JDBC connection string.
For example, if the domain is 192.168.1.2 and the database name is
beehive, specifying
SSLSettings('ENABLESSL=true&SSLTRUSTSTORE=/home/
beehive/truststore/truststore.jks') results in this
connection string:
jdbc:ncluster://192.168.1.2/beehive?
ENABLESSL=true&SSLTRUSTSTORE=/home/beehive/
truststore/truststore.jks
SSLTrustStorePassword Required The SSL TrustStore password. This password is required if you use
the SSLSettings argument. For example:
SSLTrustStorePassword ('123456')
Example
...
ON vertices_table AS vertices PARTITION BY ...
ON edges_table AS edges PARTITION BY ...
...
These aliases are not variables and must be used as specified in the function syntax. If you use different
aliases, the function throws error messages, as shown in the following example:
beehive=> ...
beehive=> ON cities AS vr PARTITION BY ...
beehive=> ON freeways AS edges PARTITION BY ...
beehive=> ...
ERROR: SQL-MR function ALLPAIRSSHORTESTPATH requires input table or query
with alias: vertices
Arima
Summary
The Arima function calculates the coefficients for a sequence of parameters, producing an ARIMA model
that is typically input to the function ArimaPredictor.
Background
An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving
average (ARMA) model. Typically, these models are fitted to time series data to predict future data points
(forecasting).
An ARIMA model adds to an ARMA model a degree of differencing, which makes the time series stationary,
if necessary. Sometimes the ARIMA model uses both differencing and some nonlinear transformations, such
as logging, to make the time series stationary.
A random variable that is a time series is stationary if its statistical properties are constant over time. An
ARIMA model acts as a filter that separates the signal from the noise, so that only the signal is used for
forecasting.
Bnyt = yt-n
et is the residual error, the difference between the actual and predicted values of yt.
Usage
Arima Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the input
parameters.
ModelTable Required Specifies the name of the table where the function outputs the
coefficients of the input parameters; that is, the model.
ResidualTable Optional Specifies the name of the table where the function outputs the
residuals of the input parameters.
Note:
Specify this argument if you will input the model to the
ArimaPredictor function.
TimestampColumns Required Specifies the names of the input_table columns that specify the
sequence (time points) of the input parameters. The sequence
must have uniform intervals.
ValueColumn Required Specifies the name of the column that contains the time series data
in input_table.
Orders Required Specifies the values of the nonseasonal parameters p, d, and q for
the ARIMA model. Each value must be an INT between 0 and 20,
inclusive.
SeasonalOrders Optional Specifies the values of the seasonal parameters sp, sd, and sq for
the ARIMA model. Each value must be an INT between 0 and 20,
inclusive.
Period Optional Specifies the period of a season (m in the formula). This value
must be a positive integer value. If you specify SeasonalOrders,
then you must also specify Period.
IncludeMean Optional Specifies whether the function adds the mean value (c in the
formula) to the ARIMA model. The default value is 'false'.
Note:
If IncludeMean is 'true', then both d in Orders and sd in
SeasonalOrders must be 0.
Fixed Optional Specifies the values of the parameters. The numeric vector
fixed_params must have a value for each parameter (for the
correspondence between values and parameters, see the note that
follows this table). If you specify IncludeMean('true'), then you
must add the mean value to the end of fixed_params.
If a value in fixed_params is non-NaN, then the corresponding
parameter is fixed at that value; otherwise, the function optimizes
the value of that parameter.
InitValues Optional Specifies the initial values of the parameters. The numeric vector
init_params must have a value for each parameter (for the
correspondence between values and parameters, see the note that
follows this table). If you specify IncludeMean('true'), then you
must add the initial mean value to the end of init_params.
If a value is NaN, then the corresponding parameter has the initial
value 0.
MaxIterNum Optional Specifies the maximum iteration number for estimating the
parameters. This value must be a positive integer. The default
value is 100.
Note:
The values in the vectors fixed_params and init_params correspond to these parameters in this order:
φ1, φ2, … , φp , θ1, θ2, … , θp , seasonalφ1, seasonalφ2, … , seasonalφsp ,
seasonalθ1, seasonalθ2, … , seasonalθsq , [meanValue]
Input
The Arima function has one required input table, which must include the columns described in the
following table.
Table 25: Arima Input Table Schema
Note:
If the time points do not have uniform intervals, then run
the function Interpolator on them before running the
Arima function on the input table. Otherwise, the
intervals of the predictions of the ArimaPredictor
function might not be as expected.
Output
The Arima function has two output tables, the model table and (optionally) the residual table.
The model table contains the coefficients of the model. The function outputs the coefficients to both the
model table and the console.
The following table shows the schema of the model table.
Table 26: Arima Model Table Schema
The residual table contains the value and residual for each time point.
Table 28: Arima Residual Table Schema
Example
This example uses monthly milk consumption in the US between 1962 and 1974.
Input
Table 29: Arima Example Input Table milk_timeseries
id period milkpound
1 1962-01 578.3
2 1962-02 609.8
3 1962-03 628.4
id period milkpound
4 1962-04 665.6
5 1962-05 713.8
6 1962-06 707.2
7 1962-07 628.4
8 1962-08 588.1
9 1962-09 576.3
10 1962-10 566.5
... ... ...
SQL-MapReduce Call
Output
Table 30: Arima Example Model Table: arimamodel
coef value
coef 3, 0, 0, 0, 0, 0
ar_params 1.831645480043797, -1.179667384141421, 0.3477840265302269
ar_params_sd 0.07527420134556974, 0.13530358092847117, 0.07512437361242541
ma_params
ma_params_sd
seasonal_ar_params
seasonal_ar_params_sd
seasonal_ma_params
seasonal_ma_params_sd
mean_param 1.0015614905052501
mean_param_sd NaN
period 0
coef value
sigma2 629.5936409862431
loglikelihood -724.0702297433396
iterations 18
converged true
ArimaPredictor
Summary
The ArimaPredictor function takes as input the ARIMA model produced by the function Arima and
predicts a specified number of future values (time point forecasts) for the modeled sequence.
ArimaPredictor Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
ModelTable Required Specifies the name of the table that contains the model. This table
is the model table that is output by the Arima function.
ResidualTable Required Specifies the name of the table that contains the original input
parameters and their residuals. This table is the residual table that
is output by the Arima function.
TimestampColumns Required Specifies the names of the residual_table columns that specify the
sequence (time points) of the original input parameters. The
sequence must have uniform intervals.
ValueColumn Optional Specifies the name of the column that contains the time series data
in residual_table.
ResidualColumn Optional Specifies the name of the column in residual_table that contains
the residuals.
StepAhead Required Specifies the number of steps to forecast after the end of the time
series. This value must be a positive integer.
Output
Table 32: ArimaPredictor Output Table Schema
Example
Input
Use the following tables from the Output section of the Arima function Example section:
• Arima Example Model Table: arimamodel
• Arima Example Residual Table: arimaresidual
SQL-MapReduce Call
Output
Table 33: ArimaPredictor Example Output
stepahead predict
1 814.094735390403
2 822.289686640269
3 823.383629105417
stepahead predict
4 821.247962247746
5 818.895762597647
6 817.487198991092
7 816.939272216913
8 816.77924369722
9 816.642623604132
10 816.39060427359
11 816.034505110816
12 815.632042387207
13 815.227303628928
14 814.836892259429
15 814.459284044822
Attribution
Summary
The Attribution function is used in web page analysis, where it lets companies assign weights to pages before
certain events, such as buying a product.
The function calculates attributions with a choice of distribution models and has two versions, multiple-
input and single-input. The multiple-input version gets many parameters from input tables. The single-
input version gets all parameters from arguments. The recommended version depends on the number of
parameters.
With a large number of parameters, the multiple-input version is recommended. You must create the tables
of parameters, but whenever you call the function, you can use the tables instead of specifying each
parameter in an argument.
If the number of parameters is so small that you prefer to specify them in arguments rather than create tables
for them, then you can use the single-input version.
• Attribution (Multiple-Input Version)
• Attribution (Single-Input Version)
Background
Before buying a product online, a customer is usually exposed to typical events or interactions (such as
clicks, page visits, and page impressions) that are associated with different channels (such as email, social
network connections, paid search advertising, organic search, direct buy, and referral). The sequence of
Summary
The multiple-input version of the Attribution function takes data and parameters from multiple tables and
outputs attributions.
Note:
A query that runs longer than 3 seconds before displaying output indicates that some of the arguments
supplied to the function are incorrect.
Usage
Input
The required input tables are:
• input_table, which contains the clickstream data to use for computing attributions
• conversion_event_table (alias conversion), which contains conversion events
• model1_table (alias model1), which defines the type and distributions of the first model
The optional input tables are:
• excluding_event_table (alias excluding), which contains events to exclude from attribution
• optional_event_table (alias optional), which contains optional events
• model2_table (alias model2), which defines the type and distributions of the second model
• input_table_1, input_table_2, and so on, which contain additional clickstream data
The optional input tables have the same schema as input_table. Specifying these tables lets you co-group
attributes from all specified input tables (for example, ad_click, impressions, and conversions).
Table 34: Attribution Input Table Schema
Row 0: Model Type Row 1, ..., n: Distribution Model Specification Additional Information
SIMPLE MODEL:PARAMETERS Distribution model for all events. For MODEL
and PARAMETER definitions, refer to the
following table.
EVENT_REGULAR EVENT:WEIGHT:MODEL:PARAMETERS Distribution model for a regular event.
EVENT cannot be a conversion, excluded, or
optional event.
For MODEL and PARAMETER definitions,
refer to the following table.
The sum of the WEIGHT values must be 1.0.
For example, suppose that the model table has
these specifications:
The following table describes the MODEL values and their corresponding PARAMETER values. MODEL
values are case-sensitive. Attributable events are those whose types are not specified in the excluding events
table.
Table 40: Attribution Distribution Model Specification: Models and Parameters
Output
Table 42: Attribution Output Table Schema
Example
This example uses models to assign attribution weights to these events and channels:
Table 43: Attribution Example: Event Types Channels
conversion_events
PaidSearch
SocialNetwork
excluding_events
Email
optional_events
Direct
OrganicSearch
optional_events
Referral
The following two model tables apply the distribution models by rows and by seconds, respectively.
Table 49: Multiple-Input Attribution Example Model Table model1_table
id model
0 SEGMENT_ROWS
1 3:0.5:EXPONENTIAL:0.5,SECOND
2 4:0.3:WEIGHTED:0.4,0.3,0.2,0.1
3 3:0.2:FIRST_CLICK:NA
id model
0 SEGMENT_SECONDS
1 6:0.5:UNIFORM:NA
2 8:0.3:LAST_CLICK:NA
3 6:0.2:FIRST_CLICK:NA
SQL-MapReduce Call
Output
Table 51: Multiple-Input Attribution Example Output Table
Summary
The single-input version of the Attribution function takes data from a single table and outputs attributions.
Parameters come from arguments, not input tables.
Usage
Note:
In the Model1 and Model2 arguments, colons are parameter delimiters. If a parameter contains colons,
enclose it in double quotation marks. For example:
Arguments
Argument Category Description
EventColumn Required Specifies the name of the input column that contains the
clickstream events.
Model2 Optional Defines the type and distributions of the second model. For
example:
Model2 ('EVENT_OPTIONAL', 'OrganicSearch:
0.5:UNIFORM:NA', 'Direct:0.3:UNIFORM:NA', 'Referral:
0.2:UNIFORM:NA')
For more information see the following tables in the Input section
of the function: Attribution (Multiple-Input Version).
• For model type and specification definitions, see the table:
Attribution Model Types and Specification Definitions
Input
Use the following table from the Input section of the Attribution (Multiple-Input Version) function Usage
section.
• Attribution Input Table Schema
Output
Use the following table from the Output section of the Attribution (Multiple-Input Version) function Usage
section.
• Attribution Output Table Schema
Examples
These examples use the events and channels from the following table from the Example section of the
Attribution (Multiple-Input Version) function.
• Attribution Example Event Types and Channels
Input
SQL-MapReduce Call
SQL-MapReduce Call
Output
The only difference between this output and the output of Example 1 is the attribution of the optional
events.
Table 54: Single-Input Attribution Example 2 Output Table
Input
This example uses the same input table, Single-Input Attribution Example 1: Input Table
attribution_sample_table, as was used in Example 1.
SQL-MapReduce Call
Output
Input
Single-Input Attribution Example 1 Input Table attribution_sample_table
SQL-MapReduce Call
Output
Input
SQL-MapReduce Call
Input
This example uses the same input table, Single-Input Attribution Example 5: Input Table
attribution_sample_table3, as was used in Example 5.
SQL-MapReduce Call
Output
Burst
Summary
The Burst function bursts (splits) a time interval into a series of shorter "burst" intervals that can be analyzed
independently.
Each row of the input table contains the start and end times of a time interval. For each input row, the
function writes a series of rows to the output table. Each output row contains the start and end time of a
burst interval.
The burst intervals can have either the same length (specified by the TimeInterval argument), the same
number of data points (specified by the NumPoints argument), or specific start and end times (specified by
time_table).
Burst Syntax
Version 1.0
Arguments
Argument Category Description
TimeColumn Required Specifies the names of the input_table columns that contain the
start and end times of the time interval to be burst.
TimeInterval Optional Specifies the length of each burst time interval. This value must
be either INTEGER or DOUBLE PRECISION.
Note:
Specify exactly one of time_table, TimeInterval, or
NumPoints.
Note:
Specify exactly one of time_table, TimeInterval, or
NumPoints.
Input
The Burst function has two input tables: input_table (required) and time_table (optional). If you omit
time_table, then you must specify either the TimeInterval or NumPoints argument.
Each row of input_table contains a time interval to be burst. The following table describes the input_table
columns that you can specify in function arguments.
Each row of time_table contains the start and end times of a burst interval. The following table describes the
columns of time_table.
Table 61: Burst time_table Schema
Output
Each row of the output table contains a burst interval. The following table describes the columns of the
output table. Columns copied from input_table appear in the same order in the output table.
Table 62: Burst Output Table Schema
Examples
• Example 1: Time_Interval Argument
• Example 2: Time_Table Argument
Input
SQL-MapReduce Call
Output
Input
Use the table below and the following table from the Input section of Example 1:
• Burst Example 1 Input Table: finance_data
Table 66: Burst Example 2: time_table1
id burst_start burst_end
1 1967-06-30 1967-07-05
1 1967-07-05 1967-07-10
2 1967-06-30 1967-07-05
2 1967-07-05 1967-07-10
3 1967-06-30 1967-07-10
4 1967-06-30 1967-07-04
4 1967-07-04 1967-07-07
4 1967-07-07 1967-07-10
5 1967-06-30 1967-07-02
5 1967-07-02 1967-07-04
5 1967-07-04 1967-07-06
5 1967-07-06 1967-07-08
5 1967-07-08 1967-07-10
SQL-MapReduce Call
Output
Summary
Change-point detection functions detect the change points in a stochastic process or time series. These
functions take sorted time series data as input and output change points or data segments.
The change-point detection functions are:
• ChangePointDetection, for when the input data can be stored in memory
• RtChangePointDetection, for when the input data cannot be stored in memory or the application needs
real-time response
Background
In statistical analysis, change detection or change-point detection tries to identify the abrupt changes of a
stochastic process or time series.
Consider the following ordered time series data sequence:
y(t), t=1, 2, ..., n
where t is a time variable.
Change-point detection tries to find a segmented model M, given by the following equation:
Y = f1(t, w1) + e1(t), (1 <t <=τ1)
= f2(t, w2) + e2(t), (τ1 <t <=τ2)
...
= fk(t, wk) + ek(t), (τk-1 <t <=τk)
= fk+1(t, wk+1) + ek+1(t), (τk <t <=nk)
where:
• fi(t,w1) is the function (with its vector of parameters wi) that fits in segment i.
• Each τi is the change point between successive segments.
• Each ei(t) is an error term.
• n is the size of data series and k is the number of change points.
Segmentation model selection aims to find the function fi(t,w1) that best approximates the data of each
segment. Various model selection methods have been proposed. According to literature, the most
commonly used model selection method is normal distribution.
Search method selection aims to find the change points from a global perspective.
If τ0 =0 and τk+1 =n, one common method of identifying the change point is to minimize this value:
C is a cost function for a segment to measure the difference between fi(t,w1) and the original data. βf (k) is a
penalty to guard against overfitting. The common choice is linear in the number of change points k; that is,
βf (k)=βk. There are several information criteria to do the evaluation, such as Akaike Information Criterion
(AIC) and Bayes Information Criterion (BIC).
For AIC, β=2p, where p is the number of additional parameters introduced by adding a change point.
For BIC (also called SBIC), β=plog(n).
Change-point detection methods are classified into two categories, based on speed of detection:
• Real-Time Change-Point Detection, for applications that require immediate responses (such as robot
control)
• Retrospective Change-Point Detection, for applications that require longer reaction periods
Taking normal distribution as an example, the change-point problem is to test the following null hypothesis:
H0:μ = μ1 = μ2 = … = μn and σ2 = σ12 = σ22 = … σn2
as opposed to the alternatives,
H1:μ1 = … = μk1 ≠ μk1+1 = … μk2 ≠ ... ≠ μkq+1 = ...= μn
and
σ12 = … = σk12 ≠ σk1+12 = … = σk22 ≠ ... ≠ σkq+12 = … = σn2
Binary segmentation performs the following tests in each iteration:
H1:μ1 = … = μk1 ≠ μk1+1 = … = μn
From the preceding formulas, the binary segmentation algorithm computes max LogL1 by giving k different
values. Then, to check for a change point, the algorithm compares the difference between max LogL1 and
LogL0 to the penalty value.
If the algorithm detects a change point, it adds that change point to its list of candidate change points and
then splits the data into two parts. From the candidate change points that the algorithm finds in the two
parts, it selects the one with the minimum loss.
The algorithm repeats the preceding process until it finds all change points or reaches the maximum change-
point number.
ChangePointDetection
Summary
The ChangePointDetection function detects change points in a stochastic process or time series, using
retrospective change-point detection, implemented with these algorithms:
• Search algorithm: binary search
• Segmentation algorithm: normal distribution and linear regression
Use this function when the input data can be stored in memory and the application does not require a real-
time response. If the input data cannot be stored in memory, or the application requires a real-time
response, use the function RtChangePointDetection.
Usage
ChangePointDetection Syntax
Version 1.0
Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains the
time series data.
Accumulate Required Specifies the names of the input table columns to copy to the
output table.
Tip:
To identify change points in the output table, specify the
columns that appear in partition_exp and order_by_exp.
SearchMethod Optional Specifies the search method, binary segmentation. This is the
default and only possible value.
MaxChangeNum Optional Specifies the maximum number of change points to detect. The
default value is 10.
Penalty Optional Specifies the penalty function, which is used to avoid over-fitting.
Possible values are:
• 'BIC' (default)
• 'AIC'
threshold, a DOUBLE PRECISION value
For BIC, the condition for the existence of a change point is:
ln(L1)−ln(L0) > (p1-p0)*ln(n)/2
For normal distribution and linear regression, the condition is:
(p1-p0)*ln(n)/2 = ln(n)
For AIC, the condition for the existence of a change point is:
ln(L1)−ln(L0) > p1-p0
For normal distribution and linear regression, the condition is:
p1-p0 = 2
For threshold, the specified value is compared to:
ln(L1)−ln(L0)
Input
The input table must contain the columns described in the following table. The function ignores any
additional columns, except those specified by the Accumulate argument, which it copies to the output table.
Table 69: Change-Point Detection Functions Input Table Schema
Output
The output table schema depends on the value of the OutputOption argument.
Table 70: Change-Point Detection Functions Output Table Schema for OutputOption ('CHANGEPOINT')
Table 71: Change-Point Detection Functions Output Table Schema for OutputOption ('VERBOSE')
Table 72: Change-Point Detection Functions Output Table Schema for OutputOption ('SEGMENT')
Examples
• Example 1: Two Series, Default Options
• Example 2: One Series, Default Options
• Example 3: One Series, VERBOSE Output
• Example 4: One Series, Penalty 10
• Example 5: One Series, SEGMENT Output, Penalty 10
• Example 6: One Series, Penalty 20, Linear Regression
Input
SQL-MapReduce Call
Output
Input
The input for ChangePointDetection examples 2 through 6 and all RtChangePointDetection examples is
represented by the following diagram. The input signal is like a clock signal whose values can represent a
cyclic recurrence of an event (for example, electric power consumption at certain periods or sequence, pulse
rate, and so on).
sid id val
1 1 10.8308
1 2 10.07182
1 3 10.30902
1 4 10.01128
1 5 10.83433
1 6 10.0189
1 7 10.8702
sid id val
1 8 10.70688
1 9 10.72465
1 10 10.76334
1 11 100.9431
1 12 100.245
1 13 100.8667
1 14 100.0768
1 15 100.7646
1 16 100.0001
1 17 100.3316
1 18 100.8994
1 19 100.5965
1 20 100.1943
1 21 10.24228
1 22 10.78137
1 23 10.90752
1 24 10.02013
1 25 10.46117
1 26 10.08672
1 27 10.33539
1 28 10.0157
1 29 10.40867
1 30 10.17071
1 31 100.3789
1 32 100.2254
1 33 100.1049
1 34 100.9242
1 35 100.6543
1 36 100.5676
1 37 100.2341
1 38 100.9213
1 39 100.334
sid id val
1 40 100.8727
SQL-MapReduce Call
Output
sid id cptid
1 8 1
1 11 2
1 21 3
1 31 4
1 34 5
Input
The input table for this example is the output table from Example 2, ChangePointDetection Example 2:
Output Table.
SQL-MapReduce Call
Output
Input
The input table for this example is the output table from Example 2.
SQL-MapReduce Call
Output
sid id cptid
1 11 1
1 21 2
1 31 3
Input
The input table for this example is the output table from Example 2.
SQL-MapReduce Call
Output
Input
The input table for this example is the output table from Example 2.
SQL-MapReduce Call
Output
sid id cptid
1 11 1
1 21 2
1 31 3
RtChangePointDetection
Summary
The RtChangePointDetection function detects change points in a stochastic process or time series, using
real-time change-point detection, implemented with these algorithms:
• Search algorithm: sliding window
• Segmentation algorithm: normal distribution
Use this function when the input data cannot be stored in memory, or when the application requires a real-
time response. If the input data can be stored in memory and the application does not require a real-time
response, use the function ChangePointDetection.
Usage
RtChangePointDetection Syntax
Version 1.0
Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains the time
series data.
Accumulate Required Specifies the names of the input table columns to copy to the output
table.
Tip:
To identify change points in the output table, specify the
columns that appear in partition_exp and order_by_exp.
SegmentationMethod Optional Specifies the segmentation method, normal distribution (in each
segment, the data is in a normal distribution). This is the default and
only possible value.
Input
Use the following table from the Input section of the ChangePointDetection function Usage section.
• Change-Point Detection Functions Input Table Schema
Output
Use the following table from the Output section of the ChangePointDetection function Usage section.
• Change-Point Detection Functions Output Table Schema for OutputOption ('CHANGEPOINT')
Examples
• Example 1: Threshold 10, Window Size 3, Default Output
• Example 2: Threshold 20, Window Size 3, VERBOSE Output
• Example 3: Threshold 100, Window Size 3, Default Output
Input
ChangePointDetection Example 2 Output Table
SQL-MapReduce Call
sid id cptid
1 11 1
1 21 2
1 31 3
1 36 4
Input
ChangePointDetection Example 2 Output Table
SQL-MapReduce Call
Output
Input
ChangePointDetection Example 2 Output Table
Output
sid id cptid
1 11 1
1 21 2
1 31 3
Convergent Cross-Mapping
Convergent cross-mapping (CCM) is a method for evaluating whether one time series variable in a system
has a causal influence on another. Unlike the symmetric relationship of correlation, a causality relationship
detected by CCM can be unidirectional: while (A is correlated with B) always implies that (B is correlated
with A), the relationship found by CCM can simultaneously satisfy (A causes B) and (B does not cause A).
The intuition behind the CCM algorithm is that if variable A is a cause of variable B, then information about
time series A is reflected in time series B. Therefore, you can estimate A from B (this is the reverse of the
usual understanding of cause and effect). If the predictability of time series A improves with increasing
information from time series B, then A has a causal influence on B. This somewhat counter-intuitive
definition is described in more detail in the following references.
The mathematical justification for this approach depends on a result from the dynamical systems theory
Takens’ Theorem, which demonstrates that a complex dynamical system can be “embedded” into a low-
dimensional space. This approach is designed for short time series (less than 30 points) for which multiple
samples are available.
To test for causality, the CCM function:
1. Chooses a library of short time series from the effect variable.
2. Uses this library to predict values of the cause variable.
The function uses a k-nearest neighbors algorithm to predict the cause variable from the effect variable
and a bootstrapping process to estimate the uncertainty associated with the predicted values.
3. Uses this library to evaluates the goodness-of-fit of the predictions.
For numerical variables, the function determines goodness-of-fit using the correlation between the
predictions and the true values. For categorical variables, the function determines goodness-of-fit using
the Jaccard Index.
4. Repeats this procedure for libraries of different sizes.
CCMPrepare
The function CCMPrepare adds a new partition column, aster_ccm_id, and partitions the data for use with
the CCM function. CCM partitions the data automatically, but to ensure that the data is partitioned the
same way over multiple executions of the function, use CCMPrepare to create the input table for CCM.
Usage
CCMPrepare Syntax
Version 1.0
Arguments
Argument Category Description
InputTable Required Table containing the input data.
Input
The input table must contain id columns for the time series and the period within the time series (the
timepoints).
Table 84: CCMPrepare Input Table Schema
Output
Table 85: CCMPrepare Output Table Schema
Example
Input
The input table, ccmprepare_input, is a collection of nine time series consisting of 10 values for each of three
variables (expenditure, income, and investment).
Table 86: CCMPrepare Example Input Table ccmprepare_input
SQL-MapReduce Call
This call splits the input sequences (column_id) into two partitions. The odd and even sequences that are
partitioned are identified by column aster_ccm_id in the output.
Output
This query returns the following table:
CCM
Summary
The Teradata Aster CCM function allows the user to test multiple causes and effects simultaneously. The
function reports an effect size for each cause-effect pair.
Usage
CCM Syntax
Version 1.0
Arguments
Argument Category Description
InputTable Required Table containing the input data.
SequenceIdColumn Required Column containing the sequence ids. A sequence is a sample
of the time series.
TimeColumn Required Column containing the timestamps.
CauseColumns Required Columns to be evaluated as potential causes.
EffectColumns Required Columns to be evaluated as potential effects.
Input
The input table must contain id columns for the time series and the period within the time series (the
timepoints).
To ensure repeatability, the user has the option of creating a table using the CCMPrepare function. This
function adds a column, “aster_ccm_id”, to the input table so that the partitioning of data across workers is
guaranteed to be consistent over multiple function calls. If the InputTable contains an “aster_ccm_id”
column, the function assumes that data has been prepared using CCMPrepare. If it does not contain this
column, the function generates this partitioning column internally.
Note:
The Aster CCM function supports categorical variables as possible causes and effects. This feature is to be
considered experimental.
Output
Table 89: CCM Output Schema
Examples
• Example 1: Numeric Causes and Effects with Default Values
• Example 2: Mixed Categorical and Numeric Causes and Effects
Input
The table, CCMPrepare Example Output Table, from the output of CCMPrepare, is the input to the CCM
function. The example investigates income as a possible cause of expenditure and investment. Other than
EmbeddingDimension ('5'), the example uses default argument values.
SQL-MapReduce Call
Output
As the library_size increases, the correlation increases, which suggests a causal relationship. Intuition
confirms that expenditures and investment would be driven by income. Because the cause and effect
variables are numeric, there is no jaccard_index value.
Input
The input table, ccm_input2, contains data from three indices of stock market performance (COMP, DJIA,
NDX). The categorical variables (marketindex, indexdate) and the numerical variables (indexval,
indexchange) are time series values spread across two sequences (id). This example shows cause and effect
with one numerical and one categorical variable.
Table 92: CCM Example 2 Input Table ccm_input2
Output
For numeric variables, the correlation indicates the relationship between the values of the cause variable (as
predicted by the effect variable) and the true values of the cause variable. The example shows a steadily
increasing absolute value of the correlation between indexval and indexchange, and a high effect size (0.557).
There is no clear trend for the correlation between indexval and indexdate.
DTW
Summary
The DTW function performs dynamic time warping (DTW), which measures the similarity (warping
distance) between two time series that vary in time or speed. You can use DTW to analyze any data that can
be represented linearly—for example, video, audio, and graphics.
For example:
• In two videos, DTW can detect similarities in walking patterns, even if in one video the person is walking
slowly and in another, the same person is walking fast.
• In audio, DTW can detect similarities in different speech speeds (and is therefore very useful in speech
recognition applications).
Given an input table, a template table, and a mapping table, DTW compares each time series in the input
table to the corresponding time series in the template table. The correspondence is defined by the mapping
table.
For more information, see FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space.
Stan Salvador & Philip Chan. KDD Workshop on Mining Temporal and Sequential Data, pp. 70-80, 2004
(http://cs.fit.edu/~pkc/papers/tdm04.pdf)
Usage
DTW Syntax
Version 1.0
Arguments
Argument Category Description
InputColumns Required Specifies the names of the input_table columns that contain
the values and timestamps of the time series.
Note:
The InputColumns argument has the alternate name
Input_Table_Value_Column_Names.
Note:
If these columns contain NaN or infinity values, then use
a WHERE clause to remove them.
Note:
The TemplateColumns argument has the alternate name
Template_Table_Value_Column_Names.
Note:
If these columns contain NaN or infinity values, then use
a WHERE clause to remove them.
TimeseriesID Required Specifies the names of the columns by which the input_table
is partitioned. These columns comprise the unique ID for a
time series in input_table.
TemplateID Required Specifies the names of the columns by which the
template_table is ordered. These columns comprise the
unique ID for a time series in template_table.
Radius Optional Specifies the integer value that determines the projected
warp path from a previous resolution. The default value is
10.
DistMethod Optional Specifies the metric for computing the warping distance.
The supported values of distance_metric, which are case-
sensitive, are
• 'EuclideanDistance' (default)
• 'ManhattanDistance'
• 'BinaryDistance'
These values are further described in the Background
section of the function: VectorDistance.
Note that the DistMethod argument has the alternate name
Metric.
WarpPath Optional Determines whether to output the warping path. The default
value is 'false'.
Input
The DTW function requires three input tables:
Note:
In mapping_table, DTW supports a single ID column in the input_table and template_table.
Note:
The names of the output table columns are case-sensitive. You must enclose them in double quotation
marks in SQL statements; for example:
Example
This example compares multiple time series to both a common template and each other. Each time series
represents stock prices and the template represents a series of stock index prices.
Input
Table 99: DTW Example Input Table timeseriesdata
timeseriesid templateid
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
4 1
4 2
4 3
SQL-MapReduce Call
Output
Table 102: DTW Example Output Table
The warping distance is an unnormalized measure of how dissimilar two time series are. The warpDistance
column in the output table contains the warping distance for all pairs in the mapping table; that is, for every
timeseriesID and templateID number.
As the preceding figure shows, input 2 is more similar to templates 1 and 3 than to template 2. The warp
distances also show this: for templates 1 and 3, they are 131.588 and 106.131; for template 2, the warping
distance is about 540.
Because the dissimilarity of two time series is not based on whether they are temporarily close (the time is
stretched and the two time series that are offset by a constant time interval are effectively the same), input 3
is not very dissimilar to templates 1 and 3. However, input 4 has the largest warping distance measure from
templates 1 and 3, as the curvature of the latter 2 is far from input 4. Time stretching brings input 4 closer to
templates 1 and 3, but with a larger warping path (not output above) and therefore, a larger warping
distance.
DWT
Summary
The DWT function implements Mallat’s algorithm (an iterate algorithm in the Discrete Wavelet Transform
field) and applies wavelet transform on multiple sequences simultaneously.
The input is typically a set of time series sequences. You specify the wavelet name or wavelet filter table,
transform level, and (optionally) extension mode. The function returns the transformed sequences in Hilbert
space with the corresponding component identifiers and indices. (The transformation is also called the
decomposition.)
Note:
The wavelet filter table does not appear in the preceding diagram because it is seldom used.
You can filter the result to reduce the lengths of the transformed sequences and then use the function IDWT
to reconstruct them; therefore, the DWT and IDWT functions are useful for compression and removing
noise.
Background
DWT is a time-frequency analysis tool for which the wavelets are discretely sampled. DWT is different from
the Fourier transform, which provides frequency information on the whole time domain. A key advantage of
DWT is that it provides frequency information at different time points.
Mallat’s algorithm can be described as a series of iterative steps. For example, for a 3-level wavelet transform:
1. Use S(n) as the original time domain sequence as the input of level 1.
2. Convolve the input sequence with high-pass filter h(n) and low-pass filter g(n).
The two generated sequences are the detail coefficients Dk and the approximation coefficients Ak in level
k.
3. If current level k is the maximum transform level n, then stop; otherwise, use Ak as the input sequence
for the next level (that is, increment k by 1 and go to step 2.)
Usage
DWT Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the sequences
to be transformed.
OutputTable Required Specifies the name for the table that the function creates to store
the coefficients generated by the wavelet transform. This table
must not exist.
MetaTable Required Specifies the name for the table that the function creates to store
the meta information for the wavelet transform. This table must
not exist.
InputColumns Required Specifies the names of the columns in the input table or view that
contain the data to be transformed. These columns must contain
numeric values between -1e308 and 1e308. The function treats
NULL as 0.
SortColumn Required Specifies the name of the column that defines the order of samples
in the sequences to be transformed. In a time series sequence, the
column can consist of timestamp values.
Note:
If sort_column has duplicate elements in a sequence (that is, in
a partition), then sequence order can vary, and the function
can produce different transform results for the sequence.
PartitionColumns Optional Specifies the names of the partition columns, which identify the
sequences. Rows with the same partition column values belong to
the same sequence. If you specify multiple partition columns, then
the function treats the first one as the distribute key of the output
and meta tables.
By default, all rows belong to one sequence, and the function
generates a distribute key column named dwt_idrandom_name in
both the output table and the meta table. In both tables, every cell
of dwt_idrandom_name has the value 1.
Wavelet Optional Specifies a wavelet filter name from the following table.
For the examples in the following table, assume that the sequence before the extension is 1 2 3 4 and the
convolution kernel in the wavelet filter has the length 6, which means that the length of the sequence is to be
extended by 5 positions before and after the sequence.
Table 104: Supported Extension Modes
filtername filtervalue
lowpassfilter Decomposed low-pass filter, represented as a comma-separated sequence; the conjugated
scale coefficients for the orthogonal wavelet. For example:
-0.1294095225512604, 0.2241438680420134, 0.8365163037378081, 0.4829629131445342
highpassfilter Decomposed high-pass filter, represented as a comma-separated sequence; the
conjugated wavelet coefficients for the orthogonal wavelet. For example:
filtername filtervalue
-0.4829629131445342, 0.8365163037378081, -0.2241438680420134,
-0.1294095225512604
ilowpassfilter Reconstructed low-pass filter, represented as a comma-separated sequence; the scale
coefficients for the orthogonal wavelet. For example:
0.4829629131445342, 0.8365163037378081, 0.2241438680420134, -0.1294095225512604
ihighpassfilter Reconstructed high-pass filter, represented as a comma-separated sequence; the wavelet
coefficients for the orthogonal wavelet. For example:
-0.1294095225512604, -0.2241438680420134, 0.8365163037378081,
-0.4829629131445342
Output
The DWT function outputs a message that indicates whether the function succeeded, an output table of
transformed (decomposed) sequences, and a meta table of wavelet-related information.
Table 108: DWT Output Message
The following table summarizes the information that the meta table contains for each sequence.
Table 111: DWT Meta Information for Each Sequence
meta content
blocklength Length of each component after transformation, from An to D1. For example: 8, 8, 13
length Length of the sequence before transformation. For example: 24
waveletname Name of the wavelet used in the transformation. For example: db2
lowpassfilter Low-pass filter used in the decomposition of the wavelet.
highpassfilter High-pass filter used in the decomposition of the wavelet.
ilowpassfilter Low-pass filter used in the reconstruction of the wavelet.
ihighpassfilter High-pass filter used in the reconstruction of the wavelet.
level Level of wavelet transform performed.
extensionmode Extension mode used in the wavelet transform.
Example
This example uses hourly climate data for five cities (Asheville, Greenville, Brownsville, Nashville and
Knoxville) on a given day. The data are temperature (in degrees Fahrenheit), pressure (in mbars), and
dewpoint (in degrees Fahrenheit). The function generates the coefficient model table and the meta table,
which are used as input to the function IDWT.
Output
Table 113: DWT Example Output Message
messages
Dwt finished successfully!
The query below returns the output shown in the following table:
The query below returns the output shown in the following table:
DWT2D
Summary
The DWT2D function implements Mallat’s algorithm (an iterate algorithm in the Discrete Wavelet
Transform field) on 2-dimensional matrices and applies wavelet transform on multiple sequences
simultaneously.
The input is a set of sequences. Typically, each sequence is a matrix that contains a position in 2-dimensional
space (y and x indexes or coordinates) and its corresponding values. You specify the wavelet name or
wavelet filter table, transform level, and (optionally) extension mode. The function returns the transformed
sequences in Hilbert space with the corresponding component identifiers and indices. (The transformation
is also called the decomposition.)
Note:
The wavelet filter table does not appear in the preceding diagram because it is seldom used.
Background
DWT is a time-frequency analysis tool for which the wavelets are discretely sampled. DWT is different from
the Fourier transform, which provides frequency information on the whole time domain. A key advantage of
DWT is that it provides frequency information at different time points.
Mallat’s algorithm for 2-dimensional input can be described as a series of iterative steps:
1. Use the original time domain sequence (2-dimensional matrix) as the input of level 1.
2. Convolve each row of the input matrix with high-pass filter h(n) and low-pass filter g(n).
3. Downsample each convolved row by column, generating two matrices.
4. Convolve each row of each generated matrix with high-pass filter h(n) and low-pass filter g(n).
5. Downsample each convolved row by column, generating two more matrices.
The four generated matrices contain the approximation coefficients Ak, horizontal detail coefficients Hk,
vertical detail coefficients Vk, and diagonal detail coefficients Dk, respectively, for level n. The following
figure shows the process.
6. If current level k is the maximum transform level n, then stop; otherwise, use Ak as the input matrix for
the next level (that is, increment k by 1 and go to step 2.)
Usage
DWT2D Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the
sequences to be transformed.
OutputTable Required Specifies the name for the table that the function creates to
store the coefficients generated by the wavelet transform.
This table must not exist.
MetaTable Required Specifies the name for the table that the function creates to
store the meta information generated by the wavelet
transform. This table must not exist.
InputColumns Required Specifies the names of the columns in the input table or view
that contain the data to be transformed. These columns
must contain numeric values between -1e308 and 1e308. The
function treats NULL as 0.
PartitionColumns Optional Specifies the names of the partition columns, which identify
the sequences. Rows with the same partition column values
belong to the same sequence. If you specify multiple
partition columns, then the function treats the first one as
the distribute key of the output and meta tables.
By default, all rows belong to one sequence, and the function
generates a distribute key column named
dwt_idrandom_name in both the output table and the meta
table. In both tables, every cell of dwt_idrandom_name has
the value 1.
IndexColumns Required Specifies the columns that contain the indexes of the input
sequences. For a matrix, indexy_column contains the y
coordinates and indexx_column contains the x coordinates.
Range Optional Specifies the start and end indexes of the input data, all of
which must be integers. The default values for each sequence
are:
• starty: minimum y index
Input
The DWT2D function has a required input table (or view) and an optional wavelet filter table. If you omit
the wavelet filter table, then you must specify the Wavelet argument.
The input table can contain at most 1594 columns. The function assumes that each sequence can be fitted
into the memory of the worker. The following table describes the input table columns that you can or must
specify with arguments. The input table can have additional columns, but the function ignores them.
The table below shows the schema of the wavelet filter table.
Table 116: DWT2D Input Table Schema
Output
The DWT2D function outputs a message that indicates whether the function succeeded,, an output table of
transformed (decomposed) sequences, and a meta table of wavelet-related information.
Table 117: DWT2D Output Message
The following table summarizes the information that the meta table contains for each sequence.
Table 120: DWT2D Meta Information for Each Sequence
meta content
blocklength Pairs that represent the length of each block of coefficients. The format is (row_number,
column_number). For example: (5, 5), (5, 5), (5, 6)
length Pair that represents the length of the original sequence in each dimension.The format is
(row_number, column_number). For example: (5, 8)
range Minimum and maximum indexes of the original sequence. The format is (min_y_index,
min_x_index), (max_y_index, max_x_index). For example: (1, 1), (5, 8)
lowpassfilter Low-pass filter coefficients used in the decomposition of the wavelet.
highpassfilter High-pass filter coefficients used in the decomposition of the wavelet.
ilowpassfilter Low-pass filter coefficients used in the reconstruction of the wavelet.
ihighpassfilter High-pass filter coefficients used in the reconstruction of the wavelet.
level Level of wavelet transform performed.
extensionmode Extension mode used in the wavelet transform.
Input
Table 121: DWT2D Example Input Table twod_climate_data
SQL-MapReduce Call
Output
Table 122: DWT2D Example Output Message
messages
Dwt2D finished successfully!
The query below returns the output shown in the following table:
FrequentPaths
Summary
The FrequentPaths takes a table of sequences and outputs a table of subsequences (patterns) that frequently
appear in the input table and, optionally, a table of sequence-pattern pairs.
The function is useful for analyzing customer purchase behavior, web access patterns, disease treatments,
and DNA sequences.
Background
In a sequential pattern mining application, each sequence is an ordered list of item sets, and each item set
contains at least one item. Items within a set are unordered.
Usage
FrequentPaths Syntax
Version 2.1
ItemDefinition (id_def_table:[id:def:item])
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Definition
InputTable Required Specifies the name of the table that contains the input sequences. Each
row is one item in a sequence. If input_table does not include a schema,
the function searches for it in the user’s search path. The function
ignores rows that contain any NULL values.
OutputTable Required Specifies the name of the table where the function outputs the
subsequences.
PartitionColumns Required Specifies the names of the columns that comprise the partition key of the
input sequences.
TimeColumn Optional* Specifies the name of the input table column that determines the order of
items in a sequence. Items in the same sequence that have the same time
stamp belong to the same set.
*Required when ItemColumn or ItemDefinition is specified.
PathFilters Optional Specifies the filters to use on the input table sequences. Only input table
sequences that satisfy all constraints of at least one filter are input to the
function.
Each filter has one or more constraints, which are separated by spaces.
Each constraint has this syntax:
contain f or g. Sequence “(a, b), e, (f, d)” meets this constraint because
the last item set, (f,d), contains f.
• CTN (containing_constraint)
The sequence must contain at least one item. For example, CTN(a,b)
requires the sequence to contain a or b. The sequence “(a,c), d, (e,f)”
meets this constraint but the sequence “d, (e,f)” does not.
Constraints in the same filter must be different. For example:
• This is valid:
'STW(c,d) STW(e,h)'
GroupByColumns Optional Specifies the names of the input table columns by which to group the
input table sequences. If you specify this argument, then the function
operates on each group separately and copies each group_by_column to
the output table.
SeqPatternTable Optional Specifies the name of the table where the function outputs sequence-
pattern pairs. For example, if a sequence has a partition value of "1" and
contains 3 patterns with IDs 2, 9, and 10, then for that sequence the
function outputs the sequence-pattern pairs ("1", 2), ("1", 9), and ("1",
10).
If sequence_pattern_table does not include a schema, the function creates
it in the first schema in the user’s search path.
If the function finds no sequence-pattern pairs, then it does not create
sequence_pattern_table.
ItemColumn Optional* Specifies the names of the input table columns that contain the items.
*Required if you specify neither ItemDefinition nor PathColumn.
ItemDefinition Optional* Specifies the name of the item definition table and the names of its index,
definition, and item columns. If item_definition_table does not include a
schema, the function searches for it in the schema in the user’s search
path.
*Required if you specify neither ItemColumn nor PathColumn.
'[item [, ...]]'
In the sequence string syntax, you must type the outer brackets (bold).
The sequence strings in this column can be generated by the nPath
function.
If you specify this argument, then each item set can have only one item.
* Required if you specify neither ItemColumn nor ItemDefinition.
MinSupport Required Determines the threshold for whether a sequential pattern is frequent.
The minimum must be a positive real number.
If minimum is in the range (0,1], then it is a relative threshold: If N is the
total number of input sequences, then the threshold is T=N*minimum.
For example, if there are 1000 sequences in the input table and minimum
is 0.05, then the threshold is 50.
If minimum is in the range (1,+), then it is an absolute threshold:
Regardless of N, T=minimum. For example, if minimum is 50, then the
threshold is 50, regardless of N.
A pattern is frequent if its support value is at least T.
Because the function outputs only frequent patterns, minimum controls
the number of output patterns. If minimum is small, processing time
increases exponentially; therefore, Teradata recommends starting the
trial with a larger value.—for example, 5% of the total sequence number
if you know N and 0.05 otherwise.
If you specify a relative minimum and GroupByColumns, then the
function calculates N and T for each group.
If you specify a relative minimum and PathFilters, then N is the number
of sequences that meet the constraints of the filters.
MaxLength Optional Specifies the maximum length of the output sequential patterns. The
length of a pattern is its number of sets. By default, there is no maximum
length.
MinLength Optional Specifies the minimum length of the output sequential patterns. The
default value is 1.
ClosedPattern Optional Specifies whether to output only closed patterns. The default value is
'false'.
Input
The FrequentPaths function requires an input table, which contains the sequence data to process. The input
can be in either of these formats:
• Sequence/path format:
Output
The FrequentPaths function outputs an output message and output table and, optionally, a sequence pattern
table.
Examples
These examples apply the FrequentPaths function to browsing sequences of different users on a banking
website.
Input
The input table contains web clickstream data from a set of users with multiple sessions or sequences.
Table 130: FrequentPaths Example 1 Input Table bank_web_clicks1
SQL-MapReduce Call
Output
message
Finished. Totally 69 patterns were found.
Input
The input table, bank_web_url, contains the URL of each page browsed by the customer. The definitions of
the browser pages, which can be specified by the ItemDefinition argument, are in the table ref_url (which
follows bank_web_url).
Table 133: FrequentPaths Example 2 Input Table bank_web_url
SQL-MapReduce Call
Output
message
Finished. Totally 69 patterns were found.
Input
SQL-MapReduce Call
Output
message
Finished. Totally 213 patterns were found.
This query returns the contents of the following table (row order can vary):
Input
This example uses the same input table, FrequentPaths Example 1 Input Table bank_web_clicks1, as was
used in Example 1.
SQL-MapReduce Call
Output
message
Finished. Totally 69 patterns were found.
This query returns the contents of the following table (row order can vary):
session_id pattern
0 ACCOUNT SUMMARY;FAQ
0 ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY
0 ACCOUNT SUMMARY;ACCOUNT HISTORY
0 ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;ACCOUNT HISTORY;FUNDS TRANSFER
session_id pattern
0 ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER
0 ACCOUNT SUMMARY;ACCOUNT HISTORY;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FUNDS TRANSFER
0 ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;FUNDS TRANSFER;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FAQ;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FUNDS TRANSFER;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;FAQ;ACCOUNT SUMMARY;VIEW DEPOSIT DETAILS
... ...
Input
FrequentPaths Example 1 Input Table bank_web_clicks1 (see Input).
SQL-MapReduce Call
Output
message
Finished. Totally 15 patterns were found.
Input
FrequentPaths Example 1 Input Table bank_web_clicks1 (see Input).
SQL-MapReduce Call
Output
message
Finished. Totally 26 patterns were found.
Input
The following table is the input table for the nPath function, which the example uses to create the input table
for the FrequentPaths function.
Table 146: FrequentPaths Example 7 nPath Input Table sequence_table
id datestamp item
1 2004-03-17 16:35:00 A
1 2004-03-17 16:38:00 B
1 2004-03-17 16:42:00 C
2 2004-03-18 01:16:00 B
2 2004-03-18 01:18:00 C
2 2004-03-18 01:20:00 D
3 2004-03-19 08:33:00 A
3 2004-03-19 08:36:00 D
3 2004-03-19 08:38:00 C
id path
1 [A, B, C]
3 [A, D, C]
SQL-MapRequest Call
The FrequentPaths function outputs the sequences that start start with “A” and end with “C”.
Output
message
Finished. Totally 3 patterns were found.
IDWT
Summary
The IDWT function is the inverse of DWT; that is, IDWT applies inverse wavelet transforms on multiple
sequences simultaneously. IDWT takes as input the output table and meta table generated by DWT and
outputs the sequences in time domain. (Because the IDWT output is comparable to the DWT input, the
inverse transformation is also called the reconstruction.)
Usage
IDWT Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the input table or view that contains the coefficients
generated by DWT. Typically, this table is the output table of DWT.
MetaTable Required Specifies the name of the input table or view that contains the meta
information used in DWT. Typically, this table is the meta table output by
DWT.
OutputTable Required Specifies the name for the table that the function creates to store the
reconstructed result. This table must not exist.
InputColumns Required Specifies the names of the columns in the input table or view that contain
the data to be transformed. These columns must contain numeric values
between -1e308 and 1e308. The function treats NULL as 0.
SortColumn Required Specifies the name of the input column that represents the order of
coefficients in each sequence (the waveletid column in the DWT output
table). The column must contain a sequence of integer values that start
from 1 for each sequence. If a value is missing from the sequence, then the
function treats the corresponding data column as 0.
PartitionColumns Optional Specifies the names of the partition columns, which identify the sequences.
Rows with the same partition column values belong to the same sequence.
If you specify multiple partition columns, then the function treats the first
one as the distribute key of the output and meta tables.
By default, all rows belong to one sequence, and the function generates a
distribute key column named dwt_idrandom_name in both the output
table and the meta table. In both tables, every cell of dwt_idrandom_name
has the value 1.
Input
The IDWT function requires a data table and a meta table. The data table has the same schema as DWT
Output Table Schema, and the meta table has the same schema as DWT Meta Table Schema.
Output
The IDWT function outputs a message that indicates whether the function succeeded and an output table of
transformed (reconstructed) sequences.
Example
This example uses hourly climate data for five cities (Asheville, Greenville, Brownsville , Nashville and
Knoxville) on a given day. The data are temperature (in degrees Fahrenheit), pressure (in Mbars). and dew
point (in degrees Fahrenheit).
Input
The input tables for this example are the output tables from the DWT function example:
• DWT Example Output Table dwt_coef_table
• DWT Example Meta Table dwt_meta_table
This example reconstructs the input to the DWT function example.
Table 152: IDWT Example Input Table dwt_coef_table
SQL-MapReduce Call
Output
Table 154: IDWT Example Output Message
messages
IDwt finished successfully!
The query below returns the output shown in the following table:
The output table is the same as the input table to the DWT function (Input). The original values for the
temperature, pressure and the dew point are reconstructed.
Table 155: IDWT Example Output Table climate_reconstruct
IDWT2D
Summary
The IDWT2D function is the inverse of DWT2D; that is, IDWT2D applies inverse wavelet transforms on
multiple sequences simultaneously. IDWT2D takes as input the output table and meta table generated by
DWT2D and outputs the sequences as 2-dimensional matrices. (Because the IDWT2D output is comparable
to the DWT2D input, the inverse transformation is also called the reconstruction.)
Usage
IDWT2D Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the input table or view that contains the
coefficients generated by DWT2D. Typically, this table is the output
table of DWT2D.
MetaTable Required Specifies the name of the input table or view that contains the meta
information used in DWT2D. Typically, this table is the meta table
output by DWT2D.
OutputTable Required Specifies the name for the table that the function creates to store the
reconstructed result. This table must not exist.
InputColumns Required Specifies the names of the columns in the input table or view that
contain the data to be transformed. These columns must contain
numeric values between -1e308 and 1e308. The function treats NULL
as 0.
SortColumn Required Specifies the name of the input column that represents the order of
coefficients in each sequence (the waveletid column in the DWT2D
output table). The column must contain a sequence of integer values
that start from 1 for each sequence. If a value is missing from the
sequence, then the function treats the corresponding data column as 0.
PartitionColumns Optional Specifies the names of the partition columns, which identify the
sequences. Rows with the same partition column values belong to the
same sequence. If you specify multiple partition columns, then the
function treats the first one as the distribute key of the output and
meta tables.
By default, all rows belong to one sequence, and the function generates
a distribute key column named dwt_idrandom_name in both the
output table and the meta table. In both tables, every cell of
dwt_idrandom_name has the value 1.
VerboseFlag Optional Specifies whether to ignore (not output) rows in which all coefficient
values are very small (having an absolute value less than 1e-12). The
default value is 'true'. For a sparse input matrix, ignoring such rows
reduces the output table size.
Output
The IDWT2D function outputs a message that indicates whether the function succeeded and an output table
of transformed (reconstructed) matrices.
Table 156: IDWT2D Output Message
Example
This example uses climate data in many cities in the states of California (CA), Texas (TX), and Washington
(WA). The cities are represented by two-dimensional coordinates (latitude and longitude). The data are
temperature (in degrees Fahrenheit), pressure (in Mbars), and dew point (in degrees Fahrenheit).
Input
The input tables for this example are the output tables from the DWT2D function example:
• DWT2D Example Output Table dwt_coef_table
Output
Table 160: IDWT2D Example Output Message
messages
IDwt2D finished successfully!
The query below returns the output shown in the following table:
Note:
VerboseFlag is 'true' by default; therefore, rows in which all coefficient values are very small do not
appear in climate2d_reconstruct.
Interpolator
Summary
The Interpolator function calculates missing values in a time series, using either interpolation or
aggregation. Interpolation estimates missing values between known values. Aggregation combines known
values to produce an aggregate value.
The time intervals between calculated values can either be the same length (specified by the TimeInterval
argument) or have specific start and end times (specified by time_table). The choice of TimeInterval or
time_table affects the behavior of interpolation, but not aggregation.
Usage
Interpolator Syntax
Version 1.0
Arguments
Argument Category Description
TimeColumn Required Specifies the name of the input_table column that contains the time
points of the time series whose missing values are to be calculated.
TimeInterval Optional Specifies the length of time, in seconds, between calculated values. This
value must be either INTEGER or DOUBLE PRECISION.
Note:
Specify exactly one of time_table or TimeInterval.
ValueColumns Required Specifies the names of input_table columns to interpolate to the output
table.
TimeDataType Optional Specifies the data type of the output column that corresponds to the
input table column that TimeColumn specifies (time_column).
If you omit this argument, then the function infers the data type of
time_column from the input table and uses the inferred data type for
the corresponding output table column.
If you specify this argument, then the function can transform the input
data to the specified output data type only if both the input column
data type and the specified output column data type are in this list:
• INTEGER
• BIGINT
• SMALLINT
• DOUBLE PRECISION
• DECIMAL(n,n)
• DECIMAL
• NUMERIC
• NUMERIC(n,n)
ValueDataType Optional Specifies the data types of the output columns that correspond to the
input table columns that ValueColumns specifies.
If you omit this argument, then the function infers the data type of
each time_column from the input table and uses the inferred data type
for the corresponding output table column.
If you specify ValueDataType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns, then
ValueDataType must specify n data types. For i in [1, n],
If you specify this argument, then the function can transform the input
data to the specified output data type only if both the input column
data type and the specified output column data type are in this list:
• INTEGER
• BIGINT
• SMALLINT
• DOUBLE PRECISION
• DECIMAL(n,n)
• DECIMAL
• NUMERIC
• NUMERIC(n,n)
InterpolationType Optional Specifies interpolation types for the columns that ValueColumns
specifies.
If you specify InterpolationType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns, then
InterpolationType must specify n interpolation types. For i in [1, n],
value_column_i has interpolation_type_i. However,
interpolation_type_i can be empty; for example:
Note:
In interpolation_type syntax, brackets do not indicate optional
elements—you must include them.
Note:
Specify only one of InterpolationType or AggregationType. If you
omit both arguments, the function uses InterpolationType with its
default value, 'linear'.
AggregationType Optional Specifies the aggregation types of the columns that ValueColumns
specifies.
If you specify AggregationType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns, then
AggregationType must specify n aggregation types. For i in [1, n],
value_column_i has aggregation_type_i. However, aggregation_type_i
can be empty; for example:
Note:
In aggregation_type syntax, brackets do not indicate optional
elements—you must include them.
Note:
Specify only one of AggregationType or InterpolationType. If you
omit both arguments, the function uses InterpolationType with its
default value, 'linear'.
StartTime Optional Specifies the start time for the time series. The default value is the start
time of the time series in input_table.
EndTime Optional Specifies the end time for the time series. The default value is the end
time of the time series in input_table.
ValuesBeforeFirst Optional Specifies the values to use if start_time is before the start time of the
time series in input_table. Each of these values must have the same
data type as its corresponding value_column. Values of data type
VARCHAR are case-insensitive.
If ValueColumns specifies n columns, then ValuesBeforeFirst must
specify n values. For i in [1, n], value_column_i has the value
before_first_value_i. However, before_first_value_i can be empty; for
example:
Accumulate Optional Specifies the names of input_table columns (other than those specified
by TimeColumn and ValueColumns) to copy to the output table. By
default, the function copies to the output table only the columns
specified by TimeColumn and ValueColumns.
Note:
For data types CHARACTER,
CHARACTER(n), CHARACTER
VARYING, CHARACTER
VARYING(n), and VARCHAR, the
only supported interpolation type is
'constant'.
The count_row_number table contains information about the shorter time intervals into which the original
time series (in input_table) has been split. Each row represents one shorter time interval. The following table
describes time_table.
Table 164: Interpolator count_row_number Table Schema
Output
The Interpolator function has one output table, which contains the time series with the values that the
function calculated. Each row contains one time point in the series and one or more values. The following
table describes the output table. Columns copied from input_table appear in the same order in the output
table.
Table 165: Interpolator Output Table Schema
Note:
For data types CHARACTER,
CHARACTER(n), CHARACTER
VARYING, CHARACTER
VARYING(n), and VARCHAR, the
only supported interpolation type is
'constant'.
Examples
• Example 1: Aggregation
• Example 2: Constant Interpolation
• Example 3: Linear Interpolation
• Example 4: Median Interpolation
• Example 5: Spline Interpolation
• Example 6: Loess Interpolation
Input
The input table contains the daily IBM stock prices from 1961 to 1962, excluding weekends and holidays.
The examples use the Interpolator function to calculate hypothetical stock prices for the excluded days.
Table 166: Interpolator Examples Input Table ibm_stock1
The examples use the Time_Interval argument, but in any example, you can substitute the following table
for the Time_Interval argument and get the same result. Example 1: Aggregation includes equivalent SQL-
MapReduce calls.
Table 167: Interpolate Example 1 (Aggregation) Input Table time_table1
id period
1 1961-05-17 00:00:00
2 1961-05-18 00:00:00
3 1961-05-19 00:00:00
4 1961-05-20 00:00:00
5 1961-05-21 00:00:00
6 1961-05-22 00:00:00
7 1961-05-23 00:00:00
8 1961-05-24 00:00:00
9 1961-05-25 00:00:00
10 1961-05-26 00:00:00
11 1961-05-27 00:00:00
12 1961-05-28 00:00:00
13 1961-05-29 00:00:00
14 1961-05-30 00:00:00
15 1961-05-31 00:00:00
id period
16 1961-06-01 00:00:00
17 1961-06-02 00:00:00
18 1961-06-03 00:00:00
19 1961-06-04 00:00:00
20 1961-06-05 00:00:00
... ...
Note:
The examples use the time interval 86,400 seconds, which is equivalent to one day.
Example 1: Aggregation
SQL-MapReduce Call
These two calls produce the same result.
Output
id period stockprice
1 1961-05-17 00:00:00 460
id period stockprice
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 452
1 1961-05-21 00:00:00 452
1 1961-05-22 00:00:00 452
1 1961-05-23 00:00:00 459
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 459
1 1961-05-26 00:00:00 463
1 1961-05-27 00:00:00 479
1 1961-05-28 00:00:00 479
1 1961-05-29 00:00:00 479
1 1961-05-30 00:00:00 490
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 490
1 1961-06-02 00:00:00 492
1 1961-06-03 00:00:00 498
1 1961-06-04 00:00:00 498
1 1961-06-05 00:00:00 498
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 478
1 1961-06-11 00:00:00 478
1 1961-06-12 00:00:00 478
... ... ...
SQL-MapRequest Call
Output
id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 452
1 1961-05-21 00:00:00 459
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 479
1 1961-05-28 00:00:00 493
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 493
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 498
1 1961-06-04 00:00:00 499
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
id period stockprice
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 489
1 1961-06-11 00:00:00 478
1 1961-06-12 00:00:00 478
... ... ...
SQL-MapRequest Call
Output
id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 454.333
1 1961-05-21 00:00:00 456.667
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 483.667
id period stockprice
1 1961-05-28 00:00:00 488.333
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 491.5
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 498.333
1 1961-06-04 00:00:00 498.667
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 485.333
1 1961-06-11 00:00:00 481.667
1 1961-06-12 00:00:00 478
... ... ...
SQL-MapRequest Call
id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 458
1 1961-05-21 00:00:00 458
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 484.5
1 1961-05-28 00:00:00 484.5
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 491
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 497.5
1 1961-06-04 00:00:00 497.5
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 488
1 1961-06-11 00:00:00 488
1 1961-06-12 00:00:00 478
... ... ...
SQL-MapRequest Call
Output
The algorithm did not converge, so the missing values are reported as not a number (NaN).
Table 172: Interpolate Example 5 (Spline Interpolation) Output Table
id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 NaN
1 1961-05-21 00:00:00 NaN
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 NaN
1 1961-05-28 00:00:00 NaN
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 NaN
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 NaN
id period stockprice
1 1961-06-04 00:00:00 NaN
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 NaN
1 1961-06-11 00:00:00 NaN
1 1961-06-12 00:00:00 478
... ... ...
SQL-MapRequest Call
Output
id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 457
1 1961-05-21 00:00:00 457.5
1 1961-05-22 00:00:00 459
id period stockprice
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 473.5
1 1961-05-28 00:00:00 481.25
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 481.25
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 496.5
1 1961-06-04 00:00:00 496.5
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 488.25
1 1961-06-11 00:00:00 486
1 1961-06-12 00:00:00 478
... ... ...
Summary
The path analysis functions automate path analysis. They are useful for clickstream analysis of web site
traffic and other sequence/path analysis tasks, such as advertisement or referral attribution.
The function descriptions use these terms:
• Path: An ordered, start-to-finish series of actions, such as the page views of a user from the start to the
end of a session. For example, if the user visits page a, page b, and page c, in that order, the path is: a,b,c
^,path
The carat (^) indicates that a path follows. For example: ^,a,b,c
• Subsequence or prefix: For a given sequence, a possible subset of steps that start with the initial step. For
example, the subsequences for the path a,b,c are:
^,a
^,a,b
^,a,b,c
• Exit subsequence or prefix: A subsequence or prefix that is the same as its sequence, indicated by a final
dollar sign ($). For example: ^,a,b,c$
• Depth: The number of steps in a sequence or subsequence. For example, the immediately preceding
subsequences have depths 1, 2, and 3, respectively.
• Node: A single step on a path. For example, one web page that the user visits during the session.
• Parent: The path the user traveled to a given node. For example, the parent of c is ^,a,b.
• Child: A path the user traveled from a given node. For example, the children of ^,a are:
^,a,b
^,a,b,c
Path_Generator
Summary
The Path_Generator function takes a set of paths and outputs the sequence and all possible subsequences,
which can be input to the function Path_Summarizer.
Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.
Path_Generator Syntax
Version 1.3
Arguments
Argument Category Description
SeqColumn Required Specifies the name of the input table column that contains the
paths.
Delimiter Optional Specifies the single-character delimiter that separates symbols in
the path string. The default value is comma (',').
Note:
Do not use any of the following characters as delimiter (they
cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)
Input
Table 174: Path_Generator Input Table Schema
Output
The output table has a row for each subsequence. The column containing the subsequence is named "prefix".
Table 175: Path_Generator Output Table Schema
Example
This example uses clickstream data from an e-commerce web site. The following table lists and describes the
symbols of the web site pages.
Table 176: Path_Generator Example E-Commerce Website Page Symbols
SQL-MapReduce Call
Output
Table 178: Path Generator Example Output Table
Path_Summarizer
Summary
The Path_Summarizer function takes output of the function Path_Generator and returns, for each prefix in
the input table, the parent and children and number of times each of its subsequences was traveled. This
output can be input to the function Path_Start.
Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.
Usage
Path_Summarizer Syntax
Version 1.2
Arguments
Argument Category Description
CountColumn Optional Specifies the name of the input table column that contains
the number of times a path was traveled. The default value is
1.
Note:
Do not use any of the following characters as delimiter
(they cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)
SeqColumn Required Specifies the name of the input table column that contains
the paths.
PartitionNames Required Lists the names of the columns that the PARTITION BY
clause specifies. The function uses these names for output
table columns. This argument and the PARTITION BY
clause must specify the same names in the same order.
Hash Optional Specifies whether to include the hash code of the node in the
output table. The default value is 'false'.
PrefixColumn Required Specifies the name of the input column that contains the
node prefixes.
Input
The input table has the same schema as the Path_Generator output table.
Output
Table 179: Path_Summarizer Output Table Schema
Example
Input
Path Generator Example Output Table
SQL-MapReduce Call
Output
Table 180: Path_Summarizer Example Output Table
Path_Start
Summary
The Path_Start function takes output of the function Path_Summarizer and returns, for each parent in the
input table, the parent and children and the number of times that each of its subsequences was traveled.
Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.
Usage
Path_Start Syntax
Version 1.2
Note:
Do not use any of the following characters as delimiter (they
cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)
ParentColumn Required Specifies the name of the input table column that contains the
parent nodes. The PARTITION BY clause in the function call
must include this column.
PartitionNames Required Lists the names of the columns that the PARTITION BY clause
specifies. The function uses these names for output table columns.
This argument and the PARTITION BY clause must specify the
same names in the same order. One partition_column must be
parent_column.
NodeColumn Required Specifies the name of the input table column that contains the
nodes.
Input
The input table has the same schema as the Path_Summarizer output table (refer to Output in
Path_Summarizer).
Output
The output table has a row for each node.
Table 181: Path_Start Output Table Schema
Example
Input
The input table for this example is the output table, Path_Summarizer Example Output Table from the
Example of the function: Path_Summarizer.
SQL-MapReduce Call
Path_Analyzer
Summary
The Path_Analyzer function:
1. Inputs a set of paths to the function Path_Generator.
2. Inputs the Path_Generator output to the function Path_Summarizer.
3. Inputs the Path_Summarizer output to the function Path_Start, which outputs, for each parent, all
children and the number of times that the user traveled each child.
Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.
Usage
Path_Analyzer Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies either the name of the input table or view or an nPath
query whose result is the input table. The input table contains the
paths to analyze. Each path is a string of alphanumeric symbols
that represents an ordered sequence of page views (or actions).
Typically each symbol is a code that represents a unique page
view.
If you specify an nPath query, it must select both the column that
contains the paths (sequence_column) and the column that
contains the number of times a path was traveled (count_column).
It must also specify sequence_column in the GROUP BY clause, so
that the input table has one row for each unique path traveled on a
web site.
OutputTable Required Specifies the name of the output table.
SeqColumn Required Specifies the name of the input table column that contains the
paths.
CountColumn Required Specifies the name of the input table column that contains the
number of times a path was traveled.
Hash Optional Specifies whether to include the hash code of the output column
node. The default value is 'false'.
Delimiter Required Specifies the single-character delimiter that separates symbols in
the path string. The default value is comma (',').
Note:
Do not use any of the following characters as delimiter (they
cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)
Input
The input table has the same schema as the Path_Generator input table (refer to Input).
Output
The output table has the same schema as the Path_Start output table (refer to Output).
Example
This example uses clickstream data from an e-commerce website. The table, Path_Generator Example E-
Commerce Website Page Symbols, from the Example section of the function: Path_Generator describes the
pages of the website.
Input
Use the following table from the Input section of the Path_Generator function Example section.
• Path_Generator Example Input Table: clickstream1
SQL-MapReduce Call
SAX2
Summary
The SAX2 function transforms original time series data into symbolic strings, which are more suitable for
many additional types of manipulation, because of their smaller size and the relative ease with which
patterns can be identified and compared. Input and output formats allow it to supply data to the Shapelet
Functions.
Background
A time series is a collection of data observations made sequentially over time. Time series occur in virtually
every medical, scientific, entertainment, and business domain.
Symbolic Aggregate Approximation (SAX) uses a simple algorithm, with low computational complexity to
create symbolic strings from time series data. Using a symbolic representation enables additional functions
such as Teradata Aster nPath to easily operate on the data.
The data can also be manipulated using common algorithms such as hashing or regular-expression pattern
matching. In classic data-mining tasks such as classification, clustering, and indexing, SAX is accepted as
being as good as some well-known, but storage-intensive methods like Discrete Wavelet Transform (DWT)
and Discrete Fourier Transform (DFT).
SAX transforms a time series X of length n into the string of arbitrary length w, where w < n, using an
alphabet A of size a > 2.
The SAX algorithm has two steps:
1. SAX transforms the original time series data into a piecewise aggregate approximation (PAA)
representation. This transformation effectively splits the time series data into intervals and then assigns
each interval to one of a limited set of alphabetical symbols (letters) based on the data being examined.
The set of symbols used is based on dividing all observed data into chunks (or thresholds), using the
normal distribution curve. Each of these chunks is represented by a symbol (a letter). This is a simple
way to reduce the dimensionality of the data.
2. SAX converts the PAA into a string of letters that represents the patterns occurring in the data over time.
The symbols created by SAX correspond to the time series features with equal probability, allowing them to
be compared and used for further manipulation with reliable accuracy. The time series that are normalized
using the zero mean and unit of energy follow the normal distribution law. By using Gaussian distribution
properties, SAX can easily select equal-sized areas under the normal curve using lookup tables for the cut
lines coordinates, slicing the under-the-Gaussian-curve area. In the SAX algorithm context, the x
coordinates of these lines are called breakpoints.
Usage
Arguments
Argument Category Description
ValueColumns Required Specifies the names of the input table columns that
contain the time series data to be transformed.
TimeColumn Optional Specifies the name of the input table column that
contains the time axis of the data.
WindowType Optional Determines how much data the function processes at
one time:
• 'global' (default)
Mean Optional Specifies the global mean values that the function
(single-input uses to calculate the SAX code for every partition. A
syntax only) mean_value has the data type DOUBLE
PRECISION.
If Mean specifies only one value and ValueColumns
specifies multiple columns, then the specified value
applies to every value_column.
If Mean specifies multiple values, then it must
specify a value for each value_column. The nth
mean_value corresponds to the nth value_column.
Tip:
To specify a different global mean value for each
partition, use the multiple-input syntax and put
the values in the meanstats table.
Tip:
To specify a different global standard deviation
value for each partition, use the multiple-input
syntax and put the values in the stdevstats table.
WindowSize Required if Specifies the size of the sliding window. The value
WindowType is must be an integer greater than 0.
'sliding', not
allowed
otherwise
OutputFrequency Optional Specifies the number of data points that the window
slides between successive outputs. The value must be
an integer greater than 0. The default value is 1.
Note:
WindowType value must be 'sliding' and Output
value cannot be 'characters'. If WindowType is
'sliding' and Output value is 'characters', then
OutputFrequency is automatically set to the value
of Window_Size, to ensure that a single character
is assigned to each time point. If the number of
data points in the time series is not an integer
multiple of the window size, then the function
ignores the leftover parts.
Note:
WindowType value must be 'global'.
Note:
WindowType value must be 'sliding'.
Note:
Output value must be 'bitmap'.
PrintCodeStats Optional Specifies whether the function prints the mean and
standard deviation. The default value is 'false'.
Note:
Output value must be 'string'.
Accumulate Optional The names of the input table columns that are to
appear in the output table. For each sequence in the
input table, SAX2 choose the value corresponding to
the first time point in the sequence to output as the
accumulate value.
Input
The single-input version of the SAX2 function requires one input table, input.
The multiple-input version of the SAX2 function requires three input tables—input, meanstats, and
stdevstats.
The input table must have one or more columns that contain time series data to be transformed, and you
must specify their names with the ValueColumns argument. The input table can have other columns, but the
function ignores them unless you specify them with the TimeColumn or Accumulate argument. The
following table gives the valid data types for input table columns that you can specify with the
ValueColumns, TimeColumn, and Accumulate arguments.
Table 183: SAX2 Input Table Schema
Both the meanstats and stdevstats tables must have every value_column and partition column (in key) that
the input table has.
The meanstats table contains the global means of each value_column of the input table. Each row of the
meanstats table specifies the global means for one input partition.
The stdevstats table contains the global standard deviations of each value_column of the input table. Each
row of the stdevstats table specifies the global means for one input partition.
Output
The output table format depends on the values of the Output and WindowType arguments.
For 'string' or 'bytes' output, the output table has only one row for a 'global' window, but multiple rows for a
'sliding' window. The following table describes the output table columns. In the column names, n varies
from 1 to N.
Table 184: SAX2 'string' or 'bytes' Output Table Schema
For 'bitmap' output, the output table has only one row, whose columns are described in the following table.
In column names, n varies from 1 to N.
Table 185: SAX2 'bitmap' Output Table Schema
For 'characters' output, each SAX symbol has its own row in the output table. You can input this output table
to the function HMMUnsupervisedLearner.
Table 186: SAX2 'characters' Output Table Schema
Examples
These examples use seasonally adjusted quarterly financial data from West Germany between 1960 and
1982.
SQL-MapReduce Call
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
SQL-MapReduce Call
The statistics are created from a query on the input table finance_data3 and grouped by column id.
Output
id period_start period_end
1 1960Q1 1969Q4
2 1970Q1 1979Q4
3 1980Q1 1982Q4
SeriesSplitter
Summary
The SeriesSplitter function splits partitions into subpartitions (called splits) to balance the partitions for time
series manipulation. The function creates an additional column that contains split identifiers. Each row
contains the identifier of the split to which the row belongs. Optionally, the function also copies a specified
number of boundary rows to each split.
Background
In many real-world use cases, the data is greatly skewed across partitions (that is, some partitions contain
significantly more data than others). This is especially true in time series manipulation—a single partition in
the input table can contain a time series with billions of data points.
Sometimes the input table cannot be further partitioned with the PARTITION BY clause. The most
common reasons are:
• The table has no column or combination of columns that can be used to further partition the data.
• The table contains an ordered data set, and to analyze one row, a function must consider adjacent rows.
Simply slicing the table makes analysis of boundary data impossible. The boundary of each subpartition
must include duplicate rows from the neighboring partition.
One vworker must process an entire partition. Therefore, severe imbalance in the partitions causes severe
load imbalance across vworkers.
Usage
SeriesSplitter Syntax
Version 1.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the input table to be split.
PartitionByColumns Required Specifies the partitioning columns of input_table. These columns
determine the identity of a partition. For data type restrictions of
these columns, see the Aster Database documentation.
DuplicateRowsCount Optional Specifies the number of rows to duplicate across split boundaries.
By default, the function duplicates one row from the previous
partition and one row from the next partition. If you specify only
value1, then the function duplicates value1 rows from the previous
partition and value1 rows from the next partition. If you specify
both value1 and value2, then the function duplicates value1 rows
from the previous partition and value2 rows from the next
partition. Each argument value must be nonnegative integer less
than or equal to 1000.
OrderByColumns Optional Specifies the ordering columns of input_table. These columns
establish the order of the rows and splits. Without this argument,
the function can split the rows in any order.
SplitCount Optional
Note:
If input_table has multiple partitions, then you cannot specify
SplitCount. Instead, specify RowsPerSplit.
Input
The input table is the table to be split, which is specified by the InputTable argument. The following table
describes the input_table columns that you can specify in function arguments.
Table 198: SeriesSplitter input_table Schema
Output
The SeriesSplitter function outputs two tables, the output table and the stats table.
Table 199: SeriesSplitter Output Table Schema
Examples
• Input
• Example 1: Partition Splitter
• Example 2: Using SeriesSplitter with Interpolator
Input
The input table contains the daily IBM stock prices from 1961 to 1962.
SQL-MapReduce Call
statistics value
input_table_row_count 369
input_partition_count 1
output_split_count 47
inserted_row_count 94
output_table_row_count 463
processing_time_in_seconds 5
The query below returns the output shown in the following table:
SQL-MapReduce Calls
There are two ways to use Interpolation with SeriesSplitter:
• Call SeriesSplitter as in Example 1: Partition Splitter to create ibm_stock_split and then call Interpolator:
The first choice, using separate SQL-MapReduce calls for SeriesSplitter and the function that uses it,
provides better performance than the second choice.
Troubleshooting
Problem: Invoking a function using SeriesSplitter does not improve execution time.
Note:
Before trying workarounds, ensure that the data is skewed and that the function that uses SeriesSplitter
does not exploit full parallelism. If the data is not skewed and the function exploits full parallelism, then
SeriesSplitter cannot improve its execution time.
Workaround:
• Invoke SeriesSplitter and the subsequent function in separate SQL-MapReduce calls (as in the first choice
in Example 2: Using SeriesSplitter with Interpolator), rather than using SeriesSplitter in the ON clause of
the subsequent function (as in the second choice in Example 2: Using SeriesSplitter with Interpolator).
• Adjust these arguments as follows:
∘ DuplicateRowsCount: as low as possible
∘ SplitCount: a smaller multiple (for example, 1) of the number of vworkers in the cluster
∘ RowsPerSplit: as high as possible (you want the resulting number of splits to be a smaller multiple of
the number of vworkers in the cluster)
∘ Accumulate: specify as few columns as possible
∘ DuplicateColumn: omit this argument
Sessionize
Summary
The Sessionize function maps each click in a session to a unique session identifier. A session is defined as a
sequence of clicks by one user that are separated by at most n seconds.
The function is useful both for sessionization and for detecting web crawler (“bot”) activity. It is typically
used to understand user browsing behavior on a web site.
Sessionize is a SQL-MapReduce function. Sample code is included with the Aster SQLMapReduce Java API.
Background
Sessionize is a SQL-MapReduce function. Sample code is included with the Aster SQLMapReduce Java API.
Usage
Sessionize Syntax
Version 1.3
Arguments
Argument Category Description
TimeColumn Required Specifies the name of the input column that contains the click times.
Note:
The timestamp_column must also be an order_column.
TimeOut Required Specifies the number of seconds at which the session times out. If
session_timeout seconds elapse after a click, then the next click starts a
Input
The input table must have a timestamp column and columns by which to partition and order the data. Input
data must be partitioned such that each partition contains all rows of an entity. No input column can have
the name 'sessionid' or 'clicklag', because these are output column names.
Table 204: Sessionize Input Table Schema
To create a single timestamp column from separate date and time columns:
Output
Table 205: Sessionize Output Table Schema
Input
The input table is web clickstream data recorded as a user navigates through a web site. Events—view, click,
and so on—are recorded with a timestamp.
Table 206: Sessionize Example Input Table adweb_clickstream
SQL-MapReduce Call
Shapelet Functions
• UnsupervisedShapelet, which takes a set of time series and assigns them to clusters, based on the
shapelets that it finds.
Overview
Any classification task that must preserve ordering can be characterized as time-series classification. Many
real-world use cases involve data that varies only slightly. Traditional classifiers may be unable to classify
such data with high precision.
Shapelets are contiguous subsequences of a time series that identify a class with high accuracy. Because
shapelets focus on local features of a time series, they can be more accurate and faster than other time-series
classification methods. In many applications, shapelets also have been found to identify interpretable results,
thus providing useful insights into differences between classes.
The most common use cases are long-term trends with small local pattern changes that distinguish trends
from each other. Almost any time-series classification problem can be mapped to a shapelets discovery
problem. For example:
• Clickstream analysis
• Scientific or health applications such as ECG analysis
• Imaging applications such as gesture recognition or motion analysis
• Manufacturing applications such as process anomaly detection
• Financial applications such as stock price analysis
Before a shapelets function classifies or clusters a set of time series, it normalizes and SAX-encodes them.
Normalization is required because shapelet classification depends on the distance between two time series.
SAX-encoding makes patterns in the data easier to identify and compare. For more information about SAX-
encoding, refer to SAX2.
The following references explain in detail how shapelets are identified. Aster Analytics’ implementation of
shapelets is based on the fast shapelet finder algorithm published by Rakthanmanon. The unsupervised
shapelet implementation is based on the scalable unsupervised-shapelet algorithm published by Ulanova.
• L. Ye, E. Keogh. Time Series Shapelets: A New primitive for Data Mining, KDD 2009
• T. Rakthanmanon, E. Keogh. Fast Shapelets: A scalable algorithm for discovering time series shapelets,
SIAM 2013.
• J. Zakaria, A. Mueen, E. Keogh. Clustering Time Series using Unsupervised-Shapelets.
• L. Ulanova, N. Begum, E. Keogh. Scalable Clustering of Time Series with U-Shapelets
UnsupervisedShapelet
Summary
The UnsupervisedShapelet function takes a set of time series and assigns them to clusters, based on the
u_shapelets that it finds. The function uses these steps:
1. Saxify the input data (as described in SAX2).
2. Apply random masking to the input data.
Usage
The following helper functions must be installed to run UnsupervisedShapelet. You can install these
functions with the command \install filename.ext.
• sax2.zip
• UshapeletMasker.zip
• UshapeletInTimeseries.zip
• UshapeletTSDistance.zip
• UshapeletFinderByScore.zip
UnsupervisedShapelet Syntax
Version 1.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Input
The function uses the input table columns time_column and value_column for saxification. They correspond
to the time_column and value_column in input for SAX2.
Input time series data must be correctly formatted; otherwise, function behavior is undefined. In a correctly
formatted time series, time intervals are evenly spaced and all time intervals have numeric values.
To calculate missing values in a time series, use the function Interpolator. If the input table’s time column is
text-based, create a new input table with an integer-based time column. For example, suppose that the table
time_series_text_time has the text-based columns idval, timeval, valueval, and catval. This statement creates
a table that the function accepts as input:
Output
The function outputs a message and a table.
Example
Input
The input table has 10 price observations for four stocks. The column period contains time values,
represented by integers. Because the function is unsupervised, it ignores the stock_category column;
however, you can use that column to verify that the generated clusters belong to the same category.
Table 211: UnsupervisedShapelet Example Input Table ushapelets_input
In the time period shown, technology stocks 1 and 4 have similar price trajectories, as do healthcare stocks 2
and 3.
SQL-MapReduce Call
Output
Table 212: UnsupervisedShapelet Example Output Message
statistics
Unsupervised shapelets table created: "uss_output"
number of clusters : 1
number of timeseries : 4
The function assigned technology stocks 1 and 4 to cluster 0 and healthcare stocks 2 and 3 to cluster 1.
stockid cluster_label
3 0
1 0
4 0
2 0
Troubleshooting
Problem: The function runs slowly for large input data sets.
For a large input data set, the function might run very slowly, spending a lot of time on one step, or it might
terminate with a failure message on the console. Consult the logs for error messages and troubleshooting
information.
Workarounds:
• Improve the execution time of the saxification step, in any of the following ways:
∘ Increase SaxWindowSize argument value.
∘ Increase the SaxOutputFrequency argument value.
∘ Decrease the SaxSymbolsPerWindow argument value.
• Decrease the number of masking operations by decreasing the RandomProjections argument value.
• Decrease the number of iterations by decreasing the MaxNumIter argument value.
• Decrease the number of u_shapelets for clustering by decreasing the ShapeletCutOff argument value.
• Increase the Threshold argument value.
Workarounds:
• Improve the accuracy of the saxification step, in any of the following ways:
∘ Decrease the SaxWindowSize argument value.
∘ Decrease the SaxOutputFrequency argument value.
∘ Increase the SaxSymbolsPerWindow argument value.
• Increase the number of masking operations by increasing the RandomProjections argument value.
• Increase the number of iterations by decreasing the MaxNumIter argument value.
• Increase the number of u_shapelets for clustering by decreasing the ShapeletCutOff argument value.
• Decrease the Threshold argument value.
SupervisedShapeletTrainer
Summary
The SupervisedShapeletTrainer function takes a set of classified time series and outputs a model for
classifying time series, based on the shapelets that it finds. The model is input to the function
SupervisedShapeletClassifier.
Usage
The following helper functions must be installed to run SupervisedShapeletTrainer. You can install these
functions with the command \install filename.ext.
• sax2.zip
• ShapeletMasker2.zip
• ShapeletCollisionCounter2.zip
• ShapeletPowerFinder2.zip
• ShapeletCandidateFinder2.zip
• ShapeletCandidateScoring2.zip
SupervisedShapeletTrainer Syntax
Version 1.1
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data.
CategoryTable Optional Specifies the name of the table that contains the categories
(classes) for the time series in input_data_table. The default
value is input_data_table.
If input_categories_table is different from input_data_table,
the function ignores any time series that is not in both
input_categories_table and input_data_table. If a time series
is represented by multiple rows in input_categories_table,
these rows must contain the same category; otherwise, the
function might not select the correct category.
IDColumn Required Specifies the name of the column in input_data_table and
input_categories_table that contains the unique identity of a
time series.
TimeColumn Required Specifies the name of the input_data_table column that
contains the time axis of the data.
ValueColumn Required Specifies the name of the input_data_table column that
contains the data points.
CategoryColumn Required Specifies the name of the input_categories_table column that
contains the category (class) of the time series.
SaxSymbolsPerWindow Optional Specifies the SAX2 argument SymbolsPerWindow, which
specifies the number of SAX code symbols to generate from
a window. The symbols_per_window must an INTEGER in
the range [1, 1000000]. The default value is 10.
If the symbols_per_window is greater than the length of the
shortest time series in input data set (d), then its value
becomes d.
SaxMinWindowSize Optional Specifies the SAX2 argument WindowSize , which specifies
the size of the sliding window. The min_window_size
defines the length (number of data points) of the shortest
shapelet; the minimum span (time series length) used to
distinguish two time series from each other. The
min_window_size must be an integer in the range
[1, 1000000]. The default value is 10.
If the min_window_size is greater than the length of the
shortest time series in input data set (d), then its value
becomes d. If min_window_size is smaller than
Input
The input table input_data_table has the same schema as UnsupervisedShapelet Input.
Output
The function outputs a message and a model table.
Table 214: SupervisedShapeletTrainer Output Message Schema
Example
Input
The input table has 10 price observations for four stocks. The column period contains time values,
represented by integers.
Table 216: SupervisedShapeletTrainer Example Input Table shapelets_train
SQL-MapReduce Call
statistics
shapelet model table created : "shapelets_model"
number of shapelets : 1
number of rows : 6
Troubleshooting
Problem: The function runs slowly for large input data sets.
For a large input data set, the function might run very slowly, spending a lot of time on one step, or it might
terminate with a failure message on the console. Consult the logs for error messages and troubleshooting
information.
Workarounds:
• Improve the execution time of the saxification step, in any of the following ways:
∘ Decrease the difference between the SaxMinWindowSize and SaxMaxWindowSize argument values.
∘ Increase the SaxOutputFrequency argument value.
∘ Decrease the SaxSymbolsPerWindow argument value.
• Decrease the number of masking operations by decreasing the RandomProjections argument value.
• Decrease the number of shapelets in the output table by decreasing the ShapeletCount argument value.
• Increase the number of data points to skip between consecutive time series windows when calculating the
distance of a shapelet from a time series by increasing the TimeInterval argument value.
Workarounds:
• Improve the accuracy of the saxification step, in any of the following ways:
∘ Increase the difference between the SaxMinWindowSize and SaxMaxWindowSize argument values.
∘ Decrease the SaxOutputFrequency argument value.
∘ Increase the SaxSymbolsPerWindow argument value.
• Increase the number of masking operations by increasing the RandomProjections argument value.
• Increase the number of shapelets in the output table by increasing the ShapeletCount argument value.
• Decrease the number of data points to skip between consecutive time series windows when calculating
the distance of a shapelet from a time series by decreasing the TimeInterval argument value.
SupervisedShapeletClassifier
Summary
The SupervisedShapeletClassifier function uses the model output by the function SupervisedShapeletTrainer
to classify a set of time series.
Usage
SupervisedShapeletClassifier Syntax
Version 1.1
Note:
This argument must specify the same value as the
SupervisedShapeletTrainer TimeInterval argument
specified when it generated the shapelets table.
Input
The function requires the following:
• time_series, which has the same schema as the table, UnsupervisedShapelet Input Table Schema, from
the Input section of the function: UnsupervisedShapelet.
• shapelets, a model table described by, SupervisedShapeletTrainer Model Table Schema, from the Output
section of the function: SupervisedShapeletTrainer.
Output
Table 219: SupervisedShapeletClassifier Output Table Schema
Example
Input
• shapelets_test, which contains additional data from the data set used to train the model
• shapelets_model (Output), the model output by the SupervisedShapeletTrainer example
Table 220: SupervisedShapeletClassifier Example Input Table shapelets_test
SQL-MapReduce Call
id predicted_category stock_category
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
id predicted_category stock_category
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
The prediction accuracy is 100% because the predicted and original categories are the same.
VARMAX
Summary
VARMAX (Vector Autoregressive Moving Average model with eXogenous variables) extends the ARMA/
ARIMA model in two ways:
• To work with time series with multiple response variables (vector time series).
• To work with exogenous variables, or variables that are independent of the other variables in the system.
The model includes both the dynamic relationship between the multiple response variables and the
relationship between the dependent and independent variables.
This formula represents a nonseasonal VARMAX model:
In the preceding equation, Yt is a stationarized time series. The first term is the autoregressive component,
the second term is the exogenous component, the third term is the moving average component, the fourth
(C) is a vector of constants, and the fifth (Et) is a vector of residual errors, and:
• Yt is a vector of n response variables
• Xt is a vector of m exogenous variables
• p is the number of previous periods of the endogenous variables included in the model
• q is the number of previous periods included in the moving average
• b is the number of previous periods of exogenous variables included
• Φi is an n * n matrix of autoregressive parameters
• Bi is an n * m matrix of exogenous variable parameters
• Θi is an n * n matrix of moving average parameters
• Et is the difference between the actual and the predicted value of Yt, (Yt - Ŷt).
This formula represents a seasonal VARMAX model:
(1 - Φ1Back - … - ΦpBackp)(1 - Φ1Backm - … - ΦspBackm*sp)(1 - Back)d(1 - Backm)sd(Yt) =
Backn(yt) = yt-n
• m is the number of periods per season
• d is the number of differencing steps performed to stationarize the time series
• sp, sq, sd are the seasonal parameters corresponding to p, q, d
Usage
The VARMAX function expects that each time series is an ordered sequence in a partition with uniform
time intervals. The function assumes each partition can fit into memory. The function does not accept null
or non-numeric inputs, with the exception noted in the description of ResponseColumns (in the Arguments
table).
VARMAX Syntax
Version 1.0
Arguments
Argument Category Description
ResponseColumns Required Specifies the columns containing the response data. Null
values are acceptable at the end of the series. If StepAhead is
specified, the function reports predicted values for the
missing values, taking into account values from the
predictor columns for those time periods.
ExogenousColumns Optional Specifies the columns containing the independent
(exogenous) predictors. If not specified, the function
calculates the model without exogenous vectors.
PartitionColumns Optional Specifies the partition columns to pass to the output. If not
specified, the output contains no partition columns.
Orders Required Specifies the parameters p, d, q for the VARMA part of the
model.
This argument consists of 3 non-negative integers separated
by commas. Values must be between 0 and 20.
SeasonalOrders Optional Specifies seasonal parameters sp, sd, sq for the VARMA part
of the model. This argument consists of 3 non-negative
integers separated by commas. Values must be between 0
and 20. If not specified, the model is treated as nonseasonal.
If the SeasonalOrders argument is used, the Period
argument must also be present.
Period Optional Specifies the period of each season. Must be a positive
integer value. If the Period argument is used, the
SeasonalOrders argument must also be present. If not
specified, the model is treated as nonseasonal.
ExogenousOrder Optional Specifies the order of exogenous variables. If the current
time is t and ExogenousOrder is b, the following values of
the exogenous time series are used in calculating the
response: Xt Xt-1 ... Xt-b+1. If not specified, the model is
calculated without exogenous vectors.
Input
Table 222: VARMAX Input Schema
Note:
If the time points do not have uniform intervals, then
run the function Interpolator on them before running
the VARMAX function on the input table.
Output
The output of VARMAX is a model for each partition in the input table. The Output table schema and
additional details about the output are shown in the following two tables.
coef coef_value
coef The vector [p, d, q, sp, sd, sq, b].
ar_params The matrices Φi, shown as a vector of p matrices, each of which is an n * n
matrix. p is from the coef vector and n is the number of response variables
specified in the ResponseColumns argument.
ma_params The matrices Φi, shown as a vector of q matrices, each of which is an n * n
matrix. q is from the coef vector and n is the number of response variables
specified in the ResponseColumns argument.
exogenous_params The matrices Bi, shown as a vector of b matrices, each of which is an n * m
matrix. b is from the coef vector, n is the number of response variables specified
in the ResponseColumns argument, and m is the number of exogenous
variables specified in the ExogenousColumns argument.
seasonal_ar_params The matrix Φsi for the seasonal parameters.
seasonal_ma_params The matrix Φsi for the seasonal parameters.
mean_param The mean vector of the response series. This value is only displayed if the
argument IncludeMean is set to True.
period The cycle period for seasonal models (0 for non-seasonal models).
lag The lag value specified in the function call.
sigma The variance matrix.
aic The Akaike information criterion.
coef coef_value
bic The Bayesian information criterion.
iterations The number of iterations performed.
converged Whether the algorithm converged.
Examples
• Input
• Example 1: VARMAX without Exogenous Model
• Example 2: VARMAX with Exogenous Model
• Example 3: VARMAX with Seasonal Model and without Exogenous Model
• Example 4: VARMAX with All Models
Input
All examples use the following input table, which is seasonally adjusted quarterly financial data from West
Germany between 1960 and 1982. Three time series are included: consumer expenditures, disposable
income, and fixed investment. Values are shown in billions of DM. The time series is partitioned by “id”
column, which indicates the decade.
Table 225: VARMAX Example Input Table finance_data3
SQL-MapReduce Call
Three values are predicted (StepAhead (3)) for each time series with Order (1, 1, 1).
Output
Note:
Note that series id = 2 does not converge. Convergence could be possibly be improved by adding more
orders or more models.
SQL-MapReduce Call
Output
Note:
The model converges for all three decades.
SQL-MapReduce Call
Seasonal modeling are specified by the Period () and SeasonalOrders () arguments.
Output
Note:
series id = 3 does not converge. Convergence could possibly be improved by adding either more orders or
more models.
SQL-MapReduce Call
Output
Note:
The model converges across all partitions.
nPath
Summary
The nPath function matches specified patterns in a sequence of rows from one or more input tables and
extracts information from the matched rows.
Usage
nPath Syntax
Version 1.0
The nPath function is not tied to any Aster Database schema and must not be qualified with a schema name.
Arguments
Argument Category Description
Mode Required Specifies the pattern-matching mode:
OVERLAPPING: The function finds every occurrence of the pattern in the
partition, regardless of whether it is part of a previously found match.
Therefore, one row can match multiple symbols in a given matched pattern.
NONOVERLAPPING: The function begins the next pattern search at the
row that follows the last pattern match. This is the default behavior of many
commonly used pattern matching utilities, including the UNIX grep utility.
Pattern Required Specifies the pattern for which the function searches. You compose pattern
with the symbols that you define in the Symbols argument, operators, and
Symbols (
pagetype = 'homepage' AS H,
pagetype <> 'homepage' AND pagetype <> 'checkout' AS
PP,
pagetype = 'checkout' AS CO
)
Symbols (
weblog.pagetype = 'homepage' AS H,
weblog.pagetype = 'thankyou' AS T,
ads.adname = 'xmaspromo' AS X,
ads.adname = 'realtorpromo' AS R
)
For more information about symbols that appear in the Pattern argument
value, refer to Symbols. For more information about symbols that appear in
the Result argument value, refer to Result: Applying Aggregate Functions.
Filter Optional Specifies filters to impose on the matched rows. The function combines the
filter expressions using the AND operator.
The filter_expression syntax is:
In the following table, A and B are symbols defined in the Symbols argument.
Table 232: Simple nPath Patterns and Operator Precedence
(subpattern){n[,[m]]}
In the preceding syntax, you must type the braces ({ and }).
(subpattern){n} specifies that subpattern must appear exactly n times. For example, the following pattern
specifies that subpattern (A.B|C) must appear exactly 3 times:
'X.(Y.Z).(A.B|C){3}'
'X.(Y.Z).(A.B|C).(A.B|C).(A.B|C)'
(subpattern){n,} specifies that subpattern must appear at least n times. For example, the following pattern
specifies that subpattern (A.B|C) must appear at least 4 times:
'X.(Y.Z).(A.B|C){4}'
'X.(Y.Z).(A.B|C).(A.B|C).(A.B|C).(A.B|C)*'
(subpattern){n,m} specifies that subpattern must appear at least n times and at most m times. For example,
the following pattern specifies that subpattern (A.B|C) must appear at least 2 times and at most 4 times:
'X.(Y.Z).(A.B|C){2,4}'
'X.(Y.Z).(A.B|C).(A.B|C).(A.B|C)?.(A.B|C)?'
Input
The function requires at least one partitioned input table, and can have additional input tables that are either
partitioned or DIMENSION tables.
Table 233: nPath Input Table Schema
Note:
If the input to nPath is nondeterministic, then the results are nondeterministic.
Output
Table 234: nPath Ouput Table Schema
Pattern Matching
Conceptually, nPath pattern matching proceeds like this: Starting from a row in a partition, the function
tries to match the given pattern along the row sequence in the partition (ordered as specified in the ORDER
BY clause).
If the function cannot match the pattern, it outputs nothing; otherwise, it continues to the next row. When
the function finds a sequence of rows that match the pattern, it selects the largest set of rows that constitute
the match and outputs a row based on this match.
For example, suppose that the pattern is 'A.B+' and the rows that constitute the match start at a row t1 and
end at row t4. Suppose that t1 matches A and each of t2,t3, and t4 matches B. When the matching is
complete, A represents t1 and B represents t2, t3, and t4. Using the rows represented by A and B, the
function evaluates the Result argument (typically applying an aggregate function to each symbol in the
pattern), outputs one row with the result values, and proceeds to search for the next pattern match.
Before running nPath on a large data set, create a small data set that includes the pattern that you want to
find. Test your pattern on the small data set, refine the pattern until nPath gives the desired output, and then
using the refined pattern for the large data set.
job_transition_path count
[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec 1
Officer]
In the pattern, CEO matches the first row, ENGR matches the second row, and OTHER* matches the
remaining rows:
job_transition_path count
[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec 1
Officer]
In the pattern, CEO matches the first row, ENGR matches the second row, OTHER* matches the next two
rows, and CEO matches the last row:
Symbols
This section applies only to symbols that appear in the Pattern argument, described in Arguments. For
information about symbols that appear in the Result argument, refer to Result: Applying Aggregate
Functions.
For each symbol definition, col_expr = symbol_predicate AS symbol, the function returns the rows for which
col_expr equals symbol_predicate. For example, for pagetype = 'home' AS H, the function returns the
first and fourth rows of the following table.
Table 238: nPath Sample Input Table
The function does not return any row that contains a NULL value. For example, for pagetype =
'checkout' AS C, the function returns the second row of the preceding table, but not the third.
The predicate TRUE matches every row.
If symbols have overlapping predicates, multiple symbols might match the same row.
where:
• current_expr is the name of a column from the current row (or an expression operating on this column).
• operator is either >, >=, <, <=, =,or!=
• previous_expr is the name of a column from a previous row (or an expression operating on this column).
• lag_rows is the number of rows to count backward from the current row to reach the previous row. For
example, if lag_rows is 1, the previous row is the immediately preceding row.
• default is the value to use for previous_expr when there is no previous row (that is, when the current row
is the first row or there is no row that is lag_rows before the current row).
Input
SQL-MapReduce Call
Output
Table 240: nPath LAG Expression Example 1 Output Table (Columns 1-4)
Table 241: nPath LAG Expression Example 1 Output Table (Columns 5-6)
page_path dup_path
[ACCOUNT SUMMARY, FAQ, ACCOUNT HISTORY, []
FUNDS TRANSFER, ONLINE STATEMENT
page_path dup_path
ENROLLMENT, PROFILE UPDATE, ACCOUNT
SUMMARY, CUSTOMER SUPPORT, VIEW DEPOSIT
DETAILS]
[ACCOUNT SUMMARY, FAQ, ACCOUNT SUMMARY, [ACCOUNT SUMMARY]
FUNDS TRANSFER, ACCOUNT HISTORY, VIEW
DEPOSIT DETAILS, ACCOUNT SUMMARY,
ACCOUNT HISTORY]
[ACCOUNT SUMMARY, ACCOUNT HISTORY, [ACCOUNT SUMMARY, ACCOUNT
FUNDS TRANSFER, ACCOUNT SUMMARY, FAQ] SUMMARY, FAQ]
[ACCOUNT SUMMARY, ACCOUNT HISTORY, [ACCOUNT SUMMARY]
ACCOUNT SUMMARY, ACCOUNT HISTORY, FAQ,
ACCOUNT SUMMARY]
[ACCOUNT SUMMARY, FAQ, VIEW DEPOSIT []
DETAILS, FAQ]
[ACCOUNT SUMMARY, FUNDS TRANSFER, VIEW [VIEW DEPOSIT DETAILS]
DEPOSIT DETAILS, ACCOUNT HISTORY]
... ...
Input
Output
Filters
The Filter argument, which specifies filters to impose on the matched rows, can improve or degrade nPath
performance, depending on several factors. Filtering out most matches can improve performance, but
memory fragmentation can degrade it. Memory fragmentation can occur in these cases:
• The mode is NONOVERLAPPING and the pattern includes the endanchor operator ($) but not the
startanchor operator (^).
• The mode is OVERLAPPING and the pattern does not include the startanchor operator.
• The first symbol in the pattern can match an infinite number of input rows.
• The data partition is huge.
• The Java Virtual Machine (JVM) is too small.
If nPath runs much slower with the Filter argument, increase the size of the JVM. If the problem persists,
alter the pattern.
Input
Table 244: nPath Filter Example Input Table clickstream
SQL-MapRequest Call
Function Description
Returns either the number of total number of matched rows (*) or the
COUNT ( number (or distinct number) of col_expr values in the matched rows.
{ * | [DISTINCT]
col_expr }
OF symbol_list )
Returns the col_expr value of the first matched row. For the example in
FIRST ( Pattern Matching, FIRST (pageid OF B) returns the pageid of row
col_expr OF t2.
symbol_list )
Returns the col_expr value of the last matched row. For the example in
LAST ( Pattern Matching, LAST (pageid OF B) returns the pageid of row
col_expr OF t4.
symbol_list )
Function Description
Returns the first non-null col_expr value in the matched rows.
FIRST_NOTNULL (
col_expr OF
symbol_list )
Function Description
CDISTINCT limits the concatenated values to consecutive distinct
[ DELIMITER values.
'delimiter'] )
You can compute an aggregate over more than one symbol. For example, SUM (val OF ANY (A,B))
computes the sum of the values of the attribute val across all rows in the matched segment that map to A or
B.
More examples:
• Example 1 uses FIRST, LAST_NOTNULL, MAX_CHOOSE, and MIN_CHOOSE
• Example 2 uses FIRST and three forms of ACCUMULATE
• Example 3 uses FIRST, three forms of ACCUMULATE, COUNT, and NTH
Example 1
Input
Table 246: nPath Aggregate Functions Example 1 Input Table trans1
SQL-MapReduce Call
Output
Table 247: nPath Aggregate Functions Example 1 Output Table
Example 2
Input
Table 248: Aggregate Functions Example 2 Input Table: clicks
SQL-MapReduce Call
Output
Table 249: nPath Aggregate Functions Example 2 Output Table (Columns 1-4)
Table 250: nPath Aggregate Functions Example 2 Output Table (Columns 5-6)
cde_dup_products de_dup_products
[null$$television$$envelopes$$null] [null, television, envelopes]
Example 3
Input
This example uses the same input table, Aggregate Functions Example 2 Input Table: clicks, as was used in
Example 2.
SQL-MapReduce Call
Table 252: nPath Aggregate Functions Example 3 Output Table (Columns 6-8)
nPath Examples
The following table summarizes the symbols and symbol predicates that the examples use.
Table 253: nPath Clickstream Data Examples Symbols and Symbol Predicates
SELECT ...
FROM nPath (...
SYMBOLS (pageid IN (10, 25) AS A,
category = 10 OR (category = 20 AND pageid <> 33) AS B,
category IN (SELECT pageid
FROM clicks1
GROUP BY userid
HAVING COUNT(*) > 10
) AS C,
referrer LIKE '%Amazon%' AS D,
true AS X
) ...
) ...
Range-Matching Examples
Input
The examples in this section use the Input table, nPath LAG Expression Example 2 Input Table:
aggregate_clicks, from LAG Expression Example 2 of the function: Symbols. The table is a collection of
clickstream data for different products with price information. Columns userid and sessionid identify the
users.
SQL-MapReduce Call
sessionid path
1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1, checkout,
home, home, home, home, home, home, home, home, home]
2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]
3 [home, home, home, home, home, home, home, home, home1, page1, home, home1, page1,
home]
4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]
5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2, page2,
checkout, checkout, checkout, page2, page2, page2]
Example 2: Find Sessions That Start at Home Page and Visit Page1
SQL-MapReduce Call
Output
sessionid path
1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1, checkout,
home, home, home, home, home, home, home, home, home]
2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]
3 [home, home, home, home, home, home, home, home, home1, page1, home, home1, page1,
home]
4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]
5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2, page2,
checkout, checkout, checkout, page2, page2, page2]
SQL-MapReduce Call
Output
SQL-MapReduce Call
sessionid path
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [checkout, home]
1 [page1, checkout]
1 [home1, page1]
1 [home, home1]
1 [home, home]
1 [home, home]
1 [page1, home]
1 [home1, page1]
1 [home, home1]
1 [page1, home]
1 [home1, page1]
1 [home, home1]
2 [home, home]
2 [checkout, home]
2 [checkout, checkout]
... ...
SQL-MapReduce Call
Output
sessionid product
1 envelopes
2 tables
3 bookcases
4 tables
5 Appliances
Example 6: Find Data for Sessions That Checked Out 3-6 Products
For sessions where the user checked out between three and six products (exclusive), return the names of the
most and least expensive products, the maximum price of the most expensive product, and the minimum
price of the least expensive product.
SQL-MapReduce Call
Example 7: Find Data for Sessions That Checked Out at Least 3 Products
Modify the SQL-MapReduce call in Example 6 to find sessions where the user checked out at least three
products by changing the Pattern argument to:
PATTERN('H+.D*.C{3,}.D')
SQL-MapReduce Call
Output
SQL-MapReduce Call
Input
userid ts imp
1 2012-01-01 ad1
1 2012-01-02 ad1
1 2012-01-03 ad1
1 2012-01-04 ad1
1 2012-01-05 ad1
1 2012-01-06 ad1
1 2012-01-07 ad1
2 2012-01-08 ad2
2 2012-01-09 ad2
2 2012-01-10 ad2
2 2012-01-11 ad2
... ... ...
userid ts click
1 2012-01-01 ad1
2 2012-01-08 ad2
3 2012-01-16 ad3
4 2012-01-23 ad4
userid ts click
5 2012-02-01 ad5
6 2012-02-08 ad6
7 2012-02-14 ad7
8 2012-02-24 ad8
9 2012-03-02 ad9
10 2012-03-10 ad10
11 2012-03-18 ad11
12 2012-03-25 ad12
13 2012-03-30 ad13
14 2012-04-02 ad14
15 2012-04-06 ad15
ts tv_imp
2012-01-01 ad1
2012-01-02 ad2
2012-01-03 ad3
2012-01-04 ad4
2012-01-05 ad5
2012-01-06 ad6
2012-01-07 ad7
2012-01-08 ad8
2012-01-09 ad9
2012-01-10 ad10
2012-01-11 ad11
2012-01-12 ad12
2012-01-13 ad13
2012-01-14 ad14
2012-01-15 ad15
Output
imp_cnt tv_imp_cnt
18 0
19 0
19 0
20 0
21 0
22 0
22 0
22 0
22 0
22 0
23 0
23 0
23 0
24 0
25 0
Statistical Analysis
• Approximate Distinct Count
• Approximate Percentile
• CMAVG
• ConfusionMatrix
• Correlation
• CoxPH
• CoxPredict
• CoxSurvFit
• CrossValidation
• Distribution Matching
• EMAVG
• FMeasure
• GLM
• GLMPredict
• Hidden Markov Model Functions
• Histogram
• KNN
• LARS Functions
• Linear Regression
• LRTEST
• Percentile
• Principal Component Analysis
• PCAPlot
• RandomSample
• Sample
• Shapley Value Functions
• SMAVG
• Support Vector Machines
• VectorDistance
• VWAP
• WMAVG
Summary
The Approximate Distinct Count function, which is composed of the ApproxDCountReduce and
ApproxDCountMap functions, can estimate the number of distinct values (cardinality) in a column or
combination of columns, scanning the table only once.
This function is recommended when the column or combination of columns has a large cardinality. The
function can estimate the number of distinct values much faster than the SQL SELECT DISTINCT
command can return the precise number of distinct values.
When the cardinality is small, the SQL SELECT DISTINCT command is recommended.
Background
When the column or combination of columns has a large cardinality, the function uses the Flajolet-Martin
algorithm, which approximates the number of distinct elements in a large set of numbers with a single pass,
by counting some bitmap functions of the hashed values of the large set of numbers.
The value nmap/φ*2 (S/nmap) asymptotically converges to the number of distinct values in the set, where:
• S is the calculated sum of the bitmap function.
• nmap is the number of hash map function used, determined by specified error tolerance.
• φ is the constant with approximate value 0.77.
When the number of distinct values in the set is small, the function counts them, rather than using the
Flajolet-Martin algorithm. To understand why, consider the case where the distinct count is 5: The value
nmap/φ*2 (S/nmap) is approximately 85 when the error is 10% and approximately 10590 when error is 1%.
For more information about probabilistic counting algorithms, see Probabilistic Counting Algorithms for
Data Base Applications, by Philippe Flajolet and G. Nigel Martin (http://portal.acm.org/citation.cfm?
id=5215).
Usage
Input
The input table requires only the columns specified by the InputColumns argument. The table can have
additional columns, but the function ignores them.
Table 265: Approximate Distinct Count Input Table Schema
Output
Table 266: Approximate Distinct Count Output Table Schema
Example
This example calculates the number of distinct values for each specified column with a 1% error rate
(accuracy).
Input
The input table has more than 3000 rows of price and advertisement information for U.S. cracker brands
Sunshine, Keebler, Nabisco, and a private label (such as a store brand). In the input column names:
• dispbrand means that the seller displayed the brand prominently.
• featbrand means that the seller featured the brand.
Table 268: Approximate Distinct Count Example Input Table crackers (Columns 9-15)
SQL-MapReduce Call
Output
Table 269: Approximate Distinct Count Example Output Table
Approximate Percentile
Summary
The Approximate Percentile function, composed of ApproxPercentileReduce and ApproxPercentileMap,
computes approximate percentiles for one or more columns of data. The nth percentile is the smallest value
in a data set that is greater than n% of the values. The larger the data set, the more accurate the approximate
percentile.
Background
The Approximate Percentile function is based on an algorithm developed by Greenwald and Khanna. The
function gives e-approximate quantile summaries of a set of N elements, where e is the error (the desired
accuracy of the approximation). Given any rank r, an e-approximate summary returns a value whose rank r'
is in the interval [r - eN, r + eN]. The algorithm has a worst-case space requirement of O((1/e) * log(eN)).
When running the Approximate Percentile function, you specify e with the Error parameter.
Usage
Input
The following table describes the required columns of the input table. The input table can have additional
columns, but the function ignores them.
Table 270: ApproxPercentileMap Input Table Schema
Output
Table 271: ApproxPercentileReduce Output Table Schema
Example
This example calculates the approximate percentiles 0, 25, 50, 75, and 100 within a 2% error rate for four
brands of crackers.
Input
The input table has more than 3000 rows of price and advertisement information the U. S. cracker brands
Sunshine, Keebler, Nabisco, and a private label (such as a store brand). In the input column names:
• dispbrand means that the seller displayed the brand prominently.
• featbrand means that the seller featured the brand.
• pricebrand is the price of the brand.
Table 272: Approximate Percentile Example Input Table cracker (Columns 1-8)
Table 273: Approximate Percentile Example Input Table cracker (Columns 9-15)
SQL-MapReduce Call
Output
Table 274: Approximate Percentile Example Output Table
CMAVG
Summary
The CMAVG (cumulative moving average) function computes the cumulative moving average of a value
from the beginning of a series.
Background
In a cumulative moving average, the data are added to the data set in an ordered data stream over time. The
objective is to compute the average of all the data at each point in time when new data arrived. For example,
an investor may want to find the average price of all of the stock transactions for a particular stock over time,
up to the current time.
The cumulative moving average computes the arithmetic average of all the rows from the beginning of the
time series, using this formula:
Usage
CMAVG Syntax
Version 1.2
Arguments
Argument Category Description
TargetColumns Optional Specifies the names of the input columns for which the cumulative
moving average is to be computed. If you omit this argument, then the
function only copies all input columns to the output table.
Input
The following table describes the required columns of the input table. The input table can have additional
columns, but the function ignores them.
Output
Table 276: CMAVG Output Table Schema
Example
This example computes a cumulative moving average for the price of IBM stock. The input data is a series of
IBM common stock closing prices from 17 May 1961 to 2 November 1962.
SQL-MapReduce Call
ConfusionMatrix
Summary
The ConfusionMatrix function shows how often a classification algorithm correctly classifies items. The
function takes an input table that includes two columns—one containing the observed class of an item and
the other containing the class predicted by the algorithm—and outputs three tables:
• A confusion matrix, which shows the performance of the algorithm
• A table of overall statistics
Background
In the field of artificial intelligence (AI), a confusion matrix typically shows the performance of a supervised
learning algorithm. The analogous table for an unsupervised learning algorithm is usually called a matching
matrix. Outside AI, a confusion matrix is often called a contingency table or error matrix.
Usage
ConfusionMatrix Syntax
Version 2.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input column that contains the observed class.
PredictColumn Required Specifies the name of the input column that contains the predicted class.
OutputTable Required Specifies the string with which to start the output table names (which are
output_table_1, output_table_2, and output_table_3).
Classes Optional Specifies the classes to output in output_table_3.
Prevalence Optional Specifies the prevalences for the classes to output in output_table_3.
Therefore, if you specify Prevalence, then you must also specify Classes,
and for every class, you must specify a prevalence.
Output
The ConfusionMatrix function returns a success message and creates 3 output tables:
• output_table_1, a confusion matrix (also called a contingency table)
• output_table_2, which contains overall statistics
• output_table_3, which contains statistics for each class
Table 280: ConfusionMatrix Output Table 1 (Confusion Matrix) Schema
• Specificity
• Pos Pred Value
• Neg Pred Value
• Prevalence
• Detection Rate
• Detection Prevalence
• Balanced Accuracy
Table 283: ConfusionMatrix Output Table 3 (Class Statistics) Schema for More Than Two Classes
Example
Input
The input table, iris_category_expect_predict, contains 30 rows of expected and predicted values for
different species of the flower iris. The predicted values can be derived from any of the classification
functions, such as SparseSVMPredict. The raw iris dataset has four prediction attributes - sepal_length,
sepal_width, petal_length, petal_width grouped into 3 species - setosa, versicolor, virginica.
Table 284: ConfusionMatrix Example Input Table iris_category_expect_predict
id expected_value predicted_value
5 setosa setosa
10 setosa setosa
15 setosa setosa
id expected_value predicted_value
20 setosa setosa
25 setosa setosa
30 setosa setosa
35 setosa setosa
40 setosa setosa
45 setosa setosa
50 setosa setosa
55 versicolor versicolor
60 versicolor versicolor
65 versicolor versicolor
70 versicolor versicolor
75 versicolor versicolor
80 versicolor versicolor
85 virginica versicolor
90 versicolor versicolor
95 versicolor versicolor
100 versicolor versicolor
105 virginica virginica
110 virginica virginica
115 virginica virginica
120 versicolor virginica
125 virginica virginica
130 versicolor virginica
135 versicolor virginica
140 virginica virginica
145 virginica virginica
150 virginica virginica
SQL-MapReduce Call
Output
The function returns a success message and creates 3 output tables.
Table 285: ConfusionMatrix Example Output Message
message
Success !
The result has been outputted to tables: "confusionmatrix_output_1",
"confusionmatrix_output_2" and "confusionmatrix_output_3"
The query below returns the output shown in the following table:
Three output tables are created by the function query. The following output table provides the confusion
matrix (also known as contingency table):
Table 286: ConfusionMatrix Example Output Table confusionmatrix_output_1
The query below returns the output shown in the following table:
key value
95% CI (0.6928, 0.9624)
Accuracy 0.8667
Kappa 0.8
Mcnemar Test P-Value NA
Null Error Rate 0.4
P-Value [Acc > NIR] 0
The following table contains accuracy/error measures like sensitivity and specificity for each class.
Table 288: ConfusionMatrix Example Output Table confusionmatrix_output_3
Correlation
Summary
The Correlation function, which is composed of the Corr_Reduce and Corr_Map functions, computes
global correlations between specified pairs of table columns. Measuring correlation lets you determine if the
value of one variable is useful in predicting the value of another.
Usage
Correlation Syntax
Version 1.4
Arguments
Argument Category Description
TargetColumns Required Specifies pairs of columns for which to calculate correlations. For
each column pair, 'col_name1:col_name2', the function calculates
the correlation between col_name1 and col_name2. For each column
range, '[col_index1:col_index2]', the function calculates the
correlation between every pair of columns in the range. For
example, if you specify '[1:3]', the function calculates the correlation
between the pairs (1,2), (1,3), (2,3),(1,1),(3,3). The mininum value of
col_index1 is 0, and col_index1 must be less than col_index2.
KeyName Required Specifies the name for the Corr_Map output table column that
contains the correlations, and by which the Corr_Map output table
is partitioned.
GroupByColumns Optional Specifies the names of the input columns that define the group for
correlation calculation. By default, all input columns belong to a
single group, for which the function calculates correlation.
Input
The Corr_Map input table must have at least two columns (one column pair). The table can have additional
columns, but the function ignores them. The Corr_Reduce input table is the Corr_Map output table.
Table 289: Correlation (Corr_Map) Input Table Schema
Output
The Corr_Map output table is input to the Corr_Reduce function, whose output table is described in the
following table.
Examples
Input
The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a
period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M).
GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns
represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.
Table 291: Correlation Example Input Table corr_input
SQL-MapRequest Call
The function calculates the correlation between each pair of columns in the TargetColumns argument. This
example compares GDP to GDPdeflator, the employed population to GDP, the number of people
unemployed, and the number of people in the armed forces.
Output
Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The
correlation coefficients for all column pairs are shown below.
Table 292: Correlation Example 1 Output Table
SQL-MapRequest Call
Output
corr value
gdp:gdp 1
gdpdeflator:gdp 1
gdpdeflator:gdpdeflator 1
employed:gdp 0.983552
employed:unemployed 0.502498
employed:armedforces 0.457307
CoxPH
Summary
The CoxPH function is named for the Cox proportional hazards model, a statistical survival model. The
function estimates coefficients by learning a set of explanatory variables. The output of the CoxPH function
is input to the function CoxPredict and CoxSurvFit.
Note:
The CoxPH and CoxPredict functions do not support interaction terms (for example, using AGE*AGE as
an item in the Cox proportional hazard model.
Usage
CoxPH Syntax
Version 1.2
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input parameters.
FeatureColumns Required Specifies the names of the input table columns that contain the
features of the input parameters.
CategoricalColumns Optional Specifies the names of the input table columns that contain
categorical predictors. Each categorical_column must also be a
feature_column. By default, the function detects the categorical
columns by their SQL data types.
TimeIntervalColumn Required Specifies the name of the column in input_table that contains the
time intervals of the input parameters; that is, end_time - start_time,
in any unit of time (for example, years, months, or days).
EventColumn Required Specifies the name of the column in input_table that contains 1 if
the event occurred by end_time and 0 if it did not. (0 represents
survival or right-censorship.) The function ignores values other
than 1 and 0.
CoefficientTable Required Specifies the name of the table where the function outputs the
estimated coefficients of the input parameters.
LinearPredictorTable Required Specifies the name of the table where the function outputs the
product ßX.
Threshold Optional Specifies the convergence threshold. The default value is
0.000000001.
MaxIterNum Optional Specifies the maximum number of iterations that the function runs
before finishing, if the convergence threshold has not been met. The
default value is 10.
Accumulate Optional Specifies the names of the columns in input_table that the function
copies to linear_predictor_table.
Output
The CoxPH function outputs information to the summary table, coefficient table, and linear predictor table.
The following table describes the information that the function outputs to the summary table.
Table 295: CoxPH Coefficient Table Schema
Following the summary table, the function displays the values of the following:
• Iteration#
• Convergence
• Likelikhood ratio test
• Wald test
• Score test
• Degree of freedom
The following table describes the schema of the coefficient table.
Table 297: CoxPH Coefficient Table Schema
The following table describes the schema of the linear predictor table.
Table 298: CoxPH Linear Predictor Table Schema
Example
Input
The input table, lungcancer, is data from a randomized trial of two treatment regimens for lung cancer used
to model survival analysis. The variables are defined below. There are three categorical predictors: treatment
(trt), type of cancer (celltype) and whether the patient has been treated previously (prior), and three
numerical predictors: the patient's self-rating on the Karnovsky scale (karno), time between diagnosis and
study start (diagtime), and the patient’s age (age). The censoring status or the survival event is specified
under column 'status' and the survival time is specified in column ‘time_int’.
• trt: Treatment plan has two values ‘standard’ or ‘test’
• celltype: Type of cancerous cell. Has four values: ‘squamous’, ‘smallcell’, ‘adeno’ and ‘large’
• time: survival time
• status: censoring status. ‘0’ means survival/right censorship. ‘1’ indicates otherwise.
• karno: Karnofsky performance score (100=good)
• diagtime: months from diagnosis to randomization
• age: Age in years
• prior: Whether the patient has undergone prior therapy (‘yes’ or ‘no’)
Table 299: CoxPH Example Input Table lungcancer
SQL-MapRequest Call
The three categorical variables are specified in the CategoricalColumns argument. The function creates two
models, a coefficient table and a linear predictor table, which are output with the names specified in the
corresponding arguments.
Output
The coefficients are estimated at 95% CI. Coefficients of variables ‘karno’, ‘squamous’ and ‘large’ celltype are
found to be significant from the table below.
Table 300: CoxPH Example Output Table (Columns 1-5)
The coefficients are output in table “lungcancer_coef” that is later used for prediction. As celltype, trt and
prior are categorical variables, one of their categories is considered a reference for the other categories, thus
‘trt’ = standard, ‘celltype’ = adeno, and ‘prior’ = no don’t show default coefficient values in each column.
The query below returns the output shown in the following two tables:
CoxPredict
Summary
The CoxPredict function takes as input the coefficient table generated by the function CoxPH and outputs
the hazard ratios between predictive features and either their corresponding reference features or their unit
differences.
This function can be used with real-time applications. Refer to AMLGenerator.
Note:
The CoxPH and CoxPredict functions do not support interaction terms (for example, using AGE*AGE as
an item in the Cox proportional hazard model). The CoxPredict function supports only relative hazard
ratio calculation. It does not calculate or output confidence intervals.
Background
In survival analysis, the hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions
described by two levels of an explanatory variable. For example, in a drug study, if the treated population
might die at twice the rate as the control population, then the hazard ratio is 2, indicating a higher hazard of
death from the treatment.
The definition of the Cox proportional hazard model is:
h(t) = h0(t)exp(β1X1 + … + βnXn)
The definition of HR is:
HR = h1(t) / h2(t) =
h0(t)exp(β1X1 + … + βnXn) / h (t) = h0(t)exp(β1X'1 + … + βnX'n) =
exp(β1(X1 - X'1) + … + βn(Xn - X'n))
The natural logarithm of HR is:
ln(HR) = β1(X1 - X'1) + … + βn(Xn - X'n)
For two groups that differ only in treatment condition, the ratio of the hazard functions is given by eβ, where
β is the estimated treatment effect derived from the regression model. This hazard ratio (the ratio of the
predicted hazard for a member of one group to the predicted hazard for a member of the other group) is
given by holding everything else constant (that is, assuming proportionality of the hazard functions).
Usage
CoxPredict Syntax
Version 1.1
Arguments
Argument Category Description
Predict_Feature_Names Required Specifies the names of the features in the Cox
coefficient model (the coefficient table
generated by the CoxPH function).
Predict_Feature_Columns Required if Specifies the names of the columns that
Predict_Feature_Units contain the values of the features in the Cox
coefficient model. This argument must specify
Note:
The function ignores this argument if you
specify Predict_Feature_Units_Columns.
Input
The CoxPredict function has two required and one optional input tables:
• Required: Cox coefficient model table cox_coef_model_table, output by the CoxPH function, whose
schema is described by the table, CoxPH Coefficient Table Schema, described in the Output section of the
function: CoxPH.
• Required: Predict feature table predict_feature_table, whose schema follows
• Optional: Reference feature table ref_feature_table, whose schema follows
The predictive feature table and reference feature table can have additional columns, but the function
ignores them.
Table 305: CoxPredict Predict Feature Table Schema
Output
The CoxPredict function has one output table, whose schema depends on whether you specify the
Predict_Feature_Columns argument or the Predict_Feature_Units_Columns argument.
Table 307: CoxPredict Predict Output Table Schema (Predict_Feature_Columns Specified)
Examples
These examples use different arguments and options for the CoxPredict function.
Input
All examples use the inputs below. Input table lc_new_predictors is a list of four patients who have been
diagnosed with lung cancer. The examples use the model table lungcancer_coef.
Table 309: CoxPredict Example Input Table: lc_new_predictors
The preceding table includes all of the attributes that were used in the input to the CoxPH function.
The following table, used in Examples 3 and 4, contains alternate sets of reference values for each attribute.
Table 310: CoxPredict Example Input Table: lc_new_reference
Output
SQL-MapRequest Call
Output
SQL-MapRequest Call
FROM CoxPredict (
ON lungcancer_coef AS cox_coef_model DIMENSION
ON lc_new_predictors AS predicts PARTITION BY 1
ON lc_new_reference AS refs PARTITION BY 1
Predict_Feature_Names ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Predict_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime',
'age', 'prior')
Ref_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Accumulate ('id', 'name')
) ORDER BY 1, 2, 3, 4, 5, 6, 7, 8;
Output
This example uses default partition by 1, thus each patient is compared with each row in the reference table.
There are 8 reference rows and thus a total of 32 rows or comparisons.
Output
There are only 4 rows of comparison as the input is partitioned. The attributes of patient Steffi are similar to
the attributes of reference id 4, so her hazard ratio is very close to 1.0 (0.97).
Table 317: CoxPredict Example 4 Output Table (Columns 1-8)
SQL-MapReduce Call
Output
Numerical attributes are scaled by the unit values for comparison.
Table 319: CoxPredict Example 5 Output Table
Hypothesis-Test Mode
Summary
In hypothesis-test mode, the function tests the hypothesis that the sample data comes from the specified
reference distribution. In this mode, the function simultaneously performs whichever tests you specify and
reports a p-value for each test. The null hypothesis is that the data are consistent with the specified
distribution. Therefore, a low p-value suggests that the distribution is not a very good fit for the data.
Usage
Recommended syntax depends on whether the reference distribution is continuous or discrete and on the
sample data set. For both continuous and discrete distributions, there are two syntax options. Option 1
usually works better for large data sets that might be stored across multiple nodes, and option 2 usually
works better for small data sets that are stored on a single node. However, performance ultimately depends
on the data itself.
Note:
To run the CvM test on discrete distributions, you must use option 2; otherwise the results might be
incorrect.
Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains
the values of the sample data set.
Tests Optional Specifies one to four tests to perform. A test can be:
• 'KS' (Kolmogorov-Smirnov test)
• 'CvM' (Cramér-von Mises criterion)
• 'AD' (Anderson-Darling test)
• 'CHISQ' (Pearson's Chi-squared test)
By default, the function runs all of the preceding tests.
Distributions Required Specifies the reference distributions and their parameters.
All distributions must be continuous or all must be discrete.
The possible distribution and parameters values for
continuous distributions are in the following table. The
possible distribution and parameters values for continuous
distributions are in the second of the two following tables.
For discrete distributions:
• BINOMIAL, GEOMETRIC, NEGATIVEBINOMIAL,
and POISSON distributions are on N={0,1,2,...}.
• UNIFORMDISCRETE distribution is on events, which
are represented by integers.
GroupByColumns Optional Specifies the names of the input table columns that contain
the group identifications over which to run the test. The
function can run multiple tests for different partitions of the
data in parallel. If you omit this argument, then specify
PARTITION BY 1 and omit the GROUP BY clause in the
second ON clause.
MinGroupSize Optional Specifies the minimum group size. The function ignores
groups smaller than the minimum size when calculating
statistics. The default value is 50.
NumCell Optional Specifies the number of cells that you want to make discrete
in a continuous distribution. The cell_size must be greater
than 3 if distribution is NORMAL; otherwise, it must be
greater than 1. The quotient min_group_size/cell_size cannot
be less than 5. The default value is 10.
Input
The input table consists of an arbitrary number of grouping columns and a single value column that
contains the dataset to be matched to the specified distribution(s). The syntax shown includes clauses that
create two tables from the input table. One table ranks the data and the other table counts the number of
points in each group.
For continuous distributions, if your input table already includes a rank column, replace the clause ON
(SELECT RANK()... with the clause ON SELECT * FROM input_table.
Table 322: Distribution Matching Input Table Schema
Output
The output table contains the columns described in the following table for each group defined by the
PARTITION BY clause.
Table 323: Distribution Matching Output Table Schema
Examples
Before running the examples in this section, switch the output mode in Act to expanded output by entering
"-x" at the Act command prompt. With expanded output mode turned on, each record is split into rows,
with one row for each value, and each new record is introduced with a text label in a form like: ---
[ RECORD 37 ]---. This mode helps make wide tables readable on a small screen.
Input
Here is a snapshot of the input data:
Table 324: distnmatch (Hypothesis Test Mode) Example 1 Input Table raw_normal_50_2
price
48.0701
52.6426
48.6372
50.9832
50.523
52.1773
50.3103
48.4424
50.1352
50.1382
...
SQL-MapReduce Call
The function call uses the sample mean (49.97225) and standard deviation (2.009698). See the Arguments
table for more information.
Output
The reported p-value for each of the four tests is around 0.4, which does not rule out the null hypothesis that
the data are consistent with a normal distribution with the specified mean and standard deviation.
Input
The input represents hypothetical mean-time-to-failure data for four products manufactured in two
different factories. Only a subset of rows is shown.
Table 326: distnmatch (Hypothesis Test Mode) Example 2 Input Table: factory7
Output
The reported p-values support these conclusions:
• For product A from factory F1, all 4 tests fail to reject the null hypothesis that the data fit a normal
distribution with the specified parameters. All 4 tests reject the null hypothesis that the data fit the
specified gamma, Weibull, or uniform distributions.
• For product C from factory F1, all 4 tests fail to reject the null hypothesis that the data fit a Weibull
distribution with the specified parameters. All 4 tests reject the null hypothesis that the data fit the
specified gamma or uniform distributions.
• For product B from factory F2, all 4 tests reject the null hypothesis for each of the specified distributions.
• For product D from factory F2, all 4 tests fail to reject the null hypothesis that the data fit a uniform
distribution with the specified parameters.
Table 327: distnmatch (Hypothesis Test Mode) Example 2 Output Table
CoxSurvFit
Summary
The CoxSurvFit function takes as input the coefficient and linear prediction tables generated by the function
CoxPH and outputs a table of survival probabilities.
Note:
The CoxSurvFit function supports only the Nelson-Aalen-Breslow estimator with Efron ties modification
for baseline survival function estimation. It does not calculate or output confidence intervals, variance, or
standard error estimates.
Background
The definition of the Cox proportional hazard model is:
h(t) = h0(t)exp(βX)
Given an estimated time t and all values of conditional variables (x1, x2, ..., xn), the survival function is:
S(t) = S0 (t)exp(βx)
S0(t), the baseline survival function, is composed of the survival probabilities at times ti. Three estimators
often used to estimate these survival probabilities are:
• Breslow estimator
• Nelson-Aalen-Breslow estimator
• Kalbfleisch and Prentice estimator
The first two estimators can be used with Efron ties modification.
The CoxSurvFit function uses the Nelson-Aalen-Breslow estimator with Efron ties modification for baseline
function estimation.
The Nelson-Aalen estimator of the integrated hazard is:
In 1984, Cox and Oakes described a simpler estimator that extends the Nelson-Aalen estimate of the
cumulative hazard to the case of covariates:
where the sum is over the risk set Ri. The cumulative hazard and survival functions are then estimated as:
and:
Usage
CoxSurvFit Syntax
Version 1.1
Arguments
Argument Category Description
Cox_Linear_Predictor_Model_Table Required Specifies the name of the Cox linear predictor model
table, which was output by the CoxPH function.
Cox_Coef_Model_Table Required Specifies the name of the Cox coefficient model table,
which was output by the CoxPH function.
Predict_Table Required Specifies the name of the predict table, which contains
new prediction feature values for survival calculation.
Predict_Feature_Names Required Specifies the names of features in the Cox model.
Predict_Feature_Columns Required Specifies the names of the columns that contain the
values for the features in the Cox model—one column
name for each feature name. The ith feature name
corresponds to the ith column name. For example,
consider this pair of
arguments:Predict_Feature_Names('name',
'age')Predict_Feature_Columns('c1',
'c2')
The predictive values of the feature 'name' are in
column 'c1', and the predictive values of the feature 'age'
are in column 'c2'.
Output_Table Required Specifies the name of the output table that contains
survival probabilities. The table must not exist.
Accumulate Optional Specifies the names of the columns in predict_table that
the function copies to the output table.
Input
The CoxSurvFit function has three required input tables
• The first two tables are output by the CoxPH function and are described in the Output section of the
function: CoxPH:
∘ CoxPH Linear Predictor Table Schema
∘ CoxPH Coefficient Table Schema
• Predict table, whose schema is described by the following table:
id x1 x2 x3 x4
1 a b c d
For the row in the preceding table, the function computes this survival probability:
S(t) = S0(t)(βx1*a + βx2*b + βx3*c + βx4*d)
Output
The CoxSurvFit function outputs a message table (usually to the screen) and a table of survival probabilities
(output_table).
Table 330: CoxSurvFit Message Table Schema
Input
The input table lc_new_predictors is used with the linear model predictor table lungcancer_lp and the
coefficient table lungcancer_coef that are generated from the CoxPH function to determine the survival
probabilities of the new patients.
SQL-MapReduce Call
Output
The query below returns the output shown in the following table:
CrossValidation
Summary
Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how
the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings
where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in
practice. In a prediction problem, a model is usually given a dataset of known data on which training is run
(training dataset), and a dataset of previously unseen data against which the model is tested (testing dataset).
The goal of cross validation is to define a dataset to “test” the model in the training phase (the validation
dataset) to provide insight into how the model will generalize to an independent dataset. Cross-validation
can be useful to identify and avoid overfitting problems.
Cross-validation works as follows: the data are randomly partitioned into k equal-sized subsamples. One
group is kept aside as a validation set, and the model is trained on the rest of the data. The trained model is
Usage
CrossValidation Syntax
Version 1.0
Arguments
Argument Category Description
Function Required The name of the function to be cross-validated. Only GLM
(‘glm’) is supported.
[Arguments used by the training Required Required and optional arguments used by the function to
function] and be cross-validated. The argument names and descriptions
Optional are the same as those used when the function is run
normally.
CVParams Required The list of the arguments to use in cross validation.
FoldNum Optional The value of k in k-fold cross validation. Default is 10.
CVTable Optional The name of the output table that contains the cross-
validation errors for all models. Default is ‘cvtable’.
Metric Optional Error function used to calculate the cross-validation error.
Possible values are ‘AUROC’ (area under the ROC curve)
and ‘MSE’ (mean squared error). Default is 'AUROC'.
Input
The input table is the same as the input table used by the function to be cross-validated.
Output
When the function completes, it displays a message. The output cross-validation table is created with the
name specified in the argument “CVTable”. The output table contains the training variable values specified
in the SQL-MapReduce call and the cross-validation error for each model analyzed.
Table 333: Cross-Validation Output Table schema
Example
This example performs the cross validation comparison between the equi-weighted logit and probit link
models of the logistic regression GLM function.
Input
The input table, admissions_train, has one numerical predictor (gpa) and three categorical predictors
(masters, stats, programming) that together determine the binary outcome of whether a student is admitted
(1) or not (0).
Table 334: Cross-Validation Example Input Table admissions_train
SQL-MapReduce Call
Choose the same weight and number of iterations (argument maxitem) to compare the logit and probit
models, so that the cverror result reflects the true strength of the models.
message
Finished. Results can be found in "glmcvtable"
The query below returns the output shown in the following table:
The reported result for cverror is the same as the models are equi-weighted.
Distribution Matching
Summary
Given sample data and reference distributions, the function tests the hypothesis that the sample data comes
from the distributions (Hypothesis-Test Mode). Given the test results, the function finds the distribution
that best matches the sample data (Best-Match Mode).
The Distribution Matching function is composed of the functions DistnmatchReduce and
DistnmatchMultipleInput. DistnmatchReduce supports these distributions:
• For continuous variables:
∘ Beta
∘ Cauchy
∘ ChiSq
∘ Exponential
∘ F
∘ Gamma
∘ Lognormal
∘ Normal
∘ T
∘ Triangular
∘ Uniform
∘ Weibull
• For discrete variables:
∘ Binomial
∘ Geometric
Best-Match Mode
Summary
In best-match mode, the function uses the result of hypothesis-test mode to find the distribution that best
matches the sample data. For each specified test, the function reports the best match, identifying the
distribution type and parameters.
Usage
Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains
the values of the sample data set.
Tests Optional Specifies one to four tests to perform. A test can be:
• 'KS' (Kolmogorov-Smirnov test)
• 'AD' (Anderson-Darling test)
• 'CHISQ' (Pearson's Chi-squared test)
By default, the function runs all of the preceding tests.
Distributions Optional Specifies the reference distributions (which must be
continuous) and their parameters. The possible distribution
and parameters values for continuous distributions are in
the table, Continuous Distributions and Parameters, of the
Arguments section of the function: Hypothesis-Test Mode.
By default, the function uses these distributions:
• Beta
• Cauchy
• CHISQ
• Exponential
• F
• Gamma
• Lognormal
• Normal
• T
• Triangular
• Uniformcontinuous
• Weibull
GroupByColumns Optional Specifies the names of the input table columns that contain
the group identifications over which to run the test. The
function can run multiple tests for different partitions of the
data in parallel. If you omit this argument, then specify
PARTITION BY 1 and omit the GROUP BY clause in the
second ON clause.
MinGroupSize Optional Specifies the minimum group size. The function ignores
groups smaller than the minimum size when calculating
statistics. The default value is 50.
NumCell Optional Specifies the number of cells that you want to make discrete
in a continuous distribution. The cell_size must be greater
than 3 if distribution is NORMAL; otherwise, it must be
Input
The input table consists of an arbitrary number of grouping columns and a single value column that
contains the dataset to be matched to the specified distribution(s). The syntax shown includes clauses that
create two tables from the input table. One table ranks the data and the other table counts the number of
points in each group.
For continuous distributions, if your input table already includes a rank column, replace the clause ON
(SELECT RANK()... with the clause ON SELECT * FROM input_table.
Table 337: Distribution Matching Input Table Schema
Output
Table 338: Distribution Matching Output Table Schema
Examples
Before running the examples in this section, switch the output mode in Act to expanded output by entering
"-x" at the Act command prompt. With expanded output mode turned on, each record is split into rows,
with one row for each value, and each new record is introduced with a text label in a form like: ---
[ RECORD 37 ]---. This mode helps make wide tables readable on a small screen.
Input
The input table, distnmatch (Hypothesis Test Mode) Example 2 Input Table: factory7, is the same as in
Example 2: Normality Tests with 'groupingColumns' of the function: Hypothesis-Test Mode.
SQL-MapReduce Call
Input
The input is hypothetical and represents the ages of children visiting three amusement parks during a one-
week period in spring and another one-week period in summer. Only a subset of rows is shown.
Table 340: distnmatch (Best Match Mode) Example 2 Input Table age_distribution
SQL-MapReduce Call
Output
The function has attempted to identify the best matching distribution for each partition of the data, based on
each test specified in the SQL-MapReduce call. For each partition, the output shows the distribution and
parameters identified by each test with the associated p-value.
Table 341: distnmatch (Best Match Mode) Example 2 Output Table (Columns 1-4)
Table 342: distnmatch (Best Match Mode) Example 2 Output Table (Columns 5-7)
Table 343: distnmatch (Best Match Mode) Example 2 Output Table (Columns 8-9)
best_match_CHISQ_top1 p-value_CHISQ_top1
BINOMIAL:100,0.5 0
best_match_CHISQ_top1 p-value_CHISQ_top1
BINOMIAL:100,0.5 0
UNIFORMDISCRETE:1,12 8.9484e-13
BINOMIAL:13,0.507505492077339 0
BINOMIAL:16,0.5091542759742608 0
BINOMIAL:110,0.06332936686652757 0
EMAVG
Summary
The EMAVG (exponential moving average) function computes the average over a number of points in a
time series, exponentially decreasing the weights of older values.
Background
Exponential moving average (EMA), or exponentially weighted moving average (EWMA), applies a damping
factor, alpha, that exponentially decreases the weights of older values. This technique gives much more
weight to recent observations, while retaining older observations.
The EMAVG function computes the arithmetic average of the first n rows and then, for each subsequent
row, computes the new value with this formula:
The initial value of old_emavg is the arithmetic average of the first n rows. The values n and alpha are
specified by the function arguments Start_Rows and Alpha, respectively.
Usage
EMAVG Syntax
Version 1.2
Arguments
Argument Category Description
TargetColumns Optional Specifies the input column names for which the exponential moving average is
to be computed. If you omit this argument, then the function copies every
input column to the output table but does not compute any exponential
moving averages.
Alpha Optional Specifies the damping factor, a value in the range [0, 1], which represents a
percentage in the range [0, 100]. For example, if alpha is 0.2, then the damping
factor is 20%. A higher alpha discounts older observations faster. The default
value is 0.1.
StartRows Optional Specifies the number of rows at the beginning of the time series that the
function skips before it begins the calculation of the exponential moving
average. The function uses the arithmetic average of these rows as the initial
value of the exponential moving average. The value n must be an integer. The
default value of n is 2.
IncludeFirst Optional Specifies whether to include the starting rows in the output table. The default
value is 'false'. If you specify 'true', the output columns for the starting rows
contain NULL, because their exponential moving average is undefined.
Input
The input table must have the rows described in the following table. The table can have additional rows, but
the function ignores them.
Table 344: EMAVG Input Table Schema
Example
This example computes an exponential moving average for the price of IBM stock. The input data is a series
of IBM common stock closing prices from 17 May 1961 to 2 November 1962.
Input
Table 346: EMAVG Example Input Table ibm_stock
SQL-MapReduce Call
Output
Table 347: EMAVG Example Output Table
FMeasure
Summary
The FMeasure function calculates the accuracy of a test (usually the output of a classifier).
Background
In statistics, the F1 score (or F-score or F-measure) is a measure of a test’s accuracy that is based on both
precision and recall, which are defined as follows:
• Precision, p, is the number of correct results divided by the number of returned results.
• Recall, r, is the number of correct results divided by the number of expected results.
The F1 score can be interpreted as a weighted average of precision and recall, whose best value is 1 and worst
value is 0.
The traditional F1 score is the harmonic mean of precision and recall:
F =2*p*r / (p+r)
The general formula for a positive real β is:
Fβ =(1+β*β)*p*r /(β*β*p+r)
Usage
FMeasure Syntax
Version 1.4
Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input table column that contains
the observed class.
PredictColumn Required Specifies the name of the input table column that contains
the predicted class.
Classes Optional Specifies the class or classes to output in the result. The
default is all classes.
Beta Optional Specifies the value of β in the general formula in
Background. The beta_value must be a positive DOUBLE
PRECISION value. The default value is 1.0.
Input
The FMeasure function has one input table, view, or query that contains the test data. the following table
describes the input columns that function arguments must specify. The function ignores any additional
columns.
Table 348: FMeasure Input Table Schema
Note:
The function is intended for general, multiclass input data. To submit a binary classification problem to
the function in the expected format, input a query that includes WHERE clauses.
Output
Table 349: FMeasure Output Table Schema
Examples
• Input
• Example 1: Output All Classes
• Example 2: Output Specified Classes
Input
The input table has five attributes of personal computers—price, speed, hard disk size, RAM, and screen size.
The table has 500 rows, categorized into five price groups—SPECIAL, SUPER, HYPER, MEGA and UBER.
The predicted_compcategory values can be generated by a classification function, such as KNN.
Table 350: FMeasure Examples Input Table computers_category
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
GLM
Summary
The generalized linear model (GLM) is an extension of the linear regression model that enables the linear
equation to be related to the dependent variables by a link function. GLM performs linear regression analysis
for any of a number of distribution functions using a user-specified distribution family and link function.
GLM selects the link function based upon the distribution family and the assumed nonlinear distribution of
expected outcomes. The table in Background describes the supported link function combinations.
A GLM has three parts:
1. A random component—the probability distribution of Y from the exponential family
2. A fixed linear component—the linear expression of the predictor values (X1,X2,...,Xp), expressed as ƞ or
Xβ
3. A link function that describes the relationship of the distribution function to the expected value of Y
(described in the table in Background)
GLM also supports categorical variables. For example, in the following table, size and color are independent
(predictive) variables and outcome is the dependent (response) variable. Size is a quantitative variable and
color is a qualitative variable (with the values yellow, blue, and red). In regression analysis, a qualitative
variable is called a categorical (or dummy) variable.
Table 353: Categorical Variables
Note:
The Aster Analytics GLM function implementation uses the Fisher Scoring Algorithm, which is highly
scalable compared to the least-squares algorithm used in the glm() function in the R package stats. The
results of the two algorithms usually match closely. However, when the input data is highly skewed or has
a large variance, the Fisher Scoring Algorithm might diverge, and you might need to use knowledge of
the dataset and trial and error to select the optimal family and link functions.
Background
Table 354: Supported Family/Link Function Combinations
1/μ2 INVERSE_MU_SQUARED D
sqrt SQUARE_ROOT *
cauchit CAUCHIT *
Usage
GLM Syntax
Version 1.7
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the columns
described in the table in Input.
OutputTable Required Specifies the name for the output table of coefficients. This table
must not exist. For GLM, the output is written to the screen, and
the output table is the table where the coefficients are stored.
InputColumns Optional Specifies the name of the column that contains the dependent
variables (Y) followed by the names of the columns that contain
the predictor variables (Xi), in this format: 'Y,X1,X2,...,Xp'.
By default, the first column of the input table is Y and the
remaining input table columns are Xi, except for the column
specified by the Weight argument.
CategoricalColumns Optional Specifies columnname-value pairs, each of which contains the
name of a categorical input column and the category values in that
column that the function is to include in the model that it
generates.
Each columnname-value pair has one these forms:
• 'columnname:max_cardinality'
Limits the categories in the column to the max_cardinality
most common ones and groups the others together as 'others'.
For example, 'column_a:3' specifies that for column_a, the
function uses the 3 most common categories and sets the
category of the rows that do not belong to those 3 categories to
'others'.
• 'columnname:(category [, ...])'
Limits the categories in the column to those that you specify
and groups the others together as 'others'. For example,
'column_a : (red, yellow, blue)' specifies that for column_a, the
function uses the categories red, yellow, and blue, and sets the
category of the rows that do not belong to those categories to
'others'.
• 'columnname'
All category values appear in the model.
Link Optional Specifies the link function. The default value is 'CANONICAL'.
The canonical link functions (default link functions) and the link
functions that are allowed for each exponential family are listed in
the table in Background.
Weight Optional Specifies the name of an input table column that contains the
weights to assign to responses. The default value is 1.
You can use non-NULL weights to indicate that different
observations have different dispersions (with the weights being
inversely proportional to the dispersions). Equivalently, when the
weights are positive integers wi, each response yi is the mean of wi
unit-weight observations. A binomial GLM uses prior weights to
give the number of trials when the response is the proportion of
successes. A Poisson GLM rarely uses weights.
If the weight is less than the response value, then the function
throws an exception. Therefore, if the response value is greater
than 1 (the default weight), then you must specify a weight that is
greater than or equal to the response value.
Threshold Optional Specify the convergence threshold. The default value is 0.01.
MaxIterNum Optional Specifies the maximum number of iterations that the algorithm
runs before quitting if the convergence threshold has not been
met. The parameter max_iterations must be a positive INTEGER
value. The default value is 25.
Intercept Optional Specifies whether the function uses an intercept. For example, in
ß0+ß1*X1+ß2*X2+ ....+ ßpXp, the intercept is ß0.The default
value is 'true'.
Step Optional Specifies whether the function uses a step. The default value is
false. If the function uses a step, then it runs with the GLM model
that has the lowest Akaike information criterion (AIC) score,
drops one predictor from the current predictor group, and repeats
this process until no predictor remains.
Onscreen Output
The onscreen output of the GLM function is a regression analysis of the data, using the family and link
functions specified.
Columns
When a particular column is not used for its corresponding row, the column contains a value of zero (0).
Table 357: GLM Onscreen Output Columns
Column Description
predictor Contains the column name for each predictor that was input to the function and the
labels of the rows whose values appear in the second table in Rows.
estimate Contains the mean of the supplied values for each predictor and each value in the second
table in Rows.
std_error Contains the standard deviation of the mean (standard error) for each predictor.
Column Description
z_score Contains the likelihood that the null hypothesis is true, given this sample. The likelihood
is the difference between the observed sample mean and the hypothesized mean, divided
by the standard error.
p_value Contains the significance level for each predictor.
significance Contains the likelihood that the predictor is significant (refer to CoxPH Output).
Rows
The onscreen output includes a row for each of parameter in the following table with a value for estimated
value, standard error, z-score, p-value, and significance:
Table 358: GLM Onscreen Output Row Parameters
Parameter Description
Intercept The value of the logit (Y) when all predictors are 0.
Predictors A row for each predictor value (X1,X2,...,Xp).
The following values are also output in the second column (estimate).
Table 359: GLM Onscreen Output Values in the Estimate Column
Note:
With Step('true'), the function reports this number for each step.
Note:
For the Gamma distribution density, AIC and BIC might have the value NaN when the dispersion
parameter is very small (for example, 0.00170243) and goodness-of-fit is poor (for example, 0.011).
Output Table
The output table specified by the OutputTable argument stores the estimated coefficients and statistics,
which are used by the functions GLMPredict and LRTEST.
When a particular column is not used for its corresponding row, the column contains a value of zero (0).
This is a description of the columns that appear in the output table:
Table 360: GLM Output Table Columns
Column Description
attribute The index of each predictor, starting from 0.
predictor The column name for each predictor that was supplied as input to the function.
category The category names of each predictor. Numeric predictors have NULL values in this
column.
estimate The mean of the supplied values for each predictor.
std_error Standard deviation of the mean for each predictor (standard error).
z_score or If the Family argument specifies the BINOMIAL, LOGISTIC, POISSON, GAMMA,
t_score INVERSE_GAUSSIAN, or INVERSE_BINOMIAL family, then the name of the column
is z_score.
The z-score is a measure of the likelihood that the NULL hypothesis is true, given this
sample. It is derived by taking the difference between the observed sample mean and the
hypothesized mean, divided by the standard error. The z-score statistic follows the
N(0,1) distribution.
If the Family argument specifies the GAUSSIAN family, then the name of the column is
t_score. The t_score statistic follows a t(N-p-1) distribution.
p_value The significance level (p-value) for each predictor.
significance The likelihood that the predictor is significant (refer to Output in the function: CoxPH).
The output includes a row for each of the following with a value for estimated value, standard error, z-score,
p-value, and significance:
Table 361: GLM Output Table Parameters
Parameter Description
Loglik The log likelihood of the model.
Intercept The value of the logit (Y) when all predictors are 0.
Predictors A row for each predictor value (X1,X2,...,Xp). Each numeric input column corresponds
to one predictor.
Goodness-of-Fit Tests
• Deviance
• Wald’s Test
• Rao's Score Test
• Pearson’s Chi-squared Statistic
Deviance
The deviance for a model M0, based on a dataset y, is defined as follows:
denotes the fitted parameters for the “full model" (or "saturated model").
Both sets of fitted values are implicitly functions of the observations y. In this case, the full model is a model
with a parameter for every observation so that the data are fitted exactly. This expression is -2 times the log-
likelihood ratio of the reduced model compared to the full model.
The deviance is used to compare two models—in particular in the case of generalized linear models where it
has a similar role to residual variance from ANOVA in linear models (RSS).
Suppose in the framework of the GLM that there are two nested models, M1 and M2. In particular, suppose
that M1 contains the parameters in M2, and k additional parameters. Then, under the null hypothesis that
M2 is the true model, the difference between the deviances for the two models follows an approximate chi-
squared distribution with k-degrees of freedom. This provides us an alternative way for computing the log-
likelihood ratio of two models.
Wald’s Test
Significance tests can be performed for individual regression coefficients (that is, H0 : βj = 0) by computing
the Wald statistics, which are similar to the partial t-statistics from classical regression:
Under the null hypothesis that βj = 0, the Wald test statistic wj follows approximately a standard normal
distribution (and its square is approximately a chi-square on one-degree of freedom).
This quantity is computed by the GLM function as the Wald Test, as well as the corresponding 'p_value'. It is
in the output table, and also displayed on the screen.
Suppose that θ0 is the maximum likelihood estimate of θ under the null hypothesis Η0 : θ= θ.
Then
asymptotically under H0, where k is the number of constraints imposed by the null hypothesis.
If the fitted model is correct and the observations yi are approximately normal, then X2 is approximately
distributed as X2on the residual degrees of freedom for the model. Both the deviance and the generalized
Pearson X2 have exact X2 distributions for Normal-theory linear models (assuming of course that the model
is true), and asymptotic results are available for the other distributions. The deviance has a general advantage
as a measure of discrepancy in that it is additive for nested sets of models if maximum-likelihood estimates
are used, whereas X2 in general is not. However, X2 may sometimes be preferred because of its more direct
interpretation.
The GLM function computes the Pearson’s goodness of fit.
Examples
• Example 1: Logistic Regression Analysis with Intercept
• Example 2: Logistic Regression Analysis with Step Argument
• Example 3: Gaussian Distribution Analysis with Default Options
Input
The input table, admissions_train, contains data about applicants to an academic program. For each
applicant, attributes in the table include a Masters Degree indicator, a grade point average (on a 4.0 scale), a
statistical skills indicator, a programming skills indicator, and an indicator of whether the applicant was
admitted. The Masters Degree, statistical skills, and programming skills indicators are categorical variables.
Masters degree has two categories (yes or no), while the other two have three categories (Novice, Beginner
and Advanced). For admitted status, "1" indicates that the student was admitted and "0" indicates otherwise.
Table 362: GLM Example 1 Input Table admissions_train
Output
The output table shows the model statistics.
Table 363: GLM Example 1 Model Statistics
For categorical variables, the model selects a reference category. In this example, the Advanced category was
used as a reference for the stats variable.
The query below returns the output shown in the following two tables:
Input
GLM Example 1 input table admissions_train
SQL-MapReduce Call
Output
Note that the model starts with 33 degrees of freedom and then consecutively increases the degrees of
freedom to 39, at which point the response is modeled with only the intercept. The model parameters are
obtained progressively by dropping one predictor variable.
The query below returns the output shown in the following table:
Input
The input table, housing_train, is real estate data on homes, which models the home price with 12 predictors
(six numerical and six categorical variables). The variable definition is:
• Response variable:
∘ price - sale price of a house in $
• Predictors:
SQL-MapReduce Call
In this example, the family is GAUSSIAN and the default family link is IDENTITY.
Output
Many predictors are significant at 95% confidence level (p-value < 0.05).
This query returns the following table:
GLMPredict
Summary
The GLMPredict function uses the model generated by the function GLM to perform generalized linear
model prediction on new input data.
This function can be used with real-time applications. Refer to AMLGenerator.
GLMPredict Syntax
Version 1.5
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Categor Description
y
ModelTable Require Specifies the name of the model table generated by the GLM function.
d
Accumulate Optional Specifies the names of input table columns to copy to the output table.
Family Optional Specifies the distribution exponential family. The default value is 'BINOMIAL'. If
you specify this argument, you must give it the same value that you used for the
Family argument of the function GLM when you generated the model table.
Link Optional The default value is 'CANONICAL'. The canonical link functions (default link
functions) and the link functions that are allowed for each exponential family are
listed in Background.
Note:
Use the same value that you used for the Link argument of the function GLM
when you generated the model table.
Output
Table 373: GLMPredict Output Table Schema
Examples
• Example 1: Logistic Distribution Prediction
• Example 2: Gaussian Distribution Prediction
Input
The input test table, admissions_test, has admissions information for 20 students. The example uses the
glm_admissions_model (see the table, GLM Example 1 Model Statistics, from the Output) section of
Example 1 of the function GLM) to evaluate the prediction on the admission status of these students.
Table 374: GLMPredict Example 1 Input Table admissions_test
SQL-MapReduce Call
Output
The query below returns the output shown in the following table:
prediction_accuracy
1.00000000000000000000
Input
The input test table, housing_test, is test data on a sample of 54 homes. The example uses the Gaussian
model glm_housing_model from GLM Example 3 to evaluate the prediction for these new homes,
comparing the prediction with the original price information with root mean square error evaluation
(RMSE).
Table 378: GLMPredict Example 2 Input Table housing_test (Columns 1-7)
SQL-MapReduce Call
The “canonical” link specifies the default family link, which is “identity” for the Gaussian distribution.
Output
The following query returns the output shown in the following table:
sn price fitted_value
13 27000 37345.844
sn price fitted_value
16 37900 43687.13175
25 42000 40902.028
38 67000 72487.6705
53 68000 79238.6937
104 132000 111528.007
111 43000 39102.8812
117 93000 66936.951
132 44500 41819.8865
140 43000 41611.7915
... ... ...
RMSE
SELECT SQRT(AVG(POWER(glmpredict_housing.price -
glmpredict_housing.fitted_value, 2))) AS RMSE FROM glmpredict_housing;
rmse
10246.7521984348
Overview
The Hidden Markov model is a statistical model that describes the evolution of observable events that
depend on internal factors that are not directly observable. The following graph shows the key elements of
the model.
The graph has two parts, divided by a dashed line. Below the dashed line is the observed sequence; above the
dashed line are the hidden states (so called because they are not directly observable). The hidden states have
state transitions that introduce the state sequences.
In the following graph, the observed sequence is the weather, the hidden states are the seasons, and the state
sequence is summer, fall, winter, spring. The states have outgoing edges to the observations, where the edge
represents the emission. For example, summer emits good weather and winter emits bad weather.
The HMM model addresses three problems: Learning, Decoding, and Evaluating. The following graph
assumes that historical weather data from years 2011 to 2013 is available to train the model. If the hidden
states are labeled in the training data, this type of training process is called “supervised learning.” Otherwise,
it is called “unsupervised learning.” To use unsupervised learning, you must specify the number of hidden
states. After the model is trained, you can make predictions. Given the observed sequence, inferring the
internal state is called “decoding.” Given the sequence, measuring the probability of the sequence is called
“evaluation.”
HMMUnsupervisedLearner
Summary
The HMMUnsupervisedLearner function is available on the SQL-Graph platform. The function can produce
multiple HMM models simultaneously, where each model is learned from a set of sequences and where each
sequence represents a vertex.
Usage
HMMUnsupervisedLearner Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
ModelColumn Optional The name of the column that contains the model attribute. If you
specify this argument, then model_attribute must match a
model_key in the PARTITION BY clause. The values in the column
can be integers or strings.
SeqColumn Required The name of the column that contains the sequence attribute. The
sequence_attribute must be a sequence attribute in the PARTITION
BY clause. A sequence must contain more than two observation
symbol.
ObsColumn Required The name of the column that contains the observed symbols. The
function scans the input table to find all possible observed symbols.
Note:
Observed symbols are case-sensitive.
Note:
The number of hidden states can influence model quality and
performance, so choose the number appropriately.
MaxIterNum Optional The number of iterations that the training process runs before the
function completes. The default is 10.
Epsilon Optional The threshold value in determining the convergence of HMM
training. If the parameter value difference is less than the threshold,
the training process converges. There is no default value. If you do
The sum of the probabilities in each row for the initial state
probabilities, state transition probabilities, or emission probabilities
parameters must be rounded to 1.0. The observed symbols are case-
Input
The HMMUnsupervisedLearner function takes a vertices table as the input fact table. Each sequence
represents a vertex.
The PARTITION BY clause specifies attributes that represent the unique sequence across the table. For
example, in the following table, a valid PARTITION BY clause is PARTITION BY model_id, seq_id.
Table 384: HMMUnsupervisedLearner Example Vertices Table (Sequences)
The ORDER BY clause ensures that the observations in each sequence are sorted chronologically in
ascending order. For example, in the preceding table, a valid ORDER BY clause is ORDER BY model_id,
seq_id, time. When seq_id is 1, the observed sequence is MMMLLMMML.
Output
The HMMUnsupervisedLearner function outputs console messages and generates the following three tables
through JDBC:
• Initial-state probability table
• State-transition probability table
• Emission probability table
Table 385: HMMUnsupervisedLearner Console Message Table Schema
Example
status symbols
current 1
late 2
one month late 3
two months late 4
three months late 5
four months late 6
status symbols
defaulted 7
paid 8
Input
The input data used to train the two models are shown in the following table.
Table 390: HMMUnsupervisedLearner Example Input Table loan_prediction
The status of the loan is shown in the model_id column, where a value of “1” denotes a defaulted loan and a
value of “2” denotes a paid loan. Rows with the same model id are used to train a single model. The use of
two model ids ensures that two different models are trained. Also notice that the defaulted loans end with
observed_id=7 and paid loans end with observed_id=8. The seq_vertex_id column provides the ordering of
the symbols in the sequences.
SQL-MapReduce Call
Assume that there are three hidden states to train the models and use the default method, random. The
query outputs three state tables: pi_loan (initial state probabilities), A_loan (state transition probabilities)
and B_loan (emission, or observation, probabilities).
Output
message
HMM models will be saved to the tables pi_loan, A_loan, and B_loan once the training process is
successfully completed.
The query below returns the output shown in the following table:
The following query returns the output shown in the following table:
HMMSupervisedLearner
Summary
The HMMSupervisedLearner function is available on SQL-Graph platform. The function can produce
multiple HMM models simultaneously, where each model is learned from a set of sequences and where each
sequence represents a vertex.
HMMSupervisedLearner Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
ModelColumn Optional The name of the column that contains the model attribute. If you specify
this argument, then its value must match a model_key in the PARTITION
BY clause.
SeqColumn Required The name of the column that contains the sequence attribute. The
sequence_attribute must be a sequence attribute in the PARTITION BY
clause. A sequence must contain more than two observation symbols.
ObsColumn Required The name of the column that contains the observed symbols. The function
scans the input table to find all possible observed symbols.
Note:
Observed symbols are case-sensitive.
StateColumn Required The state attributes. You can specify multiple states. The states are case-
sensitive.
Input
The HMMSupervisedLearner function takes a vertices table as the input fact table. Each sequence represents
a vertex.
The PARTITION BY clause consists of list attributes representing the unique sequence across the entire
table. For example, in the following table, a valid PARTITION BY clause is PARTITION BY model_id,
seq_id.
Table 395: HMMSupervisedLearner Example vertices table (sequences)
The ORDER BY clause ensures that the observations in each sequence are sorted chronologically in
ascending order. For example, in the preceding table, a valid ORDER BY clause is ORDER BY model_id,
seq_id, time. When seq_id is 1, the observed sequence is MMMLLMMML.
The training function can train either one HMM or multiple HMMs. Each model id corresponds to an
output HMM model.
Output
The HMMSupervisedLearner function outputs console messages and generates the following three tables
through JDBC:
• Initial-state probability table
• State-transition probability table
• Emission probability table
Table 396: HMMSupervisedLearner Console Message Table Schema
Example
To determine the level that a purchase belongs to, you can use K-Means clustering with K=3 to generate
clusters based on either time difference between purchases or difference in amount spent. Purchases
clustered using either metric are classified into one of the three purchase levels. The observation associated
with a purchase is the combination of the levels from both metrics.
From the different levels of time difference and spending amount difference between purchases, nine
combinations of spending profiles are possible: SL, SS, SM, ML, MS, MM, LL, LS, LM. These nine spending
profiles serve as the observation symbols for the HMM.
If you assume a customer belongs to either of three loyalty levels—low (L), normal (N) or high (H)—then
the number of hidden states is three. The definition of these loyalty levels is up to the discretion of the
business. This example uses supervised learning, so each purchase in the input dataset is labeled with the
customer's loyalty level. You can use these labeled observations to train an HMM which can later be used to
assign loyalty levels to new, unlabeled purchase data.
Input
Table 401: HMMSupervisedLearner Example Input Table customer_loyalty
SQL-MapReduce Call
The following SQL query generates the probabilities with state information:
Output
There are three state tables - pi_loyalty (initial), A_loyalty (transition) and B_loyalty (emission) - that are
output.
message
HMM models will be saved to the tables pi_loyalty, A_loyalty, and B_loyalty once the training process is
successfully completed.
The following query returns the output shown in the following table:
The following query returns the output shown in the following table:
The following query returns the output shown in the following table:
HMMEvaluator
Summary
The HMMEvaluator function measures the probabilities of sequences, with respect to each trained HMM.
Usage
HMMEvaluator Syntax
Version 1.3
Arguments
Argument Category Description
InitStateModelColumn Required The name of the model attribute column in the
InitStateProb table.
InitStateColumn Required The name of the state attribute column in the
InitStateProb table.
InitStateProbColumn Required The name of the initial probability column in the
InitStateProb table.
TransAttributeColumn Required The name of the model attribute column in the
TransProb table.
TransFromStateColumn Required The name of the source of the state transition column in
the TransProb table.
TransToStateColumn Required The name of the target of the state transition column in
the TransProb table.
TransProbColumn Required The name of the state transition probability column in
the TransProb table.
EmitModelColumn Required The name of the model attribute column in the
EmissionProb table.
EmitStateColumn Required The name of the state attribute in the EmissionProb
table.
EmitObsColumn Required The name of the observation attribute column in the
EmissionProb table.
EmitProbColumn Required The name of the emission probability in the
EmissionProb table.
ModelColumn Required The name of the column that contains the model
attribute. If you specify this argument, then
model_attribute must match a model_key in the
PARTITION BY clause.
SeqColumn Required The name of the column that contains the sequence
attribute. The sequence_attribute must be a sequence
attribute in the PARTITION BY clause.
ObsColumn Required The name of the column that contains the observed
symbols.
Note:
Observed symbols are case-sensitive.
Note:
If the SeqProbColumn argument is omitted, the
function cannot determine whether the observed
sequence is new; therefore, it treats all model
sequences in the input tables as new.
ShowChangeRate Optional If 'true' (the default), the function shows the percentage
change that corresponds to the applied model with the
difference from previous predicted probability.
SeqProbColumn Optional The function uses the previous value under the column
to calculate the change rate.
SkipColumn Optional The name of the column whose values determine
whether the function skips the row. The function skips
the row if the value is “true”, “yes”, “y”, or “1”. The
function does not skip the row if the value is “false”, “f”,
“no”, “n”, “0”, or NULL.
Accumulate Optional Specifies the names of the columns in input_table that
the function copies to the output table.
Input
HMMEvaluator accepts four input tables. Three tables contain the HMM parameter tables and output from
HMMUnsupervisedLearner or HMMSupervisedLearner function. The fourth table contains the newly
observed sequences, and has a schema similar to the input table or views for HMMUnsupervisedLearner.
Table 406: HMMEvaluator Initial-State Probability Table Schema
Output
Table 409: HMMEvaluator Output table
Input
The input, test_loan_prediction, is a test loan sequence, which HMMEvaluator uses to predict whether this
loan is more likely to be paid in full or to default. The input table does not include observations of 7 (default)
or 8 (paid). The sequence of observations in this table is the same for evaluation of both models (that is,
model_id = 1 and model_id = 2).
Table 410: HMMEvaluator Example Input Table: test_loan_prediction
Output
For the sequence used in this example (id=17), the output table shows the final sequence_probability given
by each model. For model 1, based on defaulted loans, the final sequence probability is 1.74 E-09. For model
2, based on paid loans, the final sequence probability is 3.13 E-20. Because the sequence probability is higher
for the model based on defaulted loans, this sequence is considered a potential loan default.
Table 411: HMMEvaluator Example Output Table
HMMDecoder
Summary
The HMMDecoder function finds the state sequence with the highest probability, given the learned model
and observed sequences.
Usage
HMMDecoder Syntax
Version 1.3
Arguments
Argument Category Description
SequenceMaxSize Optional The maximum length, in rows, of a sequence in the observation table.
Input
The HMMDecoder function accepts four input tables. Three tables are dimensional tables that are generated
by the HMMUnsupervisedLearner or HMMSupervisedLearner function. The fourth table contains the
newly observed sequences. The schema of the fourth table is similar to the schema of the input table of the
HMMUnsupervisedLearner or HMSupervisedLearner function.
Table 412: HMMDecoder Initial-State Probability Table Schema
The model attribute, sequence attributes, state attributes, observed key attributes, and percent change key
attributes specified in the arguments are used for output column names.
Output
Table 415: HMMDecoder Output Table Schema
Examples
• Example 1: Loan Default Prediction (from Unsupervised Learner)
• Example 2: Customer Loyalty Prediction (from Supervised Learner)
• Example 3: Part-of-Speech Tagging
• Example 4: Bank Customer Churn
Input
The input consists of the trained model tables from the Output section of the Example in
HMMUnsupervisedLearner (pi_loan, A_loan, and B_loan). The function predicts or decodes the hidden
state information for the input new test sequence.
Output
For the same sequence, the hidden states are different for each model.
Table 416: HMMDecoder Example 1 Output Table
Input
The input table, customer_loyalty_newseq, is a collection of three new test sequences (seq_id 4, 5, 6) for
user_id 1. HMMSupervisedLearner trained multiple users and obtained the trained model files (pi_loyalty,
A_loyalty and B_loyalty). This example uses these model files with the input to determine the loyalty levels
of customers from the new sequence of purchases. The loyalty levels are low (L), normal (N), and high (H).
Table 417: HMMEncoder Example 2 Input Table customer_loyalty_newseq
SQL-MapReduce Call
Output
The output table shows the decoded loyalty levels for the new sequence. For seq_id 5, the loyalty level
increased towards the end of the sequence (from L to H; for the other sequences (seq_id 4 and seq_id 6), the
loyalty level did not change.
Input
The HMMDecoder function can be used to decode the parts of speech (adjective, noun, verb etc.) for a word
set, if the set of phrases or words have been trained using a HMMSupervised or HMMUnsupervisedLearner.
Assume that you have a set of phrases (shown in the following table) whose parts of speech are unknown
and you have the three trained state tables (initial, state_transition and emission) readily available.
In this example, the parts of speech correspond to the hidden states of the HMM function. There are two
hidden states in this example: A(Adjective) and N(noun). HMMDecoder can be used to find these parts of
speech.
Table 419: HMMDecoder Example 3 Input Table phrases
SQL-MapReduce Call
Output
Input
HMMDecoder can also be used to find the propensity of customer churn, given the actions or transactions
of a customer in a bank. The input table, churn_data, contains different transactions (column action) of a
customer (column id). The order of transactions is shown in the column path_id. Assume that the trained
tables with their state probabilities are readily available, as shown in the four tables that follow churn_data,
and that the states correspond to T (True – customer is likely to churn) or F (False – customer is unlikely to
churn).
Table 424: HMMDecoder Example 4 Input Table churn_data
SQL-MapReduce Call
Output
Histogram
Summary
Histograms are useful for assessing the shape of a data distribution. The Histogram function calculates the
frequency distribution of a dataset using sophisticated binning techniques that can automatically calculate
the bin width and number of bins. The function maps each input row to one bin and returns the frequency
(row count) and proportion (percentage of rows) of each bin.
The Aster Analytics histogram implementation, redesigned for release 6.21, includes the following
capabilities:
• User-selected or automatic bin determination
• User-selected left-inclusive or right-inclusive binning
• Multiple histograms for distinct groups
Background
The Histogram function uses either Sturges' or Scott's algorithm to compute binning (bin width and number
of bins). The bin width is the range for each group of values. Binning algorithms make strong assumptions
about the shape of the distribution. Depending on the actual data distribution and the goals of the analysis,
different bin widths may be appropriate.
Sturges’ Algorithm
Sturges' algorithm for calculating bin width can be written as:
w = r/(1 + log2n)
Scott’s Algorithm
Scott's algorithm for calculating bin width can be written as:
w = 3.49s/(n1/3)
where w is the bin width, s is the standard deviation of the data values and n is the number of elements in the
data set. The number of bins is r/w, where r is the range of the data values.
This algorithm performs best on normally distributed data.
Usage
Histogram Syntax
Version 1.0
The function provides several options for how bins are defined:
• The user can specify a target for the number of bins.
• The user can specify the bin boundaries.
• The function can determine the bin boundaries automatically, using one of two built-in algorithms.
• The user can provide the minimum and maximum values for the histogram, and an optional bin size. If a
bin size is provided, the bins are equally sized; if not, they might not be.
Arguments
Argument Category Description
InputTable Required Table containing the input data.
OutputTable Required Name for the table that the function generates, containing the output.
Input
The input table must include a column with the data to be sorted, and may include one or more GroupBy
columns. Other columns are ignored.
Table 429: Histogram Input Table Schema
Output
The output table contains any user-specified GroupBy columns and the bin information (lower boundary,
upper boundary, and number and percentage of input rows falling into that bin).
Table 430: Histogram Output Table Schema
Examples
• Example 1: Bins with Sturges' Algorithm
• Example 2: Bins with Scott's Algorithm
• Example 3: You Specify Bins
Input
All examples use the same input table, cars_hist, which has the cylinder (cyl) and horsepower (hp) data for
different car models. The example computes the histograms on the hp column.
Table 431: Histogram Example Input Table cars_hist
id name cyl hp
1 Mazda RX4 6 110
2 Mazda RX4 Wag 6 110
3 Datsun 710 4 93
4 Hornet 4 Drive 6 110
5 Hornet Sportabout 8 175
6 Valiant 6 105
7 Duster 360 8 245
8 Merc 240D 4 62
id name cyl hp
9 Merc 230 4 95
10 Merc 280 6 123
11 Merc 280C 6 123
12 Merc 450SE 8 180
13 Merc 450SL 8 180
14 Merc 450SLC 8 180
15 Cadillac Fleetwood 8 205
16 Lincoln Continental 8 215
17 Chrysler Imperial 8 230
18 Fiat 128 4 66
19 Honda Civic 4 52
20 Toyota Corolla 4 65
21 Toyota Corona 4 97
22 Dodge Challenger 8 150
23 AM CJavelin 8 150
24 Camaro Z28 8 245
25 Pontiac Firebird 8 175
26 Fiat X1-9 4 66
27 Porsche 914-2 4 91
28 Lotus Europa 4 113
29 Ford Pantera L 8 264
30 Ferrari Dino 6 175
31 Maserati Bora 8 335
32 Volvo 142E 4 109
SQL-MapReduce Call
Output
The following query returns the output shown in the following table:
SQL-MapReduce Call
The following query returns the output shown in the following table:
SELECT * FROM cars_scott_out ORDER BY 1;
Table 435: Histogram Example 2 Output Table cars_scott_out
SQL-MapReduce Call
Output
The following query returns the output shown in the following table:
SELECT * FROM cars_hist_out ORDER BY 1;
KNN
Summary
The KNN function uses training data objects to map test data objects to categories. The function is
optimized for both small and large training sets. The function supports user-defined distance metrics and
distance-weighted voting.
Background
In the IEEE International Conference on Data Mining (ICDM) in December 2006, the K-Nearest Neighbor
(kNN) classification algorithm was presented as one of the top 10 data-mining algorithms.
The kNN algorithm classifies data objects based on proximity to other data objects with known
classification. The objects with known classification serve as training data.
kNN classifies data based on the following parameters:
• Training data
• A metric that measures distance between objects
• The number of nearest neighbors (k)
The following figure shows an example of data classification using kNN. The red and blue dots represent
training data objects—the red dots are classified as cancerous tissue and the blue dots are classified as
normal tissue. The gray dot represents a test data object.
The inner circle represents k=4 and the outer circle represents k=10. When k=4, most of the nearest
neighbors of the gray dot are red, so the algorithm classifies the gray dot as cancerous tissue. When k=10,
most of the nearest neighbors of the gray dot are blue, so the algorithm classifies the gray dot as normal
tissue.
Figure 12: KNN Example
KNN Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
TrainingTable Required Specifies the name of the table that contains the training
data. Each row represents a classified data object.
TestTable Required Specifies the name of the table that contains the test data to
be classified by the kNN algorithm. Each row represents a
test data object.
K Required Specifies the number of nearest neighbors to use for
classifying the test data.
ResponseColumn Required Specifies the name of the training table column that contains
the class label or classification of the classified data objects.
IDColumn Required Specifies the name of the testing table column that uniquely
identifies a data object.
w = 1/POWER(distance, voting_weight)
Input
The KNN function has two input tables, a training table and a test table.
The following table describes the required training table column. The training table can have additional
columns, but the function ignores them.
The following table describes the required test table column. The test table can have additional columns, but
the function ignores them.
Table 439: KNN Test Table Schema
Output
The KNN function outputs a message and a table. By default, the function outputs only the table. If you
specify an output table name, then the function outputs the message to the console and creates an output
table with the specified name.
Table 440: KNN Output Table Schema
Example
Input
The training input has as dimensions five attributes of personal computers—price, speed, hard disk size,
RAM, and screen size. The training table has 5008 rows, categorized into eight price groups, which the
following table describes.
clusterid category
0 SPECIAL
1 UBER
2 MEGA
3 ULTRASUPER
4 SUPER
5 EXTREME
6 HYPER
7 ULTRA
SQL-MapReduce Call
Output
The following query returns the output shown in the following table:
id computer_category
10 MEGA
11 SUPER
15 MEGA
29 HYPER
30 HYPER
38 UBER
45 UBER
46 MEGA
48 SPECIAL
51 MEGA
52 MEGA
59 HYPER
65 SUPER
66 SPECIAL
70 HYPER
86 SUPER
91 HYPER
92 SUPER
93 MEGA
94 MEGA
104 HYPER
... ...
package com.example;
import com.asterdata.ncluster.sqlmr.data.RowView;
import com.asterdata.sqlmr.analytics.classification.knn.distance.Distance;
public class MyDistance implements Distance {
/**
* calculate the distance between the test row and the training row.
* note: 1.don't reverse the sequence of parameters
* 2. the columns of trainingRowView is 'responseColumn, f1,f2,...,fn'
* 3. the columns of testRowView is the same as TEST_TABLE
* 4. all the trainingRowView and testRowView is zero-based
* (0 <= index && index < getColumnCount())
*
* @param testRowView
* stands for a point in the test data set
* @param trainingRowView
* stands for a point in the training data set, the columns is the
* columns in distanceFeatures argument
* @return the double value of distance
*/
@Override
public double calculate(RowView testRowView, RowView trainingRowView) {
return Math.abs(testRowView.getIntAt(1) - trainingRowView.getIntAt(1));
}
}
LARS Functions
Summary
Least angle regression (LARS) and its most important modification, least absolute shrinkage and selection
operator (LASSO), are variants of linear regression that select the most important variables, one by one, and
fit the coefficients dynamically. Aster Database provides two LARS functions:
• LARS
• LARSPredict
The output of the LARS function is input to the LARSPredict function.
LARS
Summary
The LARS function generates a model that the function LARSPredict uses to make predictions for the
response variables.
Usage
LARS Syntax
Version 1.1
Arguments
Argument Category Description
InputTable Required Specifies the name of the input table.
OutputTable Required Specifies the name of the output table.
InputColumns Required Specifies the names of the columns of the input table that contain
the response and predictors.
The syntax of predictor_columns is:
{col[,...] | [start_column:end_column]}[,...]
where col is a column name and start_column and end_column are
the column indexes of the first and last columns in a range of
columns. The range includes start_column and end_column.
The leftmost column has column index 0, the column to its
immediate right has column index 1, and so on.
Note:
In a column range, brackets do not indicate optional elements.
You must include the bracket characters (for example, '[2:6]').
Note:
This function can take at most 799 response and predictor
variables.
Method Optional Specifies either 'lar' (least angle regression) or 'lasso'. The default
value is 'lasso'.
Intercept Optional Specifies whether an intercept is included in the model (and not
penalized). The default value is 'true'.
Normalize Optional Specifies whether each predictor is standardized to have unit L2
norm. The default value is 'true'.
MaxIterNum Optional Specifies the maximum number of steps the function executes. The
default value is 8*min(number_of_predictors, sample_size -
intercept). For example, if the number of predictors is 11, the sample
size (number of rows in the input table) is 1532, and the intercept is
1, then the default value is 8*min(11, 1532 - 1) = 88.
Input
The LARS function has one required input table, which contains the response column and predictor
columns. The input table can have additional columns, but the function ignores them.
Note:
The LARS function skips input rows that contain NULL values.
Output
Table 446: LARS Output Table Schema
Input
This input is diabetes data from “Least Angle Regression,” by Bradley Efron and others.
The input table has one response (vector y) and ten baseline predictors measured on 442 diabetes patients.
The baseline predictors are age, sex, body mass index (bmi), mean arterial pressure (map) and six blood
serum measurements (tc, ldl, hdl, tch, ltg, glu).
Table 447: LARS Examples Input Table: diabetes, Columns 1-6
The column id is the row identifier, y is the response, and the other columns are predictors.
This data set is atypical in that each predictor has mean 0 and norm 1, which means that:
• The value of the Normalize argument is irrelevant.
• If the value of the Intercept argument is 'true', then the intercept is considered to be constant along the
entire path (which is typically not true).
SQL-MapReduce Call
Output
message
Successful.
Result has been stored in table: '"diabetes_lars"'.
(2 rows)
The following query returns the output shown in the following table:
The following figure represents the results and shows how the standardized coefficients evolved during the
model-building process. The x-axis represents the ratio of the norm of the current beta to the full beta. The
y-axis represents the standardized coefficients, which are estimated when standardized predictors are used.
The numbers on the top of the graph represent the steps of the model-building process. The numbers on the
right represent the predictor IDs.
Figure 13: LAR Results
SQL-MapReduce Call
Output
message
Successful.
Result has been stored in table: '"diabetes_lasso"'.
(2 rows)
The following query returns the output shown in the following table:
Table 454: LARS Example 2 (LASSO) Output Table diabetes_lasso, Columns 8–16
The following figure represents the results and shows how the standardized coefficients evolved during the
model-building process. The x-axis represents the ratio of the norm of the current beta to the full beta. The
y-axis represents the standardized coefficients, which are estimated when standardized predictors are used.
The numbers on the top of the graph represent the steps of the model-building process. The numbers on the
right represent the predictor IDs.
LARSPredict
Summary
The LARSPredict function takes new data and the model generated by the function LARS and uses the
predictors in the model to output predictions for the new data.
Usage
LARSPredict Syntax
Version 1.1
Arguments
Argument Category Description
Mode Optional Specifies the mode for the S argument:
• 'STEP' (default)
The S argument indicates the steps corresponding to the steps in the
model generated by the LARS function. The S argument can include any
real values in [1, k], where k is the maximum step in the model.
• 'FRACTION'
The S argument indicates the fractions of the L1 norm of the coefficients
against the maximum L1 norm. The maximum L1 norm is that of the full
OLS solution, which is the coefficients at the last step. The S argument
can include any real values in [0, 1].
• 'NORM'
The S argument indicates the L1 norm of the coefficients. The S
argument can include any real values in [0, max L1 norm]. For
maximum L1 norm, see above.
• 'LAMBDA'
The S argument indicates the maximum absolute correlations. For
definition, see the description of max_abs_corr in the Output section of
the function LARS. The S argument can include any real values.
Input
The LARSPredict function has two required input tables, the table that contains the new data (described by
the following table) and the model table generated by the LARS function (described in the Output section).
The data table can have columns that are not predictors, but the function ignores them.
Table 455: LARSPredict Data Table Schema
Examples
• Example 1: Model ('diabetes_lars')
• Example 2: Model ('diabetes_lasso')
Input
• Data table below, diabetes_test, obtained by a 20% sampling rate (88 rows) from the table, diabetes (from
the Input section of the Examples for the function LARS.
• Model file, diabetes_lars, found in the Output section of the LARS function, Example 1
Table 457: LarsPredict Example Data Table: diabetes_test, Columns 1-6
SQL-MapReduce Call
Output
Input
• Data table diabetes_test, obtained by a 20% sampling rate (88 rows) from the table diabetes
• Model file diabetes_lasso, output by LARS Example 2
SQL-MapReduce Call
Output
Linear Regression
Summary
The LinRegMatrix function takes a data set and outputs a linear regression model. The LinReg function
takes the linear regression model and outputs its coefficients. The 0th coefficient corresponds to the slope
intercept.
Background
The linear regression model is probably the easiest predictive technique to use. This model can be as simple
as having one input variable and one output variable or as complex as having dozens of input variables. All
linear regression models fit this pattern: Independent variables are used first to model and then to predict
the result—the dependent variable. In matrix notation, a linear regression model is given by the formula Y =
Xβ + Ϲ, where:
• X is the independent (predictor) variable or vector.
• β is the vector of parameters.
• ε is the error vector.
• Y is the dependent (response) vector.
That is:
The input table contains all the predictor columns and, in its last column, the response vector. The output
table contains the beta coefficients (in the coefficient_index column). The 0th coefficient corresponds to the
slope intercept and the ith coefficient corresponds to the ith predictor variable. The LinReg function is
limited to outputting the coefficients; it does not give the significance of predictor variables by a p-value or
the goodness of fit by an R2 value.
Note:
PARTITION BY 1 is required because all input data must be submitted to one worker.
Input
The input table for the LinRegMatrix function has one row for each data point and one column for each data
point component. A data point can have multiple x components but only one y component. The column
that represents the y component must be the last column in the input table.
Table 463: LinRegMatrix Input Table Schema
Note:
If an input table row contains a NULL value, then the LinRegMatrix function skips that row.
Output
Table 464: LinReg Output Table Schema
Input
Table 465: LinRegMatrix Example Input Table housing_data
Note:
The LinRegMatrix function skips the last row of housing_data because it contains a NULL value.
SQL-MapReduce Call
Output
Table 466: LinReg Example Output Table
CoefficientName value
Intercept -21739.2966650368
housesize -26.9307835091457
lotsize 6.33452410459345
CoefficientName value
granite 7140.67629349537
upgradedbathroom 43179.1998888263
bedrooms 44293.7605841832
The 0th coefficient index is the slope intercept. The coefficient indices 1, 2, 3, 4, and 5 correspond to
HouseSize, LotSize, Bedrooms, Granite, and UpgradedBathroom, respectively.
LRTEST
Summary
The LRTEST function performs the likelihood ratio test for two GLM models, generated by the function
GLM.
Background
A likelihood ratio test is useful for comparing the fit a null model and an alternative model. The null model
is a special case of the alternative model. The likelihood ratio expresses how many times more likely the data
are under one model than the other. You can use the likelihood ratio or its logarithm to compute a p-value,
or compare it to a critical value to decide whether to reject the null model in favor of the alternative model.
When you use the logarithm of the likelihood ratio, the statistic is known as the log-likelihood ratio statistic.
You can use Wilks’s theorem to approximate the probability distribution of this statistic (assuming that the
null model is true).
LRTEST Syntax
Version 1.1
Arguments
Argument Category Description
Statistic Required Specifies the name of the input column that contains the name of the
statistic. This column corresponds to the GLM output column
'predictor'.
LogLik Required Specifies the name of the input column that contains the log-likelihood
of the GLM model. This column corresponds to the GLM output
column 'estimate'.
ObsNum Required Specifies the name of the input column that contains the number of
observations. This column corresponds to the GLM output column
'std_err'.
ParamNum Required Specifies the name of the input column that contains the number of
parameters (excluding the intercept). This column corresponds to the
GLM output column 'z_score'.
Input
The LRTEST has two required input tables, both of which are model tables generated by the function GLM.
For the GLM model table schema, refer to the Output Table section.
Output
Table 467: LRTEST Output Table Schema
Example
This example compares two models generated by the GLM function.
Input
The input table, glm_tempdamage, has 22 observations with one numerical predictor variable (temp) and
one response variable (damage). The value of the response variable shows whether there is damage due to
temperature (1 means yes, 0 means no) .
Table 468: LRTEST Example Input Table glm_tempdamage
id temp damage
1 53 1
2 57 1
3 58 1
4 63 1
5 66 0
6 67 0
7 67 0
8 67 0
9 68 0
10 69 0
11 70 1
12 70 0
13 70 1
14 70 0
15 72 0
16 73 0
17 75 0
18 75 1
19 76 0
20 76 0
21 78 0
id temp damage
22 79 0
Because this is a binary outcome, the two models use GLM with logistic regression. SQL-MapReduce Call 1
generates a model using the predictor variable (the second table in its Output section). SQL-MapReduce Call
2 generates the null model (the second table in its Output section). The null model is produced with only the
intercept. SQL-MapReduce Call 3 uses the LRTEST function to compare the two GLM models.
Output
Table 469: LRTEST Example Output Table
Output
Table 471: LRTEST Example Output Table
Output
Table 473: LRTEST Example Output Table
The final output compares the two GLM distributions and displays a chi squared statistic and a p-value. The
chi square value suggests that the data was more likely to be generated by model 1, based on all predictors,
than by the null model. The result is statistically significant at the 95% confidence level (p-value <0.05).
Percentile
Summary
The Percentile function generates percentiles for groups of numbers. The nth percentile is the smallest value
in a data set that is greater than n% of the values.
Use this function when the input data is partitioned into a large number of groups and you want to find the
percentile for each group. Each group must fit on a single worker node. The maximum number of input
rows in each group that the function can process depends on the cluster configuration. To find percentile
Usage
Percentile Syntax
Version 1.0
Arguments
Argument Category Description
Percentile Required Specifies the percentiles for the function to generate.
Target_Columns Required Specifies the names of the columns that contain the groups of numbers
whose percentiles are to be generated.
Group_Columns Optional Specifies the names of the columns to copy to the output table. Typically,
the list of group columns is the same as the list of partition columns.
Input
Table 474: Percentile Input Table Schema
Output
Table 475: Percentile Output Table Schema
Example
Input
This example uses data from participants in the 2012 London Olympic Games. The input data consists of the
age, height, weight, sex, sport and country for a subset of the participants in the 2012 Summer Olympics.
Table 476: Percentile Example Input Table london_olympics
SQL-MapReduce Call
Output
The output table displays the values for each target column, partitioned by country, corresponding to the
requested percentiles.
Summary
Principal component analysis (PCA) is a common unsupervised learning technique that is useful for both
exploratory data analysis and dimension reduction. PCA is often used as the core procedure for factor
analysis.
The PCA function is composed of two functions, PCA_Map and PCA_Reduce.
If the version of PCA_Reduce is AA 6.21 or later, you can input the PCA output to the function PCAPlot.
Usage
Arguments
Argument Category Description
Target_Columns Optional Specifies the target_table columns that contain the data, which
must be numeric. The default value is every target_table column.
Components Optional Specifies the number of principal components to return (an
integer). If num_components is k, then the function returns the
top k components. By default, the function returns every
principle component.
Note:
The function ignores input rows that have missing values.
Output
The PCA_Reduce function outputs a table in which each row represents a principal component, or
eigenvector. The first row represents the largest eigenvalue in the matrix.
Table 479: PCA_Reduce Output Table Schema
Input
The input is medical data for 25 patients, identified by patient ID (pid). The data has these attributes (also
called variables or dimensions):
• Age (years)
• Body mass index (BMI) (kg/m2)
• Blood pressure (mm Hg)
• Blood glucose level (mg/dL)
• Strokes (number experienced)
• Cigarettes (number smoked/month)
• Insulin (mg/dL)
• High-density lipoproteins (HDL) (mg/dL)
Table 480: PCA Example Input Table patient_pca_input
In the following table, which is derived from the output table, the cumulative variance calculation shows that
the three top-ranked eigenvectors account for ~98% of the total variance.
To compute the correlations between the original values for each attribute, use the preceding view and the
functions Corr_Map and Corr_Reduce (described in Correlation).
corr value
pca_1:bmi 0.111364
pca_1:glucose 0.478427
pca_1:hdl 0.00949362
pca_1:insulin 0.999914
pca_1:age 0.409505
pca_1:bloodpressure 0.0123209
pca_1:cigarettes 0.459847
pca_1:strokes -0.471619
corr value
pca_2:bmi 0.212095
pca_2:glucose 0.294333
pca_2:hdl 0.990415
pca_2:insulin -0.00536637
pca_2:age 0.167691
pca_2:bloodpressure -0.0412905
pca_2:cigarettes 0.0454973
pca_2:strokes -0.0450002
corr value
pca_3:age -0.48665
pca_3:bloodpressure -0.605769
pca_3:cigarettes 0.331247
pca_3:strokes -0.48879
pca_3:bmi 0.23217
pca_3:glucose -0.769423
pca_3:hdl 0.123469
pca_3:insulin 0.0108517
Summary
If the three principal components are used in determining patient health condition, then the most important
attributes are higher insulin and hdl levels and lower blood pressure and glucose levels.
PCAPlot
Summary
The PCAPlot function takes the principal components output by the PCA function (Principal Component
Analysis) and input data, centers the input data, changes the basis of the input data to the principal
components, and outputs the result.
Note:
The version of PCA_Reduce, a component of the PCA function, must be AA 6.21 or later.
PCAPlot Syntax
Version 1.0
Arguments
Argument Category Description
Components Required Specifies the number of principal components to return (an
integer). If num_components is k, then the function returns the
top k components.
Accumulate Optional Specifies the names of the input table columns to copy to the
output table.
Input
The function requires an input table and a PCA table. The PCA table is output by the PCA_Reduce function.
Table 491: PCAPlot Input Table Schema
Output
Table 492: PCAPlot Output Table Schema
Input
• Input table patient_pca_input
• PCA table pca_health_ev
SQL-MapReduce Call
Output
Table 493: PCAPlot Example Output Table
RandomSample
Summary
The RandomSample function takes a data set and uses a specified sampling method to output one or more
random samples. Each sample has exactly the number of rows specified.
The RandomSample function is useful for generating test sets, training sets, and initial centers for clustering
algorithms.
In addition to the default basic sampling, in which each input table row has a probability of being selected
that is proportional to its weight, this function provides two alternate methods, KMeans++ and KMeans||,
which are designed for generating a set of initial seeds for the function KMeans.
Usage
RandomSample Syntax
Version 1.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data set from which to
take samples.
NumSample Required Specifies both the number of samples and their sizes. For each
sample_size (an INTEGER value), the function selects a sample that
has sample_size rows.
WeightColumn Optional Specifies the name of the input_table column that contains weights for
weighted sampling. The weight_column must have a numeric SQL data
type. By default, rows have equal weight.
SamplingMode Optional Specifies the sampling mode:
• 'basic' (default)
Each input_table row has a probability of being selected that is
proportional to its weight. The weight of each row is in
weight_column.
• 'kmeans++'
One row is selected in each of k iterations, where k is the number of
desired output rows. The first row is selected randomly. In
subsequent iterations, the probability of a row being selected is
proportional to the value in the WeightColumn multiplied by the
distance from the nearest row in the set of selected rows. The
distance is calculated using the methods specified by the Distance
and CategoricalDistance arguments.
• 'kmeans||'
Enhanced version of KMeans++ that exploits parallel architecture
to accelerate the sampling process. The algorithm is described in
the paper Scalable K-Means++ by Bahmani et al (http://
theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). Briefly, at
each iteration, the probability that a row is selected is proportional
to the value in the WeightColumn multiplied by the distance from
the nearest row in the set of selected rows (as in KMeans++).
However, the KMeans|| algorithm oversamples at each iteration,
significantly reducing the required number of iterations; therefore,
the resulting set of rows might have more than k data points. Each
Tip:
For optimal performance, use 'kmeans++' when the desired sample
size is less than 15 and 'kmeans||' otherwise.
Distance Optional For KMeans++ and KMeans|| sampling, specifies the function for
computing the distance between numerical variables:
• 'euclidean' (default): The distance between two variables is defined
in Euclidean Distance (found in the Background section of the
function VectorDistance).
• 'manhattan': The distance between two variables is defined in
Manhattan Distance (found in the Background section of the
function VectorDistance).
InputColumns Optional For KMeans++ and KMeans|| sampling, specifies the names of the
input_table columns to use to calculate the distance between numerical
variables.
AsCategories Optional For KMeans++ and KMeans|| sampling, specifies the names of the
input_table columns that contain numerical variables to treat as
categorical variables.
CategoryWeights Optional For KMeans++ and KMeans|| sampling, specifies the weights
(DOUBLE PRECISION values) of the categorical variables, including
those that the AsCategories argument specifies. Specify the weights in
the order (from left to right) that the variables appear in the input
table. When calculating the distance between two rows, distances
between categorical values are scaled by these weights.
CategoricalDistance Optional For KMeans++ and KMeans|| sampling, specifies the function for
computing the distance between categorical variables:
'overlap' (default): The distance between two variables is 0 if they are
the same and 1 if they are different.
'hamming': The distance between two variables is the Hamming
distance between the strings that represent them. The strings must
have equal length.
Seed Optional Specifies the random seed with which to initialize the algorithm (a
LONG value). If you specify Seed, you must also specify SeedColumn.
SeedColumn Optional Specifies the names of the input_table columns by which to partition
the input. Function calls that use the same input data, seed, and
seed_column output the same result. If you specify SeedColumn, you
must also specify Seed.
Note:
Ideally, the number of distinct values in the seed_column is the
same as the number of workers in the cluster. A very large number
of distinct values in the seed_column degrades function
performance.
OverSamplingRate Optional For KMeans|| sampling, specifies the oversampling rate (a DOUBLE
PRECISION value greater than 0.0). The function multiplies rate by
sample_size (for each sample_size). The default rate is 1.0.
IterationNum Optional For KMeans|| sampling, specifies the number of iterations (an
INTEGER value greater than 0). The default number_of_iterations is 5.
Input
Table 494: RandomSample Input Table Schema
Output
Table 495: RandomSample Output Table Schema
Examples
• Input
Input
The input table has 32 observations of 11 variables for different models of cars:
• mpg: miles per U. S. gallon
• cyl: number of cylinders
• disp: displacement (cubic inches)
• hp: gross horsepower
• drat: drive ratio
• wt: weight (lbs/1000)
• qsec: quarter-mile time (seconds)
• vs: engine configuraton (V or S (straight))
• am: transmission type (automatic or manual)
• gear: number of forward gears
• carb: number of carburetors
The variables vs and am are categorical; the others are numerical.
Table 496: RandomSample Examples Input Table fs_input
SQL-MapReduce Call
SQL-MapReduce Call
SQL-MapReduce Call
Output
Sample
Summary
The Sample function draws rows randomly from the input table.
Note:
The Sample function does not guarantee the exact sizes of samples. If each sample must have an exact
number of rows, use the RandomSample function.
Usage
Sample Syntax
Arguments
Argument Category Description
SampleFraction Required Specifies one or more fractions to use in sampling the data. (Syntax
options that do not use SampleFraction require ApproxSampleSize.)
If you specify only one fraction, then the function uses fraction for all
strata defined by the sample conditions.
If you specify more than one fraction, then the function uses each fraction
for sampling a particular stratum defined by the condition arguments.
Note:
For conditional sampling with variable sample sizes, specify one
fraction for each condition that you specify with the Strata argument.
Seed Optional Specifies an integer to add to each task ID to create a real random seed for
the task. The default value is 0.
ApproxSampleSize Optional Specifies one or more approximate sample sizes to use in sampling the
data. (Syntax options that do not use ApproxSampleSize require
SampleFraction.) Each sample size is approximate because the function
maps the size to the sample fractions and then generates the sample data.
If you specify only one size, then it represents the total sample size for the
entire population. If you also specify the Strata argument, then the
function proportionally generates sample units for each stratum.
If you specify more than one size, then each size corresponds to a stratum,
and the function uses each size to generate sample units for the
corresponding stratum.
Note:
For conditional sampling with variable approximate sample sizes,
specify one size for each condition that you specify with the Strata
argument.
StratumColumn Optional Specifies the name of the column that contains the sample conditions. If
the function has only one input table (the data table), then
condition_column is in the data table. If the function has two input tables,
data and summary, then condition_column is in the summary table.
Strata Optional Specifies the sample conditions that appear in the condition_column
specified by StratumColumn. If Strata specifies a condition that does not
appear in condition_column, then the function issues an error message.
Input
The Sample function always requires a data table. Some syntax options also require a summary table.
Table 500: Sample Data Table Schema
Note:
The summary input must summarize the population statistics faithfully. That is, sum / stratum_count
with a non-null stratum value must equal the total population size. Otherwise, the final sample output
might not approximate the target sample fractions well.
Output
Table 502: Sample Output Table Schema
Examples
• Input
• Example 1: Unconditional Sampling with Single Sample Rate
• Example 2: Conditional Sampling with variable Sample Rate
• Example 3: Unconditional Sampling with Total(single) Approximate SampleSize
• Example 4: Conditional Sampling with variable Approximate SampleSize
Input
The input table (score_category) is obtained by categorizing the students (in the table students) based on
their score in a given subject. There are 100 students grouped into three categories - excellent (score > 90),
very good (80 < score < 90) and fair (score < 80) - as shown in the SQL case statement below
Table 503: Sample Example Input Table students
id score
1 5
2 83
3 95
4 95
5 90
6 55
7 40
id score
8 57
9 65
10 27
... ...
id score stratum
1 5 fair
2 83 very good
3 95 excellent
4 95 excellent
5 90 excellent
6 55 fair
7 40 fair
8 57 fair
9 65 fair
10 27 fair
SQL-MapReduce Call
id score
4 95
6 55
15 22
16 19
21 5
22 25
27 44
32 3
36 15
39 50
44 70
47 79
65 53
67 29
68 30
69 18
74 7
79 71
81 13
83 81
85 79
92 32
SQL-MapReduce Call
Output
id score stratum
12 93 excellent
28 90 excellent
60 97 excellent
78 91 excellent
90 100 excellent
8 57 fair
10 27 fair
21 5 fair
24 11 fair
27 44 fair
32 3 fair
37 14 fair
42 39 fair
46 19 fair
49 8 fair
54 43 fair
61 6 fair
79 71 fair
81 13 fair
85 79 fair
94 76 fair
99 44 fair
100 18 fair
20 85 verygood
95 84 verygood
stratum stratum_count
very good 9
fair 77
excellent 14
SQL-MapReduce Call
Output
id score stratum
4 95 excellent
6 55 fair
15 22 fair
16 19 fair
21 5 fair
27 44 fair
36 15 fair
39 50 fair
69 18 fair
81 13 fair
85 79 fair
92 32 fair
SQL-MapReduce Call
Output
id score stratum
12 93 excellent
28 90 excellent
60 97 excellent
78 91 excellent
90 100 excellent
8 57 fair
10 27 fair
21 5 fair
24 11 fair
27 44 fair
37 14 fair
46 19 fair
49 8 fair
79 71 fair
81 13 fair
85 79 fair
94 76 fair
20 85 very good
53 87 very good
id score stratum
95 84 very good
Note:
The summary input must summarize the population statistics faithfully. That is, the sum over
stratum_count with a non-null stratum value must be equal to the total population size. If this condition
does not hold, the final sample output might not approximate the target sample fractions well.
Summary
The Shapley value is intended to reflect the importance of each player to the coalition in a cooperative game
(a game between coalitions of players, rather than between individual players).
The Shapley value functions are:
• GenerateCombination, a function that takes combinations of players (coalitions) and generates input for
AddOnePlayer
• SortCombination, a function that sorts combinations of players
• AddOnePlayer, a function that takes sorted combinations and outputs a table
• SQL Statements to Compute the Shapley Value, which query the AddOnePlayer input and output tables
The input to GenerateCombination can be either unsorted user data or sorted output from the function
nPath. If the input is unsorted, GenerateCombination inputs it to SortCombination.
The input to SortCombination can come from either GenerateCombination or the user.
The input to AddOnePlayer can come from either GenerateCombination or SortCombination.
Figure 15: Computing a Shapley Value
Background
The Shapley value of a player is the difference between the average coalition payoff if the player is a member
and the average coalition payoff if the player is not a member.
If N is the set of players, S is an arbitrary coalition of players that does not include player i, andv is the payoff
function, then this formula computes the Shapley value of player i:
Instead of computing the sum over all 2N possible coalitions, the Aster Analytics Shapley Value feature
computes an approximate Shapley value by sampling over coalitions whose observed payoff values are
included in the user data.
GenerateCombination
The GenerateCombination function takes combinations of players and generates input for AddOnePlayer.
Usage
GenerateCombination Syntax
Version 1.0
Input
The first two columns of the input table must be index and payoff, respectively. The tables can have other
columns after these two, but the function ignores them.
Table 510: GenerateCombination Input Table Schema
Examples
Refer to the AddOnePlayer Examples.
SortCombination
The SortCombination function takes a table of combinations, generated by either the GenerateCombination
function or a SQL statement, and outputs a table of sorted combinations that can be input to AddOnePlayer.
Usage
SortCombination Syntax
Version 1.1
Arguments
Argument Category Description
CombinationColumn Required Specifies the name of the input table column that
contains the combinations.
ValueColumn Required Specifies the name of the input table column that
contains the assigned value of each combination.
Input
The following table describes the required columns of the input table. The table can have additional
columns, but the function ignores them.
Table 512: SortCombination Input Table Schema
Output
Table 513: SortCombination Output Table Schema
Examples
Refer to the AddOnePlayer Examples.
AddOnePlayer
The AddOnePlayer function takes a table of sorted combinations, generated by either GenerateCombination
or SortCombination, and outputs a table. The AddOnePlayer input and output tables are queried by the SQL
Statements to Compute the Shapley Value.
AddOnePlayer Syntax
Version 1.0
Arguments
Argument Category Description
CombinationColumn Required Specifies the name of the input table column that contains the
combinations.
SizeColumn Required Specifies the name of the input table column that contains the size
of each combination.
ValueColumn Required Specifies the name of the input table column that contains the
characteristic value of each combination.
NumPlayers Required Specifies the number of players in the game, a positive integer.
Delimiter Optional Specifies the character that separates player numbers in
combinations—either ' ' (space, the default), '#', '$', '%', or '&'.
Input
The AddOnePlayer input table has the same schema as the SortCombination output table.
Note:
The comb column values must be unique; otherwise, the Shapley Values will be incorrect.
Output
Table 514: AddOnePlayer Output Table Schema
Examples
• Example 1: Use GenerateCombination and AddOnePlayer
• Example 2: Use nPath to Create Input to GenerateCombination
Input
Assume that you implemented all three projects and want to know the average contribution of each project
to the total cost (capital_cost + operating_cost). Shapley value calculates that average cost contribution over
all possible orderings of the players.
Given a cost-sharing game, let the players join the game one at a time in predetermined order. As each
player joins, the number of players to be served increases. The player’s cost contribution is its net addition to
cost when it joins (that is., the incremental cost of adding it to the group of players whom have already
joined).
The last line in the preceding table that includes all the projects (p_123) is analyzed as follows to derive the
average contribution.
Sharing the Capital Cost of 16000 for p_123:
SQL-MapReduce Call
Output
Table 516: AddOnePlayer Example 1 Output Table (Generate Payoff Tables for Each Combination)
SQL-MapReduce Call
Output
Table 517: AddOnePlayer Example 1 Output Table (Add One Player to Each Combination)
SQL-MapReduce Call
Output
player shapley_value
1 2000
2 10000
3 8000
Input
The input data is a simulated click stream.
Table 519: AddOnePlayer Example 2 Input Table Schema
#!/usr/bin/python
import sys, getopt
import sys
import csv
import random
def usage():
print '\nUsage:-'
print 'generate_table_data.py -l length -p partition'
print ' '
sys.exit(2)
def main(argv):
length = 0
partition = 0
try:
opts, args =
getopt.getopt(argv,"hl:p:",["length=","partition="])
except getopt.GetoptError:
usage()
for opt, arg in opts:
if (opt == '-h'):
usage()
elif opt in ("-l", "--length"):
length = long(arg)
elif opt in ("-p", "--partition"):
partition = long(arg)
else:
usage()
if __name__ == "__main__":
main(sys.argv[1:])
Output
Output
ind num_conv
00001 13223
00010 13124
00011 2693
00100 13278
00101 2753
00110 2677
00111 920
01000 13167
01001 2664
01010 2607
01011 843
01100 2628
01101 874
01110 931
01111 440
10000 13204
10001 2599
10010 2716
10011 882
10100 2651
ind num_conv
10101 901
10110 847
10111 460
11000 2595
11001 850
11010 841
11011 460
11100 859
11101 463
11110 444
11111 351
ind num_tot
00001 00001
00010 15345
00011 3128
00100 15512
00101 3169
00110 3128
00111 1090
01000 15402
01001 3114
01010 3030
01011 981
01100 3038
01101 1008
01110 1087
01111 520
10000 15418
10001 3024
10010 3144
10011 1038
10100 3078
10101 1046
10110 998
10111 536
11000 3003
11001 985
11010 997
11011 535
11100 993
11101 533
11110 514
ind num_tot
11111 404
In this simulated case, for five players (impact events), all the 32 possible combinations are seen, except that
the empty set is not matched. This is because the data is generated randomly, so it is very likely that each
combination gets a chance to be seen. In real applications with many more players (dozens), it is very
unlikely to be the case.
Output
Output
Output
player shapley_value
1 0.177030985554059
2 0.175257593393326
3 0.174638753135999
4 0.168013821045558
5 0.173870752255122
SMAVG
Summary
The SMAVG (simple moving average) function computes the simple moving average over a number of
points in a series.
Usage
SMAVG Syntax
Version 1.2
Arguments
Argument Category Description
TargetColumns Optional Specifies the names of the input columns for which the simple moving
average is to be computed. If you omit this argument, then the function
copies all numeric input columns to the output table.
IncludeFirst Optional Specifies whether to output window_size rows. Because the simple moving
average for the first window_size rows is undefined, the function returns
NULL values for those columns. The default value is 'false'.
WindowSize Optional Specifies the number of previous values to include in the computation of
the simple moving average. The default value is 10.
Note:
The SMAVG function treats the schema names, table names, and column names as case-insensitive
arguments. If any of these arguments contain capital letters, you must surround each one of them with
double quotation marks. For example:
Output
Table 527: SMAVG Output Table Schema
Example
This example computes a simple moving average for the price of IBM stock. The input data is a series of
IBM common stock closing prices from 17 May 1961 to 2 November 1962.
Input
Table 528: SMAVG Example Input Table ibm_stock
SQL-MapReduce Call
Output
Table 529: SMAVG Example Output Table
SparseSVM Functions
The SparseSVM functions are:
• SparseSVMTrainer, which takes training data and builds a predictive model in binary format
• SparseSVMPredictor, which uses the model to predict the class of each sample in a test data set
• SVMModelPrinter, which displays readable information about the model
The SparseSVMTrainer and SparseSVMPredictor functions are designed for input that is in sparse format;
that is, each table row represents an attribute and each sample (observation) often consists of many
attributes. These functions are suitable for tasks like text classification, whose high number of attributes
(many unique words) might exceed the number of columns in the table.
This implementation of SparseSVM functions solves the primal form of a linear kernel support vector
machine, using gradient descent on the objective function. The implementation is based primarily on
Pegasos: Primal Estimated Sub-Gradient Solver for SVM (by S. Shalev-Shwartz, Y. Singer, and N. Srebro;
presented at ICML 2007).
SparseSVMTrainer
Summary
The SparseSVMTrainer function takes training data (in sparse format) and outputs a predictive model in
binary format, which is input to the functions SparseSVMPredictor and SVMModelPrinter.
SparseSVMTrainer Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the training
samples.
ModelTable Required Specifies the name for the model table that the function creates (which
must not exist).
SampleIDColumn Required Specifies the name of the input_table column that contains the
identifiers of the training samples.
AttributeColumn Required Specifies the name of the input_table column that contains the
attributes of the samples.
ValueColumn Optional Specifies the name of the input_table column that contains the
attribute values. By default, each attribute has the value 1.
Note:
You must use hash projection if the dataset has more features than
fit into memory.
HashBuckets Optional Valid only if Hash is 'true'. Specifies the number of buckets for hash
projection. In most cases, the function can determine the appropriate
number of buckets from the scale of the input data set. However, if the
dataset has a very large number of features, you might have to specify
buckets_number to accelerate the function.
ClassWeights Optional Specifies the weights for different classes. The format is: “classlabel
m:weight m, classlabel n:weight n”. If weight for a class is given, the
cost parameter for this class is weight * cost. A weight larger than 1
often increases the accuracy of the corresponding class; however, it
may decrease global accuracy. Classes not assigned a weight in this
argument is assigned a weight of 1.0.
MaxStep Optional A positive integer value that specifies the maximum number of
iterations of the training process. One step means that each sample is
seen once by the trainer. The input value must be in the range (0,
10000]. The default value is 100.
Epsilon Optional Termination criterion. When the difference between the values of the
loss function in two sequential iterations is less than this number, the
function stops. Must be greater than 0.0. The default value is 0.01.
Seed Optional A long integer value used to order the training set randomly and
consistently. You can use this value can to ensure that the same model
is generated if the function is run multiple times in a given database
with the same arguments. The input value must be in the range [0,
9223372036854775807]. The default value is 0.
Output
The SparseSVMTrainer function outputs console messages and a model table.
Table 531: SparseSVMTrainer Console Message Table Schema
The model table, which is input to the function SparseSVMPredictor, is in binary format. To display its
readable content, use the function SVMModelPrinter.
Example
Input
The input data is a table of four iris attributes (sepal length, sepal width, petal length, and petal width),
grouped into three categories (setosa, versicolor, and virginica):
Table 532: SparseSVMTrainer Example Input Table svm_iris
The testing table is input to the SparseSVMPredictor function. The testing table, which is input to the
SparseSVMPredictor function, is Input.
Table 534: SparseSVMTrainer Example Input Table svm_iris_input_train
SQL-MapReduce Call
Output
Table 535: SparseSVMTrainer Example Output Message
message
Model table "svm_iris_model" is created successfully
The model is trained with 120 samples and 4 unique attributes
There are 3 different classes in the training set
The model is not converged after 150 steps with epsilon 0.01, the average value of the loss function for the
training set is 35.46831950524983
The corresponding training parameters are cost:1.0 bias:0.0
SparseSVMPredictor
Summary
The SparseSVMPredictor function takes the model generated by the function SparseSVMTrainer and a set
of test samples (in sparse format) and outputs a prediction for each sample.
This function can be used with real-time applications. Refer to AMLGenerator.
Usage
SparseSVMPredictor Syntax
Version 1.1
Input
The SparseSVMPredictor function has two required input tables:
• The sample table, which contains the test data
The following table describes the required sample table columns. The function ignores any additional
columns, except those specified by the AccumulateLabel argument, which it copies to the output table.
• The model table, which is output by the SparseSVMTrainer function
The model table is in binary format. To display its readable content, use the function SVMModelPrinter.
Table 536: SparseSVMPredictor Sample Table Schema
Output
The SparseSVMPredictor function outputs a table that contains the predicted class of each test sample.
where i is the attribute id, xi is the value of attributes in the sample, and wi is the weight of attribute i.
Example
Input
This example takes two tables as input:
• The binary-format model svm_iris_model (produced by SparseSVMTrainer)
• The test data svm_iris_input_test
Table 538: SparseSVMPredictor Example Input Table svm_iris_input_test
SQL-MapReduce Call
Output
The query below returns the output shown in the following table:
prediction_accuracy
0.86666666666666666667
SVMModelPrinter
Summary
The SVMModelPrinter function takes the training data and the model generated by the function
SparseSVMTrainer and displays specified information.
Usage
SVMModelPrinter Syntax
Version 1.1
Arguments
Argument Category Description
AttributeColumn Required Specifies the name of the input table column that contains the
attribute names.
Summary Optional Specifies whether the output is a summary of the model. If 'false', the
output is the weight of each attribute in the model. The default value
is 'false'.
Input
The SVMModelPrinter function has two required input tables:
• The input table that contains the training data, described by Input
• The model table, which is output by the SparseSVMTrainer function in binary format
Output
The SVMModelPrinter function outputs either a summary of the model (if Summary is 'true') or a table that
contains the weight of each attribute in the model.
Table 541: SVMModelPrinter Console Message Table Schema (Summary('true'))
Examples
• Example 1: ShowSummary('true')
• Example 2: ShowSummary('false')
Example 1: ShowSummary('true')
SQL-MapReduce Call
Output
message
The corresponding training parameters are cost:1.0 bias:0.0
The model is not converged after 150 steps with epsilon 0.01, the average value of the loss function for the
training set is 35.46831950524983
The model is trained with 120 samples and 4 unique attributes
There are 3 different classes in the training set
Example 2: ShowSummary('false')
SQL-MapReduce Call
Output
DenseSVM Functions
The DenseSVM functions are:
• DenseSVMTrainer, which takes training data and builds a predictive model in binary format
• DenseSVMPredictor, which uses the model to predict the class of each sample in a test data set
• DenseSVMModelPrinter, which displays readable information about the model
The DenseSVMTrainer and DenseSVMPredictor functions are designed for input in dense format; that is,
each table column contains values of a single attribute and there is a single row for each sample
(observation).
This implementation of DenseSVM function includes a linear SVM based on a Pegasos algorithm and a
non-linear SVM based on the Hash-SVM model described in the paper “Hash-SVM: Scalable Kernel
Machines for Large-Scale Visual Classification,” by Yadong Mu, Gang Hua, Wei Fan, and Shih-Fu Chang
(http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6909525).
DenseSVMTrainer
Summary
The DenseSVMTrainer function takes training data in dense format and outputs a predictive model in
binary format, which is input to the functions DenseSVMPredictor and DenseSVMModelPrinter.
Usage
DenseSVMTrainer Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Input
Table 545: DenseSVMTrainer Input Table Schema
Output
The output is a binary table. To display its readable content, use the function DenseSVMModelPrinter.
The resulting model may be different even if the function is run with the same arguments, unless the
function is run on the same database with the seed value set.
Examples
• Input
• Train and Test Set
• Example 1: Linear Model
• Example 2: Polynomial Model
• Example 3: Radial Basis Model (RBF) Model
• Example 4: Sigmoid Model
In all of these examples, the DenseSVMTrainer function creates the model and the DenseSVMPredictor
function uses that model on a test set to make a prediction. The Polynomial, RBF and sigmoid models
generally obtain better prediction accuracy with higher values of the hashbits and subspacedimension
arguments. The value of subspacedimension argument cannot be greater than the number of rows input.
You can tune the model using the cost and bias arguments. For details on model-specific tuning parameters,
refer to the arguments section.
The following query returns the output shown in the following table:
The following query returns the output shown in the following table:
SQL-MapReduce Call
Output
message
Model table "densesvm_iris_linear_model" is created successfully
The model is trained with 120 samples and 4 unique attributes
There are 3 different classes in the training set
The model is not converged after 100 steps with epsilon 0.01, the average value of the loss function for the
training set is 38.035578853191694
The corresponding training parameters are cost:1.0 bias:0.0
SQL-MapReduce Call
Output
message
Model table "densesvm_iris_polynomial_model" is created successfully
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is not converged after 100 steps with epsilon 0.01, the average value of the loss function for the
training set is 12981.195818669565
The corresponding training parameters are cost:1.0 bias:0.0
SQL-MapReduce Call
Output
message
Model table "densesvm_iris_rbf_model" is created successfully
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is converged after 39 steps with epsilon 0.01, the average value of the loss function for the
training set is 16.640287770464468
The corresponding training parameters are cost:1.0 bias:0.0
SQL-MapReduce Call
message
Model table "densesvm_iris_sigmoid_model" is created successfully
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is converged after 18 steps with epsilon 0.01, the average value of the loss function for the
training set is 55.1275114033879
The corresponding training parameters are cost:1.0 bias:0.0
Only models with RBF and Sigmoid kernels have converged. This may mean that the true boundaries in the
data set are hard to capture with a linear or polynomial model.
DenseSVMPredictor
Summary
The DenseSVMPredictor function takes the model generated by the function DenseSVMTrainer and a set of
test samples in dense format and outputs a prediction for each sample.
Usage
DenseSVMPredictor Syntax
Version 1.1
Input
DenseSVMPredictor takes two input tables, a sample table containing data whose class is to be predicted,
and the model table produced by DenseSVMTrainer.
• The model table is in binary format. To display its readable content, use the function
DenseSVMModelPrinter.
• The schema of the sample table is shown in the following table. The function ignores any additional
columns, except those specified by the AccumulateLabel argument, which it copies to the output table.
Table 553: DenseSVMPredictor Input Sample Table Schema
Output
DenseSVMPredictor outputs a table containing the predicted class of each test sample. The schema is shown
in the following table.
Note:
The predict_confidence values may be different if the function is run on a different cluster.
Examples
• Input
• Example 1: Linear Model
• Example 2: Polynomial Model
• Example 3: Radial Basis Model (RBF) Model
• Example 4: Sigmoid Model
Input
These examples use the test dataset as input to the DenseSVMPredictor function.
SQL-MapReduce Call
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
SQL-MapReduce Call
DenseSVMModelPrinter
Summary
DenseSVMModelPrinter extracts readable information from the model produced by DenseSVMTrainer.
The function can display either a summary of the model training results or a table containing the weights for
each attribute.
Usage
DenseSVMModelPrinter Syntax
Version 1.1
Arguments
Argument Category Description
AttributeColumns Required Input table columns that contain the attributes of the test samples.
Attribute columns must be numeric (INT, REAL, BIGINT,
SMALLINT, or FLOAT).
Summary Optional If true, the output contains only summary information of the model.
If false, the output contains the weight of each attribute in the
model. The default value is false.
Input
The function takes two input tables. One is the model table produced by DenseSVMTrainer. The other is the
input table to DenseSVMTrainer that was used to produce the model.
Output
The DenseSVMModelPrinter function outputs either a summary of the model (if Summary is 'true') or a
table that contains the weight of each attribute in the model.
Example
Input
The RBF model is used as an example. Set the ShowSummary argument to false to output the model
parameters (weights, attributes, etc.). The inputs for this function are Train and Test Set section and
densesvm_iris_rbf_model from the function DenseSVMTrainer.
Output
Table 561: DenseSVMModelPrinter Example Output Table
Output
Table 562: DenseSVMModelPrinter Example Output Table
message
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is converged after 28 steps with epsilon 0.01, the average value of the loss function for the
training set is 15.81343373822361
The corresponding training parameters are cost:1.0 bias:0.0
VectorDistance
Summary
The VectorDistance function takes a table of target vectors and a table of reference vectors and returns a
table that contains the distance between each target-reference pair.
Information retrieval and text mining applications use the vector distance between the Term Frequency
Inverse Document Frequency (TF-IDF) representations of two documents to measure the similarity of their
subject matter.
Background
The VectorDistance function computes the distance between each vector in the target table and each vector
in the reference table:
Cosine Similarity
The cosine similarity between two vectors of an inner product space is the cosine of the angle between them.
The cosine of 0° is 1 and the cosine of any other angle is less than 1. Therefore, the cosine similarity
measures orientation and not magnitude. Regardless of their magnitude, two vectors with the same
orientation have a cosine similarity of 1, two vectors at 90° have a cosine similarity of 0, and two
diametrically opposed vectors have a cosine similarity of -1.
Note:
Cosine similarity is not a proper distance metric, because it does not have the triangle inequality property
and it violates the coincidence axiom (which says that two things separated by zero distance must be
identical).
Cosine similarity is most commonly used in high-dimensional positive spaces. In positive space, cosine
similarity is often used for the complement, that is:
D cos(A, B) = 1 - S cos(A, B)
Euclidean Distance
The Euclidean distance between vectors p and q is the length of the line segment connecting them. If p=(p1,
p2,…, pn) and q=(q1, q2,…,qn) are vectors in Euclidean n-space, then the Euclidean distance between them
is:
Manhattan Distance
The Manhattan distance (or taxicab distance) between vectors p and q is the sum of the absolute differences
of their Cartesian coordinates. If p=(p1, p2,…, pn) and q=(q1, q2,…,qn) are vectors in an n-dimensional real
vector space with a fixed Cartesian coordinate system, then the Manhattan distance between them is:
For example, in the plane, the Manhattan distance between (p1, p2) and (q1, q2) is |p1-q1 |+|p2-q2|.
Binary Distance
The Binary distance between vectors p and q is 1 if the vectors are identical (that is, if they have the same
length and value) and 0 otherwise.
Usage
VectorDistance Syntax
Version 1.1
Arguments
Argument Category Description
TargetIDColumns Required Specifies the names of the columns that comprise the target
vector identifier. You must partition the target input table by
these columns and specify them with this argument.
TargetFeatureColumn Required Specifies the name of the column that contains the target
vector feature name (for example, the axis of a 3-D vector).
Note:
An entry with a NULL value in a feature_column is
dropped.
TargetValueColumn Optional Specifies the name of the column that contains the value for
the target vector feature.
Note:
An entry with a NULL value in a value_column is
dropped.
RefIDColumns Optional Specifies the names of the columns that comprise the
reference vector identifier. The default value is the
TargetIDColumns argument value.
RefFeatureColumn Optional Specifies the name of the column that contains the reference
vector feature name. The default value is the
TargetFeatureColumn argument value.
Note:
An entry with a NULL value in a feature_column is
dropped.
RefValueColumn Optional Specifies the name of the column that contains the value for
the reference vector feature. The default value is the
TargetValueColumn argument value.
Note:
An entry with a NULL value in a value_column is
dropped.
RefTableSize Optional Specifies the size of the reference table. Specify 'LARGE' only
if the reference table does not fit in memory. The default
value, 'SMALL', allows faster processing.
DistanceMeasure Optional Specifies the distance measures that the function uses. The
default value is 'cosine'.
IgnoreMismatch Optional Specifies whether to drop mismatched dimensions. The
default value is 'true'. If DistanceMeasure is 'cosine', then
this argument is 'false'.
If you specify 'true', then two vectors with no common
features become two empty vectors when only their
common features are considered, and the function cannot
measure the distance between them.
ReplaceInvalid Optional Specifies the value to return when the function encounters
an infinite value or empty vectors. For custom, you can
supply any DOUBLE PRECISION value. The default value is
'PositiveInfinity'.
TopK Optional Specifies, for each target vector and for each measure, the
maximum number of closest reference vectors to include in
the output table. For k, you can supply any INTEGER value.
The default value is the maximum INTEGER value
(2,147,483,647).
MaxDistance Optional Specifies the maximum distance between a pair of target and
reference vectors. If the distance exceeds the threshold, the
pair does not appear in the output table.
If the DistanceMeasure argument specifies multiple
measures, then the MaxDistance argument must specify a
threshold for each measure. The ith threshold corresponds to
the ith measure. Each threshold can be any DOUBLE
PRECISION value.
If you omit this argument, then the function returns all
results.
Input
The VectorDistance function requires two input tables: target, which contains the target vectors, and ref,
which contains the reference vectors.
Output
Table 565: VectorDistance Output Table Schema
Input
The raw input is mobile telephone user data where each user (who is identified with UserID) has these
attributes (for a specific time period):
• CallDuration—total time spent on telephone calls (in minutes)
• SMS—number of Short Message Service (SMS) messages sent and received
• DataCounter—data consumed (in megabytes)
Table 566: VectorDistance Examples Raw Input Data
The CallDuration values are so much higher than the values of the other attributes that they skew the
distribution. Normalizing the raw data to the range [0, 1] solves this problem.
In the following table, the raw input data in the preceding table has been normalized to the range [0, 1] using
the Min-Max normalization technique.
Table 567: VectorDistance Examples Normalized Input Data
This technique transforms the value 'a' (in column A) to the value 'b' in the range [C, D], using this formula:
b=((a-minimum_value_in_A)/(maximum_value_in_A-minimum_value_in_A))*(D-C)+C
The following table shows the minimum and maximum values that the formula uses for each input table
column.
From the normalized input data, you choose one or more users to be the reference vector; the remaining
users are the target vectors. The choice of reference vector depends on the application. For example, if the
mobile telephone service is expanding its range to include a new area with similar users, then one or more
typical users (with average or median attribute values) can be the reference vector. When the company has
identified similar users in the new area, it can send them promotional offers.
For these examples, the reference vector is UserID 5. The following two tables are the reference and target
tables for the VectorDistance function.
Table 569: VectorDistance Examples Reference Table ref_mobile_data
SQL-MapReduce Call
Output
The following table (which is not output by the VectorDistance function) shows the distances of the target
vectors from the reference vector (UserID 5) and their similarity ranks. The shorter the distance, the higher
the similarity rank. Similarity rank is independent of measure—if relative distances are shorter in one
measure, they are shorter in all measures. UserID 3 is most similar to UserID 5.
Table 572: VectorDistance Example 1 Target Distances from Reference and Similarity Ranks
SQL-MapReduce Call
Output
As the following table shows, only UserID 2 and UserID 3 meet the threshold criteria.
Table 573: VectorDistance Example 2 Output Table
VWAP
Summary
The VWAP function computes the volume-weighted average price of a traded item (usually an equity share)
for each time interval in a series of equal-length time intervals.
VWAP = sum(volume*price)/sum(volume)
Usage
VWAP Syntax
Version 1.2
Arguments
Argument Category Description
Price Optional Specifies the name of the input table column that contains the price
at which the item traded. The default value is 'price'.
Volume Optional Specifies the name of the input table column that contains the
number of units traded in the transaction. The default value is
'volume'.
DT Optional Specifies the name of the input table column that contains the date
and time of the trade.
TimeInterval Optional Specifies the number of seconds in each time interval. The default
value is 0, which makes each row an interval, causing the function to
calculate no averages.
Input
You must partition the input table such that each partition contains all rows of the entity whose volume-
weighted average is to be calculated. For example, if the entity is a particular equity share, then all
transactions of that share must be in the same partition.
You must sort the input data on the DT column in ascending order.
Table 574: VWAP Input Table Schema
Output
Table 575: VWAP Output Table Schema
Example
Input
Table 576: VWAP Example Input Table stock_vol
SQL-MapReduce Call
WMAVG
Summary
The WMAVG (weighted moving average) function computes the average over a number of points in a time
series, applying weights to older values. The weights for the older values decrease arithmetically.
Usage
WMAVG Syntax
Version 1.2
Arguments
Argument Category Description
TargetColumns Optional Specifies the names of the input columns for which the weighted
moving average is to be computed. The function copies these
columns to the output table. If you omit this argument, then the
function copies these columns to the output table but does not
compute their weighted moving averages.
WindowSize Optional Specifies the number of old values to consider for computing the
new weighted moving average. The default value is '10'.
IncludeFirst Optional Specifies whether to output the first window_size rows. Because the
weighted moving average for the first window_size rows is
undefined, the function returns NULL values for those columns.
The default value is 'false'.
Output
Table 579: WMAVG Output Table Schema
Example
Input
The input is hypothetical stock price and volume data of three companies IBM, General Electric (GE), and
Procter & Gamble (PG) between the time period of '05/17/1961' and '06/21/1961'. The WMAVG function in
the example calculates the weighted moving average for stockprice and volume for each company.
Data is assumed to be partitioned such that each partition contains all the rows of an entity. For example, if
the weighted moving average of a particular equity share price is required, then all transactions of that equity
share must be part of one partition. It is assumed that the input rows are provided in the correct order.
Table 580: WMAVG Example Input Table stock_data
Output
Because the window size is 5, the values in the stockprice_mavg and volume_mavg columns show the
weighted average value of the previous five rows (or days in this example). Because the IncludeFirst
argument is set to 'true', the first five rows are shown in the output even though they contain null values for
those columns.
Table 581: WMAVG Example Output Table
Text Analysis
• LDA Functions
• Levenshtein Distance (LDist)
• Naive Bayes Text Classifier
• NER Functions (CRF Model Implementation)
• NER Functions (Max Entropy Model Implementation)
• nGram
• POSTagger
• Sentenizer
• Sentiment Extraction Functions
• Text Classifier
• Text_Parser
• TextChunker
• TextMorph
• TextTagging
• TextTokenizer
• TF_IDF
LDA Functions
Summary
The Latent Dirichlet Allocation (LDA) functions are:
• LDATrainer, which uses training data and parameters to build a topic model
• LDAInference, which uses the topic model to estimate the topic distribution in a set of documents
• LDATopicPrinter, which displays the readable information from the topic model
LDATrainer
Summary
The LDATrainer function uses training data and parameters to build a topic model, using an unsupervised
method to estimate the correlation between the topics and words according to the topic number and other
parameters. Optionally, the function generates the topic distributions for each training document.
The function uses an iterative algorithm; therefore, applying it to large data sets with a large number of
topics can be time-consuming.
The function assumes that the model table can be fitted into the memory of the vworkers.
Usage
LDATrainer Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the training
documents.
ModelTable Required Specifies the name for the model table that the function creates in the
database. This table must not already exist.
OutputTable Optional Specifies the name of the output table that contains the topic distribution
of each document in the input table, which the function creates in the
database. This table must not already exist. If you omit this argument, the
function does not generate this table.
TopicNum Required Specifies the number of topics for all the documents in the input table, an
INTEGER value in the range [2, 1000].
Alpha Optional Specifies a hyperparameter of the model, the prior smooth parameter for
the topic distribution over documents. As alpha decreases, fewer topics are
associated with each document. The default value is 0.1.
Eta Optional Specifies a hyperparameter of the model, the prior smooth parameter for
the word distribution over topics. As eta decreases, fewer words are
associated with each topic. The default value is 0.1.
DocIDColumn Required Specifies the name of the input column that contains the document
identifiers.
WordColumn Required Specifies the name of the input column that contains the words (one word
in each row).
Note:
The function might produce different results with different Seed settings and cluster configurations.
Input
Table 582: LDATrainer Training Table Schema
Note:
You can use the output of the function TextTokenizer_stub with the argument OutputByWord('true') as
input to the LDATrainer function. Teradata recommends that you filter out the words with low
frequency and high frequency, as they may impact the topics that consist of common words that are not
meaningful in topic model.
Output
The LDATrainer function outputs a message, a model table, and (optionally) an output table.
Table 583: LDATrainer Output Message Schema
Note:
Because the the model table contents are in BYTEA format, it is not readable. To see the binary contents,
use the function LDATopicPrinter.
Example
Input
The training table is log of vehicle complaints. The category column indicates whether the car has been in a
crash.
Table 586: LDATrainer Example Training Table complaints
a
an
in
is
to
into
was
the
and
this
with
they
To generate a tokenized, filtered input file for the LDATrainer function, apply the function Text_Parser to
the training table:
The following query returns the output shown in the following table:
Table 587: LDATrainer Example Tokenized and Filtered Input Table complaints_traintoken
Output
Table 588: LDATrainer Example Message
message
Outputtable "ldaout1" is created successfully.
Training converged after 7 iterate steps with delta 4.2582574160277766E-5
There are 20 documents with 520 words in the training set, the perplexity is 92.139160
The following query returns the output shown in the following table:
LDAInference
Summary
The LDAInference function uses the model table generated by the function LDATrainer to infer the topic
distribution in a set of new documents. You can use the distribution for tasks such as classification and
clustering.
This function can be used with real-time applications. Refer to AMLGenerator.
Usage
LDAInference Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the new documents.
ModelTable Required Specifies the name of the model table generated by the function
LDATrainer.
Input
The LDAInference function requires an input table and a model table. Their schemas are the same as those
of the training table and model table of the function LDATrainer.
Output
The LDAInference function output table has the same schema as the output table of the LDATrainer
function.
Example
Input
The input table is a log of vehicle complaints.
Table 590: LDAInference Example Input Table complaints_test
doc_id text_data
1 ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE VEHICLE
TO STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE. CONSUMER HAD TO
CHANGE ALTERNATOR/ BATTERY AND STARTER, AND MODULE REPLACED 4
TIMES, BUT DEFECT STILL OCCURRING CANNOT DETERMINE WHAT IS CAUSING
THE PROBLEM.
doc_id text_data
2 ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED TO DEPLOY
DURING A CRASH AT APPROX. 28 MPH IMPACT. MANUFACTURER NOTIFIED.
3 WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER THAT
IS AROUND THE GAS PEDAL.
4 THERE IS A KNOCKING NOISE COMING FROM THE CATALYITC
CONVERTER ,AND THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE
STEERING.
5 CONSUMER WAS MAKING A TURN ,DRIVING AT APPROX 5- 10 MPH WHEN
CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT
DEPLOY . ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION,TO THE
FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.
6 WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND WHEN
MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL GOES TO
THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO DRIVE
VEHICLE- WHEELE COULD COME OFF.
7 DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN
WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S AND
THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN INJURIES. PLEASE
PROVIDE FURTHER INFORMATION AND VIN#.
8 THE AIR BAG WARNING LIGHT HAS COME ON. INDICATING AIRBAGS ARE
INOPERATIVE.THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM HAS
REOCCURRED.
9 CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING EAST. THE
OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE, CONSUMER HIT
OTHER VEHICLE AND STARTED TO SPIN AROUND ,COULDN'T STOP, RESULTING
IN A CRASH. UPON IMPACT, AIRBAGS DIDN'T DEPLOY.
10 WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISISON MADE A STRANGE
NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS REPAIRED
THE VEHICLE.
a
an
in
is
to
into
was
the
and
this
with
they
but
will
The following query returns the output shown in the following table:
Table 591: LDAInference Example Tokenized and Filtered Input Table complaints_testtoken
SQL-MapReduce Call
The table ldamodel was generated by the function LDATrainer.
Output
Table 592: LDAInference Example Output Message
message
There are 10 valid documents with 153 recognized words in the input, the perplexity is 145.758867
Outputtable "ldaout2" is created successfully.
The following query returns the output shown in the following table:
LDATopicPrinter
Summary
The LDATopicPrinter function displays in readable form information from the binary model table
generated by the function LDATrainer.
LDATopicPrinter Syntax
Version 1.1
Arguments
Argument Category Description
Summary Optional Specifies whether to display only a summary of the information in
the model table. The default value is 'false'.
OutputTopicWordNum Optional Specifies the number of top topic words and their topic identifiers to
include in the output table for each training document. The value
topic_words must be a positive INTEGER. The default value, 'all',
specifies all topic words and their topic identifiers.
WordWeight Optional Specifies whether to display the weight (probability of occurrence) of
each unique word in each topic. The weights for the unique words in
each topic are normalized to 1. The default value is 'false'.
WordCount Optional Specifies whether to display the count (number of occurrences) of
each unique word in each topic. Topic distribution is factored into
word counts. The default value is 'false'.
OutputByWord Optional Specifies whether to display each topic-word pair in its own row. The
default value is 'true'. If you specify 'false', each row contains a unique
topic and all words that occur in that topic, separated by commas.
Input
The input to the LDATopicPrinter function is the model table generated by the function LDATrainer.
Output
The LDATopicPrinter function outputs a message and an output table.
The schema of the output table depends on the values of the arguments ShowSummary and OutputByWord.
If ShowSummary is true, the function outputs only the preceding table.
Table 595: LDATopicPrinter Output Table (showsummary=false and outputbyword=true)
Examples
• Input
• Example 1: ShowSummary ('true')
• Example 2: OutputByWord ('false')
• Example 3: ShowWordWeight('true') and ShowWordCount('true')
Input
Model table ldamodel, generated by the LDATrainer Example.
SQL-MapReduce Call
Output
message
The model table is trained with the parameters: topicNumber:5, vocabularySize:309, alpha:0.100000, eta:
0.100000
There are 20 documents with 520 words in the training set, the perplexity is 92.139160
SQL-MapReduce Call
Output
topicid wordsequence
0 wipers,would,switch,when,on,recall,windshield,notified,manufacturer,dealer
1 vehicle,causing,consumer,replaced,which,module,control,out,has,at
2 vehicle,manufacturer,would,transmission,when,problem,at,has,also,dealer
3 did,not,deploy,hit,vehicle,side,air,passenger's,bags,head-on
4 vehicle,side,car,engine,while,fire,for,from,left,wheel
SQL-MapReduce Call
Output
Summary
The Levenshtein Distance (LDist) function computes the Levenshtein distance between two text values. The
Levenshtein distance (or edit distance) is the number of edits needed to transform one string into the other.
An edit is an insertion, deletion, or substitution of a single character.
The Levenshtein distance is useful for fuzzy matching of sequences and strings. The LDist function is often
used to resolve a user-entered value to a standard value. For example, when a enters "Jon Dow" when
searching for "John Doe".
A typical application of the LDist function is genome sequencing.
Usage
Arguments
Argument Category Description
TargetColumn Required Specifies the name of the input column that contains the target
text.
Source Required Specifies the names of the input columns that contain the source
text.
Threshold Optional Specifies the value that determines whether to return the
Levenshtein distance for a source-target pair. The threshold must a
positive integer. The function returns the Levenshtein distance for
a pair if it is less than or equal to threshold; otherwise, the function
returns -1. By default, the function returns the Levenshtein
distance of every pair.
Input
Table 600: Levenshtein Distance (LDist) Input Table Schema
Output
Table 601: Levenshtein Distance (LDist) Output Table Schema
Example
Input
A typical application of this function is genome sequencing, to find differences in base pairs (Adenine(A),
Thymine(T), Cytosine(C), Guanine(G)).
SQL-MapReduce Call
Output
Table 603: Levenshtein Distance (LDist) Example Output Table
Summary
The Naive Bayes Text Classifier is a variant of the Naive Bayes classification algorithm that is designed
specifically for document classification.
Note:
For information about the Naive Bayes classification algorithm and functions, refer to the chapter Naive
Bayes.
NaiveBayesTextClassifierTrainer
Summary
The NaiveBayesTextClassifierTrainer function takes training data as input and outputs a model table.
Usage
NaiveBayesTextClassifierTrainer Syntax
Version 1.1
Note:
Specify either this argument or the categories_table, but not
both.
Note:
Specify either this argument or the stop_words_table, but
not both.
Input
The NaiveBayesTextClassifierTrainer function has one required input table, token, and two optional input
tables, categories and stop_words.
The token table, which contains the classified training tokens, is usually generated by a tokenizing function,
such as TextTokenizer or Text_Parser. The following table describes its schema.
Table 604: NaiveBayesTextClassifierTrainer Token Table Schema
The categories table contains all possible prediction categories. If you omit this table, then you must specify
all possible prediction categories with the Categories argument.
The stop_words table contains all possible stop words (a, an, the, and so on). If you omit this table, then you
must specify all possible stop words with the Stop_Words argument.
Table 606: NaiveBayesTextClassifierTrainer Stop_Words Table Schema
Output
The NaiveBayesTextClassifierTrainer function outputs a model table, described in the following table.
Table 607: NaiveBayesTextClassifierTrainer Model Table Schema
Examples
• English Example
• Chinese Example
English Example
Input
The training table is log of vehicle complaints. The category column identifies whether the car has been in a
crash.
Table 609: NaiveBayesTextClassifierTrainer English Example Training Table complaints
Output
The following query returns the output shown in the following table:
Chinese Example
This example uses two files, news.data and stop_words.data. You must install these files onto the database
with the command \install filename.ext.
Input
The training table is a collection of categorized news articles in Simplified Chinese, from news.data.
To create the training table, use this statement:
To load the stop words table with data from stop_words.data, use this command:
Output
The following query returns the output shown in the following table:
NaiveBayesTextClassifierPredict
Summary
The NaiveBayesTextClassifierPredict function uses the model table generated by the
NaiveBayesTextClassifierTrainer function to predict outcomes for test data.
This function can be used with real-time applications. Refer to AMLGenerator.
Usage
NaiveBayesTextClassifierPredict Syntax
Version 1.1
Arguments
Argument Category Description
InputTokenColumn Required Specifies the name of the input_table column that contains the
tokens.
ModelType Optional Specifies the model type of the text classifier. The default value is
'Multinomial'.
DocIDColumn Required Specifies the names of the input_table columns that contain the
document identifier.
ModelTokenColumn Optional Specifies the name of the model_table column that contains the
tokens. The default value is the first column of model_table.
ModelCategoryColumn Optional Specifies the name of the model_table column that contains the
prediction categories. The default value is the second column of
model_table.
ModelProbColumn Optional Specifies the name of the model_table column that contains the
token counts. The default value is the third column of
model_table.
Note:
Specify either all or none of the arguments ModelTokenColumn, ModelCategoryColumn, and
ModelProbColumn.
Input
The NaiveBayesTextClassifierPredict function has two required input tables, the model_table output by the
function NaiveBayesTextClassifierTrainer, and input_table, which contains the test data for which to predict
outcomes.
The test data must be in the form of document-token pairs (as in the following table). To transform the
input documents into this form, input them to the function TextTokenizer or Text_Parser.
Table 611: NaiveBayesTextClassifierPredict Input Table Schema
Output
The NaiveBayesTextClassifierPredict function outputs a table of predictions for the test data.
Table 612: NaiveBayesTextClassifierPredict Output Table Schema
English Example
Input
The input table is a log of vehicle complaints. The example applies TextTokenizer to the complaints_test log
to generate test data, and uses the model complaints_tokens_model, generated by NaiveBayesTextClassifier.
Table 613: NaiveBayesTextClassifierPredict English Example Input Table complaints
doc_id text_data
1 ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE VEHICLE TO
STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE. CONSUMER HAD TO
CHANGE ALTERNATOR/ BATTERY AND STARTER, AND MODULE REPLACED 4
TIMES, BUT DEFECT STILL OCCURRING CANNOT DETERMINE WHAT IS CAUSING
THE PROBLEM.
2 ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED TO DEPLOY
DURING A CRASH AT APPROX. 28 MPH IMPACT. MANUFACTURER NOTIFIED.
3 WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER THAT IS
AROUND THE GAS PEDAL.
4 THERE IS A KNOCKING NOISE COMING FROM THE CATALYITC CONVERTER ,AND
THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE STEERING.
5 CONSUMER WAS MAKING A TURN ,DRIVING AT APPROX 5- 10 MPH WHEN
CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT
DEPLOY . ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION,TO THE
FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.
6 WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND WHEN
MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL GOES TO
THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO DRIVE
VEHICLE- WHEELE COULD COME OFF.
7 DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN
WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S AND
THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN INJURIES. PLEASE
PROVIDE FURTHER INFORMATION AND VIN#.
8 THE AIR BAG WARNING LIGHT HAS COME ON. INDICATING AIRBAGS ARE
INOPERATIVE.THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM HAS
REOCCURRED.
9 CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING EAST. THE
OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE, CONSUMER HIT OTHER
VEHICLE AND STARTED TO SPIN AROUND ,COULDN'T STOP, RESULTING IN A
CRASH. UPON IMPACT, AIRBAGS DIDN'T DEPLOY.
doc_id text_data
10 WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISISON MADE A STRANGE
NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS REPAIRED THE
VEHICLE.
SQL-MapReduce Call
Output
Chinese Example
This example uses two files, news.data and stop_words.data. You must install these files onto the database
with the command \install filename.ext.
SQL-MapReduce Call
Summary
Named entity recognition (NER) is a process for finding specified entities in text. For example, a simple news
named-entity recognizer for English might find the person “John J. Smith” and the location “Seattle” in the
text string “John J. Smith lives in Seattle.”
NER functions let you specify how to extract named entities when training the data models. The Aster
Analytics Foundation provides two sets of NER functions.
The NER functions that use the Conditional Random Fields (CRF) model are:
NERTrainer
Summary
The NERTrainer function takes training data and outputs a CRF model (a binary file) that can be specified
in the function NER and NEREvaluator.
Usage
NERTrainer Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
ExtractorJAR Optional Specifies the name of the JAR file that contains the Java classes that
extract features. You must install this JAR file in Aster Database
under the user search path before calling the function.
Note:
The name jar_file is case-sensitive.
FeatureTemplate Required Specifies the name of the file that specifies how to generate features
when training the model. You must install this feature template file
in Aster Database under the user search path before calling the
function. For more information about template_file, refer to Feature
Template.
ModelFile Required Specifies the name of the model file that the function generates and
installs in Aster Database.
Language Optional Specifies the language of the input text:
• 'en' (English, the default)
• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)
MaxIterNum Optional Specifies the maximum number of iterations. The default value is
1000.
Eta Optional Specifies the tolerance of the termination criterion. Defines the
differences of the values of the loss function between two sequential
epochs. The default value is 0.0001.
When training a model, the function performs n-times iterations. At
the end of each epoch, the function calculates the loss or cost
function on the training samples. If the loss function value change is
very small between two sequential epochs, the function considers
the training process to have converged.
Eta=(f(n)-f(n-1))/f(n-1)
Feature Template
The template_file has two parts. Part 1 declares the classes used to extract features and Part 2 specifies the
features to use to train the model. For example:
Part 1
Part 1 of the example template file declares three extractor classes—Defaul_Token, Begin_with_Uppercase,
and com.asterdata.ner.SuffixExtractor, with serial numbers 0, 1, and 2, respectively. (Serial numbers must
start with 0 and be incremented by 1.)
Defaul_Token and Begin_with_Uppercase are default extractor classes, defined by the function. The
following table lists the default extractor classes and describes the features that they extract.
Table 616: NERTrainer Default Extractor Classes and Features
package com.asterdata.sqlmr.text_analysis.ner;
import java.io.Serializable;
import java.util.List;
/*
* To define a function that generates features from a sequence,
* you must implement this interface.
*/
public interface Extractor extends Serializable
{
/**
* extract the feature of a token
* @param sequence
* @param i, the current token index
* @return the feature flag
*/
String extract(List String sequence, int i);
}
Suppose that the function applies the extractor classes in the example template file to the input text "More
restaurants open in San Diego." For the token "More":
• Defaul_Token extracts the feature "More".
• Begin_with_Uppercase extracts the feature "T" (true).
• com.asterdata.ner.SuffixExtractor extractor extracts the feature "e".
Applying the same three extractor classes to the entire input text generates this matrix:
More T e
restaurants F s
open F n
in F n
San T n
Diego T o
. F .
Input
The input table must have a column of text to be analyzed. The table can have other columns, but the
function ignores them.
Table 618: NERTrainer Input Table Schema
Output
The function outputs a message to the console and a CRF model (a binary file installed in the database).
Table 619: NERTrainer Output Message Schema
Input
The input train table, ner_sports_train, is a collection of different sports news in xml format (with tags like
<START:PER> Roger <END>). There are 500 rows of training data.
Table 620: NERTrainer Example Input Table ner_sports_train
id content
2 CRICKET - <START:ORG> LEICESTERSHIRE <END> TAKE OVER AT TOP AFTER INNINGS
VICTORY .
3 <START:LOC> LONDON <END> 1996-08-30
4 West Indian all-rounder <START:PER> Phil Simmons <END> took four for 38 on Friday as
<START:ORG> Leicestershire <END> beat <START:ORG> Somerset <END> by an innings and
39 runs in two days to take over at the head of the county championship .
5 Their stay on top
6 After bowling <START:ORG> Somerset <END> out for 83 on the opening morning at
<START:LOC> Grace Road <END>
7 Trailing by 213
8 <START:ORG> Essex <END>
9 <START:PER> Hussain <END>
10 By the close <START:ORG> Yorkshire <END> had turned that into a 37-run advantage but off-
spinner <START:PER> Such <END> had scuttled their hopes
... ...
The function generates a model file (ner_model.bin), obtained by training on the input data and a template
file (template_1.txt) that specifies how to extract the text.
The following is an example template file (template_1.txt):
Output
Table 621: NERTrainer Example Output Table
train_result
Model generated.
Training time(s): 7.468
File name: ner_model.bin
File size(KB): 373
Model successfully installed.
NER
Summary
The NER function takes input documents and extracts specified entities, using one or more CRF models
(generated by the function NERTrainer) and, if appropriate, rules (regular expressions) or a dictionary.
The function uses models to extract the names of persons, locations, and organizations; rules to extract
entities that conform to rules (such as phone numbers, times, and dates); and a dictionary to extract known
entities.
Usage
NER Syntax
Version 1.1
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
Models Optional Specifies the CRF models (binary files) to use, generated by
NERTrainer. If you specified the ExtractorJAR argument in the
NERTrainer call that generated model_file, then you must specify
the same jar_file in this argument. You must install model_file and
jar_file in Aster Database under the user search path before calling
the NER function.
Note:
The names model_file and jar_file are case-sensitive.
Input
The NER function has a required input table, an optional rules tables, and an optional dictionary table.
Note:
Use the function TextTokenizer_stub to tokenize the input text before inputting it to the NER function.
The following table describes the required columns of the input table. The table can have other columns, but
the function ignores them.
Table 622: NER Input Table Schema
Output
Table 625: NER Output Table Schema
Example
Input
This example uses two input tables:
• Input test table ner_sports_test contains the text to be analyzed.
• Rules table rule_table contains regular expressions used to parse emails. This table must be given with the
alias “rules”.
Model file ner_model.bin, generated by the NerTrainer Example, is also used.
id content
528 email [email protected] to contact for all sport info
529 email [email protected] to contact for all cricket info
530 email [email protected] to contact for all tennis info
531 1= <START:PER> Igor Trandenkov <END> ( <START:LOC> Russia <END> ) 5.86
532 3. <START:PER> Maksim Tarasov <END> ( <START:LOC> Russia <END> ) 5.86
533 4. <START:PER> Tim Lobinger <END> ( <START:LOC> Germany <END> ) 5.80
534 5. <START:PER> Igor Potapovich <END> ( <START:LOC> Kazakstan <END> ) 5.80
535 6. <START:PER> Jean Galfione <END> ( <START:LOC> France <END> ) 5.65
536 7. <START:PER> Pyotr Bochkary <END> ( <START:LOC> Russia <END> ) 5.65
537 8. <START:PER> Dmitri Markov <END> ( <START:LOC> Belarus <END> ) 5.65
583 <START:LOC> GENEVA <END> 1996-08-30
584 <START:ORG> UEFA <END> came down heavily on Belgian club <START:ORG> Standard
Liege <END> on Friday for disgraceful behaviour in an Intertoto final match against
<START:ORG> Karlsruhe <END> of <START:LOC> Germany <END> .
... ...
type regex
email [\w\-]([\.\w])+[\w]+@([\w\-]+\.)+[a-zA-Z]{2,4}
SQL-MapReduce Call
Output
Depending on the text content and the model, the function outputs the entity, type and the approach (CRF
or rule based).
NEREvaluator
Summary
The NEREvaluator function evaluates a CRF model (generated by the function NERTrainer).
Usage
NEREvaluator Syntax
Version 1.1
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
Model Required Specifies the CRF model file to evaluate, generated by NERTrainer.
If you specified the ExtractorJAR argument in the NERTrainer call
that generated model_file, then you must specify the same jar_file in
this argument. You must install model_file and jar_file in Aster
Database under the user search path before calling the NER
function.
Note:
The names model_file and jar_file are case-sensitive.
Input
The input is a CRF model (a binary file) generated by the function NERTrainer.
Output
Table 629: NEREvaluator Output Table Schema
Input
Model file ner_model.bin, generated by the NerTrainer Example.
SQL-MapReduce Call
Output
Table 630: NEREvaluator Example Output Table
Summary
Named entity recognition (NER) is a process of finding instances of specified entities in text. For example, a
simple news named-entity recognizer for the English language might find the person “John J. Smith” and the
location “Seattle” in the text “John J. Smith lives in Seattle”.
NER functions let you specify how to extract entities when training the data models. The Aster Analytics
Foundation provides two sets of NER functions.
The NER functions that use the Max Entropy Model model are:
• TrainNamedEntityFinder, which takes training data and outputs a Max Entropy data model
• FindNamedEntity, which takes input documents (in XML format) and extracts specified entities, using a
Max Entropy model and, if appropriate, rules (regular expressions) or a dictionary
FindNamedEntity
Summary
The FindNamedEntity function evaluates the input text, identifies tokens based on the specified model, and
outputs the tokens with detailed information. The function does not identify sentences; it simply tokenizes.
Token identification is not case-sensitive.
Usage
FindNamedEntity Syntax
Version 1.2
Note:
If the input is a query, you must map it to an alias.
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
Model Optional Specifies the model items to load. Optional if you specify
configuration_table; required otherwise (and you cannot specify
'all').
If you specify both configuration_table and this argument, then the
function loads the specified model items from configuration_table. If
you specify configuration_table but omit this argument, its default
value is 'all' (every model item from configuration_table).
The entity_type is the name of an entity type (for example, PERSON,
LOCATION, or EMAIL), which appears in the output table.
The model_type is one of these model types:
• 'max entropy': maximum entropy language model generated by
training
• 'rule': rule-based model, a plain text file with one regular
expression on each line
• 'dictionary': dictionary-based model, a plain text file with one
word on each line
• 'reg exp': regular expression that describes entity_type
If model_type is 'reg exp', specify regular_expression (a regular
expression that describes entity_type); otherwise, specify model_file
(the name of the model file). Before calling the function, add the
location of every specified model_file to the user/session default
search path.
If you specify configuration_table, you can use entity_type as a
shortcut. For example, if the configure_table has the row
'organization, max entropy, en-ner-organization.bin', you can
specify Model('organization') as a shortcut for
Model('organization:max entropy:en-ner-organization.bin').
Note:
For model_type 'max entropy', if you specify configuration_file
and omit this argument, then the JVM of the worker node needs
more than 2GB of memory.
Default English-language models are provided with the SQL-MapReduce functions. Before using these
models, you must install them (using the \install command in ACT) and create a default configure_table, as
follows:
Input
The following table describes the required input table columns. The table can have additional columns, but
the function ignores them.
Table 632: FindNamedEntity Input Table Schema
Output
Table 634: FindNamedEntity Input Table Schema
Example
Input
Table 635: FindNamedEntity Example Input Table assortedtext_input
id source content
1001 misc contact Alan by email at [email protected] for all sport info
1002 misc contact Mark at [email protected] for all cricket info
1003 misc contact Roger at [email protected] for all tennis info
1004 wiki The contiguous United States consists of the 48 adjoining U.S. states plus
Washington, D.C., on the continent of North America
1005 wiki California's economy is centered onTechnology,Finance,real estate services,
Government, and professional, Scientific and Technical business Services; together
comprising 58% of the State Government economy
1006 wiki Houston is the largest city in Texas and the fourth-largest in the United States, while
San Antonio is the second largest and seventh largest in the state.
id source content
1007 wiki Thomas is a photographer whose natural landscapes of the West are also a statement
about the importance of the preservation of the wildness
SQL-MapReduce Call
Output
Table 636: FindNamedEntity Example Output Table
TrainNamedEntityFinder
Summary
The TrainNamedEntityFinder function takes training data and outputs a Max Entropy data model. The
function is based on OpenNLP, and follows its annotation. For more information on OpenNLP, see http://
opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html.
The trainer supports only the English language.
Usage
TrainNamedEntityFinder Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text to
analyze.
EntityType Required Specifies the entity type to be trained (for example, PERSON). The input
training documents must contain the same tag.
Model Required Specifies the name of the data model file to be generated.
Input
Table 637: TrainNamedEntityFinder Input Table Schema
<START:entity_type>entity<END>
For example:
Output
The function outputs a message to the console and a Max Entropy model (a binary file). The file is
automatically installed on the Aster Database cluster.
Table 638: TrainNamedEntityFinder Output Message Schema
Example
Input
The input table, nermem_sports_train, is a collection of sports news in xml format (with tags like
<START:PER> Roger <END>). There are 50 rows of training data with an id column and a content column
(containing text information). The function generates a model file location.sports and accepts only one tag
('LOCATION') in the type argument.
Table 639: TrainNamedEntityFinder Example Input Table nermem_sports_train
id content
2 CRICKET - <START:ORG> LEICESTERSHIRE <END> TAKE OVER AT TOP AFTER INNINGS
VICTORY .
id content
3 <START:LOCATION> LONDON <END> 1996-08-30
4 West Indian all-rounder <START:PER> Phil Simmons <END> took four for 38 on Friday as
<START:ORG> Leicestershire <END> beat <START:ORG> Somerset <END> by an innings and
39 runs in two days to take over at the head of the county championship .
5 Their stay on top
6 After bowling <START:ORG> Somerset <END> out for 83 on the opening morning at
<START:LOCATION> Grace Road <END>
7 Trailing by 213
8 <START:ORG> Essex <END>
9 <START:PER> Hussain <END>
10 By the close <START:ORG> Yorkshire <END> had turned that into a 37-run advantage but off-
spinner <START:PER> Such <END> had scuttled their hopes
11 At the <START:LOCATION> Oval <END>
12 He was well backed by <START:LOCATION> England <END> hopeful <START:PER> Mark
Butcher <END> who made 70 as <START:ORG> Surrey <END> closed on 429 for seven
... ...
SQL-MapReduce Call
Output
Table 640: TrainNamedEntityFinder Example Output Table
train_result
model installed
Summary
The EvaluateNamedEntityFinderRow and EvaluateNamedEntityFinderPartition functions operate as a row
and a partition function, respectively. Each function takes a set of evaluating data and generates the
precision, recall, and F-measure values of a specified maximum entropy data model. Neither function
supports regular-expression-based or dictionary-based models.
Usage
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text to
analyze.
Model Required Specifies name of the model file to evaluate.
Input
Table 641: EvaluateNamedEntityFinderRow Input Table Schema
Example
The EvaluateNamedEntityFinderPartition invokes the EvaluateNamedEntityFinderRow function, which
takes as input the test data (nermem_sports_test). The test set is a collection of sports news in xml format
similar to the training data. The function evaluates the efficacy in terms of the precision, recall and f-
measure value of the maximum entropy data model location.sports (from TrainNamedEntityFinder).
Input
Table 643: EvaluateNamedEntityFinderRow Example Input Table nermem_sports_test
id content
3 <START:LOCATION> LONDON <END> 1996-08-30
4 West Indian all-rounder <START:PER> Phil Simmons <END> took four for 38 on Friday as
<START:ORG> Leicestershire <END> beat <START:ORG> Somerset <END> by an innings and
39 runs in two days to take over at the head of the county championship .
6 After bowling <START:ORG> Somerset <END> out for 83 on the opening morning at
<START:LOCATION> Grace Road <END>
9 <START:PER> Hussain <END>
10 By the close <START:ORG> Yorkshire <END> had turned that into a 37-run advantage but off-
spinner <START:PER> Such <END> had scuttled their hopes
11 At the <START:LOCATION> Oval <END>
12 He was well backed by <START:LOCATION> England <END> hopeful <START:PER> Mark
Butcher <END> who made 70 as <START:ORG> Surrey <END> closed on 429 for seven
14 Australian <START:PER> Tom Moody <END> took six for 82 but <START:PER> Chris Adams
<END>
16 They were held up by a gritty 84 from <START:PER> Paul Johnson <END> but ex-England fast
bowler <START:PER> Martin McCague <END> took four for 55 .
20 <START:LOCATION> LONDON <END> 1996-08-30
22 <START:LOCATION> Leicester <END> : <START:ORG> Leicestershire <END> beat
<START:ORG> Somerset <END> by an innings and 39 runs .
id content
... ...
SQL-MapReduce Call
Output
Table 644: EvaluateNamedEntityFinderPartition Example Output Table
nGram
Summary
The nGram function tokenizes (splits) an input stream of text and outputs n multigrams (called n-grams)
based on the specified delimiter and reset parameters. nGram provides more flexibility than standard
tokenization when performing text analysis. Many two-word phrases carry important meaning (for example,
"machine learning") that unigrams (single-word tokens) do not capture. This, combined with additional
analytical techniques, can be useful for performing sentiment analysis, topic identification, and document
classification.
nGram considers each input row to be one document, and it returns a row for each unique n-gram in each
document. nGram also returns, for each document, the counts of each n-gram and the total number of n-
grams.
Background
For general information about tokenization, see http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer.
nGram Syntax
Version 1.5
Arguments
Argument Category Description
TextColumn Required The name of the column that contains the input text. Input columns
must contain string SQL types.
Delimiter Optional A regular expression that specifies the character or string that
separates words in the input text. The default value is the space
character (' ').
Grams Required A list of integers or ranges of integers that specify the length, in
words, of each n-gram (that is, the value of n). A range_of_values
has the syntax integer1-integer2, where integer1 <= integer2. The
values of n, integer1, and integer2 must be positive.
OverLapping Optional A Boolean value that specifies whether the function allows
overlapping n-grams. When this value is 'true' (the default), each
word in each sentence starts an n-gram, if enough words follow it
(in the same sentence) to form a whole n-gram of the specified size.
For information on sentences, see the description of the Reset
argument.
ToLowerCase Optional A Boolean value that specifies whether the function converts all
letters in the input text to lowercase. The default value is 'true'.
Note:
The total number of n-grams is not necessarily the number of
unique n-grams.
TotalCountColumn Optional The name of the column to return if the value of the Total argument
is 'true'. The default value is 'totalcnt'.
Accumulate Optional The names of the columns to return for each n-gram. These
columns cannot have the same names as those specified by the
arguments NGramColumn, NumGramsColum, and
TotalCountColumn. By default, the function returns all input
columns for each n-gram.
NGramColumn Optional The name of the column that is to contain the generated n-grams.
The default value is 'ngram'.
NumGramsColum Optional The name of the column that is to contain the length of n-gram (in
words). The default value is 'n'.
FrequencyColumn Optional The name of the column that is to contain the count of each unique
n-gram (that is, the number of times that each unique n-gram
appears in the document). The default value is 'frequency'.
Input
Each row of the input table contains a document to be tokenized. The input table can have additional rows,
some or all of which the function returns in the output table.
Table 645: Input Table Schema
Output
The output table has a row for each unique n-gram in each input document.
Table 646: Output Table Schema
Examples
The nGram function tokenizes a given document based on the length specified by the Grams argument. It
also provides additional control of tokenization by allowing the user to specify punctuation delimiters with
the Punctuation argument.
These examples show the use of the Total and Overlapping arguments:
• Input
• Example 1: Overlapping ('true') and TotalGramCount ('true')
• Example 2: Overlapping ('false') and TotalGramCount ('false')
Input
The input table contains paragraphs about common analytics topics (regression, decision Trees, and so on).
Table 647: nGram Example Input Table paragraphs_input
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
POSTagger
Summary
The POSTagger function generates part-of-speech (POS) tags for the words contained in the input text
(typically sentences). POS tagging is the first step in the syntactic analysis of a language, and an important
preprocessing step in many natural language-processing applications.
Background
The POSTagger function was developed on the Penn Treebank Project and Chinese Penn Treebank Project
dataset. Its POS tags comply with the tags defined by the two projects.
Usage
POSTagger Syntax
Version 2.1
Note:
If you intend to use the POSTagger output table as input to the
function TextChunker, then this argument must specify the input
table columns that comprise the partition key.
Input
The POSTagger function requires a model file and an input table.
Two model files are provided with this function:
• pos_model_2.0_en_141008.bin for English
• pos_model_2.0_zh_cn_141008.bin for Simplified Chinese
Note:
Before running POSTagger, add the model file locations to the default search path for the user or session.
The following table describes the input table columns that you can specify with function arguments. The
input table can have additional columns, but the function ignores them.
Table 661: POSTagger Input Table Schema
Output
Table 662: POSTagger Output Table Schema
Example
Input
The input table is the output of the Sentenizer function example.
SQL-MapReduce Call
Output
Table 663: POSTagger Example Output Table
Sentenizer
Summary
The Sentenizer function extracts sentences from English input text. A sentence ends with a punctuation
mark such as period (.), question mark (?), or exclamation mark (!).
Background
Many Natural Language Processing (NLP) processing tasks (such as Part-Of-Speech tagging and chunking)
begin by identifying sentences.
Usage
Sentenizer Syntax
Version 1.1
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input column that contains the text from which
to extract sentences.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Note:
Before running this function, add the schema of the model file that the function uses to the user/session
default search path.
Table 664: Sentenizer Input Table Schema
Output
Table 665: Sentenizer Output Table Schema
Example
Input
The input table contains paragraphs about common analytics topics (regression, decision Trees, and so on).
Table 666: Sentenizer Example Input Table paragraphs_input
SQL-MapReduce Call
Output
Table 667: Sentenizer Example Output Table
Summary
Sentiment extraction is the process of inferring user sentiment (positive, negative, or neutral) from text
(typically call center logs, forums, and social media).
The sentiment extraction functions are:
• TrainSentimentExtractor, which trains a model—takes training documents and outputs a maximum
entropy classification model
• ExtractSentiment, which uses either the classification model or a dictionary model to extract the
sentiment of each input document or sentence; that is, to output predictions
• EvaluateSentimentExtractor, which uses test data to evaluate the precision and recall of the predictions
Background
As user-generated content has increased, sentiment extraction has become more important. Typical use
cases are:
• Support Forum
A software company has an online forum where users can share knowledge and ask each other questions
about how to use its products. If a user post shows appreciation or shares information, the company
support staff need not respond. However, if a user post shows frustration at an unanswered question, or
anger at a product, then the support staff can react as soon as possible.
• Mining User-Generated Reviews
A retailer has a web site where customers can submit reviews of its products. The retailer wants to get
feedback about the products by analyzing these reviews, rather than by sending customers
questionnaires.
• Online Reputation Management
A company wants to protect its brand and reputation by monitoring negative news, blog entries, reviews,
and comments on the Internet.
TrainSentimentExtractor
Summary
The TrainSentimentExtractor function trains a model; that is, takes training documents and outputs a
maximum entropy classification model, which it installs on Aster Database. For information about
maximum entropy, see http://en.wikipedia.org/wiki/Maximum_entropy_method.
Usage
TrainSentimentExtractor Syntax
Version 2.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the training data.
TextColumn Required Specifies the name of the input table column that contains the training
data.
SentimentColumn Required Specifies the name of the input table column that contains the
sentiment values, which are 'POS' (positive), 'NEG' (negative), and
'NEU' (neutral).
ModelFile Required Specifies the name of the file to which the function outputs the model.
Language Optional Specifies the language of the training data:
Input
Table 668: TrainSentimentExtractor Input Table Schema
Output
The function outputs a binary file that contains a maximum entropy classification model, which it installs on
Aster Database, and a message.
Table 669: TrainSentimentExtractor Output Message Schema
Example
Input
The input table is a collection of user reviews for different products.
Table 670: TrainSentimentExtractor Example Input Table sentiment_train
SQL-MapReduce Call
Output
Table 671: TrainSentimentExtractor Example Output Table
train_result
Model generated.
Training time(s): 0.167
File name: sentimentmodel1.bin
File size(KB): 3
Model successfully installed
ExtractSentiment
Summary
The ExtractSentiment function extracts the sentiment (positive, negative, or neutral) of each input
document or sentence, using either a classification model output by the function TrainSentimentExtractor
or a dictionary model.
The dictionary model consists of WordNet, a lexical database of the English language, and the following
negation words:
• no
• not
• neither
• never
• scarcely
• hardly
• nor
• little
• nothing
• seldom
• few
The function handles negated sentiments as follows:
• -1 if the sentiment is negated (for example, “I am not happy”)
• -1 if the sentiment and a negation word are separated by one or two words (for example, “I am not very
happy” or “I am not at all happy”)
• +1 if the sentiment and a negation word are separated by three words (for example, “I am not saying I am
happy”)
This function can be used with real-time applications. Refer to AMLGenerator.
Usage
ExtractSentiment Syntax
Version 3.1
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input column that contains text from which to
extract sentiments.
Language Optional Specifies the language of the input text:
• 'en' (English, the default)
• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)
Model Optional Specifies the model type and file. The default model type is dictionary. If
you omit this argument or specify dictionary without dict_file, then you
must specify a dictionary table with alias 'dict'. If you specify both dict
and dict_file, then whenever their words conflict, dict has higher priority.
The dict_file must be a text file in which each line contains only a
sentiment word, a space, and the opinion score of the sentiment word.
If you specify classification model_file, then model_file must be the name
of a model file generated and installed on the database by the function
TrainSentimentExtractor.
Note:
Before running the function, add the location of dict_file or
model_file to the user/session default search path.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Level Optional Specifies the level of analysis—whether to analyze each document (the
default) or each sentence.
HighPriority Optional Specifies the highest priority when returning results:
• NEGATIVE_RECALL
Give highest priority to negative results, including those with lower-
confidence sentiment classifications (maximizes the number of
negative results returned).
• NEGATIVE_PRECISION
Give highest priority to negative results with high-confidence
sentiment classifications.
• POSITIVE_RECALL
Give highest priority to positive results, including those with lower-
confidence sentiment classifications (maximizes the number of
positive results returned).
• POSITIVE_PRECISION
Give highest priority to positive results with high-confidence
sentiment classifications.
• NONE
Input
The function has a required input table and an optional dictionary table.
The following table describes the required input table columns. The table can have additional columns, but
the function ignores them.
Table 672: ExtractSentiment Input Table Schema
The following table describes the required first and second columns of the dictionary table. The table can
have additional columns, but the function ignores them.
Table 673: ExtractSentiment Dictionary Table Schema
Output
Table 674: ExtractSentiment Output Table Schema
Examples
• Prerequisites
• Input
• Example 1: Model ('dictionary'), Level ('document')
• Example 2: Model ('dictionary'), Level ('sentence')
• Example 3: Model ('classification:default_sentiment_classification_model.bin')
• Example 4: Model ('classification:sentimentmodel1.bin')
• Example 5: Dictionary Table Instead of Model File
Prerequisites
These files must be installed in the directory sentimentAnalysisModel:
• default_sentiment_classification_model.bin
• For English input text: default_sentiment_lexicon.txt
• For Simplified Chinese input text: default_sentiment_lexicon_zh_cn.txt
• For Traditional Chinese input text: default_sentiment_lexicon_zh_tw.txt
Input
Table 675: ExtractSentiment Examples Input Table sentiment_extract_input
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
Example 3: Model
('classification:default_sentiment_classification_model.bin')
This example uses the maximum entropy classification model file
default_sentiment_classification_model.bin.
SQL-MapReduce Call
Output
id out_polarity out_strength
1 NEG 2
2 POS 2
id out_polarity out_strength
3 NEG 2
4 POS 2
5 POS 1
6 NEG 2
7 NEG 2
8 NEG 2
9 NEG 2
10 NEG 2
SQL-MapReduce Call
Output
id out_polarity out_strength
1 POS 2
2 POS 2
3 POS 2
4 POS 2
5 POS 2
6 NEG 2
7 NEG 2
8 NEG 2
9 NEG 2
10 NEG 2
word opinion
screwed 2
excellent 2
incredible 2
terrific 2
outstanding 2
fun 1
love 1
nice 1
big 0
update 0
constant 0
small 0
mistake -1
difficulty -1
disappointed -1
not tolerate -1
stuck -1
terrible -2
crap -2
SQL-MapReduce Call
EvaluateSentimentExtractor
Summary
The EvaluateSentimentExtractor function uses test data to evaluate the precision and recall of the
predictions output by the function ExtractSentiment. The precision and recall are affected by the model that
ExtractSentiment uses; therefore, if you change the model, you must rerun EvaluateSentimentExtractor on
the new predictions.
For basic information on precision and recall calculations see http://en.wikipedia.org/wiki/
Precision_and_recall.
Usage
EvaluateSentimentExtractor Syntax
Version 1.1
Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input column with the observed sentiment
(POS, NEG or NEU).
SentimentColumn Required Specifies the name of the input column with the predicted sentiment
(POS, NEG or NEU).
Input
The input table, which contains the test data, must have the columns described in the following table. The
table can have additional columns, but the function ignores them.
Table 682: EvaluateSentimentExtractor Input Table Schema
Example
• Input
• Example 1: Model ('dictionary')
Input
The input to the function EvaluateSentimentExtractor is the output from the function ExtractSentiment;
therefore, these examples have the same Prerequisites as ExtractSentiment.
SQL-MapReduce Call
Output
evaluation_result
positive record (total relevant, relevant, total retrieved): 5 5 5
recall and precision: 1.00 1.00
negative record (total relevant, relevant, total retrieved): 5 5 5
recall and precision: 1.00 1.00
positive and negative record (total relevant, relevant, total retrieved): 10 10 10
recall and precision: 1.00 1.00
Example 2: Model
('classification:default_sentiment_classification_model.bin')
This example uses the classification model file default_sentiment_classification_model.bin.
SQL-MapReduce Call
Output
evaluation_result
positive record (total relevant, relevant, total retrieved): 5 3 3
recall and precision: 0.60 1.00
negative record (total relevant, relevant, total retrieved): 5 5 7
recall and precision: 1.00 0.71
positive and negative record (total relevant, relevant, total retrieved): 10 8 10
recall and precision: 0.80 0.80
SQL-MapReduce Call
Output
evaluation_result
positive record (total relevant, relevant, total retrieved): 5 5 5
recall and precision: 1.00 1.00
negative record (total relevant, relevant, total retrieved): 5 5 5
evaluation_result
recall and precision: 1.00 1.00
positive and negative record (total relevant, relevant, total retrieved): 10 10 10
recall and precision: 1.00 1.00
SQL-MapReduce Call
Output
evaluation_result
positive record (total relevant, relevant, total retrieved): 5 5 6
recall and precision: 1.00 0.83
negative record (total relevant, relevant, total retrieved): 5 3 3
recall and precision: 0.60 1.00
positive and negative record (total relevant, relevant, total retrieved): 10 8 9
recall and precision: 0.80 0.89
Text Classifier
Summary
Text Classifier is composed of these functions:
• TextClassifierTrainer, which trains the text classifier and creates a model
• TextClassifier, which classifies the text
• TextClassifierEvaluator, which evaluates the trained classifier model
Background
Text classification is the task of choosing the correct class label for a given text input. In basic text
classification tasks, each input is considered in isolation from all other inputs, and the set of class labels is
defined in advance.
Text classification is a two-stage process:
1. Train the model:
Preprocess the text data and produce tokens.
Use natural language processing (NLP) functionality such as tokenization, stemming, and stop words.
From the tokens, use statistical measures to select a subset.
Generate the feature for each word in the subset.
Use machine learning algorithms to train a classifier.
2. Classify the text.
TextClassifierTrainer
Summary
The TextClassifierTrainer function trains a machine learning classifier for text classification and creates a
model file. After installing the model file in Aster Database, you can input it to the function TextClassifier.
Usage
TextClassifierTrainer Syntax
Version 1.4
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the documents to
use to train the model.
TextColumn Required Specifies the name of the column that contains the text of the
training documents.
CategoryColumn Required Specifies the name of the column that contains the category of
the training documents.
ModelFile Required Specifies the name for the model file to be generated.
ClassifierType Required Specifies the classifier type of the model, KNN algorithm or
maximum entropy model.
ClassifierParameters Optional Applies only if the classifier type of the model is KNN. Specifies
parameters for the classifier. The name must be 'compress' and
value must be in the range (0, 1). The n training documents are
clustered into value*n groups (for example, if there are 100
training documents, then ClassifierParameters('compress:0.6')
clusters them into 60 groups), and the model uses the center of
each group as the feature vector.
NLPParameters Optional Specifies natural language processing (NLP) parameters for
preprocessing the text data and produce tokens. Each name:value
pair must be one of the following:
• tokenDictFile:token_file
where token_file is the name of an Aster Database file in
which each line contains a phrase, followed by a space,
followed by the token for the phrase (and nothing else).
• stopwordsFile:stopword_file
where stopword_file is the name of an Aster Database file in
which each line contains exactly one stop word (a word to
ignore during tokenization, such as a, an, or the).
• useStem:{'true' | 'false'}
which specifies whether the function stems the tokens. The
default value is 'false'.
• stemIgnoreFile:stem_ignore_file
where stem_ignore_file is the name of an Aster Database file
in which each line contains exactly one word to ignore during
stemming. Specifying this parameter with 'useStem:false'
causes an exception.
• useBgram:{'true' | 'false'}
which specifies whether the function uses Bigram, which
considers the proximity of adjacent tokens when analyzing
them. The default value is 'false'.
• language:{ 'en' | 'zh_CN' | 'zh_TW' }
which specifies the language of the input text—English (en),
Simplified Chinese (zh_CN), or Traditional Chinese
(zh_TW). The default value is en. For the values zh_CN and
zh_TW, the function ignores the parameters useStem and
stemIgnoreFile.
Example:
NLPParameters ('tokenDictFile:token_dict.txt',
'stopwordsFile:fileName',
'useStem:true',
'stemIgnoreFile:fileName',
'useBgram:true',
'language:zh_CN')
Input
The input table must have the columns described in the following table. The input table can have additional
columns, but the function ignores them.
Output
The function outputs a binary file with the name specified by ModelFile argument, installs the binary file on
Aster Database, and prints a message about the model generation.
Table 689: TextClassifierTrainer Output Message Schema
Example
Input
Table 690: TextClassifierTrainer Example Input Table texttrainer_input
id content category
1 Tennis star Roger Federer was born on August 8, 1981, in Basel, Switzerland, to sports
Swiss father Robert Federer and South African mother Lynette Du Rand
2 Federer took an interest in sports at an early age, playing tennis and soccer at sports
the age of 8.
3 At age 14, Federer became the national junior champion in Switzerland sports
4 Federer won the Wimbledon boys singles and doubles titles in 1998, and turned sports
professional later that year.
5 In 2003, following a successful season on grass, Federer became the first Swiss sports
man to win a Grand Slam title when he emerged victorious at Wimbledon.
6 A natural disaster is a major adverse event resulting from natural processes of natural disaster
the Earth. Examples include floods, volcanic eruptions, earthquakes, tsunamis,
and other geologic processes.
7 In a vulnerable area, however, such as San Francisco in 1906, an earthquake can natural disaster
have disastrous consequences and leave lasting damage, requiring years to
repair.
8 An earthquake is the result of a sudden release of energy in the Earth crust that natural disaster
creates seismic waves.
9 Volcanoes can cause widespread destruction and consequent disaster in several natural disaster
ways.
id content category
10 A flood is an overflow of water that submerges land natural disaster
a
an
in
is
to
into
was
the
and
this
with
they
but
will
SQL-MapReduce Call
Output
Table 691: TextClassifierTrainer Example Output Table
train_result
Model generated.
Training time(s): 0.216
File name: knn.bin
File size(KB): 1
Model successfully installed
TextClassifier
Summary
The TextClassifier function classifies input text, using a model output by the function TextClassifierTrainer.
Usage
TextClassifier Syntax
Version 1.2
Arguments
Argument Category Description
TextColumn Required Specifies the column of the input table that contains the text to be used
for predicting classification.
Model Required Specifies the model (which you must install in the database before calling
the function).
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Input
Table 692: TextClassifier Input Table Schema
Example
Input
• The following table, textclassifier_input
• Model file knn.bin, output by the function, TextClassifierTrainer, see the Example section.
Table 694: TextClassifier Example Input Table: textclassifier_input
id content category
16 At the beginning of 2004, Federer had a world ranking of No. 2, and that same sports
year, he won the Australian Open, the U.S. Open, the ATP Masters and
retained the Wimbledon singles title.
17 Federer held on to his No. 1 ranking from 2004 into 2008. In 2006 and 2007, he sports
won the singles championships at the Australian Open, Wimbledon and the
U.S. Open.
18 A paragon of graceful athleticism, Federer was named the Laureus World sports
Sportsman of the Year from 2005-08.
19 Cyclone, tropical cyclone, hurricane, and typhoon are different names for the natural disaster
same phenomenon, which is a cyclonic storm system that forms over the
oceans.
20 Drought is the unusual dryness of soil, resulting in crop failure and shortage of natural disaster
water and for other uses which is caused by significant low rainfall than
average over a prolonged period.
21 A tornado is a violent, dangerous, rotating column of air that is in contact with natural disaster
both the surface of the earth and a cumulonimbus cloud or, in rare cases, the
base of a cumulus cloud.
SQL-MapReduce Call
Output
Table 695: TextClassifier Example Output Table
id category out_category
16 sports sports
17 sports sports
18 sports natural disaster
19 natural disaster natural disaster
20 natural disaster natural disaster
21 natural disaster natural disaster
TextClassifierEvaluator
Summary
The TextClassifierEvaluator function evaluates the precision, recall and F-measure of the trained model
output by the function Text Classifier.
Usage
TextClassifierEvaluator Syntax
Version 1.2
Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input column that contains the expected
(correct) category.
Input
Table 696: TextClassifierEvaluator Input Table Schema
Output
Table 697: TextClassifierEvaluator Output Table Schema
Example
Input
• The table, TextClassifier Example Output Table, in the Output section of the Example from the function
TextClassifier.
• Model file knn.bin, output by the function, TextClassifierTrainer, see the Example section.
SQL-MapReduce Call
Text_Parser
Summary
The Text_Parser function tokenizes an input stream of words, optionally stems them (reduces them to their
root forms), and then outputs them. The function can either output all words in one row or output each
word in its own row with (optionally) the number of times that the word appears.
This function can be used with real-time applications. Refer to AMLGenerator.
Background
Parsing English language text includes:
• Punctuating sentences
• Breaking a sentence into words (tokenizing it)
• Removing stop words
• Stemming words (reducing them to their root forms)
The Text_Parser function reads a document into a memory buffer and creates a hash table. The dictionary
for the document must not exceed available memory; however, a million-word dictionary with an average
word length of ten bytes requires only 10 MB of memory.
The Text_Parser function uses Porter2 as the stemming algorithm.
For general information about tokenization, see:
http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer
For general information about stemming, see:
http://en.wikipedia.org/wiki/Stemming
Usage
Text_Parser Syntax
Version 1.3
Note:
If you include the PARTITION BY clause, the function treats all rows in the same partition as a single
document. If you omit the PARTITION BY clause, the function treats each row as a single document.
Arguments
Argument Catego Description
ry
TextColumn Requir Specifies the name of the input column whose contents are to be tokenized.
ed
ToLowerCase Option Specifies whether to convert input text to lowercase. The default value is 'true'.
al
Note:
The function ignores this argument if the Stemming argument has the
value 'true'.
Stemming Option Specifies whether to stem the tokens—that is, whether to apply the Porter2
al stemming algorithm to each token to reduce it to its root form. Before
stemming, the function converts the input text to lowercase and applies the
RemoveStopWords argument. The default value is 'false'.
Delimiter Option Specifies a regular expression that represents the word delimiter. The default
al value is '[\t\b\f\r]+').
TotalWordsNum Option Specifies whether to output a column that contains the total number of words
al in the input document. The default value is 'false'.
Punctuation Option Specifies a regular expression that represents the punctuation characters to
al remove from the input text. With Stemming ('true'), the recommended value
is '[\\\[.,?\!:;~()\\\]]+'.
The default value is '[.,!?]'.
Note:
No accumulate_column can be the same as token_column or total_column.
TokenColumn Option Specifies the name of the output column that contains the tokens. The default
al value is 'token'.
FrequencyColum Option Specifies the name of the output column that contains the frequency of each
n al token. The default value is 'frequency'.
Note:
The function ignores this argument if the OutputByWord argument has
the value 'false'.
TotalColumn Option Specifies the name of the output column that contains the total number of
al words in the input document. The default value is 'total_count'.
RemoveStopWord Option Specifies whether to remove stop words from the input text before parsing.
s al The default value is 'false'.
PositionColumn Option Specifies the name of the output column that contains the position of a word
al within a document. The default value is 'position'.
ListPositions Option Specifies whether to output the position of a word in list form. The default
al value is 'false', which causes the function to output a row for each occurrence
of the word.
Note:
The function ignores this argument if the OutputByWord argument has
the value 'false'.
OutputByWord Option Specifies whether to output each token of each input document in its own row
al in the output table. The default value is 'true'. If you specify 'false', then the
function outputs each tokenized input document in one row of the output
table.
StemmingExcepti Option Specifies the location of the file that contains the stemming exceptions. A
ons al stemming exception is a word followed by its stemmed form. The word and its
stemmed form are separated by white space. Each stemming exception is on
its own line in the file. For example:
bias bias
news news
goods goods
lying lie
ugly ugli
sky sky
early earli
Input
The Text_Parser function has one input table. If you include the PARTITION BY clause, then the function
treats all rows in the same partition as a single document. If you omit the PARTITION BY clause, then the
function treats each row as a single document.
Table 699: Text_Parser Input Table Schema
Output
The Text_Parser function has one output table, whose schema depends on the value of the OutputByWord
argument.
Table 700: Text_Parser Output Table Schema, Output_By_Word ('true')
Examples
• Example 1: With StopWords and without StemmingExceptions
• Example 2: With StemmingExceptions and without StopWords
Input
The input table is log of vehicle complaints. The column category indicates whether the car was involved in a
crash.
Table 702: Text_Parser Examples Input Table complaints
a
an
in
is
to
into
was
the
and
SQL-MapReduce Call
Output
This query returns the table complaints_traintoken:
Input
The input table is the first two rows of Text_Parser Examples Input Table complaints.
Table 704: Text_Parser Example 2 Input Table complaints_mini
consumer customer
enbankment embankment
SQL-MapReduce Call
Output
TextChunker
Summary
The TextChunker function divides text into phrases and assigns each phrase a tag that identifies its type.
Background
Text chunking (also called shallow parsing) divides text into phrases in such a way that syntactically related
words become members of the same phrase. Phrases do not overlap; that is, a word is a member of only one
chunk.
For example, the sentence “He reckons the current account deficit will narrow to only # 1.8 billion in
September .” can be divided as follows, with brackets delimiting phrases:
[NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to]
[NP only # 1.8 billion] [PP in] [NP September]
After each opening bracket is a tag that identifies the chunk type (NP, VP, and so on). For information about
chunk types, refer to Output.
For more information about text chunking, see:
• Erik F. Tjong Kim Sang and Sabine Buchholz, Introduction to the CoNLL-2000 Shared Task: Chunking.
In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
• Fei Sha and Fernando Pereira, Shallow Parsing with Conditional Random Fields. [2003]
TextChunker Syntax
Version 1.2
Note:
The input_table is output table of the POSTagger_stub, which contains the columns partition_key and
word_sn.
Arguments
Argument Category Description
WordColumn Required Specifies the name of the input table column that contains the words to
chunk into phrases. Typically, this is the word column of the output table
of the POSTagger function (described in the Output section of Usage).
POSColumn Required Specifies the name of the input table column the part-of-speech (POS)
tag of words. Typically, this is the pos_tag column of the output table of
the POSTagger function (described in the Output section of Usage).
Input
The TextChunker function requires:
• An input table generated by the POSTagger function (for its schema, refer to POSTagger Output Table
Schema)
When running POSTagger to generate this table, specify in the Accumulate argument the name of the
input column that contains the unique row identifiers.
• The model file, chunker_default_model.bin, which is provided with the function
Note:
Before running TextChunker, add the model file location to the default search path for the user or
session.
Output
Table 706: TextChunker Output Table Schema
Example
• Example 1: Using Output from POSTagger
• Example 2. Using Output from Sentenizer and POSTagger
Input
paraid paratext
1 I live in Los Angeles.
2 New York is a great city.
3 Chicago is a lot of fun, but the winters are very cold and windy.
4 Philadelphia and Boston have many historical sites.
Output
Input
SQL-MapReduce Call
TextChunker requires each sentence to have a unique identifier, and the input to TextChunker must be
partitioned by that identifier.
Output
TextMorph
Summary
The TextMorph function outputs each input word in its standard forms (called morphs) with their
corresponding parts of speech. The following table shows examples of words and their standard forms.
Table 712: Examples of Words and Their Standard Forms
Background
Lemmatization is a basic text analysis tool that determines the lemmas (standard forms) of words, so that all
forms of a word can be grouped together, improving the accuracy of text analysis.
The TextMorph function implements a lemmatization algorithm based on the WordNet 3.0 dictionary,
which is packaged with the function. If an input word is in the dictionary, the function outputs its morphs
with their parts of speech; otherwise, the function outputs the input word itself and sets its part of speech to
NULL.
When an input word has multiple morphs, the function outputs them by the order of precedence of their
parts of speech: noun, verb, adj, and adv. That is, if an input word has a noun form, then it is listed first. If
the same word has a verb form, then it is listed next, and so on.
Usage
TextMorph Syntax
Version 1.2
Note:
The function does not determine the part of speech of the word from
its context, it uses all possible parts of speech for the word in the
dictionary.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Input
Table 713: TextMorph Input Table Schema
Output
Table 715: TextMorph Input Table Schema
Examples
• Input
• Example 1: SingleOutput ('true')
• Example 2: SingleOutput ('false')
• Example 3: POS ('noun', 'verb') and SingleOutput ('false')
• Example 4: POS ('noun', 'verb') and SingleOutput ('true')
• Example 5: Using TextMorph with POSTagger and TextTagging
Input
Table 716: TextMorph Examples 1-4 Input Table words_input
id word
1 regression
2 Roger
3 better
id word
4 datum
5 quickly
6 proud
7 father
8 juniors
9 doing
10 being
11 negating
12 yearly
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
For the input word better, the function does not find noun and verb morphs. However, the function finds
better in the dictionary as both a noun and a verb, so it outputs those.
With SingleOutput ('false'), the words better and father appear in the output table as both nouns and verbs.
Table 719: TextMorph Example 3 Output Table
SQL-MapReduce Call
Output
With SingleOutput ('true'), the words better and father appear in the output table only as nouns.
Table 720: TextMorph Example 4 Output Table
POSTagger Input
id txt
s1 Roger Federer born on 8 August 1981, is a greatest tennis player, who has been continuously
ranked inside the top 10 since October 2002 and has won Wimbledon, USOpen, Australian and
FrenchOpen titles mutiple times
TextTagging Output
TextTagging
Summary
The TextTagging function tags text documents according to user-defined rules that use text-processing and
logical operators.
TextTagging Syntax
Version 1.3
Arguments
Argument Category Description
Language Optional Specifies the language of the input text:
• 'en': English (default)
• 'zh_cn': Simplified Chinese
• 'zh_tw': Traditional Chinese
If UseTokenizer specifies 'true', then the function uses the value of
Language to create the word tokenizer.
Rules Optional Specifies the tag names and tagging rules. Use this argument if and only
if you do not specify a rules table. For information about defining
tagging rules, refer to Defining Tagging Rules.
Tokenize Optional Specifies whether the function tokenizes the input text before evaluating
the rules and tokenizes the text string parameter in the rule definition
when parsing a rule. If you specify 'true', then you must also specify the
Language argument. The default value is 'false'.
OutputByTag Optional Specifies whether the function outputs a tuple when a text document
matches multiple tags. The default value is 'false', which means that one
tuple in the output stands for one document and the matched tags are
listed in the output column tag.
TagDelimiter Optional Specifies the delimiter that separates multiple tags in the output column
tag if OutputByTag has the value 'false' (the default). The default value is
the comma (,). If OutputByTag has the value 'true', specifying this
argument causes an error.
Accumulate Optional Specifies the names of text table columns to copy to the output table.
Note:
Do not use the name 'tag' for an accumulate_column, because the
function uses that name for the output table column that contains the
tags.
If x is the number of times that op1 appears in col, then the preceding operations have the following
meanings, respectively:
lower <= x <= upper
lower <= x
x <= upper
The meanings of lower, x, and upper depend on the operation.
For simplicity, the following table shows only the syntax that specifies both lower and upper.
Syntax Description
Returns 'true' if the text in column col and the value of op1 are equal; 'false'
equal(col, op1) otherwise.
Returns 'true' if, in column col, the number of times that the value of op1
contain(col, op1, appears is in the range [lower, upper]; 'false' otherwise.
lower, upper)
Returns 'true' if, in column col, the distance between the values of op1 and
dist(col, op1, op2, op2 (that is, the number of words between them) is in the range [lower,
lower, upper) upper]; 'false' otherwise.
The distance computation depends on the Language and UseTokenizer
arguments.
By default, Language is 'en' (English) and UseTokenizer is 'false', and words
are delimited by whitespace characters.
If Language is 'zh_cn' (Simplified Chinese) or 'zh_tw' (Traditional Chinese)
and UseTokenizer is 'true', then the function performs word segmentation
before computing the distance between words.
Returns 'true' if, in column col, the values of op1, op2, and op3 satisfy the
superdist(col, op1, context rules con1 and con2; 'false' otherwise.
op2, con1, op3, con2) The rule con1 specifies the context for inclusion. The possible values of
con1and their meanings are:
nwn: op2 appears n or fewer words before or after op1.
nrn: op2 appears n or fewer words after op1.
para: op2 appears in the same paragraph as op1.
sent: op2 appears in the same sentence as op1.
The rule con2 specifies the context for exclusion. The possible values of con2
and their meanings are:
nwn: op3 does not appear n or fewer words before or after op1.
nrn: op3 does not appear n or fewer words after op1.
para: op3 does not appear in the same paragraph as op1.
sent: op3 does not appear in the same sentence as op1.
The distance computation depends on the Language and UseTokenizer
arguments (for details, refer to the description of the dist operation).
A paragraph ends with either "\n" or "\r\n". A sentence ends with either
period (.), question mark (?), or exclamation mark (!). The function
fragments the input into paragraphs or sentences and then checks the
context rule on each piece of text. If one piece satisfies the rule, then the
function tags the whole input.
opn (where n is 1, 2, or 3) can be a list of words. Enclose the list in double
quotation marks and separate the words with semicolons. For example:
"good;bad;neutral"
If opn is a Java regular expression, then exp can be a list. Separate the items
with semicolons. For example: regex"invest[\w]*;volatil[\w]*;risk"
Syntax Description
When a list appears in an inclusion context, the rule is satisfied if at least
one item appears in the context. When a list appears in an exclusion
context, the rule is satisfied if no item appears in the context.
The operand-context pairs after op1 are optional; that is, the following are
valid syntax:
superdist(col, op1,,,,)
superdist(col, op1, op2, con1,,)
superdist(col, op1,,, op3, con2)
superdist(col, op1, op2, con1, op3, con2)
superdist(col, op1,,,,)
The final syntax in the preceding list returns 'true' if op1 appears in col.
Returns 'true' if, in column col, the number of items (lines in the dictionary
dict(col, file) is in the range [lower, upper]; 'false' otherwise.
"[schema/]dictionary"
,lower, upper) Note:
This operation requires that the dictionary file [schema.] dictionary is
installed on your Aster Database cluster. The dictionary name,
dictionary, is case-sensitive. If the dictionary is in the public schema,
then you can omit the schema name, schema.
Input
The TextTagging function has a required text table and an optional rules table. If you omit the rules table,
then you must specify the tagging rules with the Rules argument.
The following table describes the columns of the text table. The table can have additional columns, but the
function ignores them unless you specify them in rules.
Table 726: TextTagging Text Table Schema
Output
Table 728: TextTagging Output Table Schema
Examples
• Input
• Example 1: Specify Rules Argument
• Example 2: Specify Rules Table
• Example 3: Specify Dictionary File in Rules Argument
• Example 4: Specify Superdist in Rules Argument
Input
Table 729: TextTagging Examples 1-4 Input Table: text_inputs
SQL-MapReduce Call
Output
id tag
1 Natural-Disaster
2 Tennis-Greats
3 Tennis-Rivalry
4 Cricket-Rivalry
5 The-Ashes
tagname definition
Cricket-Rivalry contain(content,"India",1,) and
contain(content,"Pakistan",1,)
Natural-Disaster contain(content, "floods",1,) or
contain(content,"tsunamis",1,)
Tennis-Greats contain(title,"Tennis",1,) and
contain(content,"Roger",1,)
Tennis-Rivalry contain(content,"Roger",1,) and
contain(content,"Nadal",1,)
The-Ashes contain(content,"Australia",1,) and
contain(content,"England",1,)
SQL-MapReduce Call
Output
id tag
1 Natural-Disaster
2 Tennis-Greats
3 Tennis-Rivalry
4 Cricket-Rivalry
5 The-Ashes
floods
tsunamis
Roger
Nadal
SQL-MapReduce Call
Output
id tag
1 Natural-Disaster
2
3 Great-Sports-Rivalry
4 Great-Sports-Rivalry
5 Great-Sports-Rivalry
SQL-MapReduce Call
id tag
1 Chennai-Flood-Disaster
2 Roger-Champion
3 Tennis-Rivalry
4
5 Aus-Eng-Cricket,Aus-victory
TextTokenizer
Summary
The TextTokenizer function extracts English, Chinese, or Japanese tokens from text. Examples of tokens are
words, punctuation marks, and numbers. Tokenization is the first step of many types of text analysis.
TextTokenizer Syntax
Version 3.2
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text to
tokenize.
Language Optional Specifies the language of the text in text_column:
• en (English, the default)
• zh_CN (Simplified Chinese)
• zh_TW (Traditional Chinese)
• jp (Japanese)
Model Optional Specifies the name of model file that the function uses for tokenizing.
The model must be a conditional random-fields model and model_file
must already be installed on the database. If you omit this argument,
or if model_file is not installed on the database, then the function uses
white spaces to separate English words and an embedded dictionary
to tokenize Chinese text.
Note:
If you specify Language('jp'), the function ignores this argument.
OutputDelimiter Optional Specifies the delimiter for separating tokens in the output. The default
value is slash (/).
OutputByWord Optional Specifies whether to output one token in each row. The default value
is 'false' (output one line of text in each row).
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.
Note:
If the function finds more than one matched term, it selects the
longest term for the first match.
Input
The function has a required input table and an optional dictionary table.
Table 735: TextTokenizer Input Table Schema
The following table describes the format of both the dictionary table (dict) and the user dictionary file
(specified by the UserDictionaryFile argument).
Table 737: TextTokenizer Dictionary Table and User Dictionary File Format
Language Format
Chinese and English One dictionary word on each line.
Japanese A dictionary entry consists of the following comma-separated words:
word—The original word.
tokenized_word—The tokenized form of the word.
reading—The reading of word in Katakana.
pos—The part-of-speech of the word.
For example:
成田空港,成田空港,ナリタクウコウ,カスタム名詞
Output
The schema of the output table depends on the value of the OutputByWord argument.
Examples
• Example 1: Chinese Tokenization
• Example 2: Japanese Tokenization
• Example 3: English Tokenization
Input
id txt
t1 我从小就不由自主地认为自己长大以后一定得成为一个象我父亲一样的画家, 可能是父母
潜移默化的影响。
t2 中华人民共和国 辽宁省 铁岭市 靠山屯 村支书 赵本山。
txt
辽宁省铁岭市靠山屯村
赵本山
SQL-MapReduce Call 1
Output
id token
t1 我 从小 就 不由自主 地 认为 自己 长大 以后 一定 得 成为 一个 象 我 父亲 一样 的 画家 , 可
能 是 父母 潜移默化 的 影响 。
t2 中华人民共和国 辽宁省 铁岭市 靠山屯 村支书 赵本山 。
SQL-MapReduce Call 2
Output
id sn token
t1 1 我
t1 2 从小
t1 3 就
t1 4 不由自主
… ... ...
t2 1 中华人民共和国
t2 2 辽宁省
... ... ...
Input
id txt
t1 総務省は 28 日、全国の主要 51 市を対象に 2013 年の物価水準を比較した消費者物価
地域差指数を発表した。
t2 ソチ五輪6位の浅田真央(23)=中京大=はSP女子世界最高の78・66点で首
位に立った。
word
地域差指数,地域差指数,チイキサシスウ,カスタム名詞
SQL-MapReduce Call 1
Output
id token
t1 総務省/は/28 日/、/全国/の/主要/51 市/を/対象/に/2013 年/の/物価水準/を/比較/し/た/
消費者/物価/地域差指数/を/発表/し/た/。
t2 ソチ五輪/6位/の/浅田真央/(/23/)/=/中京大/=/は/SP女子/世界最高/の/78・
66点/で/首位/に/立っ/た/。
SQL-MapReduce Call 2
Output
id sn token
t1 1 総務省
t1 2 は
t1 3 28 日
t1 4 、
... ... ...
t2 12 SP女子
t2 13 世界最高
... ... ...
Input
The input table is log of vehicle complaints. The category column indicates whether the car has been
involved in a crash.
Table 748: TextTokenizer Example 3 Input Table complaints
Output
doc_id sn token
1 1 consumer
1 2 was
1 3 driving
1 4 approximately
1 5 45
1 6 mph
1 7 hit
1 8 a
1 9 deer
1 10 with
1 11 the
1 12 front
... ... ...
TF_IDF
Summary
The TF_IDF function can do either of the following:
• Take any document set and output the inverse document frequency (IDF) and term frequency- inverse
document frequency (TF-IDF) scores for each term.
• Use the output of a previous run of the TF_IDF function on a training document set to predict TF_IDF
scores of an input (test) document set.
Background
TF-IDF stands for “term frequency- inverse document frequency,” a technique for evaluating the
importance of a specific term in a specific document in a document set. Term frequency (tf) is the number of
times that the term appears in the document and inverse document frequency (idf) is the number of times
that the term appears in the document set. The TF-IDF score for a term is tf *idf. A term with a high TF-IDF
score is especially relevant to the specific document.
The TF_IDF function represents each document as an N-dimensional vector, where N is the number of
terms in the document set (therefore, the document vector is usually very sparse). Each entry in the
document vector is the TF-IDF score of a term.
Usage
TF_IDF Syntax
TF_IDF version 2.1, TF version 1.1
Arguments
Argument Category Description
Formula Optional Specifies the formula for calculating the term frequency (tf) of term t in
document d:
• 'normal' (normalized frequency, default)
tf((t,d) = log(f((t,d)+1)
where f((t,d) is the number of times t occurs in d (that is, the raw
frequency, rf).
• 'augment' (augmented frequency, which prevents bias towards
longer documents)
Note:
When using the output of a previous run of the TF_IDF function on a
training document set to predict TF_IDF scores on an input
document set, use the same Formula value for the input document set
that you used for the training document set.
Input
The TF_IDF function always requires as input the output of the TF function. The input for the TF function
is the document set. The other TF_IDF input tables depend on your reason for running the function:
• If you are running TF_IDF to output the IDF and TF-IDF values for each term in the document set, then
TF_IDF also requires the input table doccount and has optional input table docperterm.
• If you are running the function to predict TF_IDF values, then TF_IDF also requires the input table idf.
The table idf is the output of an earlier call to TF_IDF, using the training document set as input to the TF
function, the doccount table, and optionally, the docperterm table.
If you omit the docperterm table, the function creates it by processing the entire document set, which can
require a large amount of memory. If there is not enough memory to process the entire document set, then
the docperterm table is required.
Table 750: TF Input Table (Document Set) Schema
Output
Table 754: TF_IDF Output Schema
Examples
• Example 1: TF_IDF on Tokenized Training Document Set
• Example 2: TF_IDF on Tokenized Test Set
Note:
The examples tokenize the document sets with the function nGram_stub, but alternatively, you can use
the function TextTokenizer_stub.
Input
docid content
1 Chennai floods have battered the capital city of Tamil Nadu and its adjoining areas. Normal life
came to a standstill when roads were submerged in water and all modes of transport were
severely affected. In the past, Chennai has had tsunamis and earthquakes
docid content
2 Roger Federer born on 8 August 1981, is a greatest tennis player, who has been continuously
ranked inside the top 10 since October 2002 and has won Wimbledon, USOpen, Australian and
FrenchOpen titles mutiple times
3 The Federer Nadal rivalry, known by many as Fedal, is between two professional tennis players,
Roger Federer of Switzerland and Rafael Nadal of Spain. They are currently engaged in a storied
rivalry, which many consider to be the greatest in tennis history. They have played 34 times, most
recently in the 2015 Swiss Indoors final, and Nadal leads their eleven-year-old rivalry with an
overall record of 23–11
4 The India Pakistan cricket rivalry is one of the most intense sports rivalries in the world. An
India-Pakistan cricket match has been estimated to attract up to one billion viewers, according to
TV ratings firms and various other reports. The 2011 World Cup semifinal between the two
teams attracted around 988 million television viewers
5 An Ashes series is traditionally of five Tests, hosted in turn by England and Australia at least once
every four years. As of August 2015, England hold the ashes, having won three of the five Tests in
the 2015 Ashes series. Overall, Australia has won 32 series, England 32 and five series have been
drawn.
SQL-MapReduce Call
Input
docid content
6 In Chennai, India, floods have closed roads and factories, turned off power, shut down the
airport and forced thousands of people out of their homes.
7 Spanish tennis star Rafael Nadal said he was happy with the improvement in his game after a
below-par year, and looked forward to reigniting his long-time rivalry with Roger Federer in
India.
8 Nadal, the world number five, said he has always enjoyed playing against Federer and hoped they
would do so for years to come.
SQL-MapReduce Call
The bold clause references the IDF values from Output.
Output
This query below returns the following table:
Cluster Analysis
• Canopy
• Gaussian Mixture Model Functions
• KMeans
• KMeansPlot
• KModes
• KModesPredict
• Minhash
• Modularity
Note:
The Modularity function, which discovers clusters in input graphs, is in Graph Analysis_part2.
Canopy
Summary
The Canopy function takes a set of data points and identifies each point with one or more canopies.
Canopies are groups of points that are interrelated, close, or similar. Canopy clustering is often performed in
preparation for more rigorous clustering techniques, such as k-means clustering.
Note:
The canopy clustering algorithm is nondeterministic, and the randomness of the canopy assignments
cannot be controlled by a seed argument.
Background
Canopy clustering is a very simple, fast, and surprisingly accurate method for grouping objects into
preliminary clusters. Each object is represented as a point in a multidimensional feature space.
The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1
and T2 (T1 > T2), for processing. A point can belong to a canopy if the distance from the point to the canopy
center is less than T1.
Judicious selection of canopy centers (with no canopy center less than T2 from the next) and points in a
canopy enables more efficient execution of clustering algorithms, which are often called within canopies.
Canopy clustering is often an initial step in more rigorous clustering techniques, because after the data
points are clustered into canopies:
• More expensive distance measurements can be restricted to points inside the canopies, which can
significantly reduce their number.
• The more rigorous clustering technique need perform only intra-canopy clustering, which can be
parallelized.
Points that belong to different canopies do not have to be considered at the same time in this clustering
process.
Canopy clustering is done in three map-reduce steps:
1. Each mapper performs canopy clustering on the points in its input set and outputs its canopies' centers
(which are local to the mapper).
2. The reducer takes all the points in each (local) canopy and calculates centroids to produce the final
canopy centers.
3. Final canopy centers that are too close to each other are deleted (to eliminate the effects of earlier
localization).
A driver extracts information from the initial canopy-generation step and uses it to make another SQL-
MapReduce call that finishes the clustering process.
Usage
Canopy Syntax
Version 2.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Input
The Canopy function has one required input table, which contains the data to be clustered. The input table
cannot have any columns not described in the following table.
Table 761: Canopy Input Table Schema
Output
The Canopy function outputs a table of canopies and their centers.
Table 762: Canopy Output Table Schema
Example
Input
The input table has more than 6000 rows of computer specifications.
Table 763: Canopy Example Input Table computers_train1
SQL-MapReduce Call
Output
Table 764: Canopy Example Output Table
GMMFit
Summary
GMMFit is a driver function that fits a Gaussian Mixture Model (GMM) to data supplied in an input table.
You specify whether GMMFit uses a basic GMM algorithm with a fixed number of clusters or a Dirichlet
Process GMM (DP-GMM) algorithm with a variable number of clusters.
The output table of the GMMFit function can be input to the function GMMPredict.
Usage
GMMFit Syntax
Version 1.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data to be
clustered.
OutputTable Required Specifies the name of the output table to which the function
outputs cluster information. The table must not already exist.
MaxClusterNum Required if Specifies the maximum number of clusters in a Dirichlet Process
ClusterNum is model and causes the function to use the DP-GMM algorithm.
omitted, otherwise This value must have the data type INTEGER. The default value is
not allowed 20.
ClusterNum Required if Specifies the number of clusters in a model and causes the function
MaxClusterNum is to use the basic GMM algorithm. This value must have the data
omitted, otherwise type INTEGER and be greater than 0. The default value is 10.
not allowed
CovarianceType Optional Specifies the type of the covariance matrices, thereby determining
how many parameters the function estimates for each cluster:
'spherical': Each covariance matrix is of the form σI. The function
estimates one parameter for each cluster.
'diagonal' (default): Each covariance matrix has zeros on the
nondiagonal. The function estimates D parameters for each cluster,
where D is the number of dimensions in the matrix.
'tied': Each cluster has the same covariance matrix. The function
estimates (1/2)D(D-1) parameters.
'full': Each cluster has an arbitrary covariance matrix. The function
estimates (1/2)D(D-1) parameters for each cluster.
Input
The GMMFit function has two input tables, input_table and (optionally) init_params.
Table 765: GMMFit input_table Schema
In the init_params table, you can specify initial values for the cluster weights, means, and covariances of each
cluster. You can specify one, two, or all three of these initial values. If you do not want to specify any of these
values, then omit init_params and specify (SELECT 1) instead of a reference to a table, view, or query.
The init_params table must have the same schema as the GMMFit output table. The following table
describes the init_params table and explains how the function assigns initial values that you do not specify.
Table 766: GMMFit init_params Table Schema
Output
The GMMFit function outputs a message and output_table. The message describes these properties:
Table 767: GMMFit Output Message Properties
Property Value
Output Table Name of the output table to which the function outputs cluster
information (output_table).
Algorithm Used Algorithm that the function used—Basic GMM or DP-DMM.
Stopping Criterion Why the function stopped—maximum iterations reached or
convergence reached.
Delta Log Likelihood Change in the mean log-likelihood for each data point between the
next-to-last and the final iterations.
Number of Iterations Number of iterations that the function performed before stopping.
Number of Clusters Number of clusters in the GMM.
Covariance Type Spherical, diagonal, tied, or full.
Number of Data Points Number of data points in the data set.
Global Mean Mean of the data set.
Global Covariance Covariance of the data set.
Log Likelihood Log-likelihood of the data, given the GMM.
Akaike Information Criterion Akaike Information Criterion.
Property Value
Bayesian Information Criterion Bayesian Information Criterion.
The output_table format depends on the PackOutput argument. For PackOutput('false'), the default, the
following table describes output_table.
Table 768: GMMFit output_table Schema for PackOutput('false')
Examples
• Input
• Example 1: Basic GMM, Spherical Covariance, Packed Output
• Example 2: Basic GMM, Diagonal Covariance, Unpacked Output
• Example 3: DP-GMM, Full Covariance, Unpacked Output
Input
This example uses the well known 'iris' dataset (gmm_iris_input). The data has values for four attributes—
sepal_length, sepal_width, petal_length, and petal_width—which are the data dimensions. The input does
not include the species column, because the goal is data clustering, not classification. Each example outputs
three clusters.
From the raw data, a train set and a test set are created.
The function GMMFit uses the train set to generate the model. The GMMPredict function uses the model
information to predict clusters for the test data.
Table 770: GMMFit Example ‘Iris’ Dataset gmm_iris_input
Input
Output
Because the SQL-MapReduce call set PackOutput to 1, a single mean column displays a vector containing
the mean value for each dimension. Refer to the schema for argument definitions.
An output message table (immediately following) and an output table are shown below.
Table 772: GMMFit Example 1 Output Message Table
property value
Output Table gmm_output_ex1
Algorithm Used Basic GMM
... ...
Stopping Criterion Iteration limit reached
Delta Log Likelihood 0.013310
Number of Iterations 10
Number of Clusters 3
Covariance Type spherical
... ...
Number of Data Points 120
Global Mean [5.866, 3.055, 3.770, 1.205]
Global Covariance [[0.7197, -0.04204, 1.326, 0.5265], [-0.04204, 0.1916, -0.3241, -0.1213], [1.326,
-0.3241, 3.167, 1.298], [0.5265, -0.1213, 1.298, 0.5708]]
... ...
Log Likelihood -364.450
Akaike Information 762.899 on 17 parameters
Criterion
Bayesian Information 810.287 on 17 parameters
Criterion
SQL-MapReduce Call
property value
Output Table gmm_output_ex2
Algorithm Used Basic GMM
... ...
Stopping Criterion Iteration limit reached
Delta Log Likelihood 0.018931
Number of Iterations 10
Number of Clusters 3
Covariance Type diagonal
... ...
Number of Data Points 120
Global Mean [5.866, 3.055, 3.770, 1.205]
Global Covariance [[0.7197, -0.04204, 1.326, 0.5265], [-0.04204, 0.1916, -0.3241, -0.1213],
[1.326, -0.3241, 3.167, 1.298], [0.5265, -0.1213, 1.298, 0.5708]]
... ...
Log Likelihood -305.091
Akaike Information 662.182 on 26 parameters
Criterion
Bayesian Information 734.657 on 26 parameters
Criterion
The following query returns the output shown in the table gmm_output_ex2:
SQL-MapReduce Call
Output
property value
Output Table dpgmm_output_ex3
Algorithm Used Dirichlet Process GMM
Stopping Criterion Algorithm converged with tolerance 0.001
Delta Log Likelihood 0.000494
Number of Iterations 9
Number of Clusters Found 1
Covariance Type full
Number of Data Points 120
property value
Global Mean [5.866, 3.055, 3.770, 1.205]
Global Covariance [[0.7197, -0.04204, 1.326, 0.5265], [-0.04204, 0.1916, -0.3241, -0.1213],
1.326, -0.3241, 3.167, 1.298], [0.5265, -0.1213, 1.298, 0.5708]]
Log Likelihood 1550.435
Akaike Information Criterion -3012.870 on 44 parameters
Bayesian Information Criterion -2890.220 on 44 parameters
The following query returns the output shown in the table dpgmm_output_ex3:
GMMPredict
Summary
The GMMPredict function takes the output from the function GMMFit and predicts the cluster assignments
for each [[σ11 σ12 ...σ1D],[σ21 ... σD,D-1 σDD]] point in a specified data set. Because GMM functions do soft
assignments of data points to clusters (that is, GMM functions give probabilities that each data point is in
each cluster), you can specify the top N most likely clusters for a given point and the probability that the
point is a member of each of those clusters.
The output table of the GMMPredict function can be input to the function GMMProfile.
Usage
GMMPredict Syntax
Version 1.0
Arguments
Argument Category Description
OutputFormat Optional Specifies how the function outputs the weights that it assigns to each
of the top N clusters:
'sparse' (default): The function outputs each weight to a separate
row of the output table.
'dense': The function outputs the weights to a single row of the
output table.
TopNClusters Optional Specifies the number of cluster weights that the function outputs.
This value must be an INTEGER. For the value n, the function
outputs for each data point the cluster with the greatest weight, the
cluster with the second-greatest weight, and so on, ending with the
cluster with the kth-greatest weight. The default value is 1.
PrintLogLikelihood Optional Specifies whether to output the log likelihood of an observation,
given the data. The default value is 'false'.
Accumulate Optional Specifies the names of testdata columns to copy to the output table.
Attributes Optional Specifies the names of testdata columns that correspond to the
attributes in the modeldata table. By default, these columns are all
testdata columns except the first.
IDColumn Optional Specifies the input table column that defines the row identifier. The
default value is the first input table column.
Input
The GMMPredict function has two input tables, testdata and modeldata. For the schema of testdata, refer to
GMMFit input_table Schema. For the schema of modeldata, refer to GMMFit input_table Schema and
GMMFit output_table Schema.
Output
The GMMPredict function has one output table, whose format depends on the OutputFormat argument.
The following table describes the output table for OutputFormat('sparse'), the default. The table has D+3
columns, where D is the number of dimensions of the input data.
Table 784: GMMPredict Output Table Schema for OutputFormat('sparse')
The following table describes the output table for OutputFormat('dense'). The table has D+2k columns,
where D is the number of dimensions of the input data and n is the the number of cluster weights that the
function outputs (the value of the TopNClusters argument).
Table 785: GMMPredict Output Table Schema for OutputFormat('dense')
Example
The GMMPredict function applies the model created by GMMFit to the test input to cluster the test data.
Input
Table 786: GMMPredict Example Input Table gmm_iris_test
Output
The output table shows the dimensions and id of each sample, and the probability that it belongs to each of
the three clusters.
Table 787: GMMPredict Example Output Table (Columns 1-4)
GMMProfile
Summary
The GMMProfile function takes the output of the function GMMFit and outputs information about how
each cluster diverges from the global data statistics.
Usage
GMMProfile Syntax
Version 1.0
Input
The GMMProfile function takes as input the modeltable that the GMMFit function outputs.
Examples
The examples in this section show the delta mean and divergence for each of the models created with
GMMFit.
Example 1
Input
Use the following tables from the Output section of Example 1: Basic GMM, Spherical Covariance, Packed
Output of the function GMMFit:
• GMMFit Example 1 Output Table: gmm_output_ex1 (Columns 1-4)
• GMMFit Example 1 Output Table: gmm_output_ex1 (Columns 5-8)
SQL-MapReduce Call
Output
Example 2
Input
Use the following tables from the Output section of Example 2: Basic GMM, Diagonal Covariance,
Unpacked Output of the function GMMFit:
• GMMFit Example 2 Output Table gmm_output_ex2 (Columns 1-6)
• GMMFit Example 2 Output Table gmm_output_ex2 (Columns 7-11)
• GMMFit Example 2 Output Table gmm_output_ex2 (Columns 12-14)
SQL-MapReduce Call
Output
Example 3
Input
Use the following tables from the Output section of Example 3: DP-GMM, Full Covariance, Unpacked
Output of the function GMMFit:
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 1-6)
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 7-11)
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 12-16)
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 17-20)
SQL-MapReduce Call
Output
KMeans
Summary
The KMeans function takes a data set and outputs the centroids of its clusters and, optionally, the clusters
themselves.
Background
K-means clustering is a simple unsupervised learning algorithm that is popular for cluster analysis in data
mining. The algorithm aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean—the centroid for the cluster.
The algorithm aims to minimize an objective function (in this case, a squared error function). The objective
function, which is a chosen distance measure between a data point and the cluster center, indicates the
distance of the n data points from their respective centroids.
The algorithm has these steps:
1. Place k points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. Recalculate the positions of the k centroids.
4. Repeat steps 2 and 3 until the centroids no longer move.
Now the objects are in groups from which the metric to be minimized can be calculated.
Although the procedure always terminates, the k-means algorithm does not necessarily find the optimal
configuration, corresponding to the global objective function minimum. The algorithm is significantly
sensitive to the initial randomly selected cluster centers. To reduce the effect of these limitations, the k-
means algorithm can be run multiple times.
The k-means algorithm in map-reduce consists of an iteration (until convergence) of a map and a reduce
step. The map step assigns each point to a cluster. The reduce step takes all the points in each cluster and
calculates the new centroid of the cluster.
KMeans Syntax
Version 1.6
Note:
You must specify only one of the arguments NumClusters, InitialSeeds, and CentroidsTable. If you
specify more than one, the function gives top priority to InitialSeeds, then to NumClusters, and then to
CentroidsTable.
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the features by which to cluster the
data.
OutputTable Required Specifies the name of the table in which to output the centroids of the clusters.
ClusteredOutp Optional Specifies the name of the table in which to store the clustered output. If you
ut omit this argument, the function does not generate a table of clustered output.
UnpackColum Optional Specifies whether the means for each centroid appear unpacked (that is, in
ns separate columns) in output_table. By default, the function concatenates the
means for the centroids and outputs the result in a single VARCHAR column.
Note:
With InitialSeeds, the function uses a deterministic algorithm and the
function supports up to 1596 dimensions.
NumClusters Optional Specifies the number of clusters to generate from the data.
Note: With NumClusters, the function uses a nondeterministic algorithm and
the function supports up to 1543 dimensions.
CentroidsTabl Optional The table that contains the initial seed means for the clusters. The schema of the
e centroids table depends on the value of the UnpackColumns argument.
Note:
With CentroidsTable, the function uses a deterministic algorithm and the
function supports up to 1596 dimensions.
Threshold Optional Specifies the convergence threshold. When the centroids move by less than this
amount, the algorithm has converged. The default value is 0.0395.
MaxIterNum Optional Specifies the maximum number of iterations that the algorithm runs before
quitting if the convergence threshold has not been met. The default value is 10.
Input
The KMeans function has one required input table (specified by the InputTable argument) and one optional
input table (specified by the CentroidsTable argument).
The required input table contains the features by which to cluster the data.
Table 793: KMeans Input Table Schema
The optional input table contains the contains the initial seed means for the clusters. This table has the same
schema as the table of cluster centroids (specified by the OutputTable argument), which is affected by the
UnpackColumns argument and is described by KMeans Results Messages and KMeans Output Table
Schema for UnpackColumns('true').
Output
The KMeans function has two required outputs and one optional output. The required outputs are the result
messages (output to the screen) and the table of cluster centroids (specified by the OutputTable argument).
The optional output is a table of the clusters themselves (specified by the ClusteredOutput argument).
The results messages table starts with information about each cluster, described by the following two tables.
Table 794: KMeans Results Messages Table Schema
Note:
The UnpackColumns argument does not affect this column.
Label Value
Converged : 'True' if the algorithm converged, 'False' otherwise.
Number of iterations : Number of iterations that the algorithm performed.
Number of clusters : Number of clusters.
Label Value
Output table : Name of the output table specified by the OutputTable argument.
Total_WithinSS : Sum of withinss values in the preceding table.
Between_SS : Between sum of squares—the sum of squared distances of centroids to the
global mean, where the squared distance of each mean to the global mean is
multiplied by the number of data points it represents.
The schema of the table of cluster centroids is affected by the UnpackColumns argument.
Table 796: KMeans Output Table Schema for UnpackColumns('false') (Default)
The following table describes the optional table of the clusters themselves.
Table 798: KMeans Clustered Output Table Schema
Input
Table 799: KMeans Examples Input Table computers_train1
SQL-MapReduce Call
This call tries to group the 5-dimensional data points into 8 clusters.
Output
The following query returns the output shown in the following table:
SQL-MapReduce Call
Output
The following query returns the output shown in the table kmeanssample_centroid:
SQL-MapReduce Call
Output
The following query returns the output shown in the following table:
id clusterid
1 1
2 3
3 1
4 3
5 5
6 4
7 3
8 3
9 7
id clusterid
12 6
13 3
14 7
16 7
17 1
18 7
19 7
20 4
... ...
SQL-MapReduce Call
Output
The following query returns the output shown in the following table:
The following query returns the output shown in the following table:
id clusterid
1 0
2 4
3 0
4 4
5 6
6 1
7 0
8 4
9 4
12 2
13 4
14 2
16 4
17 0
18 2
19 4
20 1
... ...
KMeansPlot
Summary
The KMeansPlot function takes a model—a table of cluster centroids output by the KMeans function—and
an input table of test data, and uses the model to assign the test data points to the cluster centroids.
Usage
KMeansPlot Syntax
Version 1.1
);
Note:
When calling KMeansPlot on a view, you must provide aliases (a requirement of multi-input SQL-
MapReduce). For example:
SELECT *
FROM KMeansPlot (
ON pa_prdwk.seg_data_v AS input_data PARTITION BY ANY
ON pa_prdwk.seg_data_output AS segmentation_data_output DIMENSION
CentroidsTable ('segmentation_data_output')
);
Input
TheKMeansPlot function has two required input tables:
• An input table of test data (the input with the PARTITION BY ANY clause), which has the same schema
as the KMeans input table
• The table of cluster centroids output by the KMeans function (the input with the DIMENSION clause)
Output
Table 811: KMeansPlot Output Table Schema
Example
This example uses the table of cluster centroids output by a KMeans function example.
Input
The input table of test data, computers_test1, contains attributes of personal computers (price, speed, hard
disk size, RAM, and screen size). This table has over 1000 rows. If a row contains a null value, KMeansPlot
assigns the cluster ID -1 to that row.
The table of cluster centroids, kmeanssample_centroid, was output by the Kmeans function.
Table 812: KMeansPlot Example Input Table computers_test1
SQL-MapReference Call
SELECT *
FROM KMeansPlot (
ON computers_test1 PARTITION BY ANY
ON kmeanssample_centroid DIMENSION
CentroidsTable ('kmeanssample_centroid')
) ORDER BY id, clusterid;
Output
Table 814: KMeansPlot Example Output Table
KModes
Summary
KModes is an extension of KMeans that supports categorical data. KModes models are fit similarly to
KMeans models. The core algorithm is an expectation-maximization algorithm that finds a locally optimal
solution. The main steps to fitting the model are:
• Initialization - A set of K initial cluster centers is selected. This set can be generated using the
RandomSample function (RandomSample) which allows the user to sample rows from an input table
using the kmeans++ and kmeans|| algorithms. These initialization algorithms generate initial cluster
centers that are more likely to lead to better local optima.
• E step - Performed by a mapper. Each point in the input table is assigned to one of the K clusters, and the
sums of the numerical attributes and counts of the categorical attributes are stored.
• M step - Performed by a reducer. The statistics generated by each worker in the E step are aggregated and
new cluster centers are generated. For numerical attributes, the new center is the mean of the value of the
attribute for the points assigned to the cluster. For categorical attributes, the new center it the mode of
the attribute value for the points assigned to the cluster.
The algorithm runs for either a set number of iterations or until the change in movement of the cluster
centers drops below a user-specified threshold.
When assigning points to a cluster, a hybrid distance function that combines a numeric distance and a
categorical distance is required. The default distance between two data points in a KModes model is the
squared Euclidean distance:
where N denotes the indices of numerical attributes, C denotes the indices of categorical attributes, and wj
denotes the weight to be assigned to a category.
The Manhattan distance can also be used:
KModes Syntax
Version 1.0
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Input table is the table containing the list of features by which
to cluster the data.
OutputTable Required Output table is the table where output is stored. The output
table contains the centroids of the clusters.
InitialSeedTable Optional An input table containing the points that serve as initial cluster
centers. InitialSeedTable cannot be used if NumClusters is used
and is required if NumClusters is not used.
ModelIdColumn Optional If this argument is present, it indicates that the table specified
in InitialSeedTable contains more than one set of seed values
(that is, it contains seed values for more than one model).
This argument specifies the column in InitialSeedTable that
identifies which rows are associated with each model.
Input
The function has one required input table and one optional input table. The required input table contains
the data points to be clustered with one dimension in each column.
Table 815: KModes Input Table Schema
Output
The output displayed on the screen is a summary table containing statistics about the KModes run. There
are four columns if a single model is trained, or five if multiple models are trained simultaneously. The three
right-most columns are separated from the summary so that users can sort by them and quickly find the best
model.
Table 816: KMode Output Summary Table
Examples
• Input
• Example 1: Using InitialSeedTable
• Example 2: Using NumClusters
Input
Both examples use the input table kmodes_input, which has 32 observations, on 11 variables, about different
models of cars. The table has two categorical (vs, am), three numerical variables (‘cyl’, ‘gear’, ‘carb’) that are
considered as categories in the SQL-MapReduce call and six normalized pure numeric variables (‘mpg’,
‘disp’, ‘hp’, ‘drat’, ‘wt’, ‘qsec’), listed below:
The kmodes_init table is an additional input that contains three points that serve as initial cluster centers.
Table 820: KModes Example Input Table kmodes_init (Columns 1-5)
SQL-MapReduce Call
Output
With the initialseedtable argument, the cluster centers and assignments are the same every time, with the
same distance metric (in this case, the default, Euclidean).
Table 822: KModes Example 1 Output Table
SQL-MapReduce Call
Output
The following query returns the output shown in the table kmodes_clusters1:
KModesPredict
Summary
KModesPredict is the prediction function corresponding to KModes.
Usage
KModesPredict Syntax
Version 1.0
Arguments
Argument Category Description
TestModels Optional Specifies the model IDs to use for prediction. The default behavior is to
use all models.
PrintDistance Optional Specifies whether to output the distance from each observation to its
closest centroid. The default value is false.
Accumulate Optional Columns from the input table to be passed through to the output table.
Output
The output schema is:
Table 830: KmodesPredict Output Table Schema
Example
Input
KModes Example Input Table kmodes_input and the model table produced by the KModes function
'kmodes_clusters' are used for cluster prediction.
SQL-MapReduce Call
Output
Each input row is assigned one of the three clusters as shown below.
Minhash
Summary
The Minhash function uses transaction history to cluster similar items or users together. For example, the
function can cluster items that are frequently bought together or users that bought the same items.
Background
Data analysis frequently requires the detection of similarity between items in large transactional data sets.
Canopy and k-means clustering perform well with physical data, but transactional data often requires less
restrictive forms of analysis. Locality-sensitive hashing, or minhash, is a particularly effective way of
clustering items based on the Jaccard metric of similarity.
Minhash assigns a pair of users to the same cluster with probability proportional to the overlap between the
set of items that they have bought. Each user u is represented by a set of items that he or she has bought. The
similarity between users ui and uj is defined as the overlap between their item sets, given by the intersection
of the item sets divided by the union of the item sets. This quotient is called the Jaccard coefficient or Jaccard
metric.
Minhash calculates one or more cluster identifiers for each user as the hash value (s) of a randomly chosen
item from a permutation of the set of items that the user has bought. With a universal class of hash
functions, the probability that two users are hashed to the same cluster identifier is their Jaccard coefficient
(S).
If cluster identifiers are formed by concatenating p hash values, each generated by hashing a random item
from the item set with multiple hash functions, then the probability that any two users have the same hash
key is Sp.
If each user is assigned to multiple clusters, the probability that two users have the same hash key increases,
causing more effective clustering. Therefore, minhash computes several cluster identifiers for each user.
Minhash produces each cluster identifier by selecting an item from the user’s item set, hashing it with each
of several hash functions, and concatenating p hash values. Therefore, p (the number of key groups) must be
a divisor of the number of hash functions. (The item that minhash selects from the item set is the one that
produces the minimum hash value for a particular hash function, hence the name of the algorithm.)
Minhash Syntax
Version 2.2
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the input table.
OutputTable Required Specifies the name of the output table.
IDColumn Required Specifies the name of the input table column that contains the values to
be hashed into the same cluster. Typically these values are customer
identifiers.
ItemsColumn Required Specifies the name of the input column that contains the values to use for
hashing.
SeedTable Optional Specifies the name of the table that contains the seeds to use for hashing.
Typically, this table was created by an earlier Minhash call that specified
its name in the SaveSeedTo argument.
SaveSeedTo Optional Specifies the name of the table where seeds are to be saved.
Example
Input
The input table (salesdata) consists of 341 distinct users and the items they have purchased in an office
supplies store. For ease of use, the items are assigned itemids (shown in the following table) which are then
used in the input table.
Table 833: Minhash Example Items and Itemids
item itemid
Storage 1
Appliances 2
Binders 3
Telephones 4
Paper 5
Rubber Bands 6
Computer Peripherals 7
Office Furnishings 8
Office Machines 9
Envelopes 10
Bookcases 11
Tables 12
Pens & Art Supplies 13
item itemid
Chairs & Chairmats 14
Scissors 15
Rulers & Trimmers 16
Copiers & Fax Storage 17
Labels 18
userid itemid
1 1
2 23
3 4
4 2
5 31
6 1
7 5
8 56
9 21
10 3
11 8
12 10
13 11 4
... ...
SQL-MapReduce Call
SELECT *
FROM minhash(
ON (SELECT 1)
PARTITION BY 1
InputTable ('salesdata')
OutputTable ('minhashoutput')
IDColumn ('userid')
ItemsColumn ('itemid')
HashNum ('1002')
KeyGroups ('3')
InputType ('integer')
MinClusterSize ('3')
The number of hash functions must be an integer multiple of number of keygroups, while each clusterid is
generated by concatenating KeyGroups’ hashcodes together. The larger the amount of keygroups, fewer
clusters are obtained.
Output
The following query returns the output shown in the following table:
clusterid userid
1002732123681872942919652130 142 153 22 229 273
10191305779223184216324476 106 65 94
102623915513963258275858860 15 154 200 219 227
10521510524181490254808958 106 162 41 76
1057328301636481327290076924 145 336 64 73
111640426347546462487275395 159 199 329
111640426379300784959427683 172 201 8
1145291930783954549119382258 116 16 255
11574213171254045121408249132 116 126 264
1174195802405410071547744710 220 323 336 64 73
1178104602478564384799399977 233 336 64 73
12111042574047172271448914 105 233 336 64 73
... ...
Naive Bayes
• What is Naive Bayes?
• Naive Bayes Functions
• Naive Bayes Example
Note:
For the Naive Bayes functions designed specifically for text classification, refer to Naive Bayes Text
Classifier.
Summary
The Naive Bayes classifier executes these functions:
• NaiveBayesMap and NaiveBayesReduce, which generate a model from training data
• NaiveBayesPredict, which uses the model to make predictions about testing data
Note:
You must grant the EXECUTE privilege on the NaiveBayesMap, NaiveBayesReduce, and
NaiveBayesPredict functions to the database user who will run them. For more information, refer to Set
Permissions to Allow Users to Run Functions.
Summary
The NaiveBayesMap and NaiveBayesReduce functions generate a model from training data. A table of
training data is input to the NaiveBayesMap function, whose output is input to the NaiveBayesReduce
function, which outputs the model.
Usage
For example:
'input1','[4:21]','[25:53]','input73,
input80', '[25:53]'
Input
The NaiveBayesMap function has one input table, which contains the training data. Each row represents one
observation. The following table describes the input table columns that function arguments can specify.
Table 836: NaiveBayesMap Input (Training) Table Schema
Output
The NaiveBayesMap function output is input to the NaiveBayesReduce function. The NaiveBayesReduce
function outputs a model table. The following table describes the model table.
NaiveBayesPredict Input
The input for the SQL-MapReduce call shown below is as follows:
• Model File - NaiveBayesReduce and NaiveBayesMap Output: Model Table
• Test Dataset - Split Input into Training and Testing Data Sets
Prediction Accuracy
The prediction accuracy (proportion of correct predictions) of 93.33% is obtained using the following SQL
statements:
prediction_accuracy
0.93333333333333333333
NaiveBayesPredict
Summary
The NaiveBayesPredict function uses the model output by the NaiveBayesReduce function to predict the
outcomes for a test set of data.
This function can be used with real-time applications. Refer to AMLGenerator.
NaiveBayesPredict Syntax
Version 1.4
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
Model Required Specify the name of the model table generated by the NaiveBayesReduce
function.
IDCol Required Specify the name of the column that contains the ID that uniquely
identifies the test input data.
NumericInput Either Specify the same numeric_input_columns that you specified when you
s NumericInput used the NaiveBayesMap and NaiveBayesReduce functions to generate
s or the model table from the training data.
CategoricalInp
uts is required
CategoricalInp Either Specify the same categorical_input_columns that you specified when you
uts NumericInput used the NaiveBayesMap and NaiveBayesReduce functions to generate
s or the model table from the training data.
CategoricalInp
uts is required
Output
The NaiveBayesPredict function outputs a table of predictions for the observations in the test table. Each
row represents one observation.
Table 844: NaiveBayesPredict Output Table Schema
Prediction Accuracy
The prediction accuracy (proportion of correct predictions) of 93.33% is obtained using the following SQL
statements:
prediction_accuracy
0.93333333333333333333
NaiveBayesPredict Input
The input for the SQL-MapReduce call shown below is as follows:
• Model File - NaiveBayesReduce and NaiveBayesMap Output: Model Table
• Test Dataset - Split Input into Training and Testing Data Sets
prediction_accuracy
0.93333333333333333333
Ensemble Methods
• Random Forest Functions
• Single Decision Tree Functions
• AdaBoost Functions
Summary
SQL-MapReduce provides a suite of functions to create a predictive model based on a combination of the
Classification and Regression Trees (CART) algorithm for training decision trees, and the ensemble learning
method of bagging.
In this example, the x1-x2 plane has four regions, R1, R2, R3 and R4. The predicted value of y for any test
observation in R1 is the average value of y for all training observations in R1.
The algorithm starts at the Root node. If the x1 value for a data point is greater than 5, then the algorithm
travels down the right path; if the value of x1 is less than 5, then the algorithm travels down the left path. At
each subsequent node, the algorithm determines which branch to follow, until it reaches a leaf node, to
which it assigns a prediction value.
Implementation Notes
In the original Random Forest algorithm developed by Leo Breiman, each tree grows as follows:
• If the number of cases in the training set is N, sample N cases at random, but with replacement from the
original data. This sample becomes the training set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at each node, m variables are
selected at random from M and the best split on those m variables is used to split the node. The value of
m is held constant during the forest growing.
• Each tree is grown to the largest extent possible. There is no pruning.
Usage
The SQL-MapReduce decision tree functions create a decision model that predicts an outcome based on a
set of input variables. When constructing the tree, the splitting of branches stops when any of the stopping
criteria is met.
The SQL-MapReduce decision tree functions support these predictive models:
Model Description
Regression problems (continuous This model is used when the predicted outcome from the data is a real
response variable) number. For example, the dollar amount of insurance claims for a year
or the GPA expected for a college student.
Multiclass classification This model is used to classify data by predicting to which provided
(classification tree analysis) classes the data belongs. For example, whether the input data is
political news, economic news, or sports news.
Binary classification (binary This model is used to make predictions when the outcome can be
response variable) represented as a binary value (true/false, yes/no, 0/1). For example,
whether the input insurance claim description data represents an
accident.
Forest_Drive
Summary
The Forest_Drive function takes as input a training set of data and uses it to generate a predictive model.
You can input the model to the Forest_Predict function, which uses it to make predictions.
The query results includes a row_count column. The average value of this column is the recommended
maximum value for the argument NumTrees.
Usage
Forest_Drive Syntax
Version 1.5
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data set.
OutputTable Required Specifies the name of the output table in which the function stores
the predictive model that it generates.
If a table with this name exists in the database, the function drops
the existing table and creates a new table with the same name.
ResponseColumn Required Specifies the name of the column that contains the response
variable (that is, the quantity that you want to predict).
NumericInputs Either Specifies the names of the columns that contain the numeric
NumericInp predictor variables (which must be numeric values).
uts or
CategoricalIn
puts is
required
MaxNumCategoricalVal Optional Specifies the maximum number of distinct values for a single
ues categorical variable. The max_cat_values must be a positive
INTEGER. The default value is 20. A max_cat_values greater than
20 is not recommended.
CategoricalInputs Either Specifies the names of the columns that contain the categorical
NumericInp predictor variables (which can be either numeric or VARCHAR
uts or values).
CategoricalIn Each categorical input column can have at most max_cat_values
puts is distinct categorical values. If max_cat_values exceeds 20, the
required function might run out of memory, because classification trees
grow rapidly as max_cat_valuesincreases.
NumTrees Optional Specifies the number of trees to grow in the forest model. When
specified, number_of_trees must be greater than or equal to the
number of vworkers.
When not specified, the function builds the minimum number of
trees that provides the input dataset with full coverage.
TreeType Optional Specifies whether the analysis is a regression (continuous
response variable) or a multiclass classification (predicting result
from the number of classes). The default value is 'regression' if the
Input
Table 857: Forest_Drive Input Table Schema
Note:
Forest_Drive skips input rows that contain NULL values.
Output
The Forest_Drive function populates the table specified by the OutputTable argument with the decision tree
that it creates. The following table shows the output table schema.
Table 858: Forest_Drive Output Table Schema
Example
This example uses home sales data to create a model that predicts home style, which can be input to the
Forest_Predict Example.
Input
The following table describes the home sales data contained in the input table. There are six numerical
predictors and six categorical predictors. The response variable is homestyle.
The table of raw training data, housing_train, is described by the following two tables.
Table 860: Forest_Drive Example Input Table housing_train (Columns 1-7)
SQL-MapReduce Call
We use default values for the MaxDepth, MinNodeSize, Variance and NumSurrogates, and build 50 trees on
two worker nodes. Both seed values are set to 100 for repeatability and mtry is assigned a value of 3
(sqrt(12)= 3.4), as it is a classification type decision tree.
A good starting point for mtry is sqrt(p) for classification and p/3 for regression, where p is number of
variables used for prediction.
Output
The summary table and the model table are shown below.
Table 862: Forest_Drive Example Output Summary Table
message
Computing 50 classification trees.
Each worker is computing 25 trees.
Each tree will contain approximately 246 points.
Poisson sampling parameter: 1.00
Query finished in 8.962 seconds.
Decision forest created in table "rft_model".
The following query returns the output shown in the following table:
Forest_Predict
Summary
The Forest_Predict function uses the model generated by the Forest_Drive function to generate predictions
on a response variable for a test set of data. The model can be stored in either a table or a file.
This function can be used with real-time applications. Refer to AMLGenerator.
Usage
Forest_Predict Syntax
Version 1.5
Arguments
Argument Category Description
ModelFile Either Specifies the name of the text file or ZIP file that contains the trained
ModelFile, model generated by the Forest_Drive function. You must have
ModelTable,o installed this model previously using the ACT \install command.
r Forest is If you specify ModelTable, then the function uses it and ignores
required ModelFile and Forest. If you specify both ModelFile and Forest, then
the function uses Forest.
Forest Either Specifies the name of the table that contains the decision forest
ModelFile, generated by the Forest_Drive function.
ModelTable,o
r Forest is
required
NumericInputs Optional Specifies the names of the columns that contain the numeric predictor
variables. By default, the function gets these variables from the model
generated by Forest_Drive. If you specify this argument, you must
specify it exactly as you specified it in the Forest_Drive call that
generated the model.
CategoricalInputs Optional Specifies the names of the columns that contain the categorical
predictor variables. By default, the function gets these variables from
the model generated by Forest_Drive. If you specify this argument,
you must specify it exactly as you specified it in the Forest_Drive call
that generated the model.
IDColumn Required Specifies the column that contains a unique identifier for each test
point in the test set.
Detailed Optional Specifies whether to output detailed information about the forest
trees; that is, the decision tree and the specific tree information,
including task index and tree index for each tree. The default value is
'false'.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Input
The Forest_Predict function has a required input table and an optional model table.
If you do not specify the optional model table, then you must specify the model with either the ModelFile or
Forest argument. Teradata recommends that you specify the optional model table.
The input table for the Forest_Predict function must contain an ID column (for example, user_id or
transaction_id), so that each test point can be associated with a prediction. It must also contain all columns
Output
The output table is a set of predictions for each test point. The following table describes the output table
schema.
Table 865: Forest_Predict Output Table Schema
By design, for the classification tree, the columns confidence_lower and confidence_upper contain the same
value.
Input
The input test data (housing_test) has 54 observations of 14 variables. The example uses the model
rft_model (see the Output section), created by the Forest_Drive function, to predict the homestyle of the test
dataset.
Table 866: Forest_Predict Example Input Table housing_test (Columns 1-7)
SQL-MapReduce Call
Use the Accumulate argument to pass the homestyle variable, to easily compare the actual and predicted
response for each observation.
Output
The function’s predicted response is in the prediction column and the original classification values are in the
homestyle column. The upper and lower confidence intervals are also shown in the output table.
The following query returns the output shown in the following table:
Prediction Accuracy
The prediction accuracy is 77.78% as calculated by the following SQL statement:
pa
0.77777777777777777778
Forest_Analyze
Summary
The Forest_Analyze function analyzes the model generated by the Forest_Drive function and gives weights
to the variables in the model. This function shows variable/attribute counts in each tree level, helping you to
understand the importance of different variables in the decision-making process.
Forest_Analyze Syntax
Version 1.1
Arguments
Argument Category Description
NumLevels Optional Specifies the number of levels to analyze. The default value is 5.
Input
The input to the Forest_Analyze function is the model generated by the Forest_Drive function. Forest_Drive
Output Table Schema shows its schema.
Output
The output of the Forest_Analyze function is a table of model analysis data. The following table shows its
schema.
Table 870: Forest_Analyze Output Table Schema
[ NumLevels (number_of_levels)]
) GROUP BY variable;
Examples
Input
The following examples show how to use the Forest_Analyze function to analyze the sample model
generated by Forest_Drive. The rft_model table, generated by the Forest_Drive function, is used as input.
Example 1
SQL-MapReduce Call
Output
There are two worker nodes that construct 25 trees each. The level and count of variables for each tree and
worker at each node are output as shown in the following table.
Table 871: Forest_Analyze Example 1 Output Table
SQL-MapReduce Call
The overall variable importance is calculated by averaging the importance over 50 trees, as shown below.
Output
The variable importance is shown in descending order. The top three variables for modeling and prediction
are price, lotsize and bedrooms.
Table 872: Forest_Analyze Example 2 Output Table
variable sum(importance) / 50
price 0.530036819315194
lotsize 0.40869314472933
bedrooms 0.216136248043658
stories 0.176956469036925
bathrms 0.171395287455378
garagepl 0.16108831869553
fullbase 0.0853787807623518
airco 0.0720778853448971
recroom 0.0607107804514478
driveway 0.0336033805550212
gashw 0.0161230714649009
prefarea 0.00464901131486607
Single_Tree_Drive
Summary
The Single_Tree_Drive function creates a single decision tree in a distributed fashion, either weighted or
unweighted. The model table that this function outputs can be input to the function Single_Tree_Predict.
Background
Tree Building
The Single_Tree_Drive function takes the entire data set as training input and builds a single decision tree
from it.
Usage
Single_Tree_Drive Syntax
Version 1.3
Arguments
Argument Category Description
InputTable Optional Specifies the name of the table that contains the input data set.
* *Required if you omit AttributeTableName and
ResponseTableName.
AttributeTableName Optional Specifies the name of the table that contains the attribute names
* and the values.
*Required if you omit InputTable.
ResponseTableName Optional Specifies the name of the table that contains the response
* values.
*Required if you omit InputTable.
OutputTable Required Specifies the name for the output table that is to contain the
final decision tree (the model table). The name must not exceed
64 characters.
AttributeNameColumns Required Specifies the names of the attribute table columns that define
the attribute.
AttributeValueColumn Required Specifies the names of the attribute table columns that define
the value.
ResponseColumn Required Specifies the name of the response table column that contains
the response variable.
IDColumns Required Specifies the names of the columns in the response and
attribute tables that specify the ID of the instance.
CategoricalAttributeTableNam Optional Specifies the name of the input table that contains the
e categorical attributes.
SaveFinalResponseTableTo Optional Specifies the name for the output table that is to contain the
final PID and response pair from the response table and the
node_id from the final single drive tree.
SplitsTable Optional Specifies the name of the input table that contains the user-
specified splits. By default, the function creates new splits.
SplitsValueColumn Optional If you specify SplitsTableName, this argument specifies the
name of the column that contains the split value. If
UseApproximateSplits is 'true', then the default value is
splits_valcol; if not, then the default value is the
AttributeValueColumn argument, node_column.
Input
Single decision trees support millions of attributes. Because the database cannot have millions of columns,
you must spread the attributes across rows in the form of key-value pairs, where key is the name of the
attribute and value is the value of the attribute.
To convert an input table used in Forest_Drive into an input table for Single_Tree_Drive, use the Unpivot
function.
The Single_Tree_Drive function requires either an input table or both an attribute table and a response
table. The function has two optional input tables, the splits table and the categorical splits table.
Table 873: Single_Tree_Drive Input Table Schema
Note:
The response table must not have a column named node_id.
Output
The Single_Tree_Drive function outputs console messages, a model table, and (optionally) an intermediate
splits table and final response table. The following table shows the schema of the message table.
Table 878: Single_Tree_Drive Console Message Table Schema
The model table has a row for each node in the model (the single decision tree that the function creates). The
name of the model table is specified by the OutputTableName argument. The following table shows the
schema of the model table.
Table 879: Single_Tree_Drive Model Table Schema
The following table describes the intermediate splits table. The name of the intermediate splits table is
specified by the MaterializedSplitsTableWithName argument.
Table 880: Single_Tree_Drive Intermediate Splits Table Schema
The following table describes the output response table. The name of the output response table is specified
by the SaveFinalResponseTableTo argument.
Table 881: Single_Tree_Drive Output Response Table Schema
Examples
• Example 1
• Example 2
Input
The well known 'iris' dataset (iris_input)is used in this example. The data has values for four attributes
(sepal_length, sepal_width, petal_length and petal_width) and are grouped into three categories (setosa (1),
versicolor (2), virginica (3)). From the raw data, a train and test set are created.
The function Single_Tree_Drive acts on the train set to generate the model. The Single_Tree_Predict
function uses that model and a test set to predict the output. The prediction accuracy is determined based on
the original and prediction results.
Table 882: Single_Tree_Drive Example 1 Iris Table iris_input
Attribute Tables
Attribute tables, created from the raw train and test data, are used as inputs.
The following query returns the output shown in the following table:
The following query returns the output shown in the following table:
Response Tables
Response tables, created from the raw train and test data, are used as inputs.
The following query returns the output shown in the following table:
pid response
1 1
2 1
3 1
4 1
6 1
7 1
8 1
9 1
11 1
12 1
13 1
14 1
16 1
... ...
pid response
5 1
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 2
60 2
65 2
70 2
75 2
80 2
85 2
90 2
95 2
100 2
105 3
110 3
115 3
120 3
125 3
130 3
135 3
140 3
pid response
145 3
150 3
SQL-MapReduce Call
Output
The function call creates two tables, a model table 'iris_attribute_output' and an intermediate splits table
'splits_small'.
Table 889: Single_Tree_Drive Example 1 Output Message
message
Input tables:"iris_attribute_train", "iris_response_train"
Output model table: "iris_attribute_output"
Depth of the tree is:6
The following query returns the output shown in the table iris_attribute_output:
The following query returns the output shown in the following table:
Example 2
Input
This example illustrates an alternate input format. The attribute table and response table of Example 1 are
combined into a single table, iris_altinput, which is used in the argument Inputtable().
Table 895: Single_Tree_Drive Example 2 Input Table iris_altinput
SQL-MapReduce Call
Output
message
Input tables:"iris_altinput",
Output model table: "iris_attribute_output_2"
Depth of the tree is:6
Single_Tree_Predict
Summary
The Single_Tree_Predict function applies a tree model to a data input, outputting predicted labels for each
data point.
This function can be used with real-time applications. Refer to AMLGenerator.
Usage
Single_Tree_Predict Syntax
Version 1.2
Arguments
Argument Category Description
AttrTableGroupByColu Required Specifies the names of the columns on which attribute_table is
mns partitioned. Each partition contains one attribute of the input data.
AttrTablePIDColumns Required Specifies the names of the columns that define the data point
identifiers.
AttrTableValColumn Required Specifies the name of the column that contains the input values.
Input
The Single_Tree_Predict function has two input tables, the attribute table that is also input to the
Single_Tree_Drive function (described in the Input section Single_Tree_Drive) and the model table that is
output by the Single_Tree_Drive function (described in the Output section of Single_Tree_Drive).
Example
Input
The Single_Tree_Predict function acts on the following tables (taken from the Single_Tree_Drive Examples)
and produces the prediction on the test set.
• Test Input:
∘ Single_Tree_Drive Example 1 Attribute Table iris_attribute_test
• Model Table Output:
∘ Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 1-6)
∘ Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 7-11)
SQL-MapReduce Call
Output
The predict labels “1”, “2”, and “3” correspond to species 'setosa', 'versicolor, and 'virginica'.
The following query returns the output shown in the following table:
pid pred_label
5 1
10 1
pid pred_label
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 2
60 2
65 2
70 2
75 2
80 2
85 2
90 2
95 2
100 2
105 3
110 3
115 3
120 2
125 3
130 2
135 2
140 3
145 3
150 3
prediction_accuracy
0.90000000000000000000
AdaBoost Functions
• AdaBoost_Drive, which takes a training data set and a single decision tree and uses adaptive boosting to
produce a strong classifying model
• AdaBoost_Predict, which applies the strong classifying model to a new data set
Background
Boosting is a technique that develops a strong classifying algorithm from a collection of weak classifying
algorithms. A classifying algorithm is weak if its correct classification rate is slightly better than random
guessing (which is 50% for binary classification). The intuition behind boosting is that combining a set of
predictions, each of which has more than 50% probability of being correct, can produce an arbitrarily
accurate predictor function.
The AdaBoost algorithm (described by J. Zhu, H. Zou, S. Rosset and T. Hastie 2009 in https://
web.stanford.edu/~hastie/Papers/samme.pdf) is iterative. It starts with a weak classifying algorithm, and
each iteration gives higher weights to the data points that the previous iteration classified incorrectly—a
technique called Adaptive Boosting, for which the AdaBoost algorithm is named. AdaBoost constructs a
strong classifier as a linear combination of weak classifiers.
The AdaBoost_Drive function uses a single decision tree as the initial weak classifying algorithm.
Boosting can be very sensitive to noise in the data. Because weak classifiers are likely to incorrectly classify
outliers, the algorithm weights outliers more heavily with each iteration, thereby increasing their influence
on the final result.
The boosting process is:
1. Train on a data set, using a weak classifier. (For the first iteration, all data points have equal weight.)
2. Calculate the weighted training error.
3. Calculate the weight of the current classifier to use in the final calculation (step 6).
4. Update the weights for the next iteration by decreasing the weights of the correctly classified data points
and increasing the weights of the incorrectly classified data points.
5. Repeat steps 1 through 4 for each weak classifier.
AdaBoost_Drive
Summary
The AdaBoost_Drive function takes a training data set and a single decision tree and uses adaptive boosting
to produce a strong classifying model that can be input to the function AdaBoost_Predict.
AdaBoost_Drive Syntax
Version 1.5
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
AttributeTable Required Specifies the name of the table that contains the
attributes and values of the data.
AttributeNameColumns Required Specifies the names of attribute table columns that
contain the data attributes.
AttributeValueColumn Required Specifies the names of attribute table columns that
contain the data values.
CategoricalAttributeTable Optional Specifies the name of the table that contains the
names of the categorical attributes.
Input
The function requires an attribute table and a response table, and has an optional categorical attribute table.
Table 900: AdaBoost_Drive Attribute Table Schema
Example
This example uses home sales data to create a model that predicts home style, which can be input to the
AdaBoostPredict Example.
Input
Forest_Drive Example Input Data Descriptions describes the real estate sales data contained in the input
table. There are six numerical predictors and six categorical predictors. The response variable is homestyle.
The table of raw training data, housing_train, is described by the following two tables.
Create the input table for the AdaBoostDrive function, housing_train_att, by using the Unpivot function on
the table of raw data, housing_train:
sn attribute value
1 airco no
1 bathrms 1
1 bedrooms 3
1 driveway yes
1 fullbase yes
1 garagepl 1
1 gashw no
1 lotsize 5850.0
1 prefarea no
1 price 42000.0
1 recroom no
1 stories 2
2 airco no
2 bathrms 1
2 bedrooms 2
2 fullbase no
2 garagepl 0
sn attribute value
2 gashw no
2 lotsize 4000.0
2 prefarea no
2 price 38500.0
2 recroom no
2 stories 1
... ... ...
Create the response table for the AdaBoostDrive function, housing_train_response, by selecting the columns
sn and homestyle from the table of raw data, housing_train:
sn response
1 Classic
2 Classic
3 Classic
4 Eclectic
5 Eclectic
6 Eclectic
7 Eclectic
8 Eclectic
9 Eclectic
10 Eclectic
... ...
Create and populate the categorical attribute table, housing_cat, for the AdaBoostDrive function:
attribute
airco
driveway
fullbase
gashw
prefarea
recroom
SQL-MapReduce Call
Create the model, abd_model, using the default values for the optional arguments:
Output
Because the argument IterNum has the value 20, the function builds 20 classification trees.
Table 910: AdaBoost_Drive Example Output Message
message
Input tables:"housing_train_att", "housing_cat", "housing_train_response"
message
Running 20 round AdaBoost, computing 20 classification trees.
AdaBoost model created in table "abd_model"
This query returns the model table, described by the following two tables:
AdaBoost_Predict
Summary
The AdaBoost_Predict function applies the model output by the AdaBoost_Drive function to a new data set.
Usage
AdaBoost_Predict Syntax
Version 1.5
Arguments
Argument Categor Description
y
AttrTableGroupbyColumns Require Specifies the names of the columns on which the attribute table is
d partitioned.
AttrTablePidColumns Require Specifies the names of the attribute table columns that contain the
d data point identifiers.
AttrTableValColumns Require Specifies the name of the attribute table column that contains the
d data point values.
Input
The function requires the attribute table that is also input to the AdaBoost_Drive function and the model
table output by the AdaBoost_Drive function.
Example
This example uses test data and the model output by the AdaBoostDrive Example to use real estate sales data
to predict home style.
Input
The table of raw test data, housing_test, is described by the following two tables.
Table 914: AdaBoostPredict Example Raw Input Table housing_test, Columns 1-9
Create the input table for the AdaBoostPredict function, housing_test_att, by using the Unpivot function on
the table of raw data, housing_test:
sn attribute value
13 airco no
13 bathrms 1
sn attribute value
13 bedrooms 3
13 driveway yes
13 fullbase no
13 garagepl 0
13 gashw no
13 lotsize 1700.0
13 prefarea no
13 price 27000.0
13 recroom no
13 stories 2
16 airco yes
16 bathrms 1
16 bedrooms 2
16 driveway yes
16 fullbase no
16 garagepl 0
16 gashw no
16 lotsize 3185.0
16 prefarea no
16 price 37900.0
16 recroom no
16 stories 1
... ... ...
SQL-MapReduce Call
Output
The pred_label column contains the predicted response.
Table 917: AdaBoost_Predict Example Output Table
sn pred_label
13 Classic
16 Classic
25 Classic
38 Eclectic
53 Eclectic
104 Bungalow
111 Classic
117 Eclectic
132 Classic
140 Classic
142 Classic
157 Eclectic
161 Eclectic
162 Bungalow
176 Eclectic
177 Eclectic
195 Classic
198 Classic
224 Eclectic
234 Classic
237 Classic
239 Classic
249 Classic
251 Classic
254 Eclectic
sn pred_label
255 Eclectic
260 Classic
274 Eclectic
294 Classic
301 Eclectic
306 Eclectic
317 Eclectic
329 Bungalow
339 Bungalow
340 Eclectic
353 Eclectic
355 Eclectic
364 Eclectic
367 Bungalow
377 Bungalow
401 Eclectic
403 Eclectic
408 Eclectic
411 Eclectic
440 Eclectic
441 Eclectic
443 Eclectic
459 Classic
463 Classic
469 Eclectic
472 Eclectic
527 Bungalow
530 Eclectic
540 Eclectic
PA
0.98148148148148148148
The prediction accuracy is 98.1%, a large improvement over the Forest_Predict function, whose prediction
accuracy is 77.7% on the same input.
Association Analysis
• Basket_Generator
• CFilter
• FPGrowth
• Recommender Functions
Basket_Generator
Summary
The Basket_Generator function generates baskets (sets) of items. The input is typically a set of purchase
transaction records or web page view logs. Each basket is a unique combination or permutation of items.
You can use the baskets as part of a collaborative filtering algorithm, which is useful for analyzing purchase
behavior of users in a store or on a web site. You can also use this function on activity data (for example,
“users who viewed this page also viewed this page”).
Background
Retailers mine transaction data to find combinations (baskets) of items that customers purchase together or
shop for at the same time. Retailers frequently must automatically identify such baskets, look for trends over
time, and compare other attributes (such as stores).
The Basket_Generator function is intended to facilitate market basket analysis by operating on data that is
structured in a form typical of retail transaction history databases.
Usage
Basket_Generator Syntax
Version 1.3
Arguments
Argument Category Description
BasketItem Required Specifies the names of the input columns that contain the items to be
collected into baskets. If you specify multiple columns, the function
treats every unique combination of column values as one item.
For example, you could specify only the column that contains the stock
keeping unit (SKU) that identifies an item that was sold. Alternatively,
you could specify the SKU column and the columns that contain the
month manufactured, color and size.
BasketSize Optional Specifies the number of items to be included in a basket (an INTEGER
value). The default value is 2.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Each accumulate_column must be a partition_column; otherwise, the
function is nondeterministic. However, not every partition_column must
be an accumulate_column.
Combination Optional Specifies whether the function returns a basket for each unique
combination of items. The default value is 'true'. If you specify 'false',
then the function returns a basket for each unique permutation of items.
In a combination, item order is irrelevant. For example, the baskets
"tomatoes and basil" and "basil and tomatoes" are equivalent.
In a permutation, item order is relevant. For example, the baskets
"tomatoes and basil" and "basil and tomatoes" are not equivalent.
The function returns combinations and permutations in lexicographical
order.
MaxItems Optional Specifies the maximum number of items in a partition (an INTEGER
value). If the number of items in a partition exceeds item_set_max, then
the function ignores that partition. The default value is 100.
Input
The following table describes the input table columns that you can specify in function arguments. The input
table can have additional columns, but the function ignores them.
Table 919: Basket_Generator Input Table Schema
Output
In the output table, each row represents a basket.
Table 920: Basket_Generator Output Table Schema
If the number of combinations or permutations exceeds one million, then the function outputs no rows.
If n is the number of distinct items that can appear in a basket and r is basket_size, then:
• The maximum possible number of combinations is nCror n!/(r!(n-r)!)
• The maximum possible number of permutations is nPr or n!/(n-r)!)
Examples
• Input
• Example 1: Partition by tranid
• Example 2: Increase BasketSize
Input
These examples both use the same grocery data (grocery_transaction) for a sample of five transactions or
customers. The function outputs the different combinations of two items (basket size of 2) grouped by the
transaction id(tranid).
Table 921: Basket_Generator Example Input Table grocery_transaction
SQL-MapReduce Call
Output
SQL-MapReduce Call
CFilter
Summary
The CFilter function performs collaborative filtering by using a series of SQL commands and SQL-
MapReduceSQL-MapReduce functions. You run this function by using an internal JDBC wrapper function.
Background
Analysts use collaborative filtering to find items or events that are frequently paired with other items or
events. For example, an online store that tells a shopper, “Other shoppers who bought this item also bought
these items” uses a collaborative filtering algorithm. A networking site that tells a user, “Those who viewed
this profile also viewed these profiles” also uses a collaborative filtering algorithm. CFilter is a general-
purpose collaborative filter that can provide answers in many similar use cases.
Usage
CFilter Syntax
Version 1.7
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data to filter.
OutputTable Required Specifies the name of the output table that the function creates. The table
must not exist.
InputColumns Required Specifies the names of the input table columns that contain the data to
filter.
JoinColumns Required Specifies the names of the input table columns to join.
AddColumns Optional Specifies the names of the input columns to copy to the output table. The
function partitions the input data and the output table on these columns.
By default, the function treats the input data as belonging to one
partition.
Note:
Specifying a column as both an add_column and a join_column
causes incorrect counts in partitions.
PartitionKey Optional Specifies the names of the output column to use as the partition key. The
default value is 'col1_item1'.
MaxItemSet Optional Specifies the maximum size of the item set. The default value is 100.
DropTable Optional Specifies whether to drop the output table if it exists. The default value is
'false'.
Input
The CFilter function has one input table. The following table describes the columns that appear in function
arguments. The table can have additional columns, but the function ignores them.
Output
Table 925: CFilter Output Table Schema
Examples
• Input
• Example 1: Collaborative Filtering by Product
• Example 2: Collaborative Filtering by Customer Segment
SQL-MapReduce Call
Output
This query returns the output shown in the table cfilter_output:
SQL-MapReduce Call
Output
This query returns the output shown in the table cfilter_output1:
FPGrowth
Summary
The FPGrowth (frequent pattern growth) function uses an FP-growth algorithm to generate association
rules from patterns in a data set, and then determines their interestingness.
Background
Association rule mining is intended to identify strong rules in databases, using different measures of
interestingness, and then discover regularities between products in large-scale transaction data recorded by
point-of-sale (POS) systems in supermarkets. For example, the association rule {onions,
Note:
The FPGrowth function automatically truncates long transactions by removing low-frequency items,
guaranteeing that a single transaction generates at most 1 million patterns. Automatic truncation
depends only on the value of the MaxPatternLength argument.
Usage
FPGrowth Syntax
Version 1.2
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data set.
OutputPatternTable See Required if PatternsOrRules is 'patterns' or 'both';
Description otherwise, not allowed. Specifies the name of the table
where the function outputs the patterns.
OutputRuleTable See Required if PatternsOrRules is 'rules' or 'both';
Description otherwise, not allowed. Specifies the name of the table
where the function outputs the rules.
TranItemColumns Required Specifies the names of the columns that contain
transaction items to analyze.
TranIDColumns Required Specifies the names of the columns that contain
identifiers for the transaction items.
PatternsOrRules Optional Specifies whether the function outputs patterns, rules, or
both. An example of a pattern is {onions, potatoes,
hamburger}. The default value is 'both'.
GroupByColumns Optional Specifies the names of columns that define the partitions
into which the function groups the input data and
calculates output for it. At least one column must be
usable as a distribution key.
If you omit this argument, then the function considers
all input data to be in a single partition.
Note:
Do not specify the same column in both this
argument and the TRANIDCOLUMN argument,
because this causes incorrect counting in the
partitions.
Input
The FPGrowth function has one required input table. The following table describes its columns.
Table 932: FPGrowth Input Table Schema
Output
The FPGrowth function outputs either a pattern table, a rule table, or both (depending on the value of the
PatternsOrRules argument).
The following table describes its columns of the pattern table.
Table 933: FPGrowth Pattern Table Schema
The output has one row for each rule. The following table describes its columns.
Table 934: FPGrowth Rule Table Schema
Example
Input
The input (sales_transaction) is sales transaction data of an office supply chain store by different geographic
regions and customer segments. The column product specifies the items that are purchased by a customer in
a given transaction (column orderid).
Table 935: FPGrowth Example Input Table sales_transaction
SQL-MapReduce Call
Output
Table 936: FMeasure Example Output Message
Output Information
Patterns are kept in table "fpgrowth_out_pattern"
Rules are kept in table "fpgrowth_out_rule"
The query below returns the output shown in the table fpgrowth_out_rule:
The output tables contain only those rows that conform to MinSupport of 0.01. The rest of the rows that
violate the condition are deleted.
Recommender Functions
A recommender system is an information filtering system that predicts the ratings or preferences that users
assign to entities like books, songs, movies or other products. Recommender systems are widely used among
online retailers and other businesses.
The goal of a recommender system is to generate accurate recommendations to users of items or products
that might interest them. The typical recommendation task is to predict the rating a user would give to an
item. The Teradata Aster recommender functions are based on Collaborative Filtering (CF), which relies
only on historical rankings of products by users to identify similarities between users and between products,
and thus to identify products that are new to a particular user that the user would rate highly.
WSRecommender
Summary
The WSRecommender function is an item-based, collaborative filtering function that uses a weighted-sum
algorithm to make recommendations (for example, items or products for users to consider purchasing).
Usage
WSRecommender Syntax
Version 1.0
SELECT * FROM (
ON (SELECT * FROM WSRecommenderReduce (
ON item_table_name AS item_table PARTITION BY item1_column
ON user_table_name AS user_table PARTITION BY item_column
[ Item1 ('item1_column') ]
[ Item2 ('item2_column') ]
[ ItemSimilarity ('similarity_column') ]
[ UserItem ('item_column') ]
[ UserID ('user_column') ]
1 "Item-Based Collaborative Filtering recommendation Algorithms." Badrul Sarwar, George Karypis, Joseph
Konstan and John Riedl.
2 “Improved Neighborhood-based Collaborative Filtering.” Robert M. Bell and Yehuda Koren.
Arguments
Argument Category Description
Item1 Optional Specifies the name of the item_table column that contains the first item
(item1). The default value is 'col1_item1'.
Item2 Optional Specifies the name of the item_table column that contains the second
item (item2). The default value is 'col1_item2'.
ItemSimilarity Optional Specifies the name of the item_table column that contains the similarity
score for item1 and item2. The default value is 'cntb'.
UserItem Optional Specifies the name of the user_table column that contains the names of
the items that the user viewed or purchased. The default value is 'item'.
UserID Optional Specifies the name of the user_table column that contains the unique
user identifiers. The default value is 'usr'.
UserPref Optional Specifies the name of the user_table column that contains user
preferences for an item, expressed as numeric values. The value 0
indicates no preference. The default value is 'preference'.
AccumulateItem Optional Specifies the names of item_table columns to copy to the output table.
AccumulateUser Optional Specifies the names of user_table columns to copy to the output table.
Input
The WSRecommender function requires an item table and a user table.
The item table must be symmetric with respect to item1_column and item2_column. That is, if a row has
'apple' in item1_column and 'bread' in item2_column, then another row must have 'bread' in item1_column
and 'apple' in item2_column, and these two rows must have the same value in similarity_column.
Table 941: WSRecommender Item Table Schema
Note:
The database handles NULL values in
partitioning columns. You need not exclude
them with a WHERE clause.
The function gives the best results when the items in item1_column and item2_column satisfy triangular
inequality; that is: if a row has 'apple' in item1_column and 'bread' in item2_column, then another row must
have 'bread' in item1_column and 'apple' in item2_column, and these two rows must have the same value in
similarity_column.
Table 942: WSRecommender User Table Schema
Note:
The database handles NULL values in
partitioning columns. You need not exclude
them with a WHERE clause.
Output
Table 943: WSRecommender Output Table Schema
Example
Input
The item table, recommender_product, contains product categories and their similarity scores. The
similarity scores are from column cntb in CFilter Example 2 Output Table cfilter_output1, which contains
the number of co-occurrences of the items in product_category_a and product_category_b2.
Table 944: WSRecommender Example Item Table recommender_product
The user table, recommender_user, shows the product preference (business presence) of four companies in
four product categories, on a scale of 0 to 10 (10 is highest). For example, the table shows that in the
Consumer product category, Walmart has a high business presence, while Staples has none. The
prod_preference 0 means that the company has never viewed or bought a product in that category.
Table 945: WSRecommender Example User Table recommender_user
SQL-MapReduce Call
Output
If the company (usr) has ever viewed or bought items in the product category (item), then the
recommendation column contains the value in the prod_preference column of the user table; otherwise, the
column contains the recommendation score calculated by the function.
If the recommendation value is greater than 0 and the company has never viewed or bought items in the
product category (that is, the value in the prod_preference column of the user table is 0), then the
new_reco_flag is 1, meaning that the product category is to be recommended to the company.
KNNRecommenderTrain
Summary
Usage
KNNRecommenderTrain Syntax
Version 1.0
Arguments
Argument Category Description
RatingTable Required The user rating table.
UserIdColumn Optional The user id column in the rating table. The default is the
first column in the rating table.
ItemIdColumn Optional The item id column in the rating table. The default is the
second column in the rating table.
RatingColumn Optional The rating column in the rating table. The default is the
third column in the rating table.
WeightModelTable Required Name for the output table containing the interpolation
weights.
BiasModelTable Required Name for the output table containing the global, user, and
item bias statistics.
NearestItemsTable Optional Name for the output table containing the nearest neighbors
for each item.
If this argument is not present, the NearestNeighbors table is
not produced.
If the argument is used, and a table with the specified name
exists, the function uses the existing table to train the model.
If the argument is used and no table with the specified name
exists, the function creates a table with the specified name.
K Optional The number of nearest neighbors used in the calculation of
the interpolation weights.
Default is 20.
LearningRate Optional Initial learning rate. The learning rate adjusts automatically
during training based on changes in the rmse.
Default is 0.001.
MaxIterNum Optional Maximum number of iterations. Default is 10.
Threshold Optional The function stops when the rmse drops below this level.
Default is 0.0002.
ItemSimilarity Optional The method used to calculate item similarity. The default is
the Pearson correlation coefficient.
Options include:
• Pearson (Pearson correlation coefficient)
Input
KNNRecommender takes a single input table that contains ratings of various items by a set of users. The
schema is shown in the following table:
Table 947: KNNRecommender Input Table Schema
Output
When the function completes, KnnRecommenderTrain displays a table of the root mean square error (rmse)
at each iteration (schema shown in the following table).
Table 948: KNNRecommenderTrain Output Table Schema
The function also creates the following three output tables: a table of the interpolation weights, a table of the
bias values calculated by the function, and an optional table of nearest (item) neighbors.
Table 949: KNNRecommenderTrain Interpolation Weights Table Schema
Example
Input
The input table, ml_ratings, is a collection of movie ratings from 50 users on approximately 2900 movies,
with an average of about 150 ratings per user. There are 10 possible ratings, ranging from 0.5 to 5 in steps of
0.5. A higher number indicates a better rating.
Table 952: KNNRecommenderTrain Input Table ml_ratings
SQL-MapReduce Call
KnnRecommenderTrain uses the input data to generate three model tables: the weights model
('ml_weights'), the bias model table ('ml_bias') and the optional nearest items or neighbors table
('ml_itemngbrs').
Output
The rmse value is output for each of the 20 iterations. The null iteration or first row of the table shows the
rmse of the default initialized model.
Table 953: KNNRecommenderTrain Output Table
iternum rmse
0.4825
0 0.4803
1 0.4780
2 0.4757
3 0.4734
4 0.4710
5 0.4686
6 0.4661
7 0.4636
iternum rmse
8 0.4611
9 0.4585
10 0.4560
11 0.4534
12 0.4508
13 0.4482
14 0.4455
15 0.4429
16 0.4403
17 0.4376
18 0.4350
19 0.4323
The sij value (similarity between itemi and itemj) in the following table is the default Pearson correlation
coefficient.
The following querty returns the bias table, which shows the Global (G), User (U) and Item (I) statistics.
label id value
G 3.53538298436258
I 1 3.78125
I 2 3
I 3 2
I 5 3.16666666666667
I 6 3.65
I 7 3
I 9 3
I 10 3.65
... ... ...
KNNRecommenderPredict
Summary
Usage
KNNRecommenderPredict Syntax
Version 1.0
Arguments
Argument Category Description
UserIdColumn Optional The user id column in the rating table. The default is the first column in
the rating table.
ItemIdColumn Optional The item id column in the rating table. The default is the second column
in the rating table.
Input
The input to the KnnRecommenderPredict function is the rating table and the output tables from
KnnRecommenderTrain (the interpolation weights table and the bias values table).
Output
Table 956: KNNRecommenderPredict Output Table Schema
Example
Input
The function uses the model tables ml_weights and ml_bias (Output) from the KnnRecommenderTrain
function and recommends five movies for ten users from the ratings table.
SQL-MapReduce Call
Some predicted ratings are higher than 5, even though the maximum rating is 5. The weighted KNN
recommendation algorithm does not limit its results to the range of the input data. The outcome of interest
are the items with the highest recommendation score; if the resulting ratings must be limited to a specific
range, the output data can be normalized as needed.
Graph Analysis
• Overview of Graph Analysis
• AllPairsShortestPath
• Betweenness
• Closeness
• EigenvectorCentrality
• gTree
• LocalClusteringCoefficient
• LoopyBeliefPropagation
• Modularity
• nTree
• PageRank
• pSALSA
• RandomWalkSample
Before rerunning the examples in this chapter, manually remove the existing output tables.
Important:
Before rerunning the examples in this chapter, remove the existing output tables.
Graph Functions
The functions in this chapter were developed using the Teradata Aster SQL-GR™ framework, which allows
large-scale graph analysis in Aster Database. For more information about SQL-GR™, see Aster Developer
Guide.
What is a Graph?
A graph is a representation of interconnected objects. An object is represented as a vertex (also called a node)
—for example, cities, computers, and people. A link connecting two vertices is called an edge. Edges can
represent roads that connect cities, computer network cables, interpersonal connections (such as co-worker
relationships), and so on.
Figure 16: Graph Example
In Aster Database, to process graphs using SQL-GR™, it is recommended that you represent a graph using
two tables:
• Vertices table
• Edges table
The graph in the preceding figure is represented by the following two tables.
In the following table, each row represents a vertex.
Table 958: Vertex Table Example
Source Destination
A B
A C
A E
B D
C D
C F
C G
E C
Directed Graphs
SQL-GR™ is based on a simple directed graph data model where each directed edge can be represented as an
ordered pair of vertices.
Graph Discovery
Graphs can form complex structures as in social, fraud, or communication networks. Graph discovery refers
to the application of algorithms that analyze the structure of these networks. Graph discovery has
applicability in diverse business areas such as marketing, human resources, security, and operations.
AllPairsShortestPath
Summary
The AllPairsShortestPath function computes the shortest distances between all combinations of the specified
source and target vertices. The function works on directed, undirected, weighted, and unweighted graphs.
The function is useful in social network analysis. The resulting pairs and distances can be aggregated to
determine a closeness metric or the k-degree for each vertex in a graph.
Usage
AllPairsShortestPath Syntax
Version 1.2
Arguments
Argument Category Description
TargetKey Required Specifies the target key (the names of the edges table columns that
identify the target vertex). If you specify targets_table, then the function
uses only the vertices in targets_table as targets (which must be a subset
of those that this argument specifies).
EdgeWeight Optional Specifies the name of the edges table column that contains edge weights.
Each edge_weight is a positive value. By default, each edge_weight is 1;
that is, the graph is unweighted.
Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
For an undirected graph, the graph table might have duplicate rows. Remove them, using the code in
Deleting Duplicate Edges Table Rows.
Table 960: AllPairsShortestPath Vertices Table Schema
Output
For source and target vertices connected by a path, the function outputs their corresponding source and
target vertex keys and the distance of the shortest path between them. The function does not output cycle
information.
Table 965: AllPairsShortestPath Output Table Schema
Examples
• Input
• Example 1: Unweighted, Unbounded Graph
• Example 2: Weighted, Unbounded Graph
• Example 3: Weighted, Bounded Graph with Sources
Input
In the graph in the following figure, the nodes represent persons—light blue for males and dark blue for
females. The directed edges represent phone calls from one person to another. Node size represents number
of connections (degree centrality).
The graph in the preceding figure is represented by the vertices and edges tables callers and calls,
respectively.
Table 966: AllPairsShortestPath Examples Vertices Table callers
callerid callername
1 John
2 Carla
3 Simon
4 Celine
5 Winston
6 Diana
SQL-MapReduce Call
Output
SQL-MapReduce Call
SQL-MapReduce Call
Output
Betweenness
Summary
The Betweenness function returns the betweenness score, a centrality measurement, for every vertex (node)
in the input graph.
Background
Betweenness, a measure of the centrality of a node in a network, is the fraction of shortest paths between
node pairs that pass through the node of interest.
In a sense, betweenness is a measure of the influence that a node has over the spread of information through
the network. However, by counting only shortest paths, the conventional definition of betweenness
implicitly assumes that information spreads only along those shortest paths. The conventional definition of
betweenness is:
where σs,t is the number of shortest paths between nodes s and t and σs,t(ν) is the number of shortest paths
between s and t that pass through node ν.
The Betweenness function uses a hybrid distributed AllPairShortestPath (APSP) algorithm, which executes a
single-node shortest path (SNSP) algorithm for each vertex in the graph. By restricting the number of
parallel SNSP executions to groups of K vertices, the APSP algorithm enables a trade-off between time and
memory usage. (For more information on APSP, see AllPairsShortestPath.)
Assume that N is the number of vertices in the graph, K is the number of vertices that start their SNSP
algorithms in parallel, and D is the number of iterations required to complete a single SNSP algorithm. The
APSP algorithm completes when N/K of these SNSP algorithms have completed. The hybrid distributed
APSP algorithm requires O(D*N/K) iterations and O(N*K) space. In the worst case, D is bounded by the
Usage
Betweenness Syntax
Version 1.2
Arguments
Argument Category Description
TargetKey Required Specifies the target key (the names of the edges table columns that identify the
target vertex). If you specify targets_table, then the function uses only the
vertices in targets_table as targets (which must be a subset of those that this
argument specifies).
Directed Optional Specifies whether the graph is directed. The default value is 'true'.
EdgeWeight Optional Specifies the name of the edges table column that contains edge weights. The
weights are positive values. By default, the weight of each edge is 1 (that is, the
graph is unweighted).
MaxDistance Optional Specifies the maximum distance (an integer) between the source and target
vertices. A negative max_distance specifies an infinite distance. If vertices are
separated by more than max_distance, the function does not output them. The
default value is 10.
GroupSize Optional Specifies the number of source vertices that execute a SNSP algorithm in
parallel. If group_size exceeds the number of source vertices in each partition, s,
then s is the group size. By default, the function calculates the optimal group
size based on various cluster and query characteristics.
Running a group of vertices on each vworker, in parallel, uses less memory than
running all vertices on each vworker.
Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
For an undirected graph, the graph table might have duplicate rows. Remove them, using the code in the
section Deleting Duplicate Edges Table Rows of the function AllPairsShortestPath.
For a large graph, specifying the optional sources and targets tables can improve function performance time.
Table 971: Betweenness Vertices Table Schema
Output
Table 975: Betweenness Output Table Schema
Example
This example computes the betweenness score for each person in the social network shown in the following
figure.
Figure 18: Betweenness Example Social Network
vertexid
TED
RICKY
ETHEL
FRED
JOE
RANDY
LUCY
source target
TED ETHEL
RICKY FRED
ETHEL LUCY
ETHEL RANDY
FRED ETHEL
ETHEL FRED
JOE ETHEL
RANDY RICKY
RICKY RANDY
FRED LUCY
SQL-MapReduce Call
Output
Ethel has the highest betweenness score.
vertexid betweenness
ETHEL 10
FRED 4
JOE 0
LUCY 0
RANDY 4
RICKY 3
TED 0
Closeness
Summary
The Closeness function returns closeness and k-degree scores for each specified source vertex in a graph.
The closeness scores are the inverse of the sum, the inverse of the average, and the sum of inverses for the
shortest distances to all reachable target vertices (excluding the source vertex itself).The graph can be
directed or undirected, weighted or unweighted.
For large graph, you can apply the function to a random sample of the specified target vertices to get an
efficient approximation of the closeness and k-degree scores.
Usage
Closeness Syntax
Version 1.2
Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
For an undirected graph, the graph table might have duplicate rows. Remove them, using the code in the
section Deleting Duplicate Edges Table Rows of the function AllPairsShortestPath.
Table 979: Closeness Vertices Table Schema
Output
The output table has a row for each source vertex v. The reachable target vertices exclude v itself; that is, the
function does not calculate the closeness and k-degree scores for loops.
Table 983: Closeness Output Table Schema
Examples
• Input
• Example 1: Unweighted, Unbounded Graph
• Example 2: Weighted, Bounded Graph, MaxDistance=12
• Example 3: Weighted, Bounded Graph, MaxDistance=8
Input
In the graph in the following figure, the nodes represent persons—light blue for males and dark blue for
females. The directed edges represent phone calls from one person to another. Node size represents number
of connections (degree centrality).
Figure 19: Graph of Phone Calls Between Persons
The graph in the preceding figure is represented by the vertices and edges tables callers and calls,
respectively.
Table 984: Closeness Examples Vertices Table callers
callerid callername
1 John
2 Carla
callerid callername
3 Simon
4 Celine
5 Winston
6 Diana
SQL-MapReduce Call
Output
Because callerid 6 (Diana) has no outbound calls, the k-degree is 0.
Table 986: Closeness Example 1 Output Table
SQL-MapReduce Call
Output
SQL-MapReduce Call
EigenvectorCentrality
Summary
The EigenvectorCentrality function calculates the centrality (relative importance) of each node in a graph.
Background
Centrality Formulas
In the centrality formulas:
• G represents a graph.
• V represents a vertex.
• N represents the total number of vertices.
• A represents the adjacency matrix of vertices.
• aij represents the element in the matrix that represents the relationship between vertex i and vertex j.
• ci represents the centrality value of vertex i.
Katz Centrality
Katz (1953) gives a measure of centrality as:
Bonacich Centrality
Bonacich (1987) writes a more generic centrality measure as:
Note:
a and b are exchanged in the above formula (compared to the original one) to be consistent with Katz
centrality.
Power Iteration
Power Iteration is an eigenvalue algorithm to find the largest eigenvalue and corresponding eigenvector.
This algorithm does not compute a matrix decomposition; therefore, it can be used when Α is a very large
sparse matrix.
Centrality Calculation
To calculate centrality using the formulas described in Centrality Formulas, the EigenvectorCentrality
function uses an in-neighbors relation matrix of the input source key and target key. In this matrix, aij has
the value 1 if there is an edge from j to i.
Figure 20: In-Neighbors Relation Matrix
If you need an out-neighbors adjacent matrix—for example, to calculate the contribution of a vertex to other
vertices—exchange the source key and target key columns and then invoke this function.
Usage
EigenvectorCentrality Syntax
Version 1.1
Arguments
Argument Category Description
TargetKey Required Specifies the names of the target key columns in the edges table. The
number and data types of columns must correspond to those of
vertex_key.
EdgeWeight Optional Specifies the name of the edges table column that contains the edge
weights. The edge weights must be positive values. If you omit this
argument, then the graph is unweighted.
Family Optional Specifies the centrality formula. The default value is 'eigenvector'. For
descriptions of the centrality formulas, refer to Centrality Formulas.
Alpha Optional Specifies the alpha value for the Katz or Bonacich centrality formula. The
default value is 0.85.
Beta Optional Specifies the beta value for the Katz or Bonacich centrality formula. The
default value is 1 for Katz and 0 for Bonacich.
Directed Optional Specifies whether the graph is directed. The default value is 'true'.
MaxIterNum Optional Specifies the maximum number of iterations for the function. The
default value is 20.
Threshold Optional Specifies the threshold for convergence (the difference of between bk+1
and bk).The default value is 0.001.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Input
The EigenvectorCentrality function has two required input tables, vertices and edges.
The vertices table defines the set of vertices in the graph. Each row represents a vertex. The following table
describes the vertices table columns that the function uses. The table can have additional columns, but the
function ignores them.
Table 989: EigenvectorCentrality Vertices Table Schema
Output
Table 991: EigenvectorCentrality Output Table Schema
Examples
• Input
• Example 1: Eigenvector Centrality (by Default)
• Example 2: Katz Centrality
• Example 3: Bonacich Centrality
Input
In the graph in the following figure, the nodes represent college sophomores and the edges represent the
number of elective subjects that both sophomores have taken.
The graph in the preceding figure is represented by the vertices and edges tables sophomores and
common_classes, respectively.
Table 992: EigenvectorCentrality Example Vertices Table sophomores
id name
A Allen
B Becky
C Cathy
D Darren
SQL-MapReduce Call
id centrality
C 0.649450550239096
D 0.528366549347061
A 0.418290184899757
B 0.352244366231374
SQL-MapReduce Call
Output
id name centrality
C Cathy 0.632103166675334
D Darren 0.50712609393646
A Allen 0.441313029999005
B Becky 0.385371925652169
SQL-MapReduce Call
Output
id centrality
C 0.632123195866825
D 0.529800961949689
A 0.445129026468483
B 0.348699520733132
gTree
Summary
The gTree function follows all paths in a graph, starting from a given set of root vertices, and calculates
specified aggregate functions along those paths.
Background
The gTree function is similar to the function nTree, but gTree is implemented using the SQL-GR™ engine.
The SQL-GR™ engine allows the gTree function to traverse arbitrary graphs.
Some information in nTree arguments is input to the gTree function differently, as the following table
shows.
Table 997: gTree Analogs of nTree Arguments
Usage
gTree Syntax
Version 1.0
Arguments
Argument Category Description
TargetKey Required Specifies the names of the columns in the edges table that identify the
target vertex of an edge.
AllowCycles Optional Specifies whether the input graph can include cycles. The default value is
'false'.
MaxIterNum Optional Specifies the maximum depth to which the function traverses the graph
(a nonnegative integer). The default value is 1000.
Output Optional Specifies whether the function outputs all paths ('all') or only paths that
end by reaching a leaf vertex, a cycle, or the maximum number of
iterations ('end'). The default value is 'end'.
Results Either Results Specifies the aggregate functions that the function calculates along each
or vertex in each path (refer to the following table). The function outputs
EdgeResults one column of results for each aggregate function. The column name is
is required alias, if specified; otherwise it is func(expr).
EdgeResults Either Results Specifies the aggregate functions that the function calculates along each
or edge in each path (refer to the following table). The function outputs one
The following table describes the aggregate functions that the Results and EdgeResults arguments support.
In function syntax, expr, expr1, and expr2 are values from the vertices or edges table.
Table 998: Aggregate Functions Supported by gTree Function
Returns the value of expr at the final vertex in the Same as input
current(expr) path.
Note:
Specify this function only in the EdgeResults
argument.
Calculates the product of expr1 and expr2 at each Same as input if expr1
sumproduct(expr1, expr2) vertex on the path and then returns the sum of the and expr2 have the same
products. type, otherwise numeric
Input
The gTree function has three required input tables:
• vertices, which defines the set of vertices in the graph
• edges, which defines the set of edges in the graph
• root, which defines the set of root vertices from which the function starts traversing the graph
Table 999: gTree Vertices Table Schema
Output
Table 1002: gTree Output Table Schema
Examples
• Input
• Example 1: Show All Paths from Root Nodes
• Example 2: Show Only Paths That Cycle or End at Leaves
Input
The vertices (nodes) are bus stops in a small town. The vertices table lists each bus stop and the boarding
fare at that stop.
Table 1003: gTree Example Vertices Table gtree_vertices
The edges table represents the bus route. The columns nodeid and nodestring identify the source vertices
(where the bus starts) and the columns endnodeid and endnodestring identify the target vertices (where the
bus stops).
Table 1004: gTree Example Edges Table gtree_edges
The root table defines the set of root vertices from which the function starts traversing the graph.
Table 1005: gTree Example Root Table gtree_root
SQL-GR™ Call
Output
The output table has one column for each function that the Results or EdgeResults argument specifies. The
edgepath column shows the links that comprise the path, the cycle column shows whether the path is a cycle,
and the sum column shows the total fare for the path (the sum of the boarding fares at each node in the
path).
Table 1006: gTree Example 1 Output Table
SQL-GR™ Call
Output
LocalClusteringCoefficient
Summary
The LocalClusteringCoefficient function extends the clustering coefficient to directed and weighted graphs.
The clustering coefficient, which was introduced in the context of binary undirected graphs, is a frequently-
used tool for analyzing the structure of a network.
The LocalClusteringCoefficient function is based on the paper “Clustering in complex directed networks” by
Georgio Fagiolo.
Background
The definition of the local clustering coefficient depends on the graph type:
• Unweighted, Undirected Network (BUN)
• Unweighted, Directed Network (BDN)
• Weighted, Directed Network (WDN)
• Weighted, Undirected Network (WUN)
where
aij = 1 if there is an edge from i to j; otherwise aij = 0.
A triple Ƴ at a node i is a path of length two for which i is the center node. The maximum number of triples
of node i is then defined as:
This occurs when every neighbor of node iis connected to every other neighbor of node i.
The clustering coefficient was introduced by Watts and Strogatz (D. J. Watts and S. H. Strogatz. Collective
dynamics of “small-world” networks. Nature, 393:440–442, 1998) in the context of social network analysis.
Given three actors i, j, and h, with mutual relations between i and j, as well as between i and h, it is supposed
to represent the likeliness that j and h are also related.
For each pattern, its clustering coefficient (CC) can be defined as:
where di↔ is the number of bilateral edges between i and its neighbors.
τimid = diindiout - di↔
τiin = diin(diin - 1)
τiout = diout(diout - 1)
Usage
LocalClusteringCoefficient Syntax
Version 1.1
Note:
In the DegreeRange argument, you must type the brackets. They do not indicate that their contents are
optional.
Arguments
Argument Category Description
TargetKey Required Specifies the key of the target vertex of an edge. The key consists of one
or more edges table column names.
Directed Optional Specifies whether the graph is directed. The default value is 'false'.
EdgeWeight Optional Specifies the name of the edges table column that contains the edge
weights. Each edge weight is a positive value in the range (0-1]. By
default, the function treats the input graph as unweighted.
DegreeRange Optional Specifies the edge degree range—at least min and at most max
([min:max]), at least min ([min:]), or at most max ([:max]). The min and
max must be positive integers. The function outputs only nodes with
degrees in the specified range. By default, the function outputs all nodes.
Accumulate Optional Specifies the names of the vertices table columns to copy to the output
table.
The following table describes the edges table columns that you must or can specify in the function call. The
table can have additional columns, but the function ignores them.
The data-checking rules for the edges table are, given nodes A and B:
• No graph can have multiple A→B edges.
• An undirected graph cannot have edges A→B and B→A.
Table 1010: LocalClusteringCoefficient Edges Table Schema
Output
The output table schema depends on the graph type. The graph types are:
• BUN: Unweighted, undirected network
• BDN: Unweighted, directed network
• WUN: Weighted, undirected network
• WDN: Weighted, directed network
The output table schemas for BDN and WDN graphs (described by the following two tables) refer to cycle,
middleman, in, and out triangles, which are explained in Unweighted, Directed Network (BDN).
Examples
• Input
• Example 1: WUN
• Example 2: WUN with DegreeRange 3 or Greater
• Example 3: WDN
Input
In the graph in the following figure, the nodes represent countries, edges connect countries that trade with
each other, and the numbers on the edges represent trade propensity.
Figure 22: Graph of Trading Partners
The graph in the preceding figure is represented by the vertices and edges tables country and trade,
respectively.
Table 1015: LocalClusteringCoefficient Examples Vertices Table country
countryid name
1 USA
2 China
3 UK
4 Japan
5 France
Example 1: WUN
This example treats the input graph as a weighted, undirected network (WUN).
SQL-MapReduce Call
Output
Output
Example 3: WDN
This example treats the input graph as a weighted, directed network (WDN).
SQL-MapRequest Call
Output
LoopyBeliefPropagation
Summary
Belief propagation, or sum-product message passing, is an algorithm for inferring probabilities from graphical
models, such as Bayesian networks and Markov random fields.
The LoopyBeliefPropagation function calculates, for a Bayesian network of binary variables, the marginal
distribution for each unobserved variable, conditional on any observed variables.
Background
A Bayesian network is a probabilistic graphical model that represents a set of random variables and their
conditional dependencies with a directed acyclic graph (DAG). For example, a Bayesian network can
represent the probabilistic relationships between symptoms and diseases. Given symptoms, belief
propagation can use the graph to compute the probabilities of the presence of various diseases.
Formally, Bayesian networks are DAGs whose vertices (or nodes) represent random variables in the
Bayesian sense: they may be observable quantities, latent variables, unknown parameters, or hypotheses.
Each vertex is associated with a probability function that takes as input the values of the vertex's parent
variables and returns the probability of the variable represented by the vertex. For example, if the parents are
m binary variables, then the probability function could be represented by a table of 2m entries, one entry for
each of the 2m possible combinations of its parents' values. If variables are conditionally dependent on each
other, then the vertices that represent them are connected by edges.
To use the LoopyBeliefPropagation function, you must specify only the conditional dependence between
variables (directed edges, possibly weighted) and the values for observed variables. The function computes
the potential tables, using the following functions at factor nodes:
Usage
LoopyBeliefPropagation Syntax
Version 1.0
Arguments
Argument Category Description
TargetKey Optional Specifies the names of the edges table columns that comprise
the key of the target vertices.
ObservationColumn Required Specifies the name of the observation table column that
with contains the observations.
observation
table,
optional
otherwise
EdgeWeight Optional Specifies the name of the edges table column that contains
the edge weights. The function uses only positive edge
weights. The sum of the edge weights that the function uses
must be 1.
Accumulate Optional Specifies the names of the vertices table columns to copy to
the output table.
MaxIterNum Optional Specifies the maximum number of iterations that the
algorithm can run. The default value is 20.
Threshold Optional Specifies the threshold value for convergence. The default
value is 0.0001.
Input
The LoopyBeliefPropagation function has two required input tables, vertices and edges, and one optional
input table, observation.
The following table describes the vertices table columns that you must or can specify in the function call.
The table can have additional columns, but the function ignores them.
Table 1022: LoopyBeliefPropagation VerticesTable Schema
Note:
The per-vertex computational cost is exponential in terms of in-degree. Therefore, it is not recommended
to have any vertex with an in-degree higher than 20; otherwise the function could take an unexpectedly
long time to finish.
If variables are conditionally dependent on each other, then the vertices that represent them are connected
by edges. The edges table contains the columns that comprise the keys of the source and target vertices of the
edges, and optionally, a column that contains the weights of the edges.
Table 1023: LoopyBeliefPropagation Edges Table Schema
The observations table contains the vertices (which represent variables) and the observations for observed
variables.
Table 1024: LoopyBeliefPropagation Observation Table Schema
Output
Table 1025: LoopyBeliefPropagation Output Table Schema
id vertex
1 Jaundice
2 Internal bleeding
3 Loss of appetite
4 Fatigue
5 Fever
6 Dark urine
7 Stupor
8 Nausea/vomiting
9 Hepatitis
id source target
1 Jaundice Hepatitis
2 Internal bleeding Hepatitis
3 Loss of appetite Hepatitis
4 Fatigue Hepatitis
5 Fever Hepatitis
6 Dark urine Hepatitis
7 Stupor Hepatitis
8 Nausea/vomiting Hepatitis
In the observation table, 't' means that the symptom is present and 'f' means that it is absent.
Table 1028: LoopyBeliefPropagation Examples Observation Table lbp_observation
id vertex obs
1 Jaundice t
2 Internal bleeding t
3 Loss of appetite t
4 Fatigue t
5 Fever f
6 Dark urine t
7 Stupor f
id vertex obs
8 Nausea/vomiting f
SQL-MapReduce Call
Output
In the output table, 1 means that the symptom is present and 0 means that it is absent. Five of the eight
symptoms are present, so the conditional probability of hepatitis is 5/8 (0.625).
Table 1029: LoopyBeliefPropagation Example 1 Output Table
vertex prob_true
Dark urine 1
Fatigue 1
Fever 0
Hepatitis 0.625
Internal bleeding 1
Jaundice 1
Loss of appetite 1
Nausea/vomiting 0
Stupor 0
Input
Use the table below and the following tables from the Input section of Example 1:
• LoopyBeliefPropagation Examples Vertices Table lbp_vertices, from the Input section of Example 1
• LoopyBeliefPropagation Examples Observation Table lbp_observation
SQL-MapReduce Call
Output
In the output table, 1 means that the symptom is present and 0 means that it is absent. The conditional
probability of hepatitis is the sum of the weights of the symptoms that are present (0.25 + 0.1 + 0.05 + 0.15
+ 0.2 = 0.75).
Table 1031: LoopyBeliefPropagation Example 2 Output Table
vertex prob_true
Dark urine 1
Fatigue 1
Fever 0
Hepatitis 0.75
Internal bleeding 1
Jaundice 1
Loss of appetite 1
Nausea/vomiting 0
vertex prob_true
Stupor 0
Modularity
Summary
The Modularity function uses a clustering algorithm to detect communities in networks (graphs).The
function needs no prior knowledge or estimation of starting cluster centers and assumes no particular data
distribution of the input data set. The graph can be directed or undirected, weighted or unweighted.
Background
Many real-world, large data sets can be represented as networks (graphs). Most of these networks have a
community structure—groups of densely interconnected nodes that are only sparsely connected with the
rest of the network. Community detection and identification is very important for understanding network
dynamics, because community properties—node degree, clustering coefficient, betweenness, centrality, and
so on—can be quite different from those of the whole network. For example, a closely connected social
community tends to have a faster information transmission rate than a loosely connected community.
Modularity measures the strength of division of a network into modules (also called communities, clusters,
or groups). Maximizing modularity leads to the identification of communities in a given network. For
detailed information on modularity, see:
• http://en.wikipedia.org/wiki/Community_structure
• http://en.wikipedia.org/wiki/Modularity_(networks)
• M.E.J. Newman, Modularity and community structure, Proceedings of the National Academy of Sciences
of the United States of America, 2005
• Blondell, Guillaume, Renaud Lambiotte, E. Lefebvre, Fast Unfolding of communities in large networks. J.
Stat. Mech. 2008.
Real-world use cases for modularity include:
• Social network: Identifying communities of acquaintances or circles of influences.
• Telecom:
Definitions
Quality Metric
If e ij represents the number of edges between clusters i and j, C represents an entire cluster set of nodes, and
m represents total number of edges in the graph, then modularity for the graph is given by Q:
Resolution
Resolution controls the hierarchical-level information about community formation. It represents the level in
a dendogram at which to converge for the number of communities to be detected. Think of different
resolution points as different hierarchical levels in a tree of nodes interconnected through an edge table.
Two ways to think of resolution are:
• Higher resolution (> 1.0) provides a visualization of the graph nearer the root of the tree and lower
resolution (in the range (0.0, 1.0]) provides a visualization of the graph nearer the leaves of the tree.
• Higher resolution “zooms out ,” providing fewer, larger communities; lower resolution “zooms in ,”
providing more, smaller communities.
Modularity Syntax
Version 1.1
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
TargetKey Required Specifies the key of the target vertex of an edge. The key
consists of the names of one or more edges table columns.
Directed Optional Legacy argument that determined whether the graph was
directed. The default value was 'true'. The function now
ignores this argument, treating all graphs as undirected.
EdgeWeight Optional Specifies the name of the edges table column that contains
edge weights. The weights are positive values. By default, the
weight of each edge is 1 (that is, the graph is unweighted).
This argument determines how the function treats duplicate
edges (that is, edges with the same source and destination,
which might have different weights). For a weighted graph,
the function treats duplicate edges as a single edge whose
Output
The function outputs a community vertex table and, optionally, a community edges table.
The community vertex table has a row of modularity results for each specified resolution level.
Table 1035: Community Vertex Table Schema (Default Resolution)
The community edges table contains the edge weights (strength) between different communities at specified
resolutions. The table is created implicitly on the database and is not displayed at the output of function
execution. To display the contents of the community edges table, use the command SELECT *
FROMcommunity_edge_table (where community_edge_table is the name that you specified in the
CommunityEdgeTable argument).
Table 1036: Community Edges Table Schema
Examples
• Input
• Example 1: Unweighted Edges
• Example 2: Weighted Edges and Community Edge Table
Input
In the graph in the following figure, the nodes represent persons who are geographically distributed across
the United States and are connected on an online social network, on which they follow each other. The
directed edges start at the follower and end at the leader. For example, Alex follows Bob and Casey.
Figure 25: Graph of Social Network
The graph in the preceding figure is represented by the vertices and edges tables friends and
followers_leaders, respectively. The edges table column intensity represents the fervor with which the
follower follows the leader, on a scale from 1 (lowest) to 10 (highest).
Table 1037: Modularity Examples Vertices Table friends
SQL-MapReduce Call
Output
SQL-MapReduce Call
Output
To verify that the modularity scores for the community_edges table and the community vertex table match:
1. Remove all bidirectional edges from the community_edges table.
(For this example, remove rows 5 and 7 from the preceding table.)
2. Run the Modularity function using the community_edges table as edges and the unique community ids
as vertices, and specify the weight column with the EdgeWeight argument.
(For this example, the modularity score is 0.426, which matches the modularity score in the community
vertex table.)
Tips
• To organize a data set into communities or clusters, the data set first must be transformed into a graph.
In graph terminology, the objects of interest represent vertices and the relations among objects are
represented by edges.
• The edges and their weights can be based on some heuristics which are typically specific to the use case at
hand.
∘ For simple data points, edges can be established by connecting any two objects with distance < d
(some threshold) through an edge, connecting an object to k closest objects regardless of distance, or a
combination of above two techniques.
∘ For complex data points, the edge weight represents the similarity / strength of relation between two
objects. It can be represented by closeness among vertices, similarity metric score such as cosine
similarity, or inverse of the distance among objects.
∘ To compute the edge weight for general data set with each object represented by a combination of
attributes, other statistical techniques can be employed to establish the similarity metric among
objects. Techniques such as principal component analysis, data normalization, and feature extraction,
can be used to produce meaningful graphs out of input data.
• In some cases, when you specify multiple resolution points in the Resolution argument, the modularity
that the function reports for the resolution points varies slightly from the modularity that the function
reports for the same resolution points when you specify them individually. In such cases, specifying the
resolution points individually is recommended.
• The function runs faster on graphs with vertices of data type INTEGER or BIGINT, because they use
much less memory than graphs with vertices of other data types.
Suppose that you have vertices table string_nodes with column id of data type VARCHAR and edges
table string_edges with columns src_id, dest_id, and weight of data types VARCHAR, VARCHAR, and
INTEGER, respectively. You can generate equivalent INTEGER-based vertex and edges tables statements
such as these:
Troubleshooting
• Problem: The function runs slowly for large graphs. The function continues to execute while spending
hours in a graph iteration or terminates unsuccessfully, displaying a failure message on the console.
• Work Arounds:
∘ Consult the logs for error message details and troubleshooting.
∘ The logs can help determine the time taken for each iteration. If you know the number of iterations
that the function takes (usually 25-50), you can estimate the total execution time.
∘ All graph functions operate much faster on INTEGER and BIGINT single-column vertex ids. To
generate an INTEGER vertex based graph, see Tips section above.
∘ Compute the modularity in incremental steps by choosing a subset of nodes using the Sources
argument.
∘ If you already know some of the groupings in the graph, then specify them with the
CommunityAssociation argument.
∘ If you specified the CommunityEdgeTable argument, the reason for the slow execution might be that
the function is writing the community table to the database through JDBC. Run the function without
the CommunityEdgeTable argument first, to obtain the modularity of the resultant graph.
• Problem: The function does not accept the edges table. The function terminates with errors on the
vertices table or edges table.
• Work Arounds:
∘ Consult the logs for error message details and troubleshooting.
∘ Ensure that all source and target vertices in the edges table are listed in the vertices table. Ensure that
the columns representing source and target vertices are not NULL.
• Problem: The function completes successfully; however, the results do not show a good modularity score
or community detection.
• Work Arounds:
∘ Change the value of seed.
∘ Multiply the edge weights by a constant.
∘ Change the values in the Resolution argument.
∘ If you already know some of the groupings in the graph, then specify them with the
CommunityAssociation argument.
Note:
The modularity score might be poor only because the graph has no inherent community structure.
nTree
Summary
The nTree function is a hierarchical analysis SQL-MapReduce function that can build and traverse tree
structures on all worker machines. The function reads the data only once from the disk and creates the trees
in memory.
The input data must be partitionable, and each partition must fit in memory. Each partition can consist of
multiple trees of any size. The function has different ways of handling cycles.
Background
Two use cases for nTree are equity trading and social networking.
Equity Trading
A large stock buy or sell order with multiple counterparties is typically divided into child orders, which can
be further divided. An order to sell a specific stock can cause a cascade of transactions, each of which
descends from the original order. For example, an order to sell 100 shares of a specific stock can trigger
orders to sell 70 and 30 shares of that stock, and the order to sell 70 shares can trigger orders to sell 50 and 20
shares, and so on.
All stock transactions are stored in a single table. Each row represents one transaction, which is identified by
its order_id and linked to its parent by its parent_id.
A stock broker must be able to identify the root order for each transaction. To do so with SQL requires an
unknown number of self-joins, but with Aster Database SQL-MapReduce, you can partition the data by
stock symbol and then by date and use the nTree function to create a tree from each root order.
Social Networking
Social networks use multiple data sources to identify people and their relationships. For example, a user-user
connection graph shows connections that users have created on the network, a user-person invitation graph
shows a mixture of user-user connections and user-email connections, and address book data provides a
user-email graph.
Suppose that you are the administrator of a social network, and you want to know who has multiple
accounts on the network. You can use the nTree function to generate a tree for every account and then
compare these trees to find those that are very likely to have the same person as the root node.
nTree Syntax
Version 1.1
Arguments
Argument Category Description
Root_Node Required Specifies the BOOLEAN SQL expression that defines the root nodes of
the trees (for example, parent_id IS NULL).
Node_ID Required Specifies the SQL expression whose value uniquely identifies a node in
the input table (for example, order_id).
Note:
A node can appear multiple times in the data set, with different
parents.
Parent_ID Required Specifies the SQL expression whose value identifies the parent node.
Allow_Cycles Required Specifies whether trees can contain cycles. If not, a cycle in the data set
causes the function to throw an exception. The default value is 'false'. For
information about cycles, refer to Cycles in nTree.
Starts_With Required Specifies the node from which to start tree traversal—must be 'root',
'leaf', or a SQL expression that identifies a node.
Mode Required Specifies the direction of tree traversal from the start node—up to the
root node or down to the leaf nodes.
Output Required Specifies when to output a tuple—at every node along the traversal path
('all') or only at the end of the traversal path ('end'). The default value is
'end'.
Max_Distance Optional Specifies the maximum tree depth. The default value is 5.
Note:
The function ignores alias if it is the same as an input table column
name.
For the path from the Starts_With node to the last traversed node, the
operations do the following:
• PATH
Outputs the value of expression for each node, separating values
with '->'.
• SUM
Computes the value of expression for each node and outputs the sum
of these values.
• LEVEL
Outputs the number of hops.
• MAX
Computes the value of expression for each node and outputs the
highest of these values.
• MIN
Computes the value of expression for each node and outputs the
lowest of these values.
• IS_CYCLE
Outputs the cycle (if any).
• AVG
Computes the value of expression for each node and outputs the
average of these values.
• PROPAGATE
Evaluates expression with the value of the Starts_With node and
propagates the result to every node.
If you specify Mode('down') in the preceding query, it outputs the following table.
Table 1044: Cycle in nTree with Mode ('down')
export JAVA_TOOL_OPTIONS='-Xss24M'
Input
The nTree function has one required input table, which contains the data from which to build trees. The
only restriction on its schema is that the data must be partitionable and each partition must fit in memory.
Output
The nTree function outputs a table of data collected and computed by traversing the trees that it builds.
Examples
• Example 1: Find an Employee’s Reports
• Example 2: Find an Employee’s Management Chain
• Example 3: Show Reporting Structure by Department
Input
The input table contains the data to build a tree of employees. Each row represents one employee,
identifying both the employee and his or her manager by identifier and name. The employee with no
manager, Don, becomes the root of the tree. The other employees become children of their managers.
Table 1046: nTree Examples 1 and 2 Input Table employee_table
SQL-MapReduce Call
This call finds the employees who report to employee 100 (either directly or indirectly) by traversing the tree
of employees from employee 100 downward.
id path
300 Don->Donna
500 Don->Pat->Kim->Fred
Input
The input table is the same as for Example 1.
SQL-MapReduce Call
This call finds the management chain of employee 500 by traversing the tree of employees from employee
500 upward.
Output
The output table shows that employee 500, Fred, reports to Kim, who reports to Pat, who reports to Don.
Table 1048: nTree Example 2 Output Table
id path
100 Fred->Kim->Pat->Don
Input
The input table contains the data to build trees of departments, and is partitioned by department. In each
partition, each row represents one employee in the department, identifying the employee by identifier and
SQL-MapReduce Call
Output
The output table shows two department trees, whose roots are Dave and Peter.
id path path2
1 Dave 1
3 Dave->Donna 1->3
7 Dave->Donna->Richard 1->3->7
2 Dave->Kim 1->2
5 Dave->Kim->Fran 1->2->5
6 Dave->Kim->Mark 1->2->6
4 Dave->Rob 1->4
9 Dave->Rob->Don 1->4->9
8 Dave->Rob->Pat 1->4->8
10 Peter 10
12 Peter->Dale 10->12
16 Peter->Dale->Gary 10->12->16
15 Peter->Dale->Jessy 10->12->15
14 Peter->Dale->Jessy->Sophia 10->12->15->14
13 Peter->John 10->13
17 Peter->John->Elizabeth 10->13->17
18 Peter->John->Richard 10->13->18
11 Peter->Sarah 10->11
PageRank
Summary
The PageRank function computes the PageRank values for a directed graph, weighted or unweighted.
Usage
PageRank Syntax
Version 1.1
Arguments
Argument Category Description
TargetKey Required Specifies the target key columns in the edges table.
EdgeWeight Optional Specifies the column in the edges table that contains the edge weight,
which must be a positive value. By default, all edges have the same weight
(that is, the graph is unweighted).
DampFactor Optional Specifies the value to use in the PageRank formula. The damp_factor
must be a DOUBLE PRECISION value between 0 and 1. The default
value is 0.85.
MaxIterNum Optional Specifies the maximum number of iterations for which the algorithm
runs before the function completes. The max_iterations must be a
positive INTEGER value. The default value is 20.
Note:
max_iterations is the number of SQL-GR™ iterations (that is, the
algorithm iterations shown in AMC minus 3).
Threshold Optional Specifies the convergence criteria value. The threshold must be a
DOUBLE PRECISION value. The default value is 0.0001.
Accumulate Optional Specifies the vertices table columns to copy to the output table.
Input
The PageRank function requires two input tables, vertices and edges.
The vertices table must contain the unique identifier (vertex key attributes) of each vertex. The unique
identifier of a vertex can consist of multiple columns. The vertices table can also have columns that are not
vertex key attributes. If these additional columns are specified by the Accumulate argument, then the
function copies them to the output table; otherwise, it ignores them.
Table 1051: PageRank Vertices Table
The edges table must contain columns for the source and target vertices of each edge. The columns must be
in the same schema.
Table 1052: PageRank Edges Table
Output
PageRank outputs each vertex’s pagerank, a double value. In addition, the function outputs the accumulated
columns from the vertices table.
Table 1053: PageRank Output Table
The graph in the preceding figure is represented by the vertices and edges tables callers and calls,
respectively.
Table 1054: PageRank Examples Vertices Table callers
callerid callername
1 John
2 Carla
3 Simon
4 Celine
5 Winston
6 Diana
SQL-MapReduce Call
Output
Table 1056: PageRank Example Output Table
pSALSA
Summary
The pSALSA (personalized SALSA) function is a SQL-GR™ function that evaluates the similarity of nodes in
a bipartite graph according to their proximity. The typical application of pSALSA is for recommendation.
The pSALSA function assigns numerical scores (between 0 and 1) to the vertices on both sides of the
bipartite graph for each seed (hub) vertex. Also, for each seed vertex, the function outputs K hub/authority
vertices with highest score.
The score assigned by pSALSA to a node defines the possibility of visiting the node by a random walk with a
restart from a seed node.
Background
SALSA
Stochastic Approach for Link-Structure Analysis (SALSA) is a link analysis algorithm originally developed for
evaluating the importance of web pages (similar to the PageRank algorithm).
However, unlike PageRank, in SALSA, a collection of web pages is transformed into a bipartite graph:
`G=(V,X,E)
where
{ν ∈V} are hub vertices located on the left side of the graph.
{x ∈ X} are authority vertices located on the right side of the graph.
{(ν, x) Î E} are edges linking the hub vertices and authority vertices.
The following figure shows an example on a bipartite graph, which shows the transforming (a) of the
collection into (b) bipartite`G.
For each hub vertex in the graph, SALSA computes a hub score (hv), and for each authority vertex, an
authority score (hx) is associated with. The hub/authority score is defined by analyzing a random walk on the
bipartite graph, wherein the steps from hub vertex to authority vertex (from the left side of the bipartite
graph to the right side) are called forward steps, and the steps from the authority vertex to the hub vertex are
called backward steps.
The hub score (hv) is defined as the possibility of visiting the hub vertex v and the authority score (ax) is the
possibility of visiting the authority vertex x in a random walk on the bipartite graph.
The hub score (hv) and authority score (ax) can be computed using the following update rules until
converge:
pSALSA
You can use the personalized version of the SALSA algorithm to evaluate the similarity between vertices on
each side. Unlike the standard SALSA algorithm, which generates a global score for each vertex, in the
personalized SALSA algorithm, for each hub vertex vi, there is a set of hub scores, hvj (i), and a set of
authority scores, axk (i).
If the hub score is higher, it indicates that vertex vj shares more connections with (or is closer to) vi. If the
authority score is higher, it indicates that vertex ak is more important in building the closeness relationship
with vi.
The updating rule for hub/authority score personalizing on vj is as follows:
This allows random jumps, with a possibility of ε, to the seed vertex at forward steps.
Usage
pSALSA Syntax
Version 1.1
Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
Table 1057: pSALSA Vertices Table Schema
Output
Table 1061: pSALSA Output Table Schema
Examples
• Example 1: User Similarity in a Social Network Without Edge Weight
• Example 2: User Similarity in a Social Network with Edge Weight
• Example 3: User Similarity and Product Recommendation
• Example 4: Using the Sources and Targets Tables as Inputs
Examples 1 and 2 analyze a social network of users (for example, in an application such as Twitter) as shown
in the following figure and their relationships as followers and leaders based on the 'likes' each user gets from
others.
Figure 27: pSALSA Example Diagram (Network Of Users)
The pSALSA algorithm assigns scores to both sides. In the output table (Output), the hub_followers column
shows similar users for each follower based on the hub_score. Likewise based on the authority_score, leaders
who are close to followers are output in the authority_leader column. A higher score indicates greater
similarity. Typically the authority_score is interpreted as user recommendations (in this case the closer
leader) and the hub_score is interpreted as user similarity.
Note:
Because the function is stochastic, the output can vary with each run.
Input
The input consists of two tables: a user vertex table with 6 users (users_vertex) and a user edges table
(users_edges), showing relationships between users.
userid username
1 John
2 Carla
3 Simon
4 Celine
5 Winston
6 Diana
SQL-MapReduce Call
This example uses the arguments MaxHubNum and MaxAuthorityNum to output a maximum of two hub
and two authority users.
Input
The input consists of two tables: a user vertex table with six users (Input from Example 1) and a user edges
table (Input from Example 1), showing relationships between users.
The ‘likes’ column is used as edgeweight in this example.
Output
The output shows that John and Winston are similar to Carla. John is more similar, as he has a higher
hub_score. The output varies with every run.
Table 1065: pSALSA Example 2 Output Table
Input
While CFilter is good for stable product lines, pSALSA is very powerful for recommending products to
similar users when the product line has limited pairwise history or changes frequently. Consider clothing
retailers for women's apparel that change their stock based on seasons and trends. The input vertices table
(user_product_nodes) gives the list of women users and the products they buy.
Table 1066: pSALSA Example 3 Input Table user_product_nodes
nodeid nodename
1 Sandra
2 Susan
3 Stacie
4 Stephanie
5 Sally
6 coats
7 sweaters
8 jackets
9 blazers
10 pants
11 pajamas
The edges table (women_apparel_log) reflects the shopping pattern of the users. The goal is to find users
who are similar and thus determine product recommendations.
Table 1067: pSALSA Example 3 Input Table women_apparel_log
SQL-MapReduce Call
Output a maximum of two similar users (hub) and recommend two products (authority) for each user. Use
frequency of purchase as a weight factor.
Output
The output shows possible recommendations, based on hub_score and authority_score. For example, the
seller might recommend pajamas to Sandra and Susan because they and Sally have similar scores. The
hub_score and authority_score values vary with every run.
Table 1068: pSALSA Example 3 Output Table
Input
This example uses the same input as Example 3 and shows how to limit the vertices and edges used. This can
be useful if the original vertices and edges tables are very large, but only a subset of the information is of
interest. The hubs and authorities are calculated for the nodes specified in the sources and targets tables
(user_source_nodes and product_target_nodes).
Table 1069: pSALSA Example 4 Input Table user_source_nodes
nodeid username
1 Sandra
2 Susan
nodeid product
8 jackets
11 pajamas
SQL-MapReduce Call
Output
Pajamas are recommended to Sandra, as she has no purchase history for pajamas. No recommendations are
made for Susan, as she has bought all of the items in the past (refer to the Input section of Example 3). The
*_score result vary with every run.
Table 1071: pSALSA Example 4 Output Table
RandomWalkSample
Summary
The RandomWalkSample function takes an input graph (which is typically large) and outputs a sample
graph.
Note:
The sample graph is not deterministic. The function can get different results when running on a different
cluster.
Background
Graph sampling is a process of inducing a subset graph of the original graph, while preserving the original
graph properties. In some cases, if the subset graph is a good representation of the original, substituting it for
the original graph in analytic functions greatly decreases execution time without significantly decreasing
accuracy.
Random walk sampling is a graph sampling technique that randomly selects a starting vertex and then either
explores a neighboring vertex or returns ("flies back") to the starting vertex. If the sampling process reaches a
sink vertex (an isolated component or a loop), it randomly selects another vertex and continues until it
reaches the desired sample size (the desired number of vertices).
Note:
For more information about sampling from large graphs, see: http://cs.stanford.edu/people/jure/pubs/
sampling-kdd06.pdf
Usage
RandomWalkSample Syntax
Version 1.2
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
TargetKey Required The names of the columns in the edges table that identify the target
vertex of an edge. This set of columns must have the same schema as the
vertex_attributes and source_vertex_attributes.
SampleRate Optional The sampling rate. This value must be in the range (0, 1.0). The default
value is 0.15 (15%).
Input
The RandomWalkSample function has two required input tables:
• vertices, which defines the set of vertices in the input graph
• edges, which defines the set of edges in the input graph
Neither input table can contain NULL values; otherwise, the function displays an error message.
Output
The RandomWalkSample function has three output tables: the summary table (which is usually displayed on
the screen) and the vertex and edges tables (which are saved to the database).
The summary table displays statistics:
Table 1072: RandomWalkSample Summary Table
name count
vertices (number of vertices in original graph)
edges (number of edges in original graph)
sampled vertices (number of sampled vertices)
sampled edges (number of sampled edges)
The vertex and edges tables (whose names are specified in the OutputTables argument) have the same
schemas as the input tables, vertices and edges. However, the output table column names are different from
the input table column names.
If the input table vertices has only one vertex attribute, then the output table vertex has only one column,
named id. If vertices has n vertex attributes, then vertex has n columns, named id_1, ..., id_n.
If the input table edges has only one source vertex attribute, then the output table edges has two columns,
named source and target. If the input table edges has n source vertex attributes, then the output table edges
has n*2 columns, named source_1, ..., source_n, target_1, ..., target_n
Input
The input table is a set of 34,546 vertices or nodes.
Table 1073: RandomWalkSample Example Input Table citvertices
id
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1013
1014
1015
1016
1017
1018
1019
1020
...
The input table (citedges) is a set of 421,578 node couplets which specify edges between a source (column
from_id) and a target (column to_id). This example utilizes a 15% sampling rate of this huge collection of
vertices and edges.
Table 1074: RandomWalkSample Example Input Table citedges
from_id to_id
1001 9212308
1001 9305239
from_id to_id
1001 9306240
1001 9312276
1001 9312333
1001 9401294
1001 9403226
1001 9409265
1001 9511336
1001 9601359
1001 9602280
1001 9610553
1001 9701390
1001 9702424
1001 9708239
1001 9709423
1001 9710255
... ....
SQL-GR™ Call
Specifying the Seed value guarantees that the result is repeatable on the same cluster, yet it can differ
between clusters as the sample graph is not deterministic.
Output
Table 1075: RandomWalkSample Example Output Summary
name count
vertices 34546
edges 421578
name count
sampled vertices 5181
sampled edges 33650
id
1005
1013
1042
1046
1051
1073
1075
1086
1092
1113
1121
1123
1162
1166
...
source target
4043 1005
111294 1005
10244 1005
11359 1005
7113 1005
102181 1005
211064 1005
103074 1005
102277 1005
104307 1013
source target
210311 1013
109216 1042
110097 1046
211046 1046
Neural Networks
• Introduction to Neural Networks
• NeuralNet
• NeuralNetPredict
In the preceding figure, the weights are shown as wijk, where i is the network layer of the origin node, j is the
number of the origin node in layer i, and k is the number of the destination node in layer (i+1). In the
preceding figure, the input to the nodes in Layer 2 is:
where g(x) is the activation function. The output of the network shown in the preceding figure is:
A neural network is a supervised learning model. In the Teradata Aster implementation, the weights applied
to each input are trained based on a training dataset using a backpropagation algorithm. The initial weights
can be supplied by the user; if none are supplied, a random set of weights is used. For more information
about the backpropagation algorithm, see https://en.wikipedia.org/wiki/Backpropagation.
NeuralNet
Summary
The NeuralNet function uses backpropagation to train neural networks. You must provide input data and
argument settings for training the networks; the function creates the fitted weights of the neural network.
The NeuralNet function is optimized for performance on very large datasets (millions of rows).
Usage
NeuralNet Syntax
Version 1.0
Arguments
Argument Category Description
InputTable Required Specifies the table containing the input data to be trained.
OutputTable Required Specifies the table to output the trained network weight data to.
WeightTable Optional Specifies the table that lists the starting values for the neural network
weights. If you do not specify a weight table, the function assigns the initial
weights for the neural network randomly.
InputColumns Required Specifies the name of the columns of the InputTable that contains the
numerical predictor variables x1, x2, x3, etc.
ResponseColumns Required Specifies the name of the columns of InputTable that contains the
numerical dependent variables y1, y2, y3, etc.
GroupByColumns Optional Specifies the weight table columns to use to output different neural
networks for different groups.
HiddenLayers Optional Specifies the number of hidden neurons in each layer, from left to right, by
list of integers. Default value is 1 layer, 1 neuron. For example,
HiddenLayers('5','5') would produce a 3-layer network with 5 neurons in
each hidden layer, while HiddenLayers('3') would produce the network
shown in Introduction to Neural Networks.
Threshold Optional Specifies the threshold for the partial derivatives of the error function as
stopping criteria. Default value is 0.01.
MaxIterNum Optional Specifies the maximum number of steps for the training of the neural
network. Default value is 1.
LearningRate Optional Specifies the learning rate used by traditional backpropagation. Default
value is 0.001.
ActivationFunction Optional Specifies the name of the differentiable function that is applied to the
result of the cross-product of the neurons and the weights. Available
choices are ‘logistic’ (default) and hyperbolic tangent (‘tanh’).
ErrorFunction Optional Specifies the name of the differentiable function that is used for the
calculation of the error. Available choices are ‘sse’ (sum of squared errors,
the default) and cross-entropy (‘ce’).
Algorithms Optional This string contains the algorithm type that is used to calculate the neural
network. Currently, only ‘backprop’ is supported.
LinearOutput Optional Specifies whether the ActivationFunction is not to be applied to the output
neurons.
OverwriteOutput Optional This logical value defines whether (TRUE) or not (FALSE) to overwrite
the output table.
Input
The NeuralNet function has a required input table and an optional weight table.
Table 1078: NeuralNetwork Input Table Schema
Weights Table
The Weights Table is an optional table. If the table is not provided, the initial weights for the neural network
are randomly assigned. The schema for the weights table is shown in the following table. Grouping columns
(groupcoln) are optional.
Table 1080: NeuralNetwork Weights Table
Output
The NeuralNet function displays a message after it finishes.
Table 1081: NeuralNet Output Table Schema
Property Value
Reached Threshold The error threshold reached when training stops.
Property Value
Threshold Steps Max iterations reached or convergence reached.
The query below displays the OutputTable containing the fitted weights:
This output table has the same schema as the weight table and has a row for each neural network with one
column for each weight. As shown in the following table, the weight columns are labeled 0, 1, 2, etc., and are
ordered by layer, origin node, and destination node. For example, the weight in column 0 corresponds to
w101 in Introduction to Neural Networks, column 1 corresponds to w102, and so on, up to column 12
corresponding to w231.
Example
Input
The input table, breast_cancer_data, is assessment data from biopsies of 699 breast tumors. Each tumor is
rated on 9 predictor variables, and is classified as either benign (class = 2) or malignant (class = 4). The nine
attributes (clumpthickness, uniformityofcellsize, uniformityofcellshape, marginaladhesion,
singleepithelialcell, barenuclei, blandchromatin, normalnucleoli, mitoses) are scored on a scale of 1 to 10.
Table 1083: NeuralNet Example Input Table breast_cancer_data (Columns 1-5)
SQL-MapReduce Call
To ensure that the model has good prediction accuracy, the function must converge at a low threshold value.
If the function does not converge, the model prediction is not reliable. The function typically requires several
thousand iterations to converge at a low threshold, a process that may take several hours.
Output
Table 1089: NeuralNet Example Output Table
message
Run 1591 Iterations.
neural net converge
Reach Threshold: 0.9993143624713949.
NeuralNetPredict
Summary
The NeuralNetPredict function predicts the output for specific arbitrary covariate inputs, using a particular
trained neural network output weight table.
Usage
NeuralNetPredict Syntax
Version 1.0
Arguments
Argument Category Description
InputColumns Required Specifies the names of the columns of testdata that contain the numerical
input predictor variables x1, x2, x3, etc.
GroupByColumns Optional Specifies the columns that are used to output different neural networks for
different groups (must be in WeightsTable, if specified).
HiddenLayers Required Specifies the number of hidden neurons in each layer, from left to right, by
list of integers. Default value is 1 layer, 1 neuron.
ActivationFunction Optional Specifies the name of the differentiable function that is applied to the
result of the cross-product of the neurons and the weights. Available
choices are ‘logistic’ (default) and hyperbolic tangent (‘tanh’).
LinearOutput Optional Specifies whether the ActivationFunction is not to be applied to the output
neurons. The default value is 'true'.
NumOutputs Optional Specifies the number of outputs from the neural net. Default value is 1.
Maximum value is 1000.
Accumulate Optional Specifies the names of the columns in the input_table that the function
copies to linear_predictor_table.
Input
Table 1090: NeuralNetPredict Input Table schema
Output
The NeuralNetPredict function output schema is shown in the following table.
Table 1091: NeuralNetPredict Output Table schema
Example
Input
This example uses the model created by the NeuralNet function (cancer_output) and the test dataset as input
to the NeuralNetPredict function.
SQL-MapRequest Call
The NeuralNetPredict call must use the same argument values that were used in the NeuralNet call that
created the model for the following arguments: InputColumns, HiddenLayers, WeightTable, NumOutputs,
and GroupByColumns. In this example, only InputColumns and HiddenLayers are applicable.
Output
Table 1092: NeuralNetPredict Example Output Table (Columns 1-4)
rmse
0.27583997522107218006
Data Transformation
• Antiselect
• Apache_Log_Parser
• Categorize
• Fellegi-Sunter Functions
• Geometry Functions
• IdentityMatch
• IPGeo
• JSONParser
• Multi_Case
• MurmurHash
• OutlierFilter
• Pack
• Pivot
• PSTParserAFS
• Scale Functions
• StringSimilarity
• Unpack
• Unpivot
• URIPack
• URIUnpack
• XMLParser
• XMLRelation
Antiselect
Summary
Antiselect returns all columns except those specified in the Exclude argument.
Antiselect Syntax
Version 1.0
Arguments
Argument Category Description
Exclude Required Specifies the names of the columns not to return.
Input
The input table can have any schema.
Output
The output table has all input table columns except those specified by the Exclude argument.
Example
Input
The input table, antiselect_input, is a sample set of sales data containing 13 columns.
Table 1096: Antiselect Example Input Table antiselect_input (Columns 1-8)
SQL-MapReduce Call
This query excludes the columns rowid, orderdate, discount, province and custsegment:
Output
The output table excludes the specified columns and outputs the remaining eight columns.
Table 1098: Antiselect Example Output Table
Apache_Log_Parser
Summary
The Apache_Log_Parser function parses Apache log file content from a given server access log and outputs
specified columns of information, which can include search engines and search terms.
Background
The function parses Apache log files with the following constraints:
• Log files are loaded into a table
• One line of the Apache log file is loaded to one row in the table
• If you specify a custom log format, then the input must conform to that format.
• If you do not specify a custom format string with the Log_Format argument, the function parses the logs
with the NCSA extended/combined format, which is defined as:
#
# The following directives define some format nicknames for use with
Usage
Apache_Log_Parser Syntax
version 2.2
ExcludeFiles Optional Specifies the types of files to exclude, by suffix. The default value is '.png',
'.xml', '.js'. If an input row contains a file of an excluded type, then the
function does not generate an output file row for that input row.
SearchInfoFlag Optional Specifies whether to return search information. The default value is
'false'. If you specify 'true', the function extracts the search engine and
search terms (if they exist) into two output columns. The supported
search engines are Google, Bing, and Yahoo. The function provides more
complete parsing capabilities for Google.
Input
The input table must have a column that contains the information to be parsed. The table can have
additional columns, but the function ignores them.
Table 1100: Apache_Log_Parser Input Table Schema
Output
The output table has one row for each input row that the function parses, except those that contain
requested files of excluded types. Its schema depends on the log format and function arguments. The
following table is a possible output table schema.
Table 1101: Apache_Log_Parser Output Table Schema
The possible output column names are listed in Apache Log Parser Item-Name Mapping and the following
table. The following table describes the output table columns that appear only if the SearchInfoFlag
argument has the value 'true' and the log file contains referrer information.
Table 1102: Apache_Log_Parser Output Table Columns extracted when RETURN_SEARCH_INFO = 'true'
Examples
The examples in this section show the results of two different values of the LogFormat argument.
Input
The input table for both examples, apache_logs, has a sample of five records of apache web user logs.
Table 1103: Apache_Log_Parser Example Input Table apache_logs
id logdata
1 69.236.77.51 - Frank [26/Mar/2011:09:17:31 -0700] "GET /about/careers.php HTTP/1.1" 200
5976 "http://www.bing.com/search?q=Aster+data&src=ie9tr" "Mozilla/5.0 (compatible; MSIE
9.0; Windows NT 6.1; Trident/5.0)"
2 168.187.7.114 - Lewis [27/Mar/2011:00:16:49 -0700] "GET / HTTP/1.0" 200 7203 "http://
search.yahoo.com/search;_ylt=AtMGk4Fg.FlhWyX_ro.u0VybvZx4?
p=ASTER&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-383-1" "Mozilla/4.0 (compatible; MSIE
8.0; Windows NT 6.1; Trident/4.0; SLCC2;.NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET
CLR 3.0.30729; Media Center PC 6.0;InfoPath.2)"
3 75.36.209.106 - Patrick [20/May/2008:15:43:57 -0400] "GET / HTTP/1.1" 200 15251 "http://
www.google.com/search?hl=en&q=%22Aster+Data+Systems%22" "Mozilla/4.0 (compatible;
id logdata
MSIE 6.0; Windows NT 5.1; SV1; YPC 3.2.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; MS-
RTC LM 8)"
4 159.41.1.23 - - [06/Jul/2010:07:19:45 -0400] "GET /public/js/common.js HTTP/1.1" 200 16711
"http://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=aster
%20data&rsv_pq=d31bd31c000dd71c&rsv_t=982dONZ4XBYXizw4wA
%2BQD411WcEyn1YoJu4QSpNTQwwoTE7hgPFD9OBTObk&rsv_enter=1&rsv_sug3=11&r
sv_sug1=1&rsv_sug2=0&rsv_sug7=100&inputT=3572&rsv_sug4=6596" "Mozilla/5.0
(Windows; U; Windows NT 5.1; it; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"
5 127.0.0.1 - -[10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://
www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
SQL-MapReduce Call
Output
The output has 11 columns. One row is output for each input row. There is no output corresponding to
input id=4, because .js pages are omitted by default. This can be controlled by the ExcludeFiles argument
(refer to Arguments). Also, the first row in the output, which corresponds to input id=5, is empty in the
search_engine and search_term columns. This is because the referrer for that input row, “http://
www.example.com/start.html”, is not a search engine. The only supported search engines are Google, Bing,
and Yahoo.
Table 1104: Apache_Log_Parser Example 1 Output Table (Columns 1-5)
SQL-MapReduce Call
Output
The Referer, User-Agent name, search engine and search terms are absent in the common log format output.
Categorize
Summary
The Categorize function converts specified columns from any numeric type to VARCHAR. This operation is
necessary when columns contain numbers that represent codes or categories (for example, billing codes or
zip codes) and you want to input them to another function as categorical data.
Note:
The Categorize function is similar to the R function factor.
Usage
Categorize Syntax
Version 1.0
Input
Table 1109: Categorize Input Table Schema
Output
The output table has the same column names, in the same order, as the input table.
Table 1110: Categorize Output Table Schema
Example
Input
The input table contains information about houses. All columns have numeric data types. The first three
columns represent numeric values, but the other columns represent codes or categories. The following two
tables are the input table itself and its schema description, respectively.
Table 1111: Categorize Example Input Table categorize_input
To check the input data variable types, use this query: \d categorize_input
The query returns:
Table "public"."categorize_input"
Column | Type | Modifiers
-------------------------------
sn | integer |
price | real |
lotsize | real |
driveway | integer |
recroom | integer |
fullbase | integer |
gashw | integer |
airco | integer |
prefarea | integer |
homestyle | integer |
Table Type:
fact
Distribution Key:
sn
Compression Level:
none
Storage Type:
row
Persistence:
permanent
SQL-MapReduce Call
Output
This query returns the output table:
The output table looks like the input table, but its schema is different. To check the output data variable
types, use this query: \d categorize_output
The query returns:
Table "public"."categorize_output"
Column | Type | Modifiers
-----------------------------------------
sn | integer |
price | real |
lotsize | real |
driveway | integer |
recroom | character varying |
fullbase | character varying |
gashw | character varying |
airco | character varying |
prefarea | character varying |
homestyle | character varying |
Table Type:
fact
Distribution Key:
sn
Compression Level:
none
Storage Type:
row
Persistence:
permanent
Fellegi-Sunter Functions
Summary
Aster Analytics Foundation provides two Fellegi-Sunter functions:
• FellegiSunterTrainer, which estimates the parameters of the Fellegi-Sunter model
Background
The Fellegi-Sunter model is a tool in the field of record linkage (RL), the task of finding records in a data set
that refer to the same entity across different data sources (for example, data files, websites, and databases).
The data sources might or might not have a common identifier (such as a database key, URI, or National
identification number). A data set that has undergone RL-oriented reconciliation is cross-linked.
RL was introduced by Halbert L. Dunn in 1946. In 1959, Howard Borden Newcombe laid the probabilistic
foundations of modern record linkage theory, which were formalized by Ivan Fellegi and Alan Sunter.
Fellegi and Sunter proved that the probabilistic decision rule that they described was optimal when the
comparison attributes were conditionally independent. Their article, "A Theory For Record Linkage,"
published in the Journal of the American Statistical Association in December 1969, remains the
mathematical foundation for many record linkage applications.
Since the late 1990s, various machine learning techniques have been developed that can estimate the
conditional probabilities required by the Fellegi-Sunter model. Although several researchers have reported
that the conditional independence assumption of the Fellegi-Sunter model is often violated in practice,
published efforts to explicitly model the conditional dependencies among the comparison attributes have
not improved record-linkage quality.
FellegiSunterTrainer
Summary
The FellegiSunterTrainer function estimates the parameters of the Fellegi-Sunter model.
The function can use either supervised or unsupervised learning. For supervised learning, specify the
TagColumn argument. For unsupervised learning, omit the TagColumn argument and specify the
arguments InitialM, InitialU, InitialP, and MaxIteration.
Usage
FellegiSunterTrainer Syntax
Version 1.1
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the object pairs and their
field-pair similarity values.
ComparisonFields Required Specifies the columns of input_table to use in the field-pair similarity in
the training process. If the value in the column is less than
threshold_value, then the field pair does not agree; otherwise, the field pair
agrees. The default value of threshold_value is 1.
TagColumn Optional If you specify this argument, then the function uses supervised learning; if
you omit it, then the function uses unsupervised learning.
This argument specifies the name of the column that indicates whether
two objects match. The column must contain only the values 'M'
(matched) and 'U' (unmatched).
InitialM Optional For unsupervised learning, this argument specifies the initial value of m,
which is the probability that a field agrees, given that the object-pair
belongs to the same object. The default value is 0.9.
For supervised learning, the function ignores this argument.
InitialU Optional For unsupervised learning, this argument specifies the initial value of u,
which is the probability that a field agrees, given that the object-pair
belongs to a different object. The default value is 0.1.
For supervised learning, the function ignores this argument.
InitialP Optional For unsupervised learning, this argument specifies the initial value of p,
which is the percentage of all possible object-pairs that contain the same
object. The default value is 0.1.
For supervised learning, the function ignores this argument.
MaxIteration Optional For unsupervised learning, this argument specifies the maximum number
of iterations. The default value is 100.
For supervised learning, the function ignores this argument.
Eta Optional For unsupervised learning, this argument specifies the tolerance of the
termination criterion. At the end of each iteration, the function computes
the difference between the current value of p and the value of p at the end
of the previous iteration. If the difference is less than eta_value, then the
function terminates.The default value is 1*10-5.
Lambda Optional Specifies the Type I (false negative) error, which occurs if an unmatched
comparison is erroneously linked. The default value is 0.9.
Mu Optional Specifies the Type II (false positive) error, which occurs if a matched
comparison is erroneously not linked. The default value is 0.9.
Input
The FellegiSunterTrainer function has one required input table, which contains the object pairs, their field-
pair similarity values, and (for supervised learning) a tag column. The following table shows the schema of
the input table.
Table 1112: FellegiSunterTrainer input_table Schema
Note:
To create the input table for the FellegiSunterTrainer function, you can use the function StringSimilarity.
Output
The FellegiSunterTrainer function has one output table, which is a model table that is input to the function
FellegiSunterPredict. The following table shows the schema of the output table.
Table 1113: FellegiSunterTrainer Output (Model) Table Schema
Examples
Input
Both examples use the same input. The input table is generated from the output of the StringSimilarity
function, using the following SQL query and adding the match_tag column (which is used for the supervised
FellegiSunter function).
The above input table compares the source column (src_txt1) with the reference column (tar_text) and gives
the different similarity scores based on 'jaro', Levenshtein Distance (LD), ngram and jaro-winkler metrics, as
described in StringSimilarity.
SQL-MapReduce Call
The unsupervised model is generated by specifying the user defined threshold values for the different
metrics in the ComparisonFields argument. The initialization parameters InitialP, InitialM, InitialU,
Output
The unsupervised model “fg_unsupervised_model” is shown below. The time used to generate the model
may vary in each run.
This query returns the output shown in the following table.
_key _value
comparison_filed_cnt 4
comparison_filed_name_0 jaro1_sim
comparison_filed_name_1 ld1_sim
comparison_filed_name_2 ngram1_sim
comparison_filed_name_3 jw1_sim
comparison_filed_threshold_0 0.8
comparison_filed_threshold_1 0.8
comparison_filed_threshold_2 0.5
comparison_filed_threshold_3 0.8
is_supervised false
lambda 0.9
lower_bound -13.7991041364018
m_0 0.9999999
m_1 0.333315282254649
m_2 0.999945434344193
_key _value
m_3 0.9999999
mu 0.9
p 0.250013539042196
time_used 271.020000 seconds
u_0 0.777773766137301
u_1 1.0E-7
u_2 1.37483178043644E-7
u_3 0.88888688306865
upper_bound 22.2091654088378
SQL-MapReduce Call
Output
This query returns the output shown in the following table:
_key _value
comparison_filed_cnt 4
comparison_filed_name_0 jaro1_sim
comparison_filed_name_1 ld1_sim
comparison_filed_name_2 ngram1_sim
_key _value
comparison_filed_name_3 jw1_sim
comparison_filed_threshold_0 0.8
comparison_filed_threshold_1 0.8
comparison_filed_threshold_2 0.5
comparison_filed_threshold_3 0.8
is_supervised true
lambda 0.9
lower_bound -0.415037499278844
m_0 0.9999999
m_1 0.166666666666667
m_2 0.5
m_3 0.9999999
mu 0.9
time_used 35.413000 seconds
u_0 0.666666666666667
u_1 1.0E-7
u_2 1.0E-7
u_3 0.833333333333333
upper_bound -0.415037499278844
FellegiSunterPredict
Summary
The FellegiSunterPredict function predicts whether a pair of objects are duplicates.
Usage
FellegiSunterPredict Syntax
Version 1.1
Arguments
Argument Category Description
Accumulate Optional Specifies the names of input table columns to be copied to the output
table.
Input
The FellegiSunterPredict function has two required input tables:
• The table, view, or query ("input table") that contains the object pairs whose duplicity the function is to
predict
• The model table that the FellegiSunterTrainer function output
The input table must include the comparison_field_name_i columns in the model table.
Output
Table 1119: FellegiSunterTrainer Output (Model) Table Schema
Examples
Input
Both examples use the same input table, which is generated from the output of the StringSimilarity function
from the following SQL query.
The above input table compares the source column (src_txt2) with the reference column (tar_text) and gives
the different similarity scores based on 'jaro', Levenshtein Distance, ngram and jaro-winkler metrics, as
described in StringSimilarity. The tar_text column is the same as the input in FellegiSunterTrainer function,
while 'src_txt2' is the new column required for prediction.
SQL-MapReduce Call
Output
The model prediction is shown in the final column ('M' for match, 'U' for no match). The weight of the
object pair is shown in the 'weight' column.
Table 1122: FellegiSunterPredict Example 1 Output Table (Columns 1-4)
Output
The model prediction is shown in the final column ('M' for match, 'U' for no match). The weight of the
object pair is shown in the 'weight' column.
Table 1124: FellegiSunterPredict Example 2 Output Table (Columns 1-4)
In these examples, the predictions are the same for both supervised and unsupervised models.
Geometry Functions
Aster Analytics Foundation provides three geometry functions:
• GeometryLoader loads data from different providers into the database and converts the data to the
format used by the other geometry functions.
• PointInPolygon computes a list of binary values for every point and polygon combination, which indicate
whether the point is contained in the polygon.
• GeometryOverlay finds the result of overlaying two geometries.
Note:
PointInPolygon and GeometryOverlay work only on 2D spatial objects.
GeometryLoader
Summary
The GeometryLoader function fetches file-based geospatial files from AFS, parses them, and stores them in
Aster Database. The function only loads input formats from AFS and converts them to WKT or other
formats in the database.
Usage
GeometryLoader Syntax
Version 1.1
For the ON clause, create a table mr_driver once, with no rows. For example:
Arguments
Argument Category Description
Path Required The AFS directory or file name to fetch the geometry files from (for
example, /test or /test/testfile.xml or /test/*.xml).
Regular expressions are allowed in the argument.
Before calling this function, ensure that the directories and files that
you want to specify are available on the AFS system.
Note:
A zip file is treated as a directory.
Note:
Only WKT is supported by the PointInPolygon and
GeometryOverlay functions.
OutputAttributes Optional The output column names and types. The supported column types are
VARCHAR, INT, and DOUBLE PRECISION. The default column type
is VARCHAR.
Input
Table 1126: Geospatial File Formats That GeometryLoader Accepts
Output
Table 1127: GeometryLoader Output Table Schema
Note:
ZIP files are treated as
directories.
Example
Input
The input files, stored in AFS, are sample ArcGIS shapefiles. You can get these files from http://
www.arcgis.com/home/item.html?id=b07a9393ecbd430795a6f6218443dccc.
• states.dbf
• states.prj
• states.sbn
• states.sbx
• states.shp
• states.shp.xml
• states.shx
Install these files into AFS (Aster File Server) as follows. From the beehive=> prompt:
The following command moves the files from your local directory to the AFS. This example assumes that the
files have been unzipped into a local directory /home/states.
Confirm that all files have been uploaded. You should see the following list of files:
SQL-MapReduce Call
Note:
The STATE_NAME, SUB_REGION, and STATE_ABBR columns represent attributes defined in the
shapefiles.
SQL-MapReduce Call
Note:
The STATE_NAME, SUB_REGION, and STATE_ABBR columns represent attributes defined in the
shapefiles.
Output
Table 1128: GeometryLoader Output Table
PointInPolygon
Summary
The PointInPolygon function takes a list of location points and a list of polygons and returns a list of binary
values for every point and polygon combination, which indicates whether the point is contained in the
polygon.
Note:
The function works only on 2D spatial objects.
Background
The PointInPolygon function judges whether a given point in the plane lies inside, or outside of a polygon. It
has various applications in many fields such as computer graphics, geographical information systems (GIS),
and CAD.
In the following example, point A is in the polygon and point B is outside of the polygon.
A use case for this function is to determine in which “drive-time polygon” surrounding a store a customer
resides. This information helps in mailer targeting.
Another use case is to determine which cell phones are frequently within a polygon surrounding an airport.
This information helps in identifying frequent fliers.
Usage
PointInPolygon Syntax
Arguments
Argument Category Description
SourceLocationColumn Required The names of the columns that contain the point
coordinate values from the source input table.
If only one column is specified, the coordinates of the
point must be expressed using well-known text (WKT)
syntax. For example, the string “POINT (30 10)” is the
WKT markup syntax that describes a point whose x
coordinate is 30 and whose y coordinate is 10).
By supporting WKT, this function can process the output
of the GeometryLoader function, which is expressed in
WKT. For example, you can use the GeometryLoader
function to convert GIS data formats (for example,
shapefile (.shp), MapInfo TAB (.tab), Keyhole Markup
Language (KML), and GeoJSON) to WKT and use the
PointInPolygon function to process the resulting WKT
data.
For more information about WKT, refer to the following
URL:
http://www.geoapi.org/3.0/javadoc/org/opengis/
referencing/doc-files/WKT.html
If two columns are specified, the function assumes that
they represent the two coordinates (for example, latitude
and longitude) of the input points. The two-column
format for representing points is useful when the input
data consists of raw latitude and longitude pairs.
Input
The PointInPolygon function requires two input tables, source and reference. These tables must have the
same coordinate reference system.
Table 1129: PointInPolygon Source Table Schema
Examples
These examples use PointInPolygon function in three modes:
• With outputall ('true')
• With outputall ('false')
• Using passenger coordinates as separate columns
Input
This example assumes that the parsed location file formats are grouped into the following relation, as shown
in the table source_passenger, as input to the function. There are four passengers whose x, y coordinates are
known and the goal is to determine in which of the two airport terminals (A or B) they are located. The
outlay, or geographical location, of the terminals is specified in the table reference_terminal as polygon
coordinates. In this table, the coordinates of the points are specified using WKT syntax.
SQL-MapReduce Call
Output
Because the OutputAll argument is set to true, the output table shows all passengers regardless of whether
they are in a terminal or not.
Table 1134: PointInPolygon Example 1 Output Table (Columns 1-2)
source_location_point ref_reference_location_polygon
POINT (30 10) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (30 10) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (300 10) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (300 10) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (300 20) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (300 20) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (400 20) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (400 20) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
SQL-MapReduce Call
Output
In this example, which uses the same input as Example 1 but has the OutputAll argument set to false, the
output includes only passengers inside a terminal. Macy is not in any terminal and does not appear in the
output table.
Table 1136: PointInPolygon Example 2 Output Table (Columns 1-2)
source_location_point ref_reference_location_polygon
POINT (30 10) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (300 10) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (300 20) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
Input
customer_id x y customer_name
1 30 10 Jeff
1 300 10 John
1 300 20 Maria
1 400 20 Macy
SQL-MapReduce Call
Output
x y ref_reference_location_polygon
30 10 POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
300 10 POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
300 20 POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
GeometryOverlay
Summary
The GeometryOverlay function takes two geometries described by the well-known text (WKT) markup
language and outputs the result of overlaying them, as specified by the boundary operator.
You can use this function to prepare sets of geometries for input to the PointInPolygon function. For
example, you can use this function to prepare a geometry that contains all cellular phone reception polygons
near an airport to create a geometry that is useful for identifying frequent fliers.
Note:
The function works only on 2D spatial objects.
Usage
GeometryOverlay Syntax
CONVEXHULL
Version 1.1
BUFFER
Version 1.1
Arguments
Argument Category Description
SourceLocationColumn Required Specifies the name of the source table column that contains the
polygon description in WKT format.
ReferenceLocationColumn Required Specifies the name of the reference table column that contains
the location of the polygon description in WKT format.
ReferenceNameColumns Required Specifies the name of the reference table column that contains
the names of the polygons.
BoundaryOperator Required Specifies the boundary (geometry overlay) operator. For
descriptions of these operators, refer to the following table.
Distance Required Specifies the distance by which to extend or decrease the
by polygon.
BUFFER
operator
OutputAll Optional Specifies whether to include the result of non-intersection
geometries in the output. The default value is 'false'.
Accumulate Optional Specifies the names of the source table columns to copy to the
output table.
DIFFERENCE Result contains the area covered only by the source polygon (the blue
area in the figure).
SYMDIFFEREN Result contains the area covered by only the source polygon or only
CE the reference polygon (the blue area in the figure).
CONVEXHULL Result contains the smallest convex set that contains the source
polygon in the Euclidean plane (the blue area in the figure).
Note:
For the buffer operation, eight segments is adopted in curve
approximation for the corners.
Input
For the boundary operators CONVEXHULL and BUFFER, the function requires only one input table,
source. For the other boundary operators, the function requires two input tables, source and reference,
which must use the same coordinate reference system.
Table 1142: GeometryOverlay Source Table Schema
Examples
The following three examples use the same Input.
• Example 1: Intersection
• Example 2: Union
• Example 3: Buffer (Single Input)
Input
The source input table, source_gatetype, provides the geometrical coordinates of the domestic and
international gates that are spread over three terminals (A, B and C) of an airport.
Table 1145: GeometryOverlay Input Table source_gatetype
The terminal coordinates are given in the table ref_terminal. All coordinates are in WKT syntax.
Table 1146: GeometryOverlay Input Table ref_terminal
SQL-MapReduce Call
Output
The output shows that all domestic gates are contained within Terminal A. International gates are spread
over all three terminals.
Table 1147: GeometryOverlay Example 1 Output Table
Example 2: Union
This example computes the union of area for both the gates and the terminals.
SQL-MapReduce Call
Output
Because all domestic gates are contained within Terminal A, the union of the area specified by the domestic
gates and the area of Terminal A is just Terminal A, as shown in the first row of the output. The second row
shows the union of the coordinates of the international gates and all three terminals.
Table 1148: GeometryOverlay Example 2 Output Table
SQL-MapReduce Call
Output
overlay_boundary boundary_name id
POLYGON ((10 8, 9.609819355967742 8.03842943919354, Domestic Gates 1
9.234633135269819 8.152240934977428, 8.888859533960796
8.33706077539491, 8.585786437626904 8.585786437626904,
8.337060775394908 8.888859533960796, 8.152240934977426
9.23463313526982, 8.03842943919354 9.609819355967744, 8 10, 8 20,
8.03842943919354 20.390180644032256, 8.152240934977426
20.76536686473018, 8.33706077539491 21.111140466039203,
8.585786437626904 21.414213562373096, 8.888859533960796
21.66293922460509, 9.23463313526982 21.847759065022572,
9.609819355967744 21.96157056080646, 10 22, 20 22,
20.390180644032256 21.96157056080646, 20.76536686473018
overlay_boundary boundary_name id
21.847759065022572, 21.111140466039206 21.66293922460509,
21.414213562373096 21.414213562373096, 21.66293922460509
21.111140466039203, 21.847759065022572 20.76536686473018,
21.96157056080646 20.390180644032256, 22 20, 22 10,
21.96157056080646 9.609819355967744, 21.847759065022572
9.23463313526982, 21.66293922460509 8.888859533960796,
21.414213562373096 8.585786437626904, 21.111140466039206
8.33706077539491, 20.76536686473018 8.152240934977426,
20.390180644032256 8.03842943919354, 20 8, 10 8))
POLYGON ((50 48, 49.609819355967744 48.03842943919354, International Gates 2
49.23463313526982 48.15224093497743, 48.8888595339608
48.33706077539491, 48.58578643762691 48.58578643762691,
48.33706077539491 48.8888595339608, 48.15224093497743
49.23463313526982, 48.038429439193536 49.609819355967744, 48 50, 48
150, 48.038429439193536 150.39018064403226, 48.15224093497743
150.76536686473017, 48.33706077539491 151.11114046603922,
48.58578643762691 151.4142135623731, 48.8888595339608
151.66293922460508, 49.23463313526982 151.8477590650226,
49.609819355967744 151.96157056080645, 50 152, 150 152,
150.39018064403226 151.96157056080645, 150.76536686473017
151.8477590650226, 151.11114046603922 151.66293922460508,
151.4142135623731 151.4142135623731, 151.66293922460508
151.11114046603922, 151.8477590650226 150.76536686473017,
151.96157056080645 150.39018064403226, 152 150, 152 50,
151.96157056080645 49.609819355967744, 151.8477590650226
49.23463313526982, 151.66293922460508 48.8888595339608,
151.4142135623731 48.58578643762691, 151.11114046603922
48.33706077539491, 150.76536686473017 48.15224093497743,
150.39018064403226 48.038429439193536, 150 48, 50 48))
IdentityMatch
Summary
The IdentityMatch function tries to match source data with reference data, using specified attributes to
calculate the similarity score of each source-reference pair, and then computes the final similarity score.
Typically, the source data is about business customers and the reference data is from external sources, such
as online forums and social networking services.
Background
Businesses can easily gather customer sentiments from external data sources such as online forums and
social networking services. However, businesses cannot easily tell if the customer whose database identifier
(ID) is John Q. Public is the person with the online forum ID JQPublic or the social networking service ID
JohnP. The IdentityMatch function is intended to make this job easier.
The IdentityMatch function supports both nominal (exact) matching and fuzzy matching. You specify the
nominal-match attributes and the fuzzy-match attributes. First, the function compares the nominal-match
attributes. If they match exactly, the function does not compare the fuzzy-match attributes; if not, the
function compares the fuzzy-match attributes and uses only their similarity score.
For example, suppose that the nominal-match attribute is user ID and the fuzzy-match attribute is email or
mobile phone number. Two user IDs might not match exactly, but if both are associated with the same email
or mobile phone number, then they are considered to identify the same user.
However, if the fuzzy-match attributes do not represent users (as location and many other profile attributes
do not), then the function uses weighted matching. For example, for customer 1 and external user 2, the
matching formula could be:
where fx is a function that calculates the similarity of two strings and returns a value between 0 and 1, and
w1+w2+...+wn = 1.
Usage
IdentityMatch Syntax
Arguments
Argument Category Description
IDColumn Required Specifies the names of the columns in the source and
reference input tables that contain row identifiers. The
function copies these columns to the output table.
NominalMatchColumns Optional* Specifies pairs of columns (attributes) to check for exact
matching (a.columnX and b.columnY are column names). If
any pair matches exactly, then their records are considered
to be exact matches.
*Required if you omit FuzzyMatchColumns.
FuzzyMatchColumns Optional* Specifies pairs of columns (attributes) to check for fuzzy
matching (a.columnX and b.columnY are column names)
and the fuzzy matching parameters match_metric,
match_weight, and synonym_file (whose descriptions
follow). If any pair is a fuzzy match, then their records are
considered to be fuzzy matches.
*Required if you omit NominalMatchColumns.
The parameter match_metric specifies the similarity metric,
which is a function that returns the similarity score of two
Note:
The function calculates IDF only on the input relation
stored in memory.
Note:
You must install the dictionary before running the
function.
Accumulate Optional Specifies input table columns to copy to the output table.
Threshold Optional Specifies the threshold similarity score, a DOUBLE
PRECISION value between 0 and 1. The default value is 0.5.
The function outputs only the records whose similarity
score exceeds threshold.
Input
The IdentityMatch function requires a source input table and a reference input table. The following two
tables describe the input table columns that appear in the function syntax. The tables can have additional
columns, but the function ignores them.
Table 1150: IdentityMatch Source Input Table Schema
Output
Table 1152: IdentityMatch Output Table Schema
Example
Input
The input table, applicant_reference, is hypothetical information from people applying for employment at a
particular company. This table is a reference table against which external data from various sources can be
compared for identity matching.
Table 1153: IdentityMatch Example Input Table applicant_reference
The example compares this table with information (including credit scores) from the external source shown
in the following table. This table has missing and incomplete information, as expected with data from
different sources.
SQL-MapReduce Call
The objective is to correctly match the information in Input to the applicant from Input and thus accurately
identify the applicant’s credit score. Assume a default threshold of 0.5. A higher threshold means that the
matching accuracy is higher. Look for exact matches (NominalMatchColumns) to the email address and
allow approximate matches (FuzzyMatchColumns) for lastname, firstname, zipcode, city and department
columns, with different match metrics and match weights.
Output
The output table shows the matching information from both input tables and shows the matching score in
the last column. If multiple row entries exist for the same id, which is typically the case, then the higher score
entry gives the best match. For instance, in the output below, the first row is chosen over the second row as it
is a perfect match (score 1) over score 0.6036. The creditscore column gives the credit score for the
applicants.
IPGeo
Summary
IPGeo lets you map IP addresses to location information (country, region, city, latitude, longitude, ZIP code,
and ISP).
You can use the locations of web site visitors to improve the effectiveness of online applications. For
example:
• Targeted online advertising
• Content localization
• Geographic rights management
• Enhanced analytics
• Online security and fraud prevention
Usage
IPGeo Syntax
Version 2.1
Arguments
Argument Category Description
IPAddressColumn Required Specifies the name of the input table column that contains the IP
addresses.
Converter Optional The JAR filename and the name of the class that converts the IP
address to location information. The JAR file must be installed on
the Aster Database and the class name must be the full name, which
includes the package information. The file and class parameters are
case-sensitive.
The IPGeo function is a special case and needs a user-defined class.
This is why you must use the Converter argument. Only the JAR file
declared by this argument can be used by the function.
The JAR file must contain all the classes needed by the user-defined
converter. In Aster Database, all of the installed files are stored in
the database. When a function is invoked, only a ZIP/JAR file
consistent with the SQL-MapReduce function name is temporally
downloaded to the file system to be executed.
To create a new class, refer to Extending IPGeo.
IPDatabaseLocation Optional The location of the IP database that matches IP addresses to
locations. The IP databases can be stored in the file system or in
Aster Database. If the data is stored in a file system, each worker
must have the same path, and the absolute path must be set in this
parameter. If the data is installed in Aster Database, this argument is
ignored.
Accumulate Optional Specifies the names of input table columns to copy to the output
table.
Output
Table 1158: IPGeo Output Table Schema
Examples
The examples have the same Input and Output, but different SQL-MapReduce calls.
Examples 1 and 2 require that the file MaxMindLite.jar be installed on the Aster database. This jar file
contains the converter interface used in these examples,
'com.asterdata.sqlmr.analytics.location.ipgeo.MaxMindLite'.
Example 3 uses the default Converter class that ships with the ipgeo function (geolite 1.2.8).
The function requires two database files that map ip addresses to geographic locations (GeoLiteCity.dat and
GeoLiteCityv6.dat). In Example 1, the files are referenced by pathname using the argument
IpDatabaseLocation, and need not be installed on the Aster database. Examples 2 and 3 assume that the two
database files have been installed on the Aster database.
id ip
1 159.41.1.23
2 153.65.16.10
3 75.36.209.106
4 202.106.0.20
5 69.236.77.51
6 168.187.7.114
SQL-MapReduce Call
SQL-MapReduce Call
SQL-MapReduce Call
Output
Table 1160: IPGeo Example Output Table (Column 1-7)
Extending IPGeo
Because IPGeo cannot cover all the IP database providers for technical and license reasons, you can extend
this function to support new database providers.
Note:
Only Maxmind GeoLite 1.2.8 ships with this function.
package com.asterdata.sqlmr.analytics.location.ipgeo;
public interface Converter
{
/**
* initialize a Converter instance with corresponding resource
* @param ipDatabasePath
*/
void initialize(String ipDatabasePath);
/**
* release resources used by this instance before the SQL-MR function close
*/
void finalize();
/**
* Lookup location information for the input ipv4 address and write the
result to a IpLocation instance
* @param ip
* input, IP address in ipv4 format
* @param ipLocation
* output, to hold the location information
*/
void findIpv4(String ip, IpLocation ipLocation);
/**
*
* Lookup location information for the input ipv6 address and write the
result to a IpLocation instance
* @param ip
* input, IP address in ipv6 format
* @param ipLocation
* output, to hold the location information
*/
void findIpv6(String ip, IpLocation ipLocation);
}
Class IpLocation is designed to hold the location information and emit it (you can also find this
information in the ipgeo.jar file). The code has get and set functions for the following member variables
corresponding to the SQL-MapReduce function output:
package com.asterdata.sqlmr.analytics.location.ipgeo;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.InetAddress;
import java.net.UnknownHostException;
import com.asterdata.ncluster.sqlmr.IllegalUsageException;
import com.asterdata.ncluster.sqlmr.data.InstalledFile;
import com.asterdata.sqlmr.analytics.location.ipgeo.Converter;
import com.maxmind.geoip2.DatabaseReader;
import com.maxmind.geoip2.exception.GeoIp2Exception;
import com.maxmind.geoip2.model.CityResponse;
/**
* A Converter implementation for MaxMind GeoLite2 version
*/
public class MaxMindLite2 implements Converter
{
private DatabaseReader reader = null;
private static final String CITY_DATABASE = "GeoLite2-City.mmdb";
private String tmpCityDatabase_ = null;
//initialize the databaseReader
public void initialize(String ipDatabasePath)
{
if(ipDatabasePath == null)
{
tmpCityDatabase_ = downloadFile(CITY_DATABASE);
initializeDatabaseReader(tmpCityDatabase_);
}
else
{
String path = ipDatabasePath.endsWith("/") ?
ipDatabasePath + CITY_DATABASE : ipDatabasePath + "/" + CITY_DATABASE ;
initializeDatabaseReader(path);
}
}
//find address according to ipv4 address
public void findIpv4(String ip, IpLocation ipLocation)
{
CityResponse response = null;
try
{
response = reader.city(InetAddress.getByName(ip));
}
catch (UnknownHostException e)
{
// do nothing
}
catch (IOException e)
JSONParser
Summary
The JSONParser function is a tool used to extract the element name and text from JSON strings and output
them into a flattened relational table.
Background
On the Internet, most data is exchanged and processed using JSON or XML, and then displayed using
HTML. These languages are based on named structures that can be nested, using a Unicode-based
representation that is both human-readable and machine-readable.
Each language is optimized for its main function:
• HTML for representing web pages
• XML for representing document data
• JSON for representing programming language structures
In many applications, programmers work with only one of these formats. In others, programmers work with
all three. For traditional programming structures, programming with JSON is significantly easier than
programming with XML.
XQuery is the standard query language for XML, and has been implemented in databases, streaming
processors, data integration platforms, application integration platforms, XML message routing software,
Usage
JSONParser Syntax
Version 1.5
Arguments
Argument Category Description
TextColumn Required Specifies the column name from the input table which
contains the JSON string.
Nodes Required Specifies the parent/children pair. Must contain at least one
parent/child pair, and all pairs specified must be in the same
format. Multiple children can be specified as parent/
{child1,child2,...}.
SearchPath Optional Specifies the path to find the direct value of the child. To
reach the parent of the parent, include the parent of the
parent in this path. When a path to the parent of the parent
is supplied, all the siblings of the parent can be printed by
including them in the NODES argument. If anything from
root is to be parsed, then supply this argument as '/' (or leave
it as an empty string).
Delimiter Optional Specifies the delimiter used to separate multiple child values
with the same name and which have the same parent node
in the JSON String. If not defined, defaults to comma ','.
Note:
The delimiter cannot include '#'.
MaxItemNum Optional The maximum number of nodes with the same name to
display in the output. The default value is 10.
NodeIDOutputColumn Optional The name of the column to use in the result schema to
contain the identifier (from the input table) of the each node
extracted. If not defined, defaults to 'out_nodeid'.
ParentNodeOutputColumn Optional The name of column to use in the result schema to contain
the tag name of the parent node extracted. If not defined,
defaults to 'out_parent_node'.
Accumulate Optional Specifies the input table columns to copy to the output table.
ErrorHandler Optional Specifies how the function acts when it encounters a data
problem. If not specified, the function aborts if the input
table contains bad data (for example, invalid UTF-8
characters).
ErrorHandler lets you specify an “additional” column to
hold any rows that were rejected as having bad data, also
referred to as the output column, in the output table. The
log information in the additional column lets you easily
identify which input table row contains unexpected data.
There are two parameters you can pass to ErrorHandler:
The first parameter tells the function whether to continue
processing if bad data is encountered. 'true' means continue
the processing without aborting. 'false' means abort the
process when an error occurs.
The second group of parameters designates the output and
input columns. The parameters in this group,
output_col_name: input_col_name1, input_col_name2,
input_col_name3,... are optional. If you specify an output
column, it is added to the output, and bad rows are logged
there. If you do not specify output_col_name, the function
uses “ERROR_HANDLER” as the name of the output
column. The error output column includes the data from the
input columns specified using input_col_namex, when an
error occurs. The data inserted into the output column is
merged from input columns and delimited by column using
a semicolon.
Using ErrorHandler('true') without specifying input
columns does not add any data to the output column.
Input
The table used as input must contain a column with JSON data.
Examples
• Example 1: With Nondefault Options
• Example 2: With Default Argument Values
• Example 3: Parsing with Ancestor (Search Path Argument Specified)
• Example 4: Specifying ERROR_HANDLER When Calling JSONParser
Input
The input table below is a single json record with multiple fields.
Table 1162: JSONParser Example 1 Input Table
id data
1
{"menu": {
"id": "1",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}
Output
The node values are output as shown below:
Table 1163: JSONParser Example 1 Output Table
Input
This example uses the default values for the arguments NodeIdOutputColumnName and
ParentNodeOutputColumnName. The input table shown below is a single json record with multiple fields.
Table 1164: JSONParser Example 2 Input Table
id data
1
{
"email":"[email protected]",
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup
Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
id data
SQL-MapReduce Call
Output
Input
This example uses the same input as Example 2 (Input).
When a specific path is specified in argument search_path(), the function only looks for the fields (key value
pairs) within the search path. This example specifies the SEARCH_PATH('/glossary/GlossDiv/GlossList').
SQL-MapReduce Call
Output
Because the email field is not included in the specified search path, it returns an empty column.
Table 1167: JSONParser Example 3 OutputTable (Columns 1-5)
Input
The input table below is the same as the input for Example 1 (Input), except that this version has a
formatting error. In this example, the data column is missing a closing quotation mark and a colon after the
“menuitem” field.
id data
1
{"menu": {
"id": "1",
"value": "File",
"popup": {
"menuitem [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}
SQL-MapReduce Call
The default Error_Handler is applied on the data column:
Output
The error_handler column outputs the error row as shown below.
Table 1170: JSONParser Example 4 Output Table (Columns 1-3)
ERROR_HANDLER
{"menu": {
"id": "1",
"value": "File",
"popup": { "menuitem [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
ERROR_HANDLER
}};
Multi_Case
Summary
The Multi_Case function extends the capability of the SQL CASE statement by supporting matches to
multiple criteria in a single row.
When SQL CASE finds a match, it outputs the result and immediately proceeds to the next row without
searching for more matches in the current row.
The Multi_Case function iterates through the input data set only once and outputs matches whenever a
match occurs. If multiple matches occur for a given input row, the function outputs one output row for each
match.
Use the Multi_Case function when the conditions in your CASE statement do not form a mutually exclusive
set.
Usage
Multi_Case Syntax
Version 1.1
Arguments
Argument Category Description
Labels Required Specifies a label for each case. Each case corresponds to a condition, which is
a SQL predicate that includes input column names. When an input value
satisfies condition, that is a match, and the function outputs the input row
and the corresponding label.
Output
Table 1173: Multi_Case Output Table Schema
Example
This example labels people with the age groups to which they belong, which overlap:
• infant (younger than 1 year)
• toddler (1-2 years, inclusive)
• kid (2-12 years, inclusive)
• teenager (13-19 years, inclusive)
• young adult (16-25 years, inclusive)
• adult (21-40 years, inclusive)
• middle-aged person (35-60 years, inclusive)
• senior citizen (60 years or older)
Input
The input table contains the identifiers, names, and ages of people. The ages range from 0.5 years (6 months)
to 65 years.
Table 1174: Multi_Case Example InputTable people_age
id name age
1 John 0.5
2 Freddy 2
3 Marie 6
4 Tom Sawyer 17
id name age
5 Becky Thatcher 16
6 Philip 22
7 Joseph 25
8 Roger 35
9 Natalie 30
10 Henry 40
11 George 50
12 Sir William 65
SQL-MapReduce Call
Output
Several people have two labels. For example, Freddy is both a toddler and a kid, and Tom Sawyer and Becky
Thatcher are both teenagers and young adults.
Table 1175: Multi_Case Example OutputTable
MurmurHash
Summary
The MurmurHash function computes the hash values of the input columns.
Background
MurmurHash is a noncryptographic hash function suitable for hash-based searching. The function
computes the MurmurHash value of each column value in each row of the input table.
MurmurHash Syntax
Version 1.1
Arguments
Argument Category Description
InputColumns Required Specifies the names of the input table columns for which to calculate
hash values.
Note:
NULL values in the input columns are output as NULL.
HashBit Optional Specifies whether the function generates 32-bit hash values (the default)
or 64-bit hash values.
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.
Input
Table 1176: MurmurHash Input Table Schema
Output
Table 1177: MurmurHash Output Table Schema
Input
The input table is a log of midnight temperatures (in degrees Fahrenheit) for five consecutive nights in five
cities. The rows of all columns except id are to be converted to hash values. Because the hash value depends
on data type, the input table has three columns each, with different data types, for the city name, time
period, and temperature:
• Columns 2-4 contain the city name in data types BYTEA, VARCHAR, and TEXT.
• Columns 5-7 contain the time period in data types TIMESTAMP, TEXT, and DATE.
• Columns 8-10 contain the time period in data types DOUBLE PRECISION, INTEGER, and TEXT.
Table 1178: MurmurHash Examples Input Table murmurhash_input, Columns 1-6
SQL-MapReduce Call
Note:
For the InputColumns argument, columns are numbered 0, 1, 2, and so on (not 1, 2, 3 and so on, as in
Table 22 - 1157).
Output
The hash values for each city name in the output table are the same, regardless of data type, but the hash
values for each time period are different for each data type.
When the temperature value in the input table is an integer (as in rows 3 and 5 of Input), the hash values are
the same for INTEGER and TEXT, but different for REAL. When the temperature value in the input table is
real, the hash values are the same for REAL and TEXT, but different for INTEGER.
Table 1180: MurmurHash Example 1 Output Table, Columns 1-4
SQL-MapReduce Call
Output
OutlierFilter
Summary
The OutlierFilter function filters outliers from a numeric data set, either deleting them or replacing them
with a specified value. Optionally, the function stores the outliers in their own table. The methods that the
function provides for filtering outliers are:
• Percentile
• Tukey’s method
• Carling’s modification
• median absolute deviation
The input data set is expected to have as many as millions of attribute-value pairs.
Usage
OutlierFilter Syntax
Version 1.3
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the numeric data to be filtered
and (optionally) the columns by which to group the data.
OutputTable Required Specifies the name of the table where the function stores the copy of the
input table (including the PARTITION BY column) with the outliers
either deleted (by default) or replaced (as specified by the
ReplacementValue argument).
TargetColumn Required Specifies the names of the input table columns to be filtered.
OutlierTable Optional Specifies the name of the table where the function outputs copies of the
rows of the input table that contain outliers.
GroupByColumns Optional Specifies the names of the input table columns by which to group the data.
If the data schema format is name:value, then this list must include name.
Method Optional Specifies the method or methods of filtering outliers:
• 'percentile' (default)
• 'tukey' (Tukey’s method)
• 'carling' (Carling’s modification)
• 'MAD-median' (Median absolute deviation (MAD))
MAD is defined as the median of the absolute values of the residuals.
For example, if there are i datapoints and the median value of the data
is M, then MAD=mediani(|xi-M|).
Specify either one method, which the function uses for all columns
specified by TargetColumn, or specify a method for each column specified
by TargetColumn.
ApproxPercentile Optional Specifies whether the function calculates the percentiles used as filter
limits exactly. The default value is 'false'.
Approximate percentiles are typically faster, but might fail when the
number of groups exceeds one million.
MadScaleConstant Optional Specifies the scale constant used with 'MAD-median' filtering; a DOUBLE
PRECISION value. The default value is 1.4826, which means MAD =
1.4826 * median(|x - median(x)|).
MadThreshold Optional Specifies the threshold used with 'MAD-median' filtering; a DOUBLE
PRECISION value. The default value is 3, which means that |x-
median(x)|/MAD > 3 is flagged as an outlier.
Input
The input table must have one column that contains numeric data to be filtered for outliers, and you must
specify its name with the TargetColumn argument. The following table describes the input table columns
that you can specify with the TargetColumn and GroupByColumns arguments. The input table can have
other columns, but the function ignores them.
Table 1186: OutlierFilter Input Table Schema
Examples
• Input
• Example 1: Method ('percentile'), ReplacementValue ('null')
• Example 2: Method ('MAD-median'), ReplacementValue ('median')
Input
The input table contains a time series of atmospheric pressure readings (in mbar) for five cities.
Table 1188: OutlierFilter Examples Input Table ville_pressuredata
SQL-MapReduce Call
Output
message
Created table "of_output1"
SQL-MapReduce Call
Output
message
Created tables "of_output2","of_outlier2"
The outlying values have been replaced with the median value for the group.
Table 1192: OutlierFilter Example 2 Output Table of_output2
Pack
Summary
The Pack function takes data from multiple input columns and packs it into a single column. The packed
column has a virtual column for each input column. By default, virtual columns are separated by commas
and each virtual column value is labeled with its column name.
Pack is complementary to the function Unpack, but you can use it on any input columns that meet the input
requirements.
Before packing columns, note their data types—you need them if you want to unpack the packed column.
Pack Syntax
Version 1.2
Arguments
Argument Category Description
InputColumns Optional Specifies the names of the input columns to pack into a
single output column. These names become the column
names of the virtual columns. By default, all input table
columns are packed into a single output column. If you
specify this argument, but do not specify all input table
columns, the function copies the unspecified input table
columns to the output table.
Delimiter Optional Specifies the delimiter (a string) that separates the virtual
columns in the packed data. The default delimiter is comma
(,).
IncludeColumnName Optional Specifies whether to label each virtual column value with its
column name (making the virtual column
'input_column:value'). The default value is 'true'.
OutputColumn Required Specifies the name to give to the packed output column.
Input
Table 1194: Pack Input Table Schema
Examples
• Input
• Example 1: Default Options
• Example 2: Nondefault Options
Input
The input table contains temperature readings for the cities Nashville and Knoxville, in the state of
Tennessee.
Table 1196: Pack Examples Input Table ville_temperature
SQL-MapReduce Call
Output
The columns specified by InputColumns are packed in the column packed_data. Virtual columns are
separated by commas, and each virtual column value is labeled with its column name. The input column sn,
which was not specified by InputColumns, is unchanged in the output table.
Table 1197: Pack Example 1 Output
packed_data sn
city:Nashville,state:Tennessee,period:2010-01-01 00:00:00,temp_f:35.1 1
city:Nashville,state:Tennessee,period:2010-01-01 01:00:00,temp_f:36.2 2
city:Nashville,state:Tennessee,period:2010-01-01 02:00:00,temp_f:34.5 3
city:Nashville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.6 4
city:Nashville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:33.1 5
city:Knoxville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.2 6
city:Knoxville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:32.8 7
city:Knoxville,state:Tennessee,period:2010-01-01 05:00:00,temp_f:32.4 8
city:Knoxville,state:Tennessee,period:2010-01-01 06:00:00,temp_f:32.2 9
city:Knoxville,state:Tennessee,period:2010-01-01 07:00:00,temp_f:32.4 10
SQL-MapReduce Call
Output
Virtual columns are separated by pipe characters and not labeled with their column names.
packed_data sn
Nashville|Tennessee|2010-01-01 00:00:00|35.1 1
Nashville|Tennessee|2010-01-01 01:00:00|36.2 2
Nashville|Tennessee|2010-01-01 02:00:00|34.5 3
Nashville|Tennessee|2010-01-01 03:00:00|33.6 4
Nashville|Tennessee|2010-01-01 04:00:00|33.1 5
Knoxville|Tennessee|2010-01-01 03:00:00|33.2 6
Knoxville|Tennessee|2010-01-01 04:00:00|32.8 7
Knoxville|Tennessee|2010-01-01 05:00:00|32.4 8
Knoxville|Tennessee|2010-01-01 06:00:00|32.2 9
Knoxville|Tennessee|2010-01-01 07:00:00|32.4 10
Pivot
Summary
The Pivot function pivots data that is stored in rows into columns. The function takes as input a table of data
to be pivoted and constructs the output schema based on the values of its arguments. The function handles
NULL values automatically.
The reverse of this function is Unpivot.
Usage
Pivot Syntax
Version 1.5
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections
Arguments
Argument Category Description
PartitionColumns Required Specifies the same columns as the PARTITION BY clause (in any
order).
NumberOfRows Optional Specifies the maximum number of rows in any partition. If a partition
has fewer than number_of_rows rows, then the function adds NULL
values; if a partition has more than number_of_rows rows, then the
function omits the extra rows.
If you omit this argument, then you must specify the PivotColumn
argument.
Note:
With this argument, the ORDER BY clause is optional. If omitted,
the order of values can vary. The function adds NULL values at the
end.
PivotColumn Optional Specifies the name of the column that contains the pivot keys. If the
pivot column contains numeric values, then the function casts them to
VARCHAR.
If you omit the NumberOfRows argument, then you must specify this
argument.
Note:
If you specify the PivotColumn argument, then you must order the
input data; otherwise, the output table column content is
nondeterministic. For details, see Ordering Input Data.
PivotKeys Optional If you specify the PivotColumn argument, then this argument specifies
the names of the pivot keys. Do not use this argument without the
PivotColumn argument.
If pivot_column contains a value that is not specified as a pivot_key,
then the function ignores the row containing that value (see Example
2: Specify Pivot Keys).
By default, every unique value in pivot_column is a pivot key (see
Example 3: Use Default Pivot Keys).
TargetColumns Required Specifies the names of the input columns that contain the values to
pivot.
Id val
A x
A y
B w
B z
Each time you make the preceding call, the output table can be any of the following:
Table 1201: Possible Pivot Output Table 1
Id val_0 val_1
A x y
B w z
Id val_0 val_1
A y x
B w z
Id val_0 val_1
A x y
B z w
Id val_0 val_1
A y x
B z w
Id val sequencenum
A x 4
A y 2
B w 9
B z 3
Every time you use the preceding call, you get this result:
Table 1206: Pivot Output Table for Ordered Input Data
Id val_0 val_1
A y x
B z w
Examples
• Input
• Example 1: Specify Maximum Number of Rows in Any Partition
• Example 2: Specify Pivot Keys
• Example 3: Use Default Pivot Keys
Input
The input table contains temperature, pressure, and dewpoint data for three cities, in sparse format.
Table 1208: Pivot Examples Input Table pivot_input
SQL-MapReduce Call
Note:
The ORDER BY clause is optional. If omitted, the order of values can vary. The function always adds any
NULL values at the end.
Output
To create the output table, the function pivots the input table on the partition columns (sn, city, and week)
and outputs the contents of the target column (value) in dense format in the output columns value_0,
value_1, and value_2, which contain the temperature, pressure, and dewpoint, respectively.
Table 1209: Pivot Example 1 Output Table
SQL-MapReduce Call
Note:
The ORDER BY clause is required. If omitted, the output table column content is nondeterministic.
Output
The function outputs the contents of the input column value in dense format in the output columns
value_temp and value_pressure, which contain the temperature and pressure, respectively. Because these
values are numeric, the function casts them to VARCHAR.
Table 1210: Pivot Example 2 Output Table
SQL-MapReduce Call
Output
PSTParserAFS
Summary
The PSTParserAFS function parses Personal Storage Table (PST) files (which store email in Microsoft
software such as Microsoft Outlook and Microsoft Exchange Client) directly from Aster File Store (AFS).
You can use the PSTParserAFS function to extract email content for customer attrition analysis, customer
service analysis, and spam detection. You can also input PSTParserAFS output to other Aster Analytics
functions, such as Text_Parser, the sentiment extraction functions, and the text classification functions.
If the second command lists the /test directory as the output, then AFS is working. (For more information
about the \afs command, see Aster Database User Guide for Aster Appliances 6.20.
PSTParserAFS Syntax
Version 1.1
Note:
This function does not need an input table, but because SQL-MapReduce functions require at least one
input table be provided, you can create a table and pass it to the function as an empty table.
Arguments
Argument Category Description
Path Required Specifies the path to the PST files on AFS. The input_path represents
either a directory or a file name, and can use regular expressions. For
example:
/test
/test/testfile.pst
/test/*.pst
The PST files must be available on AFS before you call the function.
If input_path represents a directory, the function parses all PST files in
the directory.
If a file represented by an input_path is not a PST file, the function does
not parse that file and logs an error.
A single vworker processes each PST file.
Host Optional Specifies the IP address of the AFS server. The default value is IP address
of the Queen node.
Input
The input PST files must be available on AFS before you call the function. To upload a PST file to AFS, use
the \afs -put command in ACT. For example, to copy all PST files in the current directory on your
Queen to the /test/ directory on AFS, use this command:
If the specified directory does not exist in AFS, the command creates it before copying the files to that
directory.
Here are more examples for uploading PST files to AFS:
Output
The output table has the default columns and any custom columns specified by the OutputColumns
argument. The following table lists and describes the default and custom columns. In the following table,
column name aliases are in parentheses (for example, the alias of the message_id column name is id). The
function treats aliases as unique column names processes them independently. Column names are case-
insensitive.
The function creates the output table in memory, but you can direct it to a database table on disk or operate
on it directly within the SQL-MapReduce framework.
yyyy-mm-dd hh:mm:ss
yyyy-mm-dd hh:mm:ss
yyyy-mm-dd hh:mm:ss
• IPM.Contact
• IPM.Appointment
• IPM.Activity
• IPM.Report
• IPM.Task
• IPM.Recall.Report
For a list of all qualified message
classes, and more information, see
the Microsoft website.
In the Input_Format argument, you can specify the output column names and folders to exclude.
For more information about the table_from_afs function, see the Aster Database User Guide for Aster
Appliances 6.20.
Examples
These examples assume that the PST files are stored in AFS in the directory /test/. You can find the dum1.pst
input file for these examples in directory "pstParserFiles" (provided with the function) and download them
to the AFS directory /test/. (For instructions for setting up AFS, refer to Verifying that AFS is Working.)
Examples 1 and 2 show input and output; the others show only SQL-MapReduce calls.
Input
The input file is a PST file that contains information about an email. The following figure shows how the
email looks in Outlook. (The sender and recipient are the same.)
SQL-MapReduce Call
Input
The input file is dum1.pst.
SQL-MapReduce Call
Output
input_path sender to
/test/dum1.pst Microsoft Outlook dumfirst dumlast
Example 5: Multiple PST Files, Specified Host and AFS Server Port Attributes
Scale Functions
Summary
The scale functions are:
• ScaleMap, which takes a data set and outputs its statistical information (assembled at the vworker level)
• Scale, which takes ScaleMap output and outputs scaled (normalized) values for the input data set
You can use Scale output as input to distance-based analysis functions, such as KMeans.
• ScalePrinter, which takes ScaleMap output and outputs global statistical information for the entire input
data set
• PartitionScale, which scales the sequences in each partition independently, using the same formula as
Scale
Scale Function Examples has examples of all scale functions.
Background
The main purpose of the Scale function is to normalize the input data set. The function shifts the input data
and scales it to generate the normalized values.
Scaling a data set allows the comparison of normalized values from different columns without gross
influences. Some normalization methods require only a shift or a scaling step to arrive at values comparable
with statistical information (for example, mean and max). Other methods require a combination of shifting
and scaling steps.
Data-set scaling is a necessary step for many data preprocess flows. For some analytics functions such as
KMeans and Principal Component Analysis (PCA), the input data set consists of different variables, many of
which have different measures. Without scaling the data set, the influence of different columns varies
considerably and can produce unexpected results.
For example, an insurance company uses the KMeans function to cluster customers according to data about
their houses. The input variables are room area, number of rooms, house height, and house price. These
variables vary considerably in scale. For example, as shown in the following table, the scale for the room area
variable ranges from 50 through 150, while that for the house price variable ranges from $150,000 through
$300,000. If you use the input data set without scaling it with KMeans, the effect of the house price on
customer clustering is much greater than that of room area.
To normalize the data so that each variable has the same effect on customer clustering, you can use the Scale
function and apply the MAXABS method to transform the variable values into a common range. The
following table shows the normalization results.
Table 1219: Output Table Example
ScaleMap
The ScaleMap function outputs statistical information for a data set. The statistical information is assembled
at the vworker level.
ScaleMap output can be input to the functions Scale (which outputs scaled values for the data set) and
ScalePrinter (which outputs global statistics for the data set).
Note:
The statistical data generated by this function is local and is intended for use by the Scale and ScalePrinter
functions; the data does not make sense before combination.
Usage
ScaleMap Syntax
Version 1.2
Input
Table 1220: ScaleMap, Scale, or PartitionScale Input Table Schema
Output
Table 1221: ScaleMap Output Table Schema
Scale
The Scale function takes ScaleMap output as input and outputs scaled (normalized) values for the input data
set.
Note:
To scale the sequences in each partition independently, use the function PartitionScale.
Usage
Scale Syntax
Version 1.2
where min, mean, and max are the global minimum, maximum, mean
values in the corresponding columns.
The function scales the values of min, mean, and max. For example, if
intercept is '- min' and multiplier is 1, the scaled result is transformed
to a nonnegative sequence according to this formula, where scaledmin is
the scaled value:
X' = -scaledmin + 1 * (X - location)/scale
The default intercept is 0.
Input
The function has two input tables, whose schemas are described by Input and Output.
Output
The output table contains the normalized results of the input data set. If you specify multiple methods, the
output table has a row for each input-row/method combination.
Table 1224: Scale and PartitionScale Output Table Schema
ScalePrinter
The ScalePrinter function takes as input ScaleMap output (statistics assembled at the vworker level) and
outputs global statistical information for the entire input data set.
ScalePrinter Syntax
Version 1.2
Input
The ScalePrinter input table is the ScaleMap output table; for its schema, refer to ScaleMap Output Table
Schema.
Output
The ScalePrinter output table displays the statistics for the entire data set. The table has the same schema is
the ScaleMap output table (Output); however, its stattype column values are different—the following table
describes them.
Table 1225: Supported Statistical Data Types in ScalePrinter Output Table
PartitionScale
The PartitionScale function scales the sequences in each partition independently, using the same formula as
the function Scale.
Usage
PartitionScale Syntax
Version 1.2
Arguments
Argument Category Description
Method Required Specifies one or more statistical methods to use to scale the data set. For
method values and descriptions, refer to the table Location and Scale for
Statistical Methods.
If you specify multiple methods, the output table includes the column
scalemethod (which contains the method name) and a row for each
input-row/method combination.
MissValue Optional Specifies how the PartitionScale function is to process NULL values in
input:
• KEEP (default):
Keep NULL values.
• OMIT:
Ignore any row that has a NULL value.
• ZERO:
Replace each NULL value with zero.
• LOCATION:
Replace each NULL value with its location value.
InputColumns Required Specifies the input table columns that contain the attribute values of the
samples. The attribute values must be numeric values between -1e308
and 1e308. If a value is outside this range, the function treats it as
infinity.
Tip:
To identify the sequences in the output, specify the partition columns
in this argument.
Multiplier Optional Specifies one or more multiplying factors to apply to the input variables
(multiplier in the following formula):
X' = intercept + multiplier * (X - location)/scale
If you specify only one multiplier, it applies to all columns specified by
the InputColumns argument.
If you specify multiple multiplying factor, each multiplier applies to the
corresponding input column. For example, the first multiplier applies to
the first column specified by the InputColumns argument, the second
multiplier applies to the second input column, and so on. The default
multiplier is 1.
Intercept Optional Specifies one or more addition factors incrementing the scaled results—
intercept in the following formula:
X' = intercept + multiplier * (X - location)/scale
If you specify only one intercept, it applies to all columns specified by the
InputColumns argument. If you specify multiple addition factors, each
intercept applies to the corresponding input column.
The syntax of intercept is:
where min, mean, and max are the global minimum, maximum, mean
values in the corresponding columns.
The function scales the values of min, mean, and max. For example, if
intercept is '- min' and multiplier is 1, the scaled result is transformed to a
nonnegative sequence according to this formula, where scaledmin is the
scaled value:
X' = -scaledmin + 1 * (X - location)/scale
The default intercept is 0.
Input
The PartitionScale input table has the same schema as the Scale input table described in ScaleMap, Scale, or
PartitionScale Input Table Schema.
Input
The input table contains data about houses—categorical data in the column type and numerical data in the
columns price, lotsize, bedrooms, bathrooms, and stories. The column id identifies the rows. The table has
some NULL values.
Table 1226: Scale Functions Examples Input Table scale_housing
SQL-MapReduce Call
Input
As in Example 1, the input is scale_housing (Input).
Output
As explained in the description of the Intercept argument, the formula for computing the scaled value X'
from the input value X when intercept is -min is:
X' = -scaledmin + 1 * (X - location)/scale
The formula for computing scaledmin when intercept is -min is:
scaledmin = (minX - location)/scale
For example, consider row 1 of the price column in the input table (Input) and the following output table:
• Input value X = 42000
• Minimum input price value minX = 42000
• Maximum input price value maxX = 88500
• location = (88500+42000)/2 = 65250
• scale = (88500-42000)/2 = 23250
• scaledmin = (42000 - 65250)/23250 = -1
• Scaled output value X' = -(-1) + 1 * (42000 - 65250)/23250 = 0
Table 1228: Scale and ScaleMap Example 2 Output Table
Input
• Training data: scale_housing
• Test data: scale_housing_test
Table 1229: Scale and ScaleMap Example 3 Input Table scale_housing_test
Output
Example 4: ScalePrinter
This example uses ScalePrinter to output the contents of the table scale_stat, created in Example 3, step 1.
SQL-MapReduce Call
Output
Input
The input table is scale_housing.
SQL-MapRequest Call
Output
Table 1233: Scale and ScaleMap Example 5 Output Table (Columns 1-3)
id price lotsize
1 0.474576271186441 0.879699248120301
1 -23144.4444444444 1008
1 -1 0.554317548746518
1 0 0.777158774373259
2 0.601503759398496
2 -842
id price lotsize
2 -0.476323119777159
2 0.261838440111421
3 0.559322033898305 0.46015037593985
3 -15644.4444444444 -1782
3 -0.67741935483871 -1
3 0.161290322580645 0
... ... ...
Table 1234: Scale and ScaleMap Example 5 Output Table (Columns 4-7)
Example 6: PartitionScale
This example scales the sequences in each partition independently.
Input
The input table is scale_housing, which is partitioned by the column type.
SQL-MapRequest Call
Output
Table 1235: Scale and ScaleMap Example 5 Output Table (Columns 1-3)
type id price
classic 1 0.688524590163934
classic 2
classic 3 0.811475409836066
classic 4 0.991803278688525
classic 5 1
bungalow 6 0.745762711864407
bungalow 7 0.745762711864407
bungalow 8 0.779661016949153
bungalow 9 0.946892655367232
bungalow 10 1
Table 1236: Scale and ScaleMap Example 5 Output Table (Columns 4-7)
Note:
The result of this query varies with each run. To ensure repeatability, use the InitialSeeds argument
instead of the NumClusters argument.
Output
StringSimilarity
Summary
The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-
Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1].
Note:
You can use the output of the StringSimilarity function as input to the function FellegiSunterTrainer.
Usage
StringSimilarity Syntax
Version 1.1
Input
The StringSimilarity function has one required input table, which must contain pairs of columns of strings
to be compared. The input table can contain additional columns, but the function ignores them unless you
specify them with the Accumulate argument. The following table describes the required columns of the
input table.
Table 1238: StringSimilarity Input Table Schema
Examples
The following examples both use the same Input:
• Example 1: Comparison of src_text1 with tar_text
• Example 2: Comparison of src_text2 with tar_text
Input
The input table, strsimilarity_input, has two source columns (src_text1 and src_text2) against which the
function compares the target column (tar_text). The function calculates the similarity scores by the methods
specified by the ComparisonColumnPairs argument (jaro, jaro-winkler, ngram, Levenshtein Distance). For
clarity, separate examples show the comparison of each source column with the target column. With some
modifications, you can use the output of this function as input to the FellegiSunter functions.
Table 1240: StringSimilarity Example Input Table strsimilarity_input
SQL-MapReduce Call
Output
id src_text1 tar_text
1 astre aster
2 hone phone
3 acqiese acquiesce
4 AAAACCCCCGGGGA CCAGGGAAACCCAC
5 alice allies
6 angela angels
7 senter centre
8 chef chief
9 circus circuit
10 debt debris
11 deal lead
12 bare bear
SQL-MapReduce Call
Output
id src_text2 tar_text
1 astter aster
2 fone phone
3 acquire acquiesce
4 CCCGGGAACCAACC CCAGGGAAACCCAC
5 allen allies
6 angle angels
7 center centre
8 cheap chief
9 circle circuit
10 debut debris
id src_text2 tar_text
11 dell lead
12 bear bear
Unpack
Summary
The Unpack function takes data from a single packed column and unpacks it into multiple columns. The
packed column is composed of multiple virtual columns, which become the output columns. To determine
the virtual columns, the function must have either the delimiter that separates them in the packed column or
their lengths.
Unpack is complementary to the function Pack, but you can use it on any packed column that meets the
input requirements.
Unpack Syntax
Version 1.2
Arguments
Argument Category Description
InputColumn Required Specifies the name of the input column that contains the packed data.
OutputColumns Required Specifies the names to give to the output columns, in the order in
which the corresponding virtual columns appear in input_column.
OutputDataTypes Required Specifies the datatypes of the unpacked output columns.
If OutputDataTypes specifies only one value and OutputColumns
specifies multiple columns, then the specified value applies to every
output_column.
If OutputDataTypes specifies multiple values, then it must specify a
value for each output_column. The nth datatype corresponds to the nth
output_column.
Delimiter Optional Specifies the delimiter (a string) that separates the virtual columns in
the packed data. If delimiter contains a character that is a symbol in a
regular expression—such as an asterisk (*) or pipe character (|)—
precede it with two escape characters. For example, if the delimiter is
the pipe character, specify '\\|'. The default delimiter is comma (,).
If the virtual columns are separated by a delimiter, then specify the
delimiter with this argument; otherwise, specify the ColumnLength
argument. Do not specify both this argument and the ColumnLength
argument.
ColumnLength Optional Specifies the lengths of the virtual columns; therefore, to use this
argument, you must know the length of each virtual column.
If ColumnLength specifies only one value and OutputColumns
specifies multiple columns, then the specified value applies to every
output_column.
Output
Table 1246: Unpack Output Table Schema
Examples
• Example 1: Delimiter Separates Virtual Columns
• Example 2: No Delimiter Separates Virtual Columns
Input
The input table is a collection of temperature readings for two cities, Nashville and Knoxville, in the state of
Tennessee. In the column of packed data, the delimiter comma (,) separates the virtual columns. The last
row contains invalid data.
Table 1247: Unpack Example 1 Input Table ville_tempdata
sn packed_temp_data
10 Nashville,Tennessee,35.1
11 Nashville,Tennessee,36.2
12 Nashville,Tennessee,34.5
13 Nashville,Tennessee,33.6
14 Nashville,Tennessee,33.1
15 Nashville,Tennessee,33.2
16 Nashville,Tennessee,32.8
sn packed_temp_data
17 Nashville,Tennessee,32.4
18 Nashville,Tennessee,32.2
19 Nashville,Tennessee,32.4
20 Thisisbaddata
SQL-MapReduce Call
Note:
Because comma is the default delimiter, the Delimiter argument in the preceding call is optional.
Output
Because of Exception ('true'), the function did not fail when it encountered the row with invalid data, but it
did not output that row.
Table 1248: Unpack Example 1 Output Table
Input
The input table for this example is like the input table for the previous example, except that no delimiter
separates the virtual columns in the packed data. To enable the function to determine the virtual columns,
the function call specifies the column lengths.
Table 1249: Unpack Example 2 Input Table ville_tempdata1
sn packed_temp_data
10 NashvilleTennessee35.1
11 NashvilleTennessee36.2
12 NashvilleTennessee34.5
13 NashvilleTennessee33.6
14 NashvilleTennessee33.1
15 NashvilleTennessee33.2
16 NashvilleTennessee32.8
17 NashvilleTennessee32.4
18 NashvilleTennessee32.2
19 NashvilleTennessee32.4
20 Thisisbaddata
SQL-MapReduce Call
Output
Unpivot
Summary
The Unpivot function pivots data that is stored in columns into rows—the reverse of the function Pivot.
Usage
Unpivot Syntax
Version 1.2
Arguments
Argument Category Description
Unpivot Optional Specifies the names of the unpivot columns—the input columns to unpivot
(convert to rows).
Note:
If you do not specify this argument, you must specify the UnpivotRange
argument.
UnpivotRange Optional Specifies ranges of unpivot columns. You must type the brackets in this
context—they are range delimiters, not indicators that the syntax element is
optional. The start_column and end_column are nonnegative integers that
represent the positions of columns in the input table.The first column is in
position 0. No start_index can be greater than its corresponding end_index.
The range includes its endpoints.
Note:
If you do not specify this argument, you must specify the Unpivot
argument.
Accumulate Required Specifies the names of input columns—other than unpivot columns—to copy
to the output table. You must specify these columns in the same order that
they appear in the input table. No accumulate_column can be an unpivot
column.
InputTypes Optional Specifies whether the unpivoted value column, in the output table, has the
same data type as its corresponding unpivot column (if possible). The default
value is 'false'—for each unpivoted column, the function outputs the values in
a single VARCHAR column.
If you specify 'true', the function outputs each unpivoted value column in a
separate column. If the unpivot column has a real data type, the unpivoted
value column has the data type DOUBLE PRECISION; if the unpivot column
has an integer data type, the unpivoted value column has the data type
LONG; if the unpivot column has any other data type, the unpivoted value
column has the data type VARCHAR.
AttributeColu Optional Specifies the name of the unpivoted attribute column in the output table. The
mn default value is 'attribute'.
ValueColumn Optional Specifies the name of the unpivoted value column in the output table. The
default value is 'value'.
Input
Table 1251: Pivot Input Table Schema
Examples
• Input
• Example 1: Specified Unpivot Columns, Default Optional Values
• Example 2: Specified Unpivot Columns, Specified Optional Values
• Example 3: Specified Unpivot Range, Default Optional Values
Input
The input table contains temperature, pressure, and dewpoint data for three cities, in dense (pivoted)
format. The data types of the input table columns are:
• sn: INTEGER
• city: VARCHAR
• week: INTEGER
• temp: INTEGER
• pressure: DOUBLE PRECISION
• dewpoint: VARCHAR
Table 1253: Unpivot Examples Input Table unpivot_input
SQL-MapReduce Call
Output
Because InputTypes has the value 'false', the value column has the data type VARCHAR.
Table 1254: Unpivot Example 1 Output Table
SQL-MapReduce Call
Output
Because InputTypes has the value 'true', the output table has a separate value column for each unpivot
column. The unpivot columns temp, pressure, and dewpoint have the data types INTEGER, DOUBLE
PRECISION, and VARCHAR (respectively); therefore, their corresponding unpivoted columns have the
data types LONG, DOUBLE PRECISION, and VARCHAR.
Table 1255: Unpivot Example 2 Output Table
SQL-MapReduce Call
This call is equivalent to the call in Example 1.
Output
The output is the same as in Example 1 (Output).
URIPack
Summary
The URIPack function reconstructs hierarchical URI strings that were unpacked by the function
URIUnpack.
Usage
URIPack Syntax
Version 1.1
Arguments
Argument Category Description
Queries Optional Specifies the names of the query parameters whose values are to be
included in the URIs.
Accumulate Optional Specifies names of the input table columns to copy to the output
table.
Scheme_Column Optional Specifies the name of the input table column that contains the URI
scheme.
Host_Column Optional Specifies the name of the input table column that contains the URI
host.
Path_Column Optional Specifies the name of the input table column that contains the URI
path.
Fragment_Column Optional Specifies the name of the input table column that contains the URI
fragment.
IgnoreValues Optional Specifies a list of (case-insensitive) strings for the function to treat as
null values. If you omit this argument, the function treats only the
string 'null' as a null value. If you specify this argument, you must
specify the string 'null' to have the function treat it as a null value.
Output
Table 1256: URIPack Output Table Schema
Example
Input
The input table is the output table from the URIUnpack example.
SQL-MapReduce Call
Output
Table 1257: URIPack Example Output Table
id URI
1 https://www.google.com/webhp?p1=chrome&p2=hello+world&p3=UTF-8#fragment1
2 http://www.ietf.org/rfc/rfc2396.txt
3 ldap://[2001:db8::7]/c=GB
4 telnet:///
5 http://www.bar.com/baz/foo?p1=netscape&p2=%7Bhello+world%7D&p3=UTF#This+is
+fragment+too
URIUnpack
Summary
The URIUnpack function unpacks hierarchical uniform resource identifiers (URIs); that is, it outputs their
constituent components and the values of specified query parameters.
To repack the unpacked URIs, input the URIUnpack output to the function URIPack.
Background
A URI is a structured sequence of characters that identifies a resource (such as a file) on the Internet. URI
lists generated by web server logs and hypertext transfer protocol (HTTP) form submissions are a common
input for text analysis functions.
URI syntax is defined by the Internet Engineering Task Force (IETF). The following table describes the key
components of a hierarchical URI. The examples in the table are from this URI: https://www.google.com/
webhp?p1=chrome&p2=hello%20world&p3=UTF-8#fragment1
Table 1258: Key Hierarchical URI Components
Component Example
scheme https
host www.google.com
path /webhp
query ?p1=chrome&p2=hello%20world&p3=UTF-8
A query starts with a question mark (?). An ampersand (&) precedes each query
parameter. Here, the query parameters are p1, p2, and p3. Their values are chrome,
hello%20world, and UTF-8, respectively. %20 represents a space character.
fragment #fragment1
A URI can contain the US-ASCII characters for the lowercase and uppercase letters of the English alphabet
and the Arabic numerals. Any character outside this character set is percent-encoded; that is, converted to a
sequence of the form %hh, where h is a hexadecimal digit. In a query, the space character is encoded as %20.
For example, "San José" is encoded as "San%20Jos%C3%A9". Outside a query, the space character is
encoded as the plus character (+). For example, "San José" is encoded as "San+Jos%C3%A9".
Usage
URIUnpack Syntax
Version 1.0
Arguments
Argument Category Description
URI_Column Required Specifies the name of the input table column that contains the URIs
to unpack. Malformed URIs are ignored.
Queries Optional Specifies the names of the query parameters whose values are to be
extracted from the URIs.
Output Optional Specifies the URI components (outside the query) to output. By
default, the function outputs all four components. If you specify
'path', the function outputs the URI path in normalized form (for
example, it reduces /./bar/baz to /bar/baz.
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.
Print_Null_Queries Optional Specifies whether to output URIs that contain none of the
parameters specified by the Queries argument. The default value is
'true'.
Input
Table 1259: URIUnpack Input Table Schema
Output
Table 1260: URIUnpack Output Table Schema
Example
Input
The input table has five URIs, some of which include characters that are percent-encoded.
Table 1261: URIUnpack Input table uris_input
id uri_column
1 'https://www.google.com/webhp?p1=chrome&p2=hello%20world&p3=UTF-8#fragment1'
2 'http://www.ietf.org/rfc/rfc2396.txt'
3 'ldap://[2001:db8::7]/c=GB?objectClass?one'
4 'telnet://192.0.2.16:80/'
5 'http://www.bar.com/./baz/foo?p1=netscape&p2=%7bhello%20world%7d&p3=UTF#This
%2Bis%2Bfragment%2Btoo');
SQL-MapReduce Call
XMLParser
Summary
The XMLParser function takes XML documents and outputs their element names, attribute values, and text
in a relational table, which you can search with SQL queries.
Background
XML data is semistructured and hierarchical, unlike the data in relational database tables. Therefore, you
cannot search XML data with SQL queries unless you first relationalize the XML data (that is, put it in a
relational database table).
Not all XML data can be relationalized; therefore, the XMLParser function constrains the relationships in
the extracted data to grandparent/parent/child, parent/child, ancestor, and sibling relationships. The
function lets you specify these constraints and the output table schema.
Usage
XMLParser Syntax
Version 1.7
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the XML
documents. The function skips malformed XML documents.
Nodes Required Specifies the node-pair strings from which the function extracts data.
The simplest syntax for node_pair_string is:
[grandparent/]parent/child[,...]
{grandparent|parent|child}[:attribute[,...]]
Note:
Node and attribute names are case-sensitive.
sibling_node_name[:attribute[,...]]
The function includes the values from the sibling nodes in every output
row and adds a column to the output table for every sibling node and
every specified attribute.
If no sibling_node_string contains a sibling node, the function outputs
NULL sibling node values. If the argument specifies no attributes, the
function outputs NULL attribute values.
Delimiter Optional Specifies the delimiter that separates multiple child node values in the
output. The default value is comma (,).
SiblingDelimiter Optional Specifies the delimiter that separates multiple sibling node values in the
output. The default value is comma (,).
MaxItemNum Optional Specifies the maximum number of sibling nodes with the same name to
be returned. This value must be a positive integer. The default value is
10.
Ancestor Optional Specifies the ancestor paths for all parent nodes specified in the Nodes
argument. The simplest syntax for nodes_path is:
node[/node]...
node[:attribute[,...]]
ErrorHandler('true;error_info:col1,col2')
Input
Table 1263: XMLParser Input Table Schema
Output
In these cases, the function outputs nothing:
• No nodes_path specified by the Ancestor argument is an ancestor path.
• No node_pair_path specified by the Nodes argument contains a parent node.
• No node_pair_path specified by the Nodes argument contains a grandparent node.
Otherwise, the output table has a row for each node specified in the Nodes argument and for each
descendant of each ancestor path specified in the Ancestor argument.
Table 1264: XMLParser Output Table Schema
Examples
• Example 1: Specify Sibling and Sibling_Delimiter
• Example 2: Specify Ancestor
• Example 3: Use Regular Expressions in Nodes and Ancestor
• Example 4: Handle Errors
• Example 5: Show Grandparent, Parent, and Child Nodes
Input
xid xmldocument
1 <bookstore>
: <owner>Billy</owner>
: <book category="ASTRONOMY">
: <title lang="en">Cosmos</title>
: <author>Carl Sagan</author>
: <author>Ann Druyan</author>
: <year edition="1">1980</year>
: <year edition="2">1981</year>
: <price>
: <member>49.99</member>
: <public>60.00</public>
xid xmldocument
: </price>
: <reference>
: <title>Comet</title>
: </reference>
: <position value="1" locate="east"></position>
: </book>
: <book category="CHILDREN">
: <author>Judy Blume</author>
:
: <price>
: <member>99.99</member>
: <public>108.00</public>
: </price>
: </book>
: </bookstore>
2 <setTopRpt xsi:noNamespaceSchemaLocation="Set%20Top%2020Report%20.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
: <settopid type="string" length="5">ST789</settopid>
: <accountid type="string">8728</accountid>
: <zipcode type="string">94025</zipcode>
: <reportstamp type="dateTime">2009-10-03T12:52:06</reportstamp>
: <temperature>
: <read type="bigDecimal">46</read>
: </temperature>
: <storage>
: <used type="bigDecimal">98</used>
: <used type="bigDecimal">199</used>
: <used type="bigDecimal">247</used>
: <total type="bigDecimal">300</total>
: </storage>
: <feed>
: <feedstamp type="dateTime">2009-10-03T12:52:06</feedstamp>
: </feed>
: </setTopRpt>
SQL-MapReduce Call
Output
The parent node, price, has two child nodes, member and public. However, the Nodes argument specifies
only member; therefore, only its value is output. Title, author, and year are siblings of price. The first
document has multiple author and year siblings, so the values of those siblings are separated by the specified
delimiter, semicolon (;).
Table 1266: XMLParser Example 1 Output Table
Input
The input is the same as in Example 1 (Input).
SQL-MapReduce Call
Output
The output table contains the node and sibling values of the specified ancestor, setTopRpt.
Table 1267: XMLParser Example 2 Output Table (Columns 1-5)
Input
xid xmldocument
1 <bookstore>
: <owner>Billy</owner><items>
: <bookitem category="ASTRONOMY">
: <title lang="en">Cosmos</title>
: <author>Carl Sagan</author>
: <author>Ann Druyan</author>
: <year edition="1">1980</year>
: <price>
: <member>49.99</member>
: <public>60.00</public>
: </price>
: </bookitem>
: </items>
: </bookstore>
2 <cdstore>
: <owner> Amy </owner>
: <items>
: <cditem category="pop">
: <title lang="en">Breathe</title>
: <author>Yu Quan</author>
: <year>2003</year>
: <price>
: <member>29</member>
: <public>35</public>
: </price>
: <position value="1" locate="east"/>
: </cditem>
: </items>
: </cdstore>
The Ancestor argument specifies that any node whose value ends with 'store' is an ancestor. The Nodes
argument specifies that the function is to output the owner of each store and the title, author, and year of
each node that starts with a string of lowercase alphabetic characters and ends with 'item'.
Output
For bookstore and cdstore, the output table contains the value of owner; for bookitem and cditem, the
output table contains the values of title, author, and year. Multiple values are separated by the default
delimiter, comma (,).
Table 1270: XMLParser Example 3 Output Table
Input
The second XML document is missing the closing tag </bookstore>.
Table 1271: XMLParser Example 4 Input Table xml_inputs_error
xid xmldocument
1 <bookstore owner="Judy">
: <owner>Billy</owner><items>
: <bookitem category="ASTRONOMY">
: <title lang="en">Cosmos</title>
: <author>Carl Sagan</author>
: <author>Ann Druyan</author>
: <year edition="1">1980</year>
: <price>
xid xmldocument
: <member>49.99</member>
: <public>60.00</public>
: </price>
: </bookitem>
: </items>
</bookstore>
2 <bookstore>
SQL-MapReduce Call
Output
The output table has the column ERROR_HANDLER, which contains the value of the input column
xmldocument followed by a semicolon.
Table 1272: XMLParser Example 4 Output Table
Input
xid xml
1 <School name="UCBerkeley">
: <Dept ID="CS" name="Computer Science">
: <Class A="sophomore" B="Senior">
: <Year>
xid xml
: <Student>Harry</Student>
: <Grade>A+</Grade>
: </Year>
: </Class>
: </Dept>
: </School>
SQL-MapReduce Call
Output
XMLRelation
Summary
The XMLRelation function takes XML documents and outputs their element names, attribute values, text,
and structural information in a relational table, which you can search with SQL queries. The function
maintains multilevel paths from the input XML documents to the XML elements.
Usage
XMLRelation Syntax
Version 1.3
Note:
In the ExcludeElements argument, you must type the braces ({ and }). For example:
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the XML
documents. The function skips malformed XML documents.
DocIDColumns Required Specifies the names of the input table columns that contain the
identifiers of the XML documents. No docid_column can have the
same name as an output table column. For output column names,
refer to Output.
MaxDepth Optional Specifies the maximum depth in the XML tree at which to process
XML documents. The MaxDepth and Output arguments determine
the schema of the output table, and the number of columns in the
output table must not exceed 1600. The default value is 5.
ExcludeElements Optional Specifies the paths to the nodes to exclude from processing. The
function excludes each specified node and its child nodes. Examples
of paths to nodes are:
'chapter'
'root/book'
'root/book/{author,chapter}'
AttributeAsNode Optional Specifies whether to treat the attributes of a node as its child nodes.
The default value is 'false' (attributes of a node are stored in one
element of the output tuple).
AttributeDelimiter Optional Specifies the delimiter used to separate multiple attributes of one
node in XML documents. The default value is a comma ','.
Output Optional Specifies the output table schema (refer to Example 1: Output Three
Different Output Table Schemas). The MaxDepth and Output
arguments determine the schema of the output table, and the
number of columns in the output table must not exceed 1600. The
default value is 'fullpath'.
ErrorHandler('true;error_info:col1,col2')
Accumulate Optional Specifies the names of input column names to copy to the output
table. No accumulate_column can have the same name as an output
table column. For output column names, refer to Output.
Input
Table 1275: XMLRelation Input Table Schema
Output
The output table schema depends on the Output argument.
Output ('fulldata')
The columns DnElement, DnAttributes, DnValue, and DnID contain information about the node at depth
n, and the columns DiElement, DiAttributes, DiValue, and DiID, where i is in the range [0, n), contain
information about its ancestors.
Output ('parentchild')
Output ('fullpath')
Examples
• Example 1: Output Three Different Output Table Schemas
• Example 2: Output Attributes as Nodes
• Example 3: Enable Error Handling
Input
The input table contains an xml document that has these hierarchical nodes: School at level 1, Dept at level
2, Class at level 3, and Student and Grade at level 4.
Table 1279: XMLRelation Examples 1 & 2 Input Table xmlrelation_input
xid xmldocument
1 <School name="UCLA">
: <Dept name="EE">
: <Class A="grad" B="undergrad">
: <Student>Harry</Student>
: <Grade>A+</Grade>
: </Class>
xid xmldocument
: </Dept>
: </School>
SQL-MapReduce Call 1
Output
The output table shows the elements, attributes, values, and ids for each node.
Table 1280: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 1-5
Table 1281: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 6-10
Table 1282: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 11-14
Class A=grad,B=undergrad 3
Class A=grad,B=undergrad 3
Table 1283: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 15-18
Student Harry 4
Grade A+ 5
SQL-MapReduce Call 2
Output
The output table shows the attribute, value, and parent node for each node.
Table 1284: XMLRelation Example 1 Output Table for Output ('parentchild')
SQL-MapReduce Call 3
This call specifies the default Output argument value.
Output
The output table shows the ID for each node in a separate column.
Table 1285: XMLRelation Example 1 Output Table for Output ('fullpath')
Input
The input is the same as in Example 1 (Input).
SQL-MapReduce Call
Output
The Elements column contains both actual nodes and attributes that are output as nodes. For the latter, the
Attributes column contains a tilde (~).
Table 1286: XMLRelation Example 2 Output Table
Input
The second XML document is malformed.
Table 1287: XMLRelation Example 3 Input Table xmlrelation_error
xid xmldocument
1 <School name="UCLA">
: <Dept name="EE">
: </Dept>
: </School>
2 <School /School> name="UTA">
SQL-MapRequest Call
Output
AMLGenerator
Summary
The AMLGenerator function translates an Aster model into an XML-based Aster Model Language (AML)
format, which is accepted by the Aster Scoring SDK functionality.
Usage
AMLGenerator Syntax
Version 1.0
Arguments
Argument Category Description
ModelType Required Specifies the function name for which this .aml file is to be used.
For predictors, this represents the type of the model trained.
The model type of functions are listed in Aster Scoring SDK
Functions.
ModelTable Optional Specifies the input model table representing input data for which
aml file is to be generated. The argument clause accepts multiple
input model tables. These tables could be either database tables
or files installed on the database.
ModelTag Optional Specifies the input model table tag for each input model table
specified in ModelTable clause. Tags are supported for functions
which accept multiple model tables and are used to distinguish
the role of each model table in the context of the Aster Scoring
SDK function.
The supported tags are listed in Aster Scoring SDK Functions.
The number of entries must match the number of entries in the
ModelTable clause.
InstalledFile Optional Specifies whether the corresponding value in ModelTable is a file
installed on database or a table on database. If ModelTable is a
database table, set this to false (default) and if it is a file, set this
to true. The number of entries must match the number of entries
in the ModelTable clause.
RequestColNames Required Specifies the column names of the request to be scored. For a
predictor, these column names typically match the columns
names of the training data used to train the model.
RequestColTypes Required Specifies the column types of the request to be scored. For a
predictor, these column types typically match the columns types
of the training data used to train the model.
AMLPrefix Optional Specifies the name of the generated AML model file. Default
value is “model”. The output file is stored with suffix .aml.
OverwriteOutput Optional Specifies whether the output AML model file is to be overwritten
if it already exists.
Domain Optional Specifies the IP address of the queen node. The default is queen
of the current cluster.
UserId Optional Specifies the Aster Database user name of the user. The default is
beehive.
Password Optional Specifies the Aster Database password of the user.
Input
This is a driver function and the ON clause does not operate on any table. The ON clause must accept
(SELECT 1) PARTITION BY 1 to execute the function.
Output
The function generates an AML file installed on Aster Database which conforms to a specific XSD (XML
schema definition) format. The statistics of generated AML file are printed on console. The AML file can be
downloaded from ACT using command “\download [AMLFILE]” where AMLFILE is the name of the AML
file specified in AMLPrefix clause.
The following are some details about the generated AML file:
Header
The header consists of ModelType, AMLGenerator build version, and Teradata Copyright information.
Request Parameters
The parameters specified by the user in the clauses RequestArgName and RequestArgVal are appended in
the block “params”. These parameters are not validated by AMLGenerator (parameter parsing takes place
inside Scorer) and are appended as specified by the user.
Model Columns
Model columns are appended in the block “columns” with entity = “model”. They specify the names and
data types of columns in Model data. Each block contains ModelTag information, if provided. Each block
also contains InstalledFile information, if provided.
The behavior of Model columns varies based on whether the model is database table or installed file.
• Database Table:
∘ Model column names and types are retrieved from Aster database and converted to Scoring data
types.
∘ Database table name is appended in the AML file model column header.
• Installed File:
∘ Model columns are empty.
∘ Installed file name is appended in the AML file model column header.
Model Data
Model data contains the actual data for the model provided in ModelTable clause. The data is stored
differently depending upon whether it is a database table or installed file. This block also contains checksum
of the data stored for data integrity.
• Database Table:
∘ For time and timestamp sql types, data is converted to acceptable scoring format.
∘ Special characters are transformed as follows:
Backslash (\) is preceded by another backslash (\).
Comma (,) is preceded by backslash (\).
Ampersand (&) is replaced with “&”.
Less than (<) is replaced with “<”.
Greater than (>) is replaced with “>”.
∘ For binary data, data is stored using Base64 encoding.
• Installed File:
∘ Data is captured in binary format (bytes) and stored using Base64 encoding in the AML file.
∘ Because of binary encoded format, special character handling is not needed.
Example
Input
The input table to the AMLGenerator, glass_attribute_table_output, is an output of the single decision tree
function (single_tree_drive) in this example.
Table 1289: AMLGenerator Example Input Table glass_attribute_table_output
SQL-MapRequest Call
The RequestColNames() has all the attributes that the single decision tree function was trained on.
Scorer
Summary
The Scorer function provides a software framework to score input queries based on a given model and
predictor. Scorer is a set of java classes packed into a .jar file that resides in the user's framework (real time
java virtual machine environment). This figure shows how Scorer interacts with the rest of the system:
Cloud Integration
Scorer can be configured to use in any framework including web servers, cloud, distributed computing
environments, etc. The work flow in such environments is shown below:
Package
The Scorer package contains these files:
Installation
The only library that Aster Scoring SDK needs is scoring.jar.You must load this jar file into the user
environment, and then you can invoke the Scorer like any jar library.
Some ways to load the library are:
• Use classpath
While compiling a package that contains Scoring classes, add scoring.jar to the classpath.
For example, assume that the application MyApp.java uses Scoring classes. The command to compile it
from the command line is:
(You can also add scoring.jar to the classpath when you compile such a package from a builder such
as ant or maven.)
The command to run the compiled application from the command line is:
Using classpath
While compiling a package that contains Scoring classes from either the command line, ant, maven, or other
builders, add the scoring.jar library to the classpath. An example for command line compilation of an
application, MyApp.java (that uses Scoring classes), is:
To run the application containing Scoring classes from the command line, add the scoring.jar library to the
classpath:
The procedure to install and run on a cloud environment are similar. Follow the instructions in the
documentation for your cloud environment about running a third party jar library.
Functional Support
While most of the usage details for these functions are the same for Scorer as they are for the SQL-
MapReduce function, the format for scoring data type, argument clauses, and output generation are slightly
modified to isolate Scorer from the Aster framework. For these differences, refer to the description of the
Aster Scoring SDK version of the function in Aster Scoring SDK Functions. For information about the
function itself, refer to the description of its counterpart SQL-MapReduce function.
Input Formats
Scorer uses two input formats, CSVInputFormat and AMLInputFormat.
CSVInputFormat
CSVInputFormat supports comma separated values (CSV) file format to read input data (requests) and
request parameters from file system (input stream). The request can also be populated using the API, as
mentioned in Scoring API.
AMLInputFormat
AMLInputFormat supports AML (Aster Model Language) file format. AML is an XML-based file with a
predefined XSD schema, and is the data exchange format between Aster Database and Scorer, as mentioned
in AMLGenerator.
Data Types
The following data types are supported by the Scorer. The corresponding data types on Aster Database are
also listed.
Table 1290: Scoring Data Types
For date, time, and timestamp data types, format conversions are required in some cases when transforming
data from Aster database to Aster Scoring SDK for compatibility of SQL data types with the Java library
(java.sql package).
Output Formats
The output can be configured in the following way:
API
The API for output format is described in Scoring API.
Logging
The output can be redirected to the console using Java's library methods such as System.out and System.err,
or piped to some file. The output can also be configured to go to an event log, the details of which are
discussed in Logging Support.
Scoring API
The scoring APIs are documented in the javadoc. After installation, you can invoke Scorer in the following
ways (the code blocks show high-level method calls to invoke scorer).
// initialize
Scorer scorer = new Scorer ();
// configure
// make sure that AML file modelFile is available on file system
scorer.configure (modelFile);
// run scorer (multiple calls)
// make sure that CSV file requestFile is available on file system
scorer.score (requestFile);
// initialize
Scorer scorer = new Scorer ();
// configure
// make sure that AML file modelFile is available on file system
Request request = scorer.configure (modelFile);
// populate data structure request
// run scorer (multiple calls)
scorer.score (request);
Javadoc
The Javadoc is in the scoring package in the file scoring-doc.zip. The following figure shows a snapshot
of Javadoc for the Scorer class.
Examples
TheScoring examples and their code are in in the Scoring package in the file scoring-examples.zip.
Each example is in its own directory. To test the examples, run the script run_examples.sh.
The following figure shows an example scoring application with the GLM real-time predictor.
Logging Support
Scorer logs information that is helpful for monitoring progress and for debugging. Optionally, Scorer also
logs events in a separate event log file. Every day, a log file is stored with the date appended to the log file
name. The log files are configured to log to a file on the local file system where Scorer runs.
The following logging variables are supported as System Properties. An example that sets the system
properties is in the run_examples.sh script.
Compatibility
Scorer can run on any platform, and is compatible with Java Development Kit (JDK) 1.6 and later.
Note:
Scorer expects to use the AML file generated by the AMLGenerator function. Using an AML file
generated or manipulated by a third-party might cause Scorer failure or incorrect scoring results.
Tips
Scorer is expected to use the AML file generated by AMLGenerator function. The manual (third-party)
generation or manipulation of AML file may lead to scorer failure or incorrect scoring results because of
which this practice is discouraged.
Model Format
Table 1291: Aster Scoring SDK Single Decision Tree Model Format
Argument Description
ModelType sdt, single decision tree
ModelTable Database table
ModelTag No tags supported
Request Definition
The Aster Scoring SDK function uses a different request schema from SQL-MapReduce function
Single_Tree_Predict. As an example, the table for the SQL-MapReduce function looks like:
Table 1292: Single_Tree_Predict Request Schema
For the Aster Scoring SDK function, the request must be a flat data structure, as shown below for the same
example:
Table 1293: Aster Scoring SDK Single Decision Tree Request Schema
PID RI Na Mg ...
9 1.51545 14.14 0.00 ...
Parameters
Table 1294: Aster Scoring SDK Single Decision Tree Parameters
Model Format
Table 1295: Aster Scoring SDK Generalized Linear Model - Model Format
Argument Description
ModelType glm, generalized linear model
ModelTable Database table
ModelTag No tags supported
Request Definition
Same as SQL-MapReduce function GLMPredict.
Parameters
Table 1296: Aster Scoring SDK Generalized Linear Model Parameters
Model Format
Table 1297: Aster Scoring SDK Random Forest Model Format
Argument Description
ModelType rf, random forest
ModelTable Database table
ModelTag No tags supported
Request Definition
Same as SQL-MapReduce function Forest_Predict.
Parameters
Table 1298: Aster Scoring SDK Random Forest Parameters
Model Format
Table 1299: Aster Scoring SDK Naïve Bayes Model Format
Argument Description
ModelType nb, naïve bayes
ModelTable Database table
ModelTag No tags supported
Request Definition
Same as SQL-MapReduce function NaiveBayesPredict.
Parameters
Table 1300: Aster Scoring SDK Naïve Bayes Parameters
Model Format
Table 1301: Aster Scoring SDK Naïve Bayes Text Classifier Model Format
Argument Description
ModelType nbtc, naïve bayes text classifier
ModelTable Database table
ModelTag No tags supported
Parameters
Table 1302: Aster Scoring SDK Naïve Bayes Text Classifier Parameters
Model Format
Table 1303: Aster Scoring SDK Text Tagging Model Format
Argument Description
ModelType ttag, text tagging
ModelTable Database table for rules, installed file for dictionary.
ModelTag PREDICT (for rules), DICT (for dictionary).
Request Definition
Same as SQL-MapReduce function TextTagging.
Additional Notes
Rules can be provided using either Request parameters or a database table in the ModelTable argument.
Dictionary can be provided using ModelTable, ModelTag, and InstalledFile arguments.
When a rule uses a dictionary file 'file1', you must install the file on the database first and then put it in
ModelTable clause while invoking AMLGenerator with ModelTag “DICT”.
The following examples highlight different variations in usage of rules and the dictionary:
Example 1
No model table is specified and rules are provided using Request parameters
RequestArgName1 ('Rules')
RequestArgVal1 ('contain(content, "floods", 1, ) OR
contain(content, "tsunamis", 1,) AS Natural-Disaster')
Example 2
Rules table is a database table. No dictionary file is specified.
ModelTable ('rules')
or
ModelTable ('rules')
ModelTag ('PREDICT')
InstalledFile ('false')
Example 4
Rules table and dictionary are both provided using the ModelTable argument. The rules in the rule table
‘rules’ may contain references to the dictionary files ‘file1’ and ‘file2’.
Model Format
Table 1305: Aster Scoring SDK Extract Sentiment Model Format
Argument Description
ModelType sent, extract sentiment
ModelTable Database table for dictionary, installed file for dictionary file or classification file.
ModelTag DICT (for dictionary database table and installed dictionary file), CLASS (for
installed classification file).
Request Definition
Same as SQL-MapReduce function ExtractSentiment.
Parameters
Table 1306: Aster Scoring SDK Extract Sentiment Parameters
Additional Notes
The supporting files (dictionary model or file, classification model) are input using the ModelTable
argument in AMLGenerator.
The following examples highlight different variations in usage of the dictionary and classification model:
Example 1
Use the default model (determined by Language argument).
ModelTable ('default_sentiment_lexicon.txt')
ModelTag ('DICT')
InstalledFile ('true')
For Chinese language text, the default model is default_sentiment_lexicon_zh_cn.txt (for Simplified
Chinese) and default_sentiment_lexicon_zh_tw.txt (for traditional Chinese).
Example 2
Model is a dictionary table.
ModelTable ('dict_table')
ModelTag ('DICT')
InstalledFile ('false')
Example 3
A dictionary table and an installed dictionary file (other than the default) are used. In this case, the
sentiment words from the dictionary table have a higher priority than those in the dictionary file.
Example 4
In this example, a classification file is used.
ModelTable ('sentiment_classification_model.bin')
InstalledFile ('true')
ModelTag ('CLASS')
RequestArgName1 ('Model')
RequestArgVal1 ('classification:sentiment_classification_model.bin')
Model Format
Table 1307: Aster Scoring SDK Text Parser Model Format
Argument Description
ModelType tparser, text parser
ModelTable Installed file
ModelTag STOPWORDS, STEMEXCEPTIONWORDS
Request Definition
Same as SQL-MapReduce function Text_Parser.
Parameters
Table 1308: Aster Scoring SDK Text Parser Parameters
Additional Notes
Model files for stop words and stemming exception words are input using the ModelTable, ModelTag, and
InstalledFile arguments of AMLGenerator. If neither table is provided, no stop words or stemming
exceptions are used. Either (or both) tables can be provided, as shown below:
Example 1
ModelTable ('stop_words_file')
ModelTag ('STOPWORDS')
InstalledFile ('true')
Example 2
ModelTable ('stem_exception_words_file')
ModelTag ('STEMEXCEPTIONWORDS')
InstalledFile ('true')
Example 3
Argument Description
ModelType ttoken, text tokenizer
ModelTable Database table, installed file
ModelTag DICT, CRF
Request Definition
Same as SQL-MapReduce function TextTokenizer.
Parameters
Table 1310: Aster Scoring SDK Text Tokenizer Parameters
Additional Notes
If the function uses a dictionary table, a dictionary model, and/or a CRF model file, they can be provided
using the ModelTable, ModelTag, and InstalledFile arguments of AMLGenerator. If no table is provided, the
function uses the default embedded dictionaries for English or Chinese text.
Example 1
Chinese language:
ModelTable ('crf_model_file')
ModelTag ('CRF')
InstalledFile ('true')
Example 3
English or Japanese (CRF file is not supported):
Model Format
Table 1311: Aster Scoring SDK SparseSVM Model Format
Argument Description
ModelType svm, sparse svm
ModelTable Database table
ModelTag No tags supported
Request Definition
Same as SQL-MapReduce function SparseSVMPredictor.
Parameters
Table 1312: Aster Scoring SDK SparseSVM Parameters
Model Format
Table 1313: Aster Scoring SDK CoxPH Model Format
Argument Description
ModelType cox, coxph
ModelTable Database table
ModelTag No tags supported
Request Definition
Same as SQL-MapReduce function CoxPredict.
Parameters
Table 1314: Aster Scoring SDK CoxPH Parameters
Model Format
Table 1315: Aster Scoring SDK LDAInferenceModel Format
Argument Description
ModelType lda, lda inference
ModelTable Database table
ModelTag No tags supported
Request Definition
Same as SQL-MapReduce function LDAInference.
Parameters
Table 1316: Parameters
FAQ
Visualization Functions
Visualization functions are used with the AppCenter product. For information about these functions, refer
to the AppCenter User Guide.
SAX2
Multiple-Input Version
Version 1.0
Single-Input Version
Version 1.0
Statistical Analysis
Continuous Distributions
Version 1.0
• Option 1: For Multiple-Node Data Sets
• Option 2: For Single-Node Data Sets
Discrete Distributions
Version 1.0
• Option 1: For Multiple-Node Data Sets
• Option 2: For Single-Node Data Sets and Any CvM Test
Integer Input
Version 1.0
Text Analysis
Note:
If the input is a query, you must map it to an alias.
Note:
In the FeatureSelection argument, you must type the brackets. They do not indicate that their contents
are optional.
Cluster Analysis
The Modularity function, which discovers clusters in input graphs, is in the Graph Analysis chapter.
KMeansPlot Syntax
Version 1.1
);
Note:
When calling KMeansPlot on a view, you must provide aliases (a requirement of multi-input SQL-
MapReduce). For example:
SELECT *
FROM KMeansPlot (
ON pa_prdwk.seg_data_v AS input_data PARTITION BY ANY
ON pa_prdwk.seg_data_output AS segmentation_data_output DIMENSION
CentroidsTable ('segmentation_data_output')
);
Naive Bayes
Ensemble Methods
Association Analysis
Graph Analysis
Note:
In the DegreeRange argument, you must type the brackets. They do not indicate that their contents are
optional.
Neural Networks
Data Transformation
CONVEXHULL
BUFFER
PointInPolygon
Scorer
Visualization Functions
Visualization functions are used with the AppCenter product. For information about these functions, refer
to the AppCenter User Guide.