0% found this document useful (0 votes)
18 views1,422 pages

Aster Analytics Foundation User Guide 0621 Update 1

Uploaded by

wicaksono.dhany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views1,422 pages

Aster Analytics Foundation User Guide 0621 Update 1

Uploaded by

wicaksono.dhany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1422

What would you do if you knew?

Teradata Aster Analytics Foundation


User Guide
Release 6.21
B700-1021-621K
November 2016
The product or products described in this book are licensed products of Teradata Corporation or its affiliates.

Teradata, Applications-Within, Aster, BYNET, Claraview, DecisionCast, Gridscale, MyCommerce, QueryGrid, SQL-MapReduce, Teradata
Decision Experts, "Teradata Labs" logo, Teradata ServiceConnect, Teradata Source Experts, WebAnalyst, and Xkoto are trademarks or registered
trademarks of Teradata Corporation or its affiliates in the United States and other countries.
Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc.
Amazon Web Services, AWS, [any other AWS Marks used in such materials] are trademarks of Amazon.com, Inc. or its affiliates in the United
States and/or other countries.
AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc.
Apache, Apache Avro, Apache Hadoop, Apache Hive, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the
Apache Software Foundation in the United States and/or other countries.
Apple, Mac, and OS X all are registered trademarks of Apple Inc.
Axeda is a registered trademark of Axeda Corporation. Axeda Agents, Axeda Applications, Axeda Policy Manager, Axeda Enterprise, Axeda Access,
Axeda Software Management, Axeda Service, Axeda ServiceLink, and Firewall-Friendly are trademarks and Maximum Results and Maximum
Support are servicemarks of Axeda Corporation.
CENTOS is a trademark of Red Hat, Inc., registered in the U.S. and other countries.
Cloudera, CDH, [any other Cloudera Marks used in such materials] are trademarks or registered trademarks of Cloudera Inc. in the United States,
and in jurisdictions throughout the world.
Data Domain, EMC, PowerPath, SRDF, and Symmetrix are registered trademarks of EMC Corporation.
GoldenGate is a trademark of Oracle.
Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company.
Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other
countries.
Intel, Pentium, and XEON are registered trademarks of Intel Corporation.
IBM, CICS, RACF, Tivoli, and z/OS are registered trademarks of International Business Machines Corporation.
Linux is a registered trademark of Linus Torvalds.
LSI is a registered trademark of LSI Corporation.
Microsoft, Active Directory, Windows, Windows NT, and Windows Server are registered trademarks of Microsoft Corporation in the United States
and other countries.
NetVault is a trademark or registered trademark of Dell Inc. in the United States and/or other countries.
Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries.
Oracle, Java, and Solaris are registered trademarks of Oracle and/or its affiliates.
QLogic and SANbox are trademarks or registered trademarks of QLogic Corporation.
Quantum and the Quantum logo are trademarks of Quantum Corporation, registered in the U.S.A. and other countries.
Red Hat is a trademark of Red Hat, Inc., registered in the U.S. and other countries. Used under license.
SAP is the trademark or registered trademark of SAP AG in Germany and in several other countries.
SAS and SAS/C are trademarks or registered trademarks of SAS Institute Inc.
Simba, the Simba logo, SimbaEngine, SimbaEngine C/S, SimbaExpress and SimbaLib are registered trademarks of Simba Technologies Inc.
SPARC is a registered trademark of SPARC International, Inc.
Symantec, NetBackup, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the United States and
other countries.
Unicode is a registered trademark of Unicode, Inc. in the United States and other countries.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other product and company names mentioned herein may be the trademarks of their respective owners.
The information contained in this document is provided on an "as-is" basis, without warranty of any kind, either express
or implied, including the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.
Some jurisdictions do not allow the exclusion of implied warranties, so the above exclusion may not apply to you. In no
event will Teradata Corporation be liable for any indirect, direct, special, incidental, or consequential damages,
including lost profits or lost savings, even if expressly advised of the possibility of such damages.
The information contained in this document may contain references or cross-references to features, functions, products, or services that are not
announced or available in your country. Such references do not imply that Teradata Corporation intends to announce such features, functions,
products, or services in your country. Please consult your local Teradata Corporation representative for those features, functions, products, or
services available in your country.
Information contained in this document may contain technical inaccuracies or typographical errors. Information may be changed or updated
without notice. Teradata Corporation may also make improvements or changes in the products or services described in this information at any time
without notice.
To maintain the quality of our products and services, we would like your comments on the accuracy, clarity, organization, and value of this
document. Please e-mail: [email protected]
Any comments or materials (collectively referred to as "Feedback") sent to Teradata Corporation will be deemed non-confidential. Teradata
Corporation will have no obligation of any kind with respect to Feedback and will be free to use, reproduce, disclose, exhibit, display, transform,
create derivative works of, and distribute the Feedback and derivative works thereof without limitation on a royalty-free basis. Further, Teradata
Corporation will be free to use any ideas, concepts, know-how, or techniques contained in such Feedback for any purpose whatsoever, including
developing, manufacturing, or marketing products or services incorporating Feedback.
Copyright © 2015 - 2016 by Teradata. All Rights Reserved.
Table of Contents

Preface.................................................................................................................................................................59
Overview........................................................................................................................................................................59
Conventions Used in This Guide...............................................................................................................................59
Typefaces........................................................................................................................................................... 59
Notation Conventions.....................................................................................................................................59
Command Shell Text Conventions............................................................................................................... 60
Contact Teradata Global Technical Support (GTS)................................................................................................60
About Teradata Aster.................................................................................................................................................. 60
About This Document.................................................................................................................................................60
Version History................................................................................................................................................ 60

Chapter 1:
Introduction.................................................................................................................................................61
Introduction..................................................................................................................................................................61
Analytics at Scale: Full Dataset Analysis...................................................................................................................61
Introduction to Teradata Aster SQL-MapReduce®..................................................................................................62
What is MapReduce?....................................................................................................................................... 62
Aster Database SQL-MapReduce...................................................................................................................63
SQL-MapReduce Query Syntax................................................................................................................................. 64
SQL-MapReduce with Multiple Inputs.....................................................................................................................66
Benefits of Multiple Inputs............................................................................................................................. 66
How Multiple Inputs are Processed.............................................................................................................. 66
Types of SQL-MapRequest Inputs.................................................................................................................67
Semantic Requirements for SQL-MapReduce Functions.......................................................................... 67
Use Cases and Examples for Multiple Inputs.............................................................................................. 69
SQL-MapRequest Multiple Input FAQ........................................................................................................ 75
Aster Analytics Function Product Bundles.............................................................................................................. 76
Aster Analytics Functions by Product Bundle.........................................................................................................77
Premium Path...................................................................................................................................................77
Premium Relationship.....................................................................................................................................77
Analytics Foundation...................................................................................................................................... 78
Premium Graph............................................................................................................................................... 82
Aster Analytics..................................................................................................................................................83
Aster Scoring SDK........................................................................................................................................... 83
Aster Analytics Functions by Category.....................................................................................................................83
Time Series, Path, and Attribution Analysis................................................................................................ 83
Pattern Matching with Teradata Aster nPath.............................................................................................. 84
Statistical Analysis............................................................................................................................................85
Text Analysis.....................................................................................................................................................86
Cluster Analysis................................................................................................................................................88
Naive Bayes....................................................................................................................................................... 88

Teradata Aster Analytics Foundation User Guide 3


Table of Contents

Ensemble Methods...........................................................................................................................................88
Association Analysis........................................................................................................................................ 89
Graph Analysis................................................................................................................................................. 89
Aster Scoring SDK........................................................................................................................................... 90
NeuralNet..........................................................................................................................................................90
Data Transformation.......................................................................................................................................91
Aster Database Utilities...................................................................................................................................92

Chapter 2:
Installing Aster Analytics Functions.........................................................................93
Installing Aster Analytics Functions......................................................................................................................... 93
Aster Analytics Function Version Numbers............................................................................................................93
Finding Function Version Numbers......................................................................................................................... 93
Aster Analytics Compatibility Matrix.......................................................................................................................94
Aster Analytics Function Packages............................................................................................................................95
Downloading an Aster Analytics Function Package...............................................................................................96
Getting Install and Uninstall Scripts......................................................................................................................... 96
Scripts for the Schema PUBLIC..................................................................................................................... 96
Scripts for a Specified Schema........................................................................................................................97
Installing an Aster Analytics Function Package...................................................................................................... 98
Set Default Schema for Function Users........................................................................................................ 99
Set Permissions to Allow Users to Run Functions...................................................................................... 99
Testing the Functions...................................................................................................................................... 99
Updating an Aster Analytics Function Package...................................................................................................... 99
Installing a Function in a Specific Schema............................................................................................................. 101
Managing Files with ACT Commands................................................................................................................... 101
Usage Notes................................................................................................................................................................ 102
Enclosing Database Object Names in Double Quotation Marks............................................................ 102
Boolean Argument Values............................................................................................................................103
Column Specification Arguments............................................................................................................... 103
DATE Columns..............................................................................................................................................103
BC/BCE Timestamps.....................................................................................................................................103
Creating a Timestamp Column................................................................................................................... 104
Granting CREATE Privileges....................................................................................................................... 104
Adding Model File Locations to the Default Search Path........................................................................ 104
Connecting to Aster Database Using Authentication Cascading........................................................... 104
Connecting to Aster Database Using SSL JDBC Connections................................................................ 107
Error Message Delays.................................................................................................................................... 108
Sparse Tables and Dense Tables.................................................................................................................. 108
Permanent Tables As Output of Driver-Based Functions....................................................................... 108
Input Table Aliases........................................................................................................................................ 108

Chapter 3:
Time Series, Path, and Attribution Analysis................................................109
Time Series, Path, and Attribution Analysis.......................................................................................................... 109
Arima........................................................................................................................................................................... 109
Summary......................................................................................................................................................... 109
Background.....................................................................................................................................................109

4 Teradata Aster Analytics Foundation User Guide


Table of Contents

Usage................................................................................................................................................................110
Example........................................................................................................................................................... 114
ArimaPredictor.......................................................................................................................................................... 116
Summary......................................................................................................................................................... 116
Usage................................................................................................................................................................117
Example........................................................................................................................................................... 118
Attribution.................................................................................................................................................................. 119
Summary......................................................................................................................................................... 119
Background.....................................................................................................................................................119
Attribution (Multiple-Input Version).....................................................................................................................120
Summary......................................................................................................................................................... 120
Usage................................................................................................................................................................121
Example........................................................................................................................................................... 126
Attribution (Single-Input Version)......................................................................................................................... 129
Summary......................................................................................................................................................... 129
Usage................................................................................................................................................................129
Examples......................................................................................................................................................... 131
Burst.............................................................................................................................................................................140
Summary......................................................................................................................................................... 140
Usage................................................................................................................................................................141
Examples......................................................................................................................................................... 146
Change-Point Detection Functions.........................................................................................................................151
Summary......................................................................................................................................................... 151
Background.....................................................................................................................................................151
ChangePointDetection..............................................................................................................................................154
Summary......................................................................................................................................................... 154
Usage................................................................................................................................................................154
Examples......................................................................................................................................................... 157
RtChangePointDetection..........................................................................................................................................164
Summary......................................................................................................................................................... 164
Usage................................................................................................................................................................164
Examples......................................................................................................................................................... 165
Convergent Cross-Mapping..................................................................................................................................... 167
CCMPrepare...............................................................................................................................................................168
Usage................................................................................................................................................................168
Example........................................................................................................................................................... 169
CCM.............................................................................................................................................................................172
Summary......................................................................................................................................................... 172
Usage................................................................................................................................................................172
Examples......................................................................................................................................................... 175
DTW............................................................................................................................................................................ 178
Summary......................................................................................................................................................... 178
Usage................................................................................................................................................................179
Example........................................................................................................................................................... 182
DWT............................................................................................................................................................................ 185
Summary......................................................................................................................................................... 185
Background.....................................................................................................................................................186
Usage................................................................................................................................................................186
Example........................................................................................................................................................... 191
DWT2D.......................................................................................................................................................................195
Summary......................................................................................................................................................... 195

Teradata Aster Analytics Foundation User Guide 5


Table of Contents

Background.....................................................................................................................................................196
Usage................................................................................................................................................................197
Example........................................................................................................................................................... 202
FrequentPaths.............................................................................................................................................................205
Summary......................................................................................................................................................... 205
Background.....................................................................................................................................................205
Usage................................................................................................................................................................206
Examples......................................................................................................................................................... 211
IDWT...........................................................................................................................................................................225
Summary......................................................................................................................................................... 225
Usage................................................................................................................................................................225
Example........................................................................................................................................................... 227
IDWT2D..................................................................................................................................................................... 232
Summary......................................................................................................................................................... 232
Usage................................................................................................................................................................232
Example........................................................................................................................................................... 234
Interpolator.................................................................................................................................................................238
Summary......................................................................................................................................................... 238
Usage................................................................................................................................................................238
Examples......................................................................................................................................................... 246
Path Analysis Functions............................................................................................................................................256
Summary......................................................................................................................................................... 256
Path_Generator.......................................................................................................................................................... 257
Summary......................................................................................................................................................... 257
Usage................................................................................................................................................................258
Example........................................................................................................................................................... 259
Path_Summarizer...................................................................................................................................................... 262
Summary......................................................................................................................................................... 262
Usage................................................................................................................................................................262
Example........................................................................................................................................................... 264
Path_Start....................................................................................................................................................................266
Summary......................................................................................................................................................... 266
Usage................................................................................................................................................................266
Example........................................................................................................................................................... 268
Path_Analyzer............................................................................................................................................................ 270
Summary......................................................................................................................................................... 270
Usage................................................................................................................................................................270
Example........................................................................................................................................................... 272
SAX2............................................................................................................................................................................ 273
Summary......................................................................................................................................................... 273
Background.....................................................................................................................................................273
Usage................................................................................................................................................................274
Examples......................................................................................................................................................... 280
SeriesSplitter............................................................................................................................................................... 286
Summary......................................................................................................................................................... 286
Background.....................................................................................................................................................286
Usage................................................................................................................................................................286
Examples......................................................................................................................................................... 291
Sessionize.................................................................................................................................................................... 296
Summary......................................................................................................................................................... 296
Background.....................................................................................................................................................296

6 Teradata Aster Analytics Foundation User Guide


Table of Contents

Usage................................................................................................................................................................296
Example........................................................................................................................................................... 298
Shapelet Functions.....................................................................................................................................................299
Overview......................................................................................................................................................... 300
UnsupervisedShapelet............................................................................................................................................... 300
Summary......................................................................................................................................................... 300
Usage................................................................................................................................................................301
Example........................................................................................................................................................... 304
Troubleshooting.............................................................................................................................................307
SupervisedShapeletTrainer....................................................................................................................................... 308
Summary......................................................................................................................................................... 308
Usage................................................................................................................................................................308
Example........................................................................................................................................................... 312
Troubleshooting.............................................................................................................................................314
SupervisedShapeletClassifier.................................................................................................................................... 315
Summary......................................................................................................................................................... 315
Usage................................................................................................................................................................315
Example........................................................................................................................................................... 317
VARMAX....................................................................................................................................................................319
Summary......................................................................................................................................................... 319
Usage................................................................................................................................................................320
Examples......................................................................................................................................................... 324

Chapter 4:
Pattern Matching with Teradata Aster nPath............................................ 343
Pattern Matching with Teradata Aster nPath........................................................................................................343
nPath............................................................................................................................................................................343
Summary......................................................................................................................................................... 343
Usage................................................................................................................................................................344
Pattern Matching....................................................................................................................................................... 348
Greedy Pattern Matching..............................................................................................................................348
Symbols....................................................................................................................................................................... 350
LAG Expressions in Symbol Predicates...................................................................................................... 350
Filters........................................................................................................................................................................... 355
Example........................................................................................................................................................... 356
Result: Applying Aggregate Functions................................................................................................................... 357
Example 1........................................................................................................................................................359
Example 2........................................................................................................................................................360
Example 3........................................................................................................................................................361
nPath Examples..........................................................................................................................................................362
Clickstream Data Examples..........................................................................................................................362
Range-Matching Examples...........................................................................................................................364

Chapter 5:
Statistical Analysis...........................................................................................................................375
Statistical Analysis......................................................................................................................................................375
Approximate Distinct Count....................................................................................................................................376
Summary......................................................................................................................................................... 376

Teradata Aster Analytics Foundation User Guide 7


Table of Contents

Background.....................................................................................................................................................376
Usage................................................................................................................................................................376
Example........................................................................................................................................................... 377
Approximate Percentile............................................................................................................................................ 380
Summary......................................................................................................................................................... 380
Background.....................................................................................................................................................380
Usage................................................................................................................................................................380
Example........................................................................................................................................................... 382
CMAVG...................................................................................................................................................................... 385
Summary......................................................................................................................................................... 385
Background.....................................................................................................................................................385
Usage................................................................................................................................................................385
Example........................................................................................................................................................... 386
ConfusionMatrix........................................................................................................................................................388
Summary......................................................................................................................................................... 388
Background.....................................................................................................................................................389
Usage................................................................................................................................................................389
Example........................................................................................................................................................... 391
Correlation..................................................................................................................................................................394
Summary......................................................................................................................................................... 394
Usage................................................................................................................................................................394
Examples......................................................................................................................................................... 396
CoxPH......................................................................................................................................................................... 398
Summary......................................................................................................................................................... 398
Background.....................................................................................................................................................399
Usage................................................................................................................................................................399
Example........................................................................................................................................................... 403
CoxPredict.................................................................................................................................................................. 407
Summary......................................................................................................................................................... 407
Background.....................................................................................................................................................407
Usage................................................................................................................................................................408
Examples......................................................................................................................................................... 411
Hypothesis-Test Mode.............................................................................................................................................. 418
Summary......................................................................................................................................................... 418
Usage................................................................................................................................................................418
Examples......................................................................................................................................................... 424
CoxSurvFit.................................................................................................................................................................. 433
Summary......................................................................................................................................................... 433
Background.....................................................................................................................................................433
Usage................................................................................................................................................................434
Example........................................................................................................................................................... 437
CrossValidation..........................................................................................................................................................438
Summary......................................................................................................................................................... 438
Usage................................................................................................................................................................439
Example........................................................................................................................................................... 440
Distribution Matching.............................................................................................................................................. 443
Summary......................................................................................................................................................... 443
Best-Match Mode.......................................................................................................................................................444
Summary......................................................................................................................................................... 444
Usage................................................................................................................................................................444
Examples......................................................................................................................................................... 448

8 Teradata Aster Analytics Foundation User Guide


Table of Contents

EMAVG.......................................................................................................................................................................454
Summary......................................................................................................................................................... 454
Background.....................................................................................................................................................454
Usage................................................................................................................................................................454
Example........................................................................................................................................................... 456
FMeasure.....................................................................................................................................................................458
Summary......................................................................................................................................................... 458
Background.....................................................................................................................................................458
Usage................................................................................................................................................................458
Examples......................................................................................................................................................... 460
GLM.............................................................................................................................................................................462
Summary......................................................................................................................................................... 462
Background.....................................................................................................................................................462
Usage................................................................................................................................................................464
Examples......................................................................................................................................................... 472
GLMPredict................................................................................................................................................................ 483
Summary......................................................................................................................................................... 483
Usage................................................................................................................................................................484
Examples......................................................................................................................................................... 485
Hidden Markov Model Functions...........................................................................................................................491
Overview......................................................................................................................................................... 491
Models and Descriptions.............................................................................................................................. 492
Aster Distributed Platforms......................................................................................................................... 493
HMMUnsupervisedLearner..................................................................................................................................... 493
Summary......................................................................................................................................................... 493
Usage................................................................................................................................................................493
Example........................................................................................................................................................... 498
HMMSupervisedLearner.......................................................................................................................................... 503
Summary......................................................................................................................................................... 503
Usage................................................................................................................................................................504
Example........................................................................................................................................................... 507
HMMEvaluator.......................................................................................................................................................... 514
Summary......................................................................................................................................................... 514
Usage................................................................................................................................................................514
Example........................................................................................................................................................... 518
HMMDecoder............................................................................................................................................................ 522
Summary......................................................................................................................................................... 522
Usage................................................................................................................................................................522
Examples......................................................................................................................................................... 524
Histogram................................................................................................................................................................... 535
Summary......................................................................................................................................................... 535
Background.....................................................................................................................................................535
Usage................................................................................................................................................................536
Examples......................................................................................................................................................... 538
KNN.............................................................................................................................................................................542
Summary......................................................................................................................................................... 542
Background.....................................................................................................................................................543
Usage................................................................................................................................................................544
Example........................................................................................................................................................... 546
LARS Functions......................................................................................................................................................... 550
Summary......................................................................................................................................................... 550

Teradata Aster Analytics Foundation User Guide 9


Table of Contents

Background.....................................................................................................................................................551
LARS............................................................................................................................................................................ 551
Summary......................................................................................................................................................... 551
Usage................................................................................................................................................................551
Examples......................................................................................................................................................... 554
LARSPredict............................................................................................................................................................... 559
Summary......................................................................................................................................................... 559
Usage................................................................................................................................................................559
Examples......................................................................................................................................................... 561
Linear Regression.......................................................................................................................................................564
Summary......................................................................................................................................................... 564
Background.....................................................................................................................................................564
Usage................................................................................................................................................................565
Example........................................................................................................................................................... 566
LRTEST....................................................................................................................................................................... 567
Summary......................................................................................................................................................... 567
Background.....................................................................................................................................................567
Usage................................................................................................................................................................568
Example........................................................................................................................................................... 569
Percentile.....................................................................................................................................................................572
Summary......................................................................................................................................................... 572
Usage................................................................................................................................................................573
Example........................................................................................................................................................... 574
Principal Component Analysis................................................................................................................................ 575
Summary......................................................................................................................................................... 575
Background.....................................................................................................................................................576
Usage................................................................................................................................................................576
Example........................................................................................................................................................... 578
PCAPlot.......................................................................................................................................................................588
Summary......................................................................................................................................................... 588
Usage................................................................................................................................................................589
Example........................................................................................................................................................... 590
RandomSample.......................................................................................................................................................... 591
Summary......................................................................................................................................................... 591
Usage................................................................................................................................................................591
Examples......................................................................................................................................................... 594
Sample......................................................................................................................................................................... 599
Summary......................................................................................................................................................... 599
Usage................................................................................................................................................................600
Examples......................................................................................................................................................... 603
Shapley Value Functions...........................................................................................................................................609
Summary......................................................................................................................................................... 609
Background.....................................................................................................................................................609
GenerateCombination...............................................................................................................................................610
Usage................................................................................................................................................................610
Examples......................................................................................................................................................... 611
SortCombination....................................................................................................................................................... 611
Usage................................................................................................................................................................611
Examples......................................................................................................................................................... 612
AddOnePlayer............................................................................................................................................................ 612
Usage................................................................................................................................................................613

10 Teradata Aster Analytics Foundation User Guide


Table of Contents

Examples......................................................................................................................................................... 615
SMAVG....................................................................................................................................................................... 629
Summary......................................................................................................................................................... 629
Background.....................................................................................................................................................630
Usage................................................................................................................................................................630
Example........................................................................................................................................................... 631
Support Vector Machines......................................................................................................................................... 633
SparseSVM Functions............................................................................................................................................... 634
SparseSVMTrainer.................................................................................................................................................... 634
Summary......................................................................................................................................................... 634
Usage................................................................................................................................................................635
Example........................................................................................................................................................... 637
SparseSVMPredictor................................................................................................................................................. 640
Summary......................................................................................................................................................... 640
Usage................................................................................................................................................................640
Example........................................................................................................................................................... 642
SVMModelPrinter..................................................................................................................................................... 645
Summary......................................................................................................................................................... 645
Usage................................................................................................................................................................645
Examples......................................................................................................................................................... 646
DenseSVM Functions................................................................................................................................................647
DenseSVMTrainer.....................................................................................................................................................648
Summary......................................................................................................................................................... 648
Usage................................................................................................................................................................648
Examples......................................................................................................................................................... 651
DenseSVMPredictor..................................................................................................................................................657
Summary......................................................................................................................................................... 657
Usage................................................................................................................................................................657
Examples......................................................................................................................................................... 659
DenseSVMModelPrinter.......................................................................................................................................... 665
Summary......................................................................................................................................................... 665
Usage................................................................................................................................................................665
Example........................................................................................................................................................... 666
VectorDistance...........................................................................................................................................................667
Summary......................................................................................................................................................... 667
Background.....................................................................................................................................................668
Usage................................................................................................................................................................669
Examples......................................................................................................................................................... 673
VWAP......................................................................................................................................................................... 676
Summary......................................................................................................................................................... 676
Usage................................................................................................................................................................677
Example........................................................................................................................................................... 678
WMAVG.....................................................................................................................................................................680
Summary......................................................................................................................................................... 680
Background.....................................................................................................................................................681
Usage................................................................................................................................................................681
Example........................................................................................................................................................... 682

Chapter 6:
Text Analysis............................................................................................................................................ 685

Teradata Aster Analytics Foundation User Guide 11


Table of Contents

Text Analysis.............................................................................................................................................................. 685


LDA Functions........................................................................................................................................................... 685
Summary......................................................................................................................................................... 685
Background.....................................................................................................................................................686
LDATrainer................................................................................................................................................................ 686
Summary......................................................................................................................................................... 686
Usage................................................................................................................................................................686
Example........................................................................................................................................................... 690
LDAInference............................................................................................................................................................. 693
Summary......................................................................................................................................................... 693
Usage................................................................................................................................................................693
Example........................................................................................................................................................... 694
LDATopicPrinter.......................................................................................................................................................697
Summary......................................................................................................................................................... 697
Usage................................................................................................................................................................698
Examples......................................................................................................................................................... 699
Levenshtein Distance (LDist)...................................................................................................................................702
Summary......................................................................................................................................................... 702
Usage................................................................................................................................................................702
Example........................................................................................................................................................... 703
Naive Bayes Text Classifier.......................................................................................................................................705
Summary......................................................................................................................................................... 705
NaiveBayesTextClassifierTrainer............................................................................................................................ 706
Summary......................................................................................................................................................... 706
Usage................................................................................................................................................................706
Examples......................................................................................................................................................... 710
NaiveBayesTextClassifierPredict............................................................................................................................. 714
Summary......................................................................................................................................................... 714
Usage................................................................................................................................................................714
Examples......................................................................................................................................................... 716
NER Functions (CRF Model Implementation)..................................................................................................... 719
Summary......................................................................................................................................................... 719
NERTrainer................................................................................................................................................................ 720
Summary......................................................................................................................................................... 720
Usage................................................................................................................................................................720
Example........................................................................................................................................................... 725
NER..............................................................................................................................................................................726
Summary......................................................................................................................................................... 726
Usage................................................................................................................................................................726
Example........................................................................................................................................................... 728
NEREvaluator.............................................................................................................................................................731
Summary......................................................................................................................................................... 731
Usage................................................................................................................................................................731
Example........................................................................................................................................................... 733
NER Functions (Max Entropy Model Implementation)......................................................................................733
Summary......................................................................................................................................................... 733
FindNamedEntity...................................................................................................................................................... 734
Summary......................................................................................................................................................... 734
Usage................................................................................................................................................................734
Example........................................................................................................................................................... 738
TrainNamedEntityFinder......................................................................................................................................... 740

12 Teradata Aster Analytics Foundation User Guide


Table of Contents

Summary......................................................................................................................................................... 740
Usage................................................................................................................................................................740
Example........................................................................................................................................................... 741
Evaluate Named Entity Finder.................................................................................................................................743
Summary......................................................................................................................................................... 743
Usage................................................................................................................................................................743
Example........................................................................................................................................................... 744
nGram..........................................................................................................................................................................745
Summary......................................................................................................................................................... 745
Background.....................................................................................................................................................745
Usage................................................................................................................................................................746
Examples......................................................................................................................................................... 748
POSTagger.................................................................................................................................................................. 751
Summary......................................................................................................................................................... 751
Background.....................................................................................................................................................751
Usage................................................................................................................................................................754
Example........................................................................................................................................................... 756
Sentenizer....................................................................................................................................................................759
Summary......................................................................................................................................................... 759
Background.....................................................................................................................................................759
Usage................................................................................................................................................................759
Example........................................................................................................................................................... 760
Sentiment Extraction Functions.............................................................................................................................. 763
Summary......................................................................................................................................................... 763
Background.....................................................................................................................................................763
TrainSentimentExtractor..........................................................................................................................................764
Summary......................................................................................................................................................... 764
Usage................................................................................................................................................................764
Example........................................................................................................................................................... 765
ExtractSentiment........................................................................................................................................................767
Summary......................................................................................................................................................... 767
Usage................................................................................................................................................................767
Examples......................................................................................................................................................... 770
EvaluateSentimentExtractor.....................................................................................................................................778
Summary......................................................................................................................................................... 778
Usage................................................................................................................................................................778
Example........................................................................................................................................................... 779
Text Classifier............................................................................................................................................................. 782
Summary......................................................................................................................................................... 782
Background.....................................................................................................................................................783
TextClassifierTrainer.................................................................................................................................................783
Summary......................................................................................................................................................... 783
Usage................................................................................................................................................................783
Example........................................................................................................................................................... 786
TextClassifier.............................................................................................................................................................. 788
Summary......................................................................................................................................................... 788
Usage................................................................................................................................................................788
Example........................................................................................................................................................... 789
TextClassifierEvaluator............................................................................................................................................. 790
Summary......................................................................................................................................................... 790
Usage................................................................................................................................................................790

Teradata Aster Analytics Foundation User Guide 13


Table of Contents

Example........................................................................................................................................................... 791
Text_Parser................................................................................................................................................................. 792
Summary......................................................................................................................................................... 792
Background.....................................................................................................................................................792
Usage................................................................................................................................................................792
Examples......................................................................................................................................................... 796
TextChunker...............................................................................................................................................................799
Summary......................................................................................................................................................... 799
Background.....................................................................................................................................................799
Usage................................................................................................................................................................800
Example........................................................................................................................................................... 801
TextMorph..................................................................................................................................................................806
Summary......................................................................................................................................................... 806
Background.....................................................................................................................................................806
Usage................................................................................................................................................................806
Examples......................................................................................................................................................... 809
TextTagging................................................................................................................................................................ 818
Summary......................................................................................................................................................... 818
Usage................................................................................................................................................................819
Examples......................................................................................................................................................... 823
TextTokenizer............................................................................................................................................................ 827
Summary......................................................................................................................................................... 827
Usage................................................................................................................................................................828
Examples......................................................................................................................................................... 830
TF_IDF........................................................................................................................................................................ 834
Summary......................................................................................................................................................... 834
Background.....................................................................................................................................................835
Usage................................................................................................................................................................835
Examples......................................................................................................................................................... 838

Chapter 7:
Cluster Analysis.................................................................................................................................... 847
Cluster Analysis..........................................................................................................................................................847
Canopy.........................................................................................................................................................................847
Summary......................................................................................................................................................... 847
Background.....................................................................................................................................................847
Usage................................................................................................................................................................848
Example........................................................................................................................................................... 849
Gaussian Mixture Model Functions........................................................................................................................851
GMMFit.......................................................................................................................................................................851
Summary......................................................................................................................................................... 851
Usage................................................................................................................................................................851
Examples......................................................................................................................................................... 856
GMMPredict...............................................................................................................................................................863
Summary......................................................................................................................................................... 863
Usage................................................................................................................................................................863
Example........................................................................................................................................................... 865
GMMProfile................................................................................................................................................................868
Summary......................................................................................................................................................... 868

14 Teradata Aster Analytics Foundation User Guide


Table of Contents

Usage................................................................................................................................................................868
Examples......................................................................................................................................................... 869
KMeans........................................................................................................................................................................872
Summary......................................................................................................................................................... 872
Background.....................................................................................................................................................872
Usage................................................................................................................................................................873
Examples......................................................................................................................................................... 877
KMeansPlot................................................................................................................................................................ 886
Summary......................................................................................................................................................... 886
Usage................................................................................................................................................................886
Example........................................................................................................................................................... 887
KModes....................................................................................................................................................................... 890
Summary......................................................................................................................................................... 890
Usage................................................................................................................................................................891
Examples......................................................................................................................................................... 894
KModesPredict...........................................................................................................................................................900
Summary......................................................................................................................................................... 900
Usage................................................................................................................................................................900
Example........................................................................................................................................................... 901
Minhash.......................................................................................................................................................................904
Summary......................................................................................................................................................... 904
Background.....................................................................................................................................................904
Usage................................................................................................................................................................905
Example........................................................................................................................................................... 906

Chapter 8:
Naive Bayes................................................................................................................................................ 909
Naive Bayes................................................................................................................................................................. 909
What is Naive Bayes?.................................................................................................................................................909
Naive Bayes Functions.............................................................................................................................................. 909
Summary......................................................................................................................................................... 909
NaiveBayesMap and NaiveBayesReduce................................................................................................................ 910
Summary......................................................................................................................................................... 910
Usage................................................................................................................................................................910
Naive Bayes Example.....................................................................................................................................912
NaiveBayesPredict..................................................................................................................................................... 917
Summary......................................................................................................................................................... 917
Usage................................................................................................................................................................918
Naive Bayes Example.....................................................................................................................................919
Naive Bayes Example.................................................................................................................................................925
NaiveBayesMap Input: Training Table.......................................................................................................925
Split Input into Training and Testing Data Sets........................................................................................925
SQL-MapReduce Call to Generate the Model........................................................................................... 927
NaiveBayesReduce and NaiveBayesMap Output: Model Table.............................................................. 927
NaiveBayesPredict Input...............................................................................................................................928
SQL-MapReduce Call to Predict Outcomes of Test Table Data............................................................. 928
NaiveBayesPredict Output: Predict Outcomes Table............................................................................... 928
Prediction Accuracy...................................................................................................................................... 930

Teradata Aster Analytics Foundation User Guide 15


Table of Contents

Chapter 9:
Ensemble Methods............................................................................................................................931
Ensemble Methods.....................................................................................................................................................931
Random Forest Functions........................................................................................................................................ 931
Summary......................................................................................................................................................... 931
Background.....................................................................................................................................................932
Implementation Notes.................................................................................................................................. 933
Usage................................................................................................................................................................934
Forest_Drive............................................................................................................................................................... 934
Summary......................................................................................................................................................... 934
Usage................................................................................................................................................................935
Example........................................................................................................................................................... 938
Forest_Predict............................................................................................................................................................ 943
Summary......................................................................................................................................................... 943
Usage................................................................................................................................................................943
Example........................................................................................................................................................... 946
Forest_Analyze...........................................................................................................................................................949
Summary......................................................................................................................................................... 949
Usage................................................................................................................................................................950
Examples......................................................................................................................................................... 951
Single Decision Tree Functions............................................................................................................................... 954
Single_Tree_Drive..................................................................................................................................................... 955
Summary......................................................................................................................................................... 955
Background.....................................................................................................................................................955
Usage................................................................................................................................................................956
Examples......................................................................................................................................................... 962
Single_Tree_Predict...................................................................................................................................................973
Summary......................................................................................................................................................... 973
Usage................................................................................................................................................................973
Example........................................................................................................................................................... 974
AdaBoost Functions.................................................................................................................................................. 976
Background.....................................................................................................................................................976
AdaBoost_Drive.........................................................................................................................................................977
Summary......................................................................................................................................................... 977
Usage................................................................................................................................................................978
Example........................................................................................................................................................... 981
AdaBoost_Predict...................................................................................................................................................... 987
Summary......................................................................................................................................................... 987
Usage................................................................................................................................................................987
Example........................................................................................................................................................... 988

Chapter 10:
Association Analysis...................................................................................................................... 995
Association Analysis..................................................................................................................................................995
Basket_Generator.......................................................................................................................................................995
Summary......................................................................................................................................................... 995
Background.....................................................................................................................................................995
Usage................................................................................................................................................................995

16 Teradata Aster Analytics Foundation User Guide


Table of Contents

Examples......................................................................................................................................................... 997
CFilter........................................................................................................................................................................1000
Summary....................................................................................................................................................... 1000
Background...................................................................................................................................................1000
Usage..............................................................................................................................................................1000
Examples....................................................................................................................................................... 1003
FPGrowth..................................................................................................................................................................1007
Summary....................................................................................................................................................... 1007
Background...................................................................................................................................................1007
Usage..............................................................................................................................................................1008
Example.........................................................................................................................................................1013
Recommender Functions........................................................................................................................................1016
WSRecommender....................................................................................................................................................1017
Summary....................................................................................................................................................... 1017
Usage..............................................................................................................................................................1017
Example.........................................................................................................................................................1020
KNNRecommenderTrain....................................................................................................................................... 1023
Summary....................................................................................................................................................... 1023
Usage..............................................................................................................................................................1023
Example.........................................................................................................................................................1026
KNNRecommenderPredict.................................................................................................................................... 1030
Summary....................................................................................................................................................... 1030
Usage..............................................................................................................................................................1030
Example.........................................................................................................................................................1031

Chapter 11:
Graph Analysis..................................................................................................................................... 1035
Graph Analysis......................................................................................................................................................... 1035
Overview of Graph Analysis...................................................................................................................................1035
Graph Functions.......................................................................................................................................... 1035
Iterations....................................................................................................................................................... 1036
What is a Graph?..........................................................................................................................................1036
Directed Graphs........................................................................................................................................... 1037
Graph Discovery.......................................................................................................................................... 1037
AllPairsShortestPath................................................................................................................................................1037
Summary....................................................................................................................................................... 1037
Usage..............................................................................................................................................................1038
Examples....................................................................................................................................................... 1041
Betweenness..............................................................................................................................................................1045
Summary....................................................................................................................................................... 1045
Background...................................................................................................................................................1045
Usage..............................................................................................................................................................1046
Example.........................................................................................................................................................1048
Closeness................................................................................................................................................................... 1050
Summary....................................................................................................................................................... 1050
Background...................................................................................................................................................1051
Usage..............................................................................................................................................................1051
Examples....................................................................................................................................................... 1054
EigenvectorCentrality..............................................................................................................................................1057

Teradata Aster Analytics Foundation User Guide 17


Table of Contents

Summary....................................................................................................................................................... 1057
Background...................................................................................................................................................1057
Usage..............................................................................................................................................................1059
Examples....................................................................................................................................................... 1061
gTree.......................................................................................................................................................................... 1064
Summary....................................................................................................................................................... 1064
Background...................................................................................................................................................1064
Usage..............................................................................................................................................................1065
Examples....................................................................................................................................................... 1068
LocalClusteringCoefficient.....................................................................................................................................1072
Summary....................................................................................................................................................... 1072
Background...................................................................................................................................................1072
Usage..............................................................................................................................................................1075
Examples....................................................................................................................................................... 1079
LoopyBeliefPropagation......................................................................................................................................... 1082
Summary....................................................................................................................................................... 1082
Background...................................................................................................................................................1082
Usage..............................................................................................................................................................1083
Examples....................................................................................................................................................... 1086
Modularity................................................................................................................................................................ 1090
Summary....................................................................................................................................................... 1090
Background...................................................................................................................................................1090
Definitions.................................................................................................................................................... 1091
Usage..............................................................................................................................................................1093
Examples....................................................................................................................................................... 1097
Tips................................................................................................................................................................ 1100
Troubleshooting...........................................................................................................................................1101
nTree..........................................................................................................................................................................1102
Summary....................................................................................................................................................... 1102
Background...................................................................................................................................................1102
Usage..............................................................................................................................................................1103
Examples....................................................................................................................................................... 1107
PageRank...................................................................................................................................................................1110
Summary....................................................................................................................................................... 1110
Background...................................................................................................................................................1111
Usage..............................................................................................................................................................1111
Example.........................................................................................................................................................1113
pSALSA..................................................................................................................................................................... 1114
Summary....................................................................................................................................................... 1114
Background...................................................................................................................................................1115
Usage..............................................................................................................................................................1117
Examples....................................................................................................................................................... 1120
RandomWalkSample...............................................................................................................................................1128
Summary....................................................................................................................................................... 1128
Background...................................................................................................................................................1128
Usage..............................................................................................................................................................1129
Example.........................................................................................................................................................1131

Chapter 12:
Neural Networks............................................................................................................................... 1135

18 Teradata Aster Analytics Foundation User Guide


Table of Contents

Neural Networks...................................................................................................................................................... 1135


Introduction to Neural Networks..........................................................................................................................1135
NeuralNet..................................................................................................................................................................1137
Summary....................................................................................................................................................... 1137
Usage..............................................................................................................................................................1137
Example.........................................................................................................................................................1140
NeuralNetPredict..................................................................................................................................................... 1144
Summary....................................................................................................................................................... 1144
Usage..............................................................................................................................................................1144
Example.........................................................................................................................................................1146

Chapter 13:
Data Transformation................................................................................................................... 1151
Data Transformation...............................................................................................................................................1151
Antiselect...................................................................................................................................................................1151
Summary....................................................................................................................................................... 1151
Usage..............................................................................................................................................................1152
Example.........................................................................................................................................................1152
Apache_Log_Parser.................................................................................................................................................1154
Summary....................................................................................................................................................... 1154
Background...................................................................................................................................................1154
Usage..............................................................................................................................................................1156
Examples....................................................................................................................................................... 1158
Categorize................................................................................................................................................................. 1161
Summary....................................................................................................................................................... 1161
Usage..............................................................................................................................................................1161
Example.........................................................................................................................................................1162
Fellegi-Sunter Functions.........................................................................................................................................1164
Summary....................................................................................................................................................... 1164
Background...................................................................................................................................................1165
FellegiSunterTrainer................................................................................................................................................1165
Summary....................................................................................................................................................... 1165
Usage..............................................................................................................................................................1165
Examples....................................................................................................................................................... 1168
FellegiSunterPredict................................................................................................................................................ 1173
Summary....................................................................................................................................................... 1173
Usage..............................................................................................................................................................1173
Examples....................................................................................................................................................... 1174
Geometry Functions................................................................................................................................................1179
GeometryLoader...................................................................................................................................................... 1180
Summary....................................................................................................................................................... 1180
Usage..............................................................................................................................................................1180
Example.........................................................................................................................................................1182
PointInPolygon........................................................................................................................................................ 1184
Summary....................................................................................................................................................... 1184
Background...................................................................................................................................................1184
Usage..............................................................................................................................................................1185
Examples....................................................................................................................................................... 1188
GeometryOverlay.....................................................................................................................................................1192

Teradata Aster Analytics Foundation User Guide 19


Table of Contents

Summary....................................................................................................................................................... 1192
Usage..............................................................................................................................................................1192
Examples....................................................................................................................................................... 1195
IdentityMatch...........................................................................................................................................................1198
Summary....................................................................................................................................................... 1198
Background...................................................................................................................................................1199
Usage..............................................................................................................................................................1199
Example.........................................................................................................................................................1203
IPGeo......................................................................................................................................................................... 1205
Summary....................................................................................................................................................... 1205
Usage..............................................................................................................................................................1206
Examples....................................................................................................................................................... 1207
Extending IPGeo..........................................................................................................................................1209
JSONParser............................................................................................................................................................... 1214
Summary....................................................................................................................................................... 1214
Background...................................................................................................................................................1214
Usage..............................................................................................................................................................1215
Examples....................................................................................................................................................... 1217
Multi_Case................................................................................................................................................................1222
Summary....................................................................................................................................................... 1222
Usage..............................................................................................................................................................1222
Example.........................................................................................................................................................1223
MurmurHash............................................................................................................................................................1225
Summary....................................................................................................................................................... 1225
Background...................................................................................................................................................1225
Usage..............................................................................................................................................................1226
Example.........................................................................................................................................................1227
OutlierFilter.............................................................................................................................................................. 1230
Summary....................................................................................................................................................... 1230
Usage..............................................................................................................................................................1230
Examples....................................................................................................................................................... 1233
Pack............................................................................................................................................................................1237
Summary....................................................................................................................................................... 1237
Usage..............................................................................................................................................................1238
Examples....................................................................................................................................................... 1239
Pivot...........................................................................................................................................................................1241
Summary....................................................................................................................................................... 1241
Usage..............................................................................................................................................................1241
Examples....................................................................................................................................................... 1245
PSTParserAFS.......................................................................................................................................................... 1248
Summary....................................................................................................................................................... 1248
Usage..............................................................................................................................................................1249
Examples....................................................................................................................................................... 1254
Scale Functions.........................................................................................................................................................1258
Summary....................................................................................................................................................... 1258
Background...................................................................................................................................................1258
ScaleMap................................................................................................................................................................... 1259
Usage..............................................................................................................................................................1259
Scale........................................................................................................................................................................... 1261
Usage..............................................................................................................................................................1261
ScalePrinter...............................................................................................................................................................1263

20 Teradata Aster Analytics Foundation User Guide


Table of Contents

Usage..............................................................................................................................................................1264
PartitionScale............................................................................................................................................................1265
Usage..............................................................................................................................................................1265
Scale Function Examples............................................................................................................................ 1267
StringSimilarity........................................................................................................................................................ 1276
Summary....................................................................................................................................................... 1276
Usage..............................................................................................................................................................1276
Examples....................................................................................................................................................... 1278
Unpack...................................................................................................................................................................... 1281
Summary....................................................................................................................................................... 1281
Usage..............................................................................................................................................................1282
Examples....................................................................................................................................................... 1284
Unpivot..................................................................................................................................................................... 1287
Summary....................................................................................................................................................... 1287
Usage..............................................................................................................................................................1287
Examples....................................................................................................................................................... 1289
URIPack.................................................................................................................................................................... 1293
Summary....................................................................................................................................................... 1293
Usage..............................................................................................................................................................1293
Example.........................................................................................................................................................1294
URIUnpack...............................................................................................................................................................1295
Summary....................................................................................................................................................... 1295
Background...................................................................................................................................................1295
Usage..............................................................................................................................................................1295
Example.........................................................................................................................................................1297
XMLParser................................................................................................................................................................1298
Summary....................................................................................................................................................... 1298
Background...................................................................................................................................................1298
Usage..............................................................................................................................................................1298
Examples....................................................................................................................................................... 1303
XMLRelation............................................................................................................................................................ 1309
Summary....................................................................................................................................................... 1309
Usage..............................................................................................................................................................1309
Examples....................................................................................................................................................... 1313

Chapter 14:
Aster Scoring SDK........................................................................................................................... 1319
Aster Scoring SDK................................................................................................................................................... 1319
Introduction to Aster Scoring SDK.......................................................................................................................1319
AMLGenerator.........................................................................................................................................................1320
Summary....................................................................................................................................................... 1320
Usage..............................................................................................................................................................1320
Example.........................................................................................................................................................1324
Scorer.........................................................................................................................................................................1326
Summary....................................................................................................................................................... 1326
Package.......................................................................................................................................................... 1327
Installation.................................................................................................................................................... 1328
Functional Support......................................................................................................................................1329
Input Formats...............................................................................................................................................1329

Teradata Aster Analytics Foundation User Guide 21


Table of Contents

Data Types.................................................................................................................................................... 1329


Output Formats............................................................................................................................................1330
Scoring API...................................................................................................................................................1330
Javadoc.......................................................................................................................................................... 1331
Examples....................................................................................................................................................... 1331
Logging Support...........................................................................................................................................1332
Compatibility................................................................................................................................................1332
Performance................................................................................................................................................. 1333
Tips................................................................................................................................................................ 1333
Aster Scoring SDK Functions................................................................................................................................ 1333
Aster Scoring SDK Single Decision Tree..................................................................................................1334
Aster Scoring SDK Generalized Linear Model........................................................................................ 1335
Aster Scoring SDK Random Forest...........................................................................................................1336
Aster Scoring SDK Naïve Bayes.................................................................................................................1337
Aster Scoring SDK Naïve Bayes Text Classifier.......................................................................................1337
Aster Scoring SDK Text Tagging...............................................................................................................1338
Aster Scoring SDK Extract Sentiment...................................................................................................... 1340
Aster Scoring SDK Text Parser.................................................................................................................. 1342
Aster Scoring SDK Text Tokenizer........................................................................................................... 1343
Aster Scoring SDK SparseSVM..................................................................................................................1345
Aster Scoring SDK CoxPH......................................................................................................................... 1346
Aster Scoring SDK LDAInference.............................................................................................................1347
FAQ............................................................................................................................................................................1347
How is Aster Scoring SDK different from functions in the Aster Analytics suite?............................ 1347
Does Aster Scoring SDK include a real-time streaming engine or a listening framework?..............1348
Does Aster Scoring SDK Need Aster Database and Aster Analytics Suite?........................................ 1348
Can Aster Scoring SDK be invoked in a cloud environment such as Amazon Web Services (AWS)?
........................................................................................................................................................................ 1348
Is Aster Scoring SDK thread-safe? Can it be deployed in a multithreaded parallel system?.............1348
What is the recommended way to incorporate Aster Scoring SDK in a multithreaded system?..... 1348
Does Aster Scoring SDK work on Predictive Model Markup Language (PMML) based models?.. 1349
How fast is the response time of Aster Scoring SDK?............................................................................ 1349

Chapter 15:
Visualization Functions............................................................................................................1351
Visualization Functions.......................................................................................................................................... 1351

Chapter 16:
Aster Database System Utility Functions...................................................... 1353
Aster Database System Utility Functions............................................................................................................. 1353

Appendix A:
List of Functions and Their Syntax......................................................................... 1355
About the List of Functions....................................................................................................................................1355
Time Series, Path, and Attribution Analysis........................................................................................................1355
Arima (version 1.1)......................................................................................................................................1355

22 Teradata Aster Analytics Foundation User Guide


Table of Contents

ArimaPredictor (version 1.1)..................................................................................................................... 1355


Attribution (Multiple-Input Version) (version 2.3)............................................................................... 1356
Attribution (Single-Input Version) (version 2.3)....................................................................................1356
Burst (version 1.0)........................................................................................................................................1356
CCM (version 1.0)....................................................................................................................................... 1357
CCMPrepare (version 1.0)..........................................................................................................................1357
ChangePointDetection (version 1.0).........................................................................................................1357
DTW (version 1.0).......................................................................................................................................1358
DWT (version 1.3).......................................................................................................................................1358
DWT2D (version 1.3)..................................................................................................................................1359
FrequentPaths (version 2.1)....................................................................................................................... 1359
IDWT (version 1.3)..................................................................................................................................... 1360
IDWT2D (version 1.3)................................................................................................................................ 1360
Interpolator (version 1.0)............................................................................................................................1360
Path_Analyzer (version 1.3)....................................................................................................................... 1361
Path_Generator (version 1.3).....................................................................................................................1361
Path_Start (version 1.2).............................................................................................................................. 1361
Path_Summarizer (version 1.2)................................................................................................................. 1362
RtChangePointDetection (version 1.0).....................................................................................................1362
SAX2.............................................................................................................................................................. 1362
SeriesSplitter (version 1.0).......................................................................................................................... 1363
Sessionize (version 1.3)............................................................................................................................... 1364
SupervisedShapeletClassifier (version 1.1)...............................................................................................1364
SupervisedShapeletTrainer (version 1.1)..................................................................................................1365
UnsupervisedShapelet (version 1.0)..........................................................................................................1365
VARMAX (version 1.0).............................................................................................................................. 1366
Pattern Matching with Teradata Aster nPath......................................................................................................1366
nPath (version 1.0).......................................................................................................................................1366
Statistical Analysis................................................................................................................................................... 1367
AddOnePlayer (version 1.0).......................................................................................................................1367
Approximate Distinct Count (version 1.0).............................................................................................. 1367
Approximate Percentile (version 1.1)....................................................................................................... 1367
CMAVG (version 1.2)................................................................................................................................. 1367
ConfusionMatrix (version 2.0).................................................................................................................. 1368
Correlation (version 1.4).............................................................................................................................1368
CoxPH (version 1.2).................................................................................................................................... 1368
CoxPredict (version 1.1)............................................................................................................................. 1369
CoxSurvFit (version 1.1).............................................................................................................................1369
CrossValidation (version 1.0).................................................................................................................... 1370
Distribution Matching, Hypothesis-Test Mode...................................................................................... 1370
Distribution Matching, Best-Match Mode...............................................................................................1372
EMAVG (version 1.2)................................................................................................................................. 1374
FMeasure (version 1.4)................................................................................................................................1374
GenerateCombination (version 1.0)......................................................................................................... 1374
GLM (version 1.7)........................................................................................................................................1375
GLMPredict (version 1.5)........................................................................................................................... 1375
Histogram (version 1.0).............................................................................................................................. 1375
HMMDecoder (version 1.3).......................................................................................................................1376
HMMEvaluator (version 1.3).....................................................................................................................1376
HMMSupervisedLearner (version 1.3)..................................................................................................... 1377
HMMUnsupervisedLearner (version 1.3)................................................................................................1377

Teradata Aster Analytics Foundation User Guide 23


Table of Contents

KNN (version 1.3)........................................................................................................................................1378


LARS (version 1.1).......................................................................................................................................1378
LARSPredict (version 1.1).......................................................................................................................... 1379
Linear Regression (version 1.1 and 1.0, respectively)............................................................................. 1379
LRTEST (version 1.1).................................................................................................................................. 1379
Percentile (version 1.0)............................................................................................................................... 1379
Principal Component Analysis (PCA_Reduce version 1.2, PCA_Map, version 1.1)........................ 1380
PCAPlot (version 1.0)................................................................................................................................. 1380
RandomSample (version 1.0)..................................................................................................................... 1380
Sample (version 1.2).................................................................................................................................... 1381
SMAVG (version 1.2)..................................................................................................................................1382
SortCombination (version 1.1).................................................................................................................. 1382
Support Vector Machines........................................................................................................................... 1383
VectorDistance (version 1.1)......................................................................................................................1385
VWAP (version 1.2).................................................................................................................................... 1385
WMAVG (version 1.2)................................................................................................................................1385
Text Analysis............................................................................................................................................................ 1386
Evaluate Named Entity Finder (version 1.1)............................................................................................1386
EvaluateSentimentExtractor (version 1.1)............................................................................................... 1386
ExtractSentiment (version 3.1).................................................................................................................. 1386
FindNamedEntity (version 1.2)................................................................................................................. 1386
LDAInference (version 1.1)........................................................................................................................1387
LDATopicPrinter (version 1.1)..................................................................................................................1387
LDATrainer (version 1.1)........................................................................................................................... 1387
Levenshtein Distance (LDist) (version 1.1)..............................................................................................1388
NaiveBayesTextClassifierPredict (version 1.1)........................................................................................1388
NaiveBayesTextClassifierTrainer (version 1.1)....................................................................................... 1389
NER (version 1.1).........................................................................................................................................1389
NEREvaluator (version 1.1)........................................................................................................................1389
NERTrainer (version 1.1)........................................................................................................................... 1389
nGram (version 1.5).................................................................................................................................... 1390
POSTagger (version 2.1)............................................................................................................................. 1390
Sentenizer (version 1.1)...............................................................................................................................1391
TextChunker (version 1.2)......................................................................................................................... 1391
TextClassifier (version 1.2).........................................................................................................................1391
TextClassifierEvaluator (version 1.2)........................................................................................................1391
TextClassifierTrainer (version 1.4)........................................................................................................... 1391
TextMorph (version 1.2).............................................................................................................................1392
TextTagging (version 1.3)...........................................................................................................................1392
TextTokenizer (version 3.2)....................................................................................................................... 1392
Text_Parser (version 1.3)............................................................................................................................1393
TF_IDF (TF_IDF version 2.1, TF version 1.1)........................................................................................ 1393
TrainNamedEntityFinder (version 1.3)....................................................................................................1394
TrainSentimentExtractor (version 2.1).....................................................................................................1394
Cluster Analysis........................................................................................................................................................1394
Canopy (version 2.0)................................................................................................................................... 1394
GMMFit (version 1.0)................................................................................................................................. 1395
GMMPredict (version 1.0)......................................................................................................................... 1395
GMMProfile (version 1.0).......................................................................................................................... 1395
KMeans (version 1.6).................................................................................................................................. 1396
KMeansPlot Syntax......................................................................................................................................1396

24 Teradata Aster Analytics Foundation User Guide


Table of Contents

KModes (version 1.0).................................................................................................................................. 1397


KModesPredict (version 1.0)......................................................................................................................1397
Minhash (version 2.2)................................................................................................................................. 1397
Naive Bayes............................................................................................................................................................... 1398
NaiveBayesMap and NaiveBayesReduce (version 1.3)...........................................................................1398
NaiveBayesPredict (version 1.4)................................................................................................................ 1398
Ensemble Methods.................................................................................................................................................. 1399
AdaBoost_Drive (version 1.5)....................................................................................................................1399
AdaBoost_Predict (version 1.5).................................................................................................................1399
Forest_Analyze (version 1.1)......................................................................................................................1399
Forest_Drive (version 1.5)..........................................................................................................................1400
Forest_Predict (version 1.5)....................................................................................................................... 1400
Single_Tree_Drive (version 1.3)................................................................................................................ 1401
Single_Tree_Predict (version 1.2)............................................................................................................. 1401
Association Analysis................................................................................................................................................1402
Basket_Generator (version 1.3)................................................................................................................. 1402
CFilter (version 1.7).....................................................................................................................................1402
FPGrowth (version 1.2)...............................................................................................................................1402
KNNRecommenderPredict (version 1.0)................................................................................................. 1403
KNNRecommenderTrain (version 1.0)....................................................................................................1403
WSRecommender (version 1.0).................................................................................................................1404
Graph Analysis......................................................................................................................................................... 1404
AllPairsShortestPath (version 1.2)............................................................................................................ 1404
Betweenness (version 1.2)...........................................................................................................................1404
Closeness (version 1.2)................................................................................................................................1405
EigenvectorCentrality (version 1.1).......................................................................................................... 1405
gTree (version 1.0).......................................................................................................................................1406
LocalClusteringCoefficient (version 1.1)..................................................................................................1406
LoopyBeliefPropagation (version 1.0)...................................................................................................... 1406
Modularity (version 1.1)............................................................................................................................. 1407
nTree (version 1.1).......................................................................................................................................1407
PageRank (version 1.1)................................................................................................................................1407
pSALSA (version 1.1).................................................................................................................................. 1408
RandomWalkSample (version 1.2)............................................................................................................1408
Neural Networks...................................................................................................................................................... 1409
NeuralNet (version 1.0).............................................................................................................................. 1409
NeuralNetPredict (version 1.0)..................................................................................................................1409
Data Transformation...............................................................................................................................................1410
Antiselect (version 1.0)............................................................................................................................... 1410
Apache_Log_Parser (version 2.2)..............................................................................................................1410
Categorize (version 1.0).............................................................................................................................. 1410
FellegiSunterPredict (version 1.1)............................................................................................................. 1410
FellegiSunterTrainer (version 1.1).............................................................................................................1410
GeometryLoader (version 1.1)................................................................................................................... 1411
GeometryOverlay (version 1.1)..................................................................................................................1411
IdentityMatch (version 1.1)........................................................................................................................1412
IPGeo (version 2.1)......................................................................................................................................1412
JSONParser (version 1.5)............................................................................................................................1413
Multi_Case (version 1.1).............................................................................................................................1413
MurmurHash (version 1.1)........................................................................................................................ 1413
OutlierFilter (version 1.3)...........................................................................................................................1413

Teradata Aster Analytics Foundation User Guide 25


Table of Contents

Pack (version 1.2).........................................................................................................................................1414


PartitionScale (version 1.2).........................................................................................................................1414
Pivot (version 1.5)........................................................................................................................................1415
PointInPolygon............................................................................................................................................ 1415
PSTParserAFS (version 1.1)....................................................................................................................... 1416
Scale (version 1.2)........................................................................................................................................ 1416
ScaleMap (version 1.2)................................................................................................................................ 1417
ScalePrinter (version 1.2)............................................................................................................................1417
StringSimilarity (version 1.1)..................................................................................................................... 1417
Unpack (version 1.2)................................................................................................................................... 1417
Unpivot (version 1.2).................................................................................................................................. 1417
URIPack (version 1.1)................................................................................................................................. 1418
URIUnpack (version 1.0)............................................................................................................................1418
XMLParser (version 1.7).............................................................................................................................1418
XMLRelation (version 1.3)......................................................................................................................... 1419
Aster Scoring SDK................................................................................................................................................... 1419
AMLGenerator (version 1.0)......................................................................................................................1419
Scorer.............................................................................................................................................................1420
Visualization Functions.......................................................................................................................................... 1421

26 Teradata Aster Analytics Foundation User Guide


List of Figures

Figure 1: Cogroup Example Tables........................................................................................................................... 70


Figure 2: How a SQL-MapReduce function performs a cogroup......................................................................... 71
Figure 3: Dimensional Example Tables.................................................................................................................... 73
Figure 4: How dimensional inputs work in SQL-MapReduce.............................................................................. 74
Figure 5: Conversion Path........................................................................................................................................ 120
Figure 6: DTW Example Results Plot......................................................................................................................185
Figure 7: Single-Level Application of DWT2D..................................................................................................... 197
Figure 8: Saxification Process...................................................................................................................................274
Figure 9: UnsupervisedShapelet Example Input Data.......................................................................................... 306
Figure 10: of a Sankey Diagram of Teradata Aster nPath Output...................................................................... 344
Figure 11: K-fold cross-validation........................................................................................................................... 439
Figure 12: KNN Example..........................................................................................................................................543
Figure 13: LAR Results.............................................................................................................................................. 556
Figure 14: LASSO Results......................................................................................................................................... 559
Figure 15: Computing a Shapley Value...................................................................................................................609
Figure 16: Graph Example......................................................................................................................................1036
Figure 17: Graph of Phone Calls Between Persons............................................................................................. 1042
Figure 18: Betweenness Example Social Network............................................................................................... 1048
Figure 19: Graph of Phone Calls Between Persons............................................................................................. 1054
Figure 20: In-Neighbors Relation Matrix............................................................................................................. 1059
Figure 21: EigenvectorCentrality Example Input Graph................................................................................... 1062
Figure 22: Graph of Trading Partners...................................................................................................................1079
Figure 23: Relationship Between Hepatitis and Symptoms............................................................................... 1086
Figure 24: Resolution Levels...................................................................................................................................1092
Figure 25: Graph of Social Network......................................................................................................................1097
Figure 26: Graph of Phone Calls Between Persons............................................................................................. 1113
Figure 27: pSALSA Example Diagram (Network Of Users)..............................................................................1120
Figure 28: pSALSA Example Diagram (Bipartite Representation)...................................................................1121
Figure 29: A Neural Network................................................................................................................................. 1136
Figure 30: Apache Server Configuration File Sample Lines.............................................................................. 1154
Figure 31: PSTParserAFS Input File Email in Outlook......................................................................................1255

Teradata Aster Analytics Foundation User Guide 27


List of Figures

28 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 1: Version History Table.................................................................................................................................. 60


Table 2: Aster Analytics Function Product Bundles and Package Names........................................................... 76
Table 3: Premium Path Bundle Functions in Alphabetical Order........................................................................ 77
Table 4: Premium Relationship Bundle Functions in Alphabetical Order..........................................................77
Table 5: Analytics Foundation Bundle Functions in Alphabetical Order............................................................78
Table 6: Premium Graph Bundle Functions in Alphabetical Order.....................................................................82
Table 7: Aster Scoring SDK Functions in Alphabetical Order.............................................................................. 83
Table 8: Time Series, Path, and Attribution Analysis Functions.......................................................................... 83
Table 9: Pattern Matching with Teradata Aster nPath Function.......................................................................... 84
Table 10: Statistical Analysis Functions.................................................................................................................... 85
Table 11: Text Analysis Functions.............................................................................................................................86
Table 12: Cluster Analysis Functions........................................................................................................................ 88
Table 13: Naive Bayes Functions............................................................................................................................... 88
Table 14: Ensemble Methods Functions................................................................................................................... 88
Table 15: Association Analysis Functions................................................................................................................ 89
Table 16: Graph Analysis Functions..........................................................................................................................89
Table 17: Aster Scoring SDK Functions....................................................................................................................90
Table 18: Neural Net Functions................................................................................................................................. 90
Table 19: Data Transformation Functions............................................................................................................... 91
Table 20: ACT \dE Command Output Sample........................................................................................................94
Table 21: Query Output Sample................................................................................................................................ 94
Table 22: Aster Analytics Foundation Compatibility Matrix................................................................................ 94
Table 23: Aster Analytics Function Product Bundles, Packages, and ZIP File Names...................................... 95
Table 24: ACT Commands for Managing Files and Functions...........................................................................101
Table 25: Arima Input Table Schema......................................................................................................................112
Table 26: Arima Model Table Schema.................................................................................................................... 113
Table 27: Arima Model Coefficients....................................................................................................................... 113
Table 28: Arima Residual Table Schema................................................................................................................ 114
Table 29: Arima Example Input Table milk_timeseries.......................................................................................114
Table 30: Arima Example Model Table: arimamodel........................................................................................... 115
Table 31: Arima Example Residual Table: arimaresidual.................................................................................... 116
Table 32: ArimaPredictor Output Table Schema.................................................................................................. 118
Table 33: ArimaPredictor Example Output........................................................................................................... 118
Table 34: Attribution Input Table Schema.............................................................................................................122
Table 35: Attribution Conversion Event Table Schema....................................................................................... 123
Table 36: Attribution Excluding Event Table Schema..........................................................................................123
Table 37: Attribution Optional Event Table Schema............................................................................................123
Table 38: Attribution Model Table Schema........................................................................................................... 123
Table 39: Attribution Model Types and Specification Definitions.....................................................................123
Table 40: Attribution Distribution Model Specification: Models and Parameters...........................................125

Teradata Aster Analytics Foundation User Guide 29


List of Tables

Table 41: Attribution: Allowed Model1/Model2 Combinations.........................................................................126


Table 42: Attribution Output Table Schema..........................................................................................................126
Table 43: Attribution Example: Event Types Channels........................................................................................126
Table 44: Multiple-Input Attribution Example Input Table attribution_sample_table1................................ 127
Table 45: Multiple-Input Attribution Example Input Table attribution_sample_table2................................ 127
Table 46: Multiple-Input Attribution Example Conversion Event Table conversion_event_table...............127
Table 47: Multiple-Input Attribution Example Excluding Event Table excluding_event_table....................127
Table 48: Multiple-Input Attribution Example Dimension Table optional_event_table................................127
Table 49: Multiple-Input Attribution Example Model Table model1_table..................................................... 128
Table 50: Multiple-Input Attribution Example Model Table model2_table..................................................... 128
Table 51: Multiple-Input Attribution Example Output Table............................................................................ 128
Table 52: Single-Input Attribution Example 1: Input Table attribution_sample_table.................................. 131
Table 53: Single-Input Attribution Example 1 Output Table..............................................................................133
Table 54: Single-Input Attribution Example 2 Output Table..............................................................................134
Table 55: Single-Input Attribution Example 3 Output Table..............................................................................135
Table 56: Single-Input Attribution Example 4 Output Table..............................................................................137
Table 57: Single-Input Attribution Example 5: Input Table attribution_sample_table3................................ 138
Table 58: Single-Input Attribution Example 5 Output Table..............................................................................139
Table 59: Single-Input Attribution Example 6 Output Table..............................................................................140
Table 60: Burst input_table Schema........................................................................................................................144
Table 61: Burst time_table Schema......................................................................................................................... 144
Table 62: Burst Output Table Schema.................................................................................................................... 145
Table 63: Burst Example 1 Input Table: finance_data.......................................................................................... 146
Table 64: Burst Example 1 Output Table (Columns 1-6).....................................................................................147
Table 65: Burst Example 1 Output Table (Columns 7-9).....................................................................................148
Table 66: Burst Example 2: time_table1..................................................................................................................149
Table 67: Burst Example 2 Output Table (Columns 1-6).....................................................................................150
Table 68: Burst Example 2 Output Table (Columns 7-9).....................................................................................150
Table 69: Change-Point Detection Functions Input Table Schema................................................................... 156
Table 70: Change-Point Detection Functions Output Table Schema for OutputOption ('CHANGEPOINT')
...................................................................................................................................................................................... 156
Table 71: Change-Point Detection Functions Output Table Schema for OutputOption ('VERBOSE')....... 156
Table 72: Change-Point Detection Functions Output Table Schema for OutputOption ('SEGMENT')......157
Table 73: ChangePointDetection Example 1 Input Table finance_data2.......................................................... 157
Table 74: ChangePointDetection Example 1 Output Table.................................................................................158
Table 75: ChangePointDetection Examples 2-6 Input Table cpt........................................................................ 159
Table 76: ChangePointDetection Example 2: Output Table................................................................................161
Table 77: ChangePointDetection Example 3 Output Table.................................................................................161
Table 78: ChangePointDetection Example 4 Output Table.................................................................................162
Table 79: ChangePointDetection Example 5 Output Table.................................................................................163
Table 80: ChangePointDetection Example 6 Output Table.................................................................................163
Table 81: RtChangePointDetection Example 1 Output Table.............................................................................166
Table 82: RtChangePointDetection Example 2 Output Table.............................................................................166
Table 83: RtChangePointDetection Example 3 Output Table.............................................................................167
Table 84: CCMPrepare Input Table Schema..........................................................................................................168
Table 85: CCMPrepare Output Table Schema...................................................................................................... 169
Table 86: CCMPrepare Example Input Table ccmprepare_input...................................................................... 169

30 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 87: CCMPrepare Example Output Table.....................................................................................................170


Table 88: CCM Input Table Schema....................................................................................................................... 174
Table 89: CCM Output Schema............................................................................................................................... 174
Table 90: CCM Example 1 Output Table (Columns 1-5).................................................................................... 176
Table 91: CCM Example 1 Output Table (Columns 6-9).................................................................................... 176
Table 92: CCM Example 2 Input Table ccm_input2............................................................................................ 176
Table 93: CCM Example 2 Output Table (Columns 1-5).................................................................................... 177
Table 94: CCM Example 2 Output Table (Columns 6-9).................................................................................... 177
Table 95: DTW input_table Schema....................................................................................................................... 181
Table 96: DTW template_table Schema................................................................................................................. 181
Table 97: DTW mapping_table Schema................................................................................................................. 181
Table 98: DTW Output Table Schema....................................................................................................................182
Table 99: DTW Example Input Table timeseriesdata...........................................................................................182
Table 100: DTW Example Template Table templatedata.................................................................................... 183
Table 101: DTW Example Mapping Table mappingdata.................................................................................... 183
Table 102: DTW Example Output Table................................................................................................................184
Table 103: Supported Wavelet Filter Names..........................................................................................................188
Table 104: Supported Extension Modes................................................................................................................. 188
Table 105: DWT Input Table Schema.....................................................................................................................189
Table 106: DWT Wavelet Filter Table Schema......................................................................................................189
Table 107: Wavelet Filter Table Names and Values..............................................................................................189
Table 108: DWT Output Message........................................................................................................................... 190
Table 109: DWT Output Table Schema..................................................................................................................190
Table 110: DWT Meta Table Schema..................................................................................................................... 191
Table 111: DWT Meta Information for Each Sequence....................................................................................... 191
Table 112: DWT Example Input Table ville_climatedata.................................................................................... 192
Table 113: DWT Example Output Message...........................................................................................................193
Table 114: DWT Example Output Table dwt_coef_table.................................................................................... 193
Table 115: DWT Example Meta Table dwt_meta_table.......................................................................................194
Table 116: DWT2D Input Table Schema............................................................................................................... 199
Table 117: DWT2D Output Message......................................................................................................................200
Table 118: DWT2D Output Table Schema............................................................................................................ 200
Table 119: DWT2D Meta Table Schema................................................................................................................ 201
Table 120: DWT2D Meta Information for Each Sequence..................................................................................201
Table 121: DWT2D Example Input Table twod_climate_data........................................................................... 202
Table 122: DWT2D Example Output Message......................................................................................................203
Table 123: DWT2D Example Output Table dwt2d_coeftable.............................................................................203
Table 124: DWT2D Example Meta Table dwt2d_metatable............................................................................... 204
Table 125: FrequentPaths Input Table Schema..................................................................................................... 210
Table 126: FrequentPaths Item Definition Table Schema................................................................................... 210
Table 127: FrequentPaths Output Message............................................................................................................211
Table 128: FrequentPaths Output Table Schema.................................................................................................. 211
Table 129: FrequentPaths Sequence Pattern Table Schema.................................................................................211
Table 130: FrequentPaths Example 1 Input Table bank_web_clicks1............................................................... 211
Table 131: FrequentPaths Example 1 Output Message........................................................................................ 213
Table 132: FrequentPaths Example 1 Output Table............................................................................................. 213
Table 133: FrequentPaths Example 2 Input Table bank_web_url...................................................................... 214

Teradata Aster Analytics Foundation User Guide 31


List of Tables

Table 134: FrequentPaths Example 2 Definition Table ref_url...........................................................................215


Table 135: FrequentPaths Example 2 Output Message........................................................................................ 216
Table 136: FrequentPaths Example 2 Output Table............................................................................................. 216
Table 137: FrequentPaths Example 3 Input Table bank_web_clicks2............................................................... 217
Table 138: FrequentPaths Example 3 Output Message........................................................................................ 218
Table 139: FrequentPaths Example 3 Output Table............................................................................................. 218
Table 140: FrequentPaths Example 4 Output Message........................................................................................ 219
Table 141: FrequentPaths Example 4 Output Table............................................................................................. 219
Table 142: FrequentPaths Example 5 Output Message........................................................................................ 220
Table 143: FrequentPaths Example 5 Output Table............................................................................................. 221
Table 144: FrequentPaths Example 6 Output Message........................................................................................ 222
Table 145: FrequentPaths Example 6 Output Table............................................................................................. 222
Table 146: FrequentPaths Example 7 nPath Input Table sequence_table......................................................... 223
Table 147: FrequentPaths Example 7 nPath Output Table..................................................................................224
Table 148: FrequentPaths Example 7 Output Message........................................................................................ 224
Table 149: FrequentPaths Example 7 Output Table............................................................................................. 224
Table 150: IDWT Output Message..........................................................................................................................227
Table 151: IDWT Output Table Schema................................................................................................................ 227
Table 152: IDWT Example Input Table dwt_coef_table......................................................................................227
Table 153: IDWT Example Input Table dwt_meta_table.................................................................................... 229
Table 154: IDWT Example Output Message......................................................................................................... 230
Table 155: IDWT Example Output Table climate_reconstruct.......................................................................... 230
Table 156: IDWT2D Output Message.................................................................................................................... 234
Table 157: IDWT2D Output Table Schema...........................................................................................................234
Table 158: IDWT2D Example Input Table dwt2d_coeftable.............................................................................. 235
Table 159: IDWT2D Example Input Table dwt2d_metatable.............................................................................235
Table 160: IDWT2D Example Output Message.................................................................................................... 237
Table 161: IDWT2D Example Output Table......................................................................................................... 237
Table 162: Interpolator input_table Schema..........................................................................................................244
Table 163: Interpolator time_table Schema........................................................................................................... 245
Table 164: Interpolator count_row_number Table Schema................................................................................245
Table 165: Interpolator Output Table Schema...................................................................................................... 245
Table 166: Interpolator Examples Input Table ibm_stock1.................................................................................246
Table 167: Interpolate Example 1 (Aggregation) Input Table time_table1....................................................... 247
Table 168: Interpolate Example 1 (Aggregation) Output Table..........................................................................248
Table 169: Interpolate Example 2 (Constant Interpolation) Output Table.......................................................250
Table 170: Interpolate Example 3 (Linear Interpolation) Output Table............................................................251
Table 171: Interpolate Example 4 (Median Interpolation) Output Table......................................................... 253
Table 172: Interpolate Example 5 (Spline Interpolation) Output Table............................................................ 254
Table 173: Interpolate Example 6 (Loess Interpolation) Output Table............................................................. 255
Table 174: Path_Generator Input Table Schema...................................................................................................258
Table 175: Path_Generator Output Table Schema............................................................................................... 259
Table 176: Path_Generator Example E-Commerce Website Page Symbols..................................................... 259
Table 177: Path_Generator Example Input Table: clickstream1.........................................................................260
Table 178: Path Generator Example Output Table............................................................................................... 260
Table 179: Path_Summarizer Output Table Schema............................................................................................263
Table 180: Path_Summarizer Example Output Table.......................................................................................... 264

32 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 181: Path_Start Output Table Schema......................................................................................................... 267


Table 182: Path_Start Example Output Table....................................................................................................... 269
Table 183: SAX2 Input Table Schema.....................................................................................................................278
Table 184: SAX2 'string' or 'bytes' Output Table Schema.....................................................................................279
Table 185: SAX2 'bitmap' Output Table Schema...................................................................................................279
Table 186: SAX2 'characters' Output Table Schema............................................................................................. 280
Table 187: SAX2 Examples Input Table sax_example..........................................................................................281
Table 188: SAX2 Example 1 Output Table, Columns 1-5....................................................................................282
Table 189: SAX2 Example 1 Output Table, Columns 6-9....................................................................................282
Table 190: SAX2 Example 1 Output Table, Columns 10-12................................................................................282
Table 191: SAX2 Example 2 Output Table.............................................................................................................282
Table 192: SAX2 Example 3 Output Table.............................................................................................................283
Table 193: SAX2 Example 4 Output Table.............................................................................................................284
Table 194: SAX2 Example 5 Output Table (Columns 1-3)..................................................................................285
Table 195: SAX2 Example 5 Output Table (Columns 4-6)..................................................................................285
Table 196: SAX2 Example 5 Output Table (Columns 7-9)..................................................................................285
Table 197: SAX2 Example 5 Output Table (Columns 10-12)..............................................................................285
Table 198: SeriesSplitter input_table Schema........................................................................................................ 290
Table 199: SeriesSplitter Output Table Schema.....................................................................................................291
Table 200: SeriesSplitter Stats Table Schema......................................................................................................... 291
Table 201: SeriesSplitter Example Input Table ibm_stock1.................................................................................292
Table 202: SeriesSplitter Example 1 Stats Table.....................................................................................................293
Table 203: SeriesSplitter Example 1 Output Table ibm_stock1_split.................................................................293
Table 204: Sessionize Input Table Schema............................................................................................................. 297
Table 205: Sessionize Output Table Schema..........................................................................................................297
Table 206: Sessionize Example Input Table adweb_clickstream.........................................................................298
Table 207: Sessionize Example Output Table........................................................................................................ 299
Table 208: UnsupervisedShapelet Input Table Schema........................................................................................303
Table 209: UnsupervisedShapelet Output Message Schema................................................................................304
Table 210: UnsupervisedShapelet Output Table Schema.....................................................................................304
Table 211: UnsupervisedShapelet Example Input Table ushapelets_input.......................................................304
Table 212: UnsupervisedShapelet Example Output Message..............................................................................306
Table 213: UnsupervisedShapelet Example Output Table uss_output.............................................................. 307
Table 214: SupervisedShapeletTrainer Output Message Schema....................................................................... 311
Table 215: SupervisedShapeletTrainer Model Table Schema..............................................................................311
Table 216: SupervisedShapeletTrainer Example Input Table shapelets_train.................................................. 312
Table 217: SupervisedShapeletTrainer Example Output Message......................................................................314
Table 218: SupervisedShapeletTrainer Example Model Table shapelets_model.............................................. 314
Table 219: SupervisedShapeletClassifier Output Table Schema......................................................................... 316
Table 220: SupervisedShapeletClassifier Example Input Table shapelets_test................................................. 317
Table 221: SupervisedShapeletClassifier Example Output Table shapelets_predict........................................ 318
Table 222: VARMAX Input Schema....................................................................................................................... 322
Table 223: VARMAX Output Schema.................................................................................................................... 323
Table 224: VARMAX Model Coefficients.............................................................................................................. 323
Table 225: VARMAX Example Input Table finance_data3.................................................................................324
Table 226: VARMAX Example 1 Output Table (Columns 1-4)......................................................................... 325
Table 227: VARMAX Example 1 Output Table (Columns 5-7)......................................................................... 327

Teradata Aster Analytics Foundation User Guide 33


List of Tables

Table 228: VARMAX Example 2 Output Table.................................................................................................... 329


Table 229: VARMAX Example 3 Output Table (Columns 1-4)......................................................................... 332
Table 230: VARMAX Example 3 Output Table (Columns 5-7)......................................................................... 338
Table 231: VARMAX Example 4 Output Table.................................................................................................... 340
Table 232: Simple nPath Patterns and Operator Precedence.............................................................................. 346
Table 233: nPath Input Table Schema.................................................................................................................... 347
Table 234: nPath Ouput Table Schema...................................................................................................................348
Table 235: nPath Greedy Pattern Matching Example Input Table link2...........................................................349
Table 236: nPath Greedy Pattern Matching Example 1 Output Table...............................................................349
Table 237: nPath Greedy Pattern Matching Example 2 Output Table...............................................................350
Table 238: nPath Sample Input Table..................................................................................................................... 350
Table 239: nPath LAG Expression Example Input Table bank_web_clicks......................................................351
Table 240: nPath LAG Expression Example 1 Output Table (Columns 1-4)....................................................352
Table 241: nPath LAG Expression Example 1 Output Table (Columns 5-6)....................................................352
Table 242: nPath LAG Expression Example 2 Input Table: aggregate_clicks................................................... 353
Table 243: nPath LAG Expression Example 2 Output Table...............................................................................355
Table 244: nPath Filter Example Input Table clickstream................................................................................... 356
Table 245: nPath Filter Example Output Table..................................................................................................... 357
Table 246: nPath Aggregate Functions Example 1 Input Table trans1.............................................................. 359
Table 247: nPath Aggregate Functions Example 1 Output Table....................................................................... 360
Table 248: Aggregate Functions Example 2 Input Table: clicks..........................................................................360
Table 249: nPath Aggregate Functions Example 2 Output Table (Columns 1-4)............................................ 361
Table 250: nPath Aggregate Functions Example 2 Output Table (Columns 5-6)............................................ 361
Table 251: nPath Aggregate Functions Example 3 Output Table (Columns 1-5)............................................ 362
Table 252: nPath Aggregate Functions Example 3 Output Table (Columns 6-8)............................................ 362
Table 253: nPath Clickstream Data Examples Symbols and Symbol Predicates.............................................. 362
Table 254: nPath Range-Matching Example 1 Output Table.............................................................................. 365
Table 255: nPath Range-Matching Example 2 Output Table.............................................................................. 365
Table 256: nPath Range-Matching Example 3 Output Table.............................................................................. 366
Table 257: nPath Range-Matching Example 4 Output Table.............................................................................. 367
Table 258: nPath Range-Matching Example 5 Output Table.............................................................................. 368
Table 259: nPath Range-Matching Example 6 Output Table.............................................................................. 369
Table 260: nPath Range-Matching Example 7 Output Table.............................................................................. 369
Table 261: nPath Multiple-Input Example 2 Input Table impressions..............................................................371
Table 262: nPath Multiple-Input Example 2 Input Table clicks2....................................................................... 371
Table 263: nPath Multiple-Input Example 2 tv_spots.......................................................................................... 372
Table 264: nPath Multiple-Input Example 2 Output Table................................................................................. 373
Table 265: Approximate Distinct Count Input Table Schema............................................................................ 377
Table 266: Approximate Distinct Count Output Table Schema......................................................................... 377
Table 267: Approximate Distinct Count Example Input Table crackers (Columns 1-8)................................378
Table 268: Approximate Distinct Count Example Input Table crackers (Columns 9-15)..............................378
Table 269: Approximate Distinct Count Example Output Table....................................................................... 379
Table 270: ApproxPercentileMap Input Table Schema........................................................................................381
Table 271: ApproxPercentileReduce Output Table Schema................................................................................382
Table 272: Approximate Percentile Example Input Table cracker (Columns 1-8).......................................... 382
Table 273: Approximate Percentile Example Input Table cracker (Columns 9-15)........................................ 383
Table 274: Approximate Percentile Example Output Table................................................................................ 384

34 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 275: CMAVG Input Table Schema...............................................................................................................386


Table 276: CMAVG Output Table Schema............................................................................................................386
Table 277: Input table ibm_stock for moving average function examples........................................................ 387
Table 278: CMAVG Example Output Table.......................................................................................................... 388
Table 279: ConfusionMatrix Input Table Schema................................................................................................ 390
Table 280: ConfusionMatrix Output Table 1 (Confusion Matrix) Schema...................................................... 390
Table 281: ConfusionMatrix Output Table 2 (Overall Statistics) Schema.........................................................390
Table 282: ConfusionMatrix Output Table 3 (Class Statistics) Schema for Two Classes................................390
Table 283: ConfusionMatrix Output Table 3 (Class Statistics) Schema for More Than Two Classes...........391
Table 284: ConfusionMatrix Example Input Table iris_category_expect_predict...........................................391
Table 285: ConfusionMatrix Example Output Message...................................................................................... 393
Table 286: ConfusionMatrix Example Output Table confusionmatrix_output_1...........................................393
Table 287: ConfusionMatrix Example Output Table confusionmatrix_output_2...........................................393
Table 288: ConfusionMatrix Example Output Table confusionmatrix_output_3...........................................394
Table 289: Correlation (Corr_Map) Input Table Schema....................................................................................395
Table 290: Correlation (Corr_Reduce) Output Table Schema............................................................................396
Table 291: Correlation Example Input Table corr_input.....................................................................................396
Table 292: Correlation Example 1 Output Table...................................................................................................397
Table 293: Correlation Example 2 Output Table...................................................................................................398
Table 294: CoxPH Input Table Schema..................................................................................................................401
Table 295: CoxPH Coefficient Table Schema........................................................................................................ 401
Table 296: CoxPH Significance Codes....................................................................................................................402
Table 297: CoxPH Coefficient Table Schema........................................................................................................ 402
Table 298: CoxPH Linear Predictor Table Schema...............................................................................................402
Table 299: CoxPH Example Input Table lungcancer............................................................................................403
Table 300: CoxPH Example Output Table (Columns 1-5).................................................................................. 404
Table 301: CoxPH Example Output Table (Columns 6-9).................................................................................. 405
Table 302: CoxPH Example Output Table lungcancer_coef (Columns 1-5).................................................... 405
Table 303: CoxPH Example Output Table lungcancer_coef (Columns 6-9).................................................... 406
Table 304: CoxPH Example Output Table lungcancer_lp................................................................................... 406
Table 305: CoxPredict Predict Feature Table Schema..........................................................................................409
Table 306: CoxPredict Reference Feature Table Schema..................................................................................... 410
Table 307: CoxPredict Predict Output Table Schema (Predict_Feature_Columns Specified).......................410
Table 308: CoxPredict Predict Output Table Schema (Predict_Feature_Units_Columns Specified)...........410
Table 309: CoxPredict Example Input Table: lc_new_predictors.......................................................................411
Table 310: CoxPredict Example Input Table: lc_new_reference........................................................................ 411
Table 311: CoxPredict Example 1 Output Table (Columns 1-8)........................................................................ 412
Table 312: CoxPredict Example 1 Output Table (Columns 9-15)...................................................................... 412
Table 313: CoxPredict Example 2 Output Table (Columns 1-8)........................................................................ 413
Table 314: CoxPredict Example 2 Output Table (Columns 9-15)...................................................................... 414
Table 315: CoxPredict Example 3 Output Table (Columns 1-8)........................................................................ 414
Table 316: CoxPredict Example 3 Output Table (Columns 9-15)...................................................................... 415
Table 317: CoxPredict Example 4 Output Table (Columns 1-8)........................................................................ 417
Table 318: CoxPredict Example 4 Output Table (Columns 9-15)...................................................................... 417
Table 319: CoxPredict Example 5 Output Table................................................................................................... 418
Table 320: Continuous Distributions and Parameters......................................................................................... 422
Table 321: Discrete Distributions and Parameters................................................................................................422

Teradata Aster Analytics Foundation User Guide 35


List of Tables

Table 322: Distribution Matching Input Table Schema....................................................................................... 423


Table 323: Distribution Matching Output Table Schema....................................................................................423
Table 324: distnmatch (Hypothesis Test Mode) Example 1 Input Table raw_normal_50_2......................... 425
Table 325: distnmatch (Hypothesis Test Mode) Example 1 Output Table....................................................... 426
Table 326: distnmatch (Hypothesis Test Mode) Example 2 Input Table: factory7.......................................... 426
Table 327: distnmatch (Hypothesis Test Mode) Example 2 Output Table....................................................... 428
Table 328: CoxSurvFit Predict Table Schema........................................................................................................436
Table 329: CoxSurvFit Predict Table Example...................................................................................................... 436
Table 330: CoxSurvFit Message Table Schema......................................................................................................436
Table 331: CoxSurvFit Output Table Schema........................................................................................................436
Table 332: CoxSurvFit Output Table...................................................................................................................... 437
Table 333: Cross-Validation Output Table schema.............................................................................................. 440
Table 334: Cross-Validation Example Input Table admissions_train................................................................441
Table 335: Cross-Validation Example Output Table............................................................................................443
Table 336: Cross-Validation Example Output Table glmcvtable........................................................................443
Table 337: Distribution Matching Input Table Schema....................................................................................... 447
Table 338: Distribution Matching Output Table Schema....................................................................................447
Table 339: distnmatch (Best Match Mode) Example 1 Output Table................................................................ 449
Table 340: distnmatch (Best Match Mode) Example 2 Input Table age_distribution..................................... 450
Table 341: distnmatch (Best Match Mode) Example 2 Output Table (Columns 1-4)..................................... 453
Table 342: distnmatch (Best Match Mode) Example 2 Output Table (Columns 5-7)..................................... 453
Table 343: distnmatch (Best Match Mode) Example 2 Output Table (Columns 8-9)..................................... 453
Table 344: EMAVG Input Table Schema............................................................................................................... 455
Table 345: EMAVG Output Table Schema............................................................................................................ 456
Table 346: EMAVG Example Input Table ibm_stock.......................................................................................... 456
Table 347: EMAVG Example Output Table.......................................................................................................... 457
Table 348: FMeasure Input Table Schema............................................................................................................. 459
Table 349: FMeasure Output Table Schema.......................................................................................................... 459
Table 350: FMeasure Examples Input Table computers_category..................................................................... 460
Table 351: FMeasure Example 1 Output Table......................................................................................................461
Table 352: FMeasure Example 2 Output Table......................................................................................................461
Table 353: Categorical Variables..............................................................................................................................462
Table 354: Supported Family/Link Function Combinations...............................................................................462
Table 355: Common Link Functions for Distribution Exponential Families................................................... 464
Table 356: GLM Input Table Schema..................................................................................................................... 467
Table 357: GLM Onscreen Output Columns.........................................................................................................467
Table 358: GLM Onscreen Output Row Parameters............................................................................................ 468
Table 359: GLM Onscreen Output Values in the Estimate Column.................................................................. 468
Table 360: GLM Output Table Columns................................................................................................................469
Table 361: GLM Output Table Parameters............................................................................................................ 469
Table 362: GLM Example 1 Input Table admissions_train................................................................................. 472
Table 363: GLM Example 1 Model Statistics......................................................................................................... 474
Table 364: GLM Example 1 Output Table (Columns 1-4)...................................................................................475
Table 365: GLM Example 1 Output Table (Columns 5-8)...................................................................................475
Table 366: GLM Example 2 Model Statistics......................................................................................................... 477
Table 367: GLM Example 2 Output Table..............................................................................................................479
Table 368: GLM Example 3 Input Table housing_train (Columns 1-7)............................................................480

36 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 369: GLM Example 3 Input Table housing_train (Columns 8-14)..........................................................480


Table 370: GLM Example 3 Model Statistics......................................................................................................... 481
Table 371: GLM Example 3 Output Table glm_housing_model........................................................................ 482
Table 372: GLMPredict Input Table Schema.........................................................................................................485
Table 373: GLMPredict Output Table Schema......................................................................................................485
Table 374: GLMPredict Example 1 Input Table admissions_test....................................................................... 485
Table 375: GLMPredict Example 1 Output Table glmpredict_admissions.......................................................487
Table 376: GLMPredict Example 1 Output Table glmpredict_admissions.......................................................488
Table 377: GLMPredict Example 1 Prediction Accuracy.....................................................................................489
Table 378: GLMPredict Example 2 Input Table housing_test (Columns 1-7)................................................. 489
Table 379: GLMPredict Example 2 Input Table housing_test (Columns 8-14)............................................... 490
Table 380: GLMPredict Example 2 Output Table.................................................................................................490
Table 381: GLMPredict Example 2 RMSE............................................................................................................. 491
Table 382: HMM Models and Descriptions...........................................................................................................492
Table 383: Functions and Aster Distributed Platforms........................................................................................ 493
Table 384: HMMUnsupervisedLearner Example Vertices Table (Sequences)................................................. 496
Table 385: HMMUnsupervisedLearner Console Message Table Schema......................................................... 497
Table 386: HMMUnsupervisedLearner Initial-State Probability Table.............................................................497
Table 387: HMMUnsupervisedLearner State-Transition Probability Table.....................................................497
Table 388: HMMUnsupervisedLearner Observation Probability Table............................................................498
Table 389: HMMUnsupervisedLearner Example Observation Symbols........................................................... 498
Table 390: HMMUnsupervisedLearner Example Input Table loan_prediction...............................................499
Table 391: HMMUnsupervisedLearner Example Output Message....................................................................500
Table 392: HMMUnsupervisedLearner Example Output Message pi_loan......................................................501
Table 393: HMMUnsupervisedLearner Example Output Table A_loan........................................................... 501
Table 394: HMMUnsupervisedLearner Example Output Table B_loan........................................................... 502
Table 395: HMMSupervisedLearner Example vertices table (sequences)......................................................... 505
Table 396: HMMSupervisedLearner Console Message Table Schema.............................................................. 506
Table 397: HMMSupervisedLearner Initial-State Probability Table.................................................................. 506
Table 398: HMMSupervisedLearner State-Transition Probability Table.......................................................... 506
Table 399: HMMSupervisedLearner Observation Probability Table................................................................. 507
Table 400: HMMSupervisedLearner Example Purchase Levels..........................................................................507
Table 401: HMMSupervisedLearner Example Input Table customer_loyalty..................................................508
Table 402: HMMSupervisedLearner Output Message......................................................................................... 510
Table 403: HMMSupervisedLearner Output Table pi_loyalty............................................................................ 510
Table 404: HMMSupervisedLearner Output Table A_loyalty............................................................................ 510
Table 405: HMMSupervisedLearner Output Table B_loyalty.............................................................................511
Table 406: HMMEvaluator Initial-State Probability Table Schema................................................................... 516
Table 407: HMMEvaluator State-Transition Probability Table Schema........................................................... 517
Table 408: HMMEvaluator Emission Probability Table Schema........................................................................517
Table 409: HMMEvaluator Output table................................................................................................................517
Table 410: HMMEvaluator Example Input Table: test_loan_prediction.......................................................... 518
Table 411: HMMEvaluator Example Output Table..............................................................................................520
Table 412: HMMDecoder Initial-State Probability Table Schema..................................................................... 523
Table 413: HMMDecoder State Transition Probability Table Schema..............................................................523
Table 414: HMMDecoder Emission Probability Table Schema..........................................................................524
Table 415: HMMDecoder Output Table Schema..................................................................................................524

Teradata Aster Analytics Foundation User Guide 37


List of Tables

Table 416: HMMDecoder Example 1 Output Table.............................................................................................525


Table 417: HMMEncoder Example 2 Input Table customer_loyalty_newseq................................................. 527
Table 418: HMMDecoder Example 2 Output Table.............................................................................................529
Table 419: HMMDecoder Example 3 Input Table phrases..................................................................................530
Table 420: HMMDecoder Example 3 Input Table initial.....................................................................................530
Table 421: HMMDecoder Example 3 Input Table state_transition................................................................... 530
Table 422: HMMDecoder Example 3 Input Table emission............................................................................... 531
Table 423: HMMDecoder Example 3 Output Table.............................................................................................531
Table 424: HMMDecoder Example 4 Input Table churn_data...........................................................................532
Table 425: HMMDecoder Example 4 Input Table churn_initial........................................................................532
Table 426: HMMDecoder Example 4 Input Table churn_state_transition.......................................................533
Table 427: HMMDecoder Example 4 Input Table churn_emission.................................................................. 533
Table 428: HMMDecoder Example 4 Output Table.............................................................................................534
Table 429: Histogram Input Table Schema............................................................................................................537
Table 430: Histogram Output Table Schema.........................................................................................................538
Table 431: Histogram Example Input Table cars_hist..........................................................................................538
Table 432: Histogram Example 1 Output Table.................................................................................................... 540
Table 433: Histogram Example 1 Output Table cars_sturges_out..................................................................... 540
Table 434: Histogram Example 2 Output Table.................................................................................................... 541
Table 435: Histogram Example 2 Output Table cars_scott_out......................................................................... 541
Table 436: Histogram Example 3 Output Table.................................................................................................... 541
Table 437: Histogram Example 3 Output Table cars_hist_out........................................................................... 542
Table 438: KNN Training Table Schema................................................................................................................546
Table 439: KNN Test Table Schema........................................................................................................................546
Table 440: KNN Output Table Schema.................................................................................................................. 546
Table 441: KNN Example Training Table computers_train1_clustered........................................................... 547
Table 442: KNN Example Price Categories............................................................................................................547
Table 443: KNN Example computers_test1........................................................................................................... 548
Table 444: KNN Example Output Table knn_output...........................................................................................549
Table 445: LARS Input Table Schema.....................................................................................................................553
Table 446: LARS Output Table Schema..................................................................................................................553
Table 447: LARS Examples Input Table: diabetes, Columns 1-6........................................................................ 554
Table 448: LARS Examples Input Table: diabetes, Columns 7-12...................................................................... 554
Table 449: LARS Example 1 Output Message........................................................................................................555
Table 450: LARS Example 1 Output Table diabetes_lars, Columns 1–7........................................................... 555
Table 451: LARS Example 1 Output Table diabetes_lars, Columns 8–16......................................................... 556
Table 452: LARS Example 2 Output Message........................................................................................................557
Table 453: LARS Example 2 Output Table diabetes_lasso, Columns 1–7......................................................... 557
Table 454: LARS Example 2 (LASSO) Output Table diabetes_lasso, Columns 8–16...................................... 558
Table 455: LARSPredict Data Table Schema......................................................................................................... 560
Table 456: LARSPredict Output Table Schema.....................................................................................................561
Table 457: LarsPredict Example Data Table: diabetes_test, Columns 1-6.........................................................561
Table 458: LarsPredict Example Data Table: diabetes_test, Columns 7-12.......................................................562
Table 459: LARSPredict Example 1 Output Table, Columns 1–7...................................................................... 562
Table 460: LARSPredict Example 1 Output Table, Columns 8–13.................................................................... 562
Table 461: LARSPredict Example 2 Output Table, Columns 1–7...................................................................... 563
Table 462: LARSPredict Example 2 Output Table, Columns 8–13.................................................................... 563

38 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 463: LinRegMatrix Input Table Schema...................................................................................................... 565


Table 464: LinReg Output Table Schema............................................................................................................... 565
Table 465: LinRegMatrix Example Input Table housing_data............................................................................566
Table 466: LinReg Example Output Table............................................................................................................. 566
Table 467: LRTEST Output Table Schema.............................................................................................................568
Table 468: LRTEST Example Input Table glm_tempdamage............................................................................. 569
Table 469: LRTEST Example Output Table........................................................................................................... 570
Table 470: LRTEST Example 1 Model Table damage_glm1................................................................................571
Table 471: LRTEST Example Output Table........................................................................................................... 571
Table 472: LRTEST Example Model Table damage_glm2...................................................................................572
Table 473: LRTEST Example Output Table........................................................................................................... 572
Table 474: Percentile Input Table Schema............................................................................................................. 573
Table 475: Percentile Output Table Schema.......................................................................................................... 573
Table 476: Percentile Example Input Table london_olympics............................................................................574
Table 477: Percentile Example Output Table.........................................................................................................575
Table 478: PCA_Map Input Table Schema............................................................................................................ 577
Table 479: PCA_Reduce Output Table Schema.................................................................................................... 577
Table 480: PCA Example Input Table patient_pca_input................................................................................... 578
Table 481: PCA Example Output Table pca_health_ev (Columns 1-5)............................................................ 579
Table 482: PCA Example Output Table pca_health_ev (Columns 6-9)............................................................ 580
Table 483: PCA Example Output Table pca_health_ev (Columns 10-13)........................................................ 580
Table 484: Values Derived from PCA Example Output Table pca_health_ev..................................................581
Table 485: PCA Example Output Table pca_health_pc....................................................................................... 583
Table 486: PCA Example View v_pca_health_corr_input, Columns 1-8..........................................................584
Table 487: PCA Example View v_pca_health_corr_input, Columns 9-12........................................................585
Table 488: PCA Example pca_1 Correlation Coefficients....................................................................................587
Table 489: PCA Example pca_2 Correlation Coefficients....................................................................................587
Table 490: PCA Example pca_3 Correlation Coefficients....................................................................................588
Table 491: PCAPlot Input Table Schema............................................................................................................... 589
Table 492: PCAPlot Output Table Schema............................................................................................................ 589
Table 493: PCAPlot Example Output Table...........................................................................................................590
Table 494: RandomSample Input Table Schema...................................................................................................594
Table 495: RandomSample Output Table Schema................................................................................................594
Table 496: RandomSample Examples Input Table fs_input................................................................................595
Table 497: RandomSample Example 1 Output Table...........................................................................................597
Table 498: RandomSample Example 2 Output Table...........................................................................................598
Table 499: RandomSample Example 3 Output Table...........................................................................................599
Table 500: Sample Data Table Schema................................................................................................................... 602
Table 501: Sample Summary Table Schema...........................................................................................................603
Table 502: Sample Output Table Schema...............................................................................................................603
Table 503: Sample Example Input Table students................................................................................................ 603
Table 504: Sample Example Input Table score_category..................................................................................... 604
Table 505: Sample Example 1 Output Table.......................................................................................................... 605
Table 506: Sample Example 2 Output Table.......................................................................................................... 606
Table 507: Sample Example 3 Summary Table......................................................................................................607
Table 508: Sample Example 3 Output Table.......................................................................................................... 607
Table 509: Sample Example 4 Output Table.......................................................................................................... 608

Teradata Aster Analytics Foundation User Guide 39


List of Tables

Table 510: GenerateCombination Input Table Schema....................................................................................... 610


Table 511: GenerateCombination Output Table Schema.................................................................................... 611
Table 512: SortCombination Input Table Schema................................................................................................612
Table 513: SortCombination Output Table Schema.............................................................................................612
Table 514: AddOnePlayer Output Table Schema..................................................................................................613
Table 515: AddOnePlayer Example 1 Input Table project_cost......................................................................... 615
Table 516: AddOnePlayer Example 1 Output Table (Generate Payoff Tables for Each Combination)........616
Table 517: AddOnePlayer Example 1 Output Table (Add One Player to Each Combination)......................617
Table 518: AddOnePlayer Example 1 Output Table (Shapley Values Computation)..................................... 618
Table 519: AddOnePlayer Example 2 Input Table Schema................................................................................. 618
Table 520: AddOnePlayer Example 2 Output Table atrbtn_table_old_direct_noprsnt..................................620
Table 521: AddOnePlayer Example 2 Output Table atrbtn_old_dr_pth_noprsnt_conv................................622
Table 522: AddOnePlayer Example 2 Output Table atrbtn_old_dr_pth_noprsnt_tot....................................624
Table 523: AddOnePlayer Example 2 Output Table.............................................................................................625
Table 524: AddOnePlayer Example 2 Output Table.............................................................................................626
Table 525: AddOnePlayer Example 2 Output Table.............................................................................................629
Table 526: SMAVG Input Table Schema................................................................................................................631
Table 527: SMAVG Output Table Schema.............................................................................................................631
Table 528: SMAVG Example Input Table ibm_stock.......................................................................................... 631
Table 529: SMAVG Example Output Table...........................................................................................................632
Table 530: SparseSVMTrainer Input Table Schema............................................................................................. 637
Table 531: SparseSVMTrainer Console Message Table Schema.........................................................................637
Table 532: SparseSVMTrainer Example Input Table svm_iris........................................................................... 637
Table 533: SparseSVMTrainer Example Input Table svm_iris_input................................................................638
Table 534: SparseSVMTrainer Example Input Table svm_iris_input_train.....................................................639
Table 535: SparseSVMTrainer Example Output Message................................................................................... 640
Table 536: SparseSVMPredictor Sample Table Schema.......................................................................................641
Table 537: SparseSVMPredictor Output Table Schema.......................................................................................642
Table 538: SparseSVMPredictor Example Input Table svm_iris_input_test.................................................... 642
Table 539: SparseSVMPredictor Example Output Table..................................................................................... 643
Table 540: SparseSVMPredictor Example 2 Prediction Accuracy......................................................................644
Table 541: SVMModelPrinter Console Message Table Schema (Summary('true')).........................................645
Table 542: SVMModelPrinter Output Table Schema (Summary('false'))......................................................... 646
Table 543: SVMModelPrinter Example 1 Output Message................................................................................. 646
Table 544: SVMModelPrinter Example 2 Output Table...................................................................................... 647
Table 545: DenseSVMTrainer Input Table Schema............................................................................................. 651
Table 546: DenseSVMTrainer Example Input Table svm_iris............................................................................652
Table 547: DenseSVMTrainer Example Train Set Table svm_iris_train...........................................................653
Table 548: DenseSVMTrainer Example Test Set Table svm_iris_test............................................................... 653
Table 549: DenseSVMTrainer Example 1 Output Table......................................................................................654
Table 550: DenseSVMTrainer Example 2 Output Table......................................................................................655
Table 551: DenseSVMTrainer Example 3 Output Table......................................................................................656
Table 552: DenseSVMTrainer Example 4 Output Table......................................................................................657
Table 553: DenseSVMPredictor Input Sample Table Schema............................................................................ 658
Table 554: DenseSVMPredictor Output Sample Table Schema......................................................................... 659
Table 555: DenseSVMPredictor Example 1 Output Table.................................................................................. 660
Table 556: DenseSVMPredictor Example 2 Output Table.................................................................................. 661

40 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 557: DenseSVMPredictor Example 3 Output Table.................................................................................. 662


Table 558: DenseSVMPredictor Example 4 Output Table.................................................................................. 664
Table 559: DenseSVMModelPrinter Console Message Table Schema (Summary('true'))..............................666
Table 560: DenseSVMModelPrinter Output Table Schema (Summary('false')).............................................. 666
Table 561: DenseSVMModelPrinter Example Output Table.............................................................................. 666
Table 562: DenseSVMModelPrinter Example Output Table.............................................................................. 667
Table 563: VectorDistance Target Input Table Schema.......................................................................................672
Table 564: VectorDistance Reference Input Table Schema................................................................................. 672
Table 565: VectorDistance Output Table Schema................................................................................................ 672
Table 566: VectorDistance Examples Raw Input Data.........................................................................................673
Table 567: VectorDistance Examples Normalized Input Data............................................................................673
Table 568: VectorDistance Examples Minimum and Maximum Values.......................................................... 674
Table 569: VectorDistance Examples Reference Table ref_mobile_data...........................................................674
Table 570: VectorDistance Examples Target Table target_mobile_data........................................................... 674
Table 571: VectorDistance Example 1 Output Table............................................................................................675
Table 572: VectorDistance Example 1 Target Distances from Reference and Similarity Ranks.................... 675
Table 573: VectorDistance Example 2 Output Table............................................................................................676
Table 574: VWAP Input Table Schema.................................................................................................................. 677
Table 575: VWAP Output Table Schema...............................................................................................................678
Table 576: VWAP Example Input Table stock_vol.............................................................................................. 678
Table 577: VWAP Example Output Table............................................................................................................. 680
Table 578: WMAVG Input Table Schema............................................................................................................. 682
Table 579: WMAVG Output Table Schema.......................................................................................................... 682
Table 580: WMAVG Example Input Table stock_data........................................................................................682
Table 581: WMAVG Example Output Table.........................................................................................................684
Table 582: LDATrainer Training Table Schema................................................................................................... 688
Table 583: LDATrainer Output Message Schema.................................................................................................689
Table 584: LDATrainer Model Table Schema....................................................................................................... 689
Table 585: LDATrainer Output Table Schema......................................................................................................689
Table 586: LDATrainer Example Training Table complaints............................................................................. 690
Table 587: LDATrainer Example Tokenized and Filtered Input Table complaints_traintoken.................... 691
Table 588: LDATrainer Example Message............................................................................................................. 692
Table 589: LDATrainer Example Output Table ldaout1......................................................................................692
Table 590: LDAInference Example Input Table complaints_test....................................................................... 694
Table 591: LDAInference Example Tokenized and Filtered Input Table complaints_testtoken................... 696
Table 592: LDAInference Example Output Message............................................................................................697
Table 593: LDAInference Example Output Table ldaout2.................................................................................. 697
Table 594: LDATopicPrinter Output Message Schema....................................................................................... 699
Table 595: LDATopicPrinter Output Table (showsummary=false and outputbyword=true)....................... 699
Table 596: LDATopicPrinter Output Table (showsummary=false and outputbyword=false)...................... 699
Table 597: LDATopicPrinter Example 1 Output Message...................................................................................700
Table 598: LDATopicPrinter Example 2 Output Table........................................................................................700
Table 599: LDATopicPrinter Example 3 Output Table........................................................................................701
Table 600: Levenshtein Distance (LDist) Input Table Schema........................................................................... 703
Table 601: Levenshtein Distance (LDist) Output Table Schema........................................................................ 703
Table 602: Levenshtein Distance (LDist) Example Input Table levendist_input............................................. 704
Table 603: Levenshtein Distance (LDist) Example Output Table.......................................................................704

Teradata Aster Analytics Foundation User Guide 41


List of Tables

Table 604: NaiveBayesTextClassifierTrainer Token Table Schema................................................................... 708


Table 605: NaiveBayesTextClassifierTrainer Categories Table Schema............................................................ 709
Table 606: NaiveBayesTextClassifierTrainer Stop_Words Table Schema.........................................................709
Table 607: NaiveBayesTextClassifierTrainer Model Table Schema................................................................... 709
Table 608: NaiveBayesTextClassifierTrainer Model Table Example..................................................................709
Table 609: NaiveBayesTextClassifierTrainer English Example Training Table complaints........................... 710
Table 610: NaiveBayesTextClassifierTrainer English Example Model Table complaints_tokens_model....711
Table 611: NaiveBayesTextClassifierPredict Input Table Schema......................................................................715
Table 612: NaiveBayesTextClassifierPredict Output Table Schema...................................................................715
Table 613: NaiveBayesTextClassifierPredict English Example Input Table complaints................................. 716
Table 614: NaiveBayesTextClassifierPredict English Example Output Table complaints_tokens_model... 717
Table 615: NaiveBayesTextClassifierPredict Chinese Example Output Table..................................................719
Table 616: NERTrainer Default Extractor Classes and Features.........................................................................722
Table 617: Selected Features for Templates in NERTrainer Example Template File.......................................724
Table 618: NERTrainer Input Table Schema......................................................................................................... 724
Table 619: NERTrainer Output Message Schema................................................................................................. 724
Table 620: NERTrainer Example Input Table ner_sports_train.........................................................................725
Table 621: NERTrainer Example Output Table.................................................................................................... 726
Table 622: NER Input Table Schema...................................................................................................................... 727
Table 623: NER Rules Table Schema.......................................................................................................................728
Table 624: NER Dictionary Table Schema............................................................................................................. 728
Table 625: NER Output Table Schema................................................................................................................... 728
Table 626: NER Example Input Table ner_sports_test.........................................................................................729
Table 627: NER Example Rules Table rule_table.................................................................................................. 729
Table 628: NER Example Output Table..................................................................................................................730
Table 629: NEREvaluator Output Table Schema.................................................................................................. 732
Table 630: NEREvaluator Example Output Table.................................................................................................733
Table 631: Default English-Language Models in Table nameFind_configure.................................................. 737
Table 632: FindNamedEntity Input Table Schema...............................................................................................737
Table 633: FindNamedEntity Configuration Table Schema................................................................................737
Table 634: FindNamedEntity Input Table Schema...............................................................................................738
Table 635: FindNamedEntity Example Input Table assortedtext_input........................................................... 738
Table 636: FindNamedEntity Example Output Table.......................................................................................... 739
Table 637: TrainNamedEntityFinder Input Table Schema..................................................................................741
Table 638: TrainNamedEntityFinder Output Message Schema..........................................................................741
Table 639: TrainNamedEntityFinder Example Input Table nermem_sports_train........................................ 741
Table 640: TrainNamedEntityFinder Example Output Table.............................................................................742
Table 641: EvaluateNamedEntityFinderRow Input Table Schema.................................................................... 743
Table 642: EvaluateNamedEntityFinderPartition Output Table Schema..........................................................744
Table 643: EvaluateNamedEntityFinderRow Example Input Table nermem_sports_test..............................744
Table 644: EvaluateNamedEntityFinderPartition Example Output Table........................................................ 745
Table 645: Input Table Schema................................................................................................................................747
Table 646: Output Table Schema.............................................................................................................................748
Table 647: nGram Example Input Table paragraphs_input................................................................................ 748
Table 648: nGram Example 1 Output Table.......................................................................................................... 749
Table 649: nGram Example 2 Output Table.......................................................................................................... 750
Table 650: Chinese POS Tags: Verb, adjective.......................................................................................................752

42 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 651: Chinese POS Tags: Noun.......................................................................................................................753


Table 652: Chinese POS Tags: Localizer.................................................................................................................753
Table 653: Chinese POS Tags: Pronoun................................................................................................................. 753
Table 654: Chinese POS Tags: Determiner and number......................................................................................753
Table 655: Chinese POS Tags: Measure word........................................................................................................753
Table 656: Chinese POS Tags: Adverb....................................................................................................................753
Table 657: Chinese POS Tags: Preposition............................................................................................................ 753
Table 658: Chinese POS Tags: Conjunction.......................................................................................................... 753
Table 659: Chinese POS Tags: Particle....................................................................................................................754
Table 660: Chinese POS Tags: Others.....................................................................................................................754
Table 661: POSTagger Input Table Schema...........................................................................................................755
Table 662: POSTagger Output Table Schema........................................................................................................755
Table 663: POSTagger Example Output Table...................................................................................................... 756
Table 664: Sentenizer Input Table Schema............................................................................................................ 760
Table 665: Sentenizer Output Table Schema......................................................................................................... 760
Table 666: Sentenizer Example Input Table paragraphs_input.......................................................................... 760
Table 667: Sentenizer Example Output Table........................................................................................................761
Table 668: TrainSentimentExtractor Input Table Schema.................................................................................. 765
Table 669: TrainSentimentExtractor Output Message Schema.......................................................................... 765
Table 670: TrainSentimentExtractor Example Input Table sentiment_train................................................... 765
Table 671: TrainSentimentExtractor Example Output Table..............................................................................766
Table 672: ExtractSentiment Input Table Schema................................................................................................ 769
Table 673: ExtractSentiment Dictionary Table Schema....................................................................................... 769
Table 674: ExtractSentiment Output Table Schema............................................................................................. 769
Table 675: ExtractSentiment Examples Input Table sentiment_extract_input................................................ 770
Table 676: ExtractSentiment Example 1 Output Table........................................................................................ 772
Table 677: ExtractSentiment Example 2 Output Table........................................................................................ 773
Table 678: ExtractSentiment Example 3 Output Table........................................................................................ 774
Table 679: ExtractSentiment Example 4 Output Table........................................................................................ 775
Table 680: ExtractSentiment Example 5 Dictionary Table sentiment_word.................................................... 776
Table 681: ExtractSentiment Example 5 Output Table........................................................................................ 777
Table 682: EvaluateSentimentExtractor Input Table Schema............................................................................. 778
Table 683: EvaluateSentimentExtractor Output Table Schema.......................................................................... 779
Table 684: EvaluateSentimentExtractor Example 1 Output Table..................................................................... 780
Table 685: EvaluateSentimentExtractor Example 2 Output Table..................................................................... 781
Table 686: EvaluateSentimentExtractor Example 3 Output Table..................................................................... 781
Table 687: EvaluateSentimentExtractor Example 4 Output Table..................................................................... 782
Table 688: TextClassifierTrainer Input Table Schema......................................................................................... 786
Table 689: TextClassifierTrainer Output Message Schema................................................................................. 786
Table 690: TextClassifierTrainer Example Input Table texttrainer_input........................................................ 786
Table 691: TextClassifierTrainer Example Output Table.................................................................................... 787
Table 692: TextClassifier Input Table Schema.......................................................................................................788
Table 693: TextClassifier Output Table Schema................................................................................................... 789
Table 694: TextClassifier Example Input Table: textclassifier_input................................................................. 789
Table 695: TextClassifier Example Output Table..................................................................................................790
Table 696: TextClassifierEvaluator Input Table Schema......................................................................................791
Table 697: TextClassifierEvaluator Output Table Schema.................................................................................. 791

Teradata Aster Analytics Foundation User Guide 43


List of Tables

Table 698: TextClassifierEvaluator Example Output Table.................................................................................792


Table 699: Text_Parser Input Table Schema..........................................................................................................795
Table 700: Text_Parser Output Table Schema, Output_By_Word ('true')....................................................... 795
Table 701: Text_Parser Output Table Schema, Output_By_Word ('false').......................................................796
Table 702: Text_Parser Examples Input Table complaints..................................................................................796
Table 703: Text_Parser Example 1 Output Table complaints_traintoken........................................................ 797
Table 704: Text_Parser Example 2 Input Table complaints_mini......................................................................798
Table 705: TextParser Example 2 Output Table....................................................................................................798
Table 706: TextChunker Output Table Schema.................................................................................................... 800
Table 707: TextChunker Phrase Type Tags............................................................................................................801
Table 708: TextChunker Example 1 Input Table cities........................................................................................ 801
Table 709: TextChunker Example 1 Output Table............................................................................................... 802
Table 710: TextChunker Example 2 Input Table paragraphs_input.................................................................. 803
Table 711: TextChunker Example 2 Output Table............................................................................................... 804
Table 712: Examples of Words and Their Standard Forms................................................................................. 806
Table 713: TextMorph Input Table Schema.......................................................................................................... 807
Table 714: English POSTagger Tags and Corresponding TextMorph Tags......................................................807
Table 715: TextMorph Input Table Schema.......................................................................................................... 809
Table 716: TextMorph Examples 1-4 Input Table words_input.........................................................................809
Table 717: TextMorph Example 1 Output Table...................................................................................................810
Table 718: TextMorph Example 2 Output Table...................................................................................................811
Table 719: TextMorph Example 3 Output Table...................................................................................................812
Table 720: TextMorph Example 4 Output Table...................................................................................................813
Table 721: TextMorph Example 5 POSTagger Input Table pos_input..............................................................813
Table 722: TextMorph Example 5 Table postagger_output................................................................................ 814
Table 723: TextMorph Example 5 Table textmorph_output...............................................................................815
Table 724: TextMorph Example 5 TextTagging Output Table........................................................................... 817
Table 725: Rule Operations...................................................................................................................................... 821
Table 726: TextTagging Text Table Schema.......................................................................................................... 822
Table 727: TextTagging Rules Table Schema.........................................................................................................823
Table 728: TextTagging Output Table Schema..................................................................................................... 823
Table 729: TextTagging Examples 1-4 Input Table: text_inputs........................................................................ 823
Table 730: TextTagging Example 1 Output Table.................................................................................................824
Table 731: TextTagging Example 2 Rule Inputs Table rule_inputs....................................................................825
Table 732: TextTagging Example 2 Output Table.................................................................................................825
Table 733: TextTagging Example 3 Output Table.................................................................................................826
Table 734: TextTagging Example 4 Output Table.................................................................................................827
Table 735: TextTokenizer Input Table Schema..................................................................................................... 829
Table 736: TextTokenizer Dictionary Table Schema............................................................................................829
Table 737: TextTokenizer Dictionary Table and User Dictionary File Format................................................ 829
Table 738: TextTokenizer Output Table Schema for OutputByWord ('true').................................................. 830
Table 739: TextTokenizer Output Table Schema for OutputByWord ('false')................................................. 830
Table 740: TextTokenizer Example 1 Input Table cn_input............................................................................... 830
Table 741: TextTokenizer Example 1 Dictionary Table cn_dict.........................................................................830
Table 742: TextTokenizer Example 1 Output Table 1.......................................................................................... 831
Table 743: TextTokenizer Example 1 Output Table 2.......................................................................................... 831
Table 744: TextTokenizer Example 2 Input Table jp_input................................................................................ 832

44 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 745: TextTokenizer Example 2 Japanese Dictionary jp_dict.................................................................... 832


Table 746: TextTokenizer Example 2 Output Table 1.......................................................................................... 832
Table 747: TextTokenizer Example 2 Output Table 2.......................................................................................... 833
Table 748: TextTokenizer Example 3 Input Table complaints............................................................................833
Table 749: TextTokenizer Example 3 Output Table............................................................................................. 834
Table 750: TF Input Table (Document Set) Schema.............................................................................................837
Table 751: TF Output and TF_IDF Input Table Schema..................................................................................... 837
Table 752: TF_IDF doccount Table Schema..........................................................................................................837
Table 753: TF_IDF docperterm Table Schema......................................................................................................838
Table 754: TF_IDF Output Schema........................................................................................................................ 838
Table 755: TF_IDF Example 1 Input Table tfidf_train.........................................................................................838
Table 756: TF_IDF Example 1 Output Table tfidf_input1.................................................................................. 840
Table 757: TF_IDF Example 1 Output Table.........................................................................................................841
Table 758: TF_IDF Example 2 Input Table tfidf_test........................................................................................... 842
Table 759: TF_IDF Example 2 Output Table tfidf_input2.................................................................................. 843
Table 760: TD_IDF Example 2 Output Table........................................................................................................844
Table 761: Canopy Input Table Schema................................................................................................................. 849
Table 762: Canopy Output Table Schema.............................................................................................................. 849
Table 763: Canopy Example Input Table computers_train1...............................................................................849
Table 764: Canopy Example Output Table............................................................................................................ 850
Table 765: GMMFit input_table Schema................................................................................................................853
Table 766: GMMFit init_params Table Schema....................................................................................................853
Table 767: GMMFit Output Message Properties...................................................................................................854
Table 768: GMMFit output_table Schema for PackOutput('false').................................................................... 855
Table 769: GMMFit output_table Schema for PackOutput('true')..................................................................... 855
Table 770: GMMFit Example ‘Iris’ Dataset gmm_iris_input.............................................................................. 856
Table 771: GMMFit Example 1 Input Table gmm_iris_train..............................................................................857
Table 772: GMMFit Example 1 Output Message Table....................................................................................... 858
Table 773: GMMFit Example 1 Output Table gmm_output_ex1 (Columns 1-4)............................................859
Table 774: GMMFit Example 1 Output Table gmm_output_ex1 (Columns 5-8)............................................859
Table 775: GMMFit Example 2 Output Message Table....................................................................................... 860
Table 776: GMMFit Example 2 Output Table gmm_output_ex2 (Columns 1-6)............................................860
Table 777: GMMFit Example 2 Output Table gmm_output_ex2 (Columns 7-11)..........................................860
Table 778: GMMFit Example 2 Output Table gmm_output_ex2 (Columns 12-14)....................................... 861
Table 779: GMMFit Example 3 Output Message Table....................................................................................... 861
Table 780: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 1-6)...................................... 862
Table 781: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 7-11).................................... 862
Table 782: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 12-16)..................................862
Table 783: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 17-20)..................................862
Table 784: GMMPredict Output Table Schema for OutputFormat('sparse')....................................................864
Table 785: GMMPredict Output Table Schema for OutputFormat('dense').................................................... 865
Table 786: GMMPredict Example Input Table gmm_iris_test........................................................................... 865
Table 787: GMMPredict Example Output Table (Columns 1-4)....................................................................... 867
Table 788: GMMPredict Example Output Table (Columns 5-8)....................................................................... 867
Table 789: GMMProfile Output Table Schema..................................................................................................... 869
Table 790: GMMProfile Example 1 Output Table................................................................................................ 869
Table 791: GMMProfile Example 2 Output Table................................................................................................ 870

Teradata Aster Analytics Foundation User Guide 45


List of Tables

Table 792: GMMProfile Example 3 Output Table................................................................................................ 871


Table 793: KMeans Input Table Schema................................................................................................................ 874
Table 794: KMeans Results Messages Table Schema............................................................................................ 875
Table 795: KMeans Results Messages..................................................................................................................... 875
Table 796: KMeans Output Table Schema for UnpackColumns('false') (Default).......................................... 876
Table 797: KMeans Output Table Schema for UnpackColumns('true')............................................................ 876
Table 798: KMeans Clustered Output Table Schema........................................................................................... 876
Table 799: KMeans Examples Input Table computers_train1............................................................................ 877
Table 800: KMeans Example 1 Results Message Table.........................................................................................878
Table 801: KMeans Example 1 Output Table kmeanssample_centroid............................................................ 878
Table 802: KMeans Example 2 Results Message Table.........................................................................................879
Table 803: KMeans Example 2 Output Table kmeanssample_centroid (Columns 1-4)................................. 880
Table 804: KMeans Example 2 Output Table kmeanssample_centroid (Columns 5-8)................................. 880
Table 805: KMeans Example 3 Results Message Table.........................................................................................881
Table 806: KMeans Example 3 Output Table kmeanssample_output............................................................... 882
Table 807: KMeans Example 3 Output Table kmeanssample_clusteredoutput............................................... 882
Table 808: KMeans Example 4 Results Message Table.........................................................................................883
Table 809: KMeans Example 4 Output Table kmeanssample_output............................................................... 884
Table 810: KMeans Example 4 Output Table kmeanssample_clusteredoutput............................................... 885
Table 811: KMeansPlot Output Table Schema...................................................................................................... 887
Table 812: KMeansPlot Example Input Table computers_test1......................................................................... 887
Table 813: KMeansPlot Example Input Table kmeanssample_centroid........................................................... 888
Table 814: KMeansPlot Example Output Table.................................................................................................... 889
Table 815: KModes Input Table Schema................................................................................................................892
Table 816: KMode Output Summary Table........................................................................................................... 893
Table 817: KMode Output Model Table.................................................................................................................894
Table 818: KModes Example Input Table kmodes_input (Columns 1-5).........................................................895
Table 819: KModes Example Input Table kmodes_input (Columns 6-12).......................................................896
Table 820: KModes Example Input Table kmodes_init (Columns 1-5)............................................................ 897
Table 821: KModes Example Input Table kmodes_init (Columns 6-12).......................................................... 897
Table 822: KModes Example 1 Output Table........................................................................................................ 898
Table 823: KModes Example 1 Output Table kmodes_clusters (Columns 1-5)...............................................898
Table 824: KModes Example 1 Output Table kmodes_clusters (Columns 6-12).............................................898
Table 825: KModes Example 1 Output Table kmodes_clusters (Columns 13-16)...........................................898
Table 826: KModes Example 2 Output Table........................................................................................................ 899
Table 827: KModes Example 2 Output Table kmodes_clusters1 (Columns 1-6).............................................899
Table 828: KModes Example 2 Output Table kmodes_clusters1 (Columns 7-13)...........................................899
Table 829: KModes Example 2 Output Table kmodes_clusters1 (Columns 14-17)........................................ 900
Table 830: KmodesPredict Output Table Schema.................................................................................................901
Table 831: KmodesPredict Example Output Table (Columns 1-6)....................................................................902
Table 832: KmodesPredict Example Output Table (Columns 7-14)..................................................................903
Table 833: Minhash Example Items and Itemids.................................................................................................. 906
Table 834: Minhash Example Input Table salesdata.............................................................................................907
Table 835: Minhash Example Output Table.......................................................................................................... 908
Table 836: NaiveBayesMap Input (Training) Table Schema............................................................................... 911
Table 837: NaiveBayesReduce Output (Model) Table Schema........................................................................... 912
Table 838: Naive Bayes Example Iris Table nb_input_iris...................................................................................912

46 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 839: Naive Bayes Example Train Table nb_iris_input_train.................................................................... 913


Table 840: Naive Bayes Example Test Table nb_iris_input_test.........................................................................914
Table 841: Naive Bayes Example Model Table nb_iris_model............................................................................915
Table 842: Naive Bayes Example Output Table.....................................................................................................916
Table 843: Naive Bayes Example Prediction Accuracy.........................................................................................917
Table 844: NaiveBayesPredict Output Table Schema...........................................................................................919
Table 845: Naive Bayes Example Iris Table nb_input_iris...................................................................................919
Table 846: Naive Bayes Example Train Table nb_iris_input_train.................................................................... 920
Table 847: Naive Bayes Example Test Table nb_iris_input_test.........................................................................921
Table 848: Naive Bayes Example Model Table nb_iris_model............................................................................922
Table 849: Naive Bayes Example Output Table.....................................................................................................923
Table 850: Naive Bayes Example Prediction Accuracy.........................................................................................924
Table 851: Naive Bayes Example Iris Table nb_input_iris...................................................................................925
Table 852: Naive Bayes Example Train Table nb_iris_input_train.................................................................... 926
Table 853: Naive Bayes Example Test Table nb_iris_input_test.........................................................................926
Table 854: Naive Bayes Example Model Table nb_iris_model............................................................................928
Table 855: Naive Bayes Example Output Table.....................................................................................................929
Table 856: Naive Bayes Example Prediction Accuracy.........................................................................................930
Table 857: Forest_Drive Input Table Schema........................................................................................................937
Table 858: Forest_Drive Output Table Schema.....................................................................................................938
Table 859: Forest_Drive Example Input Data Descriptions................................................................................ 939
Table 860: Forest_Drive Example Input Table housing_train (Columns 1-7)................................................. 939
Table 861: Forest_Drive Example Input Table housing_train (Columns 8-14)............................................... 940
Table 862: Forest_Drive Example Output Summary Table.................................................................................941
Table 863: Forest_Drive Example Output Model Table rft_model.................................................................... 941
Table 864: Forest_Predict Input Table Schema..................................................................................................... 945
Table 865: Forest_Predict Output Table Schema..................................................................................................945
Table 866: Forest_Predict Example Input Table housing_test (Columns 1-7).................................................946
Table 867: Forest_Predict Example Input Table housing_test (Columns 8-14)...............................................946
Table 868: Forest_Predict Example Output Table................................................................................................ 947
Table 869: Forest_Predict Accuracy........................................................................................................................949
Table 870: Forest_Analyze Output Table Schema................................................................................................ 950
Table 871: Forest_Analyze Example 1 Output Table............................................................................................951
Table 872: Forest_Analyze Example 2 Output Table............................................................................................953
Table 873: Single_Tree_Drive Input Table Schema..............................................................................................958
Table 874: Single_Tree_Drive Attribute Table Schema....................................................................................... 959
Table 875: Single_Tree_Drive Response Table Schema....................................................................................... 959
Table 876: Single_Tree_Drive Splits Table Schema.............................................................................................. 960
Table 877: Single_Tree_Drive Categorical Splits Table Schema......................................................................... 960
Table 878: Single_Tree_Drive Console Message Table Schema......................................................................... 960
Table 879: Single_Tree_Drive Model Table Schema............................................................................................ 960
Table 880: Single_Tree_Drive Intermediate Splits Table Schema...................................................................... 962
Table 881: Single_Tree_Drive Output Response Table Schema......................................................................... 962
Table 882: Single_Tree_Drive Example 1 Iris Table iris_input...........................................................................963
Table 883: Single_Tree_Drive Example 1 Train Table iris_train........................................................................963
Table 884: Single_Tree_Drive Example 1 Test Table iris_test............................................................................ 964
Table 885: Single_Tree_Drive Example 1 Attribute Table iris_attribute_train................................................ 965

Teradata Aster Analytics Foundation User Guide 47


List of Tables

Table 886: Single_Tree_Drive Example 1 Attribute Table iris_attribute_test.................................................. 966


Table 887: Single_Tree_Drive Example 1 Response Table iris_response_train............................................... 967
Table 888: Single_Tree_Drive Example 1 Response Table iris_response_test..................................................968
Table 889: Single_Tree_Drive Example 1 Output Message.................................................................................969
Table 890: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 1-6).....................969
Table 891: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 7-11)...................970
Table 892: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 12-19).................970
Table 893: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 20-22).................970
Table 894: Single_Tree_Drive Example 1 Output Table splits_small................................................................ 971
Table 895: Single_Tree_Drive Example 2 Input Table iris_altinput.................................................................. 971
Table 896: Single_Tree_Drive Example 2 Output Table......................................................................................972
Table 897: Single_Tree_Predict Output Table Schema........................................................................................ 974
Table 898: Single_Tree_Predict Example Output Table...................................................................................... 974
Table 899: Single_Tree_Predict Example Prediction Accuracy.......................................................................... 976
Table 900: AdaBoost_Drive Attribute Table Schema........................................................................................... 979
Table 901: AdaBoost_Drive Response Table Schema...........................................................................................980
Table 902: AdaBoost_Drive Categorical Attribute Table Schema...................................................................... 980
Table 903: AdaBoost_Drive Output Message Schema......................................................................................... 981
Table 904: AdaBoost_Drive Model Table Schema................................................................................................981
Table 905: AdaBoostDrive Example Raw Input Table housing_train, Columns 1-9.......................................982
Table 906: AdaBoostDrive Example Input Table housing_train, Columns 10-14........................................... 982
Table 907: AdaBoost Functions Example Input Table housing_train_att.........................................................983
Table 908: AdaBoost Functions Example Input Table housing_train_response..............................................984
Table 909: AdaBoost Functions Example Input Table housing_cat...................................................................985
Table 910: AdaBoost_Drive Example Output Message........................................................................................985
Table 911: AdaBoost_Drive Example Model Table, Columns 1-7..................................................................... 986
Table 912: AdaBoost_Drive Example Model Table, Columns 8-13................................................................... 986
Table 913: AdaBoost_Predict Output Table Schema............................................................................................988
Table 914: AdaBoostPredict Example Raw Input Table housing_test, Columns 1-9...................................... 988
Table 915: AdaBoostPredict Example Input Table housing_test, Columns 10-14...........................................989
Table 916: AdaBoost Functions Example Input Table housing_test_att...........................................................989
Table 917: AdaBoost_Predict Example Output Table.......................................................................................... 991
Table 918: Single_Tree_Predict Example Prediction Accuracy.......................................................................... 993
Table 919: Basket_Generator Input Table Schema............................................................................................... 996
Table 920: Basket_Generator Output Table Schema............................................................................................ 997
Table 921: Basket_Generator Example Input Table grocery_transaction......................................................... 997
Table 922: Basket_Generator Example 1 Output Table....................................................................................... 998
Table 923: Basket_Generator Example 2 Output Table..................................................................................... 1000
Table 924: CFilter Input Table Schema................................................................................................................ 1002
Table 925: CFilter Output Table Schema............................................................................................................. 1002
Table 926: CFilter DuplicatesRemoved Table Schema.......................................................................................1003
Table 927: CFilter Examples Input Table sales_transaction..............................................................................1004
Table 928: CFilter Example 1 Output Table cfilter_output (Columns 1-6).................................................... 1005
Table 929: CFilter Example 1 Output Table cfilter_output (Columns 7-11).................................................. 1006
Table 930: CFilter Example 2 Output Table cfilter_output1 (Columns 1-7).................................................. 1006
Table 931: CFilter Example 2 Output Table cfilter_output1 (Columns 8-10)................................................ 1007
Table 932: FPGrowth Input Table Schema.......................................................................................................... 1011

48 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 933: FPGrowth Pattern Table Schema....................................................................................................... 1011


Table 934: FPGrowth Rule Table Schema............................................................................................................ 1012
Table 935: FPGrowth Example Input Table sales_transaction......................................................................... 1013
Table 936: FMeasure Example Output Message..................................................................................................1014
Table 937: FPGrowth Example Output Table fpgrowth_out_pattern............................................................. 1015
Table 938: FPGrowth Example Output Table fpgrowth_out_rule (Columns 1-6)........................................ 1016
Table 939: FPGrowth Example Output Table fpgrowth_out_rule (Columns 7-12)...................................... 1016
Table 940: FPGrowth Example Output Table fpgrowth_out_rule (Columns 13-17)....................................1016
Table 941: WSRecommender Item Table Schema.............................................................................................. 1018
Table 942: WSRecommender User Table Schema.............................................................................................. 1019
Table 943: WSRecommender Output Table Schema......................................................................................... 1019
Table 944: WSRecommender Example Item Table recommender_product.................................................. 1020
Table 945: WSRecommender Example User Table recommender_user.........................................................1020
Table 946: WSRecommender Example Output Table........................................................................................1022
Table 947: KNNRecommender Input Table Schema......................................................................................... 1025
Table 948: KNNRecommenderTrain Output Table Schema............................................................................ 1025
Table 949: KNNRecommenderTrain Interpolation Weights Table Schema.................................................. 1025
Table 950: KNNRecommenderTrain Bias Values Table Schema..................................................................... 1025
Table 951: KNNRecommenderTrain Nearest Neighbors Table Schema........................................................ 1026
Table 952: KNNRecommenderTrain Input Table ml_ratings.......................................................................... 1026
Table 953: KNNRecommenderTrain Output Table........................................................................................... 1027
Table 954: KNNRecommenderTrain Default Pearson Correlation Coefficient.............................................1028
Table 955: KNNRecommenderTrain Bias Table.................................................................................................1029
Table 956: KNNRecommenderPredict Output Table Schema..........................................................................1031
Table 957: KNNRecommenderPredict Example Output Table........................................................................1032
Table 958: Vertex Table Example.......................................................................................................................... 1036
Table 959: Edges Table Example............................................................................................................................1037
Table 960: AllPairsShortestPath Vertices Table Schema....................................................................................1039
Table 961: AllPairsShortestPath Edges Table Schema........................................................................................1039
Table 962: AllPairsShortestPath Sources Table Schema.................................................................................... 1040
Table 963: AllPairsShortestPath Targets Table Schema.....................................................................................1040
Table 964: AllPairsShortestPath DuplicatesRemoved Table Schema...............................................................1041
Table 965: AllPairsShortestPath Output Table Schema..................................................................................... 1041
Table 966: AllPairsShortestPath Examples Vertices Table callers.................................................................... 1042
Table 967: AllPairsShortestPath Examples Edges Table calls............................................................................1042
Table 968: AllPairsShortestPath Example 1 Output Table................................................................................ 1043
Table 969: AllPairsShortestPath Example 2 Output Table................................................................................ 1044
Table 970: AllPairsShortestPath Example 3 Output Table................................................................................ 1044
Table 971: Betweenness Vertices Table Schema..................................................................................................1047
Table 972: Betweenness Edges Table Schema......................................................................................................1047
Table 973: Betweenness Sources Table Schema...................................................................................................1048
Table 974: Betweenness Targets Table Schema................................................................................................... 1048
Table 975: Betweenness Output Table Schema................................................................................................... 1048
Table 976: Betweenness Example Vertices Table soc_nw_vertices.................................................................. 1049
Table 977: Betweenness Example Edges Table soc_nw_edges.......................................................................... 1049
Table 978: Betweenness Example Output Table..................................................................................................1050
Table 979: Closeness Vertices Table Schema.......................................................................................................1052

Teradata Aster Analytics Foundation User Guide 49


List of Tables

Table 980: Closeness Edges Table Schema........................................................................................................... 1053


Table 981: Closeness Sources Table Schema........................................................................................................1053
Table 982: Closeness Targets Table Schema........................................................................................................ 1053
Table 983: Closeness Output Table Schema........................................................................................................ 1053
Table 984: Closeness Examples Vertices Table callers........................................................................................1054
Table 985: Closeness Examples Edges Table calls............................................................................................... 1055
Table 986: Closeness Example 1 Output Table....................................................................................................1055
Table 987: Closeness Example 2 Output Table....................................................................................................1056
Table 988: Closeness Example 3 Output Table....................................................................................................1057
Table 989: EigenvectorCentrality Vertices Table Schema..................................................................................1060
Table 990: EigenvectorCentrality Edges Table Schema......................................................................................1061
Table 991: EigenvectorCentrality Output Table Schema................................................................................... 1061
Table 992: EigenvectorCentrality Example Vertices Table sophomores......................................................... 1062
Table 993: EigenvectorCentrality Example Edges Table common_classes..................................................... 1062
Table 994: EigenvectorCentrality Example 1 Output Table.............................................................................. 1063
Table 995: EigenvectorCentrality Example 2 Output Table.............................................................................. 1063
Table 996: EigenvectorCentrality Example 3 Output Table.............................................................................. 1064
Table 997: gTree Analogs of nTree Arguments...................................................................................................1064
Table 998: Aggregate Functions Supported by gTree Function........................................................................1066
Table 999: gTree Vertices Table Schema.............................................................................................................. 1067
Table 1000: gTree Edges Table Schema................................................................................................................ 1068
Table 1001: gTree Root Table Schema..................................................................................................................1068
Table 1002: gTree Output Table Schema............................................................................................................. 1068
Table 1003: gTree Example Vertices Table gtree_vertices................................................................................. 1068
Table 1004: gTree Example Edges Table gtree_edges.........................................................................................1069
Table 1005: gTree Example Root Table gtree_root.............................................................................................1069
Table 1006: gTree Example 1 Output Table.........................................................................................................1070
Table 1007: gTree Example 2 Output Table.........................................................................................................1071
Table 1008: Triangle Type Patterns.......................................................................................................................1073
Table 1009: LocalClusteringCoefficient VerticesTable Schema........................................................................1076
Table 1010: LocalClusteringCoefficient Edges Table Schema...........................................................................1076
Table 1011: LocalClusteringCoefficient Output Table Schema for BUN Graph............................................1077
Table 1012: LocalClusteringCoefficient Output Table Schema for BDN Graph............................................1077
Table 1013: LocalClusteringCoefficient Output Table Schema for WUN Graph.......................................... 1077
Table 1014: LocalClusteringCoefficient Output Table Schema for WDN Graph.......................................... 1078
Table 1015: LocalClusteringCoefficient Examples Vertices Table country.....................................................1079
Table 1016: LocalClusteringCoefficient Examples Edges Table trade............................................................. 1080
Table 1017: LocalClusteringCoefficient Example 1 Output Table................................................................... 1080
Table 1018: LocalClusteringCoefficient Example 2 Output Table................................................................... 1081
Table 1019: LocalClusteringCoefficient Example 3 Output Table (Columns 1-6)........................................ 1081
Table 1020: LocalClusteringCoefficient Example 3 Output Table (Columns 7-12)...................................... 1082
Table 1021: LocalClusteringCoefficient Example 3 Output Table (Columns 13-19).................................... 1082
Table 1022: LoopyBeliefPropagation VerticesTable Schema............................................................................ 1084
Table 1023: LoopyBeliefPropagation Edges Table Schema............................................................................... 1085
Table 1024: LoopyBeliefPropagation Observation Table Schema.................................................................... 1085
Table 1025: LoopyBeliefPropagation Output Table Schema.............................................................................1085
Table 1026: LoopyBeliefPropagation Examples Vertices Table lbp_vertices..................................................1087

50 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 1027: LoopyBeliefPropagation Example 1 Edges Table lbp_edges........................................................ 1087


Table 1028: LoopyBeliefPropagation Examples Observation Table lbp_observation................................... 1087
Table 1029: LoopyBeliefPropagation Example 1 Output Table........................................................................ 1088
Table 1030: LoopyBeliefPropagation Example 2 Edges Table lbp_weighted_edges......................................1089
Table 1031: LoopyBeliefPropagation Example 2 Output Table........................................................................ 1089
Table 1032: Modularity Vertices Table Schema.................................................................................................. 1095
Table 1033: Modularity Edges Table Schema...................................................................................................... 1095
Table 1034: Modularity Sources Table Schema...................................................................................................1096
Table 1035: Community Vertex Table Schema (Default Resolution)..............................................................1096
Table 1036: Community Edges Table Schema.....................................................................................................1096
Table 1037: Modularity Examples Vertices Table friends................................................................................. 1097
Table 1038: Modularity Examples Edges Table followers_leaders................................................................... 1098
Table 1039: Modularity Example 1 Community Vertex Table......................................................................... 1098
Table 1040: Modularity Example 2 Community Vertex Table......................................................................... 1099
Table 1041: Modularity Example 2 Community Edge Table community_edges........................................... 1099
Table 1042: emp_table_dept...................................................................................................................................1105
Table 1043: Cycle in nTree with Mode ('up').......................................................................................................1106
Table 1044: Cycle in nTree with Mode ('down')..................................................................................................1106
Table 1045: nTree Output Table Schema............................................................................................................. 1107
Table 1046: nTree Examples 1 and 2 Input Table employee_table...................................................................1107
Table 1047: nTree Example 1 Output Table........................................................................................................ 1108
Table 1048: nTree Example 2 Output Table........................................................................................................ 1108
Table 1049: nTree Example 3 Input Table emp_table_by_dept....................................................................... 1109
Table 1050: nTree Example 3 Output Table........................................................................................................ 1110
Table 1051: PageRank Vertices Table................................................................................................................... 1112
Table 1052: PageRank Edges Table....................................................................................................................... 1112
Table 1053: PageRank Output Table.....................................................................................................................1112
Table 1054: PageRank Examples Vertices Table callers..................................................................................... 1113
Table 1055: PageRank Examples Edges Table calls.............................................................................................1113
Table 1056: PageRank Example Output Table.................................................................................................... 1114
Table 1057: pSALSA Vertices Table Schema....................................................................................................... 1118
Table 1058: pSALSA Edges Table Schema........................................................................................................... 1119
Table 1059: pSALSA Sources Table Schema........................................................................................................ 1119
Table 1060: pSALSA Targets Table Schema........................................................................................................ 1119
Table 1061: pSALSA Output Table Schema.........................................................................................................1120
Table 1062: pSALSA Example 1 Input Table users_vertex................................................................................1122
Table 1063: pSALSA Example 1 Input Table users_edges................................................................................. 1122
Table 1064: pSALSA Example 1 Output Table....................................................................................................1123
Table 1065: pSALSA Example 2 Output Table....................................................................................................1124
Table 1066: pSALSA Example 3 Input Table user_product_nodes..................................................................1125
Table 1067: pSALSA Example 3 Input Table women_apparel_log.................................................................. 1125
Table 1068: pSALSA Example 3 Output Table....................................................................................................1126
Table 1069: pSALSA Example 4 Input Table user_source_nodes.................................................................... 1127
Table 1070: pSALSA Example 4 Input Table product_target_nodes............................................................... 1127
Table 1071: pSALSA Example 4 Output Table....................................................................................................1128
Table 1072: RandomWalkSample Summary Table............................................................................................ 1130
Table 1073: RandomWalkSample Example Input Table citvertices.................................................................1131

Teradata Aster Analytics Foundation User Guide 51


List of Tables

Table 1074: RandomWalkSample Example Input Table citedges.................................................................... 1131


Table 1075: RandomWalkSample Example Output Summary......................................................................... 1132
Table 1076: RandomWalkSample Example Output Table (sampled vertices)............................................... 1133
Table 1077: RandomWalkSample Example Output Table (edges)...................................................................1133
Table 1078: NeuralNetwork Input Table Schema............................................................................................... 1139
Table 1079: NeuralNetwork Weight Table Schema............................................................................................1139
Table 1080: NeuralNetwork Weights Table.........................................................................................................1139
Table 1081: NeuralNet Output Table Schema..................................................................................................... 1139
Table 1082: NeuralNet Output Table....................................................................................................................1140
Table 1083: NeuralNet Example Input Table breast_cancer_data (Columns 1-5)........................................ 1140
Table 1084: NeuralNet Example Input Table breast_cancer_data (Columns 6-11)...................................... 1141
Table 1085: NeuralNet Example Train Table breast_cancer_train (Columns 1-5)....................................... 1142
Table 1086: NeuralNet Example Train Table breast_cancer_train (Columns 6-11)..................................... 1142
Table 1087: NeuralNet Example Train Table breast_cancer_test (Columns 1-5)..........................................1143
Table 1088: NeuralNet Example Train Table breast_cancer_test (Columns 6-11)........................................1143
Table 1089: NeuralNet Example Output Table................................................................................................... 1144
Table 1090: NeuralNetPredict Input Table schema............................................................................................1145
Table 1091: NeuralNetPredict Output Table schema.........................................................................................1145
Table 1092: NeuralNetPredict Example Output Table (Columns 1-4)............................................................1146
Table 1093: NeuralNetPredict Example Output Table (Columns 5-8)............................................................1147
Table 1094: NeuralNetPredict Example Output Table (Columns 9-12)..........................................................1148
Table 1095: NeuralNetPredict Example Prediction Accuracy.......................................................................... 1149
Table 1096: Antiselect Example Input Table antiselect_input (Columns 1-8)............................................... 1152
Table 1097: Antiselect Example Input Table antiselect_input (Columns 9-13)............................................. 1153
Table 1098: Antiselect Example Output Table.................................................................................................... 1153
Table 1099: Apache Log Parser Item-Name Mapping....................................................................................... 1155
Table 1100: Apache_Log_Parser Input Table Schema....................................................................................... 1157
Table 1101: Apache_Log_Parser Output Table Schema.................................................................................... 1157
Table 1102: Apache_Log_Parser Output Table Columns extracted when RETURN_SEARCH_INFO =
'true'............................................................................................................................................................................1158
Table 1103: Apache_Log_Parser Example Input Table apache_logs............................................................... 1158
Table 1104: Apache_Log_Parser Example 1 Output Table (Columns 1-5).................................................... 1159
Table 1105: Apache_Log_Parser Example 1 Output Table (Columns 6-8).................................................... 1160
Table 1106: Apache_Log_Parser Example 1 Output Table (Columns 9-11).................................................. 1160
Table 1107: Apache_Log_Parser Example 2 Output Table (Columns 1-4).................................................... 1161
Table 1108: Apache_Log_Parser Example 2 Output Table (Columns 5-7).................................................... 1161
Table 1109: Categorize Input Table Schema........................................................................................................1162
Table 1110: Categorize Output Table Schema.....................................................................................................1162
Table 1111: Categorize Example Input Table categorize_input........................................................................1162
Table 1112: FellegiSunterTrainer input_table Schema.......................................................................................1167
Table 1113: FellegiSunterTrainer Output (Model) Table Schema....................................................................1167
Table 1114: FellegiSunterTrainer Model Properties........................................................................................... 1167
Table 1115: FellegiSunterTrainer Example Input Table fstrainer_input (Columns 1-4)..............................1169
Table 1116: FellegiSunterTrainer Example Input Table fstrainer_input (Columns 5-8)..............................1170
Table 1117: FellegiSunterTrainer Example 1 Output (Model) Table fg_unsupervised_model....................1171
Table 1118: FellegiSunterTrainer Example 2 Output (Model) Table fg_supervised_model........................ 1172
Table 1119: FellegiSunterTrainer Output (Model) Table Schema....................................................................1174

52 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 1120: FellegiSunterPredict Example Input Table fspredict_input (Columns 1-4).............................. 1175
Table 1121: FellegiSunterPredict Example Input Table fspredict_input (Columns 5-7).............................. 1176
Table 1122: FellegiSunterPredict Example 1 Output Table (Columns 1-4).................................................... 1176
Table 1123: FellegiSunterPredict Example 1 Output Table (Columns 5-9).................................................... 1177
Table 1124: FellegiSunterPredict Example 2 Output Table (Columns 1-4).................................................... 1178
Table 1125: FellegiSunterPredict Example 2 Output Table (Columns 5-9).................................................... 1178
Table 1126: Geospatial File Formats That GeometryLoader Accepts.............................................................. 1181
Table 1127: GeometryLoader Output Table Schema..........................................................................................1181
Table 1128: GeometryLoader Output Table........................................................................................................ 1183
Table 1129: PointInPolygon Source Table Schema............................................................................................ 1187
Table 1130: PointInPolygon Reference Table Schema....................................................................................... 1187
Table 1131: PointInPolygon Output Table Schema............................................................................................1188
Table 1132: PointInPolygon Example 1 Input Table source_passenger..........................................................1189
Table 1133: PointInPolygon Example 1 Input Table reference_terminal....................................................... 1189
Table 1134: PointInPolygon Example 1 Output Table (Columns 1-2)............................................................1189
Table 1135: PointInPolygon Example 1 Output Table (Columns 3-6)............................................................1190
Table 1136: PointInPolygon Example 2 Output Table (Columns 1-2)............................................................1190
Table 1137: PointInPolygon Example 2 Output Table (Columns 3-6)............................................................1190
Table 1138: PointInPolygon Example 2 Input Table source_passenger1........................................................1191
Table 1139: PointInPolygon Example 3 Output Table (Columns 1-3)............................................................1191
Table 1140: PointInPolygon Example 3 Output Table (Columns 4-7)............................................................1191
Table 1141: GeometryOverlay Boundary Operators.......................................................................................... 1193
Table 1142: GeometryOverlay Source Table Schema......................................................................................... 1194
Table 1143: GeometryOverlay Reference Table Schema....................................................................................1194
Table 1144: GeometryOverlay Output Table Schema........................................................................................ 1195
Table 1145: GeometryOverlay Input Table source_gatetype............................................................................ 1195
Table 1146: GeometryOverlay Input Table ref_terminal...................................................................................1195
Table 1147: GeometryOverlay Example 1 Output Table................................................................................... 1196
Table 1148: GeometryOverlay Example 2 Output Table................................................................................... 1197
Table 1149: GeometryOverlay Example 3 Output Table................................................................................... 1197
Table 1150: IdentityMatch Source Input Table Schema.................................................................................... 1202
Table 1151: IdentityMatch Reference Input Table Schema............................................................................... 1202
Table 1152: IdentityMatch Output Table Schema.............................................................................................. 1202
Table 1153: IdentityMatch Example Input Table applicant_reference............................................................1203
Table 1154: IdentityMatch Example Input Table applicant_external..............................................................1204
Table 1155: IdentityMatch Output Table (Columns 1-6)..................................................................................1205
Table 1156: IdentityMatch Output Table (Columns 7-13)................................................................................1205
Table 1157: IPGeo Input Table Schema............................................................................................................... 1207
Table 1158: IPGeo Output Table Schema............................................................................................................ 1207
Table 1159: IPGeo Example Input Table ipgeo_1...............................................................................................1208
Table 1160: IPGeo Example Output Table (Column 1-7)................................................................................. 1209
Table 1161: IPGeo Example Output Table (Column 8-15)............................................................................... 1209
Table 1162: JSONParser Example 1 Input Table.................................................................................................1217
Table 1163: JSONParser Example 1 Output Table..............................................................................................1218
Table 1164: JSONParser Example 2 Input Table.................................................................................................1218
Table 1165: JSONParser Example 2 OutputTable (Columns 1-5)....................................................................1219
Table 1166: JSONParser Example 2 OutputTable (Columns 6-8)....................................................................1219

Teradata Aster Analytics Foundation User Guide 53


List of Tables

Table 1167: JSONParser Example 3 OutputTable (Columns 1-5)....................................................................1220


Table 1168: JSONParser Example 3 OutputTable (Columns 6-8)....................................................................1220
Table 1169: JSONParser Example 4 Input Table.................................................................................................1221
Table 1170: JSONParser Example 4 Output Table (Columns 1-3)...................................................................1221
Table 1171: JSONParser Example 4 Output Table (Column 4)........................................................................1221
Table 1172: Multi_Case Input Table Schema...................................................................................................... 1223
Table 1173: Multi_Case Output Table Schema................................................................................................... 1223
Table 1174: Multi_Case Example InputTable people_age.................................................................................1223
Table 1175: Multi_Case Example OutputTable.................................................................................................. 1224
Table 1176: MurmurHash Input Table Schema.................................................................................................. 1226
Table 1177: MurmurHash Output Table Schema...............................................................................................1226
Table 1178: MurmurHash Examples Input Table murmurhash_input, Columns 1-6..................................1227
Table 1179: MurmurHash Examples Input Table murmurhash_input, Columns 7-10................................1227
Table 1180: MurmurHash Example 1 Output Table, Columns 1-4................................................................. 1228
Table 1181: MurmurHash Example 1 Output Table, Columns 5-7................................................................. 1228
Table 1182: MurmurHash Example 1 Output Table, Columns 8-10............................................................... 1228
Table 1183: MurmurHash Example 1 Output Table, Columns 1-4................................................................. 1229
Table 1184: MurmurHash Example 1 Output Table, Columns 5-7................................................................. 1229
Table 1185: MurmurHash Example 1 Output Table, Columns 8-10............................................................... 1230
Table 1186: OutlierFilter Input Table Schema.................................................................................................... 1232
Table 1187: OutlierFilter Output Message Schema............................................................................................ 1233
Table 1188: OutlierFilter Examples Input Table ville_pressuredata................................................................ 1233
Table 1189: OutlierFilter Example 1 Output Message........................................................................................1235
Table 1190: OutlierFilter Example 1 Output Table of_output1........................................................................1235
Table 1191: OutlierFilter Example 2 Output Message........................................................................................1236
Table 1192: OutlierFilter Example 2 Output Table of_output2........................................................................1236
Table 1193: OutlierFilter Example 2 Output Table of_outlier2........................................................................ 1237
Table 1194: Pack Input Table Schema.................................................................................................................. 1238
Table 1195: Pack Output Table Schema............................................................................................................... 1239
Table 1196: Pack Examples Input Table ville_temperature...............................................................................1239
Table 1197: Pack Example 1 Output..................................................................................................................... 1240
Table 1198: Pack Example 2 Output..................................................................................................................... 1241
Table 1199: Pivot Input Table Schema................................................................................................................. 1243
Table 1200: Pivot Input Table input_table_1.......................................................................................................1243
Table 1201: Possible Pivot Output Table 1...........................................................................................................1243
Table 1202: Possible Pivot Output Table 2...........................................................................................................1243
Table 1203: Possible Pivot Output Table 3...........................................................................................................1244
Table 1204: Possible Pivot Output Table 4...........................................................................................................1244
Table 1205: Pivot Input Table input_table_2.......................................................................................................1244
Table 1206: Pivot Output Table for Ordered Input Data...................................................................................1244
Table 1207: Pivot Output Table Schema.............................................................................................................. 1245
Table 1208: Pivot Examples Input Table pivot_input........................................................................................ 1245
Table 1209: Pivot Example 1 Output Table..........................................................................................................1246
Table 1210: Pivot Example 2 Output Table..........................................................................................................1247
Table 1211: Pivot Example 2 Output Table..........................................................................................................1248
Table 1212: PSTParserAFS Output Table Schema..............................................................................................1251
Table 1213: PSTParserAFS Example 1 Input Table dum1.pst, Columns 1-4................................................. 1255

54 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 1214: PSTParserAFS Example 1 Input Table dum1.pst, Columns 5-8................................................. 1255
Table 1215: PSTParserAFS Example 1 Output Table dum1.pst, Columns 1-4.............................................. 1256
Table 1216: PSTParserAFS Example 1 Output Table dum1.pst, Columns 5-8.............................................. 1256
Table 1217: PSTParserAFS Example 2 Output Table......................................................................................... 1256
Table 1218: Input Data Example........................................................................................................................... 1259
Table 1219: Output Table Example.......................................................................................................................1259
Table 1220: ScaleMap, Scale, or PartitionScale Input Table Schema............................................................... 1260
Table 1221: ScaleMap Output Table Schema.......................................................................................................1260
Table 1222: Supported Statistical Data Types in ScaleMap Output Table.......................................................1260
Table 1223: Location and Scale for Statistical Methods..................................................................................... 1263
Table 1224: Scale and PartitionScale Output Table Schema..............................................................................1263
Table 1225: Supported Statistical Data Types in ScalePrinter Output Table.................................................. 1264
Table 1226: Scale Functions Examples Input Table scale_housing.................................................................. 1267
Table 1227: Scale and ScaleMap Example 1 Output Table................................................................................ 1268
Table 1228: Scale and ScaleMap Example 2 Output Table................................................................................ 1269
Table 1229: Scale and ScaleMap Example 3 Input Table scale_housing_test................................................. 1270
Table 1230: Scale and ScaleMap Example 3 Output Table................................................................................ 1270
Table 1231: ScalePrinter Example Output Table (Columns 1-3)......................................................................1271
Table 1232: ScalePrinter Example Output Table (Columns 4-6)......................................................................1271
Table 1233: Scale and ScaleMap Example 5 Output Table (Columns 1-3)..................................................... 1272
Table 1234: Scale and ScaleMap Example 5 Output Table (Columns 4-7)..................................................... 1273
Table 1235: Scale and ScaleMap Example 5 Output Table (Columns 1-3)..................................................... 1274
Table 1236: Scale and ScaleMap Example 5 Output Table (Columns 4-7)..................................................... 1274
Table 1237: Scale and KMeans Example Output Table......................................................................................1275
Table 1238: StringSimilarity Input Table Schema...............................................................................................1277
Table 1239: StringSimilarity Output Table Schema............................................................................................1278
Table 1240: StringSimilarity Example Input Table strsimilarity_input........................................................... 1278
Table 1241: StringSimilarity Example 1 Output Table (Columns 1-3)............................................................1279
Table 1242: StringSimilarity Example 1 Output Table (Columns 4-7)............................................................1279
Table 1243: StringSimilarity Example 2 Output Table (Columns 1-3)............................................................1280
Table 1244: StringSimilarity Example 2 Output Table (Columns 4-7)............................................................1281
Table 1245: Unpack Input Table Schema.............................................................................................................1284
Table 1246: Unpack Output Table Schema..........................................................................................................1284
Table 1247: Unpack Example 1 Input Table ville_tempdata.............................................................................1284
Table 1248: Unpack Example 1 Output Table.....................................................................................................1285
Table 1249: Unpack Example 2 Input Table ville_tempdata1...........................................................................1286
Table 1250: Unpack Example 2 Output Table.....................................................................................................1286
Table 1251: Pivot Input Table Schema................................................................................................................. 1288
Table 1252: Pivot Output Table Schema.............................................................................................................. 1289
Table 1253: Unpivot Examples Input Table unpivot_input.............................................................................. 1289
Table 1254: Unpivot Example 1 Output Table.................................................................................................... 1290
Table 1255: Unpivot Example 2 Output Table.................................................................................................... 1291
Table 1256: URIPack Output Table Schema........................................................................................................1294
Table 1257: URIPack Example Output Table...................................................................................................... 1294
Table 1258: Key Hierarchical URI Components................................................................................................. 1295
Table 1259: URIUnpack Input Table Schema..................................................................................................... 1296
Table 1260: URIUnpack Output Table Schema.................................................................................................. 1296

Teradata Aster Analytics Foundation User Guide 55


List of Tables

Table 1261: URIUnpack Input table uris_input..................................................................................................1297


Table 1262: URIUnpack Example Output Table.................................................................................................1298
Table 1263: XMLParser Input Table Schema...................................................................................................... 1302
Table 1264: XMLParser Output Table Schema................................................................................................... 1302
Table 1265: XMLParser Example 1 & 2 Input Table xml_input1.....................................................................1303
Table 1266: XMLParser Example 1 Output Table...............................................................................................1305
Table 1267: XMLParser Example 2 Output Table (Columns 1-5)....................................................................1305
Table 1268: XMLParser Example 2 Output Table (Columns 6-11)................................................................. 1306
Table 1269: XMLParser Example 3 Input Table xml_inputs_fuzzy.................................................................1306
Table 1270: XMLParser Example 3 Output Table...............................................................................................1307
Table 1271: XMLParser Example 4 Input Table xml_inputs_error................................................................. 1307
Table 1272: XMLParser Example 4 Output Table...............................................................................................1308
Table 1273: XMLParser Example 5 Input table xml_input2............................................................................. 1308
Table 1274: XMLParser Example 5 Output Table...............................................................................................1309
Table 1275: XMLRelation Input Table Schema...................................................................................................1311
Table 1276: XMLRelation Output Table Schema, Output ('fulldata')..............................................................1311
Table 1277: XMLRelation Output Table Schema, Output ('parentchild')....................................................... 1312
Table 1278: XMLRelation Output Table Schema, Output ('fullpath')............................................................. 1312
Table 1279: XMLRelation Examples 1 & 2 Input Table xmlrelation_input....................................................1313
Table 1280: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 1-5........................... 1314
Table 1281: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 6-10.........................1314
Table 1282: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 11-14.......................1314
Table 1283: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 15-18.......................1315
Table 1284: XMLRelation Example 1 Output Table for Output ('parentchild')............................................. 1315
Table 1285: XMLRelation Example 1 Output Table for Output ('fullpath')....................................................1316
Table 1286: XMLRelation Example 2 Output Table...........................................................................................1316
Table 1287: XMLRelation Example 3 Input Table xmlrelation_error............................................................. 1317
Table 1288: XMLRelation Example 3 Output Table...........................................................................................1317
Table 1289: AMLGenerator Example Input Table glass_attribute_table_output.......................................... 1324
Table 1290: Scoring Data Types.............................................................................................................................1329
Table 1291: Aster Scoring SDK Single Decision Tree Model Format.............................................................. 1334
Table 1292: Single_Tree_Predict Request Schema..............................................................................................1334
Table 1293: Aster Scoring SDK Single Decision Tree Request Schema...........................................................1334
Table 1294: Aster Scoring SDK Single Decision Tree Parameters....................................................................1334
Table 1295: Aster Scoring SDK Generalized Linear Model - Model Format..................................................1335
Table 1296: Aster Scoring SDK Generalized Linear Model Parameters.......................................................... 1335
Table 1297: Aster Scoring SDK Random Forest Model Format....................................................................... 1336
Table 1298: Aster Scoring SDK Random Forest Parameters.............................................................................1336
Table 1299: Aster Scoring SDK Naïve Bayes Model Format............................................................................. 1337
Table 1300: Aster Scoring SDK Naïve Bayes Parameters...................................................................................1337
Table 1301: Aster Scoring SDK Naïve Bayes Text Classifier Model Format...................................................1337
Table 1302: Aster Scoring SDK Naïve Bayes Text Classifier Parameters........................................................ 1338
Table 1303: Aster Scoring SDK Text Tagging Model Format........................................................................... 1338
Table 1304: Aster Scoring SDK Text Tagging Parameters.................................................................................1339
Table 1305: Aster Scoring SDK Extract Sentiment Model Format...................................................................1340
Table 1306: Aster Scoring SDK Extract Sentiment Parameters........................................................................ 1340
Table 1307: Aster Scoring SDK Text Parser Model Format.............................................................................. 1342

56 Teradata Aster Analytics Foundation User Guide


List of Tables

Table 1308: Aster Scoring SDK Text Parser Parameters....................................................................................1342


Table 1309: Aster Scoring SDK Text Tokenizer Model Format....................................................................... 1344
Table 1310: Aster Scoring SDK Text Tokenizer Parameters............................................................................. 1344
Table 1311: Aster Scoring SDK SparseSVM Model Format..............................................................................1345
Table 1312: Aster Scoring SDK SparseSVM Parameters................................................................................... 1345
Table 1313: Aster Scoring SDK CoxPH Model Format..................................................................................... 1346
Table 1314: Aster Scoring SDK CoxPH Parameters...........................................................................................1346
Table 1315: Aster Scoring SDK LDAInferenceModel Format.......................................................................... 1347
Table 1316: Parameters........................................................................................................................................... 1347

Teradata Aster Analytics Foundation User Guide 57


List of Tables

58 Teradata Aster Analytics Foundation User Guide


Preface

Overview
This guide provides instructions for users and administrators of Teradata Aster® Analytics 6.21. If you are
using a different version, you must download a different edition of this guide.
The following additional resources are available:
• Aster Database upgrades, clients and other packages:
http://downloads.teradata.com/download/tools
• Documentation for existing customers with a Teradata @ Your Service login:
http://tays.teradata.com/
• Documentation that is available to the public:
http://www.info.teradata.com/

Conventions Used in This Guide


This document assumes that the reader is comfortable working in Windows and Linux/UNIX
environments. Many sections assume you are familiar with SQL.
This document uses the following typographical conventions.

Typefaces
Command line input and output, commands, program code, filenames, directory names, and system
variables are shown in a monospaced font. Words in italics indicate an example or placeholder value that
you must replace with a real value. Bold type is intended to draw your attention to important or changed
items. Menu navigation and user interface elements are shown using the User Interface Command font.

Notation Conventions
In the synopsis sections, we follow these conventions:
• Square brackets ([ and ]) indicate one or more optional items.
• Curly braces ({ and }) indicate that you must choose an item from the list inside the braces. Choices are
separated by vertical lines (|).
• An ellipsis (...) means the preceding element can be repeated.
• A comma and an ellipsis (, ...) means the preceding element can be repeated in a comma-separated list.
• In command line instructions, SQL commands and shell commands are typically written with no
preceding prompt, but where needed the default SQL prompt is shown: beehive=>

Teradata Aster Analytics Foundation User Guide 59


Preface
Contact Teradata Global Technical Support (GTS)
Command Shell Text Conventions
For shell commands, the prompt is usually shown. The $ sign introduces a command that’s being run by a
non-root user:
$ ls
The # sign introduces a command that’s being run as root:
# ls

Contact Teradata Global Technical Support


(GTS)
For assistance and updated documentation, contact Teradata Global Technical Support (GTS):
• Support Portal: https://tays.teradata.com
• International: 212-444-0443
• US Customers: 877-MyT-Data (877-698-3282)

About Teradata Aster


Teradata Aster provides data management and advanced analytics for diverse and big data, enabling the
powerful combination of cost-effective storage and ultra-fast analysis of relational and non-relational data.
Teradata Aster is a division of Teradata and is headquartered in San Carlos, California.
For more information, go to: http://www.teradata.com/products-and-services/analytics-from-aster-overview

About This Document


This is the Aster Analytics Foundation User Guide, version 6.21. This edition covers Aster Analytics version
6.21.

Version History
Table 1: Version History Table

Release Product ID Date


AA 6.21 B700-1020-621K April 2016
AA 6.20 B700-1017-620K October 2015
AA 6.20 B700-1015-620K September 2015
AA 6.10 B700-1020-610K March 2015
AA 6.00 B700-1014-600K August 2014

60 Teradata Aster Analytics Foundation User Guide


CHAPTER 1
Introduction

Introduction
• Analytics at Scale: Full Dataset Analysis
• Introduction to Teradata Aster SQL-MapReduce
• SQL-MapReduce Query Syntax
• SQL-MapReduce with Multiple Inputs
• Aster Analytics Function Product Bundles
• Aster Analytics Functions by Product Bundle
• Aster Analytics Functions by Category

Analytics at Scale: Full Dataset Analysis


Aster Database lets you efficiently perform analytical tasks on your full dataset, in place, rather than using
samples or bulk-exporting data to a dedicated computing cluster.
Why perform analytical tasks on your full dataset? While applying analytics to a small sample of the data
outside the database might work for some problems, it cannot provide the accurate, reproducible results that
you can get by analyzing a complete dataset.
One important application of in-database analytics is to speed up iterations of analysis. Because the cycle of
iteration time is so critical, many organizations want a solution that is faster than exporting a data sample,
analyzing it, and exporting another sample, and so on. In such cases, it makes sense to push down those
analytics into an MPP system to decrease the iteration cycle.
Teradata is working with partners, including the SAS Institute, Inc., to make this process straightforward,
and is developing functions where appropriate (for example, functionality that takes advantage of the
MapReduce paradigm).
However, there are stronger reasons for analyzing your entire data set in Aster Database:
• “Needle in a haystack,” “false negative,” and “exceptional cases” searches
Very rare events can only be found (and defined) against the background of the entire data set (consider
defining 'elite baseball player' by looking at the 2008 SF Giants, as opposed to every player in MLB
history).
• Statistical significance
Reliable analytics may require using a large portion of the data, which cannot be fit on a typical, single
database machine.
• Model tuning
The parameters to predictive models depend on aggregate statistics of the entire data set (for example,
residual away from the mean).
• No meaningful way to sample

Teradata Aster Analytics Foundation User Guide 61


Chapter 1: Introduction
Introduction to Teradata Aster SQL-MapReduce®
Sampling a graph is not straightforward, especially if one is interested in critical behavior that only
appears when a certain threshold of connections is reached.
• Larger data sets are just different
The resulting analytics are applied to the entire data set in the cluster. Algorithms developed on smaller
data sets may not scale appropriately to the full data set, requiring redevelopment.

Introduction to Teradata Aster SQL-MapReduce®

What is MapReduce?
MapReduce is a framework for operating on large sets of data using massively parallel processing (MPP)
systems. MapReduce enables complex analysis to be performed efficiently on extremely large sets of data,
such as those obtained from weblogs and clickstreams. It has applications in areas such as machine learning,
scientific data analysis, and document classification.
The basic ideas behind MapReduce originated with the map and reduce functions common to many
programming languages, though the implementation and application are somewhat different on multi-node
systems.
In programming languages, a map function applies the same operation to every input tuple (for example,
every member of a list, element of an array, or row of a table) and produces one output tuple for each input
tuple. (A map function is sometimes called a transformation operation.)
On an MPP database such as Aster Database, the map step of a MapReduce function has special meaning.
The input data set is broken into smaller data sets, which are distributed to the worker nodes in a cluster,
where an instance of the function operates on them. If the data is already distributed as specified in the
function call, the distribution step does not occur, because the function can operate on the data where it is
already stored. The outputs from these smaller data sets may be redirected back into the function for further
processing, input into another function, or otherwise further processed. Finally, all outputs are consolidated
again on the queen to produce the final output result, with one output tuple for each input tuple.
In programming languages, a reduce function combines the input tuples to produce a single result by using a
mathematical operator (like sum, multiply, or average). Reduce functions consolidate data into smaller
groups of data. They can accept the output of a map or reduce function or operate recursively on their own
output.
In Aster Database, the reduce step of a MapReduce follows this procedure:
1. The input data is partitioned by the given partitioning attribute.
2. If required by the function call, the input tuples are distributed to the worker nodes, with all the tuples
that share a partitioning key assigned to the same node for processing.
3. On each node, the function operates on the input tuples and returns the output tuples to the queen.
The number of tuples that the function outputs might differ from the number of input tuples that it
received.
4. The output from each node is consolidated on the queen.
5. If necessary, additional operations are performed on the queen.
For example, if the function averages its input, the average results from all the nodes must be averaged on
the queen to obtain the final output.
6. The SQL-MapReduce function returns the final output.

62 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Introduction to Teradata Aster SQL-MapReduce®
Aster Database SQL-MapReduce
The Aster Database In-Database MapReduce framework, known as SQL-MapReduce, lets you write
functions in Java or C, save these functions in the cluster, and allow analysts to run them in a parallel fashion
on Aster Database for efficient data analysis. Analysts invoke a SQL-MapReduce function in a SELECT
query and receive the function’s output as if the function were a table. A SQL-MapReduce function takes as
input one or more sets of rows from tables or views (for example, the contents of a table in the database, the
output of a SQL SELECT statement, or the output of another SQL-MapReduce function) and produces a set
of rows as output. Beginning in Aster Database 5.0, SQL-MapReduce functions can now accept multiple
inputs. For more on this, see SQL-MapReduce with Multiple Inputs.
Because a call to a SQL-MapReduce function results in a set of parallel tasks being run across the cluster, the
input data provided to a SQL-MapReduce function must be divided across the parallel tasks. SQL-
MapReduce supports three kinds of inputs:
• Single-input row function
This function takes input one row at a time, in any order.
The SQL statement that calls the function includes ON input_table.
The function operates at the granularity of individual rows of the the input table. An Aster Database
single-input row function corresponds to a map function in traditional map-reduce systems. The Aster
Database SQL-MapReduce API for a function that accepts row-wise input is the RowFunction interface.
• Single-input partition function
This function takes input one partition at a time. In a partition, rows are grouped by a specified key of
one or more columns.
The SQL statement that calls the function includes ON my_input PARTITION BY
partitioning_attributes.
The function operates on rows that share a partition. The function has access to all such rows at once,
enabling more complex processing than possible with row-wise inputs. Within each partition, you can
sort rows using an ORDER BY clause. An Aster Database single-input partition function corresponds to
a reduce function in traditional map-reduce systems. The Aster Database SQL-MapReduce API for a
function that accepts partition-wise input is the PartitionFunction interface.
• Multiple-input function
This function accepts multiple inputs. The inputs can include a cogroup operation in which inputs from
multiple sources are partitioned and combined before being processed, a dimension operation where all
rows of one or more inputs are replicated to each vworker, or a combination of both.
The SQL statement that calls the function includes a combination of the following, to specify each input
and how to distribute its rows:
∘ ON my_input PARTITION BY partitioning_attributes for each input where rows are to be
partitioned among vworkers using the specified columns
∘ ON my_input PARTITION BY ANY for an input where rows can be processed wherever they were
stored when the function was called
∘ ON my_input DIMENSION for each input where all rows are to be replicated to all vworkers
For rules governing which types of inputs and how many of each type can be specified in the same
multiple input function call, refer to Rules for Number of Inputs by Type.
The Aster Database SQL-MapReduce API for a function that accepts multiple inputs is the
MultipleInputFunction interface.

Teradata Aster Analytics Foundation User Guide 63


Chapter 1: Introduction
SQL-MapReduce Query Syntax
In summary, a SQL-MapReduce function:
• Uses the Aster Database API (Java and C are the supported languages)
• Is compiled outside the database, installed (uploaded to the cluster) using Aster Database ACT, and
invoked in SQL
• Receives as input (from the ON clauses) some rows of one or more database tables or views, pre-existing
trained models and/or the results of another SQL-MapReduce function
• Receives as arguments zero or more argument clauses (parameters), which can modify function behavior
• Returns output rows to the database
• Is polymorphic
During initialization, a function gets the schema of its input (for example, (key, value)) and how it must
return its output schema.
• Is designed to run on a massively parallel system by allowing the user to specify the slice of the data
(partition) that a particular instance of the function can access.

SQL-MapReduce Query Syntax


Beginning in Aster Database version 5.0, the SQL-MapReduceSQL-MapReduce function syntax is extended
to allow one or more partitioned inputs and zero or more dimensional inputs. This has introduced some
important changes to the syntax for SQL-MapReduce functions. For more information, see SQL-
MapReduce with Multiple Inputs.
Invoking a SQL-MapReduce function has this SQL syntax:

SELECT [ ALL |
DISTINCT [ ON ( expression [,...] ) ] |
* |
expression [ [ AS ] output_name ][,...]
]
FROM sqlmr_function_name (on_clause function_argument) [[ AS ] alias ][,...]
[ WHERE condition ]
[ GROUP BY expression [,...] ]
[ HAVING condition [,...] ]
[ ORDER BY expression [ ASC | DESC ][ NULLS { FIRST | LAST } ][,...] ]
[ LIMIT { count | ALL } ]
[ OFFSET start ];

on_clause is:

{ partition_any_input | partition_attributes_input | dimensional_input }

partition_any_input is:

table_input PARTITION BY { ANY [ORDER BY expression] | table_input


[ order_by ] }

partition_attributes_input is:

table_input PARTITION BY partitioning_attributes [order_by]

64 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
SQL-MapReduce Query Syntax
dimensional_input is:

table_input DIMENSION [ order_by]

table_input is:

ON table_expression [AS alias]

table_expression is:

{ table_name | view_name | (query) }

The preceding syntax focuses on SQL-MapReduce. For the complete syntax of the SELECT statement,
including the WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and OFFSET clauses, refer to the Aster
Database User Guide for Aster Appliances or Aster Database User Guide for Commodity Hardware.
Notes:
• sqlmr_function_name is the name of a SQL-MapReduce function you have installed in Aster Database. If
the sqlmr_function_name contains uppercase letters, you must enclose it in double quotation marks (").
• The on_clause provides the input data on which the function operates. This data is composed of one or
more partitioned inputs and zero or more dimensional inputs. The partitioned inputs can be a single
partition_any_input clause and/or one or more partition_attributes_input clauses. The dimensional
inputs can be zero or more dimensional_input clauses.
• partition_any_input and partition_attributes_input introduce expressions that partition the inputs before
the function operates on them.
• partitioning_attributes specifies the partition key(s) to use to partition the input data before the function
operates on it.
• dimensional_input introduces an expression that replicates the input to all nodes before the function
operates on them.
• order_by (optional) introduces an expression that sorts the input data after partitioning, but before the
function operates on it.
• table_input includes an alias for the table_expression. For rules about when an alias is required, see Rules
for Table Aliases. When declaring an alias, the AS keyword is optional.
• function_argument optionally introduces an argument clause that typically modifies the behavior of the
SQL-MapReduce function. Do not confuse argument clauses with input data: Input data is the data on
which the function operates; argument clauses provide runtime parameters. You pass an argument clause
in the form argument_name (literal[, ...]), where argument_name is the name of the
argument clause (as defined in the function) and literal is the value to be assigned to that argument. If an
argument clause is a multi-value argument, you can supply a comma-separated list of values. You can
pass multiple argument clause blocks, each consisting of an argument_name followed by its value(s)
encased in a single pair of parentheses, separated from the next argument clause block with whitespace
(not commas).
• AS provides an optional alias for the SQL-MapReduce function in the query.

Teradata Aster Analytics Foundation User Guide 65


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs

SQL-MapReduce with Multiple Inputs


Beginning in Aster Database version 5.0, Teradata has extended the capabilities of SQL-MapReduce with
support for multiple inputs. This allows SQL-MapReduce functions to be applied to related groups of
information derived from different data sets.
Data from these multiple data sources can be processed within a single SQL-MapReduce function. This
changes SQL-MapReduce in two important ways:
1. Extending the SQL-MapReduce API to accept multiple inputs allows users to write or modify their own
SQL-MapReduce function to do more advanced analysis within the function itself. It essentially mimics
the JOIN in SQL, but with better performance.
2. The Aster Database Analytics Foundation includes functions that take advantage of the new multiple
input capabilities.
Additional changes made to SQL-MapReduce to support this feature are aliasing for table expressions and
the addition of syntax to allow specifying explicit partitioning requirements when calling SQL-MapReduce
functions.

Benefits of Multiple Inputs


Some benefits of extending SQL-MapReduce to accept multiple inputs are:
• Prediction SQL-MapReduce functions that use a trained model now have better performance and
security. The function takes the model itself as a dimensional input, and one or more data inputs to
which it may be applied.
• There is no requirement that all inputs to a SQL-MapReduce function share a common schema.
• The new capabilities avoid JOINs, UNIONs, and the creation of temporary tables, which were often used
to work around the older ability to support only a single input.
• New types of analytic functions may now be created more easily (for example, multichannel attribution).
• Memory is better utilized, because the partitioning and grouping of tuples that occurs before the function
operates on them means that less data is actually processed by the function. In addition, the ability to
hold one copy of a dimensional input in memory and use it to operate on all tuples from other inputs
uses memory more efficiently.

How Multiple Inputs are Processed


When multiple inputs are supplied, SQL-MapReduce performs a grouping on the partitioned inputs, with
optional support for dimensional inputs. For functions containing partitioned inputs (PARTITION BY
partitioning_attributes), the following steps occur:
1. The partitioning_attributes in all partitioned inputs are examined. A new cogroup tuple is formed for
every distinct partitioning_attributes that is found. The cogroup tuple’s first attribute will be this
partitioning_attributes.
2. For each partitioned input, a new attribute is added to the cogroup tuple. This attribute will hold all the
attributes of each tuple in that input whose partitioning_attributes match the cogroup tuple’s
partitioning_attributes.
3. For each dimensional input, a new attribute is added to the cogroup tuple. This attribute contains all of
that dimensional input's tuples.
4. After the above steps occur, there is one cogroup tuple for each distinct partitioning_attributes with:
• one attribute being the partitioning_attributes,

66 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
• plus one attribute for each partitioned input that contains a nested array of all of that input's
matching tuples,
• plus one attribute for each dimensional input that includes an array of all of that input's tuples.
5. The SQL-MapReduce function then gets invoked on each cogroup tuple.
Comparison semantics are used in this grouping operation, so NULL values are treated as equivalent.
Grouped tuples that have empty groups for certain attributes (that is, inputs with no tuples for a particular
group) are included in the grouped output by default.

Types of SQL-MapRequest Inputs


From a semantic perspective, there are two possible types of inputs in SQL-MapRequest:
1. Partitioned inputs are split up (partitioned) among vworkers as specified in the PARTITION BY clause.
These inputs can specify one of:
• PARTITION BY ANY - random. In practice, PARTITION BY ANY simply preserves any existing
partitioning of the data for that input. There may only be one PARTITION BY ANY input in a
function.
• PARTITION BY partitioning_attributes - sorted and partitioned on the specified columns.

Note:
All PARTITION BY partitioning_attributes clauses in a function must specify the same number of
attributes and corresponding attributes must be equijoin compatible (that is, of the same datatype or
datatypes that can be implicitly cast to match). This casting is “partition safe,” meaning that it does
not cause redistribution of data on the vworkers.
2. Dimensional inputs use the DIMENSION keyword, and the entire input is distributed to each vworker.
This is done because the entire set of data is required on each vworker for the SQL-MapRequest function
to run. The most common use cases for dimensional inputs are lookup tables and trained models.
Here’s how it works. A multiple-input SQL-MapRequest function takes as input sets of rows from multiple
relations or queries. In addition to the input rows, the function can accept arguments passed by the calling
query. The function then effectively combines the partitioned and dimensional inputs into a single nested
relation for each unique set of partitioning attributes. The SQL-MapRequest function is then invoked once
on each record of the nested relation. It produces a single set of rows as output.

Semantic Requirements for SQL-MapReduce Functions


Keep in mind the following semantic requirements when designing, writing, and calling SQL-MapReduce
functions. Before applying these rules, Aster Database assumes PARTITION BY ANY for legacy SQL-
MapReduce functions whose referencing queries omit the PARTITION BY clause.

Allowed Multiple Input Structures


A multiple-input function always operates on at least two input sets. There are two alternatives for
organizing the input data sets:
• You provide the first input set in a row-wise manner and make the other input set(s) available in their
entirety. In Aster terminology, the first set is a PARTITION BY ANY set and the other sets are
DIMENSION sets.

Teradata Aster Analytics Foundation User Guide 67


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
• You provide the first input set partitioned on a key you have chosen, and you provide the other input
set(s) partitioned on key(s) and/or as DIMENSION set(s).
The following section lists the specific rules that govern the number and types of inputs that you can use.

Rules for Number of Inputs by Type


In general, any number of input data sets as specified by ON clause constructs can be provided to a SQL-
MapReduce function, but the allowed combinations are governed by the rules listed below.
1. A function may have at most one PARTITION BY ANY input.
2. If a function specifies PARTITION BY ANY, all other inputs must be DIMENSION.
3. Any number of grouping attributes as specified by the PARTITION BY partitioning_attributes clause can
be provided, as long as there are no PARTITION BY ANY inputs.
4. All PARTITION BY partitioning_attributes clauses must specify the same number of attributes and
corresponding attributes must be equijoin compatible (that is, of the same datatype or of datatypes that
can be implicitly cast to match).
5. The function is not invoked if all of the PARTITION BY ANY and PARTITION BY
partitioning_attributes inputs are empty. Simply having some data in the DIMENSION inputs is not
sufficient to invoke the function in itself. DIMENSION inputs are not first class inputs in this sense; they
simply come along for the ride as function arguments might.
6. The order of the inputs does not matter. For example, your function could have:

SELECT ...
ON store_locations DIMENSION,
ON purchases PARTITION BY purchase_date,
ON products DIMENSION ORDER BY prod_name
...

Rules for Table Aliases


1. An alias is required for subselects or views.
2. If you are referring to a base table, an alias is not required. Aster Database uses the table name as the
default alias.
3. If your SQL-MapReduce function refers to a table or view more than once, then a conflict is reported, as
in this example:

SELECT * FROM union_inputs (


ON t PARTITION BY ANY
ON t DIMENSION mode ('roundrobin')
);
ERROR: input alias T in SQL-MR function UNION_INPUTS appears more than once

You must give a different alias to each reference to the table or view.

Number of Inputs
A SQL-MapReduce function invocation triggers the following validations:
• The multiple input SQL-MapReduce function expects more than one input.

68 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
• The single input SQL-MapReduce function expects exactly one input.

Use Cases and Examples for Multiple Inputs


There are many types of SQL-MapReduce functions that can benefit from having multiple data inputs.
These generally fall into two classifications: cogroup functions and dimensional input functions. These are
not mutually exclusive; a single function could include both cogroup and dimensional operations.

Cogroup Use Case


Cogroup allows SQL-MapReduce functions to be applied to related groups of information derived from
different data sets. The different data inputs are grouped into a nested structure using a cogroup function
before the SQL-MapReduce function operates on them.
The following use case is a simplified sales attribution example for purchases made from a web store. This
case shows how to find out how much sales revenue to attribute to advertisements, based on impressions
(views) and clicks leading to a purchase. The inputs are the logs from the web store and the logs from the ad
server.
This type of result cannot easily be computed using SQL or SQL-MapReduce capabilities without multiple
data inputs.

Cogroup Example
This example uses a fictional SQL-MapReduce function named attribute_sales to show how cogroup
works. The function accepts two partitioned inputs, specified in two ON clauses, and two arguments.
The inputs to the SQL-MapReduce function are:
• weblog, which contains the store web logs, the source of purchase information
• adlog, which contains the logs from the ad server
Both inputs are partitioned on the user’s browser cookie.

Teradata Aster Analytics Foundation User Guide 69


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
Figure 1: Cogroup Example Tables

The arguments to the attribute_sales function are clicks and impressions, which supply the
percentages of sales to attribute for ad clickthroughs and views (impressions) leading up to a purchase.
Use the following SQL-MapReduce to call the attribute_sales function:

SELECT adname, attr_revenue


FROM attribute_sales (
ON (SELECT cookie, cart_amt, adname, action
FROM weblog
WHERE page = 'thankyou' )as W PARTITION BY cookie
ON adlog as S PARTITION BY cookie
clicks(.8) impressions(.2)
);

The following figure shows how SQL-MapReduce executes this function.

70 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
Figure 2: How a SQL-MapReduce function performs a cogroup

The two inputs are cogrouped before the function operates on them. Conceptually, the cogroup operation is
performed in two steps:
1. Each input data set is grouped according to the cookie attribute specified in the PARTITION BY clauses.
A cogroup tuple is formed for each unique resulting group. The tuple is composed of the cookie value
identifying the group and a nested relation that contains all values from both the weblog and adlog
inputs that belong to the group.
The middle box in the preceding figure shows the output of the cogroup operation.
2. The attribute_sales function is invoked once for each cogroup tuple.

Teradata Aster Analytics Foundation User Guide 71


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
Each time it is invoked, the function processes the nested relation, treating it as a single row, and then
attributes the sales revenue to the appropriate advertisements as previously described.
The bottom box in the diagram shows the output of the SQL-MapReduce function.
The cogroup result includes a tuple for the DDDD cookie, although there is no corresponding group in the
adlog data set. The reason is that Aster Database grouping performs an OUTER JOIN, meaning that
cogroup tuples that have empty groups for certain attributes are included in the cogroup output.

Dimensional Input Use Case: Lookup Tables


A typical scenario for multiple inputs with dimensional data is when a function must access an entire lookup
table of values for all rows of a given input. To see how this was accomplished prior to multiple input SQL-
MapReduce, consider the following example. Suppose a query needed to reference a lookup table of values.
Prior to the introduction of multiple inputs, if the lookup table was small enough, one of the following
strategies could be used:
• Add the lookup table as an additional row in the query table during the query
• Hold the lookup table in memory for the duration of the query.
There are many scenarios like the following, however, for which the above solutions do not work because of
the size, complexity or format of the data:
• The data is too big to fit in memory and/or would make the main query table too large and unwieldy if
added to it.
• The data input consists of multi-dimensional data, such as geospatial data.
• The data consist of a model, usually in JSON format, and a set of data to be analyzed against, using the
model. For a discussion of this scenario, see Dimensional Input Use Case: Machine Learning.
In these cases, analysts will find it helpful to use SQL-MapReduce with multiple inputs.
The SQL-MapReduce function can loop over the input data - holding one of the inputs in memory and
repeatedly performing the same function on each row of another input. Only one single instance of the
dimensional input is held in memory on each of the worker nodes, and it is used in processing each
incoming row of partitioned data. Prior to this functionality in SQL-MapReduce, this type of data could not
easily be processed. That is because an instance of one of the data inputs had to be held in memory for use by
each row of data from any additional input. This could cause slow performance if one or both datasets were
very large or of a structure not easily represented in a relational form.

Dimensional Input Example


This example creates a SQL-MapReduce function to show how dimensional inputs are processed. It creates
the SQL-MapReduce function for a retailer of mobile phone accessories. The function takes data from
accessory purchases made by mobile phone and finds the closest retail store at the time of purchase. Its two
data sets are:
• phone_purchases, which contains entries for mobile phone accessory purchases and normalized
spatial coordinates of the mobile phone at the time of each online purchase
• stores, which contains the location of all the retail stores and their associated normalized spatial
coordinates
The following figure shows the two data sets.

72 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
Figure 3: Dimensional Example Tables

This type of result cannot easily be computed using basic cogroup capabilities, because the data sets must be
related using a proximity join as opposed to an equijoin. The following figure shows how this relationship is
expressed and executed using cogroup extended with dimensional inputs.
Create a SQL-MapReduce function named closest_store, which accepts two partitioned inputs as
specified in two ON clauses. Use this query to call the closest_storefunction:

SELECT pid, sid


FROM closest_store (
ON phone_purchases PARTITION BY ANY,
ON stores DIMENSION
);

The following figure shows how SQL-MapReduce executes this function.

Teradata Aster Analytics Foundation User Guide 73


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
Figure 4: How dimensional inputs work in SQL-MapReduce

The closest_store function receives the result of a cogroup operation on the phone_purchases data
set and the dimensional input data set stores. The two boxes at the top of the diagram show sample
phone_purchases input and stores input, respectively. Conceptually, the operation is performed in
three steps:
1. The phone_purchases input remains grouped as it is stored in the database, as specified by the
PARTITION BY ANY clause, and thestores input is grouped into a single group as specified by the
DIMENSION clause.
2. The groups are combined using what is essentially a Cartesian join. The result of the cogroup operation
is a nested relation. Conceptually, each tuple of the nested relation contains an arbitrary group of phone
purchases concatenated with the single group comprising all retail stores.
The middle box in the diagram shows the result of the cogroup operation.

74 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
SQL-MapReduce with Multiple Inputs
3. The closest_store function is invoked once for each cogroup tuple. At each invocation, it receives a
cursor over an arbitrary group of purchases and a cursor over the entire set of stores.
4. The function performs a proximity join, using the normalized spatial coordinates to find the closest store
to each purchase.
The bottom box in the diagram shows the output of the SQL-MapReduce function.

Dimensional Input Use Case: Machine Learning


Machine learning is another common use case for SQL-MapReduce with multiple inputs. In machine
learning, you create or choose a model that predicts some outcome given a set of data. You typically test the
model to determine its accuracy, fine tuning it until its predictions fall within the desired margin of error.
The model itself is comprised of mathematical and statistical algorithms created through observations of
patterns found within a given dataset.
Assume that you want to generate a trained predictive model, fine tune it, and apply it to a large set of data.
Imagine that you have ten million emails, and you must bucket them by subject matter. You would use a
function (such as a decision tree function) to generate an algorithm that parses emails and places them in the
appropriate bucket. The function might do some statistical analysis to determine where clusters of data
appear, and create subject matter “buckets” based on these. This emails might be placed into buckets based
on frequency of occurrence of certain words, work proximity and/or grammatical analysis.
To test the accuracy of your model, you might have a human do this same classification work for a subset of
the emails (called the sample dataset). The sample dataset might be one thousand emails. The person would
read and classify each email according to the desired criteria. The model is then applied to the sample data
set and the results compared to the known outcome (the results generated by the human being). You can
then gauge the reliability of the predictive model and fine tune it. Finally, you can do the analysis on future
emails using the predictive model, with a known margin of error.
Some functions that can generate these models include Naive Bayes, K-means nearest neighbor, decision
trees, and logistic regression. The model generated by this type of function does not generally follow a
relational structure. It is more likely to be in a JSON format. It can be stored in the file system, but it is more
commonly stored in a database.
So in the machine learning scenario, the data inputs to the SQL-MapReduce function will consist of 1) a
model, usually in JSON format, and 2) one or more sets of data to be analyzed against, using the model.
Similar to the lookup table example, the predictive model must be applied to each row of input from the new
data set. Thus, the predictive model will be input to the function using the DIMENSION keyword. The data
to be analyzed could use either PARTITION BY ANY or PARTITION BY partitioning_attributes.

SQL-MapRequest Multiple Input FAQ

How Multiple Inputs Are Combined


Multiple inputs are combined using what is essentially a cogroup operation, with the addition of support for
dimensional inputs. Grouping is done using an OUTER JOIN where NULLs compare equal. The SQL-
MapReduce function is effectively invoked once for each unique partition of all the partitioned inputs. All
dimensional inputs are provided at each invocation. The function can output one or more tuples at each
invocation.
There are two mutually exclusive cases to consider in determining what the unique partitions of the
partitioned inputs will be:

Teradata Aster Analytics Foundation User Guide 75


Chapter 1: Introduction
Aster Analytics Function Product Bundles
• One or more partition_attributes_input inputs are combined into partitions using a cogroup
operation. The cogroup operation forms one partition for each unique combination of partitioning
attributes present in any of the inputs. Each partition provides the values of the partitioning attributes
and the tuples from each input that agree on those values. If a given input has no tuples for a particular
combination of partitioning attributes, then an empty set of tuples is provided for that input.
• A single partition_any_input is processed wherever its data is stored. Each invocation provides the
input tuples to the vworker where they currently reside in Aster Database.
The function is not invoked if all of the inputs tables to partitioned inputs are empty, even if a dimensional
input has been provided. Thus the dimensional inputs are not first class inputs in that they do not drive
invocation of the function. They are simply provided as additional input to the function.

Can dimensional inputs include non-deterministic expressions?


No. Dimensional inputs must not include non-deterministic expressions. These are expressions that are not
guaranteed to evaluate to the same result every time (expressions with a volatility other than IMMUTABLE).
Whenever changing Global User Configuration (GUC) settings can change the result of an expression, the
volatility of that expression may be classified as STABLE, but it cannot be classified as IMMUTABLE. An
example of this would be changing Locale and Formatting settings, such as datestyle or time zone.

Where will the output of the function be located?


The output location for row and cogroup functions is determined by the function:
• If it is vworker specific, the output goes to that worker.
• If it is partitioned, the output is partitioned in the same way.
• If it is replicated, the output is replicated.
If the input is located on the queen:
• If the inputs are row or dimension inputs, the output is replicated to the vworkers.
• If the input is partitioned, the output is also partitioned.

Aster Analytics Function Product Bundles


The Aster Analytics functions belong to product bundles, which are delivered in packages. The following
table lists the product bundles and their package names.
Table 2: Aster Analytics Function Product Bundles and Package Names

Product Bundle Package Name


Analytics Aster MapReduce Analytic Foundation Portfolio 6.21
Foundation
Premium Path Aster MapReduce Analytic Premium Portfolio - Path/Pattern Module 6.21
Premium Aster MapReduce Analytic Premium Portfolio - Relationship Module 6.21
Relationship
Premium Graph Aster Graph Analytic Premium Portfolio - Graph Module 6.21
Aster Analytics Aster Analytic Premium Portfolio 6.21

76 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Product Bundle

Product Bundle Package Name


Aster Scoring SDK Aster Scoring SDK 6.21

The Aster Analytic Premium Portfolio 6.21 package combines the following packages:
• Aster MapReduce Analytic Foundation Portfolio 6.21
• Aster MapReduce Analytic Premium Portfolio - Path/Pattern Module 6.21
• Aster MapReduce Analytic Premium Portfolio - Relationship Module 6.21
• Aster Graph Analytic Premium Portfolio - Graph Module 6.21

Aster Analytics Functions by Product Bundle


This section lists the Aster Analytics functions by product bundle. For an alphabetical list of the functions
and their syntax, see List of Functions and Their Syntax.

Premium Path
Table 3: Premium Path Bundle Functions in Alphabetical Order

Function Name
Attribution
FrequentPaths
nPath
Path_Analyzer
Path_Generator
Path_Starter
Path_Summarizer

Premium Relationship
Table 4: Premium Relationship Bundle Functions in Alphabetical Order

Function Name
Basket_Generator
CFilter (Collaborative Filtering)
FPGrowth
nTree
WSRecommender

Teradata Aster Analytics Foundation User Guide 77


Chapter 1: Introduction
Aster Analytics Functions by Product Bundle
Analytics Foundation
Table 5: Analytics Foundation Bundle Functions in Alphabetical Order

Function Name
AdaBoost_Drive
AdaBoost_Predict
AddOnePlayer
Antiselect
Apache Log Parser
Approximate Distinct Count
Approximate Percentile
Arima
ArimaPredictor
Burst
Canopy
Categorize
CCM
CCMPrepare
ChangePointDetection
CMAVG (Cumulative Moving Average)
ConfusionMatrix
Correlation
CoxPH
CoxPredict
CoxSurvFit
CrossValidation
DenseSVMModelPrinter
DenseSVMPredictor
DenseSVMTrainer
Distribution Matching
DTW
DWT
DWT2 D
EMAVG (Exponential Moving Average)

78 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Product Bundle

Function Name
EvaluateNamedEntityFinderPartition
EvaluateNamedEntityFinderRow
EvaluateSentimentExtractor
ExtractSentiment
FMeasure
FellegiSunterTrainer
FellegiSunterPredict
FindNamedEntity
Forest_Analyze
Forest_Drive
Forest_Predict
GenerateCombination
GeometryLoader
GeometryOverlay
GLM
GLMPredict
GMMFit
GMMPredict
GMMProfile
Histogram
HMMUnsupervisedLearner
HMMSupervisedLearner
HMMEvaluator
HMMDecoder
IdentityMatch
IDWT
IDWT2D
Interpolator
IPGeo
JSONParser
KMeans
KMeansPlot

Teradata Aster Analytics Foundation User Guide 79


Chapter 1: Introduction
Aster Analytics Functions by Product Bundle

Function Name
KModes
KModesPredict
KNN
KNNRecommenderPredict
KNNRecommenderTrain
LARS
LARSPredict
LDAInference
LDATopicPrinter
LDATrainer
LDist (Levenshtein Distance)
LinReg
LinRegMatrix
LRTEST
Minhash
Multi_Case
MurmurHash
NaiveBayesTextClassifierPredict
NaiveBayesTextClassifierTrainer
NaiveBayesMap
NaiveBayesPredict
NaiveBayesReduce
NER
NEREvaluator
NERTrainer
NeuralNet
NeuralNetPredict
nGram
OutlierFilter
Pack
PartitionScale
PCA (Principal Component Analysis)

80 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Product Bundle

Function Name
PCAPlot
Percentile
Pivot
PointInPolygon
POSTagger (Part-of-Speech Tagger)
PSTParserAFS
RandomSample
RtChangePointDetection
Sample
SAX2
Scale
ScaleMap
ScalePrinter
Sentenizer
SeriesSplitter
Sessionize
Single_Tree_Drive
Single_Tree_Predict
SMAVG (Simple Moving Average)
SortCombination
SparseSVMPredictor
SparseSVMTrainer
StringSimilarity
SupervisedShapeletClassifier
SupervisedShapeletTrainer
SVMModelPrinter
TextChunker
TextClassifier
TextClassifierTrainer
TextClassifierEvaluator
TextMorph
TextTagging

Teradata Aster Analytics Foundation User Guide 81


Chapter 1: Introduction
Aster Analytics Functions by Product Bundle

Function Name
TextTokenizer
Text_Parser
TF_IDF (Term Frequency Inverse Document Frequency)
TrainNamedEntityFinder
TrainSentimentExtractor
Unpack
Unpivot
UnsupervisedShapelet
URIPack
URIUnpack
VARMAX
VectorDistance
VWAP (Volume-Weighted Average Price)
WMAVG (Weighted Moving Average)
XMLParser
XMLRelation

Premium Graph
Table 6: Premium Graph Bundle Functions in Alphabetical Order

Function Name
AllPairsShortestPath
Betweenness
Closeness
EigenvectorCentrality
gTree
LocalClusteringCoefficient
LoopyBeliefPropagation
Modularity
PageRank
pSALSA
RandomWalkSample

82 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Category
Aster Analytics
The Aster Analytics bundle contains:
• Functions specified in Premium Path, Premium Relationship, Analytics Foundation, and Premium
Graph
• AMLGenerator, which is used with the Aster Scoring SDK package

Aster Scoring SDK


Table 7: Aster Scoring SDK Functions in Alphabetical Order

Function Name
Scorer

Aster Analytics Functions by Category


This section lists the Aster Analytics functions by category. For an alphabetical list of the functions and their
syntax, see List of Functions and Their Syntax.

Time Series, Path, and Attribution Analysis


Table 8: Time Series, Path, and Attribution Analysis Functions

Function Description
Arima Calculates the coefficients for a sequence of parameters, producing an ARIMA
model.
ArimaPredictor Takes as input the ARIMA model produced by the Arima function and predicts a
specified number of future values (time point forecasts) for the modeled sequence.
Attribution Calculates attributions with a wide range of distribution models. Often used in web-
page analysis.
Burst Bursts (splits) a time interval into a series of shorter "burst" intervals that can be
analyzed independently.
Change-Point Detect the change points in a stochastic process or time series. The change-point
Detection detection functions are ChangePointDetection and RtChangePointDetection.
Functions
Convergent Cross- Includes the CCMPrepare function, which adds a new partition column and
Mapping partitions the data to prepare it for use with the CCM function, which tests multiple
causes and effects simultaneously, reporting an effect size for each cause-effect pair.
DTW Computes the dynamic time warping—the similarity between two sequences that
vary in time or speed.
DWT Implements Mallat’s algorithm, an iterative algorithm in the discrete wavelet
transform field that applies wavelet transform on multiple sequences simultaneously.

Teradata Aster Analytics Foundation User Guide 83


Chapter 1: Introduction
Aster Analytics Functions by Category

Function Description
DWT2D Implements wavelet transforms on two-dimensional input, and simultaneously
applies the transforms on multiple sequences.
FrequentPaths Mines (finds) patterns that appear more than a specified number of times in the
sequence database. The difference between sequential pattern mining and frequent
pattern mining is that the former works on time sequences where the order of items
must be kept.
IDWT Applies inverse wavelet transformation on multiple sequences simultaneously. IDWT
is the inverse of DWT.
IDWT2D Simultaneously applies inverse wavelet transforms on multiple sequences. Inverse
function of DWT2D.
Interpolator Calculates missing values in a time series, using either interpolation or aggregation.
Interpolation estimates missing values between known values. Aggregation combines
known values to produce an aggregate value.
Path Analysis Automate path analysis. These functions are useful for clickstream analysis of web
Functions site traffic and other sequence/path analysis tasks, such as advertisement or referral
attribution. The path analysis functions are Path_Generator, Path_Summarizer,
Path_Start, and Path_Analyzer.
SAX2 Transforms original time series data into symbolic strings, which are more suitable
for many additional types of manipulation, because of their smaller size and the
relative ease with which patterns can be identified and compared. Input and output
formats allow it to supply data to the Shapelet functions.
SeriesSplitter Splits a partition into subpartitions (called splits) by creating an additional column
that contains the split identifier. Optionally, the function also copies a specified
number of boundary rows to each split.
Sessionize Maps each click in a clickstream to a unique session identifier.
Shapelet Functions Detect distinguishing features among ordered sequences (time series) and use them
to cluster or classify new data. The shapelet functions are UnsupervisedShapelet,
SupervisedShapeletTrainer, and SupervisedShapeletClassifier.
VARMAX Extends the ARMA/ARIMA model to work with time series with multiple response
variables (vector time series), as well as exogenous variables, or variables that are
independent of the other variables in the system.

Pattern Matching with Teradata Aster nPath


Table 9: Pattern Matching with Teradata Aster nPath Function

Function Description
nPath Pattern-matching function that lets you to specify a pattern in a row sequence,
specify additional conditions on the rows matching the symbols, and extract useful
information from the row sequence.

84 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Category
Statistical Analysis
Table 10: Statistical Analysis Functions

Function Description
Approximate Computes the approximate global distinct count of the values in one or more
Distinct Count columns, scanning the table only once. Counts all children for a specified parent.
Approximate Computes approximate percentiles for one or more columns, with specified accuracy.
Percentile
CMAVG Computes the cumulative moving average—the average of a value from the
beginning of a series.
ConfusionMatrix Shows how often a classification algorithm correctly classifies items.
Correlation Computes the global correlation between any pair of table columns.
CoxPH Estimates coefficients of a Cox proportional hazards model by learning a set of
explanatory variables. Generates coefficient and linear prediction tables.
CoxPredict Takes the coefficient table generated by the CoxPH function and outputs the hazard
ratios between predict features and either their corresponding reference features or
their unit differences.
CoxSurvFit Takes the coefficient and linear prediction tables generated by the CoxPH function
and outputs a table of survival probabilities.
CrossValidation Validates a model by assessing how the results of a statistical analysis will generalize
to an independent data set.
Distribution Uses hypothesis testing to find the best matching distribution for data.
Matching
EMAVG Computes the average over a number of points in a time series while applying an
exponentially decaying damping (weighting) factor to older values so that more
recent values are given a heavier weight in the calculation.
FMeasure Calculates the accuracy of a test.
GLM Performs linear regression analysis for any of a number of distribution functions,
using a user-specified distribution family and link function.
GLMPredict Uses the model generated by the Stats GLM function to make predictions for new
data.
Hidden Markov Describe the evolution of observable events that depend on factors that are not
Model Functions directly observable. The Hidden Markov Model functions are
HMMUnsupervisedLearner, HMMSupervisedLearner, HMMEvaluator, and
HMMDecoder.
Histogram Calculates the frequency distribution of a dataset using sophisticated binning
techniques that can automatically calculate the bin width and number of bins. The
function maps each input row to one bin and returns the frequency (row count) and
proportion (percentage of rows) of each bin.
KNN Uses the kNN algorithm to classify new objects based on their proximity to already-
classified objects.

Teradata Aster Analytics Foundation User Guide 85


Chapter 1: Introduction
Aster Analytics Functions by Category

Function Description
LARS Functions Select the most important variables one by one and fit the coefficients dynamically.
The LARS functions are LARS and LARSPredict.
Linear Regression Output the coefficients of the linear regression model represented by the input
matrices.
LRTEST Performs the likelihood ratio test for two GLM models.
Percentile Finds percentiles on a per group basis.
Principal Common unsupervised learning technique that is useful for both exploratory data
Component analysis and dimensionality reduction, often used as the core procedure for factor
Analysis analysis. Implemented by the functions PCA_Map and PCA_Reduce. If the version
of PCA_Reduce is AA 6.21 or later, you can input the PCA output to the function
PCAPlot.
RandomSample Takes a data set and uses a specified sampling method to output one or more random
samples, each with a specified size.
Sample Draws rows randomly from input, using either of two sampling schemes.
Shapley Value Computes the Shapley value, typically from nPath function output. The Shapley value
Functions is intended to reflect the importance of each player to the coalition in a cooperative
game (a game between coalitions of players, rather than between individual players).
The Shapley value functions are GenerateCombination, SortCombination, and
AddOnePlayer.
SMAVG Computes the simple moving average for a number of points in a series.
Support Vector Use a popular classification algorithm to build a predictive model according to a
Machines training set, give a prediction for each sample in the test set, and display the readable
information of the model. Support Vector Machines include both SparseSVM and
DenseSVM functions. The SparseSVM Functions include SparseSVMTrainer,
SparseSVMPredictor, and SVMModelPrinter, while the DenseSVM Functions
include DenseSVMTrainer, DenseSVMPredictor and DenseSVMModelPrinter.
VectorDistance Measures the distance between sparse vectors (for example, TF-IDF vectors) in a
pairwise manner.
VWAP Computes the volume-weighted average price of a traded item (usually an equity
share) over a specified time interval.
WMAVG Computes the weighted moving average of a number of points in a time series,
applying an arithmetically-decreasing weighting to older values.

Text Analysis
Table 11: Text Analysis Functions

Function Description
LDA Functions Build a topic model based on the supplied training data and parameters, estimate the
topic distribution for each document based on the generated model, and display

86 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Category

Function Description
information from the model. The LDA functions are LDATrainer, LDAInference,
and LDATopicPrinter.
Levenshtein Computes the Levenshtein distance between two text values, that is, the number of
Distance (LDist) edits needed to transform one string into the other, where edits include insertions,
deletions, or substitutions of individual characters.
Naive Bayes Text Uses the Naive Bayes algorithm to classify data objects. The Naive Bayes Text
Classifier Classifier is composed of the functions NaiveBayesTextClassifierTrainer and
NaiveBayesTextClassifierPredict.
NER Functions Use the Conditional Random Fields (CRF) model to specify how to extract features
(CRF Model (for example, person, location, and organization) when training data models. Trains,
Implementation) evaluates, and applies models. These NER functions are NERTrainer, NER, and
NEREvaluator.
NER Functions Use the Max Entropy model to specify how to extract features (for example, person,
(Max Entropy location, and organization) when training data models. Trains, evaluates, and applies
Model models. These NER functions are FindNamedEntity, TrainNamedEntityFinder,
Implementation) Evaluate Named Entity Finder.
nGram Tokenizes (splits) an input stream and emits n multi-grams based on specified
delimiter and reset parameters. Useful for sentiment analysis, topic identification,
and document classification.
POSTagger Tags the parts-of-speech of input text.
Sentenizer Extracts the sentences in the input paragraphs.
Sentiment Deduce user opinion (positive, negative, or neutral) from text. The sentiment
Extraction extraction functions are TrainSentimentExtractor, ExtractSentiment, and
Functions EvaluateSentimentExtractor.
Text Classifier Chooses the correct class label for given text. Text Classifier is composed of the
functions TextClassifierTrainer, TextClassifier, and TextClassifierEvaluator.
Text_Parser Tokenizes a stream of words, optionally stems them, and outputs the individual
words and their counts.
TextChunker Divides text into phrases and assigns each phrase a tag identifying its type.
TextMorph Provides lemmatization, a basic tool in text analysis. Outputs a standard form of the
input words.
TextTagging Tags input tuples according to user-defined rules that use logical and text processing
operators.
TextTokenizer Extracts tokens (for example, words, punctuation marks, and numbers) from text.
TF_IDF Evaluates the importance of a word within a specific document, weighted by the
number of times the word appears in the entire document set.

Teradata Aster Analytics Foundation User Guide 87


Chapter 1: Introduction
Aster Analytics Functions by Category
Cluster Analysis
Table 12: Cluster Analysis Functions

Function Description
Canopy Simple, fast, accurate function for grouping objects into preliminary clusters. Often
used as an initial step in more rigorous clustering techniques, such as k-means.
Gaussian Mixture Fit a Gaussian mixture model (GMM) to input data, using either a basic GMM
Model Functions algorithm with a fixed number of clusters or a Dirichlet Process GMM (DP-GMM)
algorithm with a variable number of clusters. The GMM functions are GMMFit,
GMMPredict, and GMMProfile.
KMeans Takes a data set and outputs the centroids of its clusters and, optionally, the clusters
themselves.
KMeansPlot Takes a model—a table of cluster centroids output by the KMeans function—and an
input table of test data, and uses the model to assign the test data points to the cluster
centroids.
KModes Extends KMeans to support categorical data. The core algorithm is an expectation-
maximization algorithm that finds a locally optimal solution.
KModesPredict Prediction function that corresponds to KModes.
Minhash Probabilistic clustering method that assigns a pair of users to the same cluster with
probability proportional to the overlap between the sets of items that these users have
bought.

Naive Bayes
Table 13: Naive Bayes Functions

Function Description
Naive Bayes Train a Naive Bayes classification model and use the model to predict new outcomes.
Functions The Naive Bayes functions are NaiveBayesMap and NaiveBayesReduce and
NaiveBayesPredict.

Ensemble Methods
Table 14: Ensemble Methods Functions

Function Description
Random Forest Create a predictive model based on a combination of the classification and regression
Functions trees (CART) algorithm for training decision trees and the ensemble learning method
of bagging. The Random Forest functions are Forest_Drive, Forest_Predict, and
Forest_Analyze.
Single Decision Create a predictive model that has a single decision tree. The Single Decision Tree
Tree Functions functions are Single_Tree_Drive and Single_Tree_Predict.

88 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Category

Function Description
AdaBoost Create a predictive model based on the AdaBoost algorithm. The AdaBoost functions
Functions are AdaBoost_Drive and AdaBoost_Predict.

Association Analysis
Table 15: Association Analysis Functions

Function Description
Basket_Generator Generates baskets (sets) of items that occur together in data records (typically
transaction records or web page logs).
CFilter Helps discover which items or events are frequently paired with other items or
events.
FPGrowth Uses an FP-growth algorithm to generate association rules from patterns in a data set
and then determines their interestingness.
Recommender The recommender functions include the following:
Functions WSRecommender is an item-based, collaborative filtering function that uses a
weighted-sum algorithm to make recommendations (such as items for users to
consider buying).
KNNRecommenderTrain and KNNRecommenderPredict take a similar approach to
WSRecommender, but attempt to increase prediction accuracy by adjusting for
systematic biases and replacing heuristic calculations of similarity coefficients with a
global optimization that simultaneously estimates all weights.

Graph Analysis
Table 16: Graph Analysis Functions

Function Description
AllPairsShortestPath Computes the shortest distances between all combinations of the specified
source and target vertices.
Betweenness Determines betweenness for every vertex in a graph. Betweenness is a type
of centrality (relative importance) measurement.
Closeness Computes closeness and k-degree scores for each specified source vertex
in a graph.
EigenvectorCentrality Calculates the centrality (relative importance) of each node in a graph.
gTree Follows all paths in a graph, starting from a given set of root vertices, and
calculates specified aggregate functions along those paths.
LocalClusteringCoefficient Analyzes the structure of a network.
LoopyBeliefPropagation Calculates the marginal distribution for each unobserved node,
conditional on any observed nodes.

Teradata Aster Analytics Foundation User Guide 89


Chapter 1: Introduction
Aster Analytics Functions by Category

Function Description
Modularity Discovers communities (clusters) in input graphs without advance
information about the clusters. Detects communities by discovering the
strength of relationships among data points.
nTree Builds and traverses tree structures on all worker nodes in a graph.
PageRank Computes PageRank values for a directed graph.
pSALSA Evaluates the similarity of nodes in a bipartite graph according to their
proximity. Typically used for recommendation.
RandomWalkSample Outputs a sample graph that represents the input graph (which is typically
extremely large).

Aster Scoring SDK


Table 17: Aster Scoring SDK Functions

Function Description
AMLGenerator Transforms model data from Aster to an XML-based AML (Aster Model Language)
format that is compatible with the real-time functionality.
Scorer Provides a software framework to score input queries based on a given model and
predictor. The following real-time functions are currently supported by scorer: Aster
Scoring SDK CoxPH, Aster Scoring SDK Extract Sentiment, Aster Scoring SDK
Generalized Linear Model, Aster Scoring SDK LDAInference, Aster Scoring SDK
Naïve Bayes, Aster Scoring SDK Naïve Bayes Text Classifier, Aster Scoring SDK
Random Forest, Aster Scoring SDK Single Decision Tree, Aster Scoring SDK
SparseSVM, Aster Scoring SDK Text Parser, Aster Scoring SDK Text Tagging, and
Aster Scoring SDK Text Tokenizer.

NeuralNet
Table 18: Neural Net Functions

Function Description
NeuralNet Uses backpropagation to train neural networks. The user provides input data and
other argument settings for training the networks, and the fitted weights of the neural
network are created. The Neural Net function is optimized for performance on very
large datasets (millions of rows).
NeuralNetPredict Predicts the output for specific arbitrary covariate inputs, using a particular trained
neural network output weight table.

90 Teradata Aster Analytics Foundation User Guide


Chapter 1: Introduction
Aster Analytics Functions by Category
Data Transformation
Table 19: Data Transformation Functions

Function Description
Antiselect Returns all columns except those specified.
Apache_Log_Parser Parses Apache log file content and extracts multiple columns of structural
information, including search engines and search terms.
Categorize Converts specified columns from any numeric type to VARCHAR.
Fellegi-Sunter Functions FellegiSunterTrainer estimates the parameters of the Fellegi-Sunter model, using
either supervised or unsupervised learning. FellegiSunterPredict predicts whether a
pair of objects are duplicates.
Geometry Functions GeometryLoader retrieves file-based geospatial files from AFS, parses them, and
stores them in Aster Database. GeometryOverlay calculates the result of overlaying
two geometries as specified by the overlay operator. PointInPolygon takes as input a
list of location points and a list of polygons and returns a list of binary values for
every point-and-polygon combination, which indicates whether the point is
contained in the polygon.
IdentityMatch Tries to match enterprise customers with users records provided by external data
sources.
IPGeo Maps IP addresses to information that you can use to identify the geographical
location of a visitor.
JSONParser Extracts the element name and text from JSON strings and outputs them in a
flattened relational table.
Multi_Case Extends the capability of the SQL CASE statement by supporting matches to
multiple options and iterating through the input data set only once, emitting
matches as they occur.
MurmurHash Computes the hash value of the input columns.
OutlierFilter Removes outliers from a data set.
Pack Compresses data in multiple columns into a single “packed” data column.
Pivot Converts rows into columns.
PSTParserAFS Parses Personal Storage Table (PST) files that store email in Microsoft software such
as Microsoft Outlook and Microsoft Exchange Client.
Scale Functions Normalize input data sets. The Scale functions are ScaleMap, Scale, ScalePrinter,
andPartitionScale.
StringSimilarity Calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-
Gram, or Levenshtein distance.
Unpack Expands data from a single “packed” column to multiple “unpacked” columns.
Unpivot Converts columns into rows.
URIPack Reconstructs encoded hierarchical uniform resource identifier (URI) strings that
were unpacked by the URIUnpack function.

Teradata Aster Analytics Foundation User Guide 91


Chapter 1: Introduction
Aster Analytics Functions by Category

Function Description
URIUnpack Separates hierarchical URIs into constituent components and extracts the values of
specified parameters.
XMLParser Extracts data from XML documents and flattens it into a relational table.
XMLRelation Extracts element name, text and attribute values, and structural information from
XML documents and output them in a relational table.

Aster Database Utilities


For information on the functions used for querying data from local vworkers, refer to Aster Database System
Utility Functions. These functions are useful for querying local catalog tables to support database
administration activities.

92 Teradata Aster Analytics Foundation User Guide


CHAPTER 2
Installing Aster Analytics Functions

Installing Aster Analytics Functions


• Aster Analytics Function Version Numbers
• Finding Function Version Numbers
• Aster Analytics Compatibility Matrix
• Aster Analytics Function Packages
• Downloading an Aster Analytics Function Package
• Getting Install and Uninstall Scripts
• Installing an Aster Analytics Function Package
• Updating an Aster Analytics Function Package
• Installing a Function in a Specific Schema
• Managing Files with ACT Commands
• Usage Notes

Aster Analytics Function Version Numbers


Aster Analytics functions have version numbers assigned by Teradata. The version number lets you:
1. Check whether the installed function is compatible with your version of Aster Database.
2. Ensure that the documentation version matches the version of the installed function.
3. Verify that function updates succeeded.
For every new release of Aster Analytics Foundation, Teradata recommends that you update all functions.
Functions whose syntax or behavior changed are listed in the Aster Analytics Release Notes.

Finding Function Version Numbers


To find the version numbers of installed functions, you can use either the Aster Database Cluster Terminal
(ACT) command \dE or this query:

SELECT [schemaid,]funcname,funcversion FROM


{ nc_user_sqlmr_funcs | nc_user_owned_sqlmr_funcs };

Sample output from the \dE command and the preceding query are in the following two tables. The version
numbers of the Aster Analytics functions AllPairsShortestPath and Antiselect are 6.20_rel_1.5_r39242 and
6.20_rel_1.0_r39242.

Teradata Aster Analytics Foundation User Guide 93


Chapter 2: Installing Aster Analytics Functions
Aster Analytics Compatibility Matrix
Table 20: ACT \dE Command Output Sample

schemaname funcname funcowner funcversion creationtime


nc_system load_from_hcatalog db_admin 6.20-r39242 2014-12-04
07:02:48.4148
... ... ... ... ...
public allpairsshortestpath beehive 6.20_rel_1.5_r39242 2014-12-04
07:21:38.969565
public antiselect beehive 6.20_rel_1.0_r39242 2014-12-04
07:21:41.781971
... ... ... ... ...

Table 21: Query Output Sample

schemaid funcname funcversion


stream 6.20-r39242
... ...
16379 load_from_hcatalog 6.20-r39242
... ... ...
2200 allpairsshortestpath 6.20_rel_1.5_r39242
2200 antiselect 6.20_rel_1.0_r39242
... ... ...

In the function version release_number_rel_function_version_rbuild_number:


• release_number is the Aster Analytics Foundation release number (for example, 6.20)
• function_version is the function version number (for example, 1.5 or 1.0)
• rbuild_number is the build number of the Aster Analytics Foundation release (for example, r39242)

Note:
Neither the \dE command nor the query displays version numbers for the Aster Database Utility
functions, which are installed as part of the Aster Database installation and always compatible with it.

Aster Analytics Compatibility Matrix


The following table shows which versions of the SQL-MapReduce API and Aster Database are compatible
with which versions of Aster Analytics Foundation.
Table 22: Aster Analytics Foundation Compatibility Matrix

SQL-MR API Aster Aster Analytics Version


Version Database
6.21 6.20 6.10 6.0 5.11 5.10 5.02 5.01
Version
4.0 AD 6.20 Y Y Y Y

94 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Aster Analytics Function Packages

SQL-MR API Aster Aster Analytics Version


Version Database
6.21 6.20 6.10 6.0 5.11 5.10 5.02 5.01
Version
4.0 AD 6.10 Y Y Y Y
4.0 AD 6.0.1 N Y Y Y Y Y
3.0 AD 5.10 N N N N Y Y Y Y
3.0 AD 5.02 N N N N Y Y Y Y
3.0 AD 5.0 N N N N Y Y Y Y

Aster Analytics Function Packages


The Aster Analytics functions belong to product bundles, which are delivered in packages. To download a
package, you must know the name of the ZIP file that contains its functions. You can find this information
in the following table. For a list of the functions in a product bundle, click the product bundle.
The Aster Analytics bundle combines the bundles Premium Path, Premium Relationship, Premium Graph,
and Aster Foundation.
Table 23: Aster Analytics Function Product Bundles, Packages, and ZIP File Names

Product Bundle Package Name ZIP File Name


Analytics Foundation Aster MapReduce AsterAnalytics_Foundation__indep_indep.06.21.00.00.zip
Analytic Foundation
Portfolio 6.21
Premium Path Aster MapReduce AsterAnalytics_PremiumPath__indep_indep.06.21.00.00.zip
Analytic Premium
Portfolio - Path/Pattern
Module 6.21
Premium Aster MapReduce AsterAnalytics_PremiumRelationship__indep_indep.
Relationship Analytic Premium 06.21.00.00.zip
Portfolio - Relationship
Module 6.21
Premium Graph Aster Graph Analytic AsterAnalytics_PremiumGraph__indep_indep.06.21.00.00.zip
Premium Portfolio –
Graph Module 6.21
Aster Analytics Aster Analytic AsterAnalytics_AsterAnalytics__indep_indep.06.21.00.00.zip
Premium Portfolio 6.21
Aster Scoring SDK Aster Scoring SDK 6.21 AsterAnalytics_ScoringSDK_indep_indep.06.21.00.00.zip

Teradata Aster Analytics Foundation User Guide 95


Chapter 2: Installing Aster Analytics Functions
Downloading an Aster Analytics Function Package

Downloading an Aster Analytics Function


Package
To download an Aster Analytics function package:
1. In Aster Analytics Function Packages, find the name of the ZIP file that contains the package.
2. Change your directory to /opt/teradata/AsterAnalytics_Foundation:

cd /opt/teradata/AsterAnalytics_Foundation
3. If the ZIP file is not there, contact your Teradata account manager to get its location.
4. Create a directory for the package that you are installing. For example:

mkdir AA_6.21
5. Change your directory to the newly created directory. For example:

cd AA_6.21

Postrequisite
Getting Install and Uninstall Scripts

Getting Install and Uninstall Scripts


Each Aster Analytics function package has SQL scripts that install and uninstall the package, for the PUBLIC
schema and a specified schema. Teradata recommends installing the Aster Analytics functions only in the
PUBLIC schema, but you can install them on other schemas.

Note:
Driver functions look for the functions that they call internally in the PUBLIC schema. Therefore, driver
functions might work incorrectly if you install all the functions to a schema other than PUBLIC.

Scripts for the Schema PUBLIC


To get the scripts that install a package in, and uninstall it from, PUBLIC:
1. Copy the ZIP, README, and postinstall files to your current directory. For example:

cp /opt/teradata/AsterAnalytics_Foundation/
AsterAnalytics_Foundation__indep_indep.06.21.00.00.zip README
postinstall /opt/teradata/AsterAnalytics_Foundation/AA_6.21/
2. Unzip the ZIP file. For example:

unzip AsterAnalytics_Foundation__indep_indep.06.21.00.00.zip

96 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Getting Install and Uninstall Scripts
The files contained in the ZIP file are now in your current directory, including SQL scripts to install and
uninstall the functions in the package. For example:

install_aster_analytics.sql
un_install_aster_analytics.sql

Next Step:
• If you are installing the package for the first time:
Installing an Aster Analytics Function Package
• If you are updating a package that is already installed:
Updating an Aster Analytics Function Package

Alternate Access Location for AA 6.0 and Later Uninstall Scripts


Another way to access uninstall scripts for releases AA 6.0 and later is to download them from the Teradata
Developer Exchange, as follows:
1. If this is the first time you have downloaded scripts from the Teradata Developer Exchange, create an
account (login name and password):
a) Go to http://downloads.teradata.com/.
b) Click Register.
c) Follow the instructions to create a login name and password.
2. Go to http://downloads.teradata.com/download/aster/aster-analytics-uninstall
The page contains downloadable .zip files of uninstall scripts, with their release dates. The file for the
latest release is at the top of the page. Files for earlier releases are in the OTHER RELEASES table.

Scripts for a Specified Schema


Note:
You can create these scripts only if Python is installed on your system. If Python is not installed on your
system, use the Aster Database Cluster Terminal (ACT) commands install and remove instead of install
and uninstall scripts. For more information about ACT commands, refer to Managing Files with ACT
Commands.

To get the scripts that install a package in, and uninstall it from, a specified schema:
1. Go to https://downloads.teradata.com/download/aster
2. Download the Aster Analytics Custom Schema Installer package.
3. Unzip and copy these files to your current directory:
make_install_scripts.py analytics_packages.csv
4. Run this command, where SCHEMANAME is the name of the desired schema:
python make_install_scripts.py $SCHEMANAME
The Python script make_install_scripts.py reads the CSV file analytics_packages.csv,
which tells which functions are in which packages, and generates install and uninstall SQL scripts for
each package, which it saves to your current directory. The names of the generated SQL scripts have this
format:

Teradata Aster Analytics Foundation User Guide 97


Chapter 2: Installing Aster Analytics Functions
Installing an Aster Analytics Function Package
install_PACKAGENAME_to_SCHEMANAME.sql
un_install_PACKAGENAME_from_SCHEMANAME.sql

Postrequisite
• If you are installing the package for the first time:
Installing an Aster Analytics Function Package
• If you are updating a package that is already installed:
Updating an Aster Analytics Function Package

Installing an Aster Analytics Function Package


Note:
If you are updating installed functions to a newer version, follow the instructions in Updating an Aster
Analytics Function Package.

1. Ensure that you have the necessary access privileges for installing the Aster Analytics functions in the
desired schema.

Note:
Teradata recommends installing the Aster Analytics functions only in the schema PUBLIC. If you
must maintain a different version of a function in another schema, refer to Scripts for a Specified
Schema.
2. Go to the directory where the functions from the package are. For example:

cd /opt/teradata/AsterAnalytics_Foundation/AA_6.21
3. Run the SQL script that installs the package. For example:

install_aster_analytics.sql

or:

install_ASTER_ANALYTICS_to_SCHEMANAME.sql
4. Check the version numbers of the newly installed functions, using the instructions in Finding Function
Version Numbers.

Note:
Aster Database does not share database objects, including functions, across databases.

Note:
When you expand a cluster by adding worker nodes, all installed Aster Analytics functions on your
cluster are automatically added to the new nodes.

Next Step:Set Default Schema for Function Users

98 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Updating an Aster Analytics Function Package
Set Default Schema for Function Users
For each user of the installed or updated functions, ensure that the schema in which the functions are
installed (typically PUBLIC) is the default schema in the search path.

Postrequisite
Set Permissions to Allow Users to Run Functions

Set Permissions to Allow Users to Run Functions


In the following procedure, schema is the schema where the functions are installed (typically PUBLIC).
1. Ensure that you have the privilege to grant the EXECUTE privilege on the installed functions.
2. Grant the EXECUTE privilege to either:
• Each function in schema. For example:
GRANT EXECUTE ON FUNCTION PUBLIC.Path_Start TO PUBLIC;
• Each user for each function that he or she must run. For example:
GRANT EXECUTE ON FUNCTION PUBLIC.Path_Start TO beehive;

Set Additional Permissions If Functions Are Not on PUBLIC


If schema is not PUBLIC, then you must grant each function user these additional privileges on schema:
• READ
• INSTALL
• CREATE TABLE
• TRUNCATE TABLE

Postrequisite
Testing the Functions

Testing the Functions


To test an installed function:
1. Access ACT as a user who has the EXECUTE privilege on the function.
2. In a SELECT statement, invoke the function, schema-qualifying its name unless its schema is in your
schema search path.

Updating an Aster Analytics Function Package


To update Aster Analytics functions that are already installed, you must first uninstall them and then install
their newer versions. Teradata recommends uninstalling the package that contains the older functions and
then installing the package that contains the newer versions.

Teradata Aster Analytics Foundation User Guide 99


Chapter 2: Installing Aster Analytics Functions
Updating an Aster Analytics Function Package
Note:
During this process, the functions are unavailable to users. If you want to maintain the older versions of
the functions, install the new versions on a different schema.

The procedure for uninstalling the package that contains the older functions and then installing the package
that contains the newer versions is:
1. Find the version numbers and schemas of the installed functions, using the instructions in Finding
Function Version Numbers.
2. Ensure that you have the necessary privileges for uninstalling the Aster Analytics functions in their
schema.
3. In the list of version numbers, find the release number of the functions in the package.
For example, in Finding Function Version Numbers, the installed Aster Analytics functions have release
number 6.20.
4. Go to the directory where the functions from the older package are. For example:

cd /opt/teradata/AsterAnalytics_Foundation/AA_6.20

Note:
Uninstall scripts for older packages are also available in the Aster Analytics 6.21 packages.
5. Run the SQL script that uninstalls the older package. For example:

un_install_aster_analytics.sql

or:

un_install_ASTER_ANALYTICS_from_SCHEMANAME.sql
6. Go to the directory where the functions from the newer package are. For example:

cd /opt/teradata/AsterAnalytics_Foundation/AA_6.21
7. Run the SQL script that installs the newer package. For example:

install_aster_analytics.sql

or:

install_ASTER_ANALYTICS_from_SCHEMANAME.sql

Postrequisite
The alternative to the recommended procedure is to use the ACT commands \remove and \install on each
function:
1. beehive=> \remove function_filename
2. beehive=> \install function_filename

100 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Installing a Function in a Specific Schema
For more information about the \remove and \install commands, refer to Managing Files with ACT
Commands.

Installing a Function in a Specific Schema


Teradata recommends installing Aster Analytics functions only in the PUBLIC schema. However, if you
must maintain a separate instance of a function (for example, an older version for compatibility with
existing scripts), you can install that function in another schema.
To install a function in a schema other than public, qualify the function file name with the schema name. For
example:
\install posTagger.zip textanalysis/posTagger.zip
\install textChunker.zip textanalysis/textChunker.zip
Then, qualify the function file name with the schema name when you invoke the function. For example:
SELECT * FROM textanalysis.TextChunker ( ON (SELECT * FROM
textanalysis.POSTagger ( ON ...

Managing Files with ACT Commands


You can use Aster Database Cluster Terminal (ACT) commands to install and manage individual files,
including:
• Aster Analytics functions
• SQL-MapReduce functions (compiled Java and C executables that can be invoked by name in the FROM
clause of a SELECT statement)
• Script files for stream()
• Files thatprovide settings to SQL-MapReduce functions or stream() script files
If a function is not a standalone function—that is, if it is composed of multiple functions—then you must
install all of its components. You can tell if a function is composed of multiple functions by checking its
syntax in this guide.
The following table describes the ACT commands for installing, downloading, and removing files and
functions. A file or function is “local” if it resides on your local file system and “remote” if it resides in Aster
Database.
You can put the \install and \remove commands in BEGIN / COMMIT blocks, like transactional SQL
commands.
Table 24: ACT Commands for Managing Files and Functions

Command Meaning
\dF Lists all installed files and functions.
\install file Installs the file or function file, which is the path name of the file relative to the
[installed_filename] directory where you are running ACT.
By default, the installed local file or function has the same name as the corresponding
remote file.

Teradata Aster Analytics Foundation User Guide 101


Chapter 2: Installing Aster Analytics Functions
Usage Notes

Command Meaning
The optional installed_filename is an alias. Aliases are useful for renaming helper
files, but are not recommended for SQL-MapReduce functions, because they can
cause confusion.
\download Downloads the file or function installed_filename to the directory where you are
installed_filename running ACT.
[newfilename] By default, the downloaded local file or function has the same name as the
corresponding remote file. To give the local file or function a new name, specify the
optional newfilename. If newfilename is a path, the destination directory must exist
on the file system where you are running ACT.
\remove Removes the file or function installed_filename.
installed_filename

Usage Notes
These usage notes apply to all functions, except as noted.

Enclosing Database Object Names in Double Quotation Marks


For analytic functions, if the name of a database object contains spaces or special characters, or is case-
sensitive, then you must enclose the name in double quotation marks, as in the following examples.
Create a schema and a table:

beehive=> create schema "Schema TxtClssfr";


CREATE SCHEMA beehive=> create fact table "Schema TxtClssfr"."text_Classifier
Case fact"(id
int, "ConTent" text, "[Category] coLumn" text) distribute by hash(id);
CREATE TABLE

If such a database object name is a function argument that must be enclosed in single quotation marks, then
you must put the double quotation marks inside the single quotation marks, as in this SQL-MapReduce
query:

SELECT * FROM TextClassifierTrainer (


ON (SELECT 1) PARTITION BY 1
INPUTTABLE ('"Schema TxtClssfr"."text_Classifier Case fact"')
TextColumn ('"ConTent"')
CategoryColumn ('"[Category] coLumn" ')
ModelFile ('KNN.bin')
ClassifierType ('knn')
ClassifierParameters ('Compress:0.5')
FeatureSelection ('DF:[0.1:]')
Database ('beehive')
UserID ('beehive')
Password ('beehive')
);

102 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Usage Notes
Boolean Argument Values
Some analytic functions have Boolean arguments with this syntax:

argument ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})

For such arguments, the values 't', 'yes', 'y', and 1 are equivalent to 'true' and the values 'f', 'no', 'n', and 0 are
equivalent to false.

Column Specification Arguments


Some analytic functions have column specification arguments with this syntax:

argument ( {column_name | column_range }[,...] )

The syntax of column_range is:

'start_column:end_column' [, '-exclude_column' ]

The range includes its endpoints.


The start_column and end_column can be:
• Column names (for example, '[column1:column2]')
• Nonnegative integers that represent the indexes of columns in the table (for example, '[0:4]')
The first column has index 0; therefore, '[0:4]' specifies the first five columns in the table.
• Empty. For example:
∘ '[:4]' specifies all columns up to and including the column with index 4.
∘ '[4:]' specifies the column with index 4 and all columns after it.
∘ '[:]' specifies all columns in the table.
The exclude_column is a column in the specified range, represented by either its name or its index (for
example, '[0:99]', '-[50]', '-column10'specifies the columns with indices 0 through 99, except
the column with index 50 and column10.
Column ranges cannot overlap, and cannot include any column_name specified in column_list.

DATE Columns
Input columns of the type DATE must have formats with four-digit years.

BC/BCE Timestamps
SQL-MapReduce functions do not support Before the Common Era (BCE) timestamps. BCE is an
alternative to Before Christ (BC). Examples of BC/BCE timestamp are:
4713-01-01 11:07:11-07:52:58 BC
4713-01-01 11:07:11 BC

Teradata Aster Analytics Foundation User Guide 103


Chapter 2: Installing Aster Analytics Functions
Usage Notes
Creating a Timestamp Column
If there are separate date and time columns in the input table, you can create a timestamp column using this
query:
SELECT (datecolumn || ' ' || timecolumn)::timestamp as mytimestamp FROM
input_table;

Granting CREATE Privileges


Before you run SQL-MapReduce functions, ensure that you have the appropriate CREATE privileges on the
public schema. This helps you avoid permission issues in cases where CREATE privileges on the public
schema are needed.
For example, if you run this example code without having CREATE privileges on the public schema, you get
an error message:

SELECT * FROM TextClassifierTrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('s_mktg.bdulay_search_train')
TextColumn ('s_term')
CategoryColumn ('s_cat')
ModelFile ('knn.bin')
ClassifierType ('knn')
ClassifierParameters ('compress:0.8')
NLPParameters ('useStem:true')
Database ('beehive')
UserID ('loadusr')
Password ('loadusr')
);
ERROR: SQL-MR function TEXTCLASSIFIERTRAINER failed: Error occurred when
install model file: Fail to install file: /tmp/1379088204401/knn.bin
Message:[AsterData][NClusterJDBCDSII](34) ERROR: permission denied for schema
"public" for user "loadusr

In this example, because a schema was not specified in the argument ModelFile, the function selects the first
schema from your search path and tries to install the model file on it. By default, the first schema in the
search path is the 'public' schema. Because you do not have CREATE privileges on the 'public' schema, a
privilege failure occurs.

Adding Model File Locations to the Default Search Path


For the functions that use model files, set the user/session default search path to the model file’s locations.

Connecting to Aster Database Using Authentication Cascading


For Aster Database 6.0 and later, authentication cascading lets you hide the UserID and Password arguments,
which might prevent its misuse.

104 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Usage Notes
Using Authentication Cascading
If you want a driver-based SQL-MapReduce function to use the authentication information of the current
connection to connect to Aster Database instead of specifying a new JDBC connection, then do not provide
the UserID and Password arguments.

Authentication Argument Syntax

[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]

Arguments

Argument Category Description


Domain Optional The address of the Queen. The host parameter is the Aster
Database Queen’s IP address or hostname. To specify an
IPv6 address, enclose the host argument in square brackets
(for example, [:: 1]:2406). The port parameter is the port
number that the Queen is listening on. The default is the
Aster Database standard port number (2406). For
example: DOMAIN(10.51.23.100:2406)
Database Optional The name of the database where the input table is located.
In Database version 5.11 and earlier, the default database is
beehive. In Aster Database 6.0 or later, the default value of
this argument is the database name of the current
connection.
UserID Optional The Aster Database user name of the user running this
function. The default value is 'beehive'.
Password Optional The Aster Database password of the user. If the UserID
and Password arguments are not specified, authentication
cascading is automatically adopted. The new created JDBC
connection inherits the privileges of the current
connection.
SSLSettings Optional A string that includes all the SSL settings, except the
password for the truststore. For example,
'ENABLESSL=true&SSLTRUSTSTORE=/home/beehive/
truststore/truststore.jks'. Use this argument only when the
Aster Database cluster requires an SSL connection.
SSLTrustStorePassword Optional The password for the SSL truststore. For example:
SSLTrustStorePassword('123456')
If using an SSL JDBC connection, you must specify the
SSLSettings and SSLTrustStorePassword arguments.

Teradata Aster Analytics Foundation User Guide 105


Chapter 2: Installing Aster Analytics Functions
Usage Notes

Argument Category Description


If using a normal JDBC connection, you must not specify
the SSLSettings and SSLTrustStorePassword arguments.
Use the SSLSettings and SSLTrustStorePassword
arguments only when the Aster Database cluster requires
an SSL connection.

Examples
When you use authentication cascading, you can change the usage of the following example driver function
from:

SELECT * FROM kmeans (


ON (SELECT 1)
PARTITION BY 1
Database ('databasename')
UserID ('userid')
Password ('password')
InputTable ('kmeanssample')
OutputTable (kmeanssample_centroid')
NumberK (3)
Threshold ('0.01')
MaxIterNum ('10')
);

to:

SELECT * FROM kmeans (


ON (SELECT 1)
PARTITION BY 1
InputTable ('kmeanssample')
OutputTable (kmeanssample_centroid')
NumberK (3)
Threshold ('0.01')
MaxIterNum ('10')
);

The following example connects to the database over SSL using authentication cascading:

SELECT * FROM kmeans (


ON (SELECT 1)
PARTITION BY 1
SSLSettings ('ENABLESSL=true&SSLTRUSTSTORE=
/home/beehive/truststore/truststore.jks')
SSLTrustStorePassword ('******')
InputTable ('kmeanssample')
OutputTable (kmeanssample_centroid')
NumberK (3)
Threshold ('0.01')
MaxIterNum ('10')
);

106 Teradata Aster Analytics Foundation User Guide


Chapter 2: Installing Aster Analytics Functions
Usage Notes
Connecting to Aster Database Using SSL JDBC Connections
If you want a driver SQL-MapReduce function to use a SSL JDBC connection to connect to Aster Database
instead of a normal JDBC connection, you must specify these two arguments:

SSLSettings ('SSLsettings')
SSLTrustStorePassword ('SSLTrustStorePassword')

Arguments
Argument Category Description
SSLSettings Required The string that specifies the SSL connection information,
excluding the SSL TrustStore password. Use this argument if you
want the function to use a JDBC SSL connection to connect to
Aster Database instead of a normal JDBC connection. The
connection string specified by this argument is appended to the
end of the SSL JDBC connection string.
For example, if the domain is 192.168.1.2 and the database name is
beehive, specifying
SSLSettings('ENABLESSL=true&SSLTRUSTSTORE=/home/
beehive/truststore/truststore.jks') results in this
connection string:

jdbc:ncluster://192.168.1.2/beehive?
ENABLESSL=true&SSLTRUSTSTORE=/home/beehive/
truststore/truststore.jks

SSLTrustStorePassword Required The SSL TrustStore password. This password is required if you use
the SSLSettings argument. For example:

SSLTrustStorePassword ('123456')

If SSLSettings is not specified, do not specify this argument.

Example

SELECT * FROM EIGEN_CENTRALITY (


ON (SELECT 1)
PARTITION BY 1
SSLSettings ('jdbc:ncluster://192.168.1.2/beehive?
ENABLESSL=true&SSLTRUSTSTORE=/home/beehive/truststore/truststore.jks')
SSLTRUSTSTOREPASSWORD ('123456')
InputTable ('raw_edges')
OutputTable ('eigen_centrality_string')
Threshold ('0.01')
);

Teradata Aster Analytics Foundation User Guide 107


Chapter 2: Installing Aster Analytics Functions
Usage Notes
Error Message Delays
In some instances, if there is an error when running a SQL-MapReduce function, the error message might
not appear immediately. Instead, it gets displayed after a delay.

Sparse Tables and Dense Tables


Sparse tables are sometimes used by some functions (for example, SVM) to allow processing of more
attributes than the database can support as columns. In other cases, dense tables are used. Often, algorithms
perform more efficiently if the input is a dense table.
Pivot and Unpivot functions convert from one to the other.

Permanent Tables As Output of Driver-Based Functions


All the non-driver-based functions return the results in the form of result sets. However, all of the Aster
Analytics Foundation driver-based functions store the results in permanent tables.
Driver function have to read all the rows from the Queen through JDBC (or ODBC), then feed rows to the
SQL-MapReduce API to send the rows back to Queen, and finally to the end client. For this reason, it is
more efficient for driver-based functions to store the results in permanent tables than returning result sets.

Input Table Aliases


The syntax of some of the functions listed in this guide uses aliases to represent input tables. For example, in
the following ON clauses of a graph function, “vertices” is the alias of the input table containing the vertices
of the input graph and “edges” is the alias of the table containing the edges connecting the vertices.

...
ON vertices_table AS vertices PARTITION BY ...
ON edges_table AS edges PARTITION BY ...
...

These aliases are not variables and must be used as specified in the function syntax. If you use different
aliases, the function throws error messages, as shown in the following example:

beehive=> ...
beehive=> ON cities AS vr PARTITION BY ...
beehive=> ON freeways AS edges PARTITION BY ...
beehive=> ...
ERROR: SQL-MR function ALLPAIRSSHORTESTPATH requires input table or query
with alias: vertices

108 Teradata Aster Analytics Foundation User Guide


CHAPTER 3
Time Series, Path, and Attribution
Analysis

Time Series, Path, and Attribution Analysis


See also Pattern Matching with Teradata Aster nPath.

Arima

Summary
The Arima function calculates the coefficients for a sequence of parameters, producing an ARIMA model
that is typically input to the function ArimaPredictor.

Background
An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving
average (ARMA) model. Typically, these models are fitted to time series data to predict future data points
(forecasting).
An ARIMA model adds to an ARMA model a degree of differencing, which makes the time series stationary,
if necessary. Sometimes the ARIMA model uses both differencing and some nonlinear transformations, such
as logging, to make the time series stationary.
A random variable that is a time series is stationary if its statistical properties are constant over time. An
ARIMA model acts as a filter that separates the signal from the noise, so that only the signal is used for
forecasting.

Nonseasonal ARIMA Model


In a nonseasonal ARIMA model, ARIMA(p, d, q):
• p is the order of the autoregressive part
• d is the degree of first differencing
• q is the order of the moving average part
To calculate the coefficients of the input parameters, the Arima function uses this formula:
AR(p) d_differences = MA(q)
where:
• AR(p) = 1 - φ1B - … - φpBp

Teradata Aster Analytics Foundation User Guide 109


Chapter 3: Time Series, Path, and Attribution Analysis
Arima
• d_differences = (1 - B)dyt
• MA(q = c + (1 + θ1B + … + θqBq)et
The φ values are nonseasonal autoregressive parameters and the θ values are nonseasonal moving average
parameters.
The value B is the backshift operator, which is defined as follows:
Byt = yt-1

Bnyt = yt-n
et is the residual error, the difference between the actual and predicted values of yt.

Seasonal ARIMA Model


In a seasonal ARIMA model, ARIMA(p, d, q)(sp, sd, sq)_m:
• p, d, and q are as in the nonseasonal model
• sp, sd, and sq are the seasonal analogs of p, d, and q
• m is the number of data points in each season
To calculate the coefficients of the input parameters, the Arima function uses this formula:
(1 - φ1B - … - φpBp)(1 - Φ1Bm - … - Φm*sp)(1 - B)d(1 - Bm)sdyt =

c + (1 + θ1B + … + θqBq)(1 + Θ1Bm + … + ΘqBm*sq)et


The Φ values are seasonal autoregressive parameters and the Θ values are seasonal moving average
parameters.
The values B and et are as defined as in the nonseasonal model.

Usage

Arima Syntax
Version 1.1

SELECT * FROM Arima (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('output_table')
[ ResidualTable ('residual_table') ]
TimestampColumns
({ 'timestamp_column' | 'timestamp_column_range' }[,...])
ValueColumn ('value_column')
Orders ('p, d, q')
[ SeasonalOrders ('sp, sd, sq') ]

110 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Arima
[ Period ('period')]
[ IncludeMean ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Fixed ('fixed_params') ]
[ InitValues ('init_params') ]
[ MaxIterNum ('max_iteration_number') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the input
parameters.
ModelTable Required Specifies the name of the table where the function outputs the
coefficients of the input parameters; that is, the model.
ResidualTable Optional Specifies the name of the table where the function outputs the
residuals of the input parameters.

Note:
Specify this argument if you will input the model to the
ArimaPredictor function.

TimestampColumns Required Specifies the names of the input_table columns that specify the
sequence (time points) of the input parameters. The sequence
must have uniform intervals.
ValueColumn Required Specifies the name of the column that contains the time series data
in input_table.
Orders Required Specifies the values of the nonseasonal parameters p, d, and q for
the ARIMA model. Each value must be an INT between 0 and 20,
inclusive.
SeasonalOrders Optional Specifies the values of the seasonal parameters sp, sd, and sq for
the ARIMA model. Each value must be an INT between 0 and 20,
inclusive.
Period Optional Specifies the period of a season (m in the formula). This value
must be a positive integer value. If you specify SeasonalOrders,
then you must also specify Period.
IncludeMean Optional Specifies whether the function adds the mean value (c in the
formula) to the ARIMA model. The default value is 'false'.

Teradata Aster Analytics Foundation User Guide 111


Chapter 3: Time Series, Path, and Attribution Analysis
Arima

Argument Category Description

Note:
If IncludeMean is 'true', then both d in Orders and sd in
SeasonalOrders must be 0.

Fixed Optional Specifies the values of the parameters. The numeric vector
fixed_params must have a value for each parameter (for the
correspondence between values and parameters, see the note that
follows this table). If you specify IncludeMean('true'), then you
must add the mean value to the end of fixed_params.
If a value in fixed_params is non-NaN, then the corresponding
parameter is fixed at that value; otherwise, the function optimizes
the value of that parameter.
InitValues Optional Specifies the initial values of the parameters. The numeric vector
init_params must have a value for each parameter (for the
correspondence between values and parameters, see the note that
follows this table). If you specify IncludeMean('true'), then you
must add the initial mean value to the end of init_params.
If a value is NaN, then the corresponding parameter has the initial
value 0.
MaxIterNum Optional Specifies the maximum iteration number for estimating the
parameters. This value must be a positive integer. The default
value is 100.

Note:
The values in the vectors fixed_params and init_params correspond to these parameters in this order:
φ1, φ2, … , φp , θ1, θ2, … , θp , seasonalφ1, seasonalφ2, … , seasonalφsp ,
seasonalθ1, seasonalθ2, … , seasonalθsq , [meanValue]

Input
The Arima function has one required input table, which must include the columns described in the
following table.
Table 25: Arima Input Table Schema

Column Name Data Type Description


timestamp_column Any The table can have more than one such column. These
columns contain the sequence (time points) of the input
parameters.

112 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Arima

Column Name Data Type Description

Note:
If the time points do not have uniform intervals, then run
the function Interpolator on them before running the
Arima function on the input table. Otherwise, the
intervals of the predictions of the ArimaPredictor
function might not be as expected.

value_column SMALLINT, Contains the time series data in input_table.


INT,
BIGINT,
NUMERIC,
or DOUBLE
PRECISION

Output
The Arima function has two output tables, the model table and (optionally) the residual table.
The model table contains the coefficients of the model. The function outputs the coefficients to both the
model table and the console.
The following table shows the schema of the model table.
Table 26: Arima Model Table Schema

Column Data Type Description


Name
coef VARCHAR Contains the coefficient names in the left column of the following table.
value VARCHAR Contains the coefficient values in the right column of the following table.

Table 27: Arima Model Coefficients

Coefficient Name Value


coef Vector of the coefficients p, d, q, sp, sd, and sq.
ar_params Vector of the auto regression parameters that correspond to the
coefficients
ar_params_sd Vector of the standard deviations that correspond to the auto
regression parameters
ma_params Vector of the moving average parameters that correspond to the
coefficients
ma_params_sd Vector of the standard deviations that correspond to the moving
average parameters
seasonal_ar_params Vector of the seasonal auto regressive parameters that correspond to
the coefficients

Teradata Aster Analytics Foundation User Guide 113


Chapter 3: Time Series, Path, and Attribution Analysis
Arima

Coefficient Name Value


seasonal_ar_params_sd Vector of the seasonal standard deviations that correspond to the auto
regressive parameters
seasonal_ma_params Vector of the seasonal moving average parameters that correspond to
the coefficients
seasonal_ma_params_sd Vector of the seasonal standard deviations that correspond to the
moving average parameters
mean_param Mean value (This row appears if you specify IncludeMean('true').)
mean_param_sd Standard deviation for the mean value (This row appears if you specify
IncludeMean('true').)
period Period of a season (For the nonseasonal model, the period is 0.)
sigma2 Variance
loglikelihood Partial log-likelihood
iterations Iteration that the function executed
converged 'true' if the training is converged, otherwise 'false'

The residual table contains the value and residual for each time point.
Table 28: Arima Residual Table Schema

Column Name Data Type Description


timestamp_column Any The table can have more than one such column. These columns
contain the sequence (time points) of the input parameters.
value Same type as Contains the values of the time points.
value_column in
input_table
residual DOUBLE Contains the residuals of the time points.
PRECISION

Example
This example uses monthly milk consumption in the US between 1962 and 1974.

Input
Table 29: Arima Example Input Table milk_timeseries

id period milkpound
1 1962-01 578.3
2 1962-02 609.8
3 1962-03 628.4

114 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Arima

id period milkpound
4 1962-04 665.6
5 1962-05 713.8
6 1962-06 707.2
7 1962-07 628.4
8 1962-08 588.1
9 1962-09 576.3
10 1962-10 566.5
... ... ...

SQL-MapReduce Call

SELECT * FROM Arima ( ON (SELECT 1) PARTITION BY 1


InputTable ('milk_timeseries')
ModelTable ('arimamodel')
ResidualTable ('arimaresidual')
TimestampColumns ('period')
ValueColumn ('milkpound')
Orders ('3,0,0')
IncludeMean ('true')
);

Output
Table 30: Arima Example Model Table: arimamodel

coef value
coef 3, 0, 0, 0, 0, 0
ar_params 1.831645480043797, -1.179667384141421, 0.3477840265302269
ar_params_sd 0.07527420134556974, 0.13530358092847117, 0.07512437361242541
ma_params
ma_params_sd
seasonal_ar_params
seasonal_ar_params_sd
seasonal_ma_params
seasonal_ma_params_sd
mean_param 1.0015614905052501
mean_param_sd NaN
period 0

Teradata Aster Analytics Foundation User Guide 115


Chapter 3: Time Series, Path, and Attribution Analysis
ArimaPredictor

coef value
sigma2 629.5936409862431
loglikelihood -724.0702297433396
iterations 18
converged true

This query returns the output shown in the following table:


SELECT * FROM arimaresidual ORDER BY period;
Table 31: Arima Example Residual Table: arimaresidual

arimapartionid ts value residual


1 1962-01 578.3 0
1 1962-02 609.8 0
1 1962-03 628.4 0
1 1962-04 665.6 32.8313311116795
1 1962-05 713.8 23.880880606934
1 1962-06 707.2 -33.5896560166473
1 1962-07 628.4 -56.3783948486636
1 1962-08 588.1 23.1062275326814
1 1962-09 576.3 -5.54076736453666
1 1962-10 566.5 -13.8626369886166
... ... ... ...

ArimaPredictor

Summary
The ArimaPredictor function takes as input the ARIMA model produced by the function Arima and
predicts a specified number of future values (time point forecasts) for the modeled sequence.

116 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
ArimaPredictor
Usage

ArimaPredictor Syntax
Version 1.1

SELECT * FROM ArimaPredictor (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ModelTable ('input_table')
ResidualTable ('residual_table')
TimestampColumns
({ 'timestamp_column' | 'timestamp_column_range' }[,...])
[ ValueColumn ('value')]
[ ResidualColumn ('residual_column')]
StepAhead ('steps')
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
ModelTable Required Specifies the name of the table that contains the model. This table
is the model table that is output by the Arima function.
ResidualTable Required Specifies the name of the table that contains the original input
parameters and their residuals. This table is the residual table that
is output by the Arima function.
TimestampColumns Required Specifies the names of the residual_table columns that specify the
sequence (time points) of the original input parameters. The
sequence must have uniform intervals.
ValueColumn Optional Specifies the name of the column that contains the time series data
in residual_table.
ResidualColumn Optional Specifies the name of the column in residual_table that contains
the residuals.
StepAhead Required Specifies the number of steps to forecast after the end of the time
series. This value must be a positive integer.

Teradata Aster Analytics Foundation User Guide 117


Chapter 3: Time Series, Path, and Attribution Analysis
ArimaPredictor
Input
The ArimaPredictor function has two required input tables, the model table and the residual table. These
tables are output by the Arima function. For the schemas of the model and residual tables, see the following
tables in the Output section of the function Arima: Arima Model Table Schema and Arima Residual Table
Schema.

Output
Table 32: ArimaPredictor Output Table Schema

Column Name Data Type Description


stepahead INTEGER Number of the step (the future time point in the series) that
was forecast after the end of the input time series.
value DOUBLE Value of the predictions.
PRECISION

Example

Input
Use the following tables from the Output section of the Arima function Example section:
• Arima Example Model Table: arimamodel
• Arima Example Residual Table: arimaresidual

SQL-MapReduce Call

SELECT * FROM ArimaPredictor


( ON (SELECT 1) PARTITION BY 1
ModelTable ('arimamodel')
ResidualTable ('arimaresidual')
TimestampColumns ('period')
ValueColumn ('value')
ResidualColumn ('residual')
StepAhead ('15')
);

Output
Table 33: ArimaPredictor Example Output

stepahead predict
1 814.094735390403
2 822.289686640269
3 823.383629105417

118 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution

stepahead predict
4 821.247962247746
5 818.895762597647
6 817.487198991092
7 816.939272216913
8 816.77924369722
9 816.642623604132
10 816.39060427359
11 816.034505110816
12 815.632042387207
13 815.227303628928
14 814.836892259429
15 814.459284044822

Attribution

Summary
The Attribution function is used in web page analysis, where it lets companies assign weights to pages before
certain events, such as buying a product.
The function calculates attributions with a choice of distribution models and has two versions, multiple-
input and single-input. The multiple-input version gets many parameters from input tables. The single-
input version gets all parameters from arguments. The recommended version depends on the number of
parameters.
With a large number of parameters, the multiple-input version is recommended. You must create the tables
of parameters, but whenever you call the function, you can use the tables instead of specifying each
parameter in an argument.
If the number of parameters is so small that you prefer to specify them in arguments rather than create tables
for them, then you can use the single-input version.
• Attribution (Multiple-Input Version)
• Attribution (Single-Input Version)

Background
Before buying a product online, a customer is usually exposed to typical events or interactions (such as
clicks, page visits, and page impressions) that are associated with different channels (such as email, social
network connections, paid search advertising, organic search, direct buy, and referral). The sequence of

Teradata Aster Analytics Foundation User Guide 119


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
interactions with digital marketing channels during a specified period that lead to buying the product (the
conversion event) is called the conversion path.
The following figure shows a conversion path in which the user is exposed to six different marketing
channels before making a purchase.
Attribution is typically used to identify the relative contributions of the different channels to the conversion
event.
Figure 5: Conversion Path

Attribution (Multiple-Input Version)

Summary
The multiple-input version of the Attribution function takes data and parameters from multiple tables and
outputs attributions.

120 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)

Note:
A query that runs longer than 3 seconds before displaying output indicates that some of the arguments
supplied to the function are incorrect.

Usage

Attribution Syntax (Multiple Inputs)


Version 2.3

SELECT * FROM Attribution (


ON { input_table | view | (query) }
PARTITION BY user_id
ORDER BY timestamp_column
[ ON { input_table_n | view_n | (query_n) }
PARTITION BY user_id
ORDER BY timestamp_column [,...] ]
ON conversion_event_table AS conversion DIMENSION
[ ON excluding_event_table AS excluding DIMENSION ]
[ ON optional_event_table AS optional DIMENSION ]
ON model1_table AS model1 DIMENSION
[ ON model2_table AS model2 DIMENSION ]
EventColumn ('event_column')
TimestampColumn ('timestamp_column')
WindowSize ({ 'rows:K' | 'seconds:K' | 'rows:K&seconds:K2' })
) ORDER BY user_id,time_stamp;

Teradata Aster Analytics Foundation User Guide 121


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
Arguments
Argument Category Description
EventColumn Required Specifies the name of the input column that contains the
clickstream events.
TimestampColumn Required Specifies the name of the input column that contains the
timestamps of the clickstream events.
WindowSize Required Specifies how to determine the maximum window size for the
attribution calculation:
rows:K: Assign attributions to at most K events before the
conversion event, excluding events of types specified in
excluding_event_table.
seconds:K: Assign attributions only to rows not more than K
seconds before the conversion event.
rows:K&seconds:K2: Apply both constraints and comply with
the stricter one.

Input
The required input tables are:
• input_table, which contains the clickstream data to use for computing attributions
• conversion_event_table (alias conversion), which contains conversion events
• model1_table (alias model1), which defines the type and distributions of the first model
The optional input tables are:
• excluding_event_table (alias excluding), which contains events to exclude from attribution
• optional_event_table (alias optional), which contains optional events
• model2_table (alias model2), which defines the type and distributions of the second model
• input_table_1, input_table_2, and so on, which contain additional clickstream data
The optional input tables have the same schema as input_table. Specifying these tables lets you co-group
attributes from all specified input tables (for example, ad_click, impressions, and conversions).
Table 34: Attribution Input Table Schema

Column Name Data Type Description


userid_column INTEGER or User identifier.
VARCHAR
event_column Any Event from clickstream.
timestamp_column INTEGER, Event timestamp.
SMALLINT,
BIGINT,
TIMESTAMP,
or TIME

122 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
Table 35: Attribution Conversion Event Table Schema

Column Name Data Type Description


conversion_events VARCHAR Conversion event value (string or integer).

Table 36: Attribution Excluding Event Table Schema

Column Name Data Type Description


excluding_events VARCHAR Excluded event (string or integer). Cannot be a conversion
event.

Table 37: Attribution Optional Event Table Schema

Column Name Data Type Description


optional_events VARCHAR Optional event (string or integer). Cannot be a conversion
or excluded event. The function attributes a conversion
event to an optional event only if it cannot attribute it to a
regular event.

Tables model1 and model2 have the same schema:


Table 38: Attribution Model Table Schema

Column Name Data Type Description


id INTEGER Row identifier. Rows are numbered 0, 1, 2, and so on.
model VARCHAR Row 0: Model type.
Row 1, ..., n: Distribution model definition.
For model types other than SIMPLE, n is the number of
rows or events included in the model. For SIMPLE models,
the model table has a single row that specifies the model
type and parameters.
For model type and specification definitions, refer to the
following two tables.

Table 39: Attribution Model Types and Specification Definitions

Row 0: Model Type Row 1, ..., n: Distribution Model Specification Additional Information
SIMPLE MODEL:PARAMETERS Distribution model for all events. For MODEL
and PARAMETER definitions, refer to the
following table.
EVENT_REGULAR EVENT:WEIGHT:MODEL:PARAMETERS Distribution model for a regular event.
EVENT cannot be a conversion, excluded, or
optional event.
For MODEL and PARAMETER definitions,
refer to the following table.
The sum of the WEIGHT values must be 1.0.
For example, suppose that the model table has
these specifications:

Teradata Aster Analytics Foundation User Guide 123


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
Row 0: Model Type Row 1, ..., n: Distribution Model Specification Additional Information
email:0.19:LAST_CLICK:NA
impression:0.81:UNIFORM:NA
Within the WindowSize of a conversion event,
19% of the conversion event is attributed to the
last email event and 81% is attributed uniformly
to all impression events.
EVENT_OPTIONAL EVENT:WEIGHT:MODEL:PARAMETERS Distribution model for an optional event.
EVENT must be in the optional event table.
For MODEL and PARAMETER definitions,
refer to the following table.
The sum of the WEIGHT values must be 1.0.
SEGMENT_ROWS Ki:WEIGHT:MODEL:PARAMETERS Distribution model by row. The sum of the Ki
values must be the value K specified by 'rows:K'
in the WindowSize argument.
The function considers the rows from most to
least recent. For example, suppose that the
function call has these arguments:
WindowSize ('rows:10')
Model1 ('SEGMENT_ROWS',
'3:0.5:UNIFORM:NA',
'4:0.3:LAST_CLICK:NA',
'3:0.2:FIRST_CLICK:NA')
Attribution for a conversion event is divided
among the attributable events in the 10 rows
immediately preceding the conversion event. If
the conversion event is in row 11, then the first
model specification applies to rows 10, 9, and 8;
the second applies to rows 7, 6, 5, and 4; and the
third applies to rows 3, 2, and 1.
Half the attribution (5/10) is uniformly divided
among rows 10, 9, and 8; 3/10 to the last click in
rows 7, 6, 5, and 4 (that is, in row 7), and 2/10 to
the first click in rows 3, 2, and 1 (that is, in row
1).
SEGMENT_SECONDS Ki:WEIGHT:MODEL:PARAMETERS Distribution model by time in seconds. The sum
of the Kivalues must be the value K specified by
'seconds:K' in the WindowSize argument.
The function considers the rows from most to
least recent. For example, suppose that the
function call has these arguments:
WindowSize ('seconds:20')
Model1 ('SEGMENT_SECONDS',
'6:0.5:UNIFORM:NA',
'8:0.3:LAST_CLICK:NA',
'6:0.2:FIRST_CLICK:NA')
Attribution for a conversion event is divided
among the attributable events in the 20 seconds
immediately preceding the conversion event. If
the conversion event is at second 21, then the

124 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
Row 0: Model Type Row 1, ..., n: Distribution Model Specification Additional Information
first model specification applies to seconds 20-15
(counting backward); the second applies to
seconds 14-7; and the third applies to seconds
6-1.
Half the attribution (5/10) is uniformly divided
among seconds 20-15; 3/10 to the last click in
seconds 14-7, and 2/10 to the first click in
seconds 6-1.

The following table describes the MODEL values and their corresponding PARAMETER values. MODEL
values are case-sensitive. Attributable events are those whose types are not specified in the excluding events
table.
Table 40: Attribution Distribution Model Specification: Models and Parameters

MODEL Description PARAMETERS


'LAST_CLICK' The conversion event is attributed 'NA'
entirely to the most recent
attributable event.
'FIRST_CLICK' The conversion event is attributed 'NA'
entirely to the first attributable
event.
'UNIFORM' The conversion event is attributed 'NA'
uniformly to the preceding
attributable events.
'EXPONENTIAL' The conversion event is attributed 'alpha,type' where alpha is a decay
exponentially to the preceding factor in the range (0, 1) and type
attributable events (the more is ROW, MILLISECOND,
recent the event, the higher the SECOND, MINUTE, HOUR,
attribution). DAY, MONTH, or YEAR.
When alpha is in the range (0, 1),
the sum of the series wi=(1-
alpha)*alphai is 1. The function
uses the wi as exponential weights.
'WEIGHTED' The conversion event is attributed You can specify any number of
to the preceding attributable weights. If there are more
events with the weights specified attributable events than weights,
by PARAMETERS. then the extra (least recent) events
are assigned zero weight. If there
are more weights than attributable
events, then the function
renormalizes the weights. Refer to
Example 3: Dynamic Weighted
Distribution Models in the
function Attribution (Single-Input
Version).

Teradata Aster Analytics Foundation User Guide 125


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
The allowed Model1/Model2 combinations are:
Table 41: Attribution: Allowed Model1/Model2 Combinations

Model1 Type Model2 Type


SIMPLE Not allowed
EVENT_REGULAR
EVENT_REGULAR EVENT_OPTIONAL (when you specify the optional events table)
SEGMENT_ROWS
SEGMENT_ROWS SEGMENT_SECONDS (when you specify 'rows:K&seconds:K' in the Window
argument)
SEGMENT_SECONDS Not allowed

Output
Table 42: Attribution Output Table Schema

Column Name Data Type Description


user_id INTEGER or User identifier from input table.
VARCHAR
event VARCHAR Clickstream event from input table.
time_stamp TIMESTAMP Event timestamp from input table.
attribution DOUBLE Fraction of attribution for conversion event that is attributed
PRECISION to this event.
time_to_conversion INTEGER Elapsed time between attributable event and conversion
event.

Example
This example uses models to assign attribution weights to these events and channels:
Table 43: Attribution Example: Event Types Channels

Event Type Channels


conversion SocialNetwork, PaidSearch
excluding Email
optional Direct, Referral, OrganicSearch

126 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)
Input
Table 44: Multiple-Input Attribution Example Input Table attribution_sample_table1

user_id event time_stamp


1 impression 2001-09-27 23:00:01
1 impression 2001-09-27 23:00:05
1 Email 2001-09-27 23:00:15
2 impression 2001-09-27 23:00:31
2 impression 2001-09-27 23:00:51

Table 45: Multiple-Input Attribution Example Input Table attribution_sample_table2

user_id event time_stamp


1 impression 2001-09-27 23:00:19
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21
1 Referral 2001-09-27 23:00:22
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29
2 impression 2001-09-27 23:00:31
2 impression 2001-09-27 23:00:33
2 impression 2001-09-27 23:00:36
2 impression 2001-09-27 23:00:38

Table 46: Multiple-Input Attribution Example Conversion Event Table conversion_event_table

conversion_events
PaidSearch
SocialNetwork

Table 47: Multiple-Input Attribution Example Excluding Event Table excluding_event_table

excluding_events
Email

Table 48: Multiple-Input Attribution Example Dimension Table optional_event_table

optional_events
Direct
OrganicSearch

Teradata Aster Analytics Foundation User Guide 127


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Multiple-Input Version)

optional_events
Referral

The following two model tables apply the distribution models by rows and by seconds, respectively.
Table 49: Multiple-Input Attribution Example Model Table model1_table

id model
0 SEGMENT_ROWS
1 3:0.5:EXPONENTIAL:0.5,SECOND
2 4:0.3:WEIGHTED:0.4,0.3,0.2,0.1
3 3:0.2:FIRST_CLICK:NA

Table 50: Multiple-Input Attribution Example Model Table model2_table

id model
0 SEGMENT_SECONDS
1 6:0.5:UNIFORM:NA
2 8:0.3:LAST_CLICK:NA
3 6:0.2:FIRST_CLICK:NA

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table1 AS input1
PARTITION BY user_id ORDER BY time_stamp
ON attribution_sample_table2 AS input2
PARTITION BY user_id ORDER BY time_stamp
ON conversion_event_table AS conversion DIMENSION
ON excluding_event_table AS excluding DIMENSION
ON optional_event_table AS optional DIMENSION
ON model1_table AS model1 DIMENSION
ON model2_table AS model2 DIMENSION
EventColumn ('event')
TimestampColumn ('time_stamp')
WindowSize ('rows:10&seconds:20')
) ORDER BY user_id, time_stamp;

Output
Table 51: Multiple-Input Attribution Example Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:01 0.285714 -19
1 impression 2001-09-27 23:00:05 0

128 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:19 0.714286 -1
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21 0.5 -2
1 Referral 2001-09-27 23:00:22 0.5 -1
1 PaidSearch 2001-09-27 23:00:23

Attribution (Single-Input Version)

Summary
The single-input version of the Attribution function takes data from a single table and outputs attributions.
Parameters come from arguments, not input tables.

Usage

Attribution Syntax (Single Input)


Version 2.3

SELECT * FROM attribution (


ON { input_table | view | (query) }
PARTITION BY expression [,...]
ORDER BY order_by_columns
EventColumn ('event_column')
ConversionEvents ('conversion_event' [,...])
[ ExcludeEvents ('exclude_event') ]
[ OptionalEvents ('optional_event' [,...]) ]
TimestampColumn ('timestamp_column')
WindowSize ('rows:K | seconds:K | rows:K&seconds:K')
Model1 ('type', { 'K' | 'EVENT:WEIGHT:MODEL:PARAMETERS' } [,...])
[ Model2 ('type', { 'K' | 'EVENT:WEIGHT:MODEL:PARAMETERS' } [,...]) ]
);

Note:
In the Model1 and Model2 arguments, colons are parameter delimiters. If a parameter contains colons,
enclose it in double quotation marks. For example:

Arguments
Argument Category Description
EventColumn Required Specifies the name of the input column that contains the
clickstream events.

Teradata Aster Analytics Foundation User Guide 129


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)

Argument Category Description


ConversionEvents Required Specifies the conversion events. Each conversion_event is a string
or integer.
ExcludeEvents Optional Specifies the events to exclude from the attribution calculation.
Each exclude_event is a string or integer. An exclude_event cannot
be a conversion_event.
OptionalEvents Optional Specifies the optional events. Each optional_event is a string or
integer. An optional_event cannot be a conversion_event or
exclude_event. The function attributes a conversion event to an
optional event only if it cannot attribute it to a regular event.
TimestampColumn Required Specifies the name of the input column that contains the
timestamps of the clickstream events.
WindowSize Required Specifies how to determine the maximum window size for the
attribution calculation:
rows:K: Consider the maximum number of events to be
attributed, excluding events of types specified in
excluding_event_table, which means assigning attributions to at
most K effective events before the current impact event.
seconds:K: Consider the maximum time difference between the
current impact event and the earliest effective event to be
attributed.
rows:K&seconds:K2: Consider both constraints and comply with
the stricter one.
Model1 Required Defines the type and specification of the first model. For example:
Model1 ('EVENT_REGULAR', 'email:0.19:LAST_CLICK:NA',
'impression:0.81:WEIGHTED:0.4,0.3,0.2,0.1')
For more information see the following tables in the Input section
of the function: Attribution (Multiple-Input Version).
• For model type and specification definitions, see the table:
Attribution Model Types and Specification Definitions
• For MODEL values and their corresponding PARAMETER
values, see the table: Attribution Distribution Model
Specification: Models and Parameters

Model2 Optional Defines the type and distributions of the second model. For
example:
Model2 ('EVENT_OPTIONAL', 'OrganicSearch:
0.5:UNIFORM:NA', 'Direct:0.3:UNIFORM:NA', 'Referral:
0.2:UNIFORM:NA')
For more information see the following tables in the Input section
of the function: Attribution (Multiple-Input Version).
• For model type and specification definitions, see the table:
Attribution Model Types and Specification Definitions

130 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)

Argument Category Description

• For MODEL values and their corresponding PARAMETER


values, see the table: Attribution Distribution Model
Specification: Models and Parameters
• For allowed Model1/Model2 combinations, see the table:
Attribution: Allowed Model1/Model2 Combinations

Input
Use the following table from the Input section of the Attribution (Multiple-Input Version) function Usage
section.
• Attribution Input Table Schema

Output
Use the following table from the Output section of the Attribution (Multiple-Input Version) function Usage
section.
• Attribution Output Table Schema

Examples
These examples use the events and channels from the following table from the Example section of the
Attribution (Multiple-Input Version) function.
• Attribution Example Event Types and Channels

Example 1: One Regular Model, Multiple Optional Models


This example specifies one distribution model for regular events and one distribution model for each type of
optional event.

Input

Table 52: Single-Input Attribution Example 1: Input Table attribution_sample_table

user_id event time_stamp


1 impression 2001-09-27 23:00:01
1 impression 2001-09-27 23:00:03
1 impression 2001-09-27 23:00:05
1 impression 2001-09-27 23:00:07
1 impression 2001-09-27 23:00:09
1 impression 2001-09-27 23:00:11
1 impression 2001-09-27 23:00:13

Teradata Aster Analytics Foundation User Guide 131


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)

user_id event time_stamp


1 Email 2001-09-27 23:00:15
1 impression 2001-09-27 23:00:17
1 impression 2001-09-27 23:00:19
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21
1 Referral 2001-09-27 23:00:22
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29
2 impression 2001-09-27 23:00:31
2 impression 2001-09-27 23:00:33
2 impression 2001-09-27 23:00:36
2 impression 2001-09-27 23:00:38
2 impression 2001-09-27 23:00:43
2 impression 2001-09-27 23:00:47
2 OrganicSearch 2001-09-27 23:00:49
2 impression 2001-09-27 23:00:51
2 impression 2001-09-27 23:00:53
2 impression 2001-09-27 23:00:55
2 SocialNetwork 2001-09-27 23:00:59

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table
PARTITION BY user_id
ORDER BY time_stamp
EventColumn ('event')
ConversionEvents ('SocialNetwork', 'PaidSearch')
OptionalEvents ('OrganicSearch', 'Direct', 'Referral')
TimestampColumn ('time_stamp')
WindowSize ('rows:10&seconds:20')
Model1 ('EVENT_REGULAR', 'Email:0.19:LAST_CLICK:NA',
'impression:0.81:UNIFORM:NA')
Model2 ('EVENT_OPTIONAL', 'OrganicSearch:0.5:UNIFORM:NA',
'Direct:0.3:UNIFORM:NA', 'Referral:0.2:UNIFORM:NA')
) ORDER BY user_id, time_stamp;

132 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)
Output

Table 53: Single-Input Attribution Example 1 Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:01 0.09 -19
1 impression 2001-09-27 23:00:03 0.09 -17
1 impression 2001-09-27 23:00:05 0.09 -15
1 impression 2001-09-27 23:00:07 0.09 -13
1 impression 2001-09-27 23:00:09 0.09 -11
1 impression 2001-09-27 23:00:11 0.09 -9
1 impression 2001-09-27 23:00:13 0.09 -7
1 Email 2001-09-27 23:00:15 0.19 -5
1 impression 2001-09-27 23:00:17 0.09 -3
1 impression 2001-09-27 23:00:19 0.09 -1
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21 0.6 -2
1 Referral 2001-09-27 23:00:22 0.4 -1
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29 0
2 impression 2001-09-27 23:00:31 0
2 impression 2001-09-27 23:00:33 0
2 impression 2001-09-27 23:00:36 0
2 impression 2001-09-27 23:00:38 0
2 impression 2001-09-27 23:00:43 0.2 -16
2 impression 2001-09-27 23:00:47 0.2 -12
2 impression 2001-09-27 23:00:51 0.2 -8
2 impression 2001-09-27 23:00:53 0.2 -6
2 impression 2001-09-27 23:00:55 0.2 -4
2 SocialNetwork 2001-09-27 23:00:59

Example 2: Multiple Regular Models, One Optional Model


This example specifies one distribution model for each type of regular event and one distribution model for
optional events.

Teradata Aster Analytics Foundation User Guide 133


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)
Input
This example uses the same input table, Single-Input Attribution Example 1: Input Table
attribution_sample_table, as was used in Example 1.

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table
PARTITION BY user_id
ORDER BY time_stamp
EventColumn ('event')
ConversionEvents ('SocialNetwork', 'PaidSearch')
OptionalEvents ('OrganicSearch', 'Direct', 'Referral')
TimestampColumn ('time_stamp') WindowSize ('rows:10&seconds:20')
Model1 ('EVENT_REGULAR', 'Email:0.19:LAST_CLICK:NA',
'impression:0.81:UNIFORM:NA')
Model2 ('EVENT_OPTIONAL', 'ALL:1:EXPONENTIAL:0.5,ROW')
) ORDER BY user_id, time_stamp;

Output
The only difference between this output and the output of Example 1 is the attribution of the optional
events.
Table 54: Single-Input Attribution Example 2 Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:01 0.09 -19
1 impression 2001-09-27 23:00:03 0.09 -17
1 impression 2001-09-27 23:00:05 0.09 -15
1 impression 2001-09-27 23:00:07 0.09 -13
1 impression 2001-09-27 23:00:09 0.09 -11
1 impression 2001-09-27 23:00:11 0.09 -9
1 impression 2001-09-27 23:00:13 0.09 -7
1 Email 2001-09-27 23:00:15 0.19 -5
1 impression 2001-09-27 23:00:17 0.09 -3
1 impression 2001-09-27 23:00:19 0.09 -1
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21 0.333333 -2
1 Referral 2001-09-27 23:00:22 0.666667 -1
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29 0

134 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)

user_id event time_stamp attribution time_to_conversion


2 impression 2001-09-27 23:00:31 0
2 impression 2001-09-27 23:00:33 0
2 impression 2001-09-27 23:00:36 0
2 impression 2001-09-27 23:00:38 0
2 impression 2001-09-27 23:00:43 0.2 -16
2 impression 2001-09-27 23:00:47 0.2 -12
2 impression 2001-09-27 23:00:51 0.2 -8
2 impression 2001-09-27 23:00:53 0.2 -6
2 impression 2001-09-27 23:00:55 0.2 -4
2 SocialNetwork 2001-09-27 23:00:59

Example 3: Dynamic Weighted Distribution Models

Input
This example uses the same input table, Single-Input Attribution Example 1: Input Table
attribution_sample_table, as was used in Example 1.

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table
PARTITION BY user_id
ORDER BY time_stamp
EventColumn ('event')
ConversionEvents ('SocialNetwork', 'PaidSearch')
OptionalEvents ('OrganicSearch', 'Direct', 'Referral')
TimestampColumn ('time_stamp') Window ('rows:10&seconds:20')
Model1 ('EVENT_REGULAR', 'Email:0.19:LAST_CLICK:NA',
'impression:0.81:WEIGHTED:0.4,0.3,0.2,0.1')
Model2 ('EVENT_OPTIONAL', 'ALL:1:WEIGHTED:0.4,0.3,0.2,0.1')
) ORDER BY user_id, time_stamp;

Output

Table 55: Single-Input Attribution Example 3 Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:01 0
1 impression 2001-09-27 23:00:03 0
1 impression 2001-09-27 23:00:05 0

Teradata Aster Analytics Foundation User Guide 135


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:07 0
1 impression 2001-09-27 23:00:09 0
1 impression 2001-09-27 23:00:11 0.081 -9
1 impression 2001-09-27 23:00:13 0.162 -7
1 Email 2001-09-27 23:00:15 0.19 -5
1 impression 2001-09-27 23:00:17 0.243 -3
1 impression 2001-09-27 23:00:19 0.324 -1
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21 0.428571 -2
1 Referral 2001-09-27 23:00:22 0.571429 -1
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29 0
2 impression 2001-09-27 23:00:31 0
2 impression 2001-09-27 23:00:33 0
2 impression 2001-09-27 23:00:36 0
2 impression 2001-09-27 23:00:38 0
2 impression 2001-09-27 23:00:43 0
2 impression 2001-09-27 23:00:47 0.1 -12
2 impression 2001-09-27 23:00:51 0.2 -8
2 impression 2001-09-27 23:00:53 0.3 -6
2 impression 2001-09-27 23:00:55 0.4 -4
2 SocialNetwork 2001-09-27 23:00:59

Example 4: Window Models

Input
Single-Input Attribution Example 1 Input Table attribution_sample_table

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table
PARTITION BY user_id
ORDER BY time_stamp
EventColumn ('event')
ConversionEvents ('SocialNetwork', 'PaidSearch')

136 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)
ExcludeEvents ('Email')
OptionalEvents ('OrganicSearch', 'Direct', 'Referral')
TimestampColumn ('time_stamp')
WindowSize ('rows:10&seconds:20')
Model1 ('SEGMENT_ROWS', '3:0.5:EXPONENTIAL:0.5,ROW',
'4:0.3:WEIGHTED:0.4,0.3,0.2,0.1', '3:0.2:FIRST_CLICK:NA')
Model2 ('SEGMENT_SECONDS', '6:0.5:UNIFORM:NA', '8:0.3:LAST_CLICK:NA',
'6:0.2:FIRST_CLICK:NA')
) ORDER BY user_id, time_stamp;

Output

Table 56: Single-Input Attribution Example 4 Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:01 0.2 -19
1 impression 2001-09-27 23:00:03 0
1 impression 2001-09-27 23:00:05 0
1 impression 2001-09-27 23:00:07 0
1 impression 2001-09-27 23:00:09 0
1 impression 2001-09-27 23:00:11 0
1 impression 2001-09-27 23:00:13 0.3 -7
1 impression 2001-09-27 23:00:17 0.25 -3
1 impression 2001-09-27 23:00:19 0.25 -1
1 SocialNetwork 2001-09-27 23:00:20
1 Direct 2001-09-27 23:00:21 0.5 -2
1 Referral 2001-09-27 23:00:22 0.5 -1
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29 0
2 impression 2001-09-27 23:00:31 0
2 impression 2001-09-27 23:00:33 0
2 impression 2001-09-27 23:00:36 0
2 impression 2001-09-27 23:00:38 0
2 impression 2001-09-27 23:00:43 0.2 -16
2 impression 2001-09-27 23:00:47 0
2 impression 2001-09-27 23:00:51 0.3 -8
2 impression 2001-09-27 23:00:53 0.25 -6
2 impression 2001-09-27 23:00:55 0.25 -4
2 SocialNetwork 2001-09-27 23:00:59

Teradata Aster Analytics Foundation User Guide 137


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)
Example 5: Single-Window Model

Input

Table 57: Single-Input Attribution Example 5: Input Table attribution_sample_table3

user_id event time_stamp


1 impression 2001-09-27 23:00:07
1 impression 2001-09-27 23:00:09
1 impression 2001-09-27 23:00:11
1 impression 2001-09-27 23:00:13
1 Email 2001-09-27 23:00:15
1 impression 2001-09-27 23:00:17
1 impression 2001-09-27 23:00:19
1 SocialNetwork 2001-09-27 23:00:21
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29
2 impression 2001-09-27 23:00:31
2 impression 2001-09-27 23:00:33
2 impression 2001-09-27 23:00:47
2 impression 2001-09-27 23:00:51
2 impression 2001-09-27 23:00:53
2 impression 2001-09-27 23:00:55
2 SocialNetwork 2001-09-27 23:00:59

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table3
PARTITION BY user_id
ORDER BY time_stamp
EventColumn ('event')
ConversionEvents ('SocialNetwork', 'PaidSearch')
ExcludeEvents ('Email')
TimestampColumn ('time_stamp')
WindowSize ('rows:10&seconds:20')
Model1 ('SIMPLE', 'UNIFORM:NA')
) ORDER BY user_id, time_stamp;

138 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Attribution (Single-Input Version)
Output

Table 58: Single-Input Attribution Example 5 Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:07 0.166667 -14
1 impression 2001-09-27 23:00:09 0.166667 -12
1 impression 2001-09-27 23:00:11 0.166667 -10
1 impression 2001-09-27 23:00:13 0.166667 -8
1 impression 2001-09-27 23:00:17 0.166667 -4
1 impression 2001-09-27 23:00:19 0.166667 -2
1 SocialNetwork 2001-09-27 23:00:21
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29 0
2 impression 2001-09-27 23:00:31 0
2 impression 2001-09-27 23:00:33 0
2 impression 2001-09-27 23:00:47 0.25 -12
2 impression 2001-09-27 23:00:51 0.25 -8
2 impression 2001-09-27 23:00:53 0.25 -6
2 impression 2001-09-27 23:00:55 0.25 -4
2 SocialNetwork 2001-09-27 23:00:59

Example 6: Unused Segment Windows

Input
This example uses the same input table, Single-Input Attribution Example 5: Input Table
attribution_sample_table3, as was used in Example 5.

SQL-MapReduce Call

SELECT * FROM attribution (


ON attribution_sample_table3
PARTITION BY user_id
ORDER BY time_stamp
EventColumn ('event')
ConversionEvents ('SocialNetwork', 'PaidSearch')
TimestampColumn ('time_stamp')
WindowSize ('rows:10&seconds:20')
Model1 ('SEGMENT_ROWS', '3:0.5:EXPONENTIAL:0.5,ROW',
'4:0.3:WEIGHTED:0.4,0.3,0.2,0.1', '3:0.2:FIRST_CLICK:NA')
Model2 ('SEGMENT_SECONDS', '6:0.5:UNIFORM:NA',

Teradata Aster Analytics Foundation User Guide 139


Chapter 3: Time Series, Path, and Attribution Analysis
Burst
'8:0.3:LAST_CLICK:NA', '6:0.2:FIRST_CLICK:NA')
) ORDER BY user_id, time_stamp;

Output

Table 59: Single-Input Attribution Example 6 Output Table

user_id event time_stamp attribution time_to_conversion


1 impression 2001-09-27 23:00:07 0
1 impression 2001-09-27 23:00:09 0
1 impression 2001-09-27 23:00:11 0
1 impression 2001-09-27 23:00:13 0.375 -8
1 Email 2001-09-27 23:00:15 0.208333 -6
1 impression 2001-09-27 23:00:17 0.208333 -4
1 impression 2001-09-27 23:00:19 0.208333 -2
1 SocialNetwork 2001-09-27 23:00:21
1 PaidSearch 2001-09-27 23:00:23
2 impression 2001-09-27 23:00:29 0
2 impression 2001-09-27 23:00:31 0
2 impression 2001-09-27 23:00:33 0
2 impression 2001-09-27 23:00:47 0
2 impression 2001-09-27 23:00:51 0.375 -8
2 impression 2001-09-27 23:00:53 0.3125 -6
2 impression 2001-09-27 23:00:55 0.3125 -4
2 SocialNetwork 2001-09-27 23:00:59

Burst

Summary
The Burst function bursts (splits) a time interval into a series of shorter "burst" intervals that can be analyzed
independently.
Each row of the input table contains the start and end times of a time interval. For each input row, the
function writes a series of rows to the output table. Each output row contains the start and end time of a
burst interval.
The burst intervals can have either the same length (specified by the TimeInterval argument), the same
number of data points (specified by the NumPoints argument), or specific start and end times (specified by
time_table).

140 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Burst
Usage

Burst Syntax
Version 1.0

SELECT * FROM Burst (


ON { table | view | (query) } AS input_table
PARTITION BY id
[ ORDER BY ordering_column ]
[ ON { table | view | (query) } AS time_table
PARTITION BY id
[ ORDER BY ordering_column ] ]
TimeColumn ('start_time_column', 'end_time_column')
[ TimeInterval (numeric_value) ]
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ TimeDataType (data_type) ]
[ ValueDataType (value_type [,...]) ]
[ StartTime (start_time) ]
[ EndTime (end_time) ]
[ NumPoints (data_points) ]
[ ValuesBeforeFirst ('before_first_value' [,...]) ]
[ ValuesAfterLast ('after_last_value' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TimeColumn Required Specifies the names of the input_table columns that contain the
start and end times of the time interval to be burst.
TimeInterval Optional Specifies the length of each burst time interval. This value must
be either INTEGER or DOUBLE PRECISION.

Note:
Specify exactly one of time_table, TimeInterval, or
NumPoints.

ValueColumns Required Specifies the names of input_table columns to copy to the


output table.
TimeDataType Optional Specifies the data type of the output columns that correspond
to the input table columns that TimeColumn specifies
(start_time_column and end_time_column).
If you omit this argument, then the function infers the data
type of start_time_column and end_time_column from the
input table and uses the inferred data type for the
corresponding output table columns.

Teradata Aster Analytics Foundation User Guide 141


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

Argument Category Description


If you specify this argument, then the function can transform
the input data to the specified output data type only if both the
input column data type and the specified output column data
type are in this list:
INTEGER
BIGINT
SMALLINT
DOUBLE PRECISION
DECIMAL(n,n)
DECIMAL
NUMERIC
NUMERIC(n,n)
ValueDataType Optional Specifies the data types of the output columns that correspond
to the input table columns that ValueColumns specifies.
If you omit this argument, then the function infers the data
type of each value_column from the input table and uses the
inferred data type for the corresponding output table column.
If you specify ValueDataType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns,
then ValueDataType must specify n data types. For i in [1, n],
value_column_i has value_type_i. However, value_type_i can be
empty; for example:
ValueColumns (c1, c2, c3)
ValueDataType (INTEGER, ,VARCHAR)
If you specify this argument, then the function can transform
the input data to the specified output data type only if both the
input column data type and the specified output column data
type are in this list:
INTEGER
BIGINT
SMALLINT
DOUBLE PRECISION
DECIMAL(n,n)
DECIMAL
NUMERIC
NUMERIC(n,n)
StartTime Optional Specifies the start time for the time interval to be burst. The
default value is the value in start_time_column.
EndTime Optional Specifies the end time for the time interval to be burst. The
default value is the value in end_time_column.
NumPoints Optional Specifies the number of data points in each burst time interval.
This value must be an INTEGER.

142 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

Argument Category Description

Note:
Specify exactly one of time_table, TimeInterval, or
NumPoints.

ValuesBeforeFirst Optional Specifies the values to use if start_time is before


start_time_column. Each of these values must have the same
data type as its corresponding value_column. Values of data
type VARCHAR are case-insensitive.
If you specify ValuesBeforeFirst, then it must be the same size
as ValueColumns. That is, if ValueColumns specifies n
columns, then ValuesBeforeFirst must specify n values. For i in
[1, n], value_column_i has the value before_first_value_i.
However, before_first_value_i can be empty; for example:
ValueColumns (c1, c2, c3)
ValuesBeforeFirst (1, ,'abc')
If before_first_value_i is empty, then value_column_i has the
value NULL. If you do not specify ValuesBeforeFirst, then
value_column_i has the value NULL for i in [1, n].
ValuesAfterLast Optional Specifies the values to use if end_time is after end_time_column.
Each of these values must have the same data type as its
corresponding value_column. Values of data type VARCHAR
are case-insensitive.
If you specify ValuesAfterLast, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns,
then ValuesAfterLast must specify n values. For i in [1, n],
value_column_i has the value after_last_value_i. However,
after_last_value_i can be empty; for example:
ValueColumns (c1, c2, c3)
ValuesAfterLast (1, ,'abc')
If after_last_value_i is empty, then value_column_i has the
value NULL. If you do not specify Values_After_Last, then
value_column_i has the value NULL for i in [1, n].
Accumulate Optional Specifies the names of input_table columns (other than those
specified by TimeColumn and ValueColumns) to copy to the
output table. By default, the function copies to the output table
only the columns specified by TimeColumn and
ValueColumns.

Input
The Burst function has two input tables: input_table (required) and time_table (optional). If you omit
time_table, then you must specify either the TimeInterval or NumPoints argument.
Each row of input_table contains a time interval to be burst. The following table describes the input_table
columns that you can specify in function arguments.

Teradata Aster Analytics Foundation User Guide 143


Chapter 3: Time Series, Path, and Attribution Analysis
Burst
Table 60: Burst input_table Schema

Column Data Type Description


start_time_column INTEGER, BIGINT, SMALLINT, Contains the start time of the time
DOUBLE PRECISION, DECIMAL(n,n), interval to be burst, specified by the
DECIMAL, NUMERIC, NUMERIC(n,n), TimeColumn argument.
DATE, TIME, TIME(n), TIME WITH
TIME ZONE, TIME WITH TIME
ZONE(n), TIMESTAMP,
TIMESTAMP(n), TIMESTAMP WITH
TIME ZONE, or TIMESTAMP WITH
TIME ZONE(n)
end_time_column INTEGER, BIGINT, SMALLINT, Contains the end time of the time interval
DOUBLE PRECISION, DECIMAL(n,n), to be burst, specified by the TimeColumn
DECIMAL, NUMERIC, NUMERIC(n,n), argument.
DATE, TIME, TIME(n), TIME WITH
TIME ZONE, TIME WITH TIME
ZONE(n), TIMESTAMP,
TIMESTAMP(n), TIMESTAMP WITH
TIME ZONE, or TIMESTAMP WITH
TIME ZONE(n)
value_column INTEGER, BIGINT, SMALLINT, Column to be copied to the output table,
DOUBLE PRECISION, DECIMAL(n,n), specified by the ValueColumns argument.
DECIMAL, NUMERIC, NUMERIC(n,n),
CHARACTER, CHARACTER(n),
CHARACTER VARYING,
CHARACTER VARYING(n), or
VARCHAR
accumulate_column Any Optional. Column to be copied to the
output table, specified by the Accumulate
argument.
Typically, one accumulate_column is a
row identifier, such as 'id'.

Each row of time_table contains the start and end times of a burst interval. The following table describes the
columns of time_table.
Table 61: Burst time_table Schema

Column Data Type Description


id Any Contains the identifier of the input_table
row with the time interval to be burst.
If input_table row i is to be burst into n
burst intervals, then time_table has n
rows with the id value i.
burst_start_time INTEGER, BIGINT, SMALLINT, Contains the start time of a burst interval.
DOUBLE PRECISION, DECIMAL(n,n),
DECIMAL, NUMERIC, NUMERIC(n,n),

144 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

Column Data Type Description


DATE, TIME, TIME(n), TIME WITH
TIME ZONE, TIME WITH TIME
ZONE(n), TIMESTAMP,
TIMESTAMP(n), TIMESTAMP WITH
TIME ZONE, or TIMESTAMP WITH
TIME ZONE(n)
burst_end_time INTEGER, BIGINT, SMALLINT, Contains the end time of a burst interval.
DOUBLE PRECISION, DECIMAL(n,n),
DECIMAL, NUMERIC, NUMERIC(n,n),
DATE, TIME, TIME(n), TIME WITH
TIME ZONE, TIME WITH TIME
ZONE(n), TIMESTAMP,
TIMESTAMP(n), TIMESTAMP WITH
TIME ZONE, or TIMESTAMP WITH
TIME ZONE(n)

Output
Each row of the output table contains a burst interval. The following table describes the columns of the
output table. Columns copied from input_table appear in the same order in the output table.
Table 62: Burst Output Table Schema

Column Data Type Description


start_time_column INTEGER, BIGINT, SMALLINT, DOUBLE Copied from input_table.
PRECISION, DECIMAL(n,n), DECIMAL,
NUMERIC, NUMERIC(n,n), DATE, TIME,
TIME(n), TIME WITH TIME ZONE, TIME WITH
TIME ZONE(n), TIMESTAMP, TIMESTAMP(n),
TIMESTAMP WITH TIME ZONE, or
TIMESTAMP WITH TIME ZONE(n)
end_time_column INTEGER, BIGINT, SMALLINT, DOUBLE Copied from input_table.
PRECISION, DECIMAL(n,n), DECIMAL,
NUMERIC, NUMERIC(n,n), DATE, TIME,
TIME(n), TIME WITH TIME ZONE, TIME WITH
TIME ZONE(n), TIMESTAMP, TIMESTAMP(n),
TIMESTAMP WITH TIME ZONE, or
TIMESTAMP WITH TIME ZONE(n)
value_column INTEGER, BIGINT, SMALLINT, DOUBLE Copied from input_table.
PRECISION, DECIMAL(n,n), DECIMAL,
NUMERIC, NUMERIC(n,n), CHARACTER,
CHARACTER(n), CHARACTER VARYING,
CHARACTER VARYING(n), or VARCHAR
burst_start INTEGER, BIGINT, SMALLINT, DOUBLE Contains the start time of
PRECISION, DECIMAL(n,n), DECIMAL, the burst interval.
NUMERIC, NUMERIC(n,n), DATE, TIME,

Teradata Aster Analytics Foundation User Guide 145


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

Column Data Type Description


TIME(n), TIME WITH TIME ZONE, TIME WITH
TIME ZONE(n), TIMESTAMP, TIMESTAMP(n),
TIMESTAMP WITH TIME ZONE, or
TIMESTAMP WITH TIME ZONE(n)
burst_end INTEGER, BIGINT, SMALLINT, DOUBLE Contains the end time of
PRECISION, DECIMAL(n,n), DECIMAL, the burst interval.
NUMERIC, NUMERIC(n,n), DATE, TIME,
TIME(n), TIME WITH TIME ZONE, TIME WITH
TIME ZONE(n), TIMESTAMP, TIMESTAMP(n),
TIMESTAMP WITH TIME ZONE, or
TIMESTAMP WITH TIME ZONE(n)
burst_duration DOUBLE PRECISION Contains the duration of
the burst interval:
burst_end - burst_start
accumulate_column Any Copied from input_table
only if specified by the
Accumulate argument.
Typically, one
accumulate_column is a
row identifier, such as
'id'.

Examples
• Example 1: Time_Interval Argument
• Example 2: Time_Table Argument

Example 1: Time_Interval Argument


This example uses the TimeInterval argument (instead of the NumPoints argument or a time_table). The
input data set is a derivative of West German fixed investment disposable income and consumption
expenditures in billions of DM. The SQL-MapReduce call bursts the data for a duration of 2 days (172800
seconds).

Input

Table 63: Burst Example 1 Input Table: finance_data

id start_time_column end_time_column expenditure income investment


1 1967-06-30 2007-07-10 415 451 180
2 1967-06-30 2007-07-10 421 465 179
3 1967-06-30 2007-07-10 434 485 185
4 1967-06-30 2007-07-10 448 493 192

146 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

id start_time_column end_time_column expenditure income investment


5 1967-06-30 2007-07-10 459 509 211

SQL-MapReduce Call

SELECT * FROM Burst (


ON finance_data AS input_table PARTITION BY id ORDER BY id
TimeColumn ('start_time_column', 'end_time_column')
TimeInterval (172800)
ValueColumns ('expenditure', 'income', 'investment')
StartTime ('06/30/1967')
EndTime ('07/10/1967')
Accumulate ('id')
) ORDER BY id;

Output

Table 64: Burst Example 1 Output Table (Columns 1-6)

id start_time_column end_time_column expenditure income investment


1 1967-06-30 2007-07-10 415 451 180
1 1967-06-30 2007-07-10 415 451 180
1 1967-06-30 2007-07-10 415 451 180
1 1967-06-30 2007-07-10 415 451 180
1 1967-06-30 2007-07-10 415 451 180
2 1967-06-30 2007-07-10 421 465 179
2 1967-06-30 2007-07-10 421 465 179
2 1967-06-30 2007-07-10 421 465 179
2 1967-06-30 2007-07-10 421 465 179
2 1967-06-30 2007-07-10 421 465 179
3 1967-06-30 2007-07-10 434 485 185
3 1967-06-30 2007-07-10 434 485 185
3 1967-06-30 2007-07-10 434 485 185
3 1967-06-30 2007-07-10 434 485 185
3 1967-06-30 2007-07-10 434 485 185
4 1967-06-30 2007-07-10 448 493 192
4 1967-06-30 2007-07-10 448 493 192
4 1967-06-30 2007-07-10 448 493 192
4 1967-06-30 2007-07-10 448 493 192

Teradata Aster Analytics Foundation User Guide 147


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

id start_time_column end_time_column expenditure income investment


4 1967-06-30 2007-07-10 448 493 192
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211

Table 65: Burst Example 1 Output Table (Columns 7-9)

burst_start burst_end burst_duration


1967-06-30 1967-07-02 172800
1967-07-02 1967-07-04 172800
1967-07-04 1967-07-06 172800
1967-07-06 1967-07-08 172800
1967-07-08 1967-07-10 172800
1967-06-30 1967-07-02 172800
1967-07-02 1967-07-04 172800
1967-07-04 1967-07-06 172800
1967-07-06 1967-07-08 172800
1967-07-08 1967-07-10 172800
1967-06-30 1967-07-02 172800
1967-07-02 1967-07-04 172800
1967-07-04 1967-07-06 172800
1967-07-06 1967-07-08 172800
1967-07-08 1967-07-10 172800
1967-06-30 1967-07-02 172800
1967-07-02 1967-07-04 172800
1967-07-04 1967-07-06 172800
1967-07-06 1967-07-08 172800
1967-07-08 1967-07-10 172800
1967-06-30 1967-07-02 172800
1967-07-02 1967-07-04 172800
1967-07-04 1967-07-06 172800
1967-07-06 1967-07-08 172800

148 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Burst

burst_start burst_end burst_duration


1967-07-08 1967-07-10 172800

Example 2: Time_Table Argument


This example uses a time_table (instead of the TimeInterval or NumPoints arguments) and specifies the
argument ValuesBeforeFirst and ValuesAfterLast. The time_table option allows the use of different time
intervals and partitions the data accordingly.

Input
Use the table below and the following table from the Input section of Example 1:
• Burst Example 1 Input Table: finance_data
Table 66: Burst Example 2: time_table1

id burst_start burst_end
1 1967-06-30 1967-07-05
1 1967-07-05 1967-07-10
2 1967-06-30 1967-07-05
2 1967-07-05 1967-07-10
3 1967-06-30 1967-07-10
4 1967-06-30 1967-07-04
4 1967-07-04 1967-07-07
4 1967-07-07 1967-07-10
5 1967-06-30 1967-07-02
5 1967-07-02 1967-07-04
5 1967-07-04 1967-07-06
5 1967-07-06 1967-07-08
5 1967-07-08 1967-07-10

SQL-MapReduce Call

SELECT * FROM Burst (


ON finance_data AS input_table PARTITION BY id ORDER BY id
ON time_table1 AS time_table PARTITION BY id ORDER BY burst_start
TimeColumn ('start_time_column', 'end_time_column')
ValueColumns ('expenditure', 'income', 'investment')
StartTime ('06/30/1967')
EndTime ('07/10/1967')
ValuesBeforeFirst ('NULL','NULL','NULL')
ValuesAfterLast ('NULL','NULL','NULL')

Teradata Aster Analytics Foundation User Guide 149


Chapter 3: Time Series, Path, and Attribution Analysis
Burst
Accumulate ('id')
) ORDER BY id;

Output

Table 67: Burst Example 2 Output Table (Columns 1-6)

id start_time_column end_time_column expenditure income investment


1 1967-06-30 2007-07-10 415 451 180
1 1967-06-30 2007-07-10 415 451 180
2 1967-06-30 2007-07-10 421 465 179
2 1967-06-30 2007-07-10 421 465 179
3 1967-06-30 2007-07-10 434 485 185
4 1967-06-30 2007-07-10 448 493 192
4 1967-06-30 2007-07-10 448 493 192
4 1967-06-30 2007-07-10 448 493 192
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211
5 1967-06-30 2007-07-10 459 509 211

Table 68: Burst Example 2 Output Table (Columns 7-9)

burst_start burst_end burst_duration


1967-06-30 1967-07-05 432000
1967-07-05 1967-07-10 432000
1967-06-30 1967-07-05 432000
1967-07-05 1967-07-10 432000
1967-06-30 1967-07-10 864000
1967-06-30 1967-07-04 345600
1967-07-04 1967-07-07 259200
1967-07-07 1967-07-10 259200
1967-06-30 1967-07-02 172800
1967-07-02 1967-07-04 172800
1967-07-04 1967-07-06 172800
1967-07-06 1967-07-08 172800

150 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Change-Point Detection Functions

burst_start burst_end burst_duration


1967-07-08 1967-07-10 172800

Change-Point Detection Functions

Summary
Change-point detection functions detect the change points in a stochastic process or time series. These
functions take sorted time series data as input and output change points or data segments.
The change-point detection functions are:
• ChangePointDetection, for when the input data can be stored in memory
• RtChangePointDetection, for when the input data cannot be stored in memory or the application needs
real-time response

Background
In statistical analysis, change detection or change-point detection tries to identify the abrupt changes of a
stochastic process or time series.
Consider the following ordered time series data sequence:
y(t), t=1, 2, ..., n
where t is a time variable.
Change-point detection tries to find a segmented model M, given by the following equation:
Y = f1(t, w1) + e1(t), (1 <t <=τ1)
= f2(t, w2) + e2(t), (τ1 <t <=τ2)
...
= fk(t, wk) + ek(t), (τk-1 <t <=τk)
= fk+1(t, wk+1) + ek+1(t), (τk <t <=nk)
where:
• fi(t,w1) is the function (with its vector of parameters wi) that fits in segment i.
• Each τi is the change point between successive segments.
• Each ei(t) is an error term.
• n is the size of data series and k is the number of change points.
Segmentation model selection aims to find the function fi(t,w1) that best approximates the data of each
segment. Various model selection methods have been proposed. According to literature, the most
commonly used model selection method is normal distribution.
Search method selection aims to find the change points from a global perspective.
If τ0 =0 and τk+1 =n, one common method of identifying the change point is to minimize this value:

Teradata Aster Analytics Foundation User Guide 151


Chapter 3: Time Series, Path, and Attribution Analysis
Change-Point Detection Functions

C is a cost function for a segment to measure the difference between fi(t,w1) and the original data. βf (k) is a
penalty to guard against overfitting. The common choice is linear in the number of change points k; that is,
βf (k)=βk. There are several information criteria to do the evaluation, such as Akaike Information Criterion
(AIC) and Bayes Information Criterion (BIC).
For AIC, β=2p, where p is the number of additional parameters introduced by adding a change point.
For BIC (also called SBIC), β=plog(n).
Change-point detection methods are classified into two categories, based on speed of detection:
• Real-Time Change-Point Detection, for applications that require immediate responses (such as robot
control)
• Retrospective Change-Point Detection, for applications that require longer reaction periods

Retrospective Change-Point Detection


The most widely used change-point search method for retrospective change-point detection is probably
binary segmentation, which uses this procedure:
1. Search the data for the first change point.
2. At that change point, split the data into two parts.
3. In each part, select the change point with the minimum loss.
4. Repeat this procedure until there are either no new change points or the maximum number of change
points.
Binary segmentation is an approximation method, because the change point is decided with only part of the
data. However, this method is efficient and has an O(n log n) computational cost, where n is the number of
data points.

Taking normal distribution as an example, the change-point problem is to test the following null hypothesis:
H0:μ = μ1 = μ2 = … = μn and σ2 = σ12 = σ22 = … σn2
as opposed to the alternatives,
H1:μ1 = … = μk1 ≠ μk1+1 = … μk2 ≠ ... ≠ μkq+1 = ...= μn
and
σ12 = … = σk12 ≠ σk1+12 = … = σk22 ≠ ... ≠ σkq+12 = … = σn2
Binary segmentation performs the following tests in each iteration:
H1:μ1 = … = μk1 ≠ μk1+1 = … = μn

152 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Change-Point Detection Functions
and
σ12 = … = σk12 ≠ σk1+12 = … σn2
The log likelihood functions H0 and H1 are:

The maximum likelihood estimation of μ and σ2 are:

From the preceding formulas, the binary segmentation algorithm computes max LogL1 by giving k different
values. Then, to check for a change point, the algorithm compares the difference between max LogL1 and
LogL0 to the penalty value.
If the algorithm detects a change point, it adds that change point to its list of candidate change points and
then splits the data into two parts. From the candidate change points that the algorithm finds in the two
parts, it selects the one with the minimum loss.
The algorithm repeats the preceding process until it finds all change points or reaches the maximum change-
point number.

Real-Time Change-Point Detection


Usually, real-time change-point detection uses the sliding window algorithm.

Teradata Aster Analytics Foundation User Guide 153


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection
Assume that the data follows some distribution, but has different parameters θ in different segments. In the
two following hypotheses about the parameter,
H0:θ = θ0
H1:θ = θ1
the reference part has parameter θ0 and the testing part in the sliding window has parameter θ1, with the
following notation:

The decision rule for testing the sliding window is:


If Sjk < h0, then H0 is chosen.

If Sjk ≥ h0, then H1 is chosen.

ChangePointDetection

Summary
The ChangePointDetection function detects change points in a stochastic process or time series, using
retrospective change-point detection, implemented with these algorithms:
• Search algorithm: binary search
• Segmentation algorithm: normal distribution and linear regression
Use this function when the input data can be stored in memory and the application does not require a real-
time response. If the input data cannot be stored in memory, or the application requires a real-time
response, use the function RtChangePointDetection.

Usage

ChangePointDetection Syntax
Version 1.0

SELECT * FROM ChangePointDetection (


ON { table | view | query }
PARTITION BY partition_expr ORDER BY order_by_expr
ValueColumn ('value_column')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]

154 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection
[ SegmentationMethod ('segmentation_method') ]
[ SearchMethod ('binary') ]
[ MaxChangeNum ('maximum_change_point_count') ]
[ Penalty ({ 'BIC' | 'AIC' | 'threshold' }) ]
[ OutputOption ({ 'CHANGEPOINT' | 'VERBOSE' | 'SEGMENT' }) ]
);

Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains the
time series data.
Accumulate Required Specifies the names of the input table columns to copy to the
output table.

Tip:
To identify change points in the output table, specify the
columns that appear in partition_exp and order_by_exp.

SegmentationMethod Optional Specifies one of these segmentation methods:


• 'normal_distribution' (default): In each segment, the
data is in a normal distribution.
• 'linear_regression': In each segment, the data is in
linear regression.

SearchMethod Optional Specifies the search method, binary segmentation. This is the
default and only possible value.
MaxChangeNum Optional Specifies the maximum number of change points to detect. The
default value is 10.
Penalty Optional Specifies the penalty function, which is used to avoid over-fitting.
Possible values are:
• 'BIC' (default)
• 'AIC'
threshold, a DOUBLE PRECISION value
For BIC, the condition for the existence of a change point is:
ln(L1)−ln(L0) > (p1-p0)*ln(n)/2
For normal distribution and linear regression, the condition is:
(p1-p0)*ln(n)/2 = ln(n)
For AIC, the condition for the existence of a change point is:
ln(L1)−ln(L0) > p1-p0
For normal distribution and linear regression, the condition is:
p1-p0 = 2
For threshold, the specified value is compared to:
ln(L1)−ln(L0)

Teradata Aster Analytics Foundation User Guide 155


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection

Argument Category Description


L1 and L2 are the maximum likelihood estimation of hypotheses
H1 and H0. For normal distribution, the definition of Log(L1 ) and
Log(L0) are in Background.
p is the number of additional parameters introduced by adding a
change point. p is used in the information criterion BIC or AIC. p1
and p0 represent this parameter in hypotheses H1 and H0
separately.
OutputOption Optional Specifies the output table columns. Refer to the Output section.
The default value is 'CHANGEPOINT'.

Input
The input table must contain the columns described in the following table. The function ignores any
additional columns, except those specified by the Accumulate argument, which it copies to the output table.
Table 69: Change-Point Detection Functions Input Table Schema

Column Name Data Type Description


partition_column Any Column on which the input table is partitioned. Must
appear in partition_expr. The input table has one or more
such columns.
sort_column Any Column on which the input table is sorted. Must appear in
order_by_expr. The input table has one or more such
columns.
value_column DOUBLE Column that contains the time series data.
PRECISION

Output
The output table schema depends on the value of the OutputOption argument.
Table 70: Change-Point Detection Functions Output Table Schema for OutputOption ('CHANGEPOINT')

Column Name Data Type Description


accumulate_column Any Column copied from the input table.
cptid INTEGER Sequential change point identifier. For each partition, the
identifiers are from 1 to n, where n is the number of change
points for that partition.
The table has one row for each changepoint.

Table 71: Change-Point Detection Functions Output Table Schema for OutputOption ('VERBOSE')

Column Name Data Type Description


accumulate_column Any Column copied from the input table.

156 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection

Column Name Data Type Description


cptid INTEGER Sequential change point identifier. For each partition, the
identifiers are from 1 to n, where n is the number of change
points for that partition.
The table has one row for each changepoint.
difference DOUBLE Difference H1-H0.
PRECISION

Table 72: Change-Point Detection Functions Output Table Schema for OutputOption ('SEGMENT')

Column Name Data Type Description


accumulate_column# Any Starting point of the segment. Output table has one
accumulate_column# column for each accumulate_column.
accumulate_column Any Column copied from the input table.
segid INTEGER Segment identifier.
The table has one row for each segment. For k change
points, there are k+1 segments.

Examples
• Example 1: Two Series, Default Options
• Example 2: One Series, Default Options
• Example 3: One Series, VERBOSE Output
• Example 4: One Series, Penalty 10
• Example 5: One Series, SEGMENT Output, Penalty 10
• Example 6: One Series, Penalty 20, Linear Regression

Example 1: Two Series, Default Options


This example includes two time series of finance data.

Input

Table 73: ChangePointDetection Example 1 Input Table finance_data2

sid id start_time_column end_time_column expenditure income investment


1 1 1967-06-30 2007-03-31 415 451 180
1 2 1967-06-30 2007-03-31 421 465 179
1 3 1967-06-30 2007-03-31 434 485 185
1 4 1967-06-30 2007-03-31 448 493 192
1 5 1967-06-30 2007-03-31 459 509 211
1 6 1967-06-30 2007-03-31 458 520 202
1 7 1967-06-30 2007-03-31 479 521 207

Teradata Aster Analytics Foundation User Guide 157


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection
sid id start_time_column end_time_column expenditure income investment
1 8 1967-06-30 2007-03-31 487 540 214
1 9 1967-06-30 2007-03-31 497 548 231
1 10 1967-06-30 2007-03-31 510 558 229
1 11 1967-06-30 2007-03-31 516 574 234
1 12 1967-06-30 2007-03-31 525 583 237
1 13 1967-06-30 2007-03-31 529 591 206
1 14 1967-06-30 2007-03-31 538 599 250
1 15 1967-06-30 2007-03-31 546 610 259
... ... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM ChangePointDetection (


ON finance_data2 PARTITION BY sid ORDER BY id
ValueColumn ('expenditure')
Accumulate ('sid', 'id','expenditure')
) ORDER BY sid, id;

Output

Table 74: ChangePointDetection Example 1 Output Table

sid id expenditure cptid


1 3 434 1
1 5 459 2
1 7 479 3
1 10 510 4
1 12 525 5
1 14 538 6
1 17 574 7
1 19 586 8
1 22 639 9
1 25 679 10
2 34 746 1
2 37 1774 2
2 42 1958 3
2 44 1994 4

158 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection

sid id expenditure cptid


2 47 2102 5
2 49 798 6
2 52 858 7
2 55 934 8
2 58 1013 9

Example 2: One Series, Default Options

Input
The input for ChangePointDetection examples 2 through 6 and all RtChangePointDetection examples is
represented by the following diagram. The input signal is like a clock signal whose values can represent a
cyclic recurrence of an event (for example, electric power consumption at certain periods or sequence, pulse
rate, and so on).

Table 75: ChangePointDetection Examples 2-6 Input Table cpt

sid id val
1 1 10.8308
1 2 10.07182
1 3 10.30902
1 4 10.01128
1 5 10.83433
1 6 10.0189
1 7 10.8702

Teradata Aster Analytics Foundation User Guide 159


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection

sid id val
1 8 10.70688
1 9 10.72465
1 10 10.76334
1 11 100.9431
1 12 100.245
1 13 100.8667
1 14 100.0768
1 15 100.7646
1 16 100.0001
1 17 100.3316
1 18 100.8994
1 19 100.5965
1 20 100.1943
1 21 10.24228
1 22 10.78137
1 23 10.90752
1 24 10.02013
1 25 10.46117
1 26 10.08672
1 27 10.33539
1 28 10.0157
1 29 10.40867
1 30 10.17071
1 31 100.3789
1 32 100.2254
1 33 100.1049
1 34 100.9242
1 35 100.6543
1 36 100.5676
1 37 100.2341
1 38 100.9213
1 39 100.334

160 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection

sid id val
1 40 100.8727

SQL-MapReduce Call

SELECT * FROM ChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Accumulate ('sid', 'id')
) ORDER BY sid, id;

Output

Table 76: ChangePointDetection Example 2: Output Table

sid id cptid
1 8 1
1 11 2
1 21 3
1 31 4
1 34 5

Example 3: One Series, VERBOSE Output

Input
The input table for this example is the output table from Example 2, ChangePointDetection Example 2:
Output Table.

SQL-MapReduce Call

SELECT * FROM ChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Accumulate ('sid', 'id')
OutputOption ('VERBOSE')
) ORDER BY sid, id;

Output

Table 77: ChangePointDetection Example 3 Output Table

sid id cptid difference


1 8 1 7.47347903247601

Teradata Aster Analytics Foundation User Guide 161


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection

sid id cptid difference


1 11 2 97.3574610687669
1 21 3 48.6750178281853
1 31 4 52.088392576252
1 34 5 3.71888507066399

Example 4: One Series, Penalty 10

Input
The input table for this example is the output table from Example 2.

SQL-MapReduce Call

SELECT * FROM ChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Accumulate ('sid', 'id')
Penalty ('10')
) ORDER BY sid, id;

Output

Table 78: ChangePointDetection Example 4 Output Table

sid id cptid
1 11 1
1 21 2
1 31 3

Example 5: One Series, SEGMENT Output, Penalty 10

Input
The input table for this example is the output table from Example 2.

SQL-MapReduce Call

SELECT * FROM ChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Accumulate ('sid', 'id')
Penalty ('10')

162 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
ChangePointDetection
OutputOption ('SEGMENT')
) ORDER BY sid, id;

Output

Table 79: ChangePointDetection Example 5 Output Table

sid#s id#s sid id segid


1 1 1 10 1
1 11 1 20 2
1 21 1 30 3
1 31 1 40 4

Example 6: One Series, Penalty 20, Linear Regression

Input
The input table for this example is the output table from Example 2.

SQL-MapReduce Call

SELECT * FROM ChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Accumulate ('sid', 'id')
Penalty ('20')
SegmentationMethod ('linear_regression')
);

Output

Table 80: ChangePointDetection Example 6 Output Table

sid id cptid
1 11 1
1 21 2
1 31 3

Teradata Aster Analytics Foundation User Guide 163


Chapter 3: Time Series, Path, and Attribution Analysis
RtChangePointDetection

RtChangePointDetection

Summary
The RtChangePointDetection function detects change points in a stochastic process or time series, using
real-time change-point detection, implemented with these algorithms:
• Search algorithm: sliding window
• Segmentation algorithm: normal distribution
Use this function when the input data cannot be stored in memory, or when the application requires a real-
time response. If the input data can be stored in memory and the application does not require a real-time
response, use the function ChangePointDetection.

Usage

RtChangePointDetection Syntax
Version 1.0

SELECT * FROM RtChangePointDetection (


ON { table | view | query }
PARTITION BY partition_expr ORDER BY order_by_expr
ValueColumn ('value_column')
Accumulate ({ 'accumulate_column' | 'accumulate_column_range' }[,...])
[ SegmentationMethod ('normal_distribution') ]
[ WindowSize ('window_size') ]
[ Threshold ('change_point_threshold') ]
[ OutputOption ({ 'CHANGEPOINT' | 'VERBOSE' | 'SEGMENT' }) ]
);

Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains the time
series data.
Accumulate Required Specifies the names of the input table columns to copy to the output
table.

Tip:
To identify change points in the output table, specify the
columns that appear in partition_exp and order_by_exp.

SegmentationMethod Optional Specifies the segmentation method, normal distribution (in each
segment, the data is in a normal distribution). This is the default and
only possible value.

164 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
RtChangePointDetection

Argument Category Description


WindowSize Optional Specifies the size of the sliding window. The ideal window size
depends heavily on the data. You might need to experiment with
this value. The default value is 10.
Threshold Optional Specifies a DOUBLE PRECISION value that the function compares
to ln(L1)−ln(L0). The definition of Log(L1) and Log(L0) are in
Background. They are the logarithms of L1 and L2. The default value
is 10.
OutputOption Optional Specifies the output table columns. Refer to the Output section. The
default value is 'CHANGEPOINT'.

Input
Use the following table from the Input section of the ChangePointDetection function Usage section.
• Change-Point Detection Functions Input Table Schema

Output
Use the following table from the Output section of the ChangePointDetection function Usage section.
• Change-Point Detection Functions Output Table Schema for OutputOption ('CHANGEPOINT')

Examples
• Example 1: Threshold 10, Window Size 3, Default Output
• Example 2: Threshold 20, Window Size 3, VERBOSE Output
• Example 3: Threshold 100, Window Size 3, Default Output

Example 1: Threshold 10, Window Size 3, Default Output

Input
ChangePointDetection Example 2 Output Table

SQL-MapReduce Call

SELECT * FROM RtChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Threshold ('10')
WindowSize ('3')
Accumulate ('sid', 'id')
);

Teradata Aster Analytics Foundation User Guide 165


Chapter 3: Time Series, Path, and Attribution Analysis
RtChangePointDetection
Output

Table 81: RtChangePointDetection Example 1 Output Table

sid id cptid
1 11 1
1 21 2
1 31 3
1 36 4

Example 2: Threshold 20, Window Size 3, VERBOSE Output

Input
ChangePointDetection Example 2 Output Table

SQL-MapReduce Call

SELECT * FROM RtChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Threshold ('20')
WindowSize ('3')
OutputOption ('VERBOSE')
Accumulate ('sid','id')
);

Output

Table 82: RtChangePointDetection Example 2 Output Table

sid id cptid difference


1 11 1 40764.171204234
1 21 2 41833.8977039489
1 31 3 48358.164361325
1 36 4 27.7125810367931

Example 3: Threshold 100, Window Size 3, Default Output

Input
ChangePointDetection Example 2 Output Table

166 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Convergent Cross-Mapping
SQL-MapReduce Call

SELECT * FROM RtChangePointDetection (


ON cpt PARTITION BY sid ORDER BY id
ValueColumn ('val')
Threshold ('100')
WindowSize ('3')
Accumulate ('sid','id')
);

Output

Table 83: RtChangePointDetection Example 3 Output Table

sid id cptid
1 11 1
1 21 2
1 31 3

Convergent Cross-Mapping
Convergent cross-mapping (CCM) is a method for evaluating whether one time series variable in a system
has a causal influence on another. Unlike the symmetric relationship of correlation, a causality relationship
detected by CCM can be unidirectional: while (A is correlated with B) always implies that (B is correlated
with A), the relationship found by CCM can simultaneously satisfy (A causes B) and (B does not cause A).
The intuition behind the CCM algorithm is that if variable A is a cause of variable B, then information about
time series A is reflected in time series B. Therefore, you can estimate A from B (this is the reverse of the
usual understanding of cause and effect). If the predictability of time series A improves with increasing
information from time series B, then A has a causal influence on B. This somewhat counter-intuitive
definition is described in more detail in the following references.
The mathematical justification for this approach depends on a result from the dynamical systems theory
Takens’ Theorem, which demonstrates that a complex dynamical system can be “embedded” into a low-
dimensional space. This approach is designed for short time series (less than 30 points) for which multiple
samples are available.
To test for causality, the CCM function:
1. Chooses a library of short time series from the effect variable.
2. Uses this library to predict values of the cause variable.
The function uses a k-nearest neighbors algorithm to predict the cause variable from the effect variable
and a bootstrapping process to estimate the uncertainty associated with the predicted values.
3. Uses this library to evaluates the goodness-of-fit of the predictions.
For numerical variables, the function determines goodness-of-fit using the correlation between the
predictions and the true values. For categorical variables, the function determines goodness-of-fit using
the Jaccard Index.
4. Repeats this procedure for libraries of different sizes.

Teradata Aster Analytics Foundation User Guide 167


Chapter 3: Time Series, Path, and Attribution Analysis
CCMPrepare
If increasing the library size results in a significant improvement of the goodness-of-fit, and the
correlation is significantly greater than zero, then there is a causal relationship. You can be sure that you
have considered large enough libraries if the goodness-of-fit converges (that is, stops improving) as you
continue to increase library size.
The following references provide more detail:
• Sugihara et al. Detecting Causality in Complex Systems, Science, 2012.
• Clark et al. Spatial convergent cross mapping to detect causal relationships from short time series, Ecology,
2015.

CCMPrepare
The function CCMPrepare adds a new partition column, aster_ccm_id, and partitions the data for use with
the CCM function. CCM partitions the data automatically, but to ensure that the data is partitioned the
same way over multiple executions of the function, use CCMPrepare to create the input table for CCM.

Usage

CCMPrepare Syntax
Version 1.0

SELECT * FROM CCMPrepare (


ON { table | view | query } PARTITION BY key
);

Arguments
Argument Category Description
InputTable Required Table containing the input data.

Input
The input table must contain id columns for the time series and the period within the time series (the
timepoints).
Table 84: CCMPrepare Input Table Schema

Column Name Data Type Description


sequence_id Any except ID of the time sequence.
FLOAT or
DOUBLE
timeperiod_id Any Time associated with each value in the timeseries value
columns.

168 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
CCMPrepare

Column Name Data Type Description


timeseries_values Any Any number of columns. Values from the time series to be
tested for potential causal relationships.

Output
Table 85: CCMPrepare Output Table Schema

Column Name Data Type Description


aster_ccm_id INTEGER Partition column used internally by the function to ensure
repeatability across multiple function calls.
sequence_id Any except ID of the time sequence.
FLOAT or
DOUBLE
timeperiod_id Any Time associated with each value in the timeseries values
columns.
timeseries_values Any Any number of columns. Values from the time series to be
tested for potential causal relationships.

Example

Input
The input table, ccmprepare_input, is a collection of nine time series consisting of 10 values for each of three
variables (expenditure, income, and investment).
Table 86: CCMPrepare Example Input Table ccmprepare_input

id period expenditure income investment


1 1 415 451 180
1 2 421 465 179
1 3 434 485 185
1 4 448 493 192
1 5 459 509 211
1 6 458 520 202
1 7 497 521 207
1 8 487 540 214
1 9 497 548 231
1 10 510 558 229
2 1 516 574 234

Teradata Aster Analytics Foundation User Guide 169


Chapter 3: Time Series, Path, and Attribution Analysis
CCMPrepare

id period expenditure income investment


2 2 525 583 237
2 3 529 591 206
2 4 538 599 250
2 5 546 610 259
2 6 555 627 263
2 7 574 642 264
2 8 574 653 280
2 9 586 660 282
2 10 602 694 292
... ... ... ... ...

SQL-MapReduce Call
This call splits the input sequences (column_id) into two partitions. The odd and even sequences that are
partitioned are identified by column aster_ccm_id in the output.

DROP TABLE IF EXISTS ccm_input;

CREATE TABLE ccm_input DISTRIBUTE BY hash(aster_ccm_id) AS


SELECT * FROM CCMPrepare (
ON ccmprepare_input PARTITION BY id
);

Output
This query returns the following table:

SELECT * FROM ccm_input ORDER BY 1, 2, 3;

Table 87: CCMPrepare Example Output Table

aster_ccm_id id period expenditure income investment


0 1 1 415 451 180
0 1 2 421 465 179
0 1 3 434 485 185
0 1 4 448 493 192
0 1 5 459 509 211
0 1 6 458 520 202
0 1 7 479 521 207

170 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
CCMPrepare

aster_ccm_id id period expenditure income investment


0 1 8 487 540 214
0 1 9 497 548 231
0 1 10 510 558 229
... ... ... ... ... ...
0 9 1 1650 1910 611
0 9 2 1685 1943 597
0 9 3 1722 1976 603
0 9 4 1752 2018 619
0 9 5 2145 2521 833
0 9 6 2164 2545 860
0 9 7 2206 2580 870
0 9 8 2225 2620 830
0 9 9 2235 2639 801
0 9 10 2237 2618 824
... ... ... ... ... ...
1 2 2 525 583 237
1 2 3 529 591 206
1 2 4 538 599 250
1 2 5 546 610 259
1 2 6 555 627 263
1 2 7 574 642 264
1 2 8 574 653 280
1 2 9 586 660 282
1 2 10 602 694 292
... ... ... ... ... ...
1 8 1 1355 1613 525
1 8 2 1371 1642 519
1 8 3 1402 1690 526
1 8 4 1452 1759 510
1 8 5 1485 1756 519
1 8 6 1516 1780 538
1 8 7 1549 1807 549

Teradata Aster Analytics Foundation User Guide 171


Chapter 3: Time Series, Path, and Attribution Analysis
CCM

aster_ccm_id id period expenditure income investment


1 8 8 1567 1831 570
1 8 9 1588 1873 559
1 8 10 1631 1897 584

CCM

Summary
The Teradata Aster CCM function allows the user to test multiple causes and effects simultaneously. The
function reports an effect size for each cause-effect pair.

Usage

CCM Syntax
Version 1.0

SELECT * FROM CCM (


ON { table | view | query } AS input PARTITION BY key
ON { table | view | query } AS time_series_stats DIMENSION
ON { table | view | query } AS embedding_dimension DIMENSION
IdColumn ('input_column')
SequenceColumns ({ 'seq_column' | 'seq_column_range' }[,...])
[ EmbeddingDimension ('integer')]
[ Tau (num_time_steps) ]
[ LibraryLength (library_length_min:library_length_max) ]
[ BootstrapSamples ('integer') ]
[ Seed ('long') ]
);

Arguments
Argument Category Description
InputTable Required Table containing the input data.
SequenceIdColumn Required Column containing the sequence ids. A sequence is a sample
of the time series.
TimeColumn Required Column containing the timestamps.
CauseColumns Required Columns to be evaluated as potential causes.
EffectColumns Required Columns to be evaluated as potential effects.

172 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
CCM

Argument Category Description


LibrarySize Optional The CCM algorithm works by using “libraries” of randomly
selected points along the potential effect time series to
predict values of the cause time series. A causal relationship
is said to exist if the correlation between the predicted values
of the cause time series and the actual values increases as the
size of the library increases.
Each input value must be greater than 0. The default value is
100.
EmbeddingDimension Optional The embedding dimension is an estimate of the number of
past values to use when predicting a given value of the time
series. The input value must be greater than 0. The default
value is 2.
TimeStep Optional The TimeStep parameter indicates the number of time steps
between past values to use when predicting a given value of
the time series. The input value must be greater than 0. The
default value is 1.
BoostrapIterations Optional The number of bootstrap iterations used to predict. The
bootstrap process is used to estimate the uncertainty
associated with the predicted values.
The input value must be greater than 0. The default value is
100.
PointsToTest Optional The number of data points to predict using the library. The
default is to predict all valid points along all time series.
SelfPredict Optional If SelfPredict is has the value 'true', the function tries to
predict each attribute using the attribute itself. If an attribute
can predict its own time series well, the signal-to-noise ratio
is too low for the CCM algorithm to work effectively. The
default value is 'false'.
Seed Optional Specifies the random seed used to initialize the algorithm.

Input
The input table must contain id columns for the time series and the period within the time series (the
timepoints).
To ensure repeatability, the user has the option of creating a table using the CCMPrepare function. This
function adds a column, “aster_ccm_id”, to the input table so that the partitioning of data across workers is
guaranteed to be consistent over multiple function calls. If the InputTable contains an “aster_ccm_id”
column, the function assumes that data has been prepared using CCMPrepare. If it does not contain this
column, the function generates this partitioning column internally.

Note:
The Aster CCM function supports categorical variables as possible causes and effects. This feature is to be
considered experimental.

Teradata Aster Analytics Foundation User Guide 173


Chapter 3: Time Series, Path, and Attribution Analysis
CCM
Table 88: CCM Input Table Schema

Column Name Data Type Description


aster_ccm_id Integer Optional. Partition column used internally by the function
to ensure repeatability across multiple function calls.
If this column is not present, the function generates it
automatically.
sequence_id Any except ID of the time sequence.
float or
double
timeperiod_id Any Time associated with each value in the timeseries values
columns.
timeseries_values Any Any number of columns. Values from the time series to be
tested for potential causal relationships.

Output
Table 89: CCM Output Schema

Column Name Data Type Description


cause VARCHAR The input attribute (column) being evaluated as a potential
causal variable.
effect VARCHAR The input attribute (column) being evaluated as a potential
effect variable.
library_size INTEGER The size of the library evaluated.
correlation DOUBLE For numerical cause variables only, this column returns the
PRECISION correlation between the values predicted by the effect attribute
and the true value of the cause attribute.
For categorical cause variables, this column returns null.
jaccard_index DOUBLE For categorical cause variables only, this column returns the
PRECISION Jaccard similarity index between the values predicted by the
effect attribute and the true value of the cause attribute.
For numerical cause variables, this column returns null.
lower_bound DOUBLE The lower bound of the 95% confidence interval of the
PRECISION prediction contained in the correlation or jaccard_index
column.
upper_bound DOUBLE The upper bound of the 95% confidence interval of the
PRECISION prediction contained in the correlation or jaccard_index
column.
effect_size DOUBLE The estimated effect size of increasing the library size from the
PRECISION smallest value to the largest value. A value is displayed in this
column for the smallest library size used with each cause/effect
pair. For other library sizes, this column is null.

174 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
CCM

Column Name Data Type Description


For numerical effects, the effect size in Cohen's q statistic. For
categorical effects, the effect size is the difference of the two
similarity measures.
effect_size_sd DOUBLE The standard deviation of the effect size.
PRECISION

Examples
• Example 1: Numeric Causes and Effects with Default Values
• Example 2: Mixed Categorical and Numeric Causes and Effects

Example 1: Numeric Causes and Effects with Default Values

Input
The table, CCMPrepare Example Output Table, from the output of CCMPrepare, is the input to the CCM
function. The example investigates income as a possible cause of expenditure and investment. Other than
EmbeddingDimension ('5'), the example uses default argument values.

SQL-MapReduce Call

SELECT cause, effect, library_size,


round(cast("correlation" AS numeric), 5) AS "correlation",
jaccard_index, round(cast("lower_bound" as numeric), 5) AS "lower_bound",
round(cast("upper_bound" as numeric), 5) AS "upper_bound",
round(cast("effect_size" as numeric), 5) AS "effect_size",
round(cast("effect_size_sd" as numeric), 5) AS "effect_size_sd"
FROM CCM (
ON (select 1) PARTITION BY 1
InputTable ('ccm_input')
SequenceIdColumn ('id')
TimeColumn ('period')
CauseColumns ('income')
EffectColumns ('expenditure', 'investment')
EmbeddingDimension ('5')
Seed ('0')
) ORDER BY cause, effect, library_size;

Output
As the library_size increases, the correlation increases, which suggests a causal relationship. Intuition
confirms that expenditures and investment would be driven by income. Because the cause and effect
variables are numeric, there is no jaccard_index value.

Teradata Aster Analytics Foundation User Guide 175


Chapter 3: Time Series, Path, and Attribution Analysis
CCM
Table 90: CCM Example 1 Output Table (Columns 1-5)

cause effect library_size correlation jaccard_index


income expenditure 6 0.87132
income expenditure 100 0.97057
income investment 6 0.82730
income investment 100 0.93925

Table 91: CCM Example 1 Output Table (Columns 6-9)

lower_bound upper_bound effect_size effect_size_sd


0.85363 0.88701 0.76356 0.07426
0.96215 0.97715
0.80477 0.84745 0.55214 0.06759
0.92429 0.95133

Example 2: Mixed Categorical and Numeric Causes and Effects

Input
The input table, ccm_input2, contains data from three indices of stock market performance (COMP, DJIA,
NDX). The categorical variables (marketindex, indexdate) and the numerical variables (indexval,
indexchange) are time series values spread across two sequences (id). This example shows cause and effect
with one numerical and one categorical variable.
Table 92: CCM Example 2 Input Table ccm_input2

aster_ccm_id id period marketindex indexdate indexval indexchange


1 1 1 COMP 2005-01-01 4275 -10
1 1 2 DJIA 2005-01-01 15600 -250
1 1 3 NDX 2005-01-01 3900 -10
1 1 4 COMP 2005-01-02 4280 5
1 1 5 DJIA 2005-01-02 15800 200
1 1 6 NDX 2005-01-02 3910 10
1 2 1 COMP 2005-01-03 4290 10
1 2 2 DJIA 2005-01-03 15700 -100
1 2 3 NDX 2005-01-03 3920 10
1 2 4 COMP 2005-01-04 4280 -10
1 2 5 DJIA 2005-01-04 15600 -100
1 2 6 NDX 2005-01-04 3910 -10

176 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
CCM
SQL-MapReduce Call

SELECT cause, effect, library_size,


round(cast("correlation" AS numeric), 5) AS "correlation",
jaccard_index, round(cast("lower_bound" AS numeric), 5) AS "lower_bound",
round(cast("upper_bound" AS numeric), 5) AS "upper_bound",
round(cast("effect_size" AS numeric), 5) AS "effect_size",
round(cast("effect_size_sd" AS numeric),5) AS "effect_size_sd"
FROM CCM (
ON (SELECT 1) PARTITION BY 1
InputTable ('ccm_input2')
SequenceIdColumn ('id')
TimeColumn ('period')
CauseColumns ('marketindex', 'indexval')
EffectColumns ('indexdate', 'indexchange')
LibrarySize ('10', '25', '50')
Seed ('0')
) ORDER BY cause, effect, library_size;

Output

Table 93: CCM Example 2 Output Table (Columns 1-5)

cause effect library_size correlation jaccard_index


indexval indexchange 10 0.07415
indexval indexchange 25 0.51966
indexval indexchange 50 0.55927
indexval indexdate 10 -0.52809
indexval indexdate 25 -0.43179
indexval indexdate 50 -0.42503
marketindex indexchange 10 0.51125
marketindex indexchange 25 0.6
marketindex indexchange 50 0.6425
marketindex indexdate 10 0.0175
marketindex indexdate 25 0
marketindex indexdate 50 0

Table 94: CCM Example 2 Output Table (Columns 6-9)

lower_bound upper_bound effect_size effect_size_sd


-0.12229 0.26500 0.55748 0.12117
0.42471 0.60330
0.46165 0.64350
-0.64236 -0.39085 0.13368 0.11398

Teradata Aster Analytics Foundation User Guide 177


Chapter 3: Time Series, Path, and Attribution Analysis
DTW

lower_bound upper_bound effect_size effect_size_sd


-0.46852 -0.39358
-0.53209 -0.30460
0.47383 0.54867 0.13125 0.02542
0.56526 0.63474
0.60961 0.67539
0.00400 0.03100 -0.01750 0.00689
0.00000 0.00000
0.00000 0.00000

For numeric variables, the correlation indicates the relationship between the values of the cause variable (as
predicted by the effect variable) and the true values of the cause variable. The example shows a steadily
increasing absolute value of the correlation between indexval and indexchange, and a high effect size (0.557).
There is no clear trend for the correlation between indexval and indexdate.

DTW

Summary
The DTW function performs dynamic time warping (DTW), which measures the similarity (warping
distance) between two time series that vary in time or speed. You can use DTW to analyze any data that can
be represented linearly—for example, video, audio, and graphics.
For example:
• In two videos, DTW can detect similarities in walking patterns, even if in one video the person is walking
slowly and in another, the same person is walking fast.
• In audio, DTW can detect similarities in different speech speeds (and is therefore very useful in speech
recognition applications).
Given an input table, a template table, and a mapping table, DTW compares each time series in the input
table to the corresponding time series in the template table. The correspondence is defined by the mapping
table.

178 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DTW

For more information, see FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space.
Stan Salvador & Philip Chan. KDD Workshop on Mining Temporal and Sequential Data, pp. 70-80, 2004
(http://cs.fit.edu/~pkc/papers/tdm04.pdf)

Usage

DTW Syntax
Version 1.0

SELECT * FROM DTW (


ON input_table AS input_table
PARTITION BY i_partition_column [,...]
ORDER BY i_ordering_column [,...]
ON template_table AS template_table DIMENSION
ORDER BY t_ordering_column [,...]
ON mapping_table AS mapping_table
PARTITION BY m_partition_column [,...]
InputColumns ('i_value', 'i_timestamp')
TemplateColumns ('t_value', 't_timestamp')
TimeseriesID ('timeseriesid' [,...])
TemplateID ('templateid' [,...])
[ Radius ('radius') ]
[ DistMethod ('distance_metric') ]
[ WarpPath ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
);

Arguments
Argument Category Description
InputColumns Required Specifies the names of the input_table columns that contain
the values and timestamps of the time series.

Teradata Aster Analytics Foundation User Guide 179


Chapter 3: Time Series, Path, and Attribution Analysis
DTW

Argument Category Description

Note:
The InputColumns argument has the alternate name
Input_Table_Value_Column_Names.

Note:
If these columns contain NaN or infinity values, then use
a WHERE clause to remove them.

TemplateColumns Required Specifies the names of the template_table columns that


contain the values and timestamps of the time series.

Note:
The TemplateColumns argument has the alternate name
Template_Table_Value_Column_Names.

Note:
If these columns contain NaN or infinity values, then use
a WHERE clause to remove them.

TimeseriesID Required Specifies the names of the columns by which the input_table
is partitioned. These columns comprise the unique ID for a
time series in input_table.
TemplateID Required Specifies the names of the columns by which the
template_table is ordered. These columns comprise the
unique ID for a time series in template_table.
Radius Optional Specifies the integer value that determines the projected
warp path from a previous resolution. The default value is
10.
DistMethod Optional Specifies the metric for computing the warping distance.
The supported values of distance_metric, which are case-
sensitive, are
• 'EuclideanDistance' (default)
• 'ManhattanDistance'
• 'BinaryDistance'
These values are further described in the Background
section of the function: VectorDistance.
Note that the DistMethod argument has the alternate name
Metric.
WarpPath Optional Determines whether to output the warping path. The default
value is 'false'.

Input
The DTW function requires three input tables:

180 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DTW
• input_table, in which each row contains information for one time series
• template_table, in which each row contains information for one time series
• mapping_table, which defines the mapping between input_table rows and template_table rows
The columns by which input_table and mapping_table are partitioned must agree in number and data type.
That is, each i_partition_column must have a corresponding m_partition_column with the same data type.
However, corresponding partition columns can have different names.
The columns by which input_table and template_table are ordered must agree in number and data type.
That is, each i_ordering_column must have a corresponding t_ordering_column with the same data type.
However, corresponding ordering columns can have different names.
Table 95: DTW input_table Schema

Column Name Data Type Description


timeseriesid INTEGER Unique identifier for a time series.
timestamp INTEGER, Timestamp for the time series.
SMALLINT,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
value DOUBLE Value for the timestamp.
PRECISION

Table 96: DTW template_table Schema

Column Name Data Type Description


templateid INTEGER Unique identifier for a time series.
timestamp INTEGER, Timestamp for the time series.
SMALLINT,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
value DOUBLE Value for the timestamp.
PRECISION

Table 97: DTW mapping_table Schema

Column Name Data Type Description


timeseriesid INTEGER Unique identifier for a time series in input_table.
templateid INTEGER Unique identifier for the time series in template_table that
corresponds to the time series specified by timeseriesid.

Note:
In mapping_table, DTW supports a single ID column in the input_table and template_table.

Teradata Aster Analytics Foundation User Guide 181


Chapter 3: Time Series, Path, and Attribution Analysis
DTW
Output
Table 98: DTW Output Table Schema

Column Name Data Type Description


timeseriesID INTEGER Time series ID.
templateID INTEGER Template ID.
warpDistance DOUBLE Warp distance.
PRECISION
Note:
By definition, DWT(0, 0)=0 and DWT(n,
0)=DWT(0, n)=infinity.

Note:
The names of the output table columns are case-sensitive. You must enclose them in double quotation
marks in SQL statements; for example:

Example
This example compares multiple time series to both a common template and each other. Each time series
represents stock prices and the template represents a series of stock index prices.

Input
Table 99: DTW Example Input Table timeseriesdata

timeseriesid timestamp1 stockprice


1 0 24.2019
1 0.025063 27.8701
1 0.050125 31.4969
1 0.075188 35.083
1 0.100251 38.6286
1 0.125313 42.1343
1 0.150376 45.6005
1 0.175439 49.0276
1 0.200501 52.4162
1 0.225564 55.7666
1 0.250627 59.0792
... ... ...

182 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DTW
Table 100: DTW Example Template Table templatedata

templateid timestamp2 index_price


1 0 0
1 0.025063 0
1 0.050125 0
1 0.075188 0
1 0.100251 0
1 0.125313 0
1 0.150376 0
1 0.175439 0
1 0.200501 0
1 0.225564 0
1 0.250627 0
... ... ...

Table 101: DTW Example Mapping Table mappingdata

timeseriesid templateid
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
4 1
4 2
4 3

SQL-MapReduce Call

SELECT * FROM DTW (


ON timeseriesdata AS input_table
PARTITION BY timeseriesid
ORDER BY timestamp1 ON templatedata AS template_table DIMENSION

Teradata Aster Analytics Foundation User Guide 183


Chapter 3: Time Series, Path, and Attribution Analysis
DTW
ORDER BY timestamp2 ON mappingdata AS mapping_table
PARTITION BY timeseriesid
InputColumns ('stockprice', 'timestamp1')
TemplateColumns ('indexprice', 'timestamp2')
TimeSeriesID ('timeseriesid')
TemplateID ('templateid')
) ORDER BY "timeseriesID";

Output
Table 102: DTW Example Output Table

timeseriesID templateID warpDistance


1 1 25163.9
1 2 7547.69
1 3 19577.6
2 1 132.669
2 2 1904.08
2 3 71.7805
3 1 351.676
3 2 3614.2
3 3 75.7767
4 1 4927.61
4 2 914.257s
4 3 16641.6

The following figure shows a plot of the results.

184 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
Figure 6: DTW Example Results Plot

The warping distance is an unnormalized measure of how dissimilar two time series are. The warpDistance
column in the output table contains the warping distance for all pairs in the mapping table; that is, for every
timeseriesID and templateID number.
As the preceding figure shows, input 2 is more similar to templates 1 and 3 than to template 2. The warp
distances also show this: for templates 1 and 3, they are 131.588 and 106.131; for template 2, the warping
distance is about 540.
Because the dissimilarity of two time series is not based on whether they are temporarily close (the time is
stretched and the two time series that are offset by a constant time interval are effectively the same), input 3
is not very dissimilar to templates 1 and 3. However, input 4 has the largest warping distance measure from
templates 1 and 3, as the curvature of the latter 2 is far from input 4. Time stretching brings input 4 closer to
templates 1 and 3, but with a larger warping path (not output above) and therefore, a larger warping
distance.

DWT

Summary
The DWT function implements Mallat’s algorithm (an iterate algorithm in the Discrete Wavelet Transform
field) and applies wavelet transform on multiple sequences simultaneously.
The input is typically a set of time series sequences. You specify the wavelet name or wavelet filter table,
transform level, and (optionally) extension mode. The function returns the transformed sequences in Hilbert
space with the corresponding component identifiers and indices. (The transformation is also called the
decomposition.)

Teradata Aster Analytics Foundation User Guide 185


Chapter 3: Time Series, Path, and Attribution Analysis
DWT

Note:
The wavelet filter table does not appear in the preceding diagram because it is seldom used.

You can filter the result to reduce the lengths of the transformed sequences and then use the function IDWT
to reconstruct them; therefore, the DWT and IDWT functions are useful for compression and removing
noise.

Background
DWT is a time-frequency analysis tool for which the wavelets are discretely sampled. DWT is different from
the Fourier transform, which provides frequency information on the whole time domain. A key advantage of
DWT is that it provides frequency information at different time points.
Mallat’s algorithm can be described as a series of iterative steps. For example, for a 3-level wavelet transform:
1. Use S(n) as the original time domain sequence as the input of level 1.
2. Convolve the input sequence with high-pass filter h(n) and low-pass filter g(n).
The two generated sequences are the detail coefficients Dk and the approximation coefficients Ak in level
k.
3. If current level k is the maximum transform level n, then stop; otherwise, use Ak as the input sequence
for the next level (that is, increment k by 1 and go to step 2.)

Usage

DWT Syntax
Version 1.3

SELECT * FROM DWT (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
MetaTable ('meta_table')
InputColumns ({ 'input_column' | 'input_column_range' }[, ...])
SortColumn ('sort_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]

186 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
{ Wavelet ('wavelet') |
WaveletFilterTable ('wavelet_filter_table') }
Level (level)
[ ExtensionMode ('extension_mode') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the sequences
to be transformed.
OutputTable Required Specifies the name for the table that the function creates to store
the coefficients generated by the wavelet transform. This table
must not exist.
MetaTable Required Specifies the name for the table that the function creates to store
the meta information for the wavelet transform. This table must
not exist.
InputColumns Required Specifies the names of the columns in the input table or view that
contain the data to be transformed. These columns must contain
numeric values between -1e308 and 1e308. The function treats
NULL as 0.
SortColumn Required Specifies the name of the column that defines the order of samples
in the sequences to be transformed. In a time series sequence, the
column can consist of timestamp values.

Note:
If sort_column has duplicate elements in a sequence (that is, in
a partition), then sequence order can vary, and the function
can produce different transform results for the sequence.

PartitionColumns Optional Specifies the names of the partition columns, which identify the
sequences. Rows with the same partition column values belong to
the same sequence. If you specify multiple partition columns, then
the function treats the first one as the distribute key of the output
and meta tables.
By default, all rows belong to one sequence, and the function
generates a distribute key column named dwt_idrandom_name in
both the output table and the meta table. In both tables, every cell
of dwt_idrandom_name has the value 1.
Wavelet Optional Specifies a wavelet filter name from the following table.

Teradata Aster Analytics Foundation User Guide 187


Chapter 3: Time Series, Path, and Attribution Analysis
DWT

Argument Category Description


WaveletFilterTable Optional Specifies the name of the table that contains the coefficients of the
wave filters.
Level Required Specifies the wavelet transform level. The value level must be an
integer in the range [1, 1000].
ExtensionMode Optional Specifies the method for handling border distortion, an
extension_mode from the second of the two following tables. The
default value is 'sym'.

Table 103: Supported Wavelet Filter Names

Wavelet Family Supported Wavelet Names (wavelet values)


Daubechies 'db1' or 'haar', 'db2', .... ,'db10'
Coiflets 'coif1', ... , 'coif5'
Symlets 'sym1', ... ,' sym10'
Discrete Meyer 'dmey'
Biorthogonal 'bior1.1', 'bior1.3', 'bior1.5',
'bior2.2', 'bior2.4', 'bior2.6', 'bior2.8',
'bior3.1', 'bior3.3', 'bior3.5', 'bior3.7', 'bior3.9',
'bior4.4', 'bior5.5'
Reverse 'rbio1.1', 'rbio1.3', 'rbio1.5'
Biorthogonal 'rbio2.2', 'rbio2.4', 'rbio2.6', 'rbio2.8',
'rbio3.1', 'rbio3.3', 'rbio3.5', 'rbio3.7','rbio3.9',
'rbio4.4', 'rbio5.5'

For the examples in the following table, assume that the sequence before the extension is 1 2 3 4 and the
convolution kernel in the wavelet filter has the length 6, which means that the length of the sequence is to be
extended by 5 positions before and after the sequence.
Table 104: Supported Extension Modes

Supported Extension Mode Description


(extension_mode value)
sym (Default) Symmetrically replicate boundary values, mirroring the
points near the boundaries. For example:
44321|1234|43211
zpd Zero-pad boundary values with zero. For example:
00000|1234|00000
ppd Periodic extension, fill boundary values as the input sequence is a
periodic one. For example:
41234|1234|12341

188 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
Input
The DWT function has a required input table (or view) and an optional wavelet filter table. If you omit the
wavelet filter table, then you must specify the Wavelet argument.
The input table can contain at most 1594 columns. The function assumes that each sequence can be fitted
into the memory of the worker. The following table describes the input table columns that you can or must
specify with arguments. The input table can have additional columns, but the function ignores them.
Table 105: DWT Input Table Schema

Column Name Data Type Description


partition_column INTEGER, SMALLINT, BIGINT, Optional. Sequence identifier of the
NUMERIC [(p, s)], TEXT, VARCHAR, sequence to which the data belongs. Rows
VARCHAR(n), UUID, or BYTEA with the same partition column values
belong to the same sequence. If the input
table has multiple partition columns, then
the function treats the first one as the
distribute key of the output and meta
tables.
sort_column INTEGER, SMALLINT, BIGINT, Position of the data in the sequence to
BIGSERIAL, DOUBLE PRECISION, which it belongs.
NUMERIC [(p, s)], SERIAL, TIME, or
TIMESTAMP
input_column INTEGER, SMALLINT, BIGINT, Data to be transformed—numeric values
BIGSERIAL, DOUBLE PRECISION, between -1e308 and 1e308 (or NULL,
NUMERIC [(p, s)], or SERIAL which the function treats as 0).

Table 106: DWT Wavelet Filter Table Schema

Column Name Data Type Description


filtername VARCHAR Wave filter name from the following table.
filtervalue VARCHAR Decomposed or reconstructed wavelet filter, represented as a comma-
separated sequence; the conjugated scale coefficients for the
orthogonal wavelet. For example:
-0.1294095225512604, 0.2241438680420134, 0.8365163037378081,
0.4829629131445342
For details, refer to the following table.

Table 107: Wavelet Filter Table Names and Values

filtername filtervalue
lowpassfilter Decomposed low-pass filter, represented as a comma-separated sequence; the conjugated
scale coefficients for the orthogonal wavelet. For example:
-0.1294095225512604, 0.2241438680420134, 0.8365163037378081, 0.4829629131445342
highpassfilter Decomposed high-pass filter, represented as a comma-separated sequence; the
conjugated wavelet coefficients for the orthogonal wavelet. For example:

Teradata Aster Analytics Foundation User Guide 189


Chapter 3: Time Series, Path, and Attribution Analysis
DWT

filtername filtervalue
-0.4829629131445342, 0.8365163037378081, -0.2241438680420134,
-0.1294095225512604
ilowpassfilter Reconstructed low-pass filter, represented as a comma-separated sequence; the scale
coefficients for the orthogonal wavelet. For example:
0.4829629131445342, 0.8365163037378081, 0.2241438680420134, -0.1294095225512604
ihighpassfilter Reconstructed high-pass filter, represented as a comma-separated sequence; the wavelet
coefficients for the orthogonal wavelet. For example:
-0.1294095225512604, -0.2241438680420134, 0.8365163037378081,
-0.4829629131445342

Output
The DWT function outputs a message that indicates whether the function succeeded, an output table of
transformed (decomposed) sequences, and a meta table of wavelet-related information.
Table 108: DWT Output Message

Column Name Data Type Description


messages VARCHAR Message that indicates whether the function was successful.

Table 109: DWT Output Table Schema

Column Name Data Type Description


partition_column Inherited Sequence identifier of the sequence to which the data
from input belongs. Rows with the same partition column values belong
table column to the same sequence.
with same The output table has a partition_column for every
name partition_column in the input table.
If the input table has multiple partition columns, then the
first one is the distribute key in both the input and output
tables.
If the input table has only one partition_column, then the
output table has as its distribute key a function-generated
column named dwt_idrandom_name. Every cell of
dwt_idrandom_name has the value 1.
waveletid INTEGER Index of each wavelet coefficient (starting from 1 for each
sequence).
waveletcomponent VARCHAR Component to which the coefficient belongs. Possible values
are An, Dn, Dn-1, ..., D1, where n is the wavelet transform
level.
input_column DOUBLE Coefficient of the corresponding input column after the
PRECISION wavelet transform.
The output table has an input_column for every
input_column in the input table.

190 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
Table 110: DWT Meta Table Schema

Column Name Data Type Description


partition_column Inherited Contains the sequence identifier of the sequence to which
from input the data belongs. Rows with the same partition column
table column values belong to the same sequence.
with same The meta table has a partition_column for every
name partition_column in the input table.
If the input table has multiple partition columns, then the
first one is the distribute key in both the input and meta
tables.
If the input table has only one partition_column, then the
meta table has as its distribute key a function-generated
column named dwt_idrandom_name. Every cell of
dwt_idrandom_name has the value 1.
meta VARCHAR Meta information name from the following table.
content VARCHAR Meta information that corresponds to the meta name (refer
to the following table).

The following table summarizes the information that the meta table contains for each sequence.
Table 111: DWT Meta Information for Each Sequence

meta content
blocklength Length of each component after transformation, from An to D1. For example: 8, 8, 13
length Length of the sequence before transformation. For example: 24
waveletname Name of the wavelet used in the transformation. For example: db2
lowpassfilter Low-pass filter used in the decomposition of the wavelet.
highpassfilter High-pass filter used in the decomposition of the wavelet.
ilowpassfilter Low-pass filter used in the reconstruction of the wavelet.
ihighpassfilter High-pass filter used in the reconstruction of the wavelet.
level Level of wavelet transform performed.
extensionmode Extension mode used in the wavelet transform.

Example
This example uses hourly climate data for five cities (Asheville, Greenville, Brownsville, Nashville and
Knoxville) on a given day. The data are temperature (in degrees Fahrenheit), pressure (in mbars), and
dewpoint (in degrees Fahrenheit). The function generates the coefficient model table and the meta table,
which are used as input to the function IDWT.

Teradata Aster Analytics Foundation User Guide 191


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
Input
Table 112: DWT Example Input Table ville_climatedata

city period temp_f pressure_mbar dewpoint_f


Asheville 2010-01-01 00:00:00 34.9 1020.5 28.9
Asheville 2010-01-01 01:00:00 34.4 1020.2 28.7
Asheville 2010-01-01 02:00:00 33.9 1020 28.4
Asheville 2010-01-01 03:00:00 33.4 1020.2 28.3
Asheville 2010-01-01 04:00:00 33.1 1020.2 28
Asheville 2010-01-01 05:00:00 32.7 1020 27.9
Asheville 2010-01-01 06:00:00 32.5 1020.3 27.7
Asheville 2010-01-01 07:00:00 32.3 1020.8 27.6
Asheville 2010-01-01 08:00:00 32.1 1021.3 27.4
Asheville 2010-01-01 09:00:00 33.8 1021.7 28.2
Asheville 2010-01-01 10:00:00 36.4 1022.1 28.9
Asheville 2010-01-01 11:00:00 39.4 1022 29.3
Asheville 2010-01-01 12:00:00 42.1 1021.1 29.2
Asheville 2010-01-01 13:00:00 44.2 1020 29.1
Asheville 2010-01-01 14:00:00 45.6 1019.3 28.9
Asheville 2010-01-01 15:00:00 46.2 1019 28.5
Asheville 2010-01-01 16:00:00 45.8 1019.2 28.5
Asheville 2010-01-01 17:00:00 44.1 1019.6 28.6
Asheville 2010-01-01 18:00:00 41.2 1020.1 28.5
Asheville 2010-01-01 19:00:00 39.6 1020.6 28.8
Asheville 2010-01-01 20:00:00 38.2 1020.9 29
Asheville 2010-01-01 21:00:00 37.2 1021.1 29
Asheville 2010-01-01 22:00:00 36.3 1021 29
Asheville 2010-01-01 23:00:00 35.5 1020.9 29
Brownsville 2010-01-01 00:00:00 35.1 1020.5 28.9
Brownsville 2010-01-01 01:00:00 34.6 1020.2 28.8
Brownsville 2010-01-01 02:00:00 34.1 1020 28.5
Brownsville 2010-01-01 03:00:00 33.7 1020.1 28.4
... ... ... ... ...

192 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
SQL-MapReduce Call

SELECT * FROM DWT (


ON (SELECT 1) PARTITION BY 1
InputTable ('ville_climatedata')
OutputTable ('dwt_coef_table')
MetaTable ('dwt_meta_table')
InputColumns ('temp_f', 'pressure_mbar', 'dewpoint_f')
SortColumn ('period')
PartitionColumns ('city')
Wavelet ('db2')
Level (2)
);

Output
Table 113: DWT Example Output Message

messages
Dwt finished successfully!

The query below returns the output shown in the following table:

SELECT * FROM dwt_coef_table ORDER BY city, waveletid, waveletcomponent;

Table 114: DWT Example Output Table dwt_coef_table

city waveletid waveletcomponent temp_f pressure_mbar dewpoint_f


Asheville 1 A2 69.4540094027039 2040.77288421713 57.6536778800559
Asheville 2 A2 69.3002350407891 2040.65034008677 57.5918028045736
Asheville 3 A2 65.9380138856981 2040.05868411008 56.0059210207892
Asheville 4 A2 65.3687921302244 2042.91209425196 55.7742876107938
Asheville 5 A2 84.1422926636037 2041.64689638287 58.2670431589452
Asheville 6 A2 90.1564867623684 2038.66915729816 57.0999120742162
Asheville 7 A2 76.6292062444649 2041.64172802492 57.8025718338419
Asheville 8 A2 71.1744954076401 2041.9750345768 58.0529007176853
Asheville 9 D2 0.166265877365284 0.13357902559369 0.0802319348951208
Asheville 10 D2 -0.0236532579869611 -0.0152743495649474 0.0345987263105307
Asheville 11 D2 -1.08576122940834 -0.392631330035556 -0.529214267460297
Asheville 12 D2 -0.73178478327581 1.58931398112088 0.691473086723903
Asheville 13 D2 3.26344510251065 -1.20694916477225 -0.201106157323093
Asheville 14 D2 -0.748381926835563 0.14607141087663 -0.125329090319905
Asheville 15 D2 -1.40116169247464 0.055816120120312 0.0529007176852296

Teradata Aster Analytics Foundation User Guide 193


Chapter 3: Time Series, Path, and Attribution Analysis
DWT
city waveletid waveletcomponent temp_f pressure_mbar dewpoint_f
Asheville 16 D2 -0.841593172362995 0.466498721551716 0.197428166158137
Asheville 17 D1 0.306186217847898 0.183704255459304 0.122473786334524
Asheville 18 D1 0 -0.206134749187356 -0.08365170265035
Asheville 19 D1 0.0224122024304982 0.12248196238869 -0.070710273508908
Asheville 20 D1 -0.0258815095837281 -0.161303239447761 -0.0353547995796646
Asheville 21 D1 -0.917630271917533 0.0482845002277372 -0.470022832702245
Asheville 22 D1 -0.309652827602825 0.241459877385864 0.15782975392192
Asheville 23 D1 0.328598420278396 0.200102937320878 0.064705435625207
Asheville 24 D1 0.476955377862531 -0.244956026246825 0.109533031542659
Asheville 25 D1 0.757261434825846 -0.161273761730968 -0.10006023520487
Asheville 26 D1 -0.472558544124134 -0.0129456913733748 -0.167302977780198
Asheville 27 D1 -0.219066556743464 0.0742190413474191 0.109533705892233
Asheville 28 D1 -0.0612381515206977 0.0388133785289142 0
Asheville 29 D1 -0.489897481353545 -0.0612222930706707 0
Brownsville 1 A2 69.8494278769203 2040.77746121021 57.7241015517973
Brownsville 2 A2 69.7081531839714 2040.64241250221 57.6839970736215
Brownsville 3 A2 66.3771886064812 2040.03445096574 56.3260179778376
Brownsville 4 A2 65.7888893814753 2042.77748885299 56.2960582953676
Brownsville 5 A2 84.8067964112814 2041.43991970379 59.0415953806383
Brownsville 6 A2 90.648563090627 2038.63824306649 57.7977881374318
Brownsville 7 A2 77.1037579029709 2041.37803566704 58.4437490787264
Brownsville 8 A2 71.5761715904181 2041.75457522267 58.5050244475344
Brownsville 9 D2 0.174190741199549 0.125651441035643 0.0448557763120334
Brownsville 10 D2 0.0492226945782548 -0.0881776255051818 -0.00122619573711802
... ... ... ... ... ...

The query below returns the output shown in the following table:

SELECT * FROM dwt_meta_table ORDER BY city;

Table 115: DWT Example Meta Table dwt_meta_table

city meta content


Asheville blocklength 8,8,13
Asheville length 24
Asheville waveletname db2

194 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D

city meta content


Asheville lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
Asheville highpassfilter -0.4829629131445342, 0.836516303737808,
-0.2241438680420134, -0.1294095225512604
Asheville ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
Asheville ihighpassfilter -0.1294095225512604, -0.2241438680420134,
0.836516303737808, -0.4829629131445342
Asheville level 2
Asheville extensionmode sym
Brownsville blocklength 8,8,13
Brownsville length 24
Brownsville waveletname db2
Brownsville lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
Brownsville highpassfilter -0.4829629131445342, 0.836516303737808,
-0.2241438680420134, -0.1294095225512604
Brownsville ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
Brownsville ihighpassfilter -0.1294095225512604, -0.2241438680420134,
0.836516303737808, -0.4829629131445342
Brownsville level 2
... ... ...

DWT2D

Summary
The DWT2D function implements Mallat’s algorithm (an iterate algorithm in the Discrete Wavelet
Transform field) on 2-dimensional matrices and applies wavelet transform on multiple sequences
simultaneously.
The input is a set of sequences. Typically, each sequence is a matrix that contains a position in 2-dimensional
space (y and x indexes or coordinates) and its corresponding values. You specify the wavelet name or
wavelet filter table, transform level, and (optionally) extension mode. The function returns the transformed
sequences in Hilbert space with the corresponding component identifiers and indices. (The transformation
is also called the decomposition.)

Teradata Aster Analytics Foundation User Guide 195


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D

Note:
The wavelet filter table does not appear in the preceding diagram because it is seldom used.

A typical DWT2D use case is:


1. Apply DWT2D to the original data to generate the approximate coefficients of the matrices and the
corresponding metadata.
2. If desired, filter the coefficients by methods appropriate for the objects (for example, minimum threshold
or top n coefficients).
3. From the approximate or filtered coefficients, reconstruct the matrices and compare them with their
original counterparts.

Background
DWT is a time-frequency analysis tool for which the wavelets are discretely sampled. DWT is different from
the Fourier transform, which provides frequency information on the whole time domain. A key advantage of
DWT is that it provides frequency information at different time points.
Mallat’s algorithm for 2-dimensional input can be described as a series of iterative steps:
1. Use the original time domain sequence (2-dimensional matrix) as the input of level 1.
2. Convolve each row of the input matrix with high-pass filter h(n) and low-pass filter g(n).
3. Downsample each convolved row by column, generating two matrices.
4. Convolve each row of each generated matrix with high-pass filter h(n) and low-pass filter g(n).
5. Downsample each convolved row by column, generating two more matrices.
The four generated matrices contain the approximation coefficients Ak, horizontal detail coefficients Hk,
vertical detail coefficients Vk, and diagonal detail coefficients Dk, respectively, for level n. The following
figure shows the process.
6. If current level k is the maximum transform level n, then stop; otherwise, use Ak as the input matrix for
the next level (that is, increment k by 1 and go to step 2.)

196 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D
Figure 7: Single-Level Application of DWT2D

Usage

DWT2D Syntax
Version 1.3

SELECT * FROM DWT2D (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
MetaTable ('meta_table')
InputColumns ({ 'input_column' | 'input_column_range' }[,...])
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
IndexColumns ('indexy_column', 'indexx_column')
[ Range ('(starty, startx), (endy, endx)') ]

Teradata Aster Analytics Foundation User Guide 197


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D
{ Wavelet ('wavelet') |
WaveletFilterTable ('wavelet_filter_table') }
Level (level)
[ ExtensionMode ('extension_mode') ]
[ CompactOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the
sequences to be transformed.
OutputTable Required Specifies the name for the table that the function creates to
store the coefficients generated by the wavelet transform.
This table must not exist.
MetaTable Required Specifies the name for the table that the function creates to
store the meta information generated by the wavelet
transform. This table must not exist.
InputColumns Required Specifies the names of the columns in the input table or view
that contain the data to be transformed. These columns
must contain numeric values between -1e308 and 1e308. The
function treats NULL as 0.
PartitionColumns Optional Specifies the names of the partition columns, which identify
the sequences. Rows with the same partition column values
belong to the same sequence. If you specify multiple
partition columns, then the function treats the first one as
the distribute key of the output and meta tables.
By default, all rows belong to one sequence, and the function
generates a distribute key column named
dwt_idrandom_name in both the output table and the meta
table. In both tables, every cell of dwt_idrandom_name has
the value 1.
IndexColumns Required Specifies the columns that contain the indexes of the input
sequences. For a matrix, indexy_column contains the y
coordinates and indexx_column contains the x coordinates.
Range Optional Specifies the start and end indexes of the input data, all of
which must be integers. The default values for each sequence
are:
• starty: minimum y index

198 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D

Argument Category Description

• startx: minimum x index


• endy: maximum y index
• endx: maximum x index
The function treats any NULL value as 0.
The range can specify a maximum of 1,000,000 cells.
Wavelet Optional Specifies a wavelet filter name from the table, Supported
Wavelet Filter Names, in the Arguments section of the DWT
function.
WaveletFilterTable Optional Specifies the name of the table that contains the coefficients
of the wave filters.
Level Required Specifies the wavelet transform level. The value level must be
an integer in the range [1, 1000].
ExtensionMode Optional Specifies the method for handling border distortion, an
extension_mode from the table, Supported Extension Modes,
in the Arguments section of the DWT function. The default
value is 'sym'.
CompactOutput Optional Specifies whether to ignore (not output) rows in which all
coefficient values are very small (having an absolute value
less than 1e-12). The default value is 'true'. For a sparse input
matrix, ignoring such rows reduces the output table size.

Input
The DWT2D function has a required input table (or view) and an optional wavelet filter table. If you omit
the wavelet filter table, then you must specify the Wavelet argument.
The input table can contain at most 1594 columns. The function assumes that each sequence can be fitted
into the memory of the worker. The following table describes the input table columns that you can or must
specify with arguments. The input table can have additional columns, but the function ignores them.
The table below shows the schema of the wavelet filter table.
Table 116: DWT2D Input Table Schema

Column Name Data Type Description


partition_column INTEGER, SMALLINT, Optional. Sequence identifier of the sequence to which
BIGINT, NUMERIC [(p, the data belongs. Rows with the same partition column
s)], TEXT, VARCHAR, values belong to the same sequence. If the input table has
VARCHAR(n), UUID, or multiple partition columns, then the function treats the
BYTEA first one as the distribute key of the output and meta
tables.
If all the values are invalid in a sequence, then the
function ignores the sequence.

Teradata Aster Analytics Foundation User Guide 199


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D

Column Name Data Type Description


indexy INTEGER Y index of the sequence to which the data belongs. If this
column contains a NULL value, then the function ignores
the row.
indexx INTEGER X index of the sequence to which the data belongs. If this
column contains a NULL value, then the function ignores
the row.
input_column INTEGER, SMALLINT, Data to be transformed—numeric values between -1e308
BIGINT, BIGSERIAL, and 1e308 (or NULL, which the function treats as 0).
DOUBLE PRECISION,
NUMERIC [(p, s)], or
SERIAL

Output
The DWT2D function outputs a message that indicates whether the function succeeded,, an output table of
transformed (decomposed) sequences, and a meta table of wavelet-related information.
Table 117: DWT2D Output Message

Column Name Data Type Description


messages VARCHAR Message that indicates whether the function was successful.

Table 118: DWT2D Output Table Schema

Column Name Data Type Description


partition_column Inherited from input Sequence identifier of the sequence to which the data
table column with same belongs. Rows with the same partition column
name values belong to the same sequence.
The output table has a partition_column for every
partition_column in the input table.
If the input table has multiple partition columns,
then the first one is the distribute key in both the
input and output tables.
If the input table has only one partition_column,
then the output table has as its distribute key a
function-generated column named
dwt_idrandom_name. Every cell of
dwt_idrandom_name has the value 1.
waveletid INTEGER Index of each wavelet coefficient (starting from 1 for
each sequence).
waveletcomponent VARCHAR Component to which the coefficient belongs.
Possible values are An, Hn, Vn, Dn, Hn-1, ..., H1, V1,
D1, where n is the wavelet transform level.
input_column DOUBLE PRECISION Coefficient of the corresponding input column after
the wavelet transform.

200 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D

Column Name Data Type Description


The output table has an input_column for every
input_column in the input table.

Table 119: DWT2D Meta Table Schema

columns type description


partition_column Inherited from input Sequence identifier of the sequence to which the data
table column with same belongs. Rows with the same partition column
name values belong to the same sequence.
The output table has a partition_column for every
partition_column in the input table.
If the input table has multiple partition columns,
then the first one is the distribute key in both the
input and output tables.
If the input table has only one partition_column,
then the output table has as its distribute key a
function-generated column named
dwt_idrandom_name. Every cell of
dwt_idrandom_name has the value 1.
meta VARCHAR Meta information name from the following table.
content VARCHAR Meta information that corresponds to the meta
name (refer to the following table).

The following table summarizes the information that the meta table contains for each sequence.
Table 120: DWT2D Meta Information for Each Sequence

meta content
blocklength Pairs that represent the length of each block of coefficients. The format is (row_number,
column_number). For example: (5, 5), (5, 5), (5, 6)
length Pair that represents the length of the original sequence in each dimension.The format is
(row_number, column_number). For example: (5, 8)
range Minimum and maximum indexes of the original sequence. The format is (min_y_index,
min_x_index), (max_y_index, max_x_index). For example: (1, 1), (5, 8)
lowpassfilter Low-pass filter coefficients used in the decomposition of the wavelet.
highpassfilter High-pass filter coefficients used in the decomposition of the wavelet.
ilowpassfilter Low-pass filter coefficients used in the reconstruction of the wavelet.
ihighpassfilter High-pass filter coefficients used in the reconstruction of the wavelet.
level Level of wavelet transform performed.
extensionmode Extension mode used in the wavelet transform.

Teradata Aster Analytics Foundation User Guide 201


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D
Example
This example uses climate data in many cities in the states of California (CA), Texas (TX), and Washington
(WA). The cities are represented by two-dimensional coordinates (latitude and longitude). The data are
temperature (in degrees Fahrenheit), pressure (in Mbars), and dew point (in degrees Fahrenheit). The
function generates a coefficient model table and a meta table, which are used as input to the function
IDWT2D.

Input
Table 121: DWT2D Example Input Table twod_climate_data

state city longitude latitude temp_f pressure_mbar dewpoint_f


CA ALPINE -117 32 34.9 1020.5 28.9
CA ALTURAS -121 41 36.6 1022 29.1
CA ANAHEIM -118 33 33.9 1020 28.4
CA AUBURN -122 38 34.9 1020.6 28.8
CA BAKER -117 35 39.4 1022 29.3
CA BARSTOW -118 34 32.3 1020.8 27.6
CA BRIDGEPORT -120 38 33.9 1020.1 28.4
CA BURNEY -122 40 32.3 1020.8 27.6
CA BUTTONWILLOW -120 35 32.1 1021.3 27.4
CA CALISTOGA -123 38 33.8 1021.7 28.2
CA CALLAHAN -123 41 34.6 1020.2 28.8
CA CECILVILLE -124 41 33.8 1021.7 28.3
CA CLOVERDALE -124 38 32.1 1021.3 27.4
CA COVELO -124 39 33.5 1020.2 28.3
CA GLENNVILLE -119 35 33.8 1021.7 28.2
CA HAIWEE -118 36 33.1 1020.2 28
CA HEMET -117 33 33.4 1020.2 28.3
CA IMPERIAL -116 32 34.4 1020.2 28.7
CA KENTFIELD -123 37 32.7 1020 27.9
CA KLAMATH -125 41 32.1 1021.3 27.5
... ... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM DWT2D (


ON (SELECT 1) PARTITION BY 1
InputTable ('twod_climate_data')
OutputTable ('dwt2d_coeftable')

202 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D
MetaTable ('dwt2d_metatable')
InputColumns ('temp_f','pressure_mbar','dewpoint_f')
PartitionColumns ('state')
IndexColumns ('latitude', 'longitude')
Wavelet ('db2')
CompactOutput ('true') Level (2)
);

Output
Table 122: DWT2D Example Output Message

messages
Dwt2D finished successfully!

This query returns the output shown in the following table:

SELECT * FROM dwt2d_coeftable ORDER BY state, waveletid, waveletcomponent;

Table 123: DWT2D Example Output Table dwt2d_coeftable

state waveletid waveletcomponent temp_f pressure_mbar dewpoint_f


CA 2 A2 0.17966669835468 5.36757777911124 0.148645986178623
CA 3 A2 -2.87944346786742 -80.8379456109261 -2.34340157214
CA 4 A2 85.1709979626302 2535.2091172814 71.2034084720052
CA 5 A2 -0.436988155749943 -14.3013671884212 -0.376569632995601
CA 6 A2 -3.55180132008231 -112.247317693092 -3.0373591242726
CA 7 A2 7.60713133087065 243.037739802494 6.52714999027827
CA 8 A2 77.4836280596588 2309.16778018649 64.6775876908445
CA 9 A2 -2.46718542424233 -72.80570147524 -2.05597029843014
CA 10 A2 4.30723732587148 138.964984443667 3.69857066310935
CA 11 A2 96.5053215632833 2957.68800764973 81.4818568951578
CA 12 A2 25.3705204940491 735.022235657688 20.500350016632
CA 13 A2 99.1650690684055 3095.43307260088 84.2697548740767
CA 14 A2 107.004206852775 3304.81413538727 90.5346298352776
CA 15 A2 68.9704953989846 2042.42993680917 57.0258488992799
CA 18 V2 0.180465820688191 5.49632155551756 0.154427777499678
CA 19 V2 -7.00925368266406 -214.817509066481 -5.94270114993037
CA 20 V2 11.9457843720894 336.444066140318 9.65974360626963
... ... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 203


Chapter 3: Time Series, Path, and Attribution Analysis
DWT2D
Note:
CompactOutput('true') prevents rows in which all coefficient values are very small from appearing in
dwt2d_coeftable.

The query below returns the output shown in the following table:

SELECT * FROM dwt2d_metatable ORDER BY state;

Table 124: DWT2D Example Meta Table dwt2d_metatable

state meta content


CA blocklength (4, 4),(4, 4),(4, 4),(4, 4),(6, 6),(6, 6),(6, 6)
CA length (10, 10)
CA wavelet db2
CA lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
CA highpassfilter -0.4829629131445342, 0.836516303737808, -0.2241438680420134,
-0.1294095225512604
CA ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
CA ihighpassfilter -0.1294095225512604, -0.2241438680420134, 0.836516303737808,
-0.4829629131445342
CA level 2
CA extensionmode sym
CA range (32,-125),(41,-116)
TX blocklength (5, 5),(5, 5),(5, 5),(5, 5),(7, 7),(7, 7),(7, 7)
TX length (11, 11)
TX wavelet db2
TX lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
TX highpassfilter -0.4829629131445342, 0.836516303737808, -0.2241438680420134,
-0.1294095225512604
TX ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
TX ihighpassfilter -0.1294095225512604, -0.2241438680420134, 0.836516303737808,
-0.4829629131445342
TX level 2
TX extensionmode sym
TX range (26,-105),(36,-95)

204 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

state meta content


WA blocklength (3, 4),(3, 4),(3, 4),(3, 4),(3, 5),(3, 5),(3, 5)
WA length (4, 8)
WA wavelet db2
WA lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
WA highpassfilter -0.4829629131445342, 0.836516303737808, -0.2241438680420134,
-0.1294095225512604
WA ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
WA ihighpassfilter -0.1294095225512604, -0.2241438680420134, 0.836516303737808,
-0.4829629131445342
WA level 2
WA extensionmode sym
WA range (45,-125),(48,-118)

FrequentPaths

Summary
The FrequentPaths takes a table of sequences and outputs a table of subsequences (patterns) that frequently
appear in the input table and, optionally, a table of sequence-pattern pairs.

The function is useful for analyzing customer purchase behavior, web access patterns, disease treatments,
and DNA sequences.

Background
In a sequential pattern mining application, each sequence is an ordered list of item sets, and each item set
contains at least one item. Items within a set are unordered.

Teradata Aster Analytics Foundation User Guide 205


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
In web clickstream analysis, each set has only one item. In purchase behavior analysis, a set might have
multiple items, because a customer might buy more than one item in one shopping session.
In sequential pattern mining, sequence α is a subsequence of sequence β if both of the following are true:
• Each item set ai in α is a subset of an item set bj in β.
• The ai elements in α have the same order as the bj elements in β.
More formally: sequence α=〈a1a2...an〉 is a subsequence of sequence β=<b1b2...bm>, and β is a super sequence
of α, if there exist integers 1≤j1<j2<...≤jn≤m such that a1⊆bj1,a2⊆bj2,...,an⊆bjn.
The support of sequence α in a sequence data set SDB is defined as the number of sequences in SDB that
contain α (that is, the number of sequences in SDB that are super sequences of α).
Given sequence data set SDB and threshold T, sequence α is called as a frequent sequential pattern of SDB, if
support(α)≥T. The problem of sequential pattern mining is to find all possible frequent sequential patterns,
given a sequence data set SDB and a threshold T.

Usage

FrequentPaths Syntax
Version 2.1

SELECT * FROM FrequentPaths (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
[ TimeColumn ('time_column') ]
[ PathFilters ([Separator (symbol),] 'filter' [,...]) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range'[,...]) ]
[ SeqPatternTable ('sequence_pattern_table') ]
{ ItemColumn ('sequence_column') |
ItemDefinition ('item_definition_table:
[ index_column:definition_column:item_column ]') ] |
PathColumn ('path_column')}
MinSupport ('minimum')
[ MaxLength ('maximum_length') ]
[ MinLength ('minimum_length') ]
[ ClosedPattern ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}]
);

206 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
Note:
For the ItemDefinition argument, the brackets inside the parentheses are required. For example:

ItemDefinition (id_def_table:[id:def:item])

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Definition
InputTable Required Specifies the name of the table that contains the input sequences. Each
row is one item in a sequence. If input_table does not include a schema,
the function searches for it in the user’s search path. The function
ignores rows that contain any NULL values.
OutputTable Required Specifies the name of the table where the function outputs the
subsequences.
PartitionColumns Required Specifies the names of the columns that comprise the partition key of the
input sequences.
TimeColumn Optional* Specifies the name of the input table column that determines the order of
items in a sequence. Items in the same sequence that have the same time
stamp belong to the same set.
*Required when ItemColumn or ItemDefinition is specified.
PathFilters Optional Specifies the filters to use on the input table sequences. Only input table
sequences that satisfy all constraints of at least one filter are input to the
function.
Each filter has one or more constraints, which are separated by spaces.
Each constraint has this syntax:

constraint (item [symbol ...])

By default, symbol is comma (,). If you specify symbol, it applies to all


filters. The constraint is one of the following:
• STW (start-with_constraint)
The first item set of the sequence must contain at least one item. For
example, STW(c,d) requires the first item set of the sequence to
contain c or d. Sequence “(a, c), e, (f, d)” meets this constraint because
the first item set, (a,c), contains c.
• EDW (end-with_constraint)
The last item set of the sequence must contain at least one item. For
example, EDW(f,g) requires the last item set of the sequence to

Teradata Aster Analytics Foundation User Guide 207


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

Argument Category Definition

contain f or g. Sequence “(a, b), e, (f, d)” meets this constraint because
the last item set, (f,d), contains f.
• CTN (containing_constraint)
The sequence must contain at least one item. For example, CTN(a,b)
requires the sequence to contain a or b. The sequence “(a,c), d, (e,f)”
meets this constraint but the sequence “d, (e,f)” does not.
Constraints in the same filter must be different. For example:
• This is valid:

'STW(c,d) EDW(g,k) CTN(e)'


• This is invalid:

'STW(c,d) STW(e,h)'

This argument specifies a separator and uses it in two filters:

PathFilters('Separator(#)', 'STW(c#d) EDW(g#k)


CTN(e)', 'CTN(h#k)')

GroupByColumns Optional Specifies the names of the input table columns by which to group the
input table sequences. If you specify this argument, then the function
operates on each group separately and copies each group_by_column to
the output table.
SeqPatternTable Optional Specifies the name of the table where the function outputs sequence-
pattern pairs. For example, if a sequence has a partition value of "1" and
contains 3 patterns with IDs 2, 9, and 10, then for that sequence the
function outputs the sequence-pattern pairs ("1", 2), ("1", 9), and ("1",
10).
If sequence_pattern_table does not include a schema, the function creates
it in the first schema in the user’s search path.
If the function finds no sequence-pattern pairs, then it does not create
sequence_pattern_table.
ItemColumn Optional* Specifies the names of the input table columns that contain the items.
*Required if you specify neither ItemDefinition nor PathColumn.
ItemDefinition Optional* Specifies the name of the item definition table and the names of its index,
definition, and item columns. If item_definition_table does not include a
schema, the function searches for it in the schema in the user’s search
path.
*Required if you specify neither ItemColumn nor PathColumn.

208 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

Argument Category Definition


PathColumn Optional* Specifies the name of the input table column that contains paths in the
form of sequence strings. A sequence string has this syntax:

'[item [, ...]]'

In the sequence string syntax, you must type the outer brackets (bold).
The sequence strings in this column can be generated by the nPath
function.
If you specify this argument, then each item set can have only one item.
* Required if you specify neither ItemColumn nor ItemDefinition.
MinSupport Required Determines the threshold for whether a sequential pattern is frequent.
The minimum must be a positive real number.
If minimum is in the range (0,1], then it is a relative threshold: If N is the
total number of input sequences, then the threshold is T=N*minimum.
For example, if there are 1000 sequences in the input table and minimum
is 0.05, then the threshold is 50.
If minimum is in the range (1,+), then it is an absolute threshold:
Regardless of N, T=minimum. For example, if minimum is 50, then the
threshold is 50, regardless of N.
A pattern is frequent if its support value is at least T.
Because the function outputs only frequent patterns, minimum controls
the number of output patterns. If minimum is small, processing time
increases exponentially; therefore, Teradata recommends starting the
trial with a larger value.—for example, 5% of the total sequence number
if you know N and 0.05 otherwise.
If you specify a relative minimum and GroupByColumns, then the
function calculates N and T for each group.
If you specify a relative minimum and PathFilters, then N is the number
of sequences that meet the constraints of the filters.
MaxLength Optional Specifies the maximum length of the output sequential patterns. The
length of a pattern is its number of sets. By default, there is no maximum
length.
MinLength Optional Specifies the minimum length of the output sequential patterns. The
default value is 1.
ClosedPattern Optional Specifies whether to output only closed patterns. The default value is
'false'.

Input
The FrequentPaths function requires an input table, which contains the sequence data to process. The input
can be in either of these formats:
• Sequence/path format:

Teradata Aster Analytics Foundation User Guide 209


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
Each row contains a string in the format '[item[, ...]]', where the outer brackets belong to the string (for
example, '[A, B, C, D]'). To output strings in this format, you can use the function nPath with its
Accumulate argument.
• Item format:
Each row represents one item in a sequence. With this format, you must specify either the ItemColumn
or ItemDefinition argument.
If the input table does not have an item column (specified by the ItemColumn argument), then the function
also requires an item definition table (specified by the ItemDefinition argument).
Table 125: FrequentPaths Input Table Schema

Column Name Data Type Description


partition_column INTEGER, Sequence index. Rows with the same index belong to the same
SHORT, or sequence.
LONG
item_column CHAR, Sequence item. The input table has this column only if you do
VARCHAR, or not specify an item definition table.
TEXT
time_column Any except Optional. Time stamp of sequence item. Items in the same
DOUBLE sequence with the same time stamp belong to the same item set.
PRECISION
path_column CHAR, Optional. Paths in the form of sequence strings.
VARCHAR, or
TEXT

Table 126: FrequentPaths Item Definition Table Schema

Column Name Data Type Description


index_column INTEGER, Used to determine when more than one predicate in
SHORT, or definition_column is satisified, in which case item_column applies.
LONG
definition_column CHAR, Predicate definition.
VARCHAR, or
TEXT
item_column CHAR, Sequence item for which the predicate is true.
VARCHAR, or The function applies the predicates to the input table in index
TEXT order. If more than one predicate is true for a row, the function
assigns the row the value that corresponds to the predicate with
the smallest index.
If an input table row has no corresponding definition in the item
definition table, then the function skips that row.

Output
The FrequentPaths function outputs an output message and output table and, optionally, a sequence pattern
table.

210 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
Table 127: FrequentPaths Output Message

Column Name Data Type Description


message VARCHAR Message that reports the number of patterns found.

Table 128: FrequentPaths Output Table Schema

Column Name Data Type Description


pattern VARCHAR Patterns (subsequences) in string format ('item [,...]').
support INTEGER Support value of pattern.
length INTEGER Length of pattern.
group_by_column VARCHAR Column copied from the input table. This column appears only if
you specify the GroupByColumns argument.

Table 129: FrequentPaths Sequence Pattern Table Schema

Column Name Data Type Description


partition_column Same as in Column that is, or is part of, the partition key of the input table
input table sequences.
pattern VARCHAR Pattern found in the sequence specified by partition_column.

Examples
These examples apply the FrequentPaths function to browsing sequences of different users on a banking
website.

Example 1: ItemColumn Argument Specified

Input
The input table contains web clickstream data from a set of users with multiple sessions or sequences.
Table 130: FrequentPaths Example 1 Input Table bank_web_clicks1

session_id page datestamp


0 ACCOUNT SUMMARY 2004-03-17 16:35:00
0 FAQ 2004-03-17 16:38:00
0 ACCOUNT HISTORY 2004-03-17 16:42:00
0 FUNDS TRANSFER 2004-03-17 16:45:00
0 ONLINE STATEMENT ENROLLMENT 2004-03-17 16:49:00
0 PROFILE UPDATE 2004-03-17 16:50:00
0 ACCOUNT SUMMARY 2004-03-17 16:51:00

Teradata Aster Analytics Foundation User Guide 211


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

session_id page datestamp


0 CUSTOMER SUPPORT 2004-03-17 16:53:00
0 VIEW DEPOSIT DETAILS 2004-03-17 16:57:00
1 ACCOUNT SUMMARY 2004-03-18 01:16:00
1 ACCOUNT SUMMARY 2004-03-18 01:18:00
1 FAQ 2004-03-18 01:20:00
1 ACCOUNT SUMMARY 2004-03-18 01:21:00
1 FUNDS TRANSFER 2004-03-18 01:24:00
1 ACCOUNT HISTORY 2004-03-18 01:25:00
1 VIEW DEPOSIT DETAILS 2004-03-18 01:27:00
1 ACCOUNT SUMMARY 2004-03-18 01:27:00
1 ACCOUNT HISTORY 2004-03-18 01:28:00
2 ACCOUNT SUMMARY 2004-03-18 09:22:00
2 ACCOUNT SUMMARY 2004-03-18 09:23:00
2 ACCOUNT SUMMARY 2004-03-18 09:25:00
2 ACCOUNT HISTORY 2004-03-18 09:27:00
2 FUNDS TRANSFER 2004-03-18 09:31:00
2 ACCOUNT SUMMARY 2004-03-18 09:31:00
2 FAQ 2004-03-18 09:33:00
2 FAQ 2004-03-18 09:36:00
3 ACCOUNT SUMMARY 2004-03-18 22:41:00
3 ACCOUNT HISTORY 2004-03-18 22:45:00
3 ACCOUNT SUMMARY 2004-03-18 22:47:00
3 ACCOUNT HISTORY 2004-03-18 22:49:00
3 FAQ 2004-03-18 22:50:00
3 ACCOUNT SUMMARY 2004-03-18 22:53:00
3 ACCOUNT SUMMARY 2004-03-18 22:55:00
4 ACCOUNT SUMMARY 2004-03-19 08:33:00
4 FAQ 2004-03-19 08:36:00
4 VIEW DEPOSIT DETAILS 2004-03-19 08:38:00
4 FAQ 2004-03-19 08:41:00
5 ACCOUNT SUMMARY 2004-03-19 10:06:00
5 FUNDS TRANSFER 2004-03-19 10:09:00

212 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

session_id page datestamp


5 VIEW DEPOSIT DETAILS 2004-03-19 10:11:00
5 VIEW DEPOSIT DETAILS 2004-03-19 10:13:00
5 ACCOUNT HISTORY 2004-03-19 10:14:00

SQL-MapReduce Call

SELECT * FROM FrequentPaths (


ON (SELECT 1)
PARTITION BY 1
InputTable ('bank_web_clicks1')
OutputTable ('output1')
PartitionColumns ('session_id')
TimeColumn ('datestamp')
ItemColumn ('page')
MinSupport (2)
);

Output

Table 131: FrequentPaths Example 1 Output Message

message
Finished. Totally 69 patterns were found.

This query returns the following table:

SELECT * FROM output1 ORDER BY 2, 3, 1;

Table 132: FrequentPaths Example 1 Output Table

pattern support length


ACCOUNT HISTORY;ACCOUNT HISTORY 2 2
ACCOUNT HISTORY;FAQ 2 2
ACCOUNT HISTORY;FUNDS TRANSFER 2 2
ACCOUNT HISTORY;VIEW DEPOSIT DETAILS 2 2
FAQ;ACCOUNT HISTORY 2 2
FAQ;FAQ 2 2
FAQ;FUNDS TRANSFER 2 2
FUNDS TRANSFER;ACCOUNT HISTORY 2 2
FUNDS TRANSFER;ACCOUNT SUMMARY 2 2
VIEW DEPOSIT DETAILS;ACCOUNT HISTORY 2 2

Teradata Aster Analytics Foundation User Guide 213


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

pattern support length


ACCOUNT HISTORY;ACCOUNT SUMMARY;ACCOUNT 2 3
HISTORY
ACCOUNT HISTORY;ACCOUNT SUMMARY;FAQ 2 3
ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT 2 3
HISTORY
... ... ...

Example 2: ItemDefinition Argument Specified

Input
The input table, bank_web_url, contains the URL of each page browsed by the customer. The definitions of
the browser pages, which can be specified by the ItemDefinition argument, are in the table ref_url (which
follows bank_web_url).
Table 133: FrequentPaths Example 2 Input Table bank_web_url

session_id page_url datestamp


0 www.bank.com/acsum 2004-03-17 16:35:00
0 www.bank.com/faq 2004-03-17 16:38:00
0 www.bank.com/achist 2004-03-17 16:42:00
0 www.bank.com/fundsxfer 2004-03-17 16:45:00
0 www.bank.com/onlinestat 2004-03-17 16:49:00
0 www.bank.com/profile 2004-03-17 16:50:00
0 www.bank.com/acsum 2004-03-17 16:51:00
0 www.bank.com/customer 2004-03-17 16:53:00
0 www.bank.com/deposit 2004-03-17 16:57:00
1 www.bank.com/acsum 2004-03-18 01:16:00
1 www.bank.com/acsum 2004-03-18 01:18:00
1 www.bank.com/faq 2004-03-18 01:20:00
1 www.bank.com/acsum 2004-03-18 01:21:00
1 www.bank.com/fundsxfer 2004-03-18 01:24:00
1 www.bank.com/achist 2004-03-18 01:25:00
1 www.bank.com/deposit 2004-03-18 01:27:00
1 www.bank.com/acsum 2004-03-18 01:27:00
1 www.bank.com/achist 2004-03-18 01:28:00
2 www.bank.com/acsum 2004-03-18 09:22:00

214 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

session_id page_url datestamp


2 www.bank.com/acsum 2004-03-18 09:23:00
2 www.bank.com/acsum 2004-03-18 09:25:00
2 www.bank.com/achist 2004-03-18 09:27:00
2 www.bank.com/fundsxfer 2004-03-18 09:31:00
2 www.bank.com/acsum 2004-03-18 09:31:00
2 www.bank.com/faq 2004-03-18 09:33:00
2 www.bank.com/faq 2004-03-18 09:36:00
3 www.bank.com/acsum 2004-03-18 22:41:00
3 www.bank.com/achist 2004-03-18 22:45:00
3 www.bank.com/acsum 2004-03-18 22:47:00
3 www.bank.com/achist 2004-03-18 22:49:00
3 www.bank.com/faq 2004-03-18 22:50:00
3 www.bank.com/acsum 2004-03-18 22:53:00
3 www.bank.com/acsum 2004-03-18 22:55:00
4 www.bank.com/acsum 2004-03-19 08:33:00
4 www.bank.com/faq 2004-03-19 08:36:00
4 www.bank.com/deposit 2004-03-19 08:38:00
4 www.bank.com/faq 2004-03-19 08:41:00
5 www.bank.com/acsum 2004-03-19 10:06:00
5 www.bank.com/fundsxfer 2004-03-19 10:09:00
5 www.bank.com/deposit 2004-03-19 10:11:00
5 www.bank.com/deposit 2004-03-19 10:13:00
5 www.bank.com/achist 2004-03-19 10:14:00

Table 134: FrequentPaths Example 2 Definition Table ref_url

page_id pagedef page


1 page_url like '%acsum%' ACCOUNT SUMMARY
2 page_url like '%faq%' FAQ
3 page_url like '%achist%' ACCOUNT HISTORY
4 page_url like '%fundsxfer%' FUNDS TRANSFER
5 page_url like '%onlinestat%' ONLINE STATEMENT ENROLLMENT
6 page_url like '%profile%' PROFILE UPDATE
8 page_url like '%customer%' CUSTOMER SUPPORT

Teradata Aster Analytics Foundation User Guide 215


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

page_id pagedef page


9 page_url like '%deposit%' VIEW DEPOSIT DETAILS

SQL-MapReduce Call

SELECT * FROM FrequentPaths (


ON (SELECT 1)
PARTITION BY 1
InputTable ('bank_web_url')
OutputTable ('output2')
PartitionColumns ('session_id')
TimeColumn ('datestamp')
ItemDefinition ('ref_url:[page_id:pagedef:page]')
MinSupport (2)
);

Output

Table 135: FrequentPaths Example 2 Output Message

message
Finished. Totally 69 patterns were found.

This query returns the following table:

SELECT * FROM output2 ORDER BY 2, 3, 1;

Table 136: FrequentPaths Example 2 Output Table

pattern support length


ACCOUNT HISTORY;ACCOUNT HISTORY 2 2
ACCOUNT HISTORY;FAQ 2 2
ACCOUNT HISTORY;FUNDS TRANSFER 2 2
ACCOUNT HISTORY;VIEW DEPOSIT DETAILS 2 2
FAQ;ACCOUNT HISTORY 2 2
FAQ;FAQ 2 2
FAQ;FUNDS TRANSFER 2 2
FUNDS TRANSFER;ACCOUNT HISTORY 2 2
FUNDS TRANSFER;ACCOUNT SUMMARY 2 2
VIEW DEPOSIT DETAILS;ACCOUNT HISTORY 2 2
ACCOUNT HISTORY;ACCOUNT SUMMARY;ACCOUNT HISTORY 2 3
ACCOUNT HISTORY;ACCOUNT SUMMARY;FAQ 2 3

216 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

pattern support length


ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT HISTORY 2 3
ACCOUNT SUMMARY;ACCOUNT HISTORY;FAQ 2 3
... ... ...

Example 3: GroupByColumns Argument Specified

Input

Table 137: FrequentPaths Example 3 Input Table bank_web_clicks2

customer_id session_id page datestamp


529 0 ACCOUNT SUMMARY 2004-03-17 16:35:00
529 0 FAQ 2004-03-17 16:38:00
529 0 ACCOUNT HISTORY 2004-03-17 16:42:00
529 0 FUNDS TRANSFER 2004-03-17 16:45:00
529 0 ONLINE STATEMENT 2004-03-17 16:49:00
ENROLLMENT
529 0 PROFILE UPDATE 2004-03-17 16:50:00
529 0 ACCOUNT SUMMARY 2004-03-17 16:51:00
529 0 CUSTOMER SUPPORT 2004-03-17 16:53:00
529 0 VIEW DEPOSIT DETAILS 2004-03-17 16:57:00
529 1 ACCOUNT SUMMARY 2004-03-18 01:16:00
529 1 ACCOUNT SUMMARY 2004-03-18 01:18:00
529 1 FAQ 2004-03-18 01:20:00
529 1 ACCOUNT SUMMARY 2004-03-18 01:21:00
529 1 FUNDS TRANSFER 2004-03-18 01:24:00
529 1 ACCOUNT HISTORY 2004-03-18 01:25:00
529 1 VIEW DEPOSIT DETAILS 2004-03-18 01:27:00
529 1 ACCOUNT SUMMARY 2004-03-18 01:27:00
529 1 ACCOUNT HISTORY 2004-03-18 01:28:00
... ... ... ...

SQL-MapReduce Call

SELECT * FROM FrequentPaths (


ON (SELECT 1)

Teradata Aster Analytics Foundation User Guide 217


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
PARTITION BY 1
InputTable ('bank_web_clicks2')
OutputTable ('output3')
PartitionColumns ('session_id')
GroupByColumns ('customer_id')
TimeColumn ('datestamp')
ItemColumn ('page')
MinSupport (2)
);

Output

Table 138: FrequentPaths Example 3 Output Message

message
Finished. Totally 213 patterns were found.

This query returns the contents of the following table (row order can vary):

SELECT * FROM output3 ORDER BY 4, 2, 3;

Table 139: FrequentPaths Example 3 Output Table

pattern support length customer_id


FAQ;ACCOUNT HISTORY 2 2 529
FUNDS TRANSFER;ACCOUNT HISTORY 2 2 529
FAQ;FUNDS TRANSFER 2 2 529
VIEW DEPOSIT DETAILS;ACCOUNT HISTORY 2 2 529
ACCOUNT HISTORY;FAQ 2 2 529
ACCOUNT HISTORY;VIEW DEPOSIT DETAILS 2 2 529
FUNDS TRANSFER;ACCOUNT SUMMARY 2 2 529
FAQ;FAQ 2 2 529
ACCOUNT HISTORY;FUNDS TRANSFER 2 2 529
ACCOUNT HISTORY;ACCOUNT HISTORY 2 2 529
FAQ;FUNDS TRANSFER;ACCOUNT SUMMARY 2 3 529
ACCOUNT SUMMARY;FAQ;FAQ 2 3 529
ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER 2 3 529
ACCOUNT SUMMARY;ACCOUNT HISTORY;FUNDS 2 3 529
TRANSFER
FAQ;ACCOUNT SUMMARY;ACCOUNT SUMMARY 2 3 529
ACCOUNT SUMMARY;ACCOUNT HISTORY;VIEW DEPOSIT 2 3 529
DETAILS

218 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

pattern support length customer_id


... ... ... ...

Example 4: SEQUENCEPATTERNRELATION Argument Specified

Input
This example uses the same input table, FrequentPaths Example 1 Input Table bank_web_clicks1, as was
used in Example 1.

SQL-MapReduce Call

SELECT * FROM FrequentPaths (


ON (SELECT 1)
PARTITION BY 1
InputTable ('bank_web_clicks1')
OutputTable ('output4')
PartitionColumns ('session_id')
TimeColumn ('datestamp')
ItemColumn ('page')
MinSupport (2)
SeqPatternTable ('sp_table')
);

Output

Table 140: FrequentPaths Example 4 Output Message

message
Finished. Totally 69 patterns were found.

This query returns the contents of the following table (row order can vary):

SELECT * FROM sp_table ORDER BY 1;

Table 141: FrequentPaths Example 4 Output Table

session_id pattern
0 ACCOUNT SUMMARY;FAQ
0 ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY
0 ACCOUNT SUMMARY;ACCOUNT HISTORY
0 ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;ACCOUNT HISTORY;FUNDS TRANSFER

Teradata Aster Analytics Foundation User Guide 219


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

session_id pattern
0 ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER
0 ACCOUNT SUMMARY;ACCOUNT HISTORY;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FUNDS TRANSFER
0 ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;FUNDS TRANSFER;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FAQ;ACCOUNT SUMMARY
0 ACCOUNT SUMMARY;FUNDS TRANSFER;VIEW DEPOSIT DETAILS
0 ACCOUNT SUMMARY;FAQ;ACCOUNT SUMMARY;VIEW DEPOSIT DETAILS
... ...

Example 5: PathFilters Argument Specified

Input
FrequentPaths Example 1 Input Table bank_web_clicks1 (see Input).

SQL-MapReduce Call

SELECT * FROM FrequentPaths(


ON (SELECT 1)
PARTITION BY 1
InputTable ('bank_web_clicks1')
OutputTable ('output5')
PartitionColumns ('session_id')
TimeColumn ('datestamp')
ItemColumn('page')
PathFilters ('STW(ACCOUNT SUMMARY) EDW(ACCOUNT HISTORY)')
MinSupport (2)
);

Output

Table 142: FrequentPaths Example 5 Output Message

message
Finished. Totally 15 patterns were found.

220 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
This query returns the following table:

SELECT * FROM output5 ORDER BY 2, 3, 1;

Table 143: FrequentPaths Example 5 Output Table

pattern support length


ACCOUNT HISTORY 2 1
ACCOUNT SUMMARY 2 1
FUNDS TRANSFER 2 1
VIEW DEPOSIT DETAILS 2 1
ACCOUNT SUMMARY;ACCOUNT HISTORY 2 2
ACCOUNT SUMMARY;FUNDS TRANSFER 2 2
ACCOUNT SUMMARY;VIEW DEPOSIT DETAILS 2 2
FUNDS TRANSFER;ACCOUNT HISTORY 2 2
FUNDS TRANSFER;VIEW DEPOSIT DETAILS 2 2
VIEW DEPOSIT DETAILS;ACCOUNT HISTORY 2 2
ACCOUNT SUMMARY;FUNDS TRANSFER;ACCOUNT HISTORY 2 3
ACCOUNT SUMMARY;FUNDS TRANSFER;VIEW DEPOSIT DETAILS 2 3
ACCOUNT SUMMARY;VIEW DEPOSIT DETAILS;ACCOUNT HISTORY 2 3
FUNDS TRANSFER;VIEW DEPOSIT DETAILS;ACCOUNT HISTORY 2 3
ACCOUNT SUMMARY;FUNDS TRANSFER;VIEW DEPOSIT 2 4
DETAILS;ACCOUNT HISTORY

Example 6: Output Only Closed Patterns

Input
FrequentPaths Example 1 Input Table bank_web_clicks1 (see Input).

SQL-MapReduce Call

SELECT * FROM FrequentPaths (


ON (SELECT 1)
PARTITION BY 1
InputTable ('bank_web_clicks1')
OutputTable ('output6')
PartitionColumns ('session_id')
TimeColumn ('datestamp')
ItemColumn ('page')
MinSupport (2)

Teradata Aster Analytics Foundation User Guide 221


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
ClosedPattern ('true')
);

Output

Table 144: FrequentPaths Example 6 Output Message

message
Finished. Totally 26 patterns were found.

This query returns the following table:

SELECT * FROM output6 ORDER BY 2, 3, 1;

Table 145: FrequentPaths Example 6 Output Table

pattern support length


ACCOUNT SUMMARY;ACCOUNT HISTORY;FUNDS TRANSFER 2 3
ACCOUNT SUMMARY;FAQ;FAQ 2 3
ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT 2 4
SUMMARY;ACCOUNT HISTORY
ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT SUMMARY;FAQ 2 4
ACCOUNT SUMMARY;ACCOUNT SUMMARY;ACCOUNT HISTORY;FAQ 2 4
ACCOUNT SUMMARY;ACCOUNT SUMMARY;ACCOUNT SUMMARY;FUNDS 2 4
TRANSFER
ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY;ACCOUNT SUMMARY 2 4
ACCOUNT SUMMARY;FAQ;ACCOUNT HISTORY;VIEW DEPOSIT DETAILS 2 4
ACCOUNT SUMMARY;FAQ;ACCOUNT SUMMARY;VIEW DEPOSIT DETAILS 2 4
ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER;ACCOUNT SUMMARY 2 4
ACCOUNT SUMMARY;FAQ;FUNDS TRANSFER;VIEW DEPOSIT DETAILS 2 4
ACCOUNT SUMMARY;FUNDS TRANSFER;VIEW DEPOSIT 2 4
DETAILS;ACCOUNT HISTORY
ACCOUNT SUMMARY;ACCOUNT SUMMARY;ACCOUNT 2 5
SUMMARY;ACCOUNT HISTORY;ACCOUNT SUMMARY
ACCOUNT SUMMARY;ACCOUNT SUMMARY;FAQ;ACCOUNT 2 5
SUMMARY;ACCOUNT SUMMARY
ACCOUNT SUMMARY;ACCOUNT SUMMARY;FAQ 3 3
ACCOUNT SUMMARY;FAQ;ACCOUNT SUMMARY 3 3
ACCOUNT SUMMARY;FAQ;VIEW DEPOSIT DETAILS 3 3
ACCOUNT SUMMARY;FUNDS TRANSFER;VIEW DEPOSIT DETAILS 3 3

222 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths

pattern support length


ACCOUNT SUMMARY;ACCOUNT SUMMARY;ACCOUNT 3 4
HISTORY;ACCOUNT SUMMARY
ACCOUNT SUMMARY;ACCOUNT SUMMARY;ACCOUNT 3 4
SUMMARY;ACCOUNT SUMMARY
ACCOUNT SUMMARY;FUNDS TRANSFER 4 2
ACCOUNT SUMMARY;VIEW DEPOSIT DETAILS 4 2
ACCOUNT SUMMARY;ACCOUNT HISTORY;ACCOUNT SUMMARY 4 3
ACCOUNT SUMMARY;ACCOUNT HISTORY 5 2
ACCOUNT SUMMARY;FAQ 5 2
ACCOUNT SUMMARY 6 1

Example 7: Using nPath and FrequentPaths to Select Sequences

Input
The following table is the input table for the nPath function, which the example uses to create the input table
for the FrequentPaths function.
Table 146: FrequentPaths Example 7 nPath Input Table sequence_table

id datestamp item
1 2004-03-17 16:35:00 A
1 2004-03-17 16:38:00 B
1 2004-03-17 16:42:00 C
2 2004-03-18 01:16:00 B
2 2004-03-18 01:18:00 C
2 2004-03-18 01:20:00 D
3 2004-03-19 08:33:00 A
3 2004-03-19 08:36:00 D
3 2004-03-19 08:38:00 C

Create the FrequentPaths Input Table


The nPath function populates the FrequentPaths input table with sequences that start with “A” and end with
“C”, using the Accumulate argument to output the full sequence.

CREATE VIEW nPath_output AS (


SELECT * FROM npath (
ON sequence_table PARTITION BY id ORDER BY datestamp
Pattern ('itemA.itemAny*.itemC')

Teradata Aster Analytics Foundation User Guide 223


Chapter 3: Time Series, Path, and Attribution Analysis
FrequentPaths
Symbols (item='A' AS itemA, item='C' AS itemC, TRUE AS itemAny)
Result (FIRST(id OF itemA) AS id,
Accumulate (item OF ANY(itemA, itemAny, itemC)) AS path)
Mode (NONOVERLAPPING)
)
);

This query returns the following table:

SELECT * FROM nPath_output ORDER BY id;

Table 147: FrequentPaths Example 7 nPath Output Table

id path
1 [A, B, C]
3 [A, D, C]

SQL-MapRequest Call
The FrequentPaths function outputs the sequences that start start with “A” and end with “C”.

SELECT * FROM frequentPaths (


ON (select 1) PARTITION BY 1
InputTable ('nPath_output')
OutputTable ('output7')
PartitionColumns ('id')
PathColumn ('path')
MinSupport ('2')
);

Output

Table 148: FrequentPaths Example 7 Output Message

message
Finished. Totally 3 patterns were found.

This query returns the following table:

SELECT * FROM output7 ORDER BY length, pattern;

Table 149: FrequentPaths Example 7 Output Table

pattern support length


A 2 1
C 2 1
A;C 2 2

224 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT

IDWT

Summary
The IDWT function is the inverse of DWT; that is, IDWT applies inverse wavelet transforms on multiple
sequences simultaneously. IDWT takes as input the output table and meta table generated by DWT and
outputs the sequences in time domain. (Because the IDWT output is comparable to the DWT input, the
inverse transformation is also called the reconstruction.)

A typical IDWT use case is:


1. Apply DWT to sequences to generate their coefficients and corresponding metadata.
2. Filter the coefficients by methods appropriate for the objects (for example, minimum threshold or top n
coefficients).
3. Apply IDWT to the filtered coefficients to reconstruct the sequences.
4. Compare the reconstructed sequences to their original counterparts.

Usage

IDWT Syntax
Version 1.3

SELECT * FROM IDWT (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
MetaTable ('meta_table')
OutputTable ('output_table')
InputColumns ({ 'input_column' | 'input_column_range' }[,...])
SortColumn ('sort_column')
[ PartitionColumns

Teradata Aster Analytics Foundation User Guide 225


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT
( { 'partition_column' | 'partition_column_range' }[,...]) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the input table or view that contains the coefficients
generated by DWT. Typically, this table is the output table of DWT.
MetaTable Required Specifies the name of the input table or view that contains the meta
information used in DWT. Typically, this table is the meta table output by
DWT.
OutputTable Required Specifies the name for the table that the function creates to store the
reconstructed result. This table must not exist.
InputColumns Required Specifies the names of the columns in the input table or view that contain
the data to be transformed. These columns must contain numeric values
between -1e308 and 1e308. The function treats NULL as 0.
SortColumn Required Specifies the name of the input column that represents the order of
coefficients in each sequence (the waveletid column in the DWT output
table). The column must contain a sequence of integer values that start
from 1 for each sequence. If a value is missing from the sequence, then the
function treats the corresponding data column as 0.
PartitionColumns Optional Specifies the names of the partition columns, which identify the sequences.
Rows with the same partition column values belong to the same sequence.
If you specify multiple partition columns, then the function treats the first
one as the distribute key of the output and meta tables.
By default, all rows belong to one sequence, and the function generates a
distribute key column named dwt_idrandom_name in both the output
table and the meta table. In both tables, every cell of dwt_idrandom_name
has the value 1.

Input
The IDWT function requires a data table and a meta table. The data table has the same schema as DWT
Output Table Schema, and the meta table has the same schema as DWT Meta Table Schema.

Output
The IDWT function outputs a message that indicates whether the function succeeded and an output table of
transformed (reconstructed) sequences.

226 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT
Table 150: IDWT Output Message

Column Name Data Type Description


messages VARCHAR Message that indicates whether the function was successful.

Table 151: IDWT Output Table Schema

Column Name Data Type Description


partition_column Inherited from input table Sequence identifier of the sequence to which the data
column with same name belongs. Rows with the same partition column values
belong to the same sequence.
The output table has a partition_column for every
partition_column in the data table.
If the data table has multiple partition columns, then the
first one is the distribute key in the output table.
If the data table has only one partition_column, then the
output table has as its distribute key a function-generated
column named dwt_idrandom_name. Every cell of
dwt_idrandom_name has the value 1.
indexid INTEGER Index of each reconstructed sequence (starting from 1 for
each sequence).
input_column INTEGER, SMALLINT, Reconstructed sequence.
BIGINT, BIGSERIAL, The output table has an input_column for every
DOUBLE PRECISION, input_column in the input table.
NUMERIC [(p, s)], or
SERIAL

Example
This example uses hourly climate data for five cities (Asheville, Greenville, Brownsville , Nashville and
Knoxville) on a given day. The data are temperature (in degrees Fahrenheit), pressure (in Mbars). and dew
point (in degrees Fahrenheit).

Input
The input tables for this example are the output tables from the DWT function example:
• DWT Example Output Table dwt_coef_table
• DWT Example Meta Table dwt_meta_table
This example reconstructs the input to the DWT function example.
Table 152: IDWT Example Input Table dwt_coef_table

city waveletid waveletcomponent temp_f pressure_mbar dewpoint_f


Asheville 1 A2 69.4540094027039 2040.77288421713 57.6536778800559
Asheville 2 A2 69.3002350407891 2040.65034008677 57.5918028045736

Teradata Aster Analytics Foundation User Guide 227


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT
city waveletid waveletcomponent temp_f pressure_mbar dewpoint_f
Asheville 3 A2 65.9380138856981 2040.05868411008 56.0059210207892
Asheville 4 A2 65.3687921302244 2042.91209425196 55.7742876107938
Asheville 5 A2 84.1422926636037 2041.64689638287 58.2670431589452
Asheville 6 A2 90.1564867623684 2038.66915729816 57.0999120742162
Asheville 7 A2 76.6292062444649 2041.64172802492 57.8025718338419
Asheville 8 A2 71.1744954076401 2041.9750345768 58.0529007176853
Asheville 9 D2 0.166265877365284 0.13357902559369 0.0802319348951208
Asheville 10 D2 -0.0236532579869611 -0.0152743495649474 0.0345987263105307
Asheville 11 D2 -1.08576122940834 -0.392631330035556 -0.529214267460297
Asheville 12 D2 -0.73178478327581 1.58931398112088 0.691473086723903
Asheville 13 D2 3.26344510251065 -1.20694916477225 -0.201106157323093
Asheville 14 D2 -0.748381926835563 0.14607141087663 -0.125329090319905
Asheville 15 D2 -1.40116169247464 0.055816120120312 0.0529007176852296
Asheville 16 D2 -0.841593172362995 0.466498721551716 0.197428166158137
Asheville 17 D1 0.306186217847898 0.183704255459304 0.122473786334524
Asheville 18 D1 0 -0.206134749187356 -0.08365170265035
Asheville 19 D1 0.0224122024304982 0.12248196238869 -0.070710273508908
Asheville 20 D1 -0.0258815095837281 -0.161303239447761 -0.0353547995796646
Asheville 21 D1 -0.917630271917533 0.0482845002277372 -0.470022832702245
Asheville 22 D1 -0.309652827602825 0.241459877385864 0.15782975392192
Asheville 23 D1 0.328598420278396 0.200102937320878 0.064705435625207
Asheville 24 D1 0.476955377862531 -0.244956026246825 0.109533031542659
Asheville 25 D1 0.757261434825846 -0.161273761730968 -0.10006023520487
Asheville 26 D1 -0.472558544124134 -0.0129456913733748 -0.167302977780198
Asheville 27 D1 -0.219066556743464 0.0742190413474191 0.109533705892233
Asheville 28 D1 -0.0612381515206977 0.0388133785289142 0
Asheville 29 D1 -0.489897481353545 -0.0612222930706707 0
Brownsville 1 A2 69.8494278769203 2040.77746121021 57.7241015517973
Brownsville 2 A2 69.7081531839714 2040.64241250221 57.6839970736215
Brownsville 3 A2 66.3771886064812 2040.03445096574 56.3260179778376
Brownsville 4 A2 65.7888893814753 2042.77748885299 56.2960582953676
Brownsville 5 A2 84.8067964112814 2041.43991970379 59.0415953806383
Brownsville 6 A2 90.648563090627 2038.63824306649 57.7977881374318
Brownsville 7 A2 77.1037579029709 2041.37803566704 58.4437490787264

228 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT
city waveletid waveletcomponent temp_f pressure_mbar dewpoint_f
Brownsville 8 A2 71.5761715904181 2041.75457522267 58.5050244475344
Brownsville 9 D2 0.174190741199549 0.125651441035643 0.0448557763120334
Brownsville 10 D2 0.0492226945782548 -0.0881776255051818 -0.00122619573711802
... ... ... ... ... ...

Table 153: IDWT Example Input Table dwt_meta_table

city meta content


Asheville blocklength 8, 8, 13
Asheville length 24
Asheville waveletname db2
Asheville lowpassfilter -0.1294095225512604,
0.2241438680420134,
0.836516303737808,
0.4829629131445342
Asheville highpassfilter -0.4829629131445342,
0.836516303737808,
-0.2241438680420134,
-0.1294095225512604
Asheville ilowpassfilter 0.4829629131445342,
0.836516303737808,
0.2241438680420134,
-0.1294095225512604
Asheville ihighpassfilter -0.1294095225512604,
-0.2241438680420134,
0.836516303737808,
-0.4829629131445342
Asheville level 2
Asheville extensionmode sym
Brownsville blocklength 8, 8, 13
Brownsville length 24
Brownsville waveletname db2
Brownsville lowpassfilter -0.1294095225512604,
0.2241438680420134,
0.836516303737808,
0.4829629131445342
Brownsville highpassfilter -0.4829629131445342,
0.836516303737808,
-0.2241438680420134,
-0.1294095225512604

Teradata Aster Analytics Foundation User Guide 229


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT

city meta content


Brownsville ilowpassfilter 0.4829629131445342,
0.836516303737808,
0.2241438680420134,
-0.1294095225512604
Brownsville ihighpassfilter -0.1294095225512604,
-0.2241438680420134,
0.836516303737808,
-0.4829629131445342
Brownsville level 2
... ... ...

SQL-MapReduce Call

DROP TABLE IF EXISTS climate_reconstruct;


SELECT * FROM IDWT (
ON (SELECT 1) PARTITION BY 1
InputTable ('dwt_coef_table')
MetaTable ('dwt_meta_table')
OutputTable ('climate_reconstruct')
InputColumns ('temp_f', 'pressure_mbar', 'dewpoint_f')
SortColumn ('waveletid')
PartitionColumns ('city')
);

Output
Table 154: IDWT Example Output Message

messages
IDwt finished successfully!

The query below returns the output shown in the following table:

SELECT * FROM climate_reconstruct ORDER BY city, indexid;

The output table is the same as the input table to the DWT function (Input). The original values for the
temperature, pressure and the dew point are reconstructed.
Table 155: IDWT Example Output Table climate_reconstruct

city indexid temp_f pressure_mbar dewpoint_f


Asheville 1 34.9000015258789 1020.5 28.8999996185303
Asheville 2 34.4000015258789 1020.20001220703 28.7000007629395
Asheville 3 33.9000015258789 1020 28.3999996185303

230 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT

city indexid temp_f pressure_mbar dewpoint_f


Asheville 4 33.4000015258789 1020.20001220703 28.2999992370606
Asheville 5 33.0999984741211 1020.20001220703 28
Asheville 6 32.7000007629395 1020 27.8999996185303
Asheville 7 32.5 1020.29998779297 27.7000007629395
Asheville 8 32.2999992370606 1020.79998779297 27.6000003814697
Asheville 9 32.0999984741211 1021.29998779297 27.3999996185303
Asheville 10 33.7999992370606 1021.70001220703 28.2000007629395
Asheville 11 36.4000015258789 1022.09997558594 28.8999996185303
Asheville 12 39.4000015258789 1022 29.2999992370606
Asheville 13 42.0999984741211 1021.09997558594 29.2000007629395
Asheville 14 44.2000007629395 1020 29.1000003814698
Asheville 15 45.5999984741211 1019.29998779297 28.8999996185303
Asheville 16 46.2000007629395 1019 28.5
Asheville 17 45.7999992370606 1019.20001220703 28.5
Asheville 18 44.0999984741211 1019.59997558594 28.6000003814698
Asheville 19 41.2000007629395 1020.09997558594 28.5
Asheville 20 39.5999984741211 1020.59997558594 28.7999992370606
Asheville 21 38.2000007629395 1020.90002441406 29
Asheville 22 37.2000007629395 1021.09997558594 29
Asheville 23 36.2999992370606 1021 29
Asheville 24 35.5 1020.90002441406 29
Brownsville 1 35.0999984741211 1020.5 28.8999996185303
Brownsville 2 34.5999984741211 1020.20001220703 28.7999992370606
Brownsville 3 34.0999984741211 1020 28.5
Brownsville 4 33.7000007629395 1020.09997558594 28.3999996185303
Brownsville 5 33.2999992370606 1020.20001220703 28.2000007629395
... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 231


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT2D

IDWT2D

Summary
The IDWT2D function is the inverse of DWT2D; that is, IDWT2D applies inverse wavelet transforms on
multiple sequences simultaneously. IDWT2D takes as input the output table and meta table generated by
DWT2D and outputs the sequences as 2-dimensional matrices. (Because the IDWT2D output is comparable
to the DWT2D input, the inverse transformation is also called the reconstruction.)

A typical IDWT2D use case is:


1. Apply DWT2D to 2-dimensional sequences to generate the coefficients of the matrices and
corresponding metadata.
2. Filter the coefficients by methods appropriate for the objects (for example, minimum threshold or top n
coefficients), compressing the original matrices.
3. Apply IDWT to the filtered coefficients to reconstruct the sequences.
4. Compare the reconstructed matrices to their original counterparts.

Usage

IDWT2D Syntax
Version 1.3

SELECT * FROM IDWT2D (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
MetaTable ('meta_table')
OutputTable ('output_table')
InputColumns ({ 'column_name' | 'column_range' }[,...])
SortColumn ('sort_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]

232 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT2D
[ VerboseFlag ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the input table or view that contains the
coefficients generated by DWT2D. Typically, this table is the output
table of DWT2D.
MetaTable Required Specifies the name of the input table or view that contains the meta
information used in DWT2D. Typically, this table is the meta table
output by DWT2D.
OutputTable Required Specifies the name for the table that the function creates to store the
reconstructed result. This table must not exist.
InputColumns Required Specifies the names of the columns in the input table or view that
contain the data to be transformed. These columns must contain
numeric values between -1e308 and 1e308. The function treats NULL
as 0.
SortColumn Required Specifies the name of the input column that represents the order of
coefficients in each sequence (the waveletid column in the DWT2D
output table). The column must contain a sequence of integer values
that start from 1 for each sequence. If a value is missing from the
sequence, then the function treats the corresponding data column as 0.
PartitionColumns Optional Specifies the names of the partition columns, which identify the
sequences. Rows with the same partition column values belong to the
same sequence. If you specify multiple partition columns, then the
function treats the first one as the distribute key of the output and
meta tables.
By default, all rows belong to one sequence, and the function generates
a distribute key column named dwt_idrandom_name in both the
output table and the meta table. In both tables, every cell of
dwt_idrandom_name has the value 1.
VerboseFlag Optional Specifies whether to ignore (not output) rows in which all coefficient
values are very small (having an absolute value less than 1e-12). The
default value is 'true'. For a sparse input matrix, ignoring such rows
reduces the output table size.

Teradata Aster Analytics Foundation User Guide 233


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT2D
Input
The IDWT2D function requires a data table and a meta table. The data table has the same schema as DWT
Output Table Schema, and the meta table has the same schema as DWT Meta Table Schema.

Output
The IDWT2D function outputs a message that indicates whether the function succeeded and an output table
of transformed (reconstructed) matrices.
Table 156: IDWT2D Output Message

Column Name Data Type Description


messages VARCHAR Message that indicates whether the function was successful.

Table 157: IDWT2D Output Table Schema

Column Name Data Type Description


partition_column Inherited from input table Sequence identifier of the sequence to which the data
column with same name belongs. Rows with the same partition column values
belong to the same sequence.
The output table has a partition_column for every
partition_column in the data table.
If the data table has multiple partition columns, then the
first one is the distribute key in the output table.
If the data table has only one partition_column, then the
output table has as its distribute key a function-generated
column named dwt_idrandom_name. Every cell of
dwt_idrandom_name has the value 1.
indexy INTEGER Y index of the reconstructed matrix.
indexx INTEGER X index of the reconstructed matrix.
input_column INTEGER, SMALLINT, Reconstructed matrix.
BIGINT, BIGSERIAL, The output table has an input_column for every
DOUBLE PRECISION, input_column in the data table.
NUMERIC [(p, s)], or
SERIAL

Example
This example uses climate data in many cities in the states of California (CA), Texas (TX), and Washington
(WA). The cities are represented by two-dimensional coordinates (latitude and longitude). The data are
temperature (in degrees Fahrenheit), pressure (in Mbars), and dew point (in degrees Fahrenheit).

Input
The input tables for this example are the output tables from the DWT2D function example:
• DWT2D Example Output Table dwt_coef_table

234 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT2D
• DWT2D Example Meta Table dwt_meta_table
This example reconstructs the input to the DWT2D function example.
Table 158: IDWT2D Example Input Table dwt2d_coeftable

state waveletid waveletcomponent temp_f pressure_mbar dewpoint_f


CA 2 A2 0.17966669835468 5.36757777911124 0.148645986178623
CA 3 A2 -2.87944346786742 -80.8379456109261 -2.34340157214
CA 4 A2 85.1709979626302 2535.2091172814 71.2034084720052
CA 5 A2 -0.436988155749943 -14.3013671884212 -0.376569632995601
CA 6 A2 -3.55180132008231 -112.247317693092 -3.0373591242726
CA 7 A2 7.60713133087065 243.037739802494 6.52714999027827
CA 8 A2 77.4836280596588 2309.16778018649 64.6775876908445
CA 9 A2 -2.46718542424233 -72.80570147524 -2.05597029843014
CA 10 A2 4.30723732587148 138.964984443667 3.69857066310935
CA 11 A2 96.5053215632833 2957.68800764973 81.4818568951578
CA 12 A2 25.3705204940491 735.022235657688 20.500350016632
CA 13 A2 99.1650690684055 3095.43307260088 84.2697548740767
CA 14 A2 107.004206852775 3304.81413538727 90.5346298352776
CA 15 A2 68.9704953989846 2042.42993680917 57.0258488992799
CA 18 V2 0.180465820688191 5.49632155551756 0.154427777499678
CA 19 V2 -7.00925368266406 -214.817509066481 -5.94270114993037
CA 20 V2 11.9457843720894 336.444066140318 9.65974360626963
... ... ... ... ... ...

Table 159: IDWT2D Example Input Table dwt2d_metatable

state meta content


CA blocklength (4, 4), (4, 4), (4, 4), (4, 4), (6, 6), (6, 6), (6, 6)
CA length (10, 10)
CA waveletname db2
CA lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
CA highpassfilter -0.4829629131445342, 0.836516303737808, -0.2241438680420134,
-0.1294095225512604
CA ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
CA ihighpassfilter -0.1294095225512604, -0.2241438680420134, 0.836516303737808,
-0.4829629131445342

Teradata Aster Analytics Foundation User Guide 235


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT2D

state meta content


CA level 2
CA extensionmode sym
CA range (32, -125), (41, -116)
TX blocklength (5, 5), (5, 5), (5, 5), (5, 5), (7, 7), (7, 7), (7, 7)
TX length (11, 11)
TX waveletname db2
TX lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
TX highpassfilter -0.4829629131445342, 0.836516303737808, -0.2241438680420134,
-0.1294095225512604
TX ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
TX ihighpassfilter -0.1294095225512604, -0.2241438680420134, 0.836516303737808,
-0.4829629131445342
TX level 2
TX extensionmode sym
TX range (26, -105), (36, -95)
WA blocklength (3, 4), (3, 4), (3, 4), (3, 4), (3, 5), (3, 5), (3, 5)
WA length (4, 8)
WA waveletname db2
WA lowpassfilter -0.1294095225512604, 0.2241438680420134, 0.836516303737808,
0.4829629131445342
WA highpassfilter -0.4829629131445342, 0.836516303737808, -0.2241438680420134,
-0.1294095225512604
WA ilowpassfilter 0.4829629131445342, 0.836516303737808, 0.2241438680420134,
-0.1294095225512604
WA ihighpassfilter -0.1294095225512604, -0.2241438680420134, 0.836516303737808,
-0.4829629131445342
WA level 2
WA extensionmode sym
WA range (45, -125), (48, -118)
... ... ...

236 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
IDWT2D
SQL-MapReduce Call

SELECT * FROM IDWT2D (


ON (SELECT 1) PARTITION BY 1
InputTable ('dwt2d_coeftable')
MetaTable ('dwt2d_metatable')
OutputTable ('climate2d_reconstruct')
InputColumns ('temp_f', 'pressure_mbar', 'dewpoint_f')
SortColumn ('waveletid')
PartitionColumns ('state')
);

Output
Table 160: IDWT2D Example Output Message

messages
IDwt2D finished successfully!

The query below returns the output shown in the following table:

SELECT * FROM climate2d_reconstruct ORDER BY state;

Table 161: IDWT2D Example Output Table

state indexxy indexxx temp_f pressure_mbar dewpoint_f


CA 32 -117 34.9000015258789 1020.5 28.8999996185303
CA 32 -116 34.4000015258789 1020.20001220703 28.7000007629395
CA 33 -118 33.9000015258789 1020 28.3999996185303
CA 33 -117 33.4000015258789 1020.20001220703 28.2999992370606
CA 34 -121 33.0999984741211 1020.20001220703 28
CA 34 -120 32.7000007629395 1020 27.8999996185303
CA 34 -119 32.5 1020.29998779297 27.7000007629395
CA 34 -118 32.2999992370606 1020.79998779297 27.6000003814698
CA 35 -120 32.0999984741211 1021.29998779297 27.3999996185303
CA 35 -119 33.7999992370606 1021.70001220703 28.2000007629395
CA 35 -118 36.4000015258789 1022.09997558594 28.8999996185303
CA 35 -117 39.4000015258789 1022 29.2999992370606
CA 36 -122 34.9000015258789 1020.5 28.8999996185303
CA 36 -121 34.4000015258789 1020.20001220703 28.7000007629395
CA 36 -120 33.9000015258789 1020 28.3999996185303

Teradata Aster Analytics Foundation User Guide 237


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

state indexxy indexxx temp_f pressure_mbar dewpoint_f


CA 36 -119 33.4000015258789 1020.20001220703 28.2999992370606
CA 36 -118 33.0999984741211 1020.20001220703 28
CA 37 -123 32.7000007629395 1020 27.8999996185303
... ... ... ... ... ...

Note:
VerboseFlag is 'true' by default; therefore, rows in which all coefficient values are very small do not
appear in climate2d_reconstruct.

Interpolator

Summary
The Interpolator function calculates missing values in a time series, using either interpolation or
aggregation. Interpolation estimates missing values between known values. Aggregation combines known
values to produce an aggregate value.
The time intervals between calculated values can either be the same length (specified by the TimeInterval
argument) or have specific start and end times (specified by time_table). The choice of TimeInterval or
time_table affects the behavior of interpolation, but not aggregation.

Usage

Interpolator Syntax
Version 1.0

SELECT * FROM Interpolator (


ON { table | view | (query) } AS input_table
PARTITION BY id
ORDER BY ordering_column
[ ON { table | view | (query) } AS time_table
DIMENSION ORDER BY ordering_column ]
[ ON { table | view | (query) } AS count_row_number
PARTITION BY id ]
TimeColumn ('time_column')
[ TimeInterval (time_interval) ]
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ InterpolationType (interpolation_type [,...] ) ]
[ AggregationType (aggregation_type [,...] ) ]
[ TimeDataType (time_data_type) ]
[ ValueDataType (value_type [,...])]
[ StartTime (start_time) ]
[ EndTime (end_time) ]
[ ValuesBeforeFirst ('value' [,...]) ]

238 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator
[ ValuesAfterLast ('value' [,...]) ]
[ DuplicateRowsCount ('value1' [,'value2']) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TimeColumn Required Specifies the name of the input_table column that contains the time
points of the time series whose missing values are to be calculated.
TimeInterval Optional Specifies the length of time, in seconds, between calculated values. This
value must be either INTEGER or DOUBLE PRECISION.

Note:
Specify exactly one of time_table or TimeInterval.

ValueColumns Required Specifies the names of input_table columns to interpolate to the output
table.
TimeDataType Optional Specifies the data type of the output column that corresponds to the
input table column that TimeColumn specifies (time_column).
If you omit this argument, then the function infers the data type of
time_column from the input table and uses the inferred data type for
the corresponding output table column.
If you specify this argument, then the function can transform the input
data to the specified output data type only if both the input column
data type and the specified output column data type are in this list:
• INTEGER
• BIGINT
• SMALLINT
• DOUBLE PRECISION
• DECIMAL(n,n)
• DECIMAL
• NUMERIC
• NUMERIC(n,n)

ValueDataType Optional Specifies the data types of the output columns that correspond to the
input table columns that ValueColumns specifies.
If you omit this argument, then the function infers the data type of
each time_column from the input table and uses the inferred data type
for the corresponding output table column.
If you specify ValueDataType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns, then
ValueDataType must specify n data types. For i in [1, n],

Teradata Aster Analytics Foundation User Guide 239


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

Argument Category Description


value_column_i has value_type_i. However, value_type_i can be
empty; for example:

ValueColumns (c1, c2, c3)


ValueDataType (INTEGER, ,VARCHAR)

If you specify this argument, then the function can transform the input
data to the specified output data type only if both the input column
data type and the specified output column data type are in this list:
• INTEGER
• BIGINT
• SMALLINT
• DOUBLE PRECISION
• DECIMAL(n,n)
• DECIMAL
• NUMERIC
• NUMERIC(n,n)

InterpolationType Optional Specifies interpolation types for the columns that ValueColumns
specifies.
If you specify InterpolationType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns, then
InterpolationType must specify n interpolation types. For i in [1, n],
value_column_i has interpolation_type_i. However,
interpolation_type_i can be empty; for example:

ValueColumns (c1, c2, c3)


InterpolationType ('linear', ,'constant]')

An empty interpolation_type has the default value.


The possible values of interpolation_type are as follows:

Note:
In interpolation_type syntax, brackets do not indicate optional
elements—you must include them.

• 'linear' (default): The value for each missing data point is


determined using linear interpolation between the two nearest
points.
• 'constant': The only interpolation type supported if
value_column has data type CHARACTER, CHARACTER(n),
CHARACTER VARYING, CHARACTER VARYING(n), or
VARCHAR. The value for each missing data point is set to the
nearest value.
• 'spline[(type(cubic))]]': The value for each missing data
point is determined by fitting a cubic spline to the nearest three
points.

240 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

Argument Category Description

'median[(window(n))]': n must be greater than or equal to 2.


The default value of n is 5. The value for each missing data point is
set to the median value of the nearest four points.
• 'loess[(weights({constant | tricube}), degree ({0
|1 |2}), span(m))]':
∘ weights: The default value is constant.
∘ degree: The default value is 1.
∘ m is either an integer greater than 1 (which specifies the number
of neighboring points) or a real number between (λ+1)/n and 1
(λ is the degree of the local polynomial and n is the number of
data points). The default value of m is 5.
The value for each missing data point is a low-degree polynomial
based on a set of nearest neighbors. The fitting can be weighted so that
points closer to the missing data have more influence than points
farther away.
Your choice of TimeInterval or time_table affects interpolation:
• If you specify TimeInterval, then the function calculates the value
for the time point only if the value is missing; otherwise, the
function copies the original value.
• If you specify time_table, then the function always calculates the
value of the time point.

Note:
Specify only one of InterpolationType or AggregationType. If you
omit both arguments, the function uses InterpolationType with its
default value, 'linear'.

AggregationType Optional Specifies the aggregation types of the columns that ValueColumns
specifies.
If you specify AggregationType, then it must be the same size as
ValueColumns. That is, if ValueColumns specifies n columns, then
AggregationType must specify n aggregation types. For i in [1, n],
value_column_i has aggregation_type_i. However, aggregation_type_i
can be empty; for example:

ValueColumns (c1, c2, c3)AggregationType


(min, ,max)

An empty aggregation_type has the default value.


The syntax of aggregation_type is:

{ min | max | mean | mode | sum } [(window(n))]

Teradata Aster Analytics Foundation User Guide 241


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

Argument Category Description

Note:
In aggregation_type syntax, brackets do not indicate optional
elements—you must include them.

The function calculates the aggregate value as the minimum,


maximum, mean, mode, or sum within a sliding window of length n. n
must be greater than or equal to 2. The default value of n is 5. The
default aggregation method is min.
The Interpolator function can calculate the aggregates of values of
these data types:
• INTEGER
• BIGINT
• SMALLINT
• DOUBLE PRECISION
• DECIMAL(n,n)
• DECIMAL
• NUMERIC
• NUMERIC(n,n)
Your choice of TimeInterval or time_table does not affect aggregation.
The function always calculates the aggregated value.

Note:
Specify only one of AggregationType or InterpolationType. If you
omit both arguments, the function uses InterpolationType with its
default value, 'linear'.

StartTime Optional Specifies the start time for the time series. The default value is the start
time of the time series in input_table.
EndTime Optional Specifies the end time for the time series. The default value is the end
time of the time series in input_table.
ValuesBeforeFirst Optional Specifies the values to use if start_time is before the start time of the
time series in input_table. Each of these values must have the same
data type as its corresponding value_column. Values of data type
VARCHAR are case-insensitive.
If ValueColumns specifies n columns, then ValuesBeforeFirst must
specify n values. For i in [1, n], value_column_i has the value
before_first_value_i. However, before_first_value_i can be empty; for
example:

ValueColumns (c1, c2, c3)


ValuesBeforeFirst (1, ,'abc')

If before_first_value_i is empty, then value_column_i has the value


NULL. If you do not specify Values_Before_First, then value_column_i
has the value NULL for i in [1, n].

242 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

Argument Category Description


ValuesAfterLast Optional Specifies the values to use if end_time is after the end time of the time
series in input_table. Each of these values must have the same data
type as its corresponding value_column. Values of data type
VARCHAR are case-insensitive.
If ValueColumns specifies n columns, then ValuesAfterLast must
specify n values. For i in [1, n], value_column_i has the value
after_last_value_i. However, after_last_value_i can be empty; for
example:

ValueColumns (c1, c2, c3)


ValuesAfterLast (1, ,'abc')

If after_last_value_i is empty, then value_column_i has the value


NULL. If you do not specify ValuesAfterLast, then value_column_i has
the value NULL for i in [1, n].
DuplicateRowsCount Optional Specifies the number of rows to duplicate across split boundaries if you
use the function SeriesSplitter (Example 2: Using SeriesSplitter with
Interpolator shows how to use Interpolator with SeriesSplitter).
If you specify this argument but do not use SeriesSplitter, or do not
conform to the conditions that apply for the value for each
interpolation or aggregation type, then the function either issues an
error message or produces incorrect results.
If you specify only value1, then the function duplicates value1 rows
from the previous partition and value1 rows from the next partition. If
you specify both value1 and value2, then the function duplicates value1
rows from the previous partition and value2 rows from the next
partition. Each argument value must be nonnegative INTEGER.
Both value1 and value2 must exceed the number of data points that the
function needs for every specified interpolation or aggregation
method. The interpolation methods and the number of data points
that the function needs for them are:
• 'linear' and 'constant': 1
• 'spline': 2
• 'median [(window(n))]': n/2
• 'loess [(weights ({constant | tricube}), degree
({0 |1 |2}), span(m))]':
If m > 1: m-1
If m < 1: (m * n)-1, where n is total number of data rows, found in
column n of the count_row_number table

Accumulate Optional Specifies the names of input_table columns (other than those specified
by TimeColumn and ValueColumns) to copy to the output table. By
default, the function copies to the output table only the columns
specified by TimeColumn and ValueColumns.

Teradata Aster Analytics Foundation User Guide 243


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator
Input
The Interpolator function has three input tables:
• input_table (required)
• time_table (optional)
If you omit time_table, then you must specify the TimeInterval argument. The choice of TimeInterval or
time_table affects the behavior of interpolation, but not aggregation. For details, see the descriptions of
the InterpolationType and AggregationType arguments.
• count_row_number (optional)
Use count_row_number only with InterpolationType('loess'(weights ({constant | tricube}), degree ({0 |1 |
2}), span(m))), where m is between (λ+1)/n and 1.
The input_table contains the time series whose missing values are to be calculated. Each row contains one
time point in the series and one or more values. The following table describes the input_table columns that
you can specify in function arguments.
Table 162: Interpolator input_table Schema

Column Data Type Description


time_column INTEGER, BIGINT, SMALLINT, Contains the time points in the time
DOUBLE PRECISION, DECIMAL(n,n), series.
DECIMAL, NUMERIC, NUMERIC(n,n),
DATE, TIME, TIME(n), TIME WITH
TIME ZONE, TIME WITH TIME
ZONE(n), TIMESTAMP,
TIMESTAMP(n), TIMESTAMP WITH
TIME ZONE, or TIMESTAMP WITH
TIME ZONE(n)
value_column INTEGER, BIGINT, SMALLINT, Contains values for the time points in
DOUBLE PRECISION, DECIMAL(n,n), time_column.
DECIMAL, NUMERIC, NUMERIC(n,n), The table can have more than one such
CHARACTER, CHARACTER(n), column.
CHARACTER VARYING,
CHARACTER VARYING(n), or
VARCHAR

Note:
For data types CHARACTER,
CHARACTER(n), CHARACTER
VARYING, CHARACTER
VARYING(n), and VARCHAR, the
only supported interpolation type is
'constant'.

accumulate_column Any Column to be copied to the output table,


specified by the Accumulate argument.
Typically, one accumulate_column is a
row identifier, such as 'id'.

244 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator
The time_table consists of a single column, which contains the time points whose values are to be calculated.
The following table describes time_table.
Table 163: Interpolator time_table Schema

Column Data Type Description


time_column Data type of time_column in Contains the time points in the time series whose
input_table missing values are to be calculated.

The count_row_number table contains information about the shorter time intervals into which the original
time series (in input_table) has been split. Each row represents one shorter time interval. The following table
describes time_table.
Table 164: Interpolator count_row_number Table Schema

Column Data Type Description


id INTEGER or VARCHAR if id is Contains the identification number of the shorter
from SeriesSplitter output table; time interval.
otherwise, any data type If the shorter time intervals are produced by the
function SeriesSplitter, then this column has the
value of split_id_column in the SeriesSplitter output
table; otherwise, this column has the value of the id
column in input_table.
n INTEGER or BIGINT Contains the number of data points in the individual
time series.

Output
The Interpolator function has one output table, which contains the time series with the values that the
function calculated. Each row contains one time point in the series and one or more values. The following
table describes the output table. Columns copied from input_table appear in the same order in the output
table.
Table 165: Interpolator Output Table Schema

Column Data Type Description


time_column INTEGER, BIGINT, SMALLINT, Contains the time points in the time
DOUBLE PRECISION, DECIMAL(n,n), series. This column corresponds to
DECIMAL, NUMERIC, NUMERIC(n,n), time_column in input_table.
DATE, TIME, TIME(n), TIME WITH
TIME ZONE, TIME WITH TIME
ZONE(n), TIMESTAMP,
TIMESTAMP(n), TIMESTAMP WITH
TIME ZONE, or TIMESTAMP WITH
TIME ZONE(n)
value_column INTEGER, BIGINT, SMALLINT, Contains values for the time points in
DOUBLE PRECISION, DECIMAL(n,n), time_column. This column corresponds
DECIMAL, NUMERIC, NUMERIC(n,n), to a value_column in input_table. The
CHARACTER, CHARACTER(n),

Teradata Aster Analytics Foundation User Guide 245


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

Column Data Type Description


CHARACTER VARYING, table can have more than one such
CHARACTER VARYING(n), or column.
VARCHAR

Note:
For data types CHARACTER,
CHARACTER(n), CHARACTER
VARYING, CHARACTER
VARYING(n), and VARCHAR, the
only supported interpolation type is
'constant'.

accumulate_column Any Column copied from input_table,


specified by the Accumulate argument.
Typically, one accumulate_column is a
row identifier, such as 'id'.

Examples
• Example 1: Aggregation
• Example 2: Constant Interpolation
• Example 3: Linear Interpolation
• Example 4: Median Interpolation
• Example 5: Spline Interpolation
• Example 6: Loess Interpolation

Input
The input table contains the daily IBM stock prices from 1961 to 1962, excluding weekends and holidays.
The examples use the Interpolator function to calculate hypothetical stock prices for the excluded days.
Table 166: Interpolator Examples Input Table ibm_stock1

id name period stockprice


1 IBM 1961-05-17 00:00:00 460
1 IBM 1961-05-18 00:00:00 457
1 IBM 1961-05-19 00:00:00 452
1 IBM 1961-05-22 00:00:00 459
1 IBM 1961-05-23 00:00:00 462
1 IBM 1961-05-24 00:00:00 459
1 IBM 1961-05-25 00:00:00 463
1 IBM 1961-05-26 00:00:00 479
1 IBM 1961-05-29 00:00:00 493

246 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

id name period stockprice


1 IBM 1961-05-31 00:00:00 490
1 IBM 1961-06-01 00:00:00 492
1 IBM 1961-06-02 00:00:00 498
1 IBM 1961-06-05 00:00:00 499
1 IBM 1961-06-06 00:00:00 497
1 IBM 1961-06-07 00:00:00 496
1 IBM 1961-06-08 00:00:00 490
1 IBM 1961-06-09 00:00:00 489
1 IBM 1961-06-12 00:00:00 478
1 IBM 1961-06-13 00:00:00 487
1 IBM 1961-06-14 00:00:00 491
... ... ... ...

The examples use the Time_Interval argument, but in any example, you can substitute the following table
for the Time_Interval argument and get the same result. Example 1: Aggregation includes equivalent SQL-
MapReduce calls.
Table 167: Interpolate Example 1 (Aggregation) Input Table time_table1

id period
1 1961-05-17 00:00:00
2 1961-05-18 00:00:00
3 1961-05-19 00:00:00
4 1961-05-20 00:00:00
5 1961-05-21 00:00:00
6 1961-05-22 00:00:00
7 1961-05-23 00:00:00
8 1961-05-24 00:00:00
9 1961-05-25 00:00:00
10 1961-05-26 00:00:00
11 1961-05-27 00:00:00
12 1961-05-28 00:00:00
13 1961-05-29 00:00:00
14 1961-05-30 00:00:00
15 1961-05-31 00:00:00

Teradata Aster Analytics Foundation User Guide 247


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

id period
16 1961-06-01 00:00:00
17 1961-06-02 00:00:00
18 1961-06-03 00:00:00
19 1961-06-04 00:00:00
20 1961-06-05 00:00:00
... ...

Note:
The examples use the time interval 86,400 seconds, which is equivalent to one day.

Example 1: Aggregation

SQL-MapReduce Call
These two calls produce the same result.

SELECT * FROM Interpolator (


ON ibm_stock1 AS input_table
PARTITION BY id ORDER BY "period"
TimeColumn ('period')
TimeInterval (86400)
ValueColumns ('stockprice')
AggregationType ('min[(window(2))]')
ValuesBeforeFirst ('0')
ValuesAfterLast ('0')
Accumulate ('id')
) ORDER BY period;
SELECT * FROM Interpolator (
ON ibm_stock1 AS input_table
PARTITION BY id
ORDER BY "period"
ON time_table1 AS time_table
DIMENSION ORDER BY "period"
TimeColumn ('period')
ValueColumns ('stockprice')
AggregationType ('min[(window(2))]')
ValuesBeforeFirst ('0')
ValuesAfterLast ('0')
Accumulate ('id')
) ORDER BY period;

Output

Table 168: Interpolate Example 1 (Aggregation) Output Table

id period stockprice
1 1961-05-17 00:00:00 460

248 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

id period stockprice
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 452
1 1961-05-21 00:00:00 452
1 1961-05-22 00:00:00 452
1 1961-05-23 00:00:00 459
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 459
1 1961-05-26 00:00:00 463
1 1961-05-27 00:00:00 479
1 1961-05-28 00:00:00 479
1 1961-05-29 00:00:00 479
1 1961-05-30 00:00:00 490
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 490
1 1961-06-02 00:00:00 492
1 1961-06-03 00:00:00 498
1 1961-06-04 00:00:00 498
1 1961-06-05 00:00:00 498
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 478
1 1961-06-11 00:00:00 478
1 1961-06-12 00:00:00 478
... ... ...

Example 2: Constant Interpolation

SQL-MapRequest Call

SELECT * FROM Interpolator (


ON ibm_stock1 AS input_table

Teradata Aster Analytics Foundation User Guide 249


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator
PARTITION BY id
ORDER BY "period"
TimeColumn ('period')
TimeInterval (86400)
ValueColumns ('stockprice')
InterpolationType ('constant')
ValuesBeforeFirst ('0')
ValuesAfterLast ('0')
Accumulate ('id')
);

Output

Table 169: Interpolate Example 2 (Constant Interpolation) Output Table

id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 452
1 1961-05-21 00:00:00 459
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 479
1 1961-05-28 00:00:00 493
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 493
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 498
1 1961-06-04 00:00:00 499
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490

250 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

id period stockprice
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 489
1 1961-06-11 00:00:00 478
1 1961-06-12 00:00:00 478
... ... ...

Example 3: Linear Interpolation

SQL-MapRequest Call

SELECT * FROM Interpolator (


ON ibm_stock1 AS input_table
PARTITION BY id
ORDER BY "period"
TimeColumn ('period')
TimeInterval (86400)
ValueColumns ('stockprice')
InterpolationType ('linear')
ValuesBeforeFirst ('0')
ValuesAfterLast ('0')
Accumulate ('id')
);

Output

Table 170: Interpolate Example 3 (Linear Interpolation) Output Table

id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 454.333
1 1961-05-21 00:00:00 456.667
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 483.667

Teradata Aster Analytics Foundation User Guide 251


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

id period stockprice
1 1961-05-28 00:00:00 488.333
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 491.5
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 498.333
1 1961-06-04 00:00:00 498.667
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 485.333
1 1961-06-11 00:00:00 481.667
1 1961-06-12 00:00:00 478
... ... ...

Example 4: Median Interpolation

SQL-MapRequest Call

SELECT * FROM Interpolator (


ON ibm_stock1 AS input_table
PARTITION BY id
ORDER BY "period"
TimeColumn ('period')
TimeInterval (86400)
ValueColumns ('stockprice')
InterpolationType ('median[(window(4))]')
ValuesBeforeFirst ('0')
ValuesAfterLast ('0')
Accumulate ('id')
);

252 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator
Output

Table 171: Interpolate Example 4 (Median Interpolation) Output Table

id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 458
1 1961-05-21 00:00:00 458
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 484.5
1 1961-05-28 00:00:00 484.5
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 491
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 497.5
1 1961-06-04 00:00:00 497.5
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 488
1 1961-06-11 00:00:00 488
1 1961-06-12 00:00:00 478
... ... ...

Teradata Aster Analytics Foundation User Guide 253


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator
Example 5: Spline Interpolation

SQL-MapRequest Call

SELECT * FROM Interpolator (


ON ibm_stock1 AS input_table
PARTITION BY id
ORDER BY "period"
TimeColumn ('period')
TimeInterval (86400)
ValueColumns ('stockprice')
InterpolationType ('spline[(type(cubic))]')
ValuesBeforeFirst ('0')
ValuesAfterLast ('0')
Accumulate ('id')
);

Output
The algorithm did not converge, so the missing values are reported as not a number (NaN).
Table 172: Interpolate Example 5 (Spline Interpolation) Output Table

id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 NaN
1 1961-05-21 00:00:00 NaN
1 1961-05-22 00:00:00 459
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 NaN
1 1961-05-28 00:00:00 NaN
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 NaN
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 NaN

254 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Interpolator

id period stockprice
1 1961-06-04 00:00:00 NaN
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 NaN
1 1961-06-11 00:00:00 NaN
1 1961-06-12 00:00:00 478
... ... ...

Example 6: Loess Interpolation

SQL-MapRequest Call

SELECT * FROM Interpolator (


ON ibm_stock1 AS input_table
PARTITION BY id
ORDER BY "period"
TimeColumn ('period')
TimeInterval (86400)
ValueColumns ('stockprice')
InterpolationType ('loess[(weights(constant),degree(2),span(4))]')
Values_BeforeFirst ('0')
Values_AfterLast ('0')
Accumulate ('id')
);

Output

Table 173: Interpolate Example 6 (Loess Interpolation) Output Table

id period stockprice
1 1961-05-17 00:00:00 460
1 1961-05-18 00:00:00 457
1 1961-05-19 00:00:00 452
1 1961-05-20 00:00:00 457
1 1961-05-21 00:00:00 457.5
1 1961-05-22 00:00:00 459

Teradata Aster Analytics Foundation User Guide 255


Chapter 3: Time Series, Path, and Attribution Analysis
Path Analysis Functions

id period stockprice
1 1961-05-23 00:00:00 462
1 1961-05-24 00:00:00 459
1 1961-05-25 00:00:00 463
1 1961-05-26 00:00:00 479
1 1961-05-27 00:00:00 473.5
1 1961-05-28 00:00:00 481.25
1 1961-05-29 00:00:00 493
1 1961-05-30 00:00:00 481.25
1 1961-05-31 00:00:00 490
1 1961-06-01 00:00:00 492
1 1961-06-02 00:00:00 498
1 1961-06-03 00:00:00 496.5
1 1961-06-04 00:00:00 496.5
1 1961-06-05 00:00:00 499
1 1961-06-06 00:00:00 497
1 1961-06-07 00:00:00 496
1 1961-06-08 00:00:00 490
1 1961-06-09 00:00:00 489
1 1961-06-10 00:00:00 488.25
1 1961-06-11 00:00:00 486
1 1961-06-12 00:00:00 478
... ... ...

Path Analysis Functions

Summary
The path analysis functions automate path analysis. They are useful for clickstream analysis of web site
traffic and other sequence/path analysis tasks, such as advertisement or referral attribution.
The function descriptions use these terms:
• Path: An ordered, start-to-finish series of actions, such as the page views of a user from the start to the
end of a session. For example, if the user visits page a, page b, and page c, in that order, the path is: a,b,c

256 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Generator
• Sequence: A path in this format:

^,path

The carat (^) indicates that a path follows. For example: ^,a,b,c
• Subsequence or prefix: For a given sequence, a possible subset of steps that start with the initial step. For
example, the subsequences for the path a,b,c are:

^,a
^,a,b
^,a,b,c
• Exit subsequence or prefix: A subsequence or prefix that is the same as its sequence, indicated by a final
dollar sign ($). For example: ^,a,b,c$
• Depth: The number of steps in a sequence or subsequence. For example, the immediately preceding
subsequences have depths 1, 2, and 3, respectively.
• Node: A single step on a path. For example, one web page that the user visits during the session.
• Parent: The path the user traveled to a given node. For example, the parent of c is ^,a,b.
• Child: A path the user traveled from a given node. For example, the children of ^,a are:

^,a,b
^,a,b,c

The functions are:


• Path_Generator, which takes a set of paths and outputs the sequence and all possible subsequences.
• Path_Summarizer, which takes Path_Generator output and returns, for each prefix in the input table, the
parent and children, and number of times each of its subsequences was traveled.
• Path_Start, which takes Path_Summarizer output and returns, for each parent in the input table, the
parent and children and the number of times that each of its subsequences was traveled.
• Path_Analyzer, which runs the preceding path analysis functions in the order shown, using the output of
Path_Generator as input to Path_Summarizer and the output of Path_Summarizer as the input to
Path_Start.

Path_Generator

Summary
The Path_Generator function takes a set of paths and outputs the sequence and all possible subsequences,
which can be input to the function Path_Summarizer.

Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.

Teradata Aster Analytics Foundation User Guide 257


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Generator
Usage

Path_Generator Syntax
Version 1.3

SELECT * FROM Path_Generator (


ON { table | view | (query) }
SeqColumn ('sequence_column')
[ Delimiter ('delimiter') ]
);

Arguments
Argument Category Description
SeqColumn Required Specifies the name of the input table column that contains the
paths.
Delimiter Optional Specifies the single-character delimiter that separates symbols in
the path string. The default value is comma (',').

Note:
Do not use any of the following characters as delimiter (they
cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)

Input
Table 174: Path_Generator Input Table Schema

Column Name Data Type Description


input_column Any Optional. Column other than sequence_column or
count_column. Typically, one input_column is a user
identifier.
sequence_column VARCHAR Path to analyze, which has this syntax:
symbol [delimiter symbol ...]
Each symbol is an alphanumeric character, typically a code
that represents a unique web page view, as in Example.

258 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Generator

Column Name Data Type Description


If the input table is a query result, the query must specify
sequence_column in the GROUP BY clause, so that the input
table has one row for each unique path.
count_column INTEGER Number of times the path was traveled.

Output
The output table has a row for each subsequence. The column containing the subsequence is named "prefix".
Table 175: Path_Generator Output Table Schema

Column Name Data Type Description


input_column Same as in Value of input_column in the input table for this path; for
input table example, the identifier of the user who took this path.
sequence_column VARCHAR Path, copied from sequence_column in the input table.
count_column INTEGER Number of times the path was traveled, copied from the
input table.
prefix VARCHAR Subsequence of the path.
sequence VARCHAR Sequence (path in the format that the Path_Summarizer
function requires).

Example
This example uses clickstream data from an e-commerce web site. The following table lists and describes the
symbols of the web site pages.
Table 176: Path_Generator Example E-Commerce Website Page Symbols

Symbol Meaning Description


H Home Home page
A Account User account pages
C Category Page with list of products
P Product Product information pages
I Information Shipping, order status, etc.
S ShoppingCart Pre-order pages
O Order Confirmation/purchase page
E Enter/Exit Noncommercial vendor pages

Teradata Aster Analytics Foundation User Guide 259


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Generator
Input
The input table identifies each customer by a userid. The path column lists the symbols for the pages that the
customer clicked, ordered from first to last. For symbol meanings, refer to the preceding table.
Table 177: Path_Generator Example Input Table: clickstream1

userid usertype path cnt


1 browser H,E 2
2 browser H,C,P,E 5
3 browser H,C,H,P 7
4 browser A,H,C 3
5 browser A,H,P 2
6 buyer H,A,C,P,I,S,O,E 4
7 buyer A,P,S,O, 6
8 buyer P,C,P,S,O,E 2
9 buyer H,A,P,I,S,O 2
10 buyer C,S,O,E 1

SQL-MapReduce Call

SELECT * FROM Path_Generator (


ON clickstream1
SeqColumn ('path')
Delimiter (',')
) ORDER BY userid;

Output
Table 178: Path Generator Example Output Table

userid usertype path cnt prefix sequence


1 browser H,E 2 ^,H ^,H,E
1 browser H,E 2 ^,H,E ^,H,E
2 browser H,C,P,E 5 ^,H ^,H,C,P,E
2 browser H,C,P,E 5 ^,H,C ^,H,C,P,E
2 browser H,C,P,E 5 ^,H,C,P ^,H,C,P,E
2 browser H,C,P,E 5 ^,H,C,P,E ^,H,C,P,E
3 browser H,C,H,P 7 ^,H ^,H,C,H,P
3 browser H,C,H,P 7 ^,H,C ^,H,C,H,P

260 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Generator

userid usertype path cnt prefix sequence


3 browser H,C,H,P 7 ^,H,C,H ^,H,C,H,P
3 browser H,C,H,P 7 ^,H,C,H,P ^,H,C,H,P
4 browser A,H,C 3 ^,A ^,A,H,C
4 browser A,H,C 3 ^,A,H ^,A,H,C
4 browser A,H,C 3 ^,A,H,C ^,A,H,C
5 browser A,H,P 2 ^,A ^,A,H,P
5 browser A,H,P 2 ^,A,H ^,A,H,P
5 browser A,H,P 2 ^,A,H,P ^,A,H,P
6 buyer H,A,C,P,I,S,O,E 4 ^,H ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A,C ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A,C,P ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A,C,P,I ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A,C,P,I,S ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A,C,P,I,S,O ^,H,A,C,P,I,S,O,E
6 buyer H,A,C,P,I,S,O,E 4 ^,H,A,C,P,I,S,O,E ^,H,A,C,P,I,S,O,E
7 buyer A,P,S,O, 6 ^,A ^,A,P,S,O,
7 buyer A,P,S,O, 6 ^,A,P ^,A,P,S,O,
7 buyer A,P,S,O, 6 ^,A,P,S ^,A,P,S,O,
7 buyer A,P,S,O, 6 ^,A,P,S,O ^,A,P,S,O,
8 buyer P,C,P,S,O,E 2 ^,P ^,P,C,P,S,O,E
8 buyer P,C,P,S,O,E 2 ^,P,C ^,P,C,P,S,O,E
8 buyer P,C,P,S,O,E 2 ^,P,C,P ^,P,C,P,S,O,E
8 buyer P,C,P,S,O,E 2 ^,P,C,P,S ^,P,C,P,S,O,E
8 buyer P,C,P,S,O,E 2 ^,P,C,P,S,O ^,P,C,P,S,O,E
8 buyer P,C,P,S,O,E 2 ^,P,C,P,S,O,E ^,P,C,P,S,O,E
9 buyer H,A,P,I,S,O 2 ^,H ^,H,A,P,I,S,O
9 buyer H,A,P,I,S,O 2 ^,H,A ^,H,A,P,I,S,O
9 buyer H,A,P,I,S,O 2 ^,H,A,P ^,H,A,P,I,S,O
9 buyer H,A,P,I,S,O 2 ^,H,A,P,I ^,H,A,P,I,S,O
9 buyer H,A,P,I,S,O 2 ^,H,A,P,I,S ^,H,A,P,I,S,O
9 buyer H,A,P,I,S,O 2 ^,H,A,P,I,S,O ^,H,A,P,I,S,O

Teradata Aster Analytics Foundation User Guide 261


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Summarizer

userid usertype path cnt prefix sequence


10 buyer C,S,O,E 1 ^,C ^,C,S,O,E
10 buyer C,S,O,E 1 ^,C,S ^,C,S,O,E
10 buyer C,S,O,E 1 ^,C,S,O ^,C,S,O,E
10 buyer C,S,O,E 1 ^,C,S,O,E ^,C,S,O,E

Path_Summarizer

Summary
The Path_Summarizer function takes output of the function Path_Generator and returns, for each prefix in
the input table, the parent and children and number of times each of its subsequences was traveled. This
output can be input to the function Path_Start.

Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.

Usage

Path_Summarizer Syntax
Version 1.2

SELECT * FROM Path_Summarizer (


ON { table_name | view_name | (query)}
PARTITION BY partition_column [,...]
[ CountColumn ('count_column') ]
Delimiter ('delimiter')
SeqColumn ('sequence_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
[ Hash ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
PrefixColumn ('prefix_column')
);

Arguments
Argument Category Description
CountColumn Optional Specifies the name of the input table column that contains
the number of times a path was traveled. The default value is
1.

262 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Summarizer

Argument Category Description


Delimiter Optional Specifies the single-character delimiter that separates
symbols in the path string. The default value is comma (',').

Note:
Do not use any of the following characters as delimiter
(they cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)

SeqColumn Required Specifies the name of the input table column that contains
the paths.
PartitionNames Required Lists the names of the columns that the PARTITION BY
clause specifies. The function uses these names for output
table columns. This argument and the PARTITION BY
clause must specify the same names in the same order.
Hash Optional Specifies whether to include the hash code of the node in the
output table. The default value is 'false'.
PrefixColumn Required Specifies the name of the input column that contains the
node prefixes.

Input
The input table has the same schema as the Path_Generator output table.

Output
Table 179: Path_Summarizer Output Table Schema

Column Name Data Type Description


node VARCHAR Column containing the path to the node, including the node itself.
parent VARCHAR Parent of the node (path the users traveled to the node).
children VARCHAR Children of the node (paths the users traveled from the node), a list of
subsequences with this syntax:
[(subsequence)[,...]]
The outer brackets appear in the table.
cnt INTEGER or Count (sum) of the values in the input column count_column.
BIGINT

Teradata Aster Analytics Foundation User Guide 263


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Summarizer

Column Name Data Type Description


depth INTEGER or Number of steps on the path to the node.
BIGINT
prefix VARCHAR Subsequence of the node, copied from the prefix_column of the input
table.

Example

Input
Path Generator Example Output Table

SQL-MapReduce Call

SELECT * FROM Path_Summarizer (


ON (SELECT * FROM
Path_Generator (
ON clickstream1
SeqColumn ('path')
Delimiter (',')
)
)
PARTITION BY prefix
SeqColumn ('sequence')
PrefixColumn ('prefix')
PartitionNames ('prefix')
Delimiter (',')
CountColumn ('cnt')
Hash ('false')
) ORDER BY node;

Output
Table 180: Path_Summarizer Example Output Table

node parent children cnt depth prefix


^,A ^ [(^,A,H),(^,A,P)] 11 1 ^,A
^,A,H ^,A [(^,A,H,C), 5 2 ^,A,H
(^,A,H,P)]
^,A,H,C ^,A,H [(^,A,H,C,$)] 3 3 ^,A,H,C
^,A,H,C,$ ^,A,H,C 3 4 ^,A,H,C
^,A,H,P ^,A,H [(^,A,H,P,$)] 2 3 ^,A,H,P
^,A,H,P,$ ^,A,H,P 2 4 ^,A,H,P

264 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Summarizer

node parent children cnt depth prefix


^,A,P ^,A [(^,A,P,S)] 6 2 ^,A,P
^,A,P,S ^,A,P [(^,A,P,S,O)] 6 3 ^,A,P,S
^,A,P,S,O ^,A,P,S [(^,A,P,S,O,)] 6 4 ^,A,P,S,O
^,C ^ [(^,C,S)] 1 1 ^,C
^,C,S ^,C [(^,C,S,O)] 1 2 ^,C,S
^,C,S,O ^,C,S [(^,C,S,O,E)] 1 3 ^,C,S,O
^,C,S,O,E ^,C,S,O [(^,C,S,O,E,$)] 1 4 ^,C,S,O,E
^,C,S,O,E,$ ^,C,S,O,E 1 5 ^,C,S,O,E
^,H ^ [(^,H,A),(^,H,C), 20 1 ^,H
(^,H,E)]
^,H,A ^,H [(^,H,A,C), 6 2 ^,H,A
(^,H,A,P)]
^,H,A,C ^,H,A [(^,H,A,C,P)] 4 3 ^,H,A,C
^,H,A,C,P ^,H,A,C [(^,H,A,C,P,I)] 4 4 ^,H,A,C,P
^,H,A,C,P,I ^,H,A,C,P [(^,H,A,C,P,I,S)] 4 5 ^,H,A,C,P,I
^,H,A,C,P,I,S ^,H,A,C,P,I [(^,H,A,C,P,I,S,O)] 4 6 ^,H,A,C,P,I,S
^,H,A,C,P,I,S,O ^,H,A,C,P,I,S [(^,H,A,C,P,I,S,O,E 4 7 ^,H,A,C,P,I,S,O
)]
^,H,A,C,P,I,S,O,E ^,H,A,C,P,I,S,O [(^,H,A,C,P,I,S,O,E 4 8 ^,H,A,C,P,I,S,O,E
,$)]
^,H,A,C,P,I,S,O,E,$ ^,H,A,C,P,I,S,O,E 4 9 ^,H,A,C,P,I,S,O,E
^,H,A,P ^,H,A [(^,H,A,P,I)] 2 3 ^,H,A,P
^,H,A,P,I ^,H,A,P [(^,H,A,P,I,S)] 2 4 ^,H,A,P,I
^,H,A,P,I,S ^,H,A,P,I [(^,H,A,P,I,S,O)] 2 5 ^,H,A,P,I,S
^,H,A,P,I,S,O ^,H,A,P,I,S [(^,H,A,P,I,S,O,$)] 2 6 ^,H,A,P,I,S,O
^,H,A,P,I,S,O,$ ^,H,A,P,I,S,O 2 7 ^,H,A,P,I,S,O
^,H,C ^,H [(^,H,C,H), 12 2 ^,H,C
(^,H,C,P)]
^,H,C,H ^,H,C [(^,H,C,H,P)] 7 3 ^,H,C,H
^,H,C,H,P ^,H,C,H [(^,H,C,H,P,$)] 7 4 ^,H,C,H,P
^,H,C,H,P,$ ^,H,C,H,P 7 5 ^,H,C,H,P
^,H,C,P ^,H,C [(^,H,C,P,E)] 5 3 ^,H,C,P
^,H,C,P,E ^,H,C,P [(^,H,C,P,E,$)] 5 4 ^,H,C,P,E
^,H,C,P,E,$ ^,H,C,P,E 5 5 ^,H,C,P,E

Teradata Aster Analytics Foundation User Guide 265


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Start

node parent children cnt depth prefix


^,H,E ^,H [(^,H,E,$)] 2 2 ^,H,E
^,H,E,$ ^,H,E 2 3 ^,H,E
^,P ^ [(^,P,C)] 2 1 ^,P
^,P,C ^,P [(^,P,C,P)] 2 2 ^,P,C
^,P,C,P ^,P,C [(^,P,C,P,S)] 2 3 ^,P,C,P
^,P,C,P,S ^,P,C,P [(^,P,C,P,S,O)] 2 4 ^,P,C,P,S
^,P,C,P,S,O ^,P,C,P,S [(^,P,C,P,S,O,E)] 2 5 ^,P,C,P,S,O
^,P,C,P,S,O,E ^,P,C,P,S,O [(^,P,C,P,S,O,E,$)] 2 6 ^,P,C,P,S,O,E
^,P,C,P,S,O,E,$ ^,P,C,P,S,O,E 2 7 ^,P,C,P,S,O,E

Path_Start

Summary
The Path_Start function takes output of the function Path_Summarizer and returns, for each parent in the
input table, the parent and children and the number of times that each of its subsequences was traveled.

Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.

Usage

Path_Start Syntax
Version 1.2

SELECT * FROM Path_Start (


ON table_name
PARTITION BY partition_column [,...]
CountColumn ('count_column')
[ Delimiter (',') ]
ParentColumn ('parent_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
NodeColumn ('node_column')
);

266 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Start
Arguments
Argument Category Description
CountColumn Required Specifies the name of the input table column that contains the
number of times a path was traveled.
Delimiter Optional Specifies the single-character delimiter that separates symbols in
the path string. The default value is comma (',').

Note:
Do not use any of the following characters as delimiter (they
cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)

ParentColumn Required Specifies the name of the input table column that contains the
parent nodes. The PARTITION BY clause in the function call
must include this column.
PartitionNames Required Lists the names of the columns that the PARTITION BY clause
specifies. The function uses these names for output table columns.
This argument and the PARTITION BY clause must specify the
same names in the same order. One partition_column must be
parent_column.
NodeColumn Required Specifies the name of the input table column that contains the
nodes.

Input
The input table has the same schema as the Path_Summarizer output table (refer to Output in
Path_Summarizer).

Output
The output table has a row for each node.
Table 181: Path_Start Output Table Schema

Column Name Data Type Description


node_column VARCHAR Column containing the path to the node, including the node
itself.
parent_column VARCHAR Parent of the node (path the users traveled to the node).

Teradata Aster Analytics Foundation User Guide 267


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Start

Column Name Data Type Description


children VARCHAR Children of the node (paths the users traveled from the
node), a list of subsequences with this syntax:
[(subsequence)[,...]]
The outer brackets appear in the table.
subpath_cnt Same as in Number of times the subsequence was traveled.
input table
depth Same as in Number of steps in the subsequence.
input table
partition_column Same as in Column specified by the PartitionNames argument, copied
input table from the input table.

Example

Input
The input table for this example is the output table, Path_Summarizer Example Output Table from the
Example of the function: Path_Summarizer.

SQL-MapReduce Call

SELECT * FROM Path_Start (


ON (SELECT * FROM Path_Summarizer (
ON (SELECT * FROM Path_Generator (
ON clickstream1
SeqColumn ('path')
Delimiter (',')
)
)
PARTITION BY prefix
SeqColumn ('sequence')
PrefixColumn ('prefix')
PartitionNames ('prefix')
Delimiter (',')
CountColumn ('cnt')
Hash ('false')
)
)
PARTITION BY (parent)
CountColumn ('cnt')
Delimiter (',')
ParentColumn ('parent')
PartitionNames ('partitioned')
NodeColumn ('node')
) ORDER BY node;

268 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Start
Output
Table 182: Path_Start Example Output Table

node parent children subpath_cnt depth partitioned


^ [(^,A),(^,C), 34 0 ^
(^,H),(^,P)]
^,A ^ [(^,A,H),(^,A,P)] 11 1 ^,A
^,A,H ^,A [(^,A,H,C), 5 2 ^,A,H
(^,A,H,P)]
^,A,H,C ^,A,H [(^,A,H,C,$)] 3 3 ^,A,H,C
^,A,H,P ^,A,H [(^,A,H,P,$)] 2 3 ^,A,H,P
^,A,P ^,A [(^,A,P,S)] 6 2 ^,A,P
^,A,P,S ^,A,P [(^,A,P,S,O)] 6 3 ^,A,P,S
^,C ^ [(^,C,S)] 1 1 ^,C
^,C,S ^,C [(^,C,S,O)] 1 2 ^,C,S
^,C,S,O ^,C,S [(^,C,S,O,E)] 1 3 ^,C,S,O
^,C,S,O,E ^,C,S,O [(^,C,S,O,E,$)] 1 4 ^,C,S,O,E
^,H ^ [(^,H,A),(^,H,C), 20 1 ^,H
(^,H,E)]
^,H,A ^,H [(^,H,A,C), 6 2 ^,H,A
(^,H,A,P)]
^,H,A,C ^,H,A [(^,H,A,C,P)] 4 3 ^,H,A,C
^,H,A,C,P ^,H,A,C [(^,H,A,C,P,I)] 4 4 ^,H,A,C,P
^,H,A,C,P,I ^,H,A,C,P [(^,H,A,C,P,I,S)] 4 5 ^,H,A,C,P,I
^,H,A,C,P,I,S ^,H,A,C,P,I [(^,H,A,C,P,I,S,O 4 6 ^,H,A,C,P,I,S
)]
^,H,A,C,P,I,S,O ^,H,A,C,P,I,S [(^,H,A,C,P,I,S,O, 4 7 ^,H,A,C,P,I,S,O
E)]
^,H,A,C,P,I,S,O,E ^,H,A,C,P,I,S,O [(^,H,A,C,P,I,S,O, 4 8 ^,H,A,C,P,I,S,O,E
E,$)]
^,H,A,P ^,H,A [(^,H,A,P,I)] 2 3 ^,H,A,P
^,H,A,P,I ^,H,A,P [(^,H,A,P,I,S)] 2 4 ^,H,A,P,I
^,H,A,P,I,S ^,H,A,P,I [(^,H,A,P,I,S,O)] 2 5 ^,H,A,P,I,S
^,H,A,P,I,S,O ^,H,A,P,I,S [(^,H,A,P,I,S,O, 2 6 ^,H,A,P,I,S,O
$)]
^,H,C ^,H [(^,H,C,H), 12 2 ^,H,C
(^,H,C,P)]

Teradata Aster Analytics Foundation User Guide 269


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Analyzer

node parent children subpath_cnt depth partitioned


^,H,C,H ^,H,C [(^,H,C,H,P)] 7 3 ^,H,C,H
^,H,C,H,P ^,H,C,H [(^,H,C,H,P,$)] 7 4 ^,H,C,H,P
^,H,C,P ^,H,C [(^,H,C,P,E)] 5 3 ^,H,C,P
^,H,C,P,E ^,H,C,P [(^,H,C,P,E,$)] 5 4 ^,H,C,P,E
^,H,E ^,H [(^,H,E,$)] 2 2 ^,H,E
^,P ^ [(^,P,C)] 2 1 ^,P
^,P,C ^,P [(^,P,C,P)] 2 2 ^,P,C
^,P,C,P ^,P,C [(^,P,C,P,S)] 2 3 ^,P,C,P
^,P,C,P,S ^,P,C,P [(^,P,C,P,S,O)] 2 4 ^,P,C,P,S
^,P,C,P,S,O ^,P,C,P,S [(^,P,C,P,S,O,E)] 2 5 ^,P,C,P,S,O
^,P,C,P,S,O,E ^,P,C,P,S,O [(^,P,C,P,S,O,E, 2 6 ^,P,C,P,S,O,E
$)]

Path_Analyzer

Summary
The Path_Analyzer function:
1. Inputs a set of paths to the function Path_Generator.
2. Inputs the Path_Generator output to the function Path_Summarizer.
3. Inputs the Path_Summarizer output to the function Path_Start, which outputs, for each parent, all
children and the number of times that the user traveled each child.

Note:
For the definitions of the terms that this section uses, refer to Path Analysis Functions.

Usage

Path_Analyzer Syntax
Version 1.3

SELECT * FROM Path_Analyzer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]

270 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Analyzer
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ( { table | view| (query) } )
OutputTable ('output_table')
SeqColumn ('sequence_column')
CountColumn ('count_column')
[ Hash ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
Delimiter ('delimiter')
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies either the name of the input table or view or an nPath
query whose result is the input table. The input table contains the
paths to analyze. Each path is a string of alphanumeric symbols
that represents an ordered sequence of page views (or actions).
Typically each symbol is a code that represents a unique page
view.
If you specify an nPath query, it must select both the column that
contains the paths (sequence_column) and the column that
contains the number of times a path was traveled (count_column).
It must also specify sequence_column in the GROUP BY clause, so
that the input table has one row for each unique path traveled on a
web site.
OutputTable Required Specifies the name of the output table.
SeqColumn Required Specifies the name of the input table column that contains the
paths.
CountColumn Required Specifies the name of the input table column that contains the
number of times a path was traveled.
Hash Optional Specifies whether to include the hash code of the output column
node. The default value is 'false'.
Delimiter Required Specifies the single-character delimiter that separates symbols in
the path string. The default value is comma (',').

Teradata Aster Analytics Foundation User Guide 271


Chapter 3: Time Series, Path, and Attribution Analysis
Path_Analyzer

Argument Category Description

Note:
Do not use any of the following characters as delimiter (they
cause the function to fail):
• Asterisk (*)
• Plus (+)
• Left parenthesis (()
• Right parenthesis ())
• Single quotation mark (')
• Escaped single quotation mark (\')
• Backslash (\)

Input
The input table has the same schema as the Path_Generator input table (refer to Input).

Output
The output table has the same schema as the Path_Start output table (refer to Output).

Example
This example uses clickstream data from an e-commerce website. The table, Path_Generator Example E-
Commerce Website Page Symbols, from the Example section of the function: Path_Generator describes the
pages of the website.

Input
Use the following table from the Input section of the Path_Generator function Example section.
• Path_Generator Example Input Table: clickstream1

SQL-MapReduce Call

SELECT * FROM Path_Analyzer (


ON (SELECT 1)
PARTITION BY 1
InputTable ('clickstream1')
OutputTable ('path_output')
SeqColumn ('path')
CountColumn ('cnt')
Hash ('false')
Delimiter (',')
);

272 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
Output
Path_Start Example Output Table

SAX2

Summary
The SAX2 function transforms original time series data into symbolic strings, which are more suitable for
many additional types of manipulation, because of their smaller size and the relative ease with which
patterns can be identified and compared. Input and output formats allow it to supply data to the Shapelet
Functions.

Background
A time series is a collection of data observations made sequentially over time. Time series occur in virtually
every medical, scientific, entertainment, and business domain.
Symbolic Aggregate Approximation (SAX) uses a simple algorithm, with low computational complexity to
create symbolic strings from time series data. Using a symbolic representation enables additional functions
such as Teradata Aster nPath to easily operate on the data.
The data can also be manipulated using common algorithms such as hashing or regular-expression pattern
matching. In classic data-mining tasks such as classification, clustering, and indexing, SAX is accepted as
being as good as some well-known, but storage-intensive methods like Discrete Wavelet Transform (DWT)
and Discrete Fourier Transform (DFT).
SAX transforms a time series X of length n into the string of arbitrary length w, where w < n, using an
alphabet A of size a > 2.
The SAX algorithm has two steps:
1. SAX transforms the original time series data into a piecewise aggregate approximation (PAA)
representation. This transformation effectively splits the time series data into intervals and then assigns
each interval to one of a limited set of alphabetical symbols (letters) based on the data being examined.
The set of symbols used is based on dividing all observed data into chunks (or thresholds), using the
normal distribution curve. Each of these chunks is represented by a symbol (a letter). This is a simple
way to reduce the dimensionality of the data.
2. SAX converts the PAA into a string of letters that represents the patterns occurring in the data over time.
The symbols created by SAX correspond to the time series features with equal probability, allowing them to
be compared and used for further manipulation with reliable accuracy. The time series that are normalized
using the zero mean and unit of energy follow the normal distribution law. By using Gaussian distribution
properties, SAX can easily select equal-sized areas under the normal curve using lookup tables for the cut
lines coordinates, slicing the under-the-Gaussian-curve area. In the SAX algorithm context, the x
coordinates of these lines are called breakpoints.

Teradata Aster Analytics Foundation User Guide 273


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
Figure 8: Saxification Process

Usage

SAX2 Syntax (Single Input)


Version 1.0

SELECT * FROM SAX2 (


ON { table | view | (query) } AS input
PARTITION BY key
ORDER BY order_columns
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ Time_Column ('time_column') ]
[ Window_Type ( { 'global' | 'sliding' } ) ]
[ Output ( { 'string' | 'bytes' | 'bitmap' | ' characters' } ) ]
[ Mean ('mean_value' [,...]) ]
[ Stdev ('stdev_value' [,...]) ]

274 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
[Window_Size ('window_size') ]
[Output_Frequency ('output_frequency') ]
[Points_Per_Symbol ('points_per_symbol' [,...]) ]
[Symbols_Per_Window ('symbols_per_window' [,...]) ]
[Alphabet_Size ('alphabet_size' [,...]) ]
[Bitmap_Level ('bitmap_level' [,...]) ]
[Print_Code_Stats
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

SAX2 Syntax (Multiple Inputs)


Version 1.0

SELECT * FROM SAX2 (


ON { table | view | (query) } AS input
PARTITION BY key
ORDER BY order_columns
ON { table | view | (query) } AS meanstats PARTITION BY key
ON { table | view | (query) } AS stdevstats PARTITION BY key
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ TimeColumn ('time_column') ]
[ WindowType ( { 'global' | 'sliding' } ) ]
[ Output ( { 'string' | 'bytes' | 'bitmap' | ' characters' } ) ]
[ WindowSize ('window_size') ]
[ OutputFrequency ('output_frequency') ]
[ PointsPerSymbol ('points_per_symbol' [,...]) ]
[ SymbolsPerWindow ('symbols_per_window' [,...]) ]
[ AlphabetSize ('alphabet_size' [,...]) ]
[ BitmapLevel ('bitmap_level' [,...]) ]
[ PrintCodeStats
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
ValueColumns Required Specifies the names of the input table columns that
contain the time series data to be transformed.
TimeColumn Optional Specifies the name of the input table column that
contains the time axis of the data.
WindowType Optional Determines how much data the function processes at
one time:
• 'global' (default)

Teradata Aster Analytics Foundation User Guide 275


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2

Argument Category Description

The function computes the SAX code using a


single mean and standard deviation for the entire
data set.
• 'sliding'
The function recomputes the mean and standard
deviation for a sliding window of the data set.

Output Optional Determines how the function outputs the results:


• 'string' (default)
The function outputs a list of SAX codes for each
window.
• 'bytes'
The function outputs the list of SAX codes as
compact byte arrays (which are not “human-
readable”).
• 'bitmap'
The function outputs a JSON representation of a
SAX bitmap.
• 'characters'
The function outputs one character for each line.

Mean Optional Specifies the global mean values that the function
(single-input uses to calculate the SAX code for every partition. A
syntax only) mean_value has the data type DOUBLE
PRECISION.
If Mean specifies only one value and ValueColumns
specifies multiple columns, then the specified value
applies to every value_column.
If Mean specifies multiple values, then it must
specify a value for each value_column. The nth
mean_value corresponds to the nth value_column.

Tip:
To specify a different global mean value for each
partition, use the multiple-input syntax and put
the values in the meanstats table.

Stdev Optional Specifies the global standard deviation values that


(single-input the function uses to calculate the SAX code for every
syntax only) partition. A stdev_value has the data type DOUBLE
PRECISION and its value must be greater than 0.
If Stdev specifies only one value and ValueColumns
specifies multiple columns, then the specified value
applies to every value_column.
If Stdev specifies multiple values, then it must
specify a value for each value_column. The nth
stdev_value corresponds to the nth value_column.

276 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2

Argument Category Description

Tip:
To specify a different global standard deviation
value for each partition, use the multiple-input
syntax and put the values in the stdevstats table.

WindowSize Required if Specifies the size of the sliding window. The value
WindowType is must be an integer greater than 0.
'sliding', not
allowed
otherwise
OutputFrequency Optional Specifies the number of data points that the window
slides between successive outputs. The value must be
an integer greater than 0. The default value is 1.

Note:
WindowType value must be 'sliding' and Output
value cannot be 'characters'. If WindowType is
'sliding' and Output value is 'characters', then
OutputFrequency is automatically set to the value
of Window_Size, to ensure that a single character
is assigned to each time point. If the number of
data points in the time series is not an integer
multiple of the window size, then the function
ignores the leftover parts.

PointsPerSymbol Optional Specifies the number of data points to be converted


into one SAX symbol. Each value must be an integer
greater than 0. The default value is 1.

Note:
WindowType value must be 'global'.

SymbolsPerWindow Optional Specifies the number of SAX symbols to be


generated for each window. Each value must be an
integer greater than 0. The default value is the value
of WindowSize.

Note:
WindowType value must be 'sliding'.

AlphabetSize Optional Specifies the number of symbols in the SAX


alphabet. The value must be an integer in the range
[2, 20]. The default value is 4.
BitmapLevel Optional Specifies the number of consecutive symbols to be
converted to one symbol on a bitmap. For bitmap
level 1, the bitmap contains the symbols 'a', 'b', 'c',
and so on; for bitmap level 2, the bitmap contains

Teradata Aster Analytics Foundation User Guide 277


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2

Argument Category Description


the symbols 'aa', 'ab', 'ac', and so on. The input value
must be an integer in the range [1, 4]. The default
value is 2.

Note:
Output value must be 'bitmap'.

PrintCodeStats Optional Specifies whether the function prints the mean and
standard deviation. The default value is 'false'.

Note:
Output value must be 'string'.

Accumulate Optional The names of the input table columns that are to
appear in the output table. For each sequence in the
input table, SAX2 choose the value corresponding to
the first time point in the sequence to output as the
accumulate value.

Input
The single-input version of the SAX2 function requires one input table, input.
The multiple-input version of the SAX2 function requires three input tables—input, meanstats, and
stdevstats.
The input table must have one or more columns that contain time series data to be transformed, and you
must specify their names with the ValueColumns argument. The input table can have other columns, but the
function ignores them unless you specify them with the TimeColumn or Accumulate argument. The
following table gives the valid data types for input table columns that you can specify with the
ValueColumns, TimeColumn, and Accumulate arguments.
Table 183: SAX2 Input Table Schema

Column Name Data Type Description


value_column BYTEINT, Contains time series data to be transformed, specified with the
SMALLINT, ValueColumns argument. The input table must have at least one
INTEGER, value_column.
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
time_column BYTEINT, Contains the time axis of the data, specified with the
SMALLINT, TimeColumn argument.
INTEGER,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION

278 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2

Column Name Data Type Description


accumulate_column Any Column to copy to the output table.

Both the meanstats and stdevstats tables must have every value_column and partition column (in key) that
the input table has.
The meanstats table contains the global means of each value_column of the input table. Each row of the
meanstats table specifies the global means for one input partition.
The stdevstats table contains the global standard deviations of each value_column of the input table. Each
row of the stdevstats table specifies the global means for one input partition.

Output
The output table format depends on the values of the Output and WindowType arguments.
For 'string' or 'bytes' output, the output table has only one row for a 'global' window, but multiple rows for a
'sliding' window. The following table describes the output table columns. In the column names, n varies
from 1 to N.
Table 184: SAX2 'string' or 'bytes' Output Table Schema

Column Name Data Type Description


accumulate Any A column to be included in the output table,
specified with the Accumulate argument.
start_time Numeric or time SQL data The point on the time axis where the window
type starts.
end_time Numeric or time SQL data The point on the time axis where the window
type ends.
sax_value_column_n VARCHAR for 'string', BYTE The SAX code for the nth window; a string for
for 'bytes' 'string', a byte array for 'bytes'.
mean_n DOUBLE PRECISION The mean value used to calculate the SAX value
for the nth window. This column appears only if
the value of the PrintCodeStats argument is 'true',
'yes', 't', 'y', or '1'.
sd_n DOUBLE PRECISION The standard deviation value used to calculate
the SAX value for the nth window. This column
appears only if the value of the PrintCodeStats
argument is 'true', 'yes', 't', 'y', or '1'.

For 'bitmap' output, the output table has only one row, whose columns are described in the following table.
In column names, n varies from 1 to N.
Table 185: SAX2 'bitmap' Output Table Schema

Column Name Data Type Description


accumulate Any A column to be included in the output table,
specified with the Accumulate argument.

Teradata Aster Analytics Foundation User Guide 279


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2

Column Name Data Type Description


start_time Numeric or The point on the time axis where the window starts.
time SQL data
type
end_time Numeric or The point on the time axis where the window ends.
time SQL data
type
bitmap_value_column_n VARCHAR The bitmap value for the nth window; a JSON string
containing key-value pairs of the form "xx":d, where
"xx" is a SAX code and d is a value between 0.00 and
1.00.

For 'characters' output, each SAX symbol has its own row in the output table. You can input this output table
to the function HMMUnsupervisedLearner.
Table 186: SAX2 'characters' Output Table Schema

Column Name Data Type Description


accumulate Any A column to be included in the output table,
specified with the Accumulate argument.
key VARCHAR The column on which the input table is partitioned.
start_time Numeric or The point on the time axis where the window starts.
time SQL data
type
end_time Numeric or The point on the time axis where the window ends.
time SQL data
type
char_value_column_n CHAR(1) The character value for the nth window.
mean_n DOUBLE The mean value used to calculate the SAX value for
PRECISION the nth window. This column appears only if the
value of the PrintCodeStats argument is 'true', 'yes',
't', 'y', or '1'.
sd_n DOUBLE The standard deviation value used to calculate the
PRECISION SAX value for the nth window. This column appears
only if the value of the PrintCodeStats argument is
'true', 'yes', 't', 'y', or '1'.

Examples
These examples use seasonally adjusted quarterly financial data from West Germany between 1960 and
1982.

280 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
Input
The input table columns contain the decade (id), quarter (period), consumer expenditures (expenditure),
disposable income (income), and fixed investments (investment). Values are in billions of DM.
Table 187: SAX2 Examples Input Table sax_example

id period expenditure income investment


1 1960Q1 415 451 180
1 1960Q2 421 465 179
1 1960Q3 434 485 185
1 1960Q4 448 493 192
1 1961Q1 459 509 211
1 1961Q2 458 520 202
1 1961Q3 479 521 207
1 1961Q4 487 540 214
1 1962Q1 497 548 231
1 1962Q2 510 558 229
1 1962Q3 516 574 234
1 1962Q4 525 583 237
1 1963Q1 529 591 206
1 1963Q2 538 599 250
... ... ... ... ...

Example 1: Global Window, Default Output

SQL-MapReduce Call

SELECT * FROM SAX2 (


ON sax_example AS INPUT
PARTITION BY id
ORDER BY id
ValueColumns ('expenditure', 'income', 'investment')
TimeColumn ('period')
WindowType ('global')
PrintCode_Stats ('true')
Accumulate ('id')
) ORDER BY id;

Teradata Aster Analytics Foundation User Guide 281


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
Output

Table 188: SAX2 Example 1 Output Table, Columns 1-5

id period_start period_end expenditure_saxcode expenditure_mean


1 1960Q1 1969Q4 aaaabbbbbbbbbbbbbbbbbbbbbbbbbbbcccccdddd 714.2
2 1970Q1 1979Q4 ddddddddaaaaaaaaaaaabbbbbbbbbbbccccccccd 1355.6
3 1980Q1 1982Q4 aaaacccddddd 2045.16666666667

Table 189: SAX2 Example 1 Output Table, Columns 6-9

expenditure_stdev income_saxcode income_mean income_stdev


385.0403375472 aaaabbbbbbbbbbbbbbbbbbbbbbbbbbbcccccdddd 811.2 449.682520486142
404.842280147309 ddddddddaaaaaaaaaaaabbbbbbbbbbcccccccccc 1589.275 470.779059045639
256.612489362793 aaaacccddddd 2387.41666666667 317.496587908175

Table 190: SAX2 Example 1 Output Table, Columns 10-12

investment_saxcode investment_mean investment_stdev


aaaaaaabbbbbabbbbbbbbcccccccbbbcbbccdddd 301 130.70871903117
ddddddddaaaaaaaabbbbbbbbcbbbcbbbbbbbbccc 556.675 142.199821035626
aaaacddccccc 759.083333333333 113.352594174375

Example 2: Sliding Window, Default Output

SQL-MapReduce Call

SELECT * FROM SAX2 (


ON sax_example AS INPUT
PARTITION BY id
ORDER BY id
ValueColumns ('expenditure')
TimeColumn ('period')
WindowType ('sliding')
WindowSize (20)
PrintCodeStats ('true')
Accumulate ('id')
) ORDER BY id;

Output

Table 191: SAX2 Example 2 Output Table

id period_start period_end expenditure_saxcode expenditure_mean expenditure_stdev


1 1960Q1 1964Q4 aaaaaabbbcccccdddddd 507.65 56.4654295728117

282 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
id period_start period_end expenditure_saxcode expenditure_mean expenditure_stdev
1 1960Q2 1965Q1 aaaaaabbbbcccccddddd 517.75 57.0833877576897
1 1960Q3 1965Q2 aaaaaabbbbccccdddddd 528.65 58.4341960084844
1 1960Q4 1965Q3 aaaaaabbbbbccccddddd 539.6 60.255071854662
1 1961Q1 1965Q4 aaaaabbbbbbccccddddd 550.6 62.6850103798008
1 1961Q2 1966Q1 aaaaaabbbbbccccddddd 561.6 65.0242060191514
1 1961Q3 1966Q2 aaaaaabbbbcccccddddd 573 65.8858662265366
1 1961Q4 1966Q3 aaaaaaabbbbcccdddddd 583.9 67.5284735266697
1 1962Q1 1966Q4 aaaaaaabbbbcccdddddd 593.95 67.3048641395409
... ... ... ... ... ...

Example 3: Sliding Window, Bitmap Output

SQL-MapReduce Call

SELECT * FROM SAX2 (


ON sax_example AS INPUT
PARTITION BY id
ORDER BY id
ValueColumns ('expenditure', 'income', 'investment')
TimeColumn ('period')
Output ('bitmap')
BitmapLevel (1)
WindowType ('sliding')
WindowSize (20)
Accumulate ('id')
) ORDER BY id;

Output

Table 192: SAX2 Example 3 Output Table

id period_start period_end expenditure_bitmap income_bitmap investment_bitmap


1 1960Q1 1969Q4 {"a":0.81,"b":1.0, "c": {"a":0.79,"b":1.0, "c": {"a":0.72,"b":1.0, "c":
0.55,"d":0.87} 0.54,"d":0.87} 0.52,"d":0.9}
2 1970Q1 1979Q4 {"a":1.0,"b":0.98, "c": {"a":1.0,"b":0.95, "c": {"a":0.83,"b":1.0, "c":
0.53,"d":0.93} 0.52,"d":0.91} 0.83,"d":0.79}
3 1980Q1 1982Q4 {"a":0.0,"b":0.0, "c": {"a":0.0,"b":0.0, "c": {"a":0.0,"b":0.0 ,"c":
0.0,"d":0.0} 0.0,"d":0.0} 0.0,"d":0.0}

Teradata Aster Analytics Foundation User Guide 283


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
Example 4: Sliding Window, Character Output

SQL-MapReduce Call

SELECT * FROM SAX2 (


ON sax_example AS INPUT
PARTITION BY id
ORDER BY id
ValueColumns ('expenditure', 'income', 'investment')
TimeColumn ('period')
Output ('characters')
WindowType ('global')
Accumulate ('id')
) ORDER BY id;

Output

Table 193: SAX2 Example 4 Output Table

id value_column period_start period_end saxchar


1 expenditure 1960Q1 1960Q1 a
1 expenditure 1960Q2 1960Q2 a
1 expenditure 1960Q3 1960Q3 a
1 expenditure 1960Q4 1960Q4 a
1 expenditure 1961Q1 1961Q1 b
1 expenditure 1961Q2 1961Q2 b
1 expenditure 1961Q3 1961Q3 b
... ... ... ... ...

Example 5: Multiple-Input Version


In the multi-input version, the mean and standard deviation statistics are applied globally with meanstats
and the stdevstats tables, respectively.

SQL-MapReduce Call
The statistics are created from a query on the input table finance_data3 and grouped by column id.

SELECT * FROM SAX2 (


ON finance_data3 AS INPUT
PARTITION BY id
ORDER BY id
ON (SELECT id, AVG (expenditure) AS expenditure,
AVG (income) AS income,
AVG (investment) AS investment
FROM finance_data3 GROUP BY id) AS meanstats PARTITION BY id
ON (SELECT id, STDDEV (expenditure) AS expenditure,

284 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SAX2
STDDEV (income) AS income,
STDDEV (investment) AS investment
FROM finance_data3 GROUP BY id) AS stdevstats PARTITION BY id
ValueColumns ('expenditure', 'income', 'investment')
TimeColumn ('period')
WindowType ('global')
PrintCodeStats ('true')
Accumulate ('id')
) ORDER BY id;

Output

Table 194: SAX2 Example 5 Output Table (Columns 1-3)

id period_start period_end
1 1960Q1 1969Q4
2 1970Q1 1979Q4
3 1980Q1 1982Q4

Table 195: SAX2 Example 5 Output Table (Columns 4-6)

expenditure_saxcode expenditure_mean expenditure_stdev


aaaabbbbbbbbbbbbbbbbbbbbbbbbbbbcccccdddd 714.2 385.0403375472
ddddddddaaaaaaaaaaaabbbbbbbbbbbccccccccd 1355.6 404.842280147308
aaaacccddddd 2045.16666666667 256.612489362793

Table 196: SAX2 Example 5 Output Table (Columns 7-9)

income_saxcode income_mean income_stdev


aaaabbbbbbbbbbbbbbbbbbbbbbbbbbbcccccdddd 811.2 449.682520486142
ddddddddaaaaaaaaaaaabbbbbbbbbbcccccccccc 1589.275 470.779059045639
aaaacccddddd 2387.41666666667 317.496587908175

Table 197: SAX2 Example 5 Output Table (Columns 10-12)

investment_saxcode investment_mean investment_stdev


aaaaaaabbbbbabbbbbbbbcccccccbbbcbbccdddd 301 130.70871903117
ddddddddaaaaaaaabbbbbbbbcbbbcbbbbbbbbccc 556.675 142.199821035626
aaaacddccccc 759.083333333333 113.352594174375

Teradata Aster Analytics Foundation User Guide 285


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter

SeriesSplitter

Summary
The SeriesSplitter function splits partitions into subpartitions (called splits) to balance the partitions for time
series manipulation. The function creates an additional column that contains split identifiers. Each row
contains the identifier of the split to which the row belongs. Optionally, the function also copies a specified
number of boundary rows to each split.

Background
In many real-world use cases, the data is greatly skewed across partitions (that is, some partitions contain
significantly more data than others). This is especially true in time series manipulation—a single partition in
the input table can contain a time series with billions of data points.
Sometimes the input table cannot be further partitioned with the PARTITION BY clause. The most
common reasons are:
• The table has no column or combination of columns that can be used to further partition the data.
• The table contains an ordered data set, and to analyze one row, a function must consider adjacent rows.
Simply slicing the table makes analysis of boundary data impossible. The boundary of each subpartition
must include duplicate rows from the neighboring partition.
One vworker must process an entire partition. Therefore, severe imbalance in the partitions causes severe
load imbalance across vworkers.

Usage

SeriesSplitter Syntax
Version 1.0

SELECT * FROM SeriesSplitter (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ({ table | view | (query) })
PartitionByColumns
('partition_column' | 'partition_column_range' }[,...])
[ DuplicateRowsCount ('value' [,...]) ]
[ OrderByColumns
({ 'ordering_column' | 'ordering_column_range'}[,...]) ]
[ SplitCount ('split_count') ]
[ RowsPerSplit ('rows_per_split') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]

286 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter
[ OutputTable ('output_table') ]
[ SplitIDColumn ('split_id_column') ]
[ StatsTable ('stats_table') ]
[ ReturnStatsTable
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OverwriteOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ ValuesBeforeFirst ('value' [,...]) ]
[ ValuesAfterLast ('value' [,...]) ]
[ DuplicateColumn ('duplicate_column') ]
[ PartialSplitID
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the input table to be split.
PartitionByColumns Required Specifies the partitioning columns of input_table. These columns
determine the identity of a partition. For data type restrictions of
these columns, see the Aster Database documentation.
DuplicateRowsCount Optional Specifies the number of rows to duplicate across split boundaries.
By default, the function duplicates one row from the previous
partition and one row from the next partition. If you specify only
value1, then the function duplicates value1 rows from the previous
partition and value1 rows from the next partition. If you specify
both value1 and value2, then the function duplicates value1 rows
from the previous partition and value2 rows from the next
partition. Each argument value must be nonnegative integer less
than or equal to 1000.
OrderByColumns Optional Specifies the ordering columns of input_table. These columns
establish the order of the rows and splits. Without this argument,
the function can split the rows in any order.
SplitCount Optional
Note:
If input_table has multiple partitions, then you cannot specify
SplitCount. Instead, specify RowsPerSplit.

Specifies the desired number of splits in a partition of the output


table.
The value of split_count must be a positive BIGINT, and its upper
bound is the number of rows in the partition. The default value is
4.

Teradata Aster Analytics Foundation User Guide 287


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter

Argument Category Description


Base the value of split_count on the desired amount of parallelism.
For example, for a cluster with 10 vworkers, make split_count a
multiple of 10.
If the number of rows in input_table (n) is not exactly divisible by
split_count, then the function estimates the number of splits in the
partition, using this formula:
ceiling (n / ceiling (n / split_count) )
RowsPerSplit Optional
Note:
If input_table has multiple partitions, then specify
RowsPerSplit instead of SplitCount.

Specifies the desired maximum number of rows in each split in the


output table. If the number of rows in input_table is not exactly
divisible by rows_per_split, then the last split contains fewer than
rows_per_split rows, but no row contains more than
rows_per_split rows.
The value of rows_per_split must be a positive BIGINT.
If input_table has multiple partitions and you do not specify
RowsPerSplit, then the function uses the value 1000.
Accumulate Optional Specifies the names of input_table columns (other than those
specified by PartitionByColumns and OrderByColumns) to copy
to the output table. By default, only the columns specified by
PartitionByColumns and OrderByColumns are copied to the
output table.
OutputTable Optional Specifies the name of table that the function creates to store the
data splits for all partitions. The default value is
'partitioned_input_table'. For example, if input_table is
'time_series', then output_table is 'partitioned_time_series'.
SplitIDColumn Optional Specifies the name for the output table column that is to contain
the split identifiers. The default value is 'split_id'. If the output
table has another column named split_id_column, then the
function returns an error. Therefore, if the output table has a
column named 'split_id' (specified by Accumulate,
PartitionByColumns, or Order_By_Columns), then you must use
SplitIDColumn to specify a different split_id_column.
StatsTable Optional Specifies the name of table that the function creates to store the
statistics for the splitting operation that it performs. The default
value is 'stats_input_table'. For example, if input_table is
'time_series', then stats_table is 'stats_time_series'.
ReturnStatsTable Optional Specifies whether the function returns the data in stats_table in
response to the command SELECT * FROM SeriesSplitter.
The default value is 'true'. When this value is 'false', the function
returns only the data in output_table.

288 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter

Argument Category Description


OverwriteOutput Optional Specifies whether the function overwrites stats_table and
output_table if they exist. The default value is 'false'.
ValuesBeforeFirst Optional If DuplicateRowsCount is nonzero and OrderByColumns is
specified, then ValuesBeforeFirst specifies the values to be stored
in the ordering columns that precede the first row of the first split
in a partition as a result of duplicating rows across split
boundaries.
If ValuesBeforeFirst specifies only one value and
OrderByColumns specifies multiple ordering columns, then the
specified value is stored in every ordering column.
If ValuesBeforeFirst specifies multiple values, then it must specify
a value for each ordering column. The value and the ordering
column must have the same data type. For the data type
VARCHAR, the values are case-insensitive.
The default values for different data types are:
• Numeric: -1
• CHAR(n) or VARCHAR : '-1'
• Date- or time-based: 1900-01-01 0:00:00
• CHARACTER: '0'
• Bit: 0
• Boolean: 'false'
• IP4 : 0.0.0.0
• UUID: 0000-0000-0000-0000-0000-0000-0000-0000

ValuesAfterLast Optional If DuplicateRowsCount is nonzero and OrderByColumns is


specified, then ValuesAfterLast specifies the values to be stored in
the ordering columns that follow the last row of the last split in a
partition as a result of duplicating rows across split boundaries.
If ValuesAfterLast specifies only one value and OrderByColumns
specifies multiple ordering columns, then the specified value is
stored in every ordering column.
If ValuesAfterLast specifies multiple values, then it must specify a
value for each ordering column. The value and the ordering
column must have the same data type. For the data type
VARCHAR, the values are case-insensitive.
The default value is NULL.
DuplicateColumn Optional Specifies the name of the column that indicates whether a row is
duplicated from the neighboring split. If the row is duplicated, this
column contains 1; otherwise it contains 0.
PartialSplitID Optional Specifies whether split_id_column contains only the numeric split
identifier. The default value is 'false'.
If the value is 'true', then split_id_column contains a numeric
representation of the split identifier that is unique for each
partition. To distribute the output table by split, use a
combination of all partitioning columns and split_id_column.

Teradata Aster Analytics Foundation User Guide 289


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter

Argument Category Description


If the value is 'false', then split_id_column contains a string
representation of the split that is unique across all partitions. The
function generates the string representation by concatenating the
partitioning columns with the order of the split inside the
partition (the numeric representation). In the string
representation, hyphens separate partitioning column names from
each other and from the order. For example, 'pcol1-pcol2-3'.
Domain Optional Specifies the IP address of the queen node. The default value is the
IP address of the queen node of the current cluster.
Database Optional Specifies the name of the database where input_table resides. The
default value is 'beehive'.
UserID Optional Specifies the Aster Database user name of the user. The default
value is 'beehive'.
Password Optional Specifies the Aster Database password of the user.
SSLSettings Optional Specifies the SSL connection information in a string, excluding the
SSL TrustStore password. Use this argument if you want the
function to connect to Aster Database with a JDBC SSL
connection instead of a normal JDBC connection. The function
appends SSLsettings to the SSL JDBC connection string.
SSLTrustStorePassword Optional Specifies the SSL TrustStore password. Specify this password if
and only if you specify the SSLSettings argument.

Input
The input table is the table to be split, which is specified by the InputTable argument. The following table
describes the input_table columns that you can specify in function arguments.
Table 198: SeriesSplitter input_table Schema

Column Name Data Type Description


partition_column For data type restrictions of these A partitioning column of input_table,
columns, see the Aster Database specified by the PartitionByColumns
Documentation. argument.
The table can have more than one such
column.
ordering_column Any An ordering column of input_table, specified
by the OrderByColumns argument.
The table can have more than one such
column.
accumulate_column Any A column copied from input_table to the
output table, specified by the Accumulate
argument.
The table can have more than one such
column.

290 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter
If the name of a table or column is case-sensitive, then enclose the name in either double quotation marks or
brackets. For example, "InputTable" or [InputTable].

Output
The SeriesSplitter function outputs two tables, the output table and the stats table.
Table 199: SeriesSplitter Output Table Schema

Column Name Data Type Description


partition_column Same as input column Partitioning column specified by the
PartitionByColumns argument. The output table has
a column for each partitioning column.
split_id_column INTEGER or VARCHAR Contains the split identifier, in the form specified by
the PartialSplitID argument.
ordering_column Same as input column Ordering column specified by the OrderByColumns
argument. The output table has a column for each
ordering column.
duplicate_column SMALLINT Duplicate column specified by the DuplicateColumn
argument. The output table has a column for each
duplicate column.

Table 200: SeriesSplitter Stats Table Schema

Column Name Data Type Description


statistics VARCHAR Contains these statistics for the splitting operation that the
function performed:
input_table_row_count
input_partition_count
output_split_count
inserted_row_count
output_table_row_count
processing_time_in_seconds
value BIGINT Contains the values for the statistics.

Examples
• Input
• Example 1: Partition Splitter
• Example 2: Using SeriesSplitter with Interpolator

Input
The input table contains the daily IBM stock prices from 1961 to 1962.

Teradata Aster Analytics Foundation User Guide 291


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter
Table 201: SeriesSplitter Example Input Table ibm_stock1

id name period stockprice


1 IBM 1961-05-1700:00:00 460
1 IBM 1961-05-1800:00:00 457
1 IBM 1961-05-1900:00:00 452
1 IBM 1961-05-2200:00:00 459
1 IBM 1961-05-2300:00:00 462
1 IBM 1961-05-2400:00:00 459
1 IBM 1961-05-2500:00:00 463
1 IBM 1961-05-2600:00:00 479
1 IBM 1961-05-2900:00:00 493
1 IBM 1961-05-3100:00:00 490
1 IBM 1961-06-0100:00:00 492
1 IBM 1961-06-0200:00:00 498
1 IBM 1961-06-0500:00:00 499
1 IBM 1961-06-0600:00:00 497
1 IBM 1961-06-0700:00:00 496
1 IBM 1961-06-0800:00:00 490
1 IBM 1961-06-0900:00:00 489
1 IBM 1961-06-1200:00:00 478
1 IBM 1961-06-1300:00:00 487
1 IBM 1961-06-1400:00:00 491
... ... ... ...

Example 1: Partition Splitter

SQL-MapReduce Call

SELECT * FROM SeriesSplitter (


ON (SELECT 1) PARTITION BY 1
InputTable ('ibm_stock1')
OutputTable ('ibm_stock1_split')
PartitionByColumns ('id')
OrderByColumns ('period')
SplitCount (50)
Accumulate ('stockprice')
) ORDER BY statistics;

292 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter
Output

Table 202: SeriesSplitter Example 1 Stats Table

statistics value
input_table_row_count 369
input_partition_count 1
output_split_count 47
inserted_row_count 94
output_table_row_count 463
processing_time_in_seconds 5

The query below returns the output shown in the following table:

SELECT * FROM ibm_stock1_split ORDER BY id, split_id;

Table 203: SeriesSplitter Example 1 Output Table ibm_stock1_split

split_id id period stockprice


1-0 1 1961-05-1800:00:00 457
1-0 1 1961-05-1700:00:00 460
1-0 1 1961-05-2200:00:00 459
1-0 1 1961-05-2300:00:00 462
1-0 1 1961-05-2400:00:00 459
1-0 1 1961-05-1900:00:00 452
1-0 1 1961-05-2500:00:00 463
1-0 1 1961-05-2600:00:00 479
1-0 1 1900-01-0100:00:00
1-0 1 1961-05-2900:00:00 493
1-1 1 1961-05-2600:00:00 479
1-1 1 1961-05-3100:00:00 490
1-1 1 1961-06-0200:00:00 498
1-1 1 1961-06-0500:00:00 499
1-1 1 1961-06-0600:00:00 497
1-1 1 1961-06-0700:00:00 496
1-1 1 1961-06-0800:00:00 490
1-1 1 1961-06-0100:00:00 492
1-1 1 1961-06-0900:00:00 489

Teradata Aster Analytics Foundation User Guide 293


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter

split_id id period stockprice


1-1 1 1961-05-2900:00:00 493
1-10 1 1961-09-1800:00:00 543
1-10 1 1961-09-0800:00:00 541
1-10 1 1961-09-1500:00:00 547
1-10 1 1961-09-1400:00:00 549
1-10 1 1961-09-1300:00:00 545
1-10 1 1961-09-1200:00:00 549
1-10 1 1961-09-1100:00:00 545
1-10 1 1961-09-2100:00:00 532
1-10 1 1961-09-2000:00:00 539
1-10 1 1961-09-1900:00:00 540
1-11 1 1961-09-2000:00:00 539
1-11 1 1961-10-0200:00:00 541
1-11 1 1961-09-2900:00:00 541
1-11 1 1961-09-2800:00:00 538
1-11 1 1961-09-2700:00:00 542
1-11 1 1961-09-2600:00:00 540
1-11 1 1961-09-2500:00:00 527
1-11 1 1961-09-2200:00:00 517
1-11 1 1961-09-2100:00:00 532
1-11 1 1961-10-0300:00:00 547
... ... ... ...

Example 2: Using SeriesSplitter with Interpolator


This example shows how to use the SeriesSplitter function with the time series manipulation function
Interpolator.

SQL-MapReduce Calls
There are two ways to use Interpolation with SeriesSplitter:
• Call SeriesSplitter as in Example 1: Partition Splitter to create ibm_stock_split and then call Interpolator:

SELECT * FROM Interpolator (


ON ibm_stock1_split AS input_table
PARTITION BY id ORDER BY "period"
TimeColumn ('period')

294 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SeriesSplitter
TimeInterval (86400)
InterpolationType ('linear')
ValueColumns ('stockprice')
Accumulate ('id')
DuplicateRowsCount (2)
);
• Combine the calls to SeriesSplitter and Interpolator:

SELECT * FROM Interpolator (


ON (SELECT * FROM SeriesSplitter (
ON (SELECT 1) PARTITION BY 1
InputTable ('ibm_stock1')
OutputTable ('ibm_stock1_split')
PartitionByColumns ('id')
OrderByColumns ('period')
SplitCount (50)
Accumulate ('stockprice')
ReturnStatsTable ('false')
)
) AS input_table PARTITION BY id ORDER BY period
TimeColumn ('period')
TimeInterval (86400)
InterpolationType ('linear')
ValueColumns ('stockprice')
Accumulate ('id')
DuplicateRowsCount (2)
);

The first choice, using separate SQL-MapReduce calls for SeriesSplitter and the function that uses it,
provides better performance than the second choice.

Troubleshooting
Problem: Invoking a function using SeriesSplitter does not improve execution time.

Note:
Before trying workarounds, ensure that the data is skewed and that the function that uses SeriesSplitter
does not exploit full parallelism. If the data is not skewed and the function exploits full parallelism, then
SeriesSplitter cannot improve its execution time.

Workaround:
• Invoke SeriesSplitter and the subsequent function in separate SQL-MapReduce calls (as in the first choice
in Example 2: Using SeriesSplitter with Interpolator), rather than using SeriesSplitter in the ON clause of
the subsequent function (as in the second choice in Example 2: Using SeriesSplitter with Interpolator).
• Adjust these arguments as follows:
∘ DuplicateRowsCount: as low as possible
∘ SplitCount: a smaller multiple (for example, 1) of the number of vworkers in the cluster
∘ RowsPerSplit: as high as possible (you want the resulting number of splits to be a smaller multiple of
the number of vworkers in the cluster)
∘ Accumulate: specify as few columns as possible
∘ DuplicateColumn: omit this argument

Teradata Aster Analytics Foundation User Guide 295


Chapter 3: Time Series, Path, and Attribution Analysis
Sessionize
∘ PartialSplitID: 'true'
∘ ReturnStatsTable: 'true'

Sessionize

Summary
The Sessionize function maps each click in a session to a unique session identifier. A session is defined as a
sequence of clicks by one user that are separated by at most n seconds.
The function is useful both for sessionization and for detecting web crawler (“bot”) activity. It is typically
used to understand user browsing behavior on a web site.
Sessionize is a SQL-MapReduce function. Sample code is included with the Aster SQLMapReduce Java API.

Background
Sessionize is a SQL-MapReduce function. Sample code is included with the Aster SQLMapReduce Java API.

Usage

Sessionize Syntax
Version 1.3

SELECT * FROM Sessionize (


ON { table_name | view_name | (query) }
PARTITION BY expression [,...]
ORDER BY order_column [,...]
TimeColumn ('timestamp_column')
TimeOut (session_timeout)
[ ClickLag (min_human_click_lag) ]
[ EmitNull ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]);

Arguments
Argument Category Description
TimeColumn Required Specifies the name of the input column that contains the click times.

Note:
The timestamp_column must also be an order_column.

TimeOut Required Specifies the number of seconds at which the session times out. If
session_timeout seconds elapse after a click, then the next click starts a

296 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Sessionize

Argument Category Description


new session. The data type of session_timeout is DOUBLE
PRECISION.
ClickLag Optional Specifies the minimum number of seconds between clicks for the
session user to be considered human. If clicks are more frequent,
indicating that the user is a “bot,” the function ignores the session. The
min_human_click_lag must be less than session_timout. The data type
of min_human_click_lag is DOUBLE PRECISION.
EmitNull Optional Specifies whether to output rows that have NULL values in their
session id and rapid fire columns, even if their timestamp_column has a
NULL value. The default value is 'false'.

Input
The input table must have a timestamp column and columns by which to partition and order the data. Input
data must be partitioned such that each partition contains all rows of an entity. No input column can have
the name 'sessionid' or 'clicklag', because these are output column names.
Table 204: Sessionize Input Table Schema

Column Name Data Type Description


timestamp_column TIME, Contains the click times. If the data type is INTEGER, BIGINT,
TIMESTAMP, or SMALLINT, then the function treats the values as
INTEGER, milliseconds.
BIGINT, or
SMALLINT

To create a single timestamp column from separate date and time columns:

SELECT (datecolumn || ' ' || timecolumn)::timestamp AS mytimestamp


FROM table;

Output
Table 205: Sessionize Output Table Schema

Column Name Data Type Description


input_column Same as in input Column copied from input table. The function copies every
table input table column to the output table.
sessionid INTEGER or Contains the identifiers that the function assigned to the
BIGINT sessions.
clicklag BOOLEAN Contains 't' if the session exceeded min_human_click_lag, 'f'
otherwise.

Teradata Aster Analytics Foundation User Guide 297


Chapter 3: Time Series, Path, and Attribution Analysis
Sessionize
Example

Input
The input table is web clickstream data recorded as a user navigates through a web site. Events—view, click,
and so on—are recorded with a timestamp.
Table 206: Sessionize Example Input Table adweb_clickstream

userid adid productid event clicktime


1039 2 1001 view 2009-04-21 13:17:59
1039 2 1001 view 2009-04-21 13:17:59
1039 2 1001 view 2009-04-21 13:17:59
1039 3 1001 view 2009-05-23 13:17:59
1039 3 1001 view 2009-05-23 13:17:59
1039 3 1001 view 2009-05-23 13:17:59
1039 4 1001 view 2009-07-16 11:17:59
1039 4 1001 view 2009-07-16 11:17:59
1039 4 1001 view 2009-07-16 11:17:59
1039 4 1001 click 2009-07-16 11:18:16
1039 4 1001 click 2009-07-16 11:18:16
1039 4 1001 click 2009-07-16 11:18:16
1039 4 1001 landing_page 2009-07-16 11:18:18
1039 4 1001 landing_page 2009-07-16 11:18:18
1039 4 1001 landing_page 2009-07-16 11:18:18
... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM SESSIONIZE (


ON adweb_clickstream
PARTITION BY userid
ORDER BY clicktime
TimeColumn ('clicktime')
TimeOut ('60')
ClickLag ('0.2')
) ORDER BY userid, clicktime;

298 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
Shapelet Functions
Output
Table 207: Sessionize Example Output Table

userid adid productid event clicktime sessionid clicklag


1039 2 1001 view 2009-04-21 13:17:59 0 f
1039 2 1001 view 2009-04-21 13:17:59 0 t
1039 2 1001 view 2009-04-21 13:17:59 0 t
1039 3 1001 view 2009-05-23 13:17:59 1 f
1039 3 1001 view 2009-05-23 13:17:59 1 t
1039 3 1001 view 2009-05-23 13:17:59 1 t
1039 4 1001 view 2009-07-16 11:17:59 2 f
1039 4 1001 view 2009-07-16 11:17:59 2 t
1039 4 1001 view 2009-07-16 11:17:59 2 t
1039 4 1001 click 2009-07-16 11:18:16 2 f
1039 4 1001 click 2009-07-16 11:18:16 2 t
1039 4 1001 click 2009-07-16 11:18:16 2 t
1039 4 1001 landing_page 2009-07-16 11:18:18 2 f
1039 4 1001 landing_page 2009-07-16 11:18:18 2 t
1039 4 1001 landing_page 2009-07-16 11:18:18 2 t
1039 1 1001 view 2009-07-29 20:17:59 3 f
1039 1 1001 view 2009-07-29 20:17:59 3 t
1039 1 1001 view 2009-07-29 20:17:59 3 t
1039 5 1001 view 2009-08-19 22:17:59 4 f
1039 5 1001 view 2009-08-19 22:17:59 4 t
1039 5 1001 view 2009-08-19 22:17:59 4 t
1039 5 1001 click 2009-08-19 22:18:02 4 f
1039 5 1001 click 2009-08-19 22:18:02 4 t
1039 5 1001 click 2009-08-19 22:18:02 4 t
... ... ... ... ... ... ...

Shapelet Functions
• UnsupervisedShapelet, which takes a set of time series and assigns them to clusters, based on the
shapelets that it finds.

Teradata Aster Analytics Foundation User Guide 299


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet
• SupervisedShapeletTrainer, which takes a set of classified time series and outputs a model for classifying
time series, based on the shapelets that it finds.
• SupervisedShapeletClassifier, which takes a set of time series and assigns them to clusters, based on the
model output by SupervisedShapeletTrainer.

Overview
Any classification task that must preserve ordering can be characterized as time-series classification. Many
real-world use cases involve data that varies only slightly. Traditional classifiers may be unable to classify
such data with high precision.
Shapelets are contiguous subsequences of a time series that identify a class with high accuracy. Because
shapelets focus on local features of a time series, they can be more accurate and faster than other time-series
classification methods. In many applications, shapelets also have been found to identify interpretable results,
thus providing useful insights into differences between classes.
The most common use cases are long-term trends with small local pattern changes that distinguish trends
from each other. Almost any time-series classification problem can be mapped to a shapelets discovery
problem. For example:
• Clickstream analysis
• Scientific or health applications such as ECG analysis
• Imaging applications such as gesture recognition or motion analysis
• Manufacturing applications such as process anomaly detection
• Financial applications such as stock price analysis
Before a shapelets function classifies or clusters a set of time series, it normalizes and SAX-encodes them.
Normalization is required because shapelet classification depends on the distance between two time series.
SAX-encoding makes patterns in the data easier to identify and compare. For more information about SAX-
encoding, refer to SAX2.
The following references explain in detail how shapelets are identified. Aster Analytics’ implementation of
shapelets is based on the fast shapelet finder algorithm published by Rakthanmanon. The unsupervised
shapelet implementation is based on the scalable unsupervised-shapelet algorithm published by Ulanova.
• L. Ye, E. Keogh. Time Series Shapelets: A New primitive for Data Mining, KDD 2009
• T. Rakthanmanon, E. Keogh. Fast Shapelets: A scalable algorithm for discovering time series shapelets,
SIAM 2013.
• J. Zakaria, A. Mueen, E. Keogh. Clustering Time Series using Unsupervised-Shapelets.
• L. Ulanova, N. Begum, E. Keogh. Scalable Clustering of Time Series with U-Shapelets

UnsupervisedShapelet

Summary
The UnsupervisedShapelet function takes a set of time series and assigns them to clusters, based on the
u_shapelets that it finds. The function uses these steps:
1. Saxify the input data (as described in SAX2).
2. Apply random masking to the input data.

300 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet
3. Use statistics to find u_shapelets.
4. Use scores to find u_shapelets.
5. Use the u_shapelets to cluster the times series.

Usage
The following helper functions must be installed to run UnsupervisedShapelet. You can install these
functions with the command \install filename.ext.
• sax2.zip
• UshapeletMasker.zip
• UshapeletInTimeseries.zip
• UshapeletTSDistance.zip
• UshapeletFinderByScore.zip

UnsupervisedShapelet Syntax
Version 1.0

SELECT * FROM UnsupervisedShapelet (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
[ OutputTable ('output_table') ]
[ OverwriteOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
TimeColumn ('time_column')
ValueColumn ('value_column')
SaxWindowSize ('window_size')
[ SaxSymbolsPerWindow ('symbols_per_window') ]
[ SaxOutputFrequency ('gap_between_windows') ]
ID ('id_column')
[ RandomProjections ('projections') ]
[ Threshold ('threshold') ]
[ MaxNumIter ('max_iterations') ]
[ ShapeletCutOff ('cut_off') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Teradata Aster Analytics Foundation User Guide 301


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data.
OutputTable Optional Specifies the name for the output table that the function creates.
The default name is output_table.
OverwriteOutput Optional Specifies whether to overwrite output_table, if it exists. The default
value is 'false'.
TimeColumn Required Specifies the name of the input table column that contains the
time series.
ValueColumn Required Specifies the name of the input table column that contains the data
point values.
SaxWindowSize Required Specifies the SAX2 argument WindowSize, which specifies the size
of the sliding window. The window_size must be an INTEGER in
the range [1, 1000000].
SaxSymbolsPerWindow Optional Specifies the SAX2 argument SymbolsPerWindow, which specifies
the number of SAX code symbols to generate from a window. The
symbols_per_window must an INTEGER in the range
[1, 1000000]. If symbols_per_window is greater than window_size,
the function changes symbols_per_window to window_size. By
default, symbols_per_window is the same as window_size.
SaxOutputFrequency Optional Specifies the SAX2 argument OutputFrequency, which specifies
the number of data points to skip between successive sliding
windows. The gap_between_windows must be an integer in the
range [1, 1000]. The default value is 1. A smaller value increases
accuracy (the chance of distinguishing time series from each
other) at the cost of higher execution time.
ID Required Specifies the name of the input table column that contains the
unique identity of a time series.
RandomProjections Optional Specifies the number of iterations required for random masking of
SAX words during u_shapelets selection. The projections must be
an INTEGER in the range [1, 30]. The default value is 10.
Specifying a greater projections for a longer input time series
increases the probability of identifying better u_shapelets at the
cost of higher execution time.
Threshold Optional Specifies the value at which an iteration stops and the function
stops. The iteration i ends when scorei / scorei-1 is less than
threshold. The threshold must be a DOUBLE PRECISION value in
the range (0, 1). The default value is 0.5.
MaxNumIter Optional Specifies the number of iterations at which the function stops. The
max_iterations must be an INTEGER in the range [1, 50]. The
default value is 10.

302 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet

Argument Category Description


ShapeletCutOff Optional Specifies the percentage of u-shapelets to use. The cut_off must be
a DOUBLE PRECISION value in the range (0, 1]. The default
value is 0.1 (10%).

Input
The function uses the input table columns time_column and value_column for saxification. They correspond
to the time_column and value_column in input for SAX2.
Input time series data must be correctly formatted; otherwise, function behavior is undefined. In a correctly
formatted time series, time intervals are evenly spaced and all time intervals have numeric values.
To calculate missing values in a time series, use the function Interpolator. If the input table’s time column is
text-based, create a new input table with an integer-based time column. For example, suppose that the table
time_series_text_time has the text-based columns idval, timeval, valueval, and catval. This statement creates
a table that the function accepts as input:

CREATE TABLE time_series_numeric_time AS SELECT


idval,
RANK() OVER (PARTITION BY idval ORDER BY timeval) AS timeval,
valueval,
catval
FROM time_series_text_time;

Table 208: UnsupervisedShapelet Input Table Schema

Column Name Data Type Description


id Any Time series identifier.
time_column BYTEINT, Contains the time axis of the data.
SMALLINT,
INTEGER,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
value_column BYTEINT, Contains time series data to be transformed.
SMALLINT,
INTEGER,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION

Output
The function outputs a message and a table.

Teradata Aster Analytics Foundation User Guide 303


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet
Table 209: UnsupervisedShapelet Output Message Schema

Column Name Data Type Description


statistics VARCHAR Name of the output table, number of clusters, and number of time
series clustered.

Table 210: UnsupervisedShapelet Output Table Schema

Column Name Data Type Description


id Same as in Time series identifier from input table.
input table
cluster_label Same as in Cluster label.
input table

Example

Input
The input table has 10 price observations for four stocks. The column period contains time values,
represented by integers. Because the function is unsupervised, it ignores the stock_category column;
however, you can use that column to verify that the generated clusters belong to the same category.
Table 211: UnsupervisedShapelet Example Input Table ushapelets_input

stockid period stockprice stock_category


1 22418 460 Technology
1 22419 457 Technology
1 22420 452 Technology
1 22421 459 Technology
1 22422 462 Technology
1 22423 459 Technology
1 22424 463 Technology
1 22425 479 Technology
1 22426 493 Technology
1 22427 490 Technology
2 22418 66.62 Healthcare
2 22419 66.87 Healthcare
2 22420 67 Healthcare
2 22421 67.25 Healthcare
2 22422 65.88 Healthcare

304 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet

stockid period stockprice stock_category


2 22423 66.12 Healthcare
2 22424 66.5 Healthcare
2 22425 67.75 Healthcare
2 22426 67.5 Healthcare
2 22427 67.25 Healthcare
3 22418 66.42 Healthcare
3 22419 66.87 Healthcare
3 22420 66.6 Healthcare
3 22421 67.15 Healthcare
3 22422 65.68 Healthcare
3 22423 65.92 Healthcare
3 22424 66.7 Healthcare
3 22425 67.75 Healthcare
3 22426 68 Healthcare
3 22427 66.95 Healthcare
4 22418 489 Technology
4 22419 487 Technology
4 22420 485 Technology
4 22421 489 Technology
4 22422 490 Technology
4 22423 496 Technology
4 22424 497 Technology
4 22425 499 Technology
4 22426 498 Technology
4 22427 497 Technology

The following figure is a graphic representation of the input data.

Teradata Aster Analytics Foundation User Guide 305


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet
Figure 9: UnsupervisedShapelet Example Input Data

In the time period shown, technology stocks 1 and 4 have similar price trajectories, as do healthcare stocks 2
and 3.

SQL-MapReduce Call

SELECT * FROM UnsupervisedShapelet (


ON (SELECT 1) PARTITION BY 1
InputTable ('ushapelets_input')
OutputTable ('uss_output')
ID ('stockid')
TimeColumn ('period')
ValueColumn ('stockprice')
SAXWindowSize ('5')
OverwriteOutput ('true')
) ORDER BY 1;

Output
Table 212: UnsupervisedShapelet Example Output Message

statistics
Unsupervised shapelets table created: "uss_output"
number of clusters : 1
number of timeseries : 4

The function assigned technology stocks 1 and 4 to cluster 0 and healthcare stocks 2 and 3 to cluster 1.

306 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
UnsupervisedShapelet
Table 213: UnsupervisedShapelet Example Output Table uss_output

stockid cluster_label
3 0
1 0
4 0
2 0

Troubleshooting

Problem: The function runs slowly for large input data sets.
For a large input data set, the function might run very slowly, spending a lot of time on one step, or it might
terminate with a failure message on the console. Consult the logs for error messages and troubleshooting
information.

Workarounds:
• Improve the execution time of the saxification step, in any of the following ways:
∘ Increase SaxWindowSize argument value.
∘ Increase the SaxOutputFrequency argument value.
∘ Decrease the SaxSymbolsPerWindow argument value.
• Decrease the number of masking operations by decreasing the RandomProjections argument value.
• Decrease the number of iterations by decreasing the MaxNumIter argument value.
• Decrease the number of u_shapelets for clustering by decreasing the ShapeletCutOff argument value.
• Increase the Threshold argument value.

Problem: Clustering accuracy is not good enough.


The function might complete successfully, but the clustering accuracy might be low.

Workarounds:
• Improve the accuracy of the saxification step, in any of the following ways:
∘ Decrease the SaxWindowSize argument value.
∘ Decrease the SaxOutputFrequency argument value.
∘ Increase the SaxSymbolsPerWindow argument value.
• Increase the number of masking operations by increasing the RandomProjections argument value.
• Increase the number of iterations by decreasing the MaxNumIter argument value.
• Increase the number of u_shapelets for clustering by decreasing the ShapeletCutOff argument value.
• Decrease the Threshold argument value.

Teradata Aster Analytics Foundation User Guide 307


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer

SupervisedShapeletTrainer

Summary
The SupervisedShapeletTrainer function takes a set of classified time series and outputs a model for
classifying time series, based on the shapelets that it finds. The model is input to the function
SupervisedShapeletClassifier.

Usage
The following helper functions must be installed to run SupervisedShapeletTrainer. You can install these
functions with the command \install filename.ext.
• sax2.zip
• ShapeletMasker2.zip
• ShapeletCollisionCounter2.zip
• ShapeletPowerFinder2.zip
• ShapeletCandidateFinder2.zip
• ShapeletCandidateScoring2.zip

SupervisedShapeletTrainer Syntax
Version 1.1

SELECT * FROM SupervisedShapeletTrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_data_table')
[ CategoryTable ('input_categories_table') ]
IdColumn ('id_column')
TimeColumn ('time_column')
ValueColumn ('value_column')
CategoryColumn ('category_column')
[ SaxSymbolsPerWindow ('symbols_per_window') ]
[ SaxMinWindowSize ('min_window_size') ]
[ SaxMaxWindowSize ('max_window_size') ]
[ SaxOutputFrequency ('gap_between_windows') ]
[ ModelTable ('output_model_table') ]
[ OverwriteOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ RandomProjections ('projections') ]
[ ShapeletCount ('num_shapelets') ]
[ TimeInterval ('num_data_points') ]
[ Seed ('seed') ]
);

308 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data.
CategoryTable Optional Specifies the name of the table that contains the categories
(classes) for the time series in input_data_table. The default
value is input_data_table.
If input_categories_table is different from input_data_table,
the function ignores any time series that is not in both
input_categories_table and input_data_table. If a time series
is represented by multiple rows in input_categories_table,
these rows must contain the same category; otherwise, the
function might not select the correct category.
IDColumn Required Specifies the name of the column in input_data_table and
input_categories_table that contains the unique identity of a
time series.
TimeColumn Required Specifies the name of the input_data_table column that
contains the time axis of the data.
ValueColumn Required Specifies the name of the input_data_table column that
contains the data points.
CategoryColumn Required Specifies the name of the input_categories_table column that
contains the category (class) of the time series.
SaxSymbolsPerWindow Optional Specifies the SAX2 argument SymbolsPerWindow, which
specifies the number of SAX code symbols to generate from
a window. The symbols_per_window must an INTEGER in
the range [1, 1000000]. The default value is 10.
If the symbols_per_window is greater than the length of the
shortest time series in input data set (d), then its value
becomes d.
SaxMinWindowSize Optional Specifies the SAX2 argument WindowSize , which specifies
the size of the sliding window. The min_window_size
defines the length (number of data points) of the shortest
shapelet; the minimum span (time series length) used to
distinguish two time series from each other. The
min_window_size must be an integer in the range
[1, 1000000]. The default value is 10.
If the min_window_size is greater than the length of the
shortest time series in input data set (d), then its value
becomes d. If min_window_size is smaller than

Teradata Aster Analytics Foundation User Guide 309


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer

Argument Category Description


symbols_per_window, then its value becomes
symbols_per_window.
SaxMaxWindowSize Optional Specifies the SAX2 argument WindowSize , which specifies
the size of the sliding window. The max_window_size
defines the length of the longest shapelet; the maximum
span used to distinguish two time series from each other.
The max_window_size must be an integer in the range
[1, 1000000] that is greater than or equal to
min_window_size. The default value is 70.
If the max_window_size is greater than the length of the
shortest time series in input data set (d), then its value
becomes d.
A greater difference between min_window_size and
max_window_size increases the probability of identifying
better shapelets at the cost of higher execution time. The
function uses this formula to compute the number of sliding
windows, n:
n = ((max_window_size -
min_window_size) / symbols_per_window)+1
The maximum value of n is 20.
SaxOutputFrequency Optional Specifies SAX2 argument OutputFrequency, which specifies
the number of data points to skip between successive sliding
windows. The gap_between_windows must be an integer in
the range [1, 1000]. The default value is 10. A smaller value
increases accuracy (the chance of distinguishing time series
from each other) at the cost of higher execution time.
ModelTable Optional Specifies the name of the output model table that contains
trained shapelets. The default output_model_table is
"shapelet_model".
OverwriteOutput Optional Specifies whether to overwrite output_model_table, if it
exists. The default value is 'false'.
RandomProjections Optional Specifies the number of iterations required for random
masking of SAX words during shapelet training. The
projections must be an INTEGER in the range [1, 40]. The
default value is 10.
Specifying a greater projections for a longer input time series
increases the probability of identifying better shapelets at the
cost of higher execution time.
ShapeletCount Optional Specifies the maximum number of shapelets in the output
model table. The num_shapelets must be an INTEGER in
the range [1, 100000]. The default value is 20.
TimeInterval Optional Specifies the number of data points in a time series to skip
between consecutive time series windows when calculating
the distance of a shapelet from a time series.

310 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer

Argument Category Description


The function builds a shapelet classification tree based on
the distance of a shapelet from the time series data. Because
a shapelet is typically much smaller than a complete time
series, the function calculates the distance of a shapelet from
a time series by sliding the shapelet across time series
windows of shapelet length, calculating the distance between
the shapelet and each window, and then selecting the
smallest distance.
The num_data_points is the number of data points to skip
when sliding from one time series window to the next. The
num_data_points must be an INTEGER in the range
[1, 1000000]. The value 1 gives optimal results at the cost of
higher execution time. The default value is 10.
Seed Optional Specifies the seed value for the function to use to generate
random numbers, which it uses internally. The seed must be
an INTEGER in the range [1, 100000]. The default value is
23.

Input
The input table input_data_table has the same schema as UnsupervisedShapelet Input.

Output
The function outputs a message and a model table.
Table 214: SupervisedShapeletTrainer Output Message Schema

Column Name Data Type Description


statistics VARCHAR Name of the model table, number of shapelets, and number of
rows.

Table 215: SupervisedShapeletTrainer Model Table Schema

Column Name Data Type Description


shapelet_id INTEGER Unique identifier of a shapelet.
time_instant INTEGER Column that represents the time axis of the shapelet. Time
values are represented as integers, regardless of the data type of
time_column in input_data_table.
value Same as Column that represents the data point values of the shapelet.
value_column in
input_data_table
split_value DOUBLE Column that represents the split value of the shapelet.
PRECISION
left_child_id INTEGER shapelet_id of the left child of the shapelet in the shapelet tree.

Teradata Aster Analytics Foundation User Guide 311


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer

Column Name Data Type Description


right_child_id INTEGER shapelet_id of the right child of the shapelet in the shapelet tree.
left_class VARCHAR Class (category) identifier of the left child of the shapelet.
right_class VARCHAR Class (category) identifier of the right child of the shapelet.

Example

Input
The input table has 10 price observations for four stocks. The column period contains time values,
represented by integers.
Table 216: SupervisedShapeletTrainer Example Input Table shapelets_train

stockid period stockprice stock_category


1 22418 460 Technology
1 22419 457 Technology
1 22420 452 Technology
1 22421 459 Technology
1 22422 462 Technology
1 22423 459 Technology
1 22424 463 Technology
1 22425 479 Technology
1 22426 493 Technology
1 22427 490 Technology
2 22418 492 Technology
2 22419 498 Technology
2 22420 499 Technology
2 22421 497 Technology
2 22422 496 Technology
2 22423 490 Technology
2 22424 489 Technology
2 22425 478 Technology
2 22426 487 Technology
2 22427 491 Technology
3 22418 68.2502 Healthcare

312 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer

stockid period stockprice stock_category


3 22419 67.7501 Healthcare
3 22420 68.375 Healthcare
3 22421 67.1251 Healthcare
3 22422 67.1251 Healthcare
3 22423 66 Healthcare
3 22424 66.5002 Healthcare
3 22425 64.1251 Healthcare
3 22426 64.7501 Healthcare
3 22427 65.375 Healthcare
4 22418 66.625 Healthcare
4 22419 66.8746 Healthcare
4 22420 67.0003 Healthcare
4 22421 67.2499 Healthcare
4 22422 65.8752 Healthcare
4 22423 66.1248 Healthcare
4 22424 66.5002 Healthcare
4 22425 67.7501 Healthcare
4 22426 67.4995 Healthcare
4 22427 67.2499 Healthcare

SQL-MapReduce Call

SELECT * FROM SupervisedShapeletTrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('shapelets_train')
IdColumn ('id')
TimeColumn ('period')
ValueColumn ('stockprice')
CategoryColumn ('stock_category')
SaxMinWindowSize (3)
SaxMaxWindowSize (6)
SaxOutputFrequency (1)
SaxSymbolsPerWindow (3)
RandomProjections (10)
TimeInterval (1)
ModelTable ('shapelets_model')
OverwriteOutput ('true')
);

Teradata Aster Analytics Foundation User Guide 313


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer
Output
Table 217: SupervisedShapeletTrainer Example Output Message

statistics
shapelet model table created : "shapelets_model"
number of shapelets : 1
number of rows : 6

This query returns the following table:

SELECT * FROM shapelets_model ORDER BY 1, 2;

Table 218: SupervisedShapeletTrainer Example Model Table shapelets_model

shapelet_id time_instant value split_value left_child_id right_child_id left_class right_class


1 1 67.1251 1.21809 0 0 Healthcare Technology
1 2 66 1.21809 0 0 Healthcare Technology
1 3 66.5002 1.21809 0 0 Healthcare Technology
1 4 64.1251 1.21809 0 0 Healthcare Technology
1 5 64.7501 1.21809 0 0 Healthcare Technology
1 6 65.375 1.21809 0 0 Healthcare Technology

Troubleshooting

Problem: The function runs slowly for large input data sets.
For a large input data set, the function might run very slowly, spending a lot of time on one step, or it might
terminate with a failure message on the console. Consult the logs for error messages and troubleshooting
information.

Workarounds:
• Improve the execution time of the saxification step, in any of the following ways:
∘ Decrease the difference between the SaxMinWindowSize and SaxMaxWindowSize argument values.
∘ Increase the SaxOutputFrequency argument value.
∘ Decrease the SaxSymbolsPerWindow argument value.
• Decrease the number of masking operations by decreasing the RandomProjections argument value.
• Decrease the number of shapelets in the output table by decreasing the ShapeletCount argument value.
• Increase the number of data points to skip between consecutive time series windows when calculating the
distance of a shapelet from a time series by increasing the TimeInterval argument value.

314 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletClassifier
Problem: Classification accuracy is not good enough.
The function might complete successfully, but the classification accuracy might be low.

Workarounds:
• Improve the accuracy of the saxification step, in any of the following ways:
∘ Increase the difference between the SaxMinWindowSize and SaxMaxWindowSize argument values.
∘ Decrease the SaxOutputFrequency argument value.
∘ Increase the SaxSymbolsPerWindow argument value.
• Increase the number of masking operations by increasing the RandomProjections argument value.
• Increase the number of shapelets in the output table by increasing the ShapeletCount argument value.
• Decrease the number of data points to skip between consecutive time series windows when calculating
the distance of a shapelet from a time series by decreasing the TimeInterval argument value.

SupervisedShapeletClassifier

Summary
The SupervisedShapeletClassifier function uses the model output by the function SupervisedShapeletTrainer
to classify a set of time series.

Usage

SupervisedShapeletClassifier Syntax
Version 1.1

SELECT * FROM SupervisedShapeletClassifier (


ON { table | view | (query) } AS time_series
PARTITION BY id
ORDER BY time_instant
ON { table | view | (query) } AS shapelets DIMENSION
ORDER BY shapelet_id, time_instant
[ ValueColumn ('value_column') ]
[ TimeInterval ('num_data_points') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Teradata Aster Analytics Foundation User Guide 315


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletClassifier
Arguments
Argument Category Description
ValueColumn Optional Specifies the name of the time_series column that contains
the data points in the time series. The default value_column
is "value".
TimeInterval Optional Specifies the number of data points in a time series to skip
between consecutive time series windows when calculating
the distance of a shapelet from a time series.
Because a shapelet is typically much smaller than a complete
time series, the function calculates the distance of a shapelet
from a time series by sliding the shapelet across time series
windows of shapelet length, calculating the distance between
the shapelet and each window, and then selecting the
smallest distance.
The num_data_points is the number of data points to skip
when sliding from one time series window to the next. The
num_data_points must be an INTEGER in the range
[1, 1000000]. The value 1 gives optimal results at the cost of
higher execution time. The default value is 10.

Note:
This argument must specify the same value as the
SupervisedShapeletTrainer TimeInterval argument
specified when it generated the shapelets table.

Accumulate Optional Specifies the names of the time_series columns to copy to


the output table. By default, only the id and
predicted_category columns are copied to the output table.
Columns specified by this argument appear after the other
output table columns.

Input
The function requires the following:
• time_series, which has the same schema as the table, UnsupervisedShapelet Input Table Schema, from
the Input section of the function: UnsupervisedShapelet.
• shapelets, a model table described by, SupervisedShapeletTrainer Model Table Schema, from the Output
section of the function: SupervisedShapeletTrainer.

Output
Table 219: SupervisedShapeletClassifier Output Table Schema

Column Name Data Type Description


id_column Any Unique identifier of the time series.
predicted_category VARCHAR Predicted category of the time series.

316 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletClassifier

Column Name Data Type Description


accumulate_column Same as in Column copied from the time_series table.
time_series
table

Example

Input
• shapelets_test, which contains additional data from the data set used to train the model
• shapelets_model (Output), the model output by the SupervisedShapeletTrainer example
Table 220: SupervisedShapeletClassifier Example Input Table shapelets_test

id period stockprice stock_category


5 22418 460 Technology
5 22419 457 Technology
5 22420 452 Technology
5 22421 459 Technology
5 22422 462 Technology
5 22423 459 Technology
5 22424 463 Technology
5 22425 479 Technology
5 22426 493 Technology
5 22427 490 Technology
... ... ... ...

SQL-MapReduce Call

CREATE TABLE shapelets_predict DISTRIBUTE BY HASH(id) AS


SELECT * FROM SupervisedShapeletClassifier (
ON shapelets_test AS time_series
PARTITION BY id
ORDER BY period
ON shapelets_model AS shapelets DIMENSION
ORDER BY shapelet_id, time_instant
TimeInterval (1)
ValueColumn ('stockprice')
Accumulate ('stock_category')
) ORDER BY 1;

Teradata Aster Analytics Foundation User Guide 317


Chapter 3: Time Series, Path, and Attribution Analysis
SupervisedShapeletClassifier
Output
This query returns the following table:

SELECT * FROM shapelets_predict ORDER BY 1;

The column stock_category contains the original category.


Table 221: SupervisedShapeletClassifier Example Output Table shapelets_predict

id predicted_category stock_category
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
5 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
6 Technology Technology
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare

318 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id predicted_category stock_category
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
7 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare
8 Healthcare Healthcare

This query returns the prediction accuracy:

SELECT (SELECT COUNT(id) FROM shapelets_predict


WHERE predicted_category = stock_category)/
(SELECT COUNT(id) FROM shapelets_predict) AS prediction_accuracy;
prediction_accuracy
-------------------
1.00000000000000000000
(1 row)

The prediction accuracy is 100% because the predicted and original categories are the same.

VARMAX

Summary
VARMAX (Vector Autoregressive Moving Average model with eXogenous variables) extends the ARMA/
ARIMA model in two ways:
• To work with time series with multiple response variables (vector time series).
• To work with exogenous variables, or variables that are independent of the other variables in the system.
The model includes both the dynamic relationship between the multiple response variables and the
relationship between the dependent and independent variables.
This formula represents a nonseasonal VARMAX model:

Teradata Aster Analytics Foundation User Guide 319


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

In the preceding equation, Yt is a stationarized time series. The first term is the autoregressive component,
the second term is the exogenous component, the third term is the moving average component, the fourth
(C) is a vector of constants, and the fifth (Et) is a vector of residual errors, and:
• Yt is a vector of n response variables
• Xt is a vector of m exogenous variables
• p is the number of previous periods of the endogenous variables included in the model
• q is the number of previous periods included in the moving average
• b is the number of previous periods of exogenous variables included
• Φi is an n * n matrix of autoregressive parameters
• Bi is an n * m matrix of exogenous variable parameters
• Θi is an n * n matrix of moving average parameters
• Et is the difference between the actual and the predicted value of Yt, (Yt - Ŷt).
This formula represents a seasonal VARMAX model:
(1 - Φ1Back - … - ΦpBackp)(1 - Φ1Backm - … - ΦspBackm*sp)(1 - Back)d(1 - Backm)sd(Yt) =

C + (1 + Θ1Back + … + ΘqBackq) (1 + Θ1Backm + … + ΘsqBackm*sq) Et +

(B1 + B2Back + … + Bb-1Backb) (1 - Back)d (1 - Backm)sd Xt


Where the variables are as previously described and:
• Back is the backshift operator, that is:
Back(yt) = yt-1

Backn(yt) = yt-n
• m is the number of periods per season
• d is the number of differencing steps performed to stationarize the time series
• sp, sq, sd are the seasonal parameters corresponding to p, q, d

Usage
The VARMAX function expects that each time series is an ordered sequence in a partition with uniform
time intervals. The function assumes each partition can fit into memory. The function does not accept null
or non-numeric inputs, with the exception noted in the description of ResponseColumns (in the Arguments
table).

VARMAX Syntax
Version 1.0

SELECT * FROM VARMAX (


ON inputtable
PARTITION BY partitionColumns

320 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
ORDER BY timestampColumns
ResponseColumns('columns')
[ ExogenousColumns('columns') ]
[ PartitionColumns('columns') ]
Orders ('p,d,q')
[ SeasonalOrders('sp,sd,sq') ]
[ Period ('period') ]
[ ExogenousOrder ('b') ]
[ Lag ('lag') ]
[ IncludeMean ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}') ]
[ MaxIterNum ('max_iteration_number') ]
[ StepAhead ('predict_steps') ]
);

Arguments
Argument Category Description
ResponseColumns Required Specifies the columns containing the response data. Null
values are acceptable at the end of the series. If StepAhead is
specified, the function reports predicted values for the
missing values, taking into account values from the
predictor columns for those time periods.
ExogenousColumns Optional Specifies the columns containing the independent
(exogenous) predictors. If not specified, the function
calculates the model without exogenous vectors.
PartitionColumns Optional Specifies the partition columns to pass to the output. If not
specified, the output contains no partition columns.
Orders Required Specifies the parameters p, d, q for the VARMA part of the
model.
This argument consists of 3 non-negative integers separated
by commas. Values must be between 0 and 20.
SeasonalOrders Optional Specifies seasonal parameters sp, sd, sq for the VARMA part
of the model. This argument consists of 3 non-negative
integers separated by commas. Values must be between 0
and 20. If not specified, the model is treated as nonseasonal.
If the SeasonalOrders argument is used, the Period
argument must also be present.
Period Optional Specifies the period of each season. Must be a positive
integer value. If the Period argument is used, the
SeasonalOrders argument must also be present. If not
specified, the model is treated as nonseasonal.
ExogenousOrder Optional Specifies the order of exogenous variables. If the current
time is t and ExogenousOrder is b, the following values of
the exogenous time series are used in calculating the
response: Xt Xt-1 ... Xt-b+1. If not specified, the model is
calculated without exogenous vectors.

Teradata Aster Analytics Foundation User Guide 321


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

Argument Category Description


Lag Optional Specifies the lag in the effect of the exogenous variables on
the response variables. For example, if Lag = 3, and
ExogenousOrder is b, Yi is predicted based on Xi-3 to Xi-b-2.
Default value is 0.
IncludeMean Optional Specifies whether mean vector of the response series is
added in the VARMAX model. Default value is False.
Note that if this argument is True, the difference parameters
d (in the Orders argument) and sd (in the SeasonalOrders
argument) must be 0.
MaxIterNum Optional A positive integer value. The maximum number of iterations
performed. Default value is 100.
StepAhead Optional A positive integer value. The number of steps to forecast
after the end of the time series. If not provided, no forecast
values are calculated.

Input
Table 222: VARMAX Input Schema

Column Name Data Type Description


timestamp_column Any The table can have more than one such column. These
columns contain the sequence (time points) of the input
values.

Note:
If the time points do not have uniform intervals, then
run the function Interpolator on them before running
the VARMAX function on the input table.

partition_column Any Optional. Column on which the input data is partitioned.


The table can have more than one such column.
response_column Numeric The table can have more than one such column. These
columns contain the values of the response variables.
exogenous_column Numeric Optional. The table can have more than one such column.
These columns contain the values of the exogenous
variables.

Output
The output of VARMAX is a model for each partition in the input table. The Output table schema and
additional details about the output are shown in the following two tables.

322 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
Table 223: VARMAX Output Schema

Column Name Data Type Description


partition_columns Any Identifies the partition_column from the input table.
coef VARCHAR See the following table for the values that appear in this
column.
coef_value VARCHAR See the following table for explanations of the values that
appear in this column.
[stepahead] Integer Included only if the StepAhead argument has a positive
value.
Identifier of the future period. For example, if the
argument StepAhead is 3, this column shows 1, 2, 3 to
indicate the future period for which a predicted value is
shown in the next column.
[predict_columnname] Double Included only if the StepAhead argument has a positive
value. One column appears for each column specified in
the ResponseColumns argument.
The predicted future values of the response variable.

Table 224: VARMAX Model Coefficients

coef coef_value
coef The vector [p, d, q, sp, sd, sq, b].
ar_params The matrices Φi, shown as a vector of p matrices, each of which is an n * n
matrix. p is from the coef vector and n is the number of response variables
specified in the ResponseColumns argument.
ma_params The matrices Φi, shown as a vector of q matrices, each of which is an n * n
matrix. q is from the coef vector and n is the number of response variables
specified in the ResponseColumns argument.
exogenous_params The matrices Bi, shown as a vector of b matrices, each of which is an n * m
matrix. b is from the coef vector, n is the number of response variables specified
in the ResponseColumns argument, and m is the number of exogenous
variables specified in the ExogenousColumns argument.
seasonal_ar_params The matrix Φsi for the seasonal parameters.
seasonal_ma_params The matrix Φsi for the seasonal parameters.
mean_param The mean vector of the response series. This value is only displayed if the
argument IncludeMean is set to True.
period The cycle period for seasonal models (0 for non-seasonal models).
lag The lag value specified in the function call.
sigma The variance matrix.
aic The Akaike information criterion.

Teradata Aster Analytics Foundation User Guide 323


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

coef coef_value
bic The Bayesian information criterion.
iterations The number of iterations performed.
converged Whether the algorithm converged.

Examples
• Input
• Example 1: VARMAX without Exogenous Model
• Example 2: VARMAX with Exogenous Model
• Example 3: VARMAX with Seasonal Model and without Exogenous Model
• Example 4: VARMAX with All Models

Input
All examples use the following input table, which is seasonally adjusted quarterly financial data from West
Germany between 1960 and 1982. Three time series are included: consumer expenditures, disposable
income, and fixed investment. Values are shown in billions of DM. The time series is partitioned by “id”
column, which indicates the decade.
Table 225: VARMAX Example Input Table finance_data3

id period expenditure income investment


1 1960Q1 415 451 180
1 1960Q2 421 465 179
1 1960Q3 434 485 185
1 1960Q4 448 493 192
1 1961Q1 459 509 211
1 1961Q2 458 520 202
1 1961Q3 479 521 207
1 1961Q4 487 540 214
1 1962Q1 497 548 231
1 1962Q2 510 558 229
1 1962Q3 516 574 234
1 1962Q4 525 583 237
1 1963Q1 529 591 206
1 1963Q2 538 599 250
... ... ... ... ...

324 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
Example 1: VARMAX without Exogenous Model
This example uses all three time series as response columns modeled by the auto regression (AR) and
moving average (MA) parameters. Because it uses no exogenous variables, it is equivalent to the VARMA
model.

SQL-MapReduce Call
Three values are predicted (StepAhead (3)) for each time series with Order (1, 1, 1).

SELECT * FROM Varmax (


ON finance_data3 PARTITION BY id ORDER BY period
ResponseColumns ('expenditure', 'income', 'investment')
PartitionColumns ('id')
Orders ('1, 1, 1')
IncludeMean ('false')
StepAhead (3)
) ORDER BY id;

Output

Table 226: VARMAX Example 1 Output Table (Columns 1-4)

id coef coef_value stepahead


1 coef 1, 1, 1, 0, 0, 0, 0
1 ar_params [[0.5401142443401908, -0.22144696209085052,
-0.8658134069625404], [0.46112845524514406,
-0.3141400396767603, 0.5401312499458767],
[-0.5436170410196307, 0.3550487442187296,
0.11840800979913536]]
1 ma_params [[0.9418274466713232, 0.47136752532237564,
-3.5912652252354267], [0.6921921561542808,
1.1978201256969616, -3.1537918114389636],
[-0.4072848688634932, 0.5456117174031574,
-1.0037912602351198]]
1 exogenous_params
1 seasonal_ar_params
1 seasonal_ma_params
1 period 0
1 lag 0
1 sigma [[9303.34637558586, 8911.919784323425,
2489.9006735815274], [8911.919784323425,
11603.406279896304, 2608.8269629048896],
[2489.9006735815274, 2608.8269629048896,
1873.3459372621583]]
1 aic 25.17173031402279

Teradata Aster Analytics Foundation User Guide 325


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


1 bic 25.939527996851854
1 iterations 90
1 converged true
1 1
1 2
1 3
2 coef 1, 1, 1, 0, 0, 0, 0
2 ar_params [[-2.068628487877914, 1.4352819442850229,
0.5040312926651062], [-1.7033511912858643,
1.4975506746058223, 0.5074485754476421],
[0.02390658218391805, 0.09069981493533767,
0.6323254473495666]]
2 ma_params [[1.1099224166373396, -1.4699745679694927,
-0.13597198353294757], [2.0484118200773467,
-1.461011739772329, -0.966987021587989],
[1.215189995141357, -0.48167519587064467,
-0.7805184850752649]]
2 exogenous_params
2 seasonal_ar_params
2 seasonal_ma_params
2 period 0
2 lag 0
2 sigma [[20710.85970757356, 19488.49092866746,
3238.838942282427], [19488.49092866746,
23099.4579527159, 7096.235865937958],
[3238.838942282427, 7096.235865937958,
4997.15664809357]]
2 aic 26.28445013189386
2 bic 27.052247814722925
2 iterations 100
2 converged false
2 1
2 2
2 3
3 coef 1, 1, 1, 0, 0, 0, 0
3 ar_params [[2.356733508164794, -0.5621044042754211,
-1.4440215700897763], [1.3690594672816452,

326 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


-0.374171671178873, 0.145908702245906],
[-0.46084660401307126, 0.5119208015206022,
0.5487744352179722]]
3 ma_params [[1.4346653216242653, 1.2117554668497033,
-1.3437736962000293], [2.4823004009827208,
0.7810702841064198, 0.9249544231941549],
[1.2614528273336505, 0.46230687456647834,
0.8258407892382512]]
3 exogenous_params
3 seasonal_ar_params
3 seasonal_ma_params
3 period 0
3 lag 0
3 sigma [[3350.0872240469225, 2079.903284457588,
-57.80956553330401], [2079.903284457588,
1813.642959968383, 53.094810167970195],
[-57.80956553330401, 53.094810167970195,
545.3242237694398]]
3 aic 23.919079900514713
3 bic 24.570181256002957
3 iterations 96
3 converged true
3 1
3 2
3 3

Table 227: VARMAX Example 1 Output Table (Columns 5-7)

id predict_expenditure predict_income predict_investment


1
1
1
1
1
1
1
1

Teradata Aster Analytics Foundation User Guide 327


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id predict_expenditure predict_income predict_investment


1
1
1
1
1
1 1938.76079858415 2252.24766189307 744.997964522574
1 1925.43436379877 2283.39702101694 740.419046284062
1 1915.30312439615 2264.99334498778 758.180883573926
2
2
2
2
2
2
2
2
2
2
2
2
2
2 1647.53805974095 1992.4468446864 677.020182217684
2 1797.20517373109 2154.41616652847 744.891590184598
2 1754.28047469318 2176.47902609351 806.077025266635
3
3
3
3
3
3
3
3

328 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id predict_expenditure predict_income predict_investment


3
3
3
3
3
3 2349.23362243522 2735.7962811932 863.755002444903
3 2437.20210713965 2816.0995450051 889.634165862864
3 2562.01149424731 2910.26242054004 904.404922890561

Note:
Note that series id = 2 does not converge. Convergence could be possibly be improved by adding more
orders or more models.

Example 2: VARMAX with Exogenous Model


This example models the expenditure time series as a function of exogenous variables (income and
investment) and the AR and MA parameters. The function chooses the order and lag of the variables
according to the degree of complexity, typically a trade-off between accuracy and overfitting.

SQL-MapReduce Call

SELECT * FROM Varmax (


ON finance_data3 PARTITION BY id ORDER BY period
ResponseColumns ('expenditure')
ExogenousColumns ('income', 'investment')
PartitionColumns ('id')
Orders ('1, 1, 1')
Lag ('3')
ExogenousOrder ('3')
IncludeMean ('false')
StepAhead (3)
) ORDER BY id;

Output

Table 228: VARMAX Example 2 Output Table

id coef coef_value stepahead predict_expenditure


1 coef 1, 1, 1, 0, 0, 0, 3
1 ar_params [[-0.11048447321369022]]
1 ma_params [[-0.1925561123625604]]
1 exogenous_params [[0.15184265006071107,
-0.6548579365873168]],

Teradata Aster Analytics Foundation User Guide 329


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
id coef coef_value stepahead predict_expenditure
[[3.4356161960447644,
-3.8815455917944566]],
[[1.352629223976695,
1.6692441091760393]]
1 seasonal_ar_params
1 seasonal_ma_params
1 period 0
1 lag 3
1 sigma [[21815.169519750183]]
1 aic 10.400617266572384
1 bic 10.74186068116308
1 iterations 18
1 converged true
1 1 4601.67249271766
1 2 6375.70446779608
1 3 6353.20194246941
2 coef 1, 1, 1, 0, 0, 0, 3
2 ar_params [[0.7135109382825823]]
2 ma_params [[0.8116168306474483]]
2 exogenous_params [[-1.3199094913183353,
3.7663680262105013]],
[[0.8573542123886598,
-2.361564228588058]],
[[0.7742458737629424,
-2.39596174873262]]
2 seasonal_ar_params
2 seasonal_ma_params
2 period 0
2 lag 3
2 sigma [[41347.85268103083]]
2 aic 11.040032179034085
2 bic 11.381275593624782
2 iterations 18
2 converged true
2 1 1738.29815872784
2 2 1683.52303336657
2 3 1737.17454282084

330 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
id coef coef_value stepahead predict_expenditure
3 coef 1, 1, 1, 0, 0, 0, 3
3 ar_params [[0.7034293980424465]]
3 ma_params [[-0.5320720491694714]]
3 exogenous_params [[0.21057556006931047,
-0.5831916151232963]],
[[-0.09419582311809366,
0.2501649697569839]],
[[0.4552695851877213,
-1.121807842101149]]
3 seasonal_ar_params
3 seasonal_ma_params
3 period 0
3 lag 3
3 sigma [[1.2480831335136702E-25]]
3 aic -55.88847298918466
3 bic -55.59909460896767
3 iterations 25
3 converged true
3 1 2321.97511577896
3 2 2404.77038121072
3 3 2433.88438933071

Note:
The model converges for all three decades.

Example 3: VARMAX with Seasonal Model and without Exogenous Model


This example models all three time series, assuming seasonality in the trends and without any exogenous
variables.

SQL-MapReduce Call
Seasonal modeling are specified by the Period () and SeasonalOrders () arguments.

SELECT * FROM Varmax (


ON finance_data3 PARTITION BY id ORDER BY period
ResponseColumns ('expenditure', 'income', 'investment')
PartitionColumns ('id')
Orders ('1,1,1')
SeasonalOrders ('1,0,0')
Period ('4')
IncludeMean ('false')

Teradata Aster Analytics Foundation User Guide 331


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
StepAhead (3)
) ORDER BY id;

Output

Table 229: VARMAX Example 3 Output Table (Columns 1-4)

id coef coef_value stepahead


1 coef 1, 1, 1, 1, 0, 0, 0
1 ar_params [[1.2280127869768
733,
0.10369097626637
279,
0.71411243446300
49],
[1.68925033115149
15,
-0.78663655698855
21,
1.23986484865168
26],
[2.59052325398429
06,
-1.98064700965973
04,
0.84896199571420
1]]
1 ma_params [[-1.357808784931
5306,
3.19295139894179
86,
-0.50784344120034
5],
[-0.2716129803669
5314,
2.04345568387762
85,
-0.65329940610988
48],
[-0.3314681626075
737,
0.97091462209909
82,
0.49862145336134
08]]
1 exogenous_params
1 seasonal_ar_params [[-0.851624175248
6389,

332 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


1.45432858785882
37,
-1.71742405977475
4],
[0.27802014575224
626,
1.10280236987349
97,
-1.98171068044519
8],
[3.64360818346916
65,
-0.65155042799421
87,
0.21016710518637
58]]
1 seasonal_ma_params
1 period 4
1 lag 0
1 sigma [[8687.1496209515
08,
7948.57185497201
3,
2273.04618671658
14],
[7948.57185497201
3,
12790.6375400359
25,
1116.43862441418
92],
[2273.04618671658
14,
1116.43862441418
92,
3484.41990598517]
]
1 aic 26.9788875182000
4
1 bic 28.1305840424436
44
1 iterations 60
1 converged true
1 1

Teradata Aster Analytics Foundation User Guide 333


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


1 2
1 3
2 coef 1, 1, 1, 1, 0, 0, 0
2 ar_params [[-2.858821746120
2174,
2.13241335669388
9,
0.04803182509163
4464],
[-1.6669033637617
956,
1.08107828261318
97,
0.24307828676371
48],
[-0.4730750055653
358,
0.02489187844828
7257,
0.23763398702303
07]]
2 ma_params [[1.7104434579564
591,
-2.13963149013368
1,
-0.40009632335398
77],
[2.99594963531429
8,
-2.26975498328001
77,
-2.56160825799566
14],
[0.30188472672917
305,
-0.76462617447930
58,
-0.21255497859171
532]]
2 exogenous_params
2 seasonal_ar_params [[-2.826011826280
9155,
2.71542389511665
94,
-0.85095330861237
83],

334 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


[-2.9395646286325
388,
3.12479695417388
83,
-1.78837731572089
29],
[-1.3822487137932
813,
1.15130484289570
71,
0.29884692571288
657]]
2 seasonal_ma_params
2 period 4
2 lag 0
2 sigma [[8931.1788301610
3,
9690.56279573768
7,
4251.38401430044
2],
[9690.56279573768
7,
17621.6816547502
6,
4888.40640227978
1],
[4251.38401430044
2,
4888.40640227978
1,
3825.02151343578
3]]
2 aic 26.8410908070776
66
2 bic 27.9927873313212
7
2 iterations 89
2 converged true
2 1
2 2
2 3

Teradata Aster Analytics Foundation User Guide 335


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


3 coef 1, 1, 1, 1, 0, 0, 0
3 ar_params [[-1.429610126152
5473,
2.53153199639820
15,
1.54153683614799
9],
[0.50689784056970
62,
0.34804164987971
653,
0.27906793587515
316],
[-1.0150487527027
767,
-0.20328555669130
86,
0.07866915491037
613]]
3 ma_params [[1.6842659784928
97,
-1.01161806660979
78,
-5.53317826375307
3],
[-1.0550984768730
498,
1.57560630902137
63,
4.70570346190122]
,
[1.66044927213823
4,
0.68840495292055
33,
-0.26326075799465
776]]
3 exogenous_params
3 seasonal_ar_params [[0.3038135044562
906,
-0.05563712172379
2274,
-0.45848481298992
416],
[0.87962181864981
89,
-0.50987161835426

336 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id coef coef_value stepahead


96,
-0.36798469575722
14],
[0.75881905266135
98,
-0.49764695771025
447,
-0.23842915354370
808]]
3 seasonal_ma_params
3 period 4
3 lag 0
3 sigma [[0.0492576820660
1586,
-0.06743723547426
139,
-0.06158689749450
5584],
[-0.0674372354742
6139,
0.22449141784354
268,
0.28364011938502
687],
[-0.0615868974945
05584,
0.28364011938502
687,
0.41035204036308
964]]
3 aic -3.54437872075208
3
3 bic -2.56772668751971
9
3 iterations 100
3 converged false
3 1
3 2
3 3

Teradata Aster Analytics Foundation User Guide 337


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
Table 230: VARMAX Example 3 Output Table (Columns 5-7)

id predict_expenditure predict_income predict_investment


1
1
1
1
1
1
1
1
1
1
1
1
1
1 2142.78715878754 3092.86303423376 3673.84547642858
1 4144.01202399602 7019.73780801767 2413.26665396173
1 6182.11119125155 5848.79519783584 -1227.61804348869
2
2
2
2
2
2
2
2
2
2
2
2
2
2 1879.13079317076 2309.21928969857 667.801515797637
2 2029.37000308685 2277.98815737177 568.158627956374

338 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

id predict_expenditure predict_income predict_investment


2 1590.12072329947 2093.70018052511 491.215852756901
3
3
3
3
3
3
3
3
3
3
3
3
3
3 2255.14441988896 2673.97520019463 845.302048427789
3 2310.40934463343 2679.05646283002 888.074446679834
3 2333.56533607469 2727.31044779104 820.4016838113

Note:
series id = 3 does not converge. Convergence could possibly be improved by adding either more orders or
more models.

Example 4: VARMAX with All Models


This is a comprehensive model and includes all the possible model types (Autoregressive, Exogenous,
Seasonal and Moving Average) for prediction.

SQL-MapReduce Call

SELECT * FROM Varmax(


ON finance_data3 PARTITION BY id ORDER BY period
ResponseColumns ('expenditure')
ExogenousColumns ('income', 'investment')
PartitionColumns ('id')
Orders ('1, 1, 1')
SeasonalOrders ('1, 0, 0')
Period ('4')
Lag ('3')
ExogenousOrder ('3')
IncludeMean ('false')

Teradata Aster Analytics Foundation User Guide 339


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
StepAhead (3)
) ORDER BY id;

Output

Table 231: VARMAX Example 4 Output Table

id coef coef_value stepahead predict_expenditure


1 coef 1, 1, 1, 1, 0, 0, 3
1 ar_params [[-0.08614401327519393]]
1 ma_params [[-0.1680269481962965]]
1 exogenous_params [[0.14203710879699824,
-0.5781265904163024]],
[[4.346857849907079,
-3.6811357032482723]],
[[1.5837616273557507,
1.6939228988751796]]
1 seasonal_ar_params [[-1.774477167876499]]
1 seasonal_ma_params
1 period 4
1 lag 3
1 sigma [[21676.58531469003]]
1 aic 10.445526400752035
1 bic 10.829425242166568
1 iterations 20
1 converged true
1 1 3943.92757311105
1 2 5935.79783027663
1 3 5949.27104339776
2 coef 1, 1, 1, 1, 0, 0, 3
2 ar_params [[0.7116543185119842]]
2 ma_params [[0.7409842225080137]]
2 exogenous_params [[-1.3036705963984643,
3.539286250241028]],
[[4.598852419341966,
0.24549009988380754]],
[[-1.665252032620963,
-4.712566452061846]]
2 seasonal_ar_params [[-5.409614969849707]]
2 seasonal_ma_params
2 period 4

340 Teradata Aster Analytics Foundation User Guide


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX
id coef coef_value stepahead predict_expenditure
2 lag 3
2 sigma [[36921.803548309304]]
2 aic 10.97809599913301
2 bic 11.361994840547544
2 iterations 53
2 converged true
2 1 1684.98113900396
2 2 1678.10798395172
2 3 1737.6262526761
3 coef 1, 1, 1, 1, 0, 0, 3
3 ar_params [[0.254361642305569]]
3 ma_params [[0.1001575083196238]]
3 exogenous_params [[0.29320016884259,
-0.7088197800884475]],
[[0.029695938455021133,
0.3298987004589782]],
[[0.552628701381863,
-1.4286270925970472]]
3 seasonal_ar_params [[-0.1780611965952543]]
3 seasonal_ma_params
3 period 4
3 lag 3
3 sigma [[2.333553926057712E-18]]
3 aic -38.962775641871566
3 bic -38.63722496412744
3 iterations 20
3 converged true
3 1 2323.20886314076
3 2 2393.45004194531
3 3 2374.68753995705

Note:
The model converges across all partitions.

Teradata Aster Analytics Foundation User Guide 341


Chapter 3: Time Series, Path, and Attribution Analysis
VARMAX

342 Teradata Aster Analytics Foundation User Guide


CHAPTER 4
Pattern Matching with Teradata Aster
nPath

Pattern Matching with Teradata Aster nPath


• nPath
• Pattern Matching
• Symbols
• Filters
• Result: Applying Aggregate Functions
• nPath Examples

nPath

Summary
The nPath function matches specified patterns in a sequence of rows from one or more input tables and
extracts information from the matched rows.

Typical nPath uses are:


• Categorizing entities based on observed patterns; for example, distinguishing “loyal customers” from
“price-sensitive shoppers.”
• Selecting relevant data from a data set and then inputting it to another function or a third-party data
graph generator, such as the application that produced the following figure.

Teradata Aster Analytics Foundation User Guide 343


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath
Figure 10: of a Sankey Diagram of Teradata Aster nPath Output

Usage

nPath Syntax
Version 1.0

SELECT * FROM nPath (


ON { table | view | (query) }
PARTITION BY partition_column
ORDER BY order_column[ ASC | DESC ]
[ ON { table | view | (query) }
[ PARTITION BY partition_column | DIMENSION ]
ORDER BY order_column[ ASC | DESC ]
][...]
Mode ({ OVERLAPPING | NONOVERLAPPING })
Pattern ('pattern')
Symbols ( { col_expr = symbol_predicate AS symbol} [,...])
[ Filter (filter_expression[,...]) ]
Result ({aggregate_function(col_expr OF symbol) AS alias}[,...])
);

The nPath function is not tied to any Aster Database schema and must not be qualified with a schema name.

Arguments
Argument Category Description
Mode Required Specifies the pattern-matching mode:
OVERLAPPING: The function finds every occurrence of the pattern in the
partition, regardless of whether it is part of a previously found match.
Therefore, one row can match multiple symbols in a given matched pattern.
NONOVERLAPPING: The function begins the next pattern search at the
row that follows the last pattern match. This is the default behavior of many
commonly used pattern matching utilities, including the UNIX grep utility.
Pattern Required Specifies the pattern for which the function searches. You compose pattern
with the symbols that you define in the Symbols argument, operators, and

344 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath

Argument Category Description


parentheses. The following table describes the simplest patterns, which you
can combine to form more complex patterns. When patterns have multiple
operators, the function applies them in order of precedence, and applies
operators of equal precedence from left to right. The following table also
shows operator precedence. To force the function to evaluate a subpattern
first, enclose it in parentheses. To specify that a subpattern must appear a
specific number of times, use the Range-Matching Feature.
For pattern matching details, refer to Pattern Matching.
Symbols Required Defines the symbols that appear in the values of the Pattern and Result
arguments. The col_expr is an expression whose value is a column name,
symbol is any valid identifier, and symbol_predicate is a SQL predicate
(often a column name).
For example, the Symbols argument for analyzing website visits might look
like this:

Symbols (
pagetype = 'homepage' AS H,
pagetype <> 'homepage' AND pagetype <> 'checkout' AS
PP,
pagetype = 'checkout' AS CO
)

The symbol is case-insensitive; however, a symbol of one or two uppercase


letters is easy to identify in patterns.
If col_expr represents a column that appears in multiple input tables, then
you must qualify the ambiguous column name with its table name. For
example:

Symbols (
weblog.pagetype = 'homepage' AS H,
weblog.pagetype = 'thankyou' AS T,
ads.adname = 'xmaspromo' AS X,
ads.adname = 'realtorpromo' AS R
)

For more information about symbols that appear in the Pattern argument
value, refer to Symbols. For more information about symbols that appear in
the Result argument value, refer to Result: Applying Aggregate Functions.
Filter Optional Specifies filters to impose on the matched rows. The function combines the
filter expressions using the AND operator.
The filter_expression syntax is:

symbol_expression comparison_operator symbol_expression

Teradata Aster Analytics Foundation User Guide 345


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath

Argument Category Description


The two symbol expressions must be type-compatible. The
symbol_expression syntax is:

{ FIRST | LAST }(column_with_expression OF [ANY]


(symbol[,...]))

The column_with_expression cannot contain the operator AND or OR, and


all its columns must come from the same input. If the function has multiple
inputs, then column_with_expression and symbol must come from the same
input.
The comparison_operator is either <, >, <=, >=, =, or !=.
This argument can improve or degrade nPath performance, depending on
several factors. For details, refer to Filters.
Result Required Defines the output columns. The col_expr is an expression whose value is a
column name; it specifies the values to retrieve from the matched rows. The
function applies aggregate_function to these values. For details, see Result:
Applying Aggregate Functions.
The function evaluates this argument once for every matched pattern in the
partition (that is, it outputs one row for each pattern match).

In the following table, A and B are symbols defined in the Symbols argument.
Table 232: Simple nPath Patterns and Operator Precedence

pattern Description Operator


Precedence
A The function returns the rows that contain exactly one occurrence of A. 1 (highest)
A. The function returns the rows that contain exactly one occurrence of A. 1
A? The function returns the rows that contain at most one occurrence of A. The ? 1
operator is nongreedy.
A* The function returns the rows that contain zero or more occurrences of A. The * 1
operator is nongreedy.
A+ The function returns the rows that contain at least one occurrence of A. The + 1
operator is nongreedy.
A.B Cascade operator. The function returns the rows that contain A followed 2
immediately by B.
A|B Alternative (or) operator. The function returns the rows that contain either A or 3
B.
^A Startanchor operator. The function returns the rows that start with A.
A$ Endanchor operator. The function returns the rows that end with A.

346 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath
Range-Matching Feature
The range-matching feature lets you specify the number of times that a subpattern must appear in a match.
You can specify an exact number, a minimum number, or both a minimum and maximum number:

(subpattern){n[,[m]]}

In the preceding syntax, you must type the braces ({ and }).
(subpattern){n} specifies that subpattern must appear exactly n times. For example, the following pattern
specifies that subpattern (A.B|C) must appear exactly 3 times:

'X.(Y.Z).(A.B|C){3}'

The preceding pattern is equivalent to the following pattern:

'X.(Y.Z).(A.B|C).(A.B|C).(A.B|C)'

(subpattern){n,} specifies that subpattern must appear at least n times. For example, the following pattern
specifies that subpattern (A.B|C) must appear at least 4 times:

'X.(Y.Z).(A.B|C){4}'

The preceding pattern is equivalent to the following pattern:

'X.(Y.Z).(A.B|C).(A.B|C).(A.B|C).(A.B|C)*'

(subpattern){n,m} specifies that subpattern must appear at least n times and at most m times. For example,
the following pattern specifies that subpattern (A.B|C) must appear at least 2 times and at most 4 times:

'X.(Y.Z).(A.B|C){2,4}'

The preceding pattern is equivalent to the following pattern:

'X.(Y.Z).(A.B|C).(A.B|C).(A.B|C)?.(A.B|C)?'

Input
The function requires at least one partitioned input table, and can have additional input tables that are either
partitioned or DIMENSION tables.
Table 233: nPath Input Table Schema

Column Name Data Type Description


partition_column INTEGER or Column by which every partitioned input table is
VARCHAR partitioned.

Teradata Aster Analytics Foundation User Guide 347


Chapter 4: Pattern Matching with Teradata Aster nPath
Pattern Matching

Column Name Data Type Description


order_column INTEGER or Column by which every input table is ordered.
VARCHAR
input_column INTEGER or Contains data to search for patterns.
VARCHAR

Note:
If the input to nPath is nondeterministic, then the results are nondeterministic.

Output
Table 234: nPath Ouput Table Schema

Column Name Data Type Description


partition_column Same as in input Column by which partitioned input tables are partitioned.
table
order_column Same as in input Column by which input tables are ordered.
table
result_column Same as result of Determined by Results argument. For details, refer to Result:
aggregate_functi Applying Aggregate Functions.
on

Pattern Matching
Conceptually, nPath pattern matching proceeds like this: Starting from a row in a partition, the function
tries to match the given pattern along the row sequence in the partition (ordered as specified in the ORDER
BY clause).
If the function cannot match the pattern, it outputs nothing; otherwise, it continues to the next row. When
the function finds a sequence of rows that match the pattern, it selects the largest set of rows that constitute
the match and outputs a row based on this match.
For example, suppose that the pattern is 'A.B+' and the rows that constitute the match start at a row t1 and
end at row t4. Suppose that t1 matches A and each of t2,t3, and t4 matches B. When the matching is
complete, A represents t1 and B represents t2, t3, and t4. Using the rows represented by A and B, the
function evaluates the Result argument (typically applying an aggregate function to each symbol in the
pattern), outputs one row with the result values, and proceeds to search for the next pattern match.
Before running nPath on a large data set, create a small data set that includes the pattern that you want to
find. Test your pattern on the small data set, refine the pattern until nPath gives the desired output, and then
using the refined pattern for the large data set.

Greedy Pattern Matching


The nPath function uses greedy pattern matching, finding the longest available match despite any nongreedy
operators in the pattern.

348 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Pattern Matching
For example, consider the input table link2:
Table 235: nPath Greedy Pattern Matching Example Input Table link2

userid title startdate enddate


21 Chief Exec Officer 1994-10-01 2005-02-28
21 Software Engineer 1996-10-01 2001-06-30
21 Software Engineer 1998-10-01 2001-06-30
21 Chief Exec Officer 2005-03-01 2007-03-31
21 Chief Exec Officer 2007-06-01 null

The following query returns the following table:

SELECT job_transition_path, count(*) AS count FROM nPath (


ON link2 PARTITION BY userid ORDER BY startdate
Mode (NONOVERLAPPING)
Pattern ('CEO.ENGR.OTHER*')
Symbols (title ilike 'software eng%' AS ENGR,
true AS OTHER,
title ilike 'Chief Exec Officer' AS CEO)
Result (accumulate(title OF ANY(ENGR,OTHER,CEO))
AS job_transition_path)
) GROUP BY 1 ORDER BY 2 DESC;

Table 236: nPath Greedy Pattern Matching Example 1 Output Table

job_transition_path count
[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec 1
Officer]

In the pattern, CEO matches the first row, ENGR matches the second row, and OTHER* matches the
remaining rows:

The following query returns the following table:

SELECT job_transition_path, count(*) AS count FROM nPath (


ON link2 PARTITION BY userid ORDER BY startdate
Mode (NONOVERLAPPING)
Pattern ('CEO.ENGR.OTHER*.CEO')
Symbols (title ilike 'software eng%' AS ENGR,
true AS OTHER,
title ilike 'Chief Exec Officer' AS CEO)
Result (accumulate(title of ANY(ENGR,OTHER,CEO))

Teradata Aster Analytics Foundation User Guide 349


Chapter 4: Pattern Matching with Teradata Aster nPath
Symbols
AS job_transition_path)
) GROUP BY 1 ORDER BY 2 DESC;

Table 237: nPath Greedy Pattern Matching Example 2 Output Table

job_transition_path count
[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec 1
Officer]

In the pattern, CEO matches the first row, ENGR matches the second row, OTHER* matches the next two
rows, and CEO matches the last row:

Symbols
This section applies only to symbols that appear in the Pattern argument, described in Arguments. For
information about symbols that appear in the Result argument, refer to Result: Applying Aggregate
Functions.
For each symbol definition, col_expr = symbol_predicate AS symbol, the function returns the rows for which
col_expr equals symbol_predicate. For example, for pagetype = 'home' AS H, the function returns the
first and fourth rows of the following table.
Table 238: nPath Sample Input Table

sessionid clicktime userid productname pagetype referrer productprice


1 07:00:10 333 home www.company2.com
1 07:00:12 333 product1 checkout www.company2.com 200.2
1 07:01:00 333 product2 checkout 340
13 15:35:08 67403 home www.company1.com

The function does not return any row that contains a NULL value. For example, for pagetype =
'checkout' AS C, the function returns the second row of the preceding table, but not the third.
The predicate TRUE matches every row.
If symbols have overlapping predicates, multiple symbols might match the same row.

LAG Expressions in Symbol Predicates


When a symbol predicate contains a LAG expression, the function compares the current row to a previous
row to determine whether to return the current row. You must know the number of rows to look ahead or

350 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Symbols
back. If you cannot determine this number (because the pattern includes symbols that can match a variable
number of rows, for example), use the Filter argument.

LAG Expression Syntax

{ current_expr operator LAG (previous_expr, lag_rows [, default]) |


LAG (previous_expr, lag_rows [, default]) operator current_expr}

where:
• current_expr is the name of a column from the current row (or an expression operating on this column).
• operator is either >, >=, <, <=, =,or!=
• previous_expr is the name of a column from a previous row (or an expression operating on this column).
• lag_rows is the number of rows to count backward from the current row to reach the previous row. For
example, if lag_rows is 1, the previous row is the immediately preceding row.
• default is the value to use for previous_expr when there is no previous row (that is, when the current row
is the first row or there is no row that is lag_rows before the current row).

LAG Expression Rules


• A symbol definition can have multiple LAG expressions.
• A symbol definition that has a LAG expression cannot have an OR operator.
• If a symbol definition has a LAG expression and the input is not a table, then you must create an alias of
the input query, as in Lag Expression Example 1.

Lag Expression Example 1

Input

Table 239: nPath LAG Expression Example Input Table bank_web_clicks

customer_id session_id page datestamp


529 0 ACCOUNT SUMMARY 2004-03-17 16:35:00
529 0 FAQ 2004-03-17 16:38:00
529 0 ACCOUNT HISTORY 2004-03-17 16:42:00
529 0 FUNDS TRANSFER 2004-03-17 16:45:00
529 0 ONLINE STATEMENT 2004-03-17 16:49:00
ENROLLMENT
529 0 PROFILE UPDATE 2004-03-17 16:50:00
529 0 ACCOUNT SUMMARY 2004-03-17 16:51:00
529 0 CUSTOMER SUPPORT 2004-03-17 16:53:00
529 0 VIEW DEPOSIT DETAILS 2004-03-17 16:57:00
529 1 ACCOUNT SUMMARY 2004-03-18 01:16:00

Teradata Aster Analytics Foundation User Guide 351


Chapter 4: Pattern Matching with Teradata Aster nPath
Symbols

customer_id session_id page datestamp


529 1 ACCOUNT SUMMARY 2004-03-18 01:18:00
529 1 FAQ 2004-03-18 01:20:00
... ... ... ...

SQL-MapReduce Call

SELECT * FROM nPath (


ON (SELECT customer_id, session_id, datestamp, page
FROM bank_web_clicks) AS alias
PARTITION BY customer_id, session_id
ORDER BY datestamp
MODE (NONOVERLAPPING)
PATTERN ('(DUP|A)*')
SYMBOLS (
'true' AS A,
page = LAG (page,1) AS DUP
)
RESULT (
FIRST (customer_id OF any (A)) AS customer_id,
FIRST (session_id OF A) AS session_id,
FIRST (datestamp OF A) AS first_date,
LAST (datestamp OF ANY(A,DUP)) AS last_date,
ACCUMULATE (page OF A) AS page_path,
ACCUMULATE (page of DUP) AS dup_path)
);

Output

Table 240: nPath LAG Expression Example 1 Output Table (Columns 1-4)

customer_id session_id first_date last_date


529 0 2004-03-17 16:35:00 2004-03-17 16:57:00
529 1 2004-03-18 01:16:00 2004-03-18 01:28:00
529 2 2004-03-18 09:22:00 2004-03-18 09:36:00
529 3 2004-03-18 22:41:00 2004-03-18 22:55:00
529 4 2004-03-19 08:33:00 2004-03-19 08:41:00
529 5 2004-03-19 10:06:00 2004-03-19 10:14:00
... ... ... ...

Table 241: nPath LAG Expression Example 1 Output Table (Columns 5-6)

page_path dup_path
[ACCOUNT SUMMARY, FAQ, ACCOUNT HISTORY, []
FUNDS TRANSFER, ONLINE STATEMENT

352 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Symbols

page_path dup_path
ENROLLMENT, PROFILE UPDATE, ACCOUNT
SUMMARY, CUSTOMER SUPPORT, VIEW DEPOSIT
DETAILS]
[ACCOUNT SUMMARY, FAQ, ACCOUNT SUMMARY, [ACCOUNT SUMMARY]
FUNDS TRANSFER, ACCOUNT HISTORY, VIEW
DEPOSIT DETAILS, ACCOUNT SUMMARY,
ACCOUNT HISTORY]
[ACCOUNT SUMMARY, ACCOUNT HISTORY, [ACCOUNT SUMMARY, ACCOUNT
FUNDS TRANSFER, ACCOUNT SUMMARY, FAQ] SUMMARY, FAQ]
[ACCOUNT SUMMARY, ACCOUNT HISTORY, [ACCOUNT SUMMARY]
ACCOUNT SUMMARY, ACCOUNT HISTORY, FAQ,
ACCOUNT SUMMARY]
[ACCOUNT SUMMARY, FAQ, VIEW DEPOSIT []
DETAILS, FAQ]
[ACCOUNT SUMMARY, FUNDS TRANSFER, VIEW [VIEW DEPOSIT DETAILS]
DEPOSIT DETAILS, ACCOUNT HISTORY]
... ...

LAG Expression Example 2


Whenever a user visits the home page and then visits checkout pages and buys increasingly expensive
products, the nPath query returns the first purchase and the most expensive purchase.

Input

Table 242: nPath LAG Expression Example 2 Input Table: aggregate_clicks

userid sessionid productname pagetype clicktime referrer productprice


1039 1 sneakers home 2009-07-29 Nike 100
20:17:59
1039 2 books home 2009-04-21 BarnesNoble 300
13:17:59
1039 3 television home 2009-05-23 Bestbuy 500
13:17:59
1039 4 envelopes home 2009-07-16 Staples 10
11:17:59
1039 4 envelopes home1 2009-07-16 Staples 10
11:18:16
1039 4 envelopes page1 2009-07-16 Staples 10
11:18:18

Teradata Aster Analytics Foundation User Guide 353


Chapter 4: Pattern Matching with Teradata Aster nPath
Symbols

userid sessionid productname pagetype clicktime referrer productprice


1039 5 bookcases home 2009-08-19 Ikea 150
22:17:59
1039 5 bookcases home1 2009-08-19 Ikea 150
22:18:02
1039 5 bookcases page1 2009-08-19 Ikea 150
22:18:05
1039 5 bookcases page2 2009-08-22 Ikea 150
04:20:05
1039 5 bookcases checkout 2009-08-24 Ikea 150
14:30:05
1039 5 bookcases page2 2009-08-27 Ikea 150
23:03:05
1040 1 tables home 2009-07-29 Ikea 250
20:17:59
1040 2 Appliances home 2009-04-21 GE 1500
13:17:59
1040 3 laptops home 2009-05-23 Dell 800
13:17:59
1040 4 chairs home 2009-07-16 Staples 400
11:17:59
1040 4 chairs home1 2009-07-16 Staples 400
11:18:16
1040 4 chairs page1 2009-07-16 Staples 400
11:18:18
1040 5 cellphones home 2009-08-19 Samsung 600
22:17:59
1040 5 cellphones home1 2009-08-19 Samsung 600
22:18:02
1040 5 cellphones page1 2009-08-19 Samsung 600
22:18:05
1040 5 cellphones page2 2009-08-22 Samsung 600
04:20:05
1040 5 cellphones checkout 2009-08-24 Samsung 600
14:30:05
1040 5 cellphones page2 2009-08-27 Samsung 600
23:03:05
... ... ... ... ... ... ...

354 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Filters
SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime ASC
MODE (NONOVERLAPPING)
PATTERN ('H+.D*.X*.P1.P2+')
SYMBOLS (
'true' AS X,
pagetype = 'home' AS H,
pagetype <> 'home' AND pagetype <> 'checkout' AS D,
pagetype = 'checkout' AS P1,
pagetype = 'checkout' AND
productprice > 100 AND
productprice > LAG (productprice, 1, 100::REAL) AS P2
)
RESULT (
FIRST (productname OF P1) AS first_product,
MAX_CHOOSE (productprice, productname OF P2) AS max_product,
FIRST (sessionid OF P2) AS sessionid
)
) ORDER BY sessionid;

Output

Table 243: nPath LAG Expression Example 2 Output Table

first_product max_product sessionid


bookcases cellphones 5

Filters
The Filter argument, which specifies filters to impose on the matched rows, can improve or degrade nPath
performance, depending on several factors. Filtering out most matches can improve performance, but
memory fragmentation can degrade it. Memory fragmentation can occur in these cases:
• The mode is NONOVERLAPPING and the pattern includes the endanchor operator ($) but not the
startanchor operator (^).
• The mode is OVERLAPPING and the pattern does not include the startanchor operator.
• The first symbol in the pattern can match an infinite number of input rows.
• The data partition is huge.
• The Java Virtual Machine (JVM) is too small.
If nPath runs much slower with the Filter argument, increase the size of the JVM. If the problem persists,
alter the pattern.

Teradata Aster Analytics Foundation User Guide 355


Chapter 4: Pattern Matching with Teradata Aster nPath
Filters
Example
This example takes as input the clickstream data from an online retail store and finds all sessions where users
visited the checkout page within 10 minutes of visiting the home page. The example cannot impose a time
window using a LAG expression, because the pattern includes an expression can consume zero or more rows
(view*); therefore, it uses the Filter argument.

Input
Table 244: nPath Filter Example Input Table clickstream

userid sessionid clicktime pagetype


1 1 10-10-2012 10:15 home
1 1 10-10-2012 10:16 view
1 1 10-10-2012 10:17 view
1 1 10-10-2012 10:20 checkout
1 1 10-10-2012 10:30 checkout
1 1 10-10-2012 10:35 view
1 1 10-10-2012 10:45 view
2 2 10-10-2012 13:15 home
2 2 10-10-2012 13:16 view
2 2 10-10-2012 13:43 checkout
2 2 10-10-2012 13:35 view
2 2 10-10-2012 13:45 view

SQL-MapRequest Call

SELECT * FROM npath (


ON clickstream PARTITION BY userid ORDER BY clicktime
Symbols (pagetype='home' AS home,
pagetype!='home' AND pagetype!='checkout' AS view,
pagetype='checkout' AS checkout)
Pattern ('home.view*.checkout')
Result (FIRST(userid of ANY(home, checkout, view)) AS userid,
FIRST(sessionid of ANY(home, checkout, view)) AS sessioinid,
COUNT(* of any(home, checkout, view)) AS cnt,
FIRST(clicktime of ANY(home)) AS firsthome,
LAST(clicktime of ANY(checkout)) AS lastcheckout)
Filter (FIRST(clicktime + '10 minutes' ::interval OF ANY (home)) >
FIRST(clicktime of any(checkout)))
Mode (NONOVERLAPPING)
);

356 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Result: Applying Aggregate Functions
Output
Table 245: nPath Filter Example Output Table

userid sessionid cnt firsthome lastcheckout


1 1 4 2012-10-10 10:15:00 2012-10-10 10:20:00

Result: Applying Aggregate Functions


The Result argument defines the output columns, specifying the values to retrieve from the matched rows
and the aggregate function to apply to these values.
For each pattern, the nPath function can apply specified one or more aggregate functions to the matched
rows and output aggregate results.The supported aggregate functions are:
• SQL aggregate functions AVG, COUNT, MAX, MIN, and SUM
• Teradata Aster nPath sequence aggregate functions described in the following table
In the following table, col_expr is an expression whose value is a column name, symbol is defined by the
Symbols argument, and symbol_list has this syntax:

{ symbol | ANY (symbol[,...]) }

Function Description
Returns either the number of total number of matched rows (*) or the
COUNT ( number (or distinct number) of col_expr values in the matched rows.
{ * | [DISTINCT]
col_expr }
OF symbol_list )

Returns the col_expr value of the first matched row. For the example in
FIRST ( Pattern Matching, FIRST (pageid OF B) returns the pageid of row
col_expr OF t2.
symbol_list )

Returns the col_expr value of the last matched row. For the example in
LAST ( Pattern Matching, LAST (pageid OF B) returns the pageid of row
col_expr OF t4.
symbol_list )

Returns the col_expr value of the nth matched row, where n is a


NTH ( nonzero value of the data type SMALLINT, INTEGER, or BIGINT.
col_expr, n OF The sign of n determines whether the nth matched row is nth from the
symbol_list ) first or last matched row. For example, if n is 1, then the nth matched
row is the first matched row, and if n is -1, then the nth matched row is
the last matched row.
If n is greater than the number of matched rows, then the nth function
returns NULL.

Teradata Aster Analytics Foundation User Guide 357


Chapter 4: Pattern Matching with Teradata Aster nPath
Result: Applying Aggregate Functions

Function Description
Returns the first non-null col_expr value in the matched rows.
FIRST_NOTNULL (
col_expr OF
symbol_list )

Returns the last non-null col_expr value in the matched rows.


LAST_NOTNULL (
col_expr OF
symbol_list )

Returns the descriptive_col_expr value of the matched row with the


MAX_CHOOSE ( highest-sorted quantifying_col_expr value. For example, MAX_CHOOSE
quantifying_col_expr, (product_price, product_name OF B) returns the
descriptive_col_expr product_name of the most expensive product in the rows that map to
OF symbol_list ) B.
The descriptive_col_expr can have any data type. The
qualifying_col_expr must have a sortable datatype (SMALLINT,
INTEGER, BIGINTEGER, DOUBLE PRECISION, DATE, TIME,
TIMESTAMP, VARCHAR, or CHARACTER).
Returns the descriptive_col_expr value of the matched row with the
MIN_CHOOSE ( lowest-sorted qualifying_col_expr value. For example, MIN_CHOOSE
quantifying_col_expr, (product_price, product_name OF B) returns the
descriptive_col_expr product_name of the least expensive product in the rows that map to
OF symbol_list ) B.
The descriptive_col_expr can have any data type. The
qualifying_col_expr must have a sortable datatype (SMALLINT,
INTEGER, BIGINTEGER, DOUBLE PRECISION, DATE, TIME,
TIMESTAMP, VARCHAR, or CHARACTER).
Returns the duplicate count for col_expr in the matched rows. That is,
DUPCOUNT ( for each matched row, the function returns the number of occurrences
col_expr OF of the current value of col_expr in the immediately preceding matched
symbol_list ) row.
When col_expr is also the ORDER BY col_expr, this function returns
the equivalent of ROW_NUMBER()-RANK().
Returns the cumulative duplicate count for col_expr in the matched
DUPCOUNTCUM ( rows. That is, for each matched row, the function returns the number
col_expr OF of occurrences of the current value of col_expr in all preceding
symbol_list ) matched rows.
When col_expr is also the ORDER BY col_expr, this function returns
the equivalent of ROW_NUMBER()-DENSE_RANK().
Returns, for each matched row, the concatenated values in col_expr,
ACCUMULATE ( separated by delimiter. The default delimiter is ', ' (a comma followed
[ DISTINCT | by a space).
CDISTINCT ] DISTINCT limits the concatenated values to distinct values.
col_expr OF symbol_list

358 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Result: Applying Aggregate Functions

Function Description
CDISTINCT limits the concatenated values to consecutive distinct
[ DELIMITER values.
'delimiter'] )

You can compute an aggregate over more than one symbol. For example, SUM (val OF ANY (A,B))
computes the sum of the values of the attribute val across all rows in the matched segment that map to A or
B.
More examples:
• Example 1 uses FIRST, LAST_NOTNULL, MAX_CHOOSE, and MIN_CHOOSE
• Example 2 uses FIRST and three forms of ACCUMULATE
• Example 3 uses FIRST, three forms of ACCUMULATE, COUNT, and NTH

Example 1

Input
Table 246: nPath Aggregate Functions Example 1 Input Table trans1

userid gender ts productname productamt


1 M 2012-01-01 00:00:00 shoes 100
1 M 2012-02-01 00:00:00 books 300
1 M 2012-03-01 00:00:00 television 500
1 M 2012-04-01 00:00:00 envelopes 10
2 2012-01-01 00:00:00 bookcases 150
2 2012-02-01 00:00:00 tables 250
2 F 2012-03-01 00:00:00 appliances 1500
3 F 2012-01-01 00:00:00 chairs 400
3 F 2012-02-01 00:00:00 cellphones 600
3 F 2012-03-01 00:00:00 dvds 50

SQL-MapReduce Call

SELECT * FROM NPATH (


ON trans1
PARTITION BY userid ORDER BY ts
MODE (nonoverlapping)
PATTERN ('A+')
SYMBOLS(TRUE AS A)
RESULT (FIRST(userid OF A) AS Userid,
LAST_NOTNULL (gender OF A) AS Gender,

Teradata Aster Analytics Foundation User Guide 359


Chapter 4: Pattern Matching with Teradata Aster nPath
Result: Applying Aggregate Functions
MAX_CHOOSE (productamt, productname OF A) AS Max_prod,
MIN_CHOOSE (productamt, productname OF A) AS Min_prod)
) ORDER BY 1;

Output
Table 247: nPath Aggregate Functions Example 1 Output Table

userid gender max_prod min_prod


1 M television envelopes
2 F appliances bookcases
3 F cellphones dvds

Example 2

Input
Table 248: Aggregate Functions Example 2 Input Table: clicks

userid sessionid productname pagetype clicktime referrer productprice


1039 1 null home 06:59:13 Nike 100
1039 1 null home 07:00:10 Bestbuy 300
1039 1 television checkout 07:00:12 Bestbuy 500
1039 1 television checkout 07:00:18 Bestbuy 10
1039 1 envelope checkout 07:01:00 Staples 10
1039 1 null checkout 07:01:10 Staples 10

SQL-MapReduce Call

SELECT * FROM npath (


ON clicks PARTITION BY sessionid ORDER BY clicktime
Mode ('nonoverlapping')
Symbols (pagetype='home' AS H, pagetype='checkout' AS C,
pagetype!='home' AND pagetype!='checkout' AS A)
Pattern ('^H+.A*.C+$')
Result (
FIRST (sessionid OF ANY (H, A, C)) AS sessionid,
FIRST (clicktime OF H) AS firsthome,
FIRST (clicktime OF C) AS firstcheckout,
ACCUMULATE (productname OF ANY (H,A,C) DELIMITER '*')
AS products_accumulate,
ACCUMULATE (CDISTINCT productname OF ANY (H,A,C) DELIMITER '$$')
AS cde_dup_products,
ACCUMULATE (DISTINCT productname OF ANY (H,A,C))

360 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
Result: Applying Aggregate Functions
AS de_dup_products
)
) ORDER BY sessionid;

Output
Table 249: nPath Aggregate Functions Example 2 Output Table (Columns 1-4)

sessionid firsthome firstcheckout products_accumulate


1 06:59:13 07:00:12 [null*null*television*television*envelopes*null]

Table 250: nPath Aggregate Functions Example 2 Output Table (Columns 5-6)

cde_dup_products de_dup_products
[null$$television$$envelopes$$null] [null, television, envelopes]

Example 3

Input
This example uses the same input table, Aggregate Functions Example 2 Input Table: clicks, as was used in
Example 2.

SQL-MapReduce Call

SELECT * FROM npath (


ON clicks PARTITION BY sessionid ORDER BY clicktime
Mode ('nonoverlapping')
Symbols (pagetype='home' AS H, pagetype='checkout' AS C,
pagetype!='home' AND pagetype!='checkout' AS A)
Pattern ('^H+.A*.C+$')
Result (
FIRST (sessionid OF ANY (H, A, C)) AS sessionid,
FIRST (clicktime OF H) AS firsthome,
FIRST (clicktime OF C) AS firstcheckout,
ACCUMULATE (productname OF ANY (H,A,C))
AS products_accumulate,
COUNT (DISTINCT productname OF ANY(H,A,C))
AS count_distinct_products,
ACCUMULATE (CDISTINCT productname OF ANY (H,A,C))
AS consecutive_distinct_products,
ACCUMULATE (DISTINCT productname OF ANY (H,A,C))
AS distinct_products,
NTH (productname, -1 OF ANY(H,A,C)) AS nth
)
) ORDER BY sessionid;

Teradata Aster Analytics Foundation User Guide 361


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Output
Table 251: nPath Aggregate Functions Example 3 Output Table (Columns 1-5)

sessionid firsthome firstcheckout products_accumulate count_distinct_products


1 06:59:13 07:00:12 [null, null, television, 3
television, envelopes,
null]

Table 252: nPath Aggregate Functions Example 3 Output Table (Columns 6-8)

consecutive_distinct_products distinct_products nth


[null, television, envelopes, null] [null, television, envelopes] null

nPath Examples

Clickstream Data Examples

Input, Symbols, and Symbol Predicates


This statement creates the input table of clickstream data that the examples use:

CREATE TABLE clicks1 (


ts TIME,
userid INTEGER,
sessionid INTEGER
pageid INTEGER,
category INTEGER,
val FLOAT,
referrer VARCHAR (256)
) DISTRIBUTE BY HASH (category);

The following table summarizes the symbols and symbol predicates that the examples use.
Table 253: nPath Clickstream Data Examples Symbols and Symbol Predicates

Symbol Symbol Predicate


A pageid IN (10, 25)
B category = 10 OR (category = 20 AND pageid <> 33)
C category IN (SELECT pageid FROM clicks1 GROUP BY userid HAVING COUNT(*) > 10)
D referrer LIKE '%Amazon%'
X true

362 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Basic SQL-MapReduce Call Structure

SELECT ...
FROM nPath (...
SYMBOLS (pageid IN (10, 25) AS A,
category = 10 OR (category = 20 AND pageid <> 33) AS B,
category IN (SELECT pageid
FROM clicks1
GROUP BY userid
HAVING COUNT(*) > 10
) AS C,
referrer LIKE '%Amazon%' AS D,
true AS X
) ...
) ...

Combining Values from Adjacent Rows


This SQL-MapReduce call gets the pageid for each row and the pageid for the next row in sequence:

SELECT sessionid, pageid, next_pageid FROM nPath (


ON clicks1
PARTITION BY sessionid
ORDER BY ts
MODE (OVERLAPPING)
PATTERN ('A.B')
SYMBOLS (true AS A, true AS B)
RESULT (FIRST(sessionid OF A) AS sessionid,
FIRST (pageid OF A) AS pageid,
FIRST (pageid OF B) AS next_pageid
)
);

Counting Preceding Rows in a Sequence


For each row, this SQL-MapReduce call counts the number of preceding rows in a given sequence (including
the current row). The ORDER BY clause specifies DESC because the pattern must be matched over the rows
preceding the start row, while the semantics dictate that the pattern be matched over the rows following the
start row.

SELECT sessionid, pageid, rank FROM nPath (


ON clicks1
PARTITION BY sessionid
ORDER BY ts DESC
MODE (OVERLAPPING)
PATTERN ('A*')
SYMBOLS (true AS A)
RESULT (FIRST(sessionid OF A) AS sessionid,
FIRST (pageid OF A) AS pageid,
COUNT (* OF A) AS rank)
);

Teradata Aster Analytics Foundation User Guide 363


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Complex Path Query
This SQL-MapReduce call finds the user click-paths that start at pageid 50 and proceed either to pageid80 or
to pages in category 9 or category 10, finds the pageid of the last page in the path, counts the visits to page
80, and returns the maximum count for each last page, by which it sorts the output. The query ignores paths
of fewer than five pages and pages for which category is less than zero.

SELECT last_pageid, MAX(count_page80) FROM nPath (


ON (SELECT * FROM clicks1 WHERE category >= 0)
PARTITION BY sessionid ORDER BY ts
PATTERN ('A.(B|C)*')
MODE (OVERLAPPING)
SYMBOLS (pageid = 50 AS A,
pageid = 80 AS B,
pageid <> 80 AND category IN (9,10) AS C)
RESULT (LAST(pageid OF ANY (A,B,C)) AS last_pageid,
COUNT (* OF B) AS count_page80,
COUNT (* OF ANY (A,B,C)) AS count_any)
) WHERE count_any >= 5
GROUP BY last_pageid
ORDER BY MAX(count_page80);

Range-Matching Examples

Input
The examples in this section use the Input table, nPath LAG Expression Example 2 Input Table:
aggregate_clicks, from LAG Expression Example 2 of the function: Symbols. The table is a collection of
clickstream data for different products with price information. Columns userid and sessionid identify the
users.

Example 1: Accumulate Pages Visited in Each Session

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime
MODE (nonoverlapping)
PATTERN ('A*')
SYMBOLS (TRUE AS A)
RESULT (FIRST (sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF A) AS path)
) ORDER BY sessionid;

364 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Output

Table 254: nPath Range-Matching Example 1 Output Table

sessionid path
1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1, checkout,
home, home, home, home, home, home, home, home, home]
2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]
3 [home, home, home, home, home, home, home, home, home1, page1, home, home1, page1,
home]
4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]
5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2, page2,
checkout, checkout, checkout, page2, page2, page2]

Example 2: Find Sessions That Start at Home Page and Visit Page1

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime
MODE (nonoverlapping)
PATTERN ('^H.A*.P1.A*')
SYMBOLS (pagetype='home' AS H, pagetype='page1' AS P1, TRUE AS A)
RESULT (FIRST(sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF ANY(H,P1,A)) AS path)
) ORDER BY sessionid;

Output

Table 255: nPath Range-Matching Example 2 Output Table

sessionid path
1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1, checkout,
home, home, home, home, home, home, home, home, home]
2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]
3 [home, home, home, home, home, home, home, home, home1, page1, home, home1, page1,
home]
4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]
5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2, page2,
checkout, checkout, checkout, page2, page2, page2]

Teradata Aster Analytics Foundation User Guide 365


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Example 3: Find Paths to Checkout Page for Purchases Over $200

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime
MODE (nonoverlapping)
PATTERN ('A*.C+.A*')
SYMBOLS (productprice > 200 AND
pagetype='checkout' AS C, true AS A)
RESULT (FIRST(sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF ANY(A,C)) AS path,
AVG (productprice OF ANY(A,C)) AS sum)
) ORDER BY sessionid;

Output

Table 256: nPath Range-Matching Example 3 Output Table

sessionid path sum


1 [home, home1, page1, home, home1, page1, home, home, 602.857142857143
home, home1, page1, checkout, home, home, home, home,
home, home, home, home, home]
5 [home, home, home, home, home1, home1, home1, page1, 363.157894736842
page1, page1, page2, page2, page2, checkout, checkout,
checkout, page2, page2, page2]

Example 4: Use OVERLAPPING Mode

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime
MODE (overlapping)
PATTERN ('A.A')
SYMBOLS (TRUE AS A)
RESULT (FIRST(sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF A) AS path)
) ORDER BY sessionid;

366 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Output

Table 257: nPath Range-Matching Example 4 Output Table

sessionid path
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [checkout, home]
1 [page1, checkout]
1 [home1, page1]
1 [home, home1]
1 [home, home]
1 [home, home]
1 [page1, home]
1 [home1, page1]
1 [home, home1]
1 [page1, home]
1 [home1, page1]
1 [home, home1]
2 [home, home]
2 [checkout, home]
2 [checkout, checkout]
... ...

Example 5: Find First Product with Multiple Referrers in Any Session

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid

Teradata Aster Analytics Foundation User Guide 367


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
ORDER BY clicktime
MODE (nonoverlapping)
PATTERN ('REFERRER{2,}')
SYMBOLS (referrer IS NOT NULL AS REFERRER)
RESULT (FIRST(sessionid OF REFERRER) AS sessionid,
FIRST(productname OF REFERRER) AS product)
) ORDER BY sessionid;

Output

Table 258: nPath Range-Matching Example 5 Output Table

sessionid product
1 envelopes
2 tables
3 bookcases
4 tables
5 Appliances

Example 6: Find Data for Sessions That Checked Out 3-6 Products
For sessions where the user checked out between three and six products (exclusive), return the names of the
most and least expensive products, the maximum price of the most expensive product, and the minimum
price of the least expensive product.

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime
MODE (nonoverlapping)
PATTERN ('H+.D*.C{3,6}.D')
SYMBOLS (pagetype = 'home' AS H, pagetype='checkout' AS C,
pagetype<>'home' AND pagetype<>'checkout' AS D)
RESULT (FIRST(sessionid OF C) AS sessionid,
max_choose(productprice, productname OF C) AS
most_expensive_product,
MAX(productprice OF C) AS max_price,
min_choose(productprice, productname of C) AS
least_expensive_product,
MIN(productprice OF C) AS min_price)
) ORDER BY sessionid;

368 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Output

Table 259: nPath Range-Matching Example 6 Output Table

sessionid most_expensive_product max_price least_expensive_product min_price


5 cellphones 600 bookcases 150

Example 7: Find Data for Sessions That Checked Out at Least 3 Products
Modify the SQL-MapReduce call in Example 6 to find sessions where the user checked out at least three
products by changing the Pattern argument to:

PATTERN('H+.D*.C{3,}.D')

SQL-MapReduce Call

SELECT * FROM npath (


ON aggregate_clicks
PARTITION BY sessionid
ORDER BY clicktime
MODE (nonoverlapping)
PATTERN('H+.D*.C{3,}.D')
SYMBOLS(pagetype = 'home' AS H, pagetype='checkout' AS C,
pagetype<>'home' AND pagetype<>'checkout' AS D)
RESULT (FIRST(sessionid OF C) AS sessionid,
max_choose(productprice, productname OF C) AS
most_expensive_product,
MAX (productprice OF C) AS max_price,
min_choose (productprice, productname OF C) AS
least_expensive_product,
MIN (productprice OF C) AS min_price)
) ORDER BY sessionid;

Output

Table 260: nPath Range-Matching Example 7 Output Table

sessionid most_expensive_product max_price least_expensive_product min_price


5 cellphones 600 bookcases 150

Multiple Partitioned Input Tables


An e-commerce store wants to know if people add and then remove items from their shopping carts more
often when the price is more than $1000 than when the price is less than $100.

Teradata Aster Analytics Foundation User Guide 369


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Input
The following statements create the input tables:

CREATE TABLE purchases_table (


userid VARCHAR,
purchaseDate TIMESTAMP,
purchaseId INTEGER,
itemid INTEGER,
numitem INTEGER,
pricePerItem DOUBLE,
sessionid INTEGER,
partition key(userid)
);
CREATE TABLE addToCart_table (
userid VARCHAR,
addtocartDate TIMESTAMP,
cartId int, itemid INTEGER,
numitem INTEGER,
pricePerItem DOUBLE,
sessionid INTEGER,
partition key(userid)
);
CREATE table removeFromCart_table(
userid VARCHAR,
removefromcartDate TIMESTAMP,
cartId INTEGER,
itemid INTEGER,
numitem INTEGER,
pricePerItem DOUBLE,
sessionid INTEGER,
partition key(userid)
);
CREATE table ItemViews_table(
userid VARCHAR,
viewDate TIMESTAMP,
pricedisplayPerItem DOUBLE,
sessionid INTEGER,
partition key(userid)
);

SQL-MapReduce Call

SELECT * from npath (


ON purchases_table PARTITION BY userid ORDER BY purchaseDate
ON AddToCart_table PARTITION BY userid ORDER BY addToCartDate
ON RemoveFromCart_table PARTITION BY userid
ORDER BY removeFromCartDate
ON ItemViews_table PARTITION BY userid ORDER BY viewDate
Mode ('NONOVERLAPPING')
Symbols (true as PURCHASE,
AddToCart_table.pricePerItem >= 1000 AS expensiveAdd,
AddToCart_table.pricePerItem <= 100 AS cheapAdd,
RemoveFromCart_table.pricePerItem >= 1000 AS expensiveRemove,
RemoveFromCart_table.pricePerItem <= 100 cheapRemove, true AS View)

370 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
Pattern ('(View*).(expensiveAdd.(View*).expensiveRemove) |
(cheapAdd.(View*).cheapRemove)Purchase+')
Result (
(FIRST(AddToCart_table.itemId OF ANY (expensiveAdd, cheapAdd)),
(FIRST(RemoveFromCart_table.itemId
OF ANY (expensiveRemove, cheapRemove)),
(FIRST((CASE WHEN AddToCart_table.pricePerItem >= 1000 THEN
'expensive' ELSE 'cheap' END) OF ANY (expensiveAdd, cheapAdd))
)
);

Multiple Partitioned Input Tables and Dimension Input Table


An e-commerce store wants to count the advertising impressions that lead to a user clicking an online
advertisement. The example counts the online advertisements that the user viewed and the television
advertisements that the user might have viewed.

Input

Table 261: nPath Multiple-Input Example 2 Input Table impressions

userid ts imp
1 2012-01-01 ad1
1 2012-01-02 ad1
1 2012-01-03 ad1
1 2012-01-04 ad1
1 2012-01-05 ad1
1 2012-01-06 ad1
1 2012-01-07 ad1
2 2012-01-08 ad2
2 2012-01-09 ad2
2 2012-01-10 ad2
2 2012-01-11 ad2
... ... ...

Table 262: nPath Multiple-Input Example 2 Input Table clicks2

userid ts click
1 2012-01-01 ad1
2 2012-01-08 ad2
3 2012-01-16 ad3
4 2012-01-23 ad4

Teradata Aster Analytics Foundation User Guide 371


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples

userid ts click
5 2012-02-01 ad5
6 2012-02-08 ad6
7 2012-02-14 ad7
8 2012-02-24 ad8
9 2012-03-02 ad9
10 2012-03-10 ad10
11 2012-03-18 ad11
12 2012-03-25 ad12
13 2012-03-30 ad13
14 2012-04-02 ad14
15 2012-04-06 ad15

Table 263: nPath Multiple-Input Example 2 tv_spots

ts tv_imp
2012-01-01 ad1
2012-01-02 ad2
2012-01-03 ad3
2012-01-04 ad4
2012-01-05 ad5
2012-01-06 ad6
2012-01-07 ad7
2012-01-08 ad8
2012-01-09 ad9
2012-01-10 ad10
2012-01-11 ad11
2012-01-12 ad12
2012-01-13 ad13
2012-01-14 ad14
2012-01-15 ad15

372 Teradata Aster Analytics Foundation User Guide


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples
SQL-MapReduce Call
The tables impressions and clicks have a user_id column, but the table tv_spots is only a record of television
advertisements shown, which any user might have seen. Therefore, tv_spots must be a dimension table.

SELECT * FROM npath (


ON impressions PARTITION BY userid ORDER BY ts
ON clicks2 PARTITION BY userid ORDER BY ts
ON tv_spots DIMENSION ORDER BY ts
MODE ('nonoverlapping')
SYMBOLS (true as imp, true as click, true as tv_imp)
PATTERN ('(imp|tv_imp)*.click')
RESULT (COUNT(* of imp) as imp_cnt,
COUNT (* of tv_imp) as tv_imp_cnt)
) ORDER BY imp_cnt;

Output

Table 264: nPath Multiple-Input Example 2 Output Table

imp_cnt tv_imp_cnt
18 0
19 0
19 0
20 0
21 0
22 0
22 0
22 0
22 0
22 0
23 0
23 0
23 0
24 0
25 0

Teradata Aster Analytics Foundation User Guide 373


Chapter 4: Pattern Matching with Teradata Aster nPath
nPath Examples

374 Teradata Aster Analytics Foundation User Guide


CHAPTER 5
Statistical Analysis

Statistical Analysis
• Approximate Distinct Count
• Approximate Percentile
• CMAVG
• ConfusionMatrix
• Correlation
• CoxPH
• CoxPredict
• CoxSurvFit
• CrossValidation
• Distribution Matching
• EMAVG
• FMeasure
• GLM
• GLMPredict
• Hidden Markov Model Functions
• Histogram
• KNN
• LARS Functions
• Linear Regression
• LRTEST
• Percentile
• Principal Component Analysis
• PCAPlot
• RandomSample
• Sample
• Shapley Value Functions
• SMAVG
• Support Vector Machines
• VectorDistance
• VWAP
• WMAVG

Teradata Aster Analytics Foundation User Guide 375


Chapter 5: Statistical Analysis
Approximate Distinct Count

Approximate Distinct Count

Summary
The Approximate Distinct Count function, which is composed of the ApproxDCountReduce and
ApproxDCountMap functions, can estimate the number of distinct values (cardinality) in a column or
combination of columns, scanning the table only once.
This function is recommended when the column or combination of columns has a large cardinality. The
function can estimate the number of distinct values much faster than the SQL SELECT DISTINCT
command can return the precise number of distinct values.
When the cardinality is small, the SQL SELECT DISTINCT command is recommended.

Background
When the column or combination of columns has a large cardinality, the function uses the Flajolet-Martin
algorithm, which approximates the number of distinct elements in a large set of numbers with a single pass,
by counting some bitmap functions of the hashed values of the large set of numbers.
The value nmap/φ*2 (S/nmap) asymptotically converges to the number of distinct values in the set, where:
• S is the calculated sum of the bitmap function.
• nmap is the number of hash map function used, determined by specified error tolerance.
• φ is the constant with approximate value 0.77.
When the number of distinct values in the set is small, the function counts them, rather than using the
Flajolet-Martin algorithm. To understand why, consider the case where the distinct count is 5: The value
nmap/φ*2 (S/nmap) is approximately 85 when the error is 10% and approximately 10590 when error is 1%.
For more information about probabilistic counting algorithms, see Probabilistic Counting Algorithms for
Data Base Applications, by Philippe Flajolet and G. Nigel Martin (http://portal.acm.org/citation.cfm?
id=5215).

Usage

Approximate Distinct Count Syntax


Version 1.0

SELECT * FROM ApproxDCountReduce (


ON ( [ SELECT * FROM ] ApproxDCountMap (
ON { table_name | view_name| (query) }
InputColumns ({ 'input_column' | 'input_column_range' [,...])
[ ErrorRate ('error_tolerance') ]
)
)
PARTITION BY expression[,...]
);

376 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Approximate Distinct Count
Arguments
Argument Category Description
InputColumns Required Specifies the columns for which to estimate the number of
distinct values.
ErrorRate Optional Specifies the acceptable error rate, expressed as a decimal (for
example, if error_tolerance is 10, then the acceptable error rate is
10%). The error_tolerance must be in the range (5.0E-4, 10]. The
default value is 10.

Input
The input table requires only the columns specified by the InputColumns argument. The table can have
additional columns, but the function ignores them.
Table 265: Approximate Distinct Count Input Table Schema

Column Data Description


Name Type
column_name Any Column for which to estimate the number of distinct values. The table has one or
more such columns.

Output
Table 266: Approximate Distinct Count Output Table Schema

Column Name Data Type Description


column_name Inherited from Name of input column (column_name) or column range
input table (start_column_end_column) for which the approximate distinct count
was computed.
cnt INTEGER Approximate distinct count.
method VARCHAR Method used for calculating the approximate distinct count—'approx'
(Flajolet-Martin algorithm) or 'nearExact' (counting).

Example
This example calculates the number of distinct values for each specified column with a 1% error rate
(accuracy).

Input
The input table has more than 3000 rows of price and advertisement information for U.S. cracker brands
Sunshine, Keebler, Nabisco, and a private label (such as a store brand). In the input column names:
• dispbrand means that the seller displayed the brand prominently.
• featbrand means that the seller featured the brand.

Teradata Aster Analytics Foundation User Guide 377


Chapter 5: Statistical Analysis
Approximate Distinct Count
• pricebrand is the price of the brand.
Table 267: Approximate Distinct Count Example Input Table crackers (Columns 1-8)

sn id dispsunshine dispkeebler dispnabisco dispprivate featsunshine featkeebler


1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 1 0 0 0 0 0
4 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0
6 1 0 0 0 0 0 0
7 1 0 0 1 0 0 0
8 1 0 0 1 0 0 0
9 1 0 0 1 0 0 0
10 1 1 0 1 0 0 0
11 1 0 0 1 0 0 0
12 1 0 0 0 0 0 0
13 1 1 0 0 0 0 0
14 1 0 1 1 0 0 0
15 1 0 0 0 0 0 0
... ... ... ... ... ... ... ...

Table 268: Approximate Distinct Count Example Input Table crackers (Columns 9-15)

featnabisco featprivate pricesunshine pricekeebler pricenabisco priceprivate choice


0 0 98 88 120 71 nabisco
0 0 99 109 99 71 nabisco
0 0 49 109 109 78 sunshine
0 0 103 109 89 78 nabisco
0 0 109 109 119 64 nabisco
0 0 89 109 119 84 nabisco
0 0 109 109 129 78 sunshine
0 0 109 119 129 78 nabisco
0 0 109 121 109 78 nabisco
0 0 79 121 109 78 nabisco
0 0 109 113 109 96 nabisco
0 0 109 121 99 86 nabisco

378 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Approximate Distinct Count

featnabisco featprivate pricesunshine pricekeebler pricenabisco priceprivate choice


0 0 89 121 99 86 nabisco
0 0 109 109 129 96 nabisco
0 0
... ... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM ApproxDCountReduce (


ON ApproxDCountMap (
ON cracker
InputColumns ('sn', 'id', 'dispsunshine', 'dispkeebler', 'dispnabisco',
'dispprivate', 'featsunshine', 'featkeebler', 'featnabisco',
'featprivate', 'pricesunshine', 'pricekeebler', 'pricenabisco',
'priceprivate', 'choice')
ErrorRate (1)
)
PARTITION BY column_name
) ORDER BY column_name;

Output
Table 269: Approximate Distinct Count Example Output Table

column_name cnt method


choice 4 nearExact
dispkeebler 2 nearExact
dispnabisco 2 nearExact
dispprivate 2 nearExact
dispsunshine 2 nearExact
featkeebler 2 nearExact
featnabisco 2 nearExact
featprivate 2 nearExact
featsunshine 2 nearExact
id 136 nearExact
pricekeebler 29 nearExact
pricenabisco 42 nearExact
priceprivate 39 nearExact
pricesunshine 27 nearExact

Teradata Aster Analytics Foundation User Guide 379


Chapter 5: Statistical Analysis
Approximate Percentile

column_name cnt method


sn 3292 nearExact

Approximate Percentile

Summary
The Approximate Percentile function, composed of ApproxPercentileReduce and ApproxPercentileMap,
computes approximate percentiles for one or more columns of data. The nth percentile is the smallest value
in a data set that is greater than n% of the values. The larger the data set, the more accurate the approximate
percentile.

Background
The Approximate Percentile function is based on an algorithm developed by Greenwald and Khanna. The
function gives e-approximate quantile summaries of a set of N elements, where e is the error (the desired
accuracy of the approximation). Given any rank r, an e-approximate summary returns a value whose rank r'
is in the interval [r - eN, r + eN]. The algorithm has a worst-case space requirement of O((1/e) * log(eN)).
When running the Approximate Percentile function, you specify e with the Error parameter.

Usage

Approximate Percentile Syntax


ApproxPercentileMap version 1.1, ApproxPercentileReduce version 1.1

SELECT * FROM ApproxPercentileReduce (


ON ([ SELECT * FROM ] ApproxPercentileMap (
ON { table | view | (query) }
TargetColumns ({ 'target_column' | 'target_column_range' }[,...])
[ ErrorRate (error) ]
[ GroupColumns ({ 'group_column' | group_column_range }[,...]) ]
)
) PARTITION BY [ 1 | group_column [,...]]
[ Percentile (percentile [,...]) ]
[ TargetColumns
({ 'target_column' | 'target_column_range' }[,...]) ]
[ GroupColumns ({ 'group_column' | group_column_range }[,...]) ]
);

380 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Approximate Percentile
Arguments
Argument Category Description
TargetColumns Required Specifies the names of the input columns for which to compute
approximate percentiles.
If you specify only one target_column in the
ApproxPercentileMap function, then you can omit this argument
in the ApproxPercentileReduce function.
If you specify more than one target_column in the
ApproxPercentileMap function, then you must specify this
argument in the ApproxPercentileReduce function. In
ApproxPercentileReduce, this argument must specify at least one
target_column that it specifies in ApproxPercentileMap. Only the
target columns specified in ApproxPercentileReduce appear in
the output table.
ErrorRate Optional Specifies the error (desired accuracy) of the approximation; that
the quantile is to be correct within error%. The error must be in
the range [.01, 50]. The default value is 1. Lower error is more
accurate but takes longer to compute.
Percentile Optional Specifies the approximate percentiles to compute. Each percentile
is an INTEGER. By default, the function computes the
percentiles 0, 25, 50, 75, and 100.
GroupColumns Optional Specifies the names of the input columns by which to group the
data. If you specify this argument, the function computes the
approximate percentile for each group in each column. If you
omit this argument, the function computes the approximate
percentile for the entire column. For example, suppose that the
target columns are State, Town, and Population. If you specify
GroupColumns('State'), the function computes the approximate
percentile for the population of each state. If you omit this
argument, the function computes the approximate percentile for
the population across all towns.
To specify this argument, you must do so in both the
ApproxPercentileMap and the ApproxPercentileReduce
functions, and specify each group_column in the PARTITION BY
clause.
If you omit this argument, specify PARTITION BY 1.

Input
The following table describes the required columns of the input table. The input table can have additional
columns, but the function ignores them.
Table 270: ApproxPercentileMap Input Table Schema

Column Name Data Type Description


target_column SMALLINT, Column for which to compute the approximate percentile.
INTEGER,

Teradata Aster Analytics Foundation User Guide 381


Chapter 5: Statistical Analysis
Approximate Percentile

Column Name Data Type Description


BIGINT,
NUMERIC, or
DOUBLE
PRECISION
group_column INTEGER or Column by which to group the data.
VARCHAR

Output
Table 271: ApproxPercentileReduce Output Table Schema

Column Name Data Type Description


group_column Same as in input Column by which the data is grouped. Appears only if you specify the
table GroupColumns argument.
percentile INTEGER Approximate percentile of group (if you specify the GroupColumns
argument) or column (by default).
value DOUBLE Value of the percentile, correct within error%.
PRECISION

Example
This example calculates the approximate percentiles 0, 25, 50, 75, and 100 within a 2% error rate for four
brands of crackers.

Input
The input table has more than 3000 rows of price and advertisement information the U. S. cracker brands
Sunshine, Keebler, Nabisco, and a private label (such as a store brand). In the input column names:
• dispbrand means that the seller displayed the brand prominently.
• featbrand means that the seller featured the brand.
• pricebrand is the price of the brand.
Table 272: Approximate Percentile Example Input Table cracker (Columns 1-8)

sn id dispsunshine dispkeebler dispnabisco dispprivate featsunshine featkeebler


1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 1 0 0 0 0 0
4 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0
6 1 0 0 0 0 0 0

382 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Approximate Percentile

sn id dispsunshine dispkeebler dispnabisco dispprivate featsunshine featkeebler


7 1 0 0 1 0 0 0
8 1 0 0 1 0 0 0
9 1 0 0 1 0 0 0
10 1 1 0 1 0 0 0
11 1 0 0 1 0 0 0
12 1 0 0 0 0 0 0
13 1 1 0 0 0 0 0
14 1 0 1 1 0 0 0
15 1 0 0 0 0 0 0
... ... ... ... ... ... ... ...

Table 273: Approximate Percentile Example Input Table cracker (Columns 9-15)

featnabisco featprivate pricesunshine pricekeebler pricenabisco priceprivate choice


0 0 98 88 120 71 Nabisco
0 0 99 109 99 71 Nabisco
0 0 49 109 109 78 Sunshine
0 0 103 109 89 78 Nabisco
0 0 109 109 119 64 Nabisco
0 0 89 109 119 84 Nabisco
0 0 109 109 129 78 Sunshine
0 0 109 119 129 78 Nabisco
0 0 109 121 109 78 Nabisco
0 0 79 121 109 78 Nabisco
0 0 109 113 109 96 Nabisco
0 0 109 121 99 86 Nabisco
0 0 89 121 99 86 Nabisco
0 0 109 109 129 96 Nabisco
0 0
... ... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM ApproxPercentileReduce (


ON ( SELECT * FROM ApproxPercentileMap (

Teradata Aster Analytics Foundation User Guide 383


Chapter 5: Statistical Analysis
Approximate Percentile
ON cracker
TargetColumns ('pricesunshine', 'pricekeebler',
'pricenabisco', 'priceprivate')
GroupColumns ('choice')
ErrorRate (2)
)
) PARTITION BY choice
GroupColumns ('choice')
Percentile (0, 25, 50, 75, 100)
TargetColumns ('pricesunshine', 'pricekeebler',
'pricenabisco', 'priceprivate')
) ORDER BY choice, percentile;

Output
Table 274: Approximate Percentile Example Output Table

choice percentile pricesunshine pricekeebler pricenabisco priceprivate


Keebler 0 49 88 88 38
Keebler 25 89 99 99 60.0000038146973
Keebler 50 97 109 109 65
Keebler 75 105 109 125 78
Keebler 100 129 135 129 96
Nabisco 0 49 88 0 38
Nabisco 25 89 105 99 61
Nabisco 50 97 115 105 65
Nabisco 75 105 121 119.000007629395 78
Nabisco 100 129 139 169.000015258789 115
private 0 49 88 89 38
private 25 89 109 99 55
private 50 98 113 109 58.9999961853027
private 75 109 121 125 79
private 100 129 135 129 115
Sunshine 0 49 88 88 49
Sunshine 25 79 107.000007629395 99 64
Sunshine 50 89 109 109 65
Sunshine 75 97 124 119.000007629395 78
Sunshine 100 129 135 129 96

384 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CMAVG

CMAVG

Summary
The CMAVG (cumulative moving average) function computes the cumulative moving average of a value
from the beginning of a series.

Background
In a cumulative moving average, the data are added to the data set in an ordered data stream over time. The
objective is to compute the average of all the data at each point in time when new data arrived. For example,
an investor may want to find the average price of all of the stock transactions for a particular stock over time,
up to the current time.
The cumulative moving average computes the arithmetic average of all the rows from the beginning of the
time series, using this formula:

CMAVG = SUM(a1, ..., aN)/N

N is the number of rows from the beginning of the data set.

Usage

CMAVG Syntax
Version 1.2

SELECT * FROM CMAVG (


ON { table_name| view_name| (query) }
PARTITION BY partition_column
ORDER BY order_by_column
[ TargetColumns ({ 'target_column' | target_column_range }[,...]) ]
);

Arguments
Argument Category Description
TargetColumns Optional Specifies the names of the input columns for which the cumulative
moving average is to be computed. If you omit this argument, then the
function only copies all input columns to the output table.

Input
The following table describes the required columns of the input table. The input table can have additional
columns, but the function ignores them.

Teradata Aster Analytics Foundation User Guide 385


Chapter 5: Statistical Analysis
CMAVG
Table 275: CMAVG Input Table Schema

Column Name Data Type Description


partition_column INTEGER, Column on which the input data is partitioned. This column
BIGINT, must contain all rows that contain the entity to be averaged.
NUMERIC, or For example, if the function is to return the cumulative
VARCHAR moving average of a particular stock share price, then all
transactions of that stock must be in one partition.
cma_column INTEGER, Column whose values are to be averaged.
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
order_by_column INTEGER, Column that contains the time stamps of the time points in
BIGINT, the time series.
TIMESTAMP,
TIME

Output
Table 276: CMAVG Output Table Schema

Column Name Data Type Description


partition_column INTEGER, Column on which the input data is partitioned.
BIGINT,
NUMERIC, or
VARCHAR
cma_column INTEGER, Column whose values are to be averaged.
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
order_by_column INTEGER, Column that contains the time stamps of the time
BIGINT, points in the time series.
TIMESTAMP,
TIME
cma_column_mavg DOUBLE Cumulative moving average of the cma_column at
PRECISION the time that the row was added to the data set.

Example
This example computes a cumulative moving average for the price of IBM stock. The input data is a series of
IBM common stock closing prices from 17 May 1961 to 2 November 1962.

386 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CMAVG
Input
Table 277: Input table ibm_stock for moving average function examples

id name period stockprice


1 IBM 1961-05-17 00:00:00 460
2 IBM 1961-05-18 00:00:00 457
3 IBM 1961-05-19 00:00:00 452
4 IBM 1961-05-22 00:00:00 459
5 IBM 1961-05-23 00:00:00 462
6 IBM 1961-05-24 00:00:00 459
7 IBM 1961-05-25 00:00:00 463
8 IBM 1961-05-26 00:00:00 479
9 IBM 1961-05-29 00:00:00 493
10 IBM 1961-05-31 00:00:00 490
11 IBM 1961-06-01 00:00:00 492
12 IBM 1961-06-02 00:00:00 498
13 IBM 1961-06-05 00:00:00 499
14 IBM 1961-06-06 00:00:00 497
15 IBM 1961-06-07 00:00:00 496
16 IBM 1961-06-08 00:00:00 490
17 IBM 1961-06-09 00:00:00 489
18 IBM 1961-06-12 00:00:00 478
19 IBM 1961-06-13 00:00:00 487
20 IBM 1961-06-14 00:00:00 491
... ... ... ...

SQL-MapReduce Call

SELECT * FROM CMAVG (


ON ibm_stock
PARTITION BY name
ORDER BY period
TargetColumns ('stockprice')
) ORDER BY period;

Teradata Aster Analytics Foundation User Guide 387


Chapter 5: Statistical Analysis
ConfusionMatrix
Output
Table 278: CMAVG Example Output Table

id name period stockprice stockprice_mavg


1 IBM 1961-05-17 00:00:00 460 460.0
2 IBM 1961-05-18 00:00:00 457 458.5
3 IBM 1961-05-19 00:00:00 452 456.3333333333333
4 IBM 1961-05-22 00:00:00 459 457.0
5 IBM 1961-05-23 00:00:00 462 458.0
6 IBM 1961-05-24 00:00:00 459 458.1666666666667
7 IBM 1961-05-25 00:00:00 463 458.85714285714283
8 IBM 1961-05-26 00:00:00 479 461.375
9 IBM 1961-05-29 00:00:00 493 464.8888888888889
10 IBM 1961-05-31 00:00:00 490 467.4
11 IBM 1961-06-01 00:00:00 492 469.6363636363636
12 IBM 1961-06-02 00:00:00 498 472.0
13 IBM 1961-06-05 00:00:00 499 474.0769230769231
14 IBM 1961-06-06 00:00:00 497 475.7142857142857
15 IBM 1961-06-07 00:00:00 496 477.06666666666666
16 IBM 1961-06-08 00:00:00 490 477.875
17 IBM 1961-06-09 00:00:00 489 478.52941176470586
18 IBM 1961-06-12 00:00:00 478 478.5
19 IBM 1961-06-13 00:00:00 487 478.94736842105266
20 IBM 1961-06-14 00:00:00 491 479.55
... ... ... ... ...

ConfusionMatrix

Summary
The ConfusionMatrix function shows how often a classification algorithm correctly classifies items. The
function takes an input table that includes two columns—one containing the observed class of an item and
the other containing the class predicted by the algorithm—and outputs three tables:
• A confusion matrix, which shows the performance of the algorithm
• A table of overall statistics

388 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
ConfusionMatrix
• A table of statistics for each class

Background
In the field of artificial intelligence (AI), a confusion matrix typically shows the performance of a supervised
learning algorithm. The analogous table for an unsupervised learning algorithm is usually called a matching
matrix. Outside AI, a confusion matrix is often called a contingency table or error matrix.

Usage

ConfusionMatrix Syntax
Version 2.0

SELECT * FROM ConfusionMatrix (


ON input_table PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ObsColumn ('observed_column')
PredictColumn ('predicted_column')
OutputTable ('output_table')
[ Classes ('class' [,...] ) ]
[ Prevalence ('prevalence' [,...] ) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input column that contains the observed class.
PredictColumn Required Specifies the name of the input column that contains the predicted class.
OutputTable Required Specifies the string with which to start the output table names (which are
output_table_1, output_table_2, and output_table_3).
Classes Optional Specifies the classes to output in output_table_3.
Prevalence Optional Specifies the prevalences for the classes to output in output_table_3.
Therefore, if you specify Prevalence, then you must also specify Classes,
and for every class, you must specify a prevalence.

Teradata Aster Analytics Foundation User Guide 389


Chapter 5: Statistical Analysis
ConfusionMatrix
Input
The ConfusionMatrix function has one input table. The following table describes the required columns. The
input table can contain additional columns, but the function ignores them.
Table 279: ConfusionMatrix Input Table Schema

Column Name Data Type Description


observed_column VARCHAR The expected (observed) class.
predicted_column VARCHAR The predicted class.

Output
The ConfusionMatrix function returns a success message and creates 3 output tables:
• output_table_1, a confusion matrix (also called a contingency table)
• output_table_2, which contains overall statistics
• output_table_3, which contains statistics for each class
Table 280: ConfusionMatrix Output Table 1 (Confusion Matrix) Schema

Column Name Data Type Description


observation/predict VARCHAR One row for each unique value of observed_column in the input
table.
predicted_class INTEGER The number of times that items in the observed_column were
classified as predicted_class.

Table 281: ConfusionMatrix Output Table 2 (Overall Statistics) Schema

Column Name Data Type Description


key VARCHAR Each row contains one of the following statistic names:
∘ Accuracy
∘ 95% CI
∘ Null Error Rate
∘ P-Value [Acc > NIR]
∘ Kappa
∘ McNemar Test P-Value

value DOUBLE Values of the statistics.


PRECISION

The schema of output_table_3 depends on the number of classes.


Table 282: ConfusionMatrix Output Table 3 (Class Statistics) Schema for Two Classes

Column Name Data Type Description


key VARCHAR Each row contains one of the following statistic names:
• Sensitivity

390 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
ConfusionMatrix

Column Name Data Type Description

• Specificity
• Pos Pred Value
• Neg Pred Value
• Prevalence
• Detection Rate
• Detection Prevalence
• Balanced Accuracy

value DOUBLE Values of the statistics.


PRECISION

Table 283: ConfusionMatrix Output Table 3 (Class Statistics) Schema for More Than Two Classes

Column Name Data Type Description


key VARCHAR Each row contains one of the following statistic names:
• Sensitivity
• Specificity
• Pos Pred Value
• Neg Pred Value
• Prevalence
• Detection Rate
• Detection Prevalence
• Balanced Accuracy

class:expect_class DOUBLE Values of the statistics for the class expect_class.


PRECISION If you specify the Classes argument, there is one column for each
specified value. Otherwise, there is one column for each unique value
in the observed column of the input table.

Example

Input
The input table, iris_category_expect_predict, contains 30 rows of expected and predicted values for
different species of the flower iris. The predicted values can be derived from any of the classification
functions, such as SparseSVMPredict. The raw iris dataset has four prediction attributes - sepal_length,
sepal_width, petal_length, petal_width grouped into 3 species - setosa, versicolor, virginica.
Table 284: ConfusionMatrix Example Input Table iris_category_expect_predict

id expected_value predicted_value
5 setosa setosa
10 setosa setosa
15 setosa setosa

Teradata Aster Analytics Foundation User Guide 391


Chapter 5: Statistical Analysis
ConfusionMatrix

id expected_value predicted_value
20 setosa setosa
25 setosa setosa
30 setosa setosa
35 setosa setosa
40 setosa setosa
45 setosa setosa
50 setosa setosa
55 versicolor versicolor
60 versicolor versicolor
65 versicolor versicolor
70 versicolor versicolor
75 versicolor versicolor
80 versicolor versicolor
85 virginica versicolor
90 versicolor versicolor
95 versicolor versicolor
100 versicolor versicolor
105 virginica virginica
110 virginica virginica
115 virginica virginica
120 versicolor virginica
125 virginica virginica
130 versicolor virginica
135 versicolor virginica
140 virginica virginica
145 virginica virginica
150 virginica virginica

SQL-MapReduce Call

SELECT * FROM ConfusionMatrix (


ON iris_category_expect_predict PARTITION BY 1
ObsColumn('expected_value')
PredictColumn ('predicted_value')

392 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
ConfusionMatrix
OutputTable ('confusionmatrix_output')
);

Output
The function returns a success message and creates 3 output tables.
Table 285: ConfusionMatrix Example Output Message

message
Success !
The result has been outputted to tables: "confusionmatrix_output_1",
"confusionmatrix_output_2" and "confusionmatrix_output_3"

The query below returns the output shown in the following table:

SELECT * FROM confusionmatrix_output_1 ORDER BY 1;

Three output tables are created by the function query. The following output table provides the confusion
matrix (also known as contingency table):
Table 286: ConfusionMatrix Example Output Table confusionmatrix_output_1

observation/predict setosa versicolor virginica


setosa 10 0 0
versicolor 0 9 3
virginica 0 1 7

The query below returns the output shown in the following table:

SELECT * FROM confusionmatrix_output_2 ORDER BY 1;

The following table contains statistical values:


Table 287: ConfusionMatrix Example Output Table confusionmatrix_output_2

key value
95% CI (0.6928, 0.9624)
Accuracy 0.8667
Kappa 0.8
Mcnemar Test P-Value NA
Null Error Rate 0.4
P-Value [Acc > NIR] 0

Teradata Aster Analytics Foundation User Guide 393


Chapter 5: Statistical Analysis
Correlation
The query below returns the output shown in the following table:

SELECT * FROM confusionmatrix_output_3 ORDER BY 1;

The following table contains accuracy/error measures like sensitivity and specificity for each class.
Table 288: ConfusionMatrix Example Output Table confusionmatrix_output_3

measure virginica setosa versicolor


Balanced Accuracy 0.8693 1 0.8472
Detection Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.2333 0.3333 0.3
Neg Pred Value 0.95 1 0.85
Pos Pred Value 0.7 1 0.9
Prevalence 0.2667 0.3333 0.4
Sensitivity 0.875 1 0.75
Specificity 0.8636 1 0.9444

Correlation

Summary
The Correlation function, which is composed of the Corr_Reduce and Corr_Map functions, computes
global correlations between specified pairs of table columns. Measuring correlation lets you determine if the
value of one variable is useful in predicting the value of another.

Usage

Correlation Syntax
Version 1.4

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON { table_name | view_name | (query) }
PARTITION BY group_column [, group_column [,...]]
[ TargetColumns
({ 'target_column_name' | target_column_range }[,...]) ]
KeyName ('key_name')
[ GroupByColumns
({ 'group_column' | 'group_column_range' }[,...]) ]
)

394 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Correlation
PARTITION BY key_name [, group_column [,...]]
);

Arguments
Argument Category Description
TargetColumns Required Specifies pairs of columns for which to calculate correlations. For
each column pair, 'col_name1:col_name2', the function calculates
the correlation between col_name1 and col_name2. For each column
range, '[col_index1:col_index2]', the function calculates the
correlation between every pair of columns in the range. For
example, if you specify '[1:3]', the function calculates the correlation
between the pairs (1,2), (1,3), (2,3),(1,1),(3,3). The mininum value of
col_index1 is 0, and col_index1 must be less than col_index2.
KeyName Required Specifies the name for the Corr_Map output table column that
contains the correlations, and by which the Corr_Map output table
is partitioned.
GroupByColumns Optional Specifies the names of the input columns that define the group for
correlation calculation. By default, all input columns belong to a
single group, for which the function calculates correlation.

Input
The Corr_Map input table must have at least two columns (one column pair). The table can have additional
columns, but the function ignores them. The Corr_Reduce input table is the Corr_Map output table.
Table 289: Correlation (Corr_Map) Input Table Schema

Column Name Data Type Description


group_column Any Defines a group for correlation calculation. This column appears only
if you specify the GroupByColumns argument.
target_column INTEGER, Contains values to be correlated with corresponding values in another
BIGINT, target column.
NUMERIC,
or DOUBLE
PRECISION

Output
The Corr_Map output table is input to the Corr_Reduce function, whose output table is described in the
following table.

Teradata Aster Analytics Foundation User Guide 395


Chapter 5: Statistical Analysis
Correlation
Table 290: Correlation (Corr_Reduce) Output Table Schema

Column Name Data Type Description


group_column VARCHAR Column by which to group the correlation calculations.
or
INTEGER
corr VARCHAR Contains column pairs, in the format 'col_name1:col_name2', where
col_name1 and col_name2 are input column names.
value DOUBLE Contains correlation values for column pairs.
PRECISION

Examples

Input
The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a
period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M).
GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns
represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.
Table 291: Correlation Example Input Table corr_input

year state gdpdeflator gdp unemployed armedforces employed


1947 CA 64.52 234.289 235.6 159 60.323
1948 CA 71.45 259.426 232.5 145.6 61.122
1949 CA 71.07 258.054 368.2 161.6 60.171
1950 CA 78.38 284.599 335.1 165 61.187
1951 CA 90.6 328.975 209.9 309.9 63.221
1952 CA 95.56 346.999 193.2 359.4 63.639
1953 CA 100.63 365.385 187 354.7 64.989
1954 CA 100 363.112 357.8 335 63.761
1955 TX 109.46 397.469 290.4 304.8 66.019
1956 TX 115.44 419.18 282.2 285.7 67.857
1957 TX 121.94 442.769 293.6 279.8 68.169
1958 TX 122.43 444.546 468.1 263.7 66.513
1959 TX 132.94 482.704 381.3 255.2 68.655
1960 TX 138.41 502.601 393.1 251.4 69.564
1961 TX 142.7 518.173 480.6 257.2 69.331
1962 TX 152.82 554.894 400.7 282.7 70.551

396 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Correlation
Example 1: Using Partition By clause
In this example, the Partition By clause is used to group the data by state. The correlations between columns
are calculated separately for California (CA) and Texas (TX) data.

SQL-MapRequest Call
The function calculates the correlation between each pair of columns in the TargetColumns argument. This
example compares GDP to GDPdeflator, the employed population to GDP, the number of people
unemployed, and the number of people in the armed forces.

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON corr_input PARTITION BY state
TargetColumns('[2:3]', 'employed:gdp', 'employed:unemployed',
'employed:armedforces')
KeyName ('key')
GroupByColumns ('state')
)
PARTITION BY key,state
) ORDER BY 1, 3 DESC;

Output
Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The
correlation coefficients for all column pairs are shown below.
Table 292: Correlation Example 1 Output Table

state corr value


CA gdp:gdp 1
CA gdpdeflator:gdpdeflator 1
CA gdpdeflator:gdp 1
CA employed:gdp 0.967695
CA employed:armedforces 0.952826
CA employed:unemployed -0.437618
TX gdp:gdp 1
TX gdpdeflator:gdpdeflator 1
TX gdpdeflator:gdp 1
TX employed:gdp 0.912757
TX employed:unemployed 0.32077
TX employed:armedforces -0.451985

Teradata Aster Analytics Foundation User Guide 397


Chapter 5: Statistical Analysis
CoxPH
Example 2: Without PARTITION BY Clause
In this example, the PARTITION BY clause is not used, and correlation values are determined for the overall
population. Unlike in Example 1, the data are not grouped by state.

SQL-MapRequest Call

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON corr_input
TargetColumns ('[2:3]', 'employed:gdp', 'employed:unemployed',
'employed:armedforces')
KeyName ('key')
)
PARTITION BY key
) ORDER BY 2 DESC;

Output

Table 293: Correlation Example 2 Output Table

corr value
gdp:gdp 1
gdpdeflator:gdp 1
gdpdeflator:gdpdeflator 1
employed:gdp 0.983552
employed:unemployed 0.502498
employed:armedforces 0.457307

CoxPH

Summary
The CoxPH function is named for the Cox proportional hazards model, a statistical survival model. The
function estimates coefficients by learning a set of explanatory variables. The output of the CoxPH function
is input to the function CoxPredict and CoxSurvFit.

Note:
The CoxPH and CoxPredict functions do not support interaction terms (for example, using AGE*AGE as
an item in the Cox proportional hazard model.

398 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPH
Background
The Cox proportional hazards model, proposed by David Cox in 1972, is a statistical survival model. The
purpose of the model is to simultaneously explore the effects of several explanatory variables on survival.
The definition of the Cox proportional hazard model is:
h(t) = h0(t)exp(βX)
• h (t), the hazard function, is the probability that a subject will experience an event (such as machine
failure or death) within the time interval t, given that the subject has survived until the beginning of t.
• h0(t), the baseline or underlying hazard function, is the probability of reaching an event when all
covariates have the value 0.
h0(t) is unspecified, but cannot be negative.
• A linear function of a set of k fixed covariates is vectorized in X. ß is a vector of coefficients for X. The
product ßX is the exponent of e.
Because h0(t) is unspecified, the Cox model is semiparametric.
For example, if the explanatory variables are age, weight, and treatment group, then the hazard (or risk) of
dying at time t is:
h(t) = h0(t)exp(βage * Xage + βweight + Xweight + βgroup + Xgroup)
The model is called proportional because the hazard for any subject is a fixed proportion of the hazard for
any other subject. The ratio of the hazards for individuals i and j is:
hi(t) / hj(t) = exp (β1(Xi1 - Xj1) + … + β1(Xik - Xjk)
h0(t) cancels out of the numerator and denominator; therefore, the ratio of the hazards is constant over time.
If an event does not occur by time t, then the event is said to be right censored.

Usage

CoxPH Syntax
Version 1.2

SELECT * FROM CoxPH (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
FeatureColumns ({ 'feature_column' | 'feature_column_range' }[,...])
[ CategoricalColumns
({ 'categorical_column' | 'categorical_column_range' }[,...]) ]
TimeIntervalColumn ('time_interval_column')
EventColumn ('event_column')
CoefficientTable ('coefficient_table')

Teradata Aster Analytics Foundation User Guide 399


Chapter 5: Statistical Analysis
CoxPH
LinearPredictorTable ('linear_predictor_table')
[ Threshold ('threshold') ]
[ MaxIterNum ('max_iteration_number') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input parameters.
FeatureColumns Required Specifies the names of the input table columns that contain the
features of the input parameters.
CategoricalColumns Optional Specifies the names of the input table columns that contain
categorical predictors. Each categorical_column must also be a
feature_column. By default, the function detects the categorical
columns by their SQL data types.
TimeIntervalColumn Required Specifies the name of the column in input_table that contains the
time intervals of the input parameters; that is, end_time - start_time,
in any unit of time (for example, years, months, or days).
EventColumn Required Specifies the name of the column in input_table that contains 1 if
the event occurred by end_time and 0 if it did not. (0 represents
survival or right-censorship.) The function ignores values other
than 1 and 0.
CoefficientTable Required Specifies the name of the table where the function outputs the
estimated coefficients of the input parameters.
LinearPredictorTable Required Specifies the name of the table where the function outputs the
product ßX.
Threshold Optional Specifies the convergence threshold. The default value is
0.000000001.
MaxIterNum Optional Specifies the maximum number of iterations that the function runs
before finishing, if the convergence threshold has not been met. The
default value is 10.
Accumulate Optional Specifies the names of the columns in input_table that the function
copies to linear_predictor_table.

400 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPH
Input
The CoxPH function has one required input table, which must include the columns described in the
following table.
Table 294: CoxPH Input Table Schema

Column Name Data Type Description


time_interval_column SMALLINT Contains the time interval of the input data; that is,
or INTEGER end_time - start_time, in any unit of time (for example,
years, months, or days).
event_column SMALLINT Contains 1 if the event occurred by end_time and 0
or INTEGER otherwise. (0 represents survival or right-censorship.)
feature_column SMALLINT, Contains a feature of the input data.
INTEGER, The table must have one such column for each column
BIGINT, specified by the FeatureColumns argument. (The
NUMERIC, CategoricalColumns argument can also specify a
or DOUBLE feature_column as a categorical_column.)
PRECISION

Output
The CoxPH function outputs information to the summary table, coefficient table, and linear predictor table.
The following table describes the information that the function outputs to the summary table.
Table 295: CoxPH Coefficient Table Schema

Column Name Contents


predictor One row for each feature specified by the FeatureColumns argument
category Category name if the predictor is categorical.
coefficient Estimated coefficient of the input parameters
exp_coef Exponent of the coefficient
(The exponent of the coefficient represents the increase in units of hazard when the
covariate increases by 1.)
std_error Standard error of the coefficient
z_score Wald test statistic for the coefficient
(The Wald test statistic assumes that the coefficient has a normal distribution; that is,
N(0, std_error.))
p_value p value for the z score
(The p value represents the significance of each individual coefficient.)
significance Significance code for the p_value (refer to the following table)

Teradata Aster Analytics Foundation User Guide 401


Chapter 5: Statistical Analysis
CoxPH
Table 296: CoxPH Significance Codes

p_value Significance Code


[0, 0.001) ***
[0.001, 0.01) **
[0.01, 0.05) *
[0.05, 0.1) .
[0.1, 1]

Following the summary table, the function displays the values of the following:
• Iteration#
• Convergence
• Likelikhood ratio test
• Wald test
• Score test
• Degree of freedom
The following table describes the schema of the coefficient table.
Table 297: CoxPH Coefficient Table Schema

Column Name Data Type Description


id INTEGER Row identifier
predictor VARCHAR Feature column name
coefficient DOUBLE Estimated coefficient of the input parameters
PRECISION
exp_coef DOUBLE Exponent of the coefficient
PRECISION
std_error DOUBLE Standard error of the coefficient
PRECISION
z_score DOUBLE Wald test statistic for the coefficient
PRECISION
p_value DOUBLE p value for the z score
PRECISION
significance VARCHAR Significance code

The following table describes the schema of the linear predictor table.
Table 298: CoxPH Linear Predictor Table Schema

Column Name Data Type Description


linear_predictor DOUBLE Contains the k fixed covariates and coefficients of the input
PRECISION parameters.

402 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPH

Column Name Data Type Description


event INTEGER Contains 1 if the event occurred before the time interval
ended, 0 otherwise.
time_interval INTEGER Contains the time intervals of the input data; that is,
end_time - start_time, in any unit of time (for example,
years, months, or days).
accumulate_column Any Column copied from input_table.
The linear predictor table has one such column for each
column specified by the Accumulate argument.

Example

Input
The input table, lungcancer, is data from a randomized trial of two treatment regimens for lung cancer used
to model survival analysis. The variables are defined below. There are three categorical predictors: treatment
(trt), type of cancer (celltype) and whether the patient has been treated previously (prior), and three
numerical predictors: the patient's self-rating on the Karnovsky scale (karno), time between diagnosis and
study start (diagtime), and the patient’s age (age). The censoring status or the survival event is specified
under column 'status' and the survival time is specified in column ‘time_int’.
• trt: Treatment plan has two values ‘standard’ or ‘test’
• celltype: Type of cancerous cell. Has four values: ‘squamous’, ‘smallcell’, ‘adeno’ and ‘large’
• time: survival time
• status: censoring status. ‘0’ means survival/right censorship. ‘1’ indicates otherwise.
• karno: Karnofsky performance score (100=good)
• diagtime: months from diagnosis to randomization
• age: Age in years
• prior: Whether the patient has undergone prior therapy (‘yes’ or ‘no’)
Table 299: CoxPH Example Input Table lungcancer

id trt celltype time_int status karno diagtime age prior


1 standard squamous 72 1 60 7 69 no
2 standard squamous 411 1 70 5 64 yes
3 standard squamous 228 1 60 3 38 no
4 standard squamous 126 1 60 9 63 yes
5 standard squamous 118 1 70 11 65 yes
6 standard squamous 10 1 20 5 49 no
7 standard squamous 82 1 40 10 69 yes
8 standard squamous 110 1 80 29 68 no
9 standard squamous 314 1 50 18 43 no

Teradata Aster Analytics Foundation User Guide 403


Chapter 5: Statistical Analysis
CoxPH

id trt celltype time_int status karno diagtime age prior


10 standard squamous 100 0 70 6 70 no
... ... ... ... ... ... ... ... ...

SQL-MapRequest Call
The three categorical variables are specified in the CategoricalColumns argument. The function creates two
models, a coefficient table and a linear predictor table, which are output with the names specified in the
corresponding arguments.

SELECT * FROM CoxPH (


ON (SELECT 1)
PARTITION BY 1
InputTable ('lungcancer')
FeatureColumns ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
CategoricalColumns ('trt','celltype','prior')
TimeIntervalColumn ('time_int')
EventColumn ('status')
CoefficientTable ('lungcancer_coef')
LinearPredictorTable ('lungcancer_lp')
);

Output
The coefficients are estimated at 95% CI. Coefficients of variables ‘karno’, ‘squamous’ and ‘large’ celltype are
found to be significant from the table below.
Table 300: CoxPH Example Output Table (Columns 1-5)

predictor category coefficient exp_coef std_error


karno -0.032815 0.967717 0.005508
diagtime 8.1e-05 1.000081 0.009136
age -0.008706 0.991331 0.0093
trt standard 0 1 0
trt test 0.294603 1.342593 0.20755
celltype adeno 0 1 0
celltype large -0.794775 0.451683 0.302878
celltype smallcell -0.334506 0.715692 0.275978
celltype squamous -1.196066 0.302381 0.300917
prior no 0 1 0
prior yes 0.071594 1.074219 0.232305
Iteration # 5
Convergence

404 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPH
predictor category coefficient exp_coef std_error
Likelihood ratio test 62.1039
Wald test 62.3673
Score test 66.7375

Table 301: CoxPH Example Output Table (Columns 6-9)

std_error z_score p_value significance


0.005508 -5.95802 0 ***
0.009136 0.008901 0.992898
0.0093 -0.93615 0.349196
0
0.20755 1.419433 0.155773
0
0.302878 -2.624078 0.008688 **
0.275978 -1.212075 0.225483
0.300917 -3.974739 7e-05 ***
0
0.232305 0.308187 0.75794
yes
0 on 8 degree of freedom
0 on 8 degree of freedom
0 on 8 degree of freedom

The coefficients are output in table “lungcancer_coef” that is later used for prediction. As celltype, trt and
prior are categorical variables, one of their categories is considered a reference for the other categories, thus
‘trt’ = standard, ‘celltype’ = adeno, and ‘prior’ = no don’t show default coefficient values in each column.
The query below returns the output shown in the following two tables:

SELECT * FROM lungcancer_coef ORDER BY 1;

Table 302: CoxPH Example Output Table lungcancer_coef (Columns 1-5)

id predictor category coefficient exp_coef


1 karno -0.0328153261941663 0.967717255116806
2 diagtime 8.13205087074416e-05 1.00008132381531
3 age -0.00870647494549903 0.991331316650765
4 trt standard 0 1

Teradata Aster Analytics Foundation User Guide 405


Chapter 5: Statistical Analysis
CoxPH

id predictor category coefficient exp_coef


5 trt test 0.294602821498042 1.34259300369677
6 celltype adeno 0 1
7 celltype large -0.794774719851903 0.451682978670067
8 celltype smallcell -0.334505911425932 0.71569161405743
9 celltype squamous -1.19606637417932 0.302381330550752
10 prior no 0 1
11 prior yes 0.0715936019179389 1.07421869492581

Table 303: CoxPH Example Output Table lungcancer_coef (Columns 6-9)

std_error z_score p_value significance


0.00550775688646227 -5.95802009250341 2.55312138097707e-09 ***
0.00913606224777197 0.00890104582280767 0.992898086742234
0.00930029912031493 -0.936149991829963 0.349195966726043
0 NaN NaN
0.207549603603519 1.41943331320844 0.155772725980423
0 NaN NaN
0.30287771543449 -2.62407790124726 0.00868839104930008 **
0.275977786191144 -1.21207549362053 0.225483483597614
0.300916994493076 -3.974738536101 7.04566159376308e-05 ***
0 NaN NaN
0.232305384067305 0.308187441308705 0.757939708088376

The linear predictor table “lungcancer_lp” is shown below.

SELECT * FROM lungcancer_lp ORDER BY 1;

Table 304: CoxPH Example Output Table lungcancer_lp

linear_predictor event time_interval


-4.41189466565077 1 467
-4.41097447125404 1 110
-4.39448171575977 1 389
-4.29871049135928 1 283
-4.28779288293039 0 182
-4.26989200998715 1 143

406 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPredict

linear_predictor event time_interval


-4.25242310919077 1 999
-4.20170368038227 0 25
-4.10210453090365 0 100
-4.04859022189228 1 112
... ... ...

CoxPredict

Summary
The CoxPredict function takes as input the coefficient table generated by the function CoxPH and outputs
the hazard ratios between predictive features and either their corresponding reference features or their unit
differences.
This function can be used with real-time applications. Refer to AMLGenerator.

Note:
The CoxPH and CoxPredict functions do not support interaction terms (for example, using AGE*AGE as
an item in the Cox proportional hazard model). The CoxPredict function supports only relative hazard
ratio calculation. It does not calculate or output confidence intervals.

Background
In survival analysis, the hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions
described by two levels of an explanatory variable. For example, in a drug study, if the treated population
might die at twice the rate as the control population, then the hazard ratio is 2, indicating a higher hazard of
death from the treatment.
The definition of the Cox proportional hazard model is:
h(t) = h0(t)exp(β1X1 + … + βnXn)
The definition of HR is:
HR = h1(t) / h2(t) =
h0(t)exp(β1X1 + … + βnXn) / h (t) = h0(t)exp(β1X'1 + … + βnX'n) =
exp(β1(X1 - X'1) + … + βn(Xn - X'n))
The natural logarithm of HR is:
ln(HR) = β1(X1 - X'1) + … + βn(Xn - X'n)

For two groups that differ only in treatment condition, the ratio of the hazard functions is given by eβ, where
β is the estimated treatment effect derived from the regression model. This hazard ratio (the ratio of the
predicted hazard for a member of one group to the predicted hazard for a member of the other group) is
given by holding everything else constant (that is, assuming proportionality of the hazard functions).

Teradata Aster Analytics Foundation User Guide 407


Chapter 5: Statistical Analysis
CoxPredict
For a continuous explanatory variable, the same interpretation applies to a unit difference.
Researchers usually consider probabilities lower than .05 to be significant and provide a 95% confidence
interval for the hazard ratio. Statistically significant hazard ratios cannot include unity (one) in their
confidence intervals.
Suppose that you have the following Cox proportional hazard model:
h(t) = h0(t)exp(β1XAGE + β2XGENDER + β1XAGE*GENDER + β2XWEIGHT)
You can use the preceding model to calculate hazard ratios such as:
• The hazard ratio when AGE increases 1 unit
• The hazard ratio among AGE=20, 40, 60 at the group in which GENDER is female
• The hazard ratio when WEIGHT increases 1 unit at the group in which GENDER is male and AGE = (20,
40)
• The hazard ratio between the groups (GENDER=1, AGE=20, WEIGHT=80) and (GENDER=0, AGE=60,
WEIGHT=70)
• The hazard ratio when AGE increases 1 unit and WEIGHT increases 10 units

Usage

CoxPredict Syntax
Version 1.1

SELECT * FROM CoxPredict (


ON cox_coef_model_table AS cox_coef_model DIMENSION
ON predict_feature_table AS predicts PARTITION BY { 1 | id }
[ ON ref_feature_table AS refs PARTITION BY { 1 | id} ]
Predict_Feature_Names (predict_feature [,...])
{ Predict_Feature_Columns
({ 'pf_value_column' | 'pf_value_column_range' }[,...]) |
Predict_Feature_Units_Columns
({ 'pf_unit_column' | 'pf_unit_column_range' }[,...]) }
[ Ref_Feature_Columns
({ 'rf_value_column' | 'rf_value_column_range' }[,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
Predict_Feature_Names Required Specifies the names of the features in the Cox
coefficient model (the coefficient table
generated by the CoxPH function).
Predict_Feature_Columns Required if Specifies the names of the columns that
Predict_Feature_Units contain the values of the features in the Cox
coefficient model. This argument must specify

408 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPredict

Argument Category Description


_Columns is omitted, a column for each feature specified by
disallowed otherwise Predict_Feature_Names. The ith
predict_feature corresponds to the ith
pf_value_column.
Predict_Feature_Units_Columns Required if Specifies the names of the columns that
Predict_Feature_Colu contain the unit values of the features in the
mns is omitted, Cox coefficient model. This argument must
disallowed otherwise specify a column for each feature specified by
Predict_Feature_Names. The ith
predict_feature corresponds to the ith
pf_unit_column.
Ref_Feature_Columns Optional Specifies the names of the columns that
contain the reference values. This argument
must specify a column for each feature
specified by Predict_Feature_Names. The ith
predict_feature corresponds to the ith
rf_value_column.
The default reference values are the distinct
feature-value combinations.

Note:
The function ignores this argument if you
specify Predict_Feature_Units_Columns.

Accumulate Optional Specifies the names of the columns in


predict_feature_table that the function copies
to the output table.

Input
The CoxPredict function has two required and one optional input tables:
• Required: Cox coefficient model table cox_coef_model_table, output by the CoxPH function, whose
schema is described by the table, CoxPH Coefficient Table Schema, described in the Output section of the
function: CoxPH.
• Required: Predict feature table predict_feature_table, whose schema follows
• Optional: Reference feature table ref_feature_table, whose schema follows
The predictive feature table and reference feature table can have additional columns, but the function
ignores them.
Table 305: CoxPredict Predict Feature Table Schema

Column Name Data Type Description


pf_value_column Numeric or This column appears if you specify the
string Predict_Feature_Columns argument, and contains the
values of the prediction variables. The table must have one
such column for each predict_feature.

Teradata Aster Analytics Foundation User Guide 409


Chapter 5: Statistical Analysis
CoxPredict

Column Name Data Type Description


pf_unit_column Numeric This column appears if you specify the
Predict_Feature_Units_Columns argument, and contains
the unit values of the prediction variables. The table must
have one such column for each predict_feature.
accumulate_column Any Optional column that can contain anything. The table can
have more than one such column. If the Accumulate
argument specifies this column, then the function copies it
to the output table; otherwise, the function ignores it.

Table 306: CoxPredict Reference Feature Table Schema

Column Name Data Type Description


ref_feature_value_column Numeric or Contains the reference values of the prediction variables.
string The table must have one column for each predict_feature.

Output
The CoxPredict function has one output table, whose schema depends on whether you specify the
Predict_Feature_Columns argument or the Predict_Feature_Units_Columns argument.
Table 307: CoxPredict Predict Output Table Schema (Predict_Feature_Columns Specified)

Column Name Data Type Description


accumulate_column Same as in Column copied from predict_feature_table. The
predict_feature_table table can have more than one such column.
predict_feature Same as pf_value_column Contains the values of the predictive features. The
column in table has one such column for each predict_feature.
predict_feature_table
predict_feature_ref Same as rf_value_column Contains the reference values of the predictive
in ref_feature_table features. The table has one such column for each
predict_feature.
hazardratio DOUBLE PRECISION Contains the hazard ratios between corresponding
predictive features and reference features.

Table 308: CoxPredict Predict Output Table Schema (Predict_Feature_Units_Columns Specified)

Column Name Data Type Description


accumulate_column Same as in Column copied from predict_feature_table. The table can have
predict_feature more than one such column.
_table
pf_unit_column Same as Contains the unit differences of the predictive features (for
pf_unit_colum example, the value of a feature might increase or decrease 5
n column in units). The table has one such column for each predict_feature.
predict_feature
_table

410 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPredict

Column Name Data Type Description


hazardratio DOUBLE Contains the hazard ratios for the unit differences.
PRECISION

Examples
These examples use different arguments and options for the CoxPredict function.

Input
All examples use the inputs below. Input table lc_new_predictors is a list of four patients who have been
diagnosed with lung cancer. The examples use the model table lungcancer_coef.
Table 309: CoxPredict Example Input Table: lc_new_predictors

id name trt celltype karno diagtime age prior


1 John standard squamous 30 4 63 yes
2 James standard large 80 12 41 no
3 Stella test smallcell 70 3 72 no
4 Steffi test adeno 60 5 63 yes

The preceding table includes all of the attributes that were used in the input to the CoxPH function.
The following table, used in Examples 3 and 4, contains alternate sets of reference values for each attribute.
Table 310: CoxPredict Example Input Table: lc_new_reference

id trt celltype karno diagtime age prior


1 standard squamous 58 12 60 yes
2 standard smallcell 54 8 58 no
3 test smallcell 52 12 61 no
4 test adeno 60 5 60 yes
5 standard adeno 58 6 52 yes
6 standard large 70 8 55 no
7 test squamous 64 10 57 yes
8 test large 60 8 62 no

Example 1: No Reference Values Provided


Four hazard ratios are calculated for each patient: using the patient's own characteristics as a reference, and
using the characteristics of each of the other three patients. The SQL-MapReduce call uses the model created
by the Example section in the CoxPH function description (Output).

Teradata Aster Analytics Foundation User Guide 411


Chapter 5: Statistical Analysis
CoxPredict
SQL-MapRequest Call

SELECT * FROM CoxPredict (


ON lungcancer_coef AS cox_coef_model DIMENSION
ON lc_new_predictors AS predicts PARTITION BY 1
Predict_Feature_Names ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Predict_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime',
'age', 'prior')
Accumulate ('id', 'name')
) ORDER BY 1, 2, 3, 4, 5, 6, 7, 8;

Output

Table 311: CoxPredict Example 1 Output Table (Columns 1-8)

id name trt celltype karno diagtime age prior


1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes

Table 312: CoxPredict Example 1 Output Table (Columns 9-15)

trt_ref celltype_ref karno_ref diagtime_ref age_ref prior_ref hazardratio


standard squamous 30 4 63 yes 1
standard large 80 12 41 no 3.06140892935981
test smallcell 70 3 72 no 1.35863831669632
test adeno 60 5 63 yes 0.60272711495711

412 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPredict

trt_ref celltype_ref karno_ref diagtime_ref age_ref prior_ref hazardratio


standard large 80 12 41 no 1
test smallcell 70 3 72 no 0.443795111351046
test adeno 60 5 63 yes 0.196878995542471
standard squamous 30 4 63 yes 0.32664698610163
test smallcell 70 3 72 no 1
standard large 80 12 41 no 2.25329205848099
test adeno 60 5 63 yes 0.443625877137564
standard squamous 30 4 63 yes 0.736031059709555
test adeno 60 5 63 yes 1
standard large 80 12 41 no 5.07926199666272
standard squamous 30 4 63 yes 1.65912562283042
test smallcell 70 3 72 no 2.25415164339007

Example 2: Partition by Name/ID and No Reference Values


The input table can be partitioned by either name or id and the hazard ratio is determined. Because
reference values are not used in this example, partitioning by id gives the same result as identity comparison
(that is, the patients are compared with themselves and the hazard ratio is one).

SQL-MapRequest Call

SELECT * FROM CoxPredict (


ON lungcancer_coef AS cox_coef_model DIMENSION
ON lc_new_predictors AS predicts PARTITION BY name, id
Predict_Feature_Names ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Predict_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime',
'age', 'prior')
Accumulate ('id', 'name')
) ORDER BY 1, 2, 3, 4, 5, 6, 7, 8;

Output

Table 313: CoxPredict Example 2 Output Table (Columns 1-8)

id name trt celltype karno diagtime age prior


1 John standard squamous 30 4 63 yes
2 James standard large 80 12 41 no
3 Stella test smallcell 70 3 72 no
4 Steffi test adeno 60 5 63 yes

Teradata Aster Analytics Foundation User Guide 413


Chapter 5: Statistical Analysis
CoxPredict
Table 314: CoxPredict Example 2 Output Table (Columns 9-15)

trt_ref celltype_ref karno_ref diagtime_ref age_ref prior_ref hazardratio


standard squamous 30 4 63 yes 1
standard large 80 12 41 no 1
test smallcell 70 3 72 no 1
test adeno 60 5 63 yes 1

Example 3: Use Reference Values


This example uses tables from the Input section. Each of the four new patients in the table, CoxPredict
Example Input Table: lc_new_predictors, are compared with each of the attribute reference values provided
in the table, CoxPredict Example Input Table: lc_new_reference, and a hazard ratio is calculated.

SQL-MapRequest Call

FROM CoxPredict (
ON lungcancer_coef AS cox_coef_model DIMENSION
ON lc_new_predictors AS predicts PARTITION BY 1
ON lc_new_reference AS refs PARTITION BY 1
Predict_Feature_Names ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Predict_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime',
'age', 'prior')
Ref_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Accumulate ('id', 'name')
) ORDER BY 1, 2, 3, 4, 5, 6, 7, 8;

Output

Table 315: CoxPredict Example 3 Output Table (Columns 1-8)

id name trt celltype karno diagtime age prior


1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
1 John standard squamous 30 4 63 yes
2 James standard large 80 12 41 no

414 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPredict

id name trt celltype karno diagtime age prior


2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
2 James standard large 80 12 41 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
3 Stella test smallcell 70 3 72 no
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes
4 Steffi test adeno 60 5 63 yes

Table 316: CoxPredict Example 3 Output Table (Columns 9-15)

trt_ref celltype_ref karno_ref diagtime_ref age_ref prior_ref hazardratio


standard squamous 58 12 60 yes 2.44014910130533
standard smallcell 54 8 58 no 0.954796844944686
test smallcell 52 12 61 no 0.683385593586743
test adeno 60 5 60 yes 0.587188048537179
standard adeno 58 6 52 yes 0.688547408060605
standard large 70 8 55 no 2.49163199075597
test squamous 64 10 57 yes 2.15629505441063

Teradata Aster Analytics Foundation User Guide 415


Chapter 5: Statistical Analysis
CoxPredict

trt_ref celltype_ref karno_ref diagtime_ref age_ref prior_ref hazardratio


test large 60 8 62 no 1.4206679593728
standard large 70 8 55 no 0.813884080254844
standard adeno 58 6 52 yes 0.224911935631086
standard smallcell 54 8 58 no 0.311881511740527
test large 60 8 62 no 0.464056907180278
test squamous 64 10 57 yes 0.704347280669083
test adeno 60 5 60 yes 0.191803206329567
test smallcell 52 12 61 no 0.223225844490383
standard squamous 58 12 60 yes 0.797067349579989
test squamous 64 10 57 yes 1.58710013394433
test smallcell 52 12 61 no 0.502993022637894
test adeno 60 5 60 yes 0.432188641613605
standard adeno 58 6 52 yes 0.506792278415114
standard large 70 8 55 no 1.83391853456235
test large 60 8 62 no 1.04565574363257
standard squamous 58 12 60 yes 1.79602552888308
standard smallcell 54 8 58 no 0.702760133591977
test smallcell 52 12 61 no 1.13382254859294
test adeno 60 5 60 yes 0.974218736747828
standard smallcell 54 8 58 no 1.58412791004538
standard squamous 58 12 60 yes 4.04851389750232
standard adeno 58 6 52 yes 1.14238664724683
test large 60 8 62 no 2.35706661292962
test squamous 64 10 57 yes 3.5775643751552
standard large 70 8 55 no 4.13393047852722

This example uses default partition by 1, thus each patient is compared with each row in the reference table.
There are 8 reference rows and thus a total of 32 rows or comparisons.

Example 4: Use Reference values and Partition by id


This example uses tables from the Input section. In this example, the new patients in the table, CoxPredict
Example Input Table: lc_new_predictors, are compared with the reference table, but this example uses
partition by id. The hazard ratio is calculated only when the patient's id matches the reference id. There are
four patients with ids 1 through 4, and they are compared with the rows from the reference table, CoxPredict
Example Input Table: lc_new_reference, with the same id.

416 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxPredict
SQL-MapRequest Call

SELECT * FROM CoxPredict (


ON lungcancer_coef AS cox_coef_model DIMENSION
ON lc_new_predictors AS predicts PARTITION BY id
ON lc_new_reference AS refs PARTITION BY id
Predict_Feature_Names ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Predict_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime',
'age', 'prior')
Ref_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Accumulate ('id', 'name')
) ORDER BY 1, 2, 3, 4, 5, 6, 7, 8;

Output
There are only 4 rows of comparison as the input is partitioned. The attributes of patient Steffi are similar to
the attributes of reference id 4, so her hazard ratio is very close to 1.0 (0.97).
Table 317: CoxPredict Example 4 Output Table (Columns 1-8)

id name trt celltype karno diagtime age prior


1 John standard squamous 30 4 63 yes
2 James standard large 80 12 41 no
3 Stella test smallcell 70 3 72 no
4 Steffi test adeno 60 5 63 yes

Table 318: CoxPredict Example 4 Output Table (Columns 9-15)

trt_ref celltype_ref karno_ref diagtime_re age_ref prior_ref hazardratio


f
standard squamous 58 12 60 yes 2.44014910130533
standard smallcell 54 8 58 no 0.311881511740527
test smallcell 52 12 61 no 0.502993022637894
test adeno 60 5 60 yes 0.974218736747828

Example 5: Use Units Values


Many applications have a standard value or unit value against which they make comparisons. Unit values
apply only to numerical variables. This example increases the variable karno by 10%, decreases the variable
age by 10%, leaves the variable diagtime unchanged, and calculates the hazard ratios.

SQL-MapReduce Call

SELECT * FROM CoxPredict (


ON lungcancer_coef AS cox_coef_model DIMENSION

Teradata Aster Analytics Foundation User Guide 417


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
ON (SELECT id, name, karno * (1.1) as karno, diagtime * (1) AS
diagtime, age * (0.9) AS age FROM lc_new_predictors) AS predicts
PARTITION BY 1
Predict_Feature_Names ('karno', 'diagtime', 'age')
Predict_Feature_Units_Columns ('karno', 'diagtime', 'age')
ACCUMULATE('id', 'name')
) ORDER BY 1, 2, 3, 4, 5, 6;

Output
Numerical attributes are scaled by the unit values for comparison.
Table 319: CoxPredict Example 5 Output Table

id name karno_units diagtime_units age_units hazardratio


1 John 33.0 4 56.7 0.206751516164483
2 James 88.0 12 36.9 0.0404357176452642
3 Stella 77.0 3 64.8 0.0454693988565714
4 Steffi 66.0 5 56.7 0.070013860059579

Hypothesis-Test Mode

Summary
In hypothesis-test mode, the function tests the hypothesis that the sample data comes from the specified
reference distribution. In this mode, the function simultaneously performs whichever tests you specify and
reports a p-value for each test. The null hypothesis is that the data are consistent with the specified
distribution. Therefore, a low p-value suggests that the distribution is not a very good fit for the data.

Usage
Recommended syntax depends on whether the reference distribution is continuous or discrete and on the
sample data set. For both continuous and discrete distributions, there are two syntax options. Option 1
usually works better for large data sets that might be stored across multiple nodes, and option 2 usually
works better for small data sets that are stored on a single node. However, performance ultimately depends
on the data itself.

Note:
To run the CvM test on discrete distributions, you must use option 2; otherwise the results might be
incorrect.

418 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
Hypothesis-Test Mode Syntax (Continuous Distributions)

Option 1: For Multiple-Node Data Sets


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY col [,...]
ORDER BY column_name) AS rank, *
FROM input_table
WHERE column_name IS NOT NULL
) AS input PARTITION BY ANY
ON (SELECT col[,...], COUNT(*) AS group_size
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY col[,...]
) AS groupstats DIMENSION
ValueColumn (column_name)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameter' [,...])
[ GroupingColumns (col[,...]) ]
[ MinGroupSize (minGroupSize) ]
[ NumCell (cell_size) ]
)
PARTITION BY col[,...]
);

Option 2: For Single-Node Data Sets


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY column[,...]
ORDER BY column) AS rank, *
FROM input_table
WHERE column IS NOT NULL
) AS input PARTITION BY column[,...]
ON (SELECT col[,...], COUNT(*) AS group_size
FROM input_table
WHERE column IS NOT NULL c
GROUP BY column[,...]
) AS groupstats PARTITION BY column[,...]
ValueColumn (value_column)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameters' [,...])
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ MinGroupSize (min_group_size) ]
[ NumCell (cell_size) ]
) PARTITION BY column[,...]
);

Teradata Aster Analytics Foundation User Guide 419


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
Hypothesis-Test Mode Syntax (Discrete Distributions)

Option 1: For Multiple-Node Data Sets


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY column[,...]
ORDER BY column) AS rank, column[,...]
FROM input_table
WHERE column IS NOT NULL
GROUP BY column[,...]
) AS input PARTITION BY ANY
ON (SELECT column[,...], COUNT(*) AS group_size
FROM input_table
WHERE column IS NOT NULL
GROUP BY column[,...]
) AS groupstats DIMENSION
ValueColumn (value_column)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameters' [,...])
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ MinGroupSize (min_group_size) ]
[ NumCell (cell_size) ]
)
PARTITION BY column[,...]
);

Option 2: For Single-Node Data Sets and Any CvM Test


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY column[,...]
ORDER BY column) AS rank, column[,...]
FROM input_table
WHERE column IS NOT NULL
GROUP BY column [,...]
) AS input PARTITION BY column[,...]
ORDER BY column
ON (SELECT column[,...], COUNT(*) AS group_size
FROM input_table
WHERE column IS NOT NULL
GROUP BY column[,...]
) AS groupstats
PARTITION BY column[,...]
ValueColumn (value_column)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameters' [,...])
[ GroupByColumns

420 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
({ 'group_by_column' | 'group_by_column_range' [,...]) ]
[ MinGroupSize (min_group_size) ]
[ NumCell (cell_size) ]
) PARTITION BY column[,...]
);

Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains
the values of the sample data set.
Tests Optional Specifies one to four tests to perform. A test can be:
• 'KS' (Kolmogorov-Smirnov test)
• 'CvM' (Cramér-von Mises criterion)
• 'AD' (Anderson-Darling test)
• 'CHISQ' (Pearson's Chi-squared test)
By default, the function runs all of the preceding tests.
Distributions Required Specifies the reference distributions and their parameters.
All distributions must be continuous or all must be discrete.
The possible distribution and parameters values for
continuous distributions are in the following table. The
possible distribution and parameters values for continuous
distributions are in the second of the two following tables.
For discrete distributions:
• BINOMIAL, GEOMETRIC, NEGATIVEBINOMIAL,
and POISSON distributions are on N={0,1,2,...}.
• UNIFORMDISCRETE distribution is on events, which
are represented by integers.

GroupByColumns Optional Specifies the names of the input table columns that contain
the group identifications over which to run the test. The
function can run multiple tests for different partitions of the
data in parallel. If you omit this argument, then specify
PARTITION BY 1 and omit the GROUP BY clause in the
second ON clause.
MinGroupSize Optional Specifies the minimum group size. The function ignores
groups smaller than the minimum size when calculating
statistics. The default value is 50.
NumCell Optional Specifies the number of cells that you want to make discrete
in a continuous distribution. The cell_size must be greater
than 3 if distribution is NORMAL; otherwise, it must be
greater than 1. The quotient min_group_size/cell_size cannot
be less than 5. The default value is 10.

Teradata Aster Analytics Foundation User Guide 421


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
Table 320: Continuous Distributions and Parameters

distribution:parameters parameter Descriptions


BETA:α,β α > 0 is the first shape parameter.
β > 0 is the second shape parameter.
CAUCHY:x,θ x, a DOUBLE PRECISION value, is the median parameter.
θ > 0 is the scale parameter.
CHISQ:k k, a positive INTEGER, is the degree of freedom.
EXPONENTIAL:θ θ > 0 is the mean parameter, which is the inverse rate.
F:d1,d2 d1 > 0 and d2 > 0 are degrees of freedom.
GAMMA:k,θ k > 0 is the shape parameter.
θ > 0 is the scale parameter.
LOGNORMAL:μ,σ μ, a DOUBLE PRECISION value, is the mean.
σ > 0 is the standard deviation.
NORMAL:μ,σ μ, a DOUBLE PRECISION value, is the mean.
σ > 0 is the standard deviation.
T:k k, a positive INTEGER, is the degree of freedom.
TRIANGULAR:a,c,b a <= c <= b && a < b, where a is the lower limit of this distribution
(inclusive), b is the upper limit of this distribution (inclusive), and c is
the mode of this distribution.
UNIFORMCONTINUOUS:a,b a < b, where a is the lower bound of this distribution (inclusive) and b
is the upper bound of this distribution (exclusive).
WEIBULL:α,β α > 0 is the shape parameter.
β > 0 is the scale parameter.
The function uses the two-parameter form of the distribution defined
by the Weibull Distribution, http://mathworld.wolfram.com/
WeibullDistribution.html, equations (1) and (2).

Table 321: Discrete Distributions and Parameters

distribution:parameters parameter Descriptions


BINOMIAL:n,p n, a positive INTEGER, is the number of trials.
p, in [0,1], is the success probability in each trial.
GEOMETRIC:p p, in [0,1], is the success probability in each trial.
NEGATIVEBINOMIAL:r,p r, a positive INTEGER, is the number of successes until the function
stops the tests.
p, in [0,1], is the success probability in each trial.
The function represents the distribution of the number of failures
before r successes occur.
POISSON:λ λ > 0 is the rate parameter.

422 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

distribution:parameters parameter Descriptions


UNIFORMDISCRETE:a,b a < b, where a is the lower bound of this distribution (inclusive) and b
is the upper bound of this distribution (exclusive). Both a and b are
INTEGER values.

Input
The input table consists of an arbitrary number of grouping columns and a single value column that
contains the dataset to be matched to the specified distribution(s). The syntax shown includes clauses that
create two tables from the input table. One table ranks the data and the other table counts the number of
points in each group.
For continuous distributions, if your input table already includes a rank column, replace the clause ON
(SELECT RANK()... with the clause ON SELECT * FROM input_table.
Table 322: Distribution Matching Input Table Schema

Column Name Data Type Description


column Any Column used to partition or identify the values. The table can have
several such columns.
value_column INTEGER, Contains the values of the sample data set.
BIGINT,
NUMERIC,
or DOUBLE
PRECISION

Output
The output table contains the columns described in the following table for each group defined by the
PARTITION BY clause.
Table 323: Distribution Matching Output Table Schema

Column Name Data Type Description


column Any Column specified by the GroupByColumns argument.
group_size INTEGER Number of rows in the group.
disti_KS_statistic DOUBLE Column appears only if the Tests argument specifies 'KS' or is
PRECISION omitted. For details, refer to Results (statistics and p-values).
disti_KS_p-value DOUBLE Column appears only if the Tests argument specifies 'KS' or is
PRECISION omitted. For details, refer to Results (statistics and p-values).
disti_CvM_statistic DOUBLE Column appears only if the Tests argument specifies 'CvM' or is
PRECISION omitted. For details, refer to Results (statistics and p-values).
disti_CvM_p-value DOUBLE Column appears only if the Tests argument specifies 'CvM' or is
PRECISION omitted. For details, refer to Results (statistics and p-values).

Teradata Aster Analytics Foundation User Guide 423


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

Column Name Data Type Description


disti_AD_statistic DOUBLE Column appears only if the Tests argument specifies 'AD' or is
PRECISION omitted. For details, refer to Results (statistics and p-values).
disti_AD_p-value DOUBLE Column appears only if the Tests argument specifies 'AD' or is
PRECISION omitted. For details, refer to Results (statistics and p-values).
disti_CHISQ_statistic DOUBLE Column appears only if the Tests argument specifies 'CHISQ' or
PRECISION is omitted. For details, refer to Results (statistics and p-values).
disti_CHISQ_p-value DOUBLE Column appears only if the Tests argument specifies 'CHISQ' or
PRECISION is omitted. For details, refer to Results (statistics and p-values).

Results (statistics and p-values)


These results refer to the following R packages:
• nortest (https://cran.r-project.org/web/packages/nortest/nortest.pdf)
• ADGofTest (https://cran.r-project.org/web/packages/ADGofTest/ADGofTest.pdf)
• dgof (https://cran.r-project.org/web/packages/dgof/dgof.pdf)
For a normal distribution, the function computes p-values from the modified statistics as implemented in
the R package nortest.
For other continuous distributions:
• KS: The function computes p-values using approximation to the Kolmogorov-Smirnov distribution
(assuming large sample size). Results are comparable to the ks.test() from the stats package in R.
• CvM: The function computes results as in the R package nortest.
• AD: The function computes results that are comparable to the ad.test() from the R package ADGofTest.
• CHISQ: Statistics are comparable to pearson.test() from the R package nortest, with a minor modification
when data falls beyond the upper limit of the distribution's support. The function computes p-values
using DOF = cell_size - 1.
For discrete distributions:
• KS: Results are comparable to the ks.test() from the R package dgof, when reference distribution is a
'stepfun' object.
• CvM: Statistics are comparable to the cvm.test() from the R package dgof, when reference distribution is
a 'stepfun' object. P-values are not calculated.
• AD: Results are comparable to the ad.test() from the R package ADGofTest.
• CHISQ: Statistics are comparable to pearson.test() from the R package nortest, with a minor modification
when data falls beyond the upper limit of the distribution's support. The function computes p-values
using DOF = cell_size - 1.

Examples
Before running the examples in this section, switch the output mode in Act to expanded output by entering
"-x" at the Act command prompt. With expanded output mode turned on, each record is split into rows,
with one row for each value, and each new record is introduced with a text label in a form like: ---
[ RECORD 37 ]---. This mode helps make wide tables readable on a small screen.

424 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
Example 1: Normality Tests without 'groupingColumns'
This example uses an input table with a single column (price). The data in this column was drawn from a
normal distribution with a mean of 50 and a standard deviation of 2. The first 10 rows are shown in Input.

Input
Here is a snapshot of the input data:
Table 324: distnmatch (Hypothesis Test Mode) Example 1 Input Table raw_normal_50_2

price
48.0701
52.6426
48.6372
50.9832
50.523
52.1773
50.3103
48.4424
50.1352
50.1382
...

SQL-MapReduce Call
The function call uses the sample mean (49.97225) and standard deviation (2.009698). See the Arguments
table for more information.

SELECT *FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY 1 ORDER BY price) AS rank, * FROM
raw_normal_50_2 WHERE price IS NOT NULL) AS input PARTITION BY ANY
ON (SELECT COUNT(*) AS group_size FROM raw_normal_50_2
WHERE price IS NOT NULL) AS groupstats DIMENSION
ValueColumn ('price')
Tests ('KS', 'CvM', 'AD', 'CHISQ')
Distributions ('NORMAL:49.97225,2.009698')
MinGroupSize ('50')
NumCell ('10')
) PARTITION BY 1
);

Output
The reported p-value for each of the four tests is around 0.4, which does not rule out the null hypothesis that
the data are consistent with a normal distribution with the specified mean and standard deviation.

Teradata Aster Analytics Foundation User Guide 425


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
Table 325: distnmatch (Hypothesis Test Mode) Example 1 Output Table

Column Name Value


-[ RECORD 1 ]--------------------------------------- --------------------------------------------------
group_size 400
NORMAL:49.97225,2.009698_KS_statistic 0.0319503
NORMAL:49.97225,2.009698_KS_p-value 0.41181
NORMAL:49.97225,2.009698_CvM_statistic 0.0556535
NORMAL:49.97225,2.009698_CvM_p-value 0.430792
NORMAL:49.97225,2.009698_AD_statistic 0.376151
NORMAL:49.97225,2.009698_AD_p-value 0.410292
NORMAL:49.97225,2.009698_CHISQ_statistic 7.8
NORMAL:49.97225,2.009698_CHISQ_p-value 0.35056

Example 2: Normality Tests with 'groupingColumns'


This example shows the use of grouping columns, and also illustrates the syntax for testing against multiple
distributions in a single SQL-MapReduce command.

Input
The input represents hypothetical mean-time-to-failure data for four products manufactured in two
different factories. Only a subset of rows is shown.
Table 326: distnmatch (Hypothesis Test Mode) Example 2 Input Table: factory7

factory product mttf


F1 A 10039.5
F1 A 9926.6
F1 A 9971.34
F1 A 9868.7
F1 A 9940.17
F1 A 10266.7
F1 A 9768.64
F1 A 10043.2
F1 A 10133.7
F1 A 9731.33
.. .. ..
F2 B 9836.72

426 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

factory product mttf


F2 B 10015.7
F2 B 10069.5
F2 B 9941.35
F2 B 10114.4
F2 B 10055
F2 B 9945.04
F2 B 10086.9
F2 B 9917.59
F2 B 10071
.. .. ..
F1 C 10010.5
F1 C 9793.85
F1 C 10081.4
F1 C 9867.01
F1 C 10031.3
F1 C 9852.22
F1 C 10006.7
F1 C 9747.12
F1 C 9968.97
F1 C 9996.6
.. .. ..
F2 D 9721.21
F2 D 10068.6
F2 D 9952
F2 D 9851.94
F2 D 10378.3
F2 D 9908.9
F2 D 9749.43
F2 D 10448
F2 D 9681.25
F2 D 10147.5
... ... ...

Teradata Aster Analytics Foundation User Guide 427


Chapter 5: Statistical Analysis
Hypothesis-Test Mode
SQL-MapReduce Call
Apply all four fit tests to the data, evaluating four possible distributions (normal, gamma, Weibull, and
uniform).

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY factory, product ORDER BY mttf)
AS rank, * FROM factory7 WHERE mttf IS NOT NULL) AS input
PARTITION BY ANY
ON (SELECT factory, product, COUNT(*) AS group_size FROM
factory7 WHERE mttf IS NOT NULL GROUP BY factory,
product) AS groupstats DIMENSION
ValueColumn ('mttf')
Tests ('KS', 'CvM', 'AD', 'CHISQ')
Distributions ('NORMAL:10000,150', 'GAMMA:1,10000',
'WEIBULL:100,10000','UNIFORMCONTINUOUS:9500,10500')
GroupingColumns ('factory', 'product')
MinGroupSize ('50')
NumCell ('10')
) PARTITION BY factory, product
);

Output
The reported p-values support these conclusions:
• For product A from factory F1, all 4 tests fail to reject the null hypothesis that the data fit a normal
distribution with the specified parameters. All 4 tests reject the null hypothesis that the data fit the
specified gamma, Weibull, or uniform distributions.
• For product C from factory F1, all 4 tests fail to reject the null hypothesis that the data fit a Weibull
distribution with the specified parameters. All 4 tests reject the null hypothesis that the data fit the
specified gamma or uniform distributions.
• For product B from factory F2, all 4 tests reject the null hypothesis for each of the specified distributions.
• For product D from factory F2, all 4 tests fail to reject the null hypothesis that the data fit a uniform
distribution with the specified parameters.
Table 327: distnmatch (Hypothesis Test Mode) Example 2 Output Table

Column Name Value


-[ RECORD 1 ]------------------------------------------------ ----------------------------------------
factory F1
product A
group_size 4000
NORMAL:10000,150_KS_statistic 0.00772035
NORMAL:10000,150_KS_p-value 0.815753
NORMAL:10000,150_CvM_statistic 0.0322619
NORMAL:10000,150_CvM_p-value 0.814989

428 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

Column Name Value


NORMAL:10000,150_AD_statistic 0.200188
NORMAL:10000,150_AD_p-value 0.883759
NORMAL:10000,150_CHISQ_statistic 6.395
NORMAL:10000,150_CHISQ_p-value 0.494456
GAMMA:1,10000_KS_statistic 0.614549
GAMMA:1,10000_KS_p-value 5.55112e-16
GAMMA:1,10000_CvM_statistic 390.822
GAMMA:1,10000_CvM_p-value 7.37e-10
GAMMA:1,10000_AD_statistic 1781.74
GAMMA:1,10000_AD_p-value 1.5e-07
GAMMA:1,10000_CHISQ_statistic 36000
GAMMA:1,10000_CHISQ_p-value 0
WEIBULL:100,10000_KS_statistic 0.193984
WEIBULL:100,10000_KS_p-value 3.33067e-16
WEIBULL:100,10000_CvM_statistic 53.7847
WEIBULL:100,10000_CvM_p-value 7.37e-10
WEIBULL:100,10000_AD_statistic Infinity
WEIBULL:100,10000_AD_p-value 1.5e-07
WEIBULL:100,10000_CHISQ_statistic 1653.06
WEIBULL:100,10000_CHISQ_p-value 0
UNIFORMCONTINUOUS:9500,10500_KS_statistic 0.213031
UNIFORMCONTINUOUS:9500,10500_KS_p-value 0
UNIFORMCONTINUOUS:9500,10500_CvM_statistic 85.2624
UNIFORMCONTINUOUS:9500,10500_CvM_p-value 7.37e-10
UNIFORMCONTINUOUS:9500,10500_AD_statistic Infinity
UNIFORMCONTINUOUS:9500,10500_AD_p-value 1.5e-07
UNIFORMCONTINUOUS:9500,10500_CHISQ_statistic 3468.4
UNIFORMCONTINUOUS:9500,10500_CHISQ_p-value 0
-[ RECORD 2 ]------------------------------------------------ ----------------------------------------
factory F1
product C
group_size 4000

Teradata Aster Analytics Foundation User Guide 429


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

Column Name Value


NORMAL:10000,150_KS_statistic 0.193211
NORMAL:10000,150_KS_p-value 0
NORMAL:10000,150_CvM_statistic 56.5262
NORMAL:10000,150_CvM_p-value 7.37e-10
NORMAL:10000,150_AD_statistic 308.282
NORMAL:10000,150_AD_p-value 0
NORMAL:10000,150_CHISQ_statistic 885.585
NORMAL:10000,150_CHISQ_p-value 0
GAMMA:1,10000_KS_statistic 0.609315
GAMMA:1,10000_KS_p-value 7.77156e-16
GAMMA:1,10000_CvM_statistic 391.021
GAMMA:1,10000_CvM_p-value 7.37e-10
GAMMA:1,10000_AD_statistic 1782.73
GAMMA:1,10000_AD_p-value 1.5e-07
GAMMA:1,10000_CHISQ_statistic 36000
GAMMA:1,10000_CHISQ_p-value 0
WEIBULL:100,10000_KS_statistic 0.0110351
WEIBULL:100,10000_KS_p-value 0.714696
WEIBULL:100,10000_CvM_statistic 0.0825799
WEIBULL:100,10000_CvM_p-value 0.191825
WEIBULL:100,10000_AD_statistic 0.575701
WEIBULL:100,10000_AD_p-value 0.671256
WEIBULL:100,10000_CHISQ_statistic 7.795
WEIBULL:100,10000_CHISQ_p-value 0.55493
UNIFORMCONTINUOUS:9500,10500_KS_statistic 0.348953
UNIFORMCONTINUOUS:9500,10500_KS_p-value 2.22045e-16
UNIFORMCONTINUOUS:9500,10500_CvM_statistic 137.658
UNIFORMCONTINUOUS:9500,10500_CvM_p-value 7.37e-10
UNIFORMCONTINUOUS:9500,10500_AD_statistic Infinity
UNIFORMCONTINUOUS:9500,10500_AD_p-value 1.5e-07
UNIFORMCONTINUOUS:9500,10500_CHISQ_statistic 5733.04
UNIFORMCONTINUOUS:9500,10500_CHISQ_p-value 0

430 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

Column Name Value


-[ RECORD 3 ]------------------------------------------------ ----------------------------------------
factory F2
product B
group_size 4000
NORMAL:10000,150_KS_statistic 0.105561
NORMAL:10000,150_KS_p-value 0
NORMAL:10000,150_CvM_statistic 20.4224
NORMAL:10000,150_CvM_p-value 7.37e-10
NORMAL:10000,150_AD_statistic 140.674
NORMAL:10000,150_AD_p-value 0
NORMAL:10000,150_CHISQ_statistic 789.855
NORMAL:10000,150_CHISQ_p-value 0
GAMMA:1,10000_KS_statistic 0.620046
GAMMA:1,10000_KS_p-value 2.88658e-15
GAMMA:1,10000_CvM_statistic 394.989
GAMMA:1,10000_CvM_p-value 7.37e-10
GAMMA:1,10000_AD_statistic 1799.61
GAMMA:1,10000_AD_p-value 1.5e-07
GAMMA:1,10000_CHISQ_statistic 36000
GAMMA:1,10000_CHISQ_p-value 0
WEIBULL:100,10000_KS_statistic 0.156687
WEIBULL:100,10000_KS_p-value 1.11022e-16
WEIBULL:100,10000_CvM_statistic 62.6341
WEIBULL:100,10000_CvM_p-value 7.37e-10
WEIBULL:100,10000_AD_statistic 397.239
WEIBULL:100,10000_AD_p-value 1.5e-07
WEIBULL:100,10000_CHISQ_statistic 825.73
WEIBULL:100,10000_CHISQ_p-value 0
UNIFORMCONTINUOUS:9500,10500_KS_statistic 0.291416
UNIFORMCONTINUOUS:9500,10500_KS_p-value 0
UNIFORMCONTINUOUS:9500,10500_CvM_statistic 148.82
UNIFORMCONTINUOUS:9500,10500_CvM_p-value 7.37e-10

Teradata Aster Analytics Foundation User Guide 431


Chapter 5: Statistical Analysis
Hypothesis-Test Mode

Column Name Value


UNIFORMCONTINUOUS:9500,10500_AD_statistic 784.555
UNIFORMCONTINUOUS:9500,10500_AD_p-value 1.5e-07
UNIFORMCONTINUOUS:9500,10500_CHISQ_statistic 6954.15
UNIFORMCONTINUOUS:9500,10500_CHISQ_p-value 0
-[ RECORD 4 ]------------------------------------------------ ----------------------------------------
factory F2
product D
group_size 4000
NORMAL:10000,150_KS_statistic 0.208405
NORMAL:10000,150_KS_p-value 0
NORMAL:10000,150_CvM_statistic 82.98
NORMAL:10000,150_CvM_p-value 7.37e-10
NORMAL:10000,150_AD_statistic 1162.94
NORMAL:10000,150_AD_p-value Infinity
NORMAL:10000,150_CHISQ_statistic 4221.82
NORMAL:10000,150_CHISQ_p-value 0
GAMMA:1,10000_KS_statistic 0.613272
GAMMA:1,10000_KS_p-value 1.22125e-15
GAMMA:1,10000_CvM_statistic 379.171
GAMMA:1,10000_CvM_p-value 7.37e-10
GAMMA:1,10000_AD_statistic 1731.83
GAMMA:1,10000_AD_p-value 1.5e-07
GAMMA:1,10000_CHISQ_statistic 36000
GAMMA:1,10000_CHISQ_p-value 0
WEIBULL:100,10000_KS_statistic 0.345556
WEIBULL:100,10000_KS_p-value 2.22045e-16
WEIBULL:100,10000_CvM_statistic 137.136
WEIBULL:100,10000_CvM_p-value 7.37e-10
WEIBULL:100,10000_AD_statistic Infinity
WEIBULL:100,10000_AD_p-value 1.5e-07
WEIBULL:100,10000_CHISQ_statistic 6596
WEIBULL:100,10000_CHISQ_p-value 0

432 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxSurvFit

Column Name Value


UNIFORMCONTINUOUS:9500,10500_KS_statistic 0.0112432
UNIFORMCONTINUOUS:9500,10500_KS_p-value 0.69272
UNIFORMCONTINUOUS:9500,10500_CvM_statistic 0.0776112
UNIFORMCONTINUOUS:9500,10500_CvM_p-value 0.222524
UNIFORMCONTINUOUS:9500,10500_AD_statistic 0.53247
UNIFORMCONTINUOUS:9500,10500_AD_p-value 0.713925
UNIFORMCONTINUOUS:9500,10500_CHISQ_statistic 4.445
UNIFORMCONTINUOUS:9500,10500_CHISQ_p-value 0.879764

CoxSurvFit

Summary
The CoxSurvFit function takes as input the coefficient and linear prediction tables generated by the function
CoxPH and outputs a table of survival probabilities.

Note:
The CoxSurvFit function supports only the Nelson-Aalen-Breslow estimator with Efron ties modification
for baseline survival function estimation. It does not calculate or output confidence intervals, variance, or
standard error estimates.

Background
The definition of the Cox proportional hazard model is:
h(t) = h0(t)exp(βX)
Given an estimated time t and all values of conditional variables (x1, x2, ..., xn), the survival function is:

S(t) = S0 (t)exp(βx)
S0(t), the baseline survival function, is composed of the survival probabilities at times ti. Three estimators
often used to estimate these survival probabilities are:
• Breslow estimator
• Nelson-Aalen-Breslow estimator
• Kalbfleisch and Prentice estimator
The first two estimators can be used with Efron ties modification.
The CoxSurvFit function uses the Nelson-Aalen-Breslow estimator with Efron ties modification for baseline
function estimation.
The Nelson-Aalen estimator of the integrated hazard is:

Teradata Aster Analytics Foundation User Guide 433


Chapter 5: Statistical Analysis
CoxSurvFit

In 1972, Breslow suggested estimating the survival function as:

In 1984, Cox and Oakes described a simpler estimator that extends the Nelson-Aalen estimate of the
cumulative hazard to the case of covariates:

where the sum is over the risk set Ri. The cumulative hazard and survival functions are then estimated as:

and:

With Efron ties modification:

Usage

CoxSurvFit Syntax
Version 1.1

SELECT * FROM CoxSurvFit (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
Cox_Linear_Predictor_Model_Table (cox_linear_predictor_model_table)
Cox_Coef_Model_Table (cox_coef_model_table)
Predict_Table (predict_table)
Predict_Feature_Names (feature_name [,...])
Predict_Feature_Columns
({ 'pf_value_column' | 'pf_value_column_range' }[,...])
Output_Table (output_table)
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

434 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxSurvFit
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
Cox_Linear_Predictor_Model_Table Required Specifies the name of the Cox linear predictor model
table, which was output by the CoxPH function.
Cox_Coef_Model_Table Required Specifies the name of the Cox coefficient model table,
which was output by the CoxPH function.
Predict_Table Required Specifies the name of the predict table, which contains
new prediction feature values for survival calculation.
Predict_Feature_Names Required Specifies the names of features in the Cox model.
Predict_Feature_Columns Required Specifies the names of the columns that contain the
values for the features in the Cox model—one column
name for each feature name. The ith feature name
corresponds to the ith column name. For example,
consider this pair of
arguments:Predict_Feature_Names('name',
'age')Predict_Feature_Columns('c1',
'c2')
The predictive values of the feature 'name' are in
column 'c1', and the predictive values of the feature 'age'
are in column 'c2'.
Output_Table Required Specifies the name of the output table that contains
survival probabilities. The table must not exist.
Accumulate Optional Specifies the names of the columns in predict_table that
the function copies to the output table.

Input
The CoxSurvFit function has three required input tables
• The first two tables are output by the CoxPH function and are described in the Output section of the
function: CoxPH:
∘ CoxPH Linear Predictor Table Schema
∘ CoxPH Coefficient Table Schema
• Predict table, whose schema is described by the following table:

Teradata Aster Analytics Foundation User Guide 435


Chapter 5: Statistical Analysis
CoxSurvFit
Table 328: CoxSurvFit Predict Table Schema

Column Name Data Type Description


prediction_variable Any Contains the values of the prediction variables. The table
must have one column for each prediction variable.
accumulate_column Any Optional column that can contain anything. The table can
have more than one such column. If the Accumulate
argument specifies this column, then the function copies it
to the output table; otherwise, the function ignores it.

The following is an example of a predict table.


Table 329: CoxSurvFit Predict Table Example

id x1 x2 x3 x4
1 a b c d

For the row in the preceding table, the function computes this survival probability:
S(t) = S0(t)(βx1*a + βx2*b + βx3*c + βx4*d)

Output
The CoxSurvFit function outputs a message table (usually to the screen) and a table of survival probabilities
(output_table).
Table 330: CoxSurvFit Message Table Schema

Column Name Data Type Description


result VARCHAR Contains a message that indicates whether the function completed
successfully or failed. If the function succeeded, the message is:
'Survival functions are successfully generated in output table:
output_table'

Table 331: CoxSurvFit Output Table Schema

Column Name Data Type Description


accumulate_column Any Column copied from predict_table. The table can have more
than one such column.
time_interval INTEGER Contains the analysis time interval (how much time elapses
before the event happens).
survival_prob DOUBLE Contains the survival probability at the analysis time
PRECISION interval.

436 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CoxSurvFit
Example

Input
The input table lc_new_predictors is used with the linear model predictor table lungcancer_lp and the
coefficient table lungcancer_coef that are generated from the CoxPH function to determine the survival
probabilities of the new patients.

SQL-MapReduce Call

SELECT * FROM CoxSurvFit (


ON (SELECT 1) PARTITION BY 1
Cox_Linear_Predictor_Model_Table ('lungcancer_lp')
Cox_Coef_Model_Table ('lungcancer_coef')
Predict_Table ('lc_new_predictors ')
Output_Table ('lungcancer_survival_out')
Predict_Feature_Names ('trt', 'celltype', 'karno', 'diagtime', 'age',
'prior')
Predict_Feature_Columns ('trt', 'celltype', 'karno', 'diagtime',
'age', 'prior')
Accumulate ('id', 'name')
);

Output
The query below returns the output shown in the following table:

SELECT * FROM lungcancer_survival_out ORDER BY 1, 3;

Table 332: CoxSurvFit Output Table

id name time_interval survival_prob


1 John 1 0.9877627218694728
1 John 2 0.9816252528322383
1 John 3 0.9754215530836501
1 John 4 0.9691109375794283
1 John 7 0.9498304739372286
1 John 8 0.9232305157057336
1 John 10 0.909659027139288
...
2 James 1 0.9959861478722662
2 James 2 0.9939604281289419
2 James 3 0.9919041628384534

Teradata Aster Analytics Foundation User Guide 437


Chapter 5: Statistical Analysis
CrossValidation

id name time_interval survival_prob


2 James 4 0.9898034044749178
2 James 7 0.9833274563958035
2 James 8 0.9742460761878555
2 James 10 0.9695446843319574
...
3 Stella 1 0.9909783601721555
3 Stella 2 0.9864425596922994
3 Stella 3 0.981850198266686
3 Stella 4 0.9771707737992631
3 Stella 7 0.9628238391265019
3 Stella 8 0.9429033822472632
3 Stella 10 0.9326815744370016
...
4 Steffi 1 0.9797788143757056
4 Steffi 2 0.9696989861155152
4 Steffi 3 0.959552512629006
4 Steffi 4 0.9492747226911316
4 Steffi 7 0.9181466846968994
4 Steffi 8 0.875881155109985
4 Steffi 10 0.8546228246487249

CrossValidation

Summary
Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how
the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings
where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in
practice. In a prediction problem, a model is usually given a dataset of known data on which training is run
(training dataset), and a dataset of previously unseen data against which the model is tested (testing dataset).
The goal of cross validation is to define a dataset to “test” the model in the training phase (the validation
dataset) to provide insight into how the model will generalize to an independent dataset. Cross-validation
can be useful to identify and avoid overfitting problems.
Cross-validation works as follows: the data are randomly partitioned into k equal-sized subsamples. One
group is kept aside as a validation set, and the model is trained on the rest of the data. The trained model is

438 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CrossValidation
used on the validation set and the error rate is calculated. The process is repeated k times, with each of the k
subsamples used as the validation set in turn.
Figure 11: K-fold cross-validation

Usage

CrossValidation Syntax
Version 1.0

SELECT * FROM CrossValidation (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
Function ('function_name')
[ Arguments used by the training function ]
......
CVParams ('arguments_vary_in_cv')
[ FoldNum ('k') ]
[ CVTable ('tablename') ]
[ Metric ('error_function_name') ]
);

Teradata Aster Analytics Foundation User Guide 439


Chapter 5: Statistical Analysis
CrossValidation
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
Function Required The name of the function to be cross-validated. Only GLM
(‘glm’) is supported.
[Arguments used by the training Required Required and optional arguments used by the function to
function] and be cross-validated. The argument names and descriptions
Optional are the same as those used when the function is run
normally.
CVParams Required The list of the arguments to use in cross validation.
FoldNum Optional The value of k in k-fold cross validation. Default is 10.
CVTable Optional The name of the output table that contains the cross-
validation errors for all models. Default is ‘cvtable’.
Metric Optional Error function used to calculate the cross-validation error.
Possible values are ‘AUROC’ (area under the ROC curve)
and ‘MSE’ (mean squared error). Default is 'AUROC'.

Input
The input table is the same as the input table used by the function to be cross-validated.

Output
When the function completes, it displays a message. The output cross-validation table is created with the
name specified in the argument “CVTable”. The output table contains the training variable values specified
in the SQL-MapReduce call and the cross-validation error for each model analyzed.
Table 333: Cross-Validation Output Table schema

Column Name Data Type Description


<argument used in cross-validation> String A column appears for each argument
specified in the argument CVParams.
model String Model name.
cverror Double Cross-validation error for the model.

Example
This example performs the cross validation comparison between the equi-weighted logit and probit link
models of the logistic regression GLM function.

440 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
CrossValidation
The AUROC (area under the receiver operating characteristic curve) is the preferred metric for this scheme.

Input
The input table, admissions_train, has one numerical predictor (gpa) and three categorical predictors
(masters, stats, programming) that together determine the binary outcome of whether a student is admitted
(1) or not (0).
Table 334: Cross-Validation Example Input Table admissions_train

id masters gpa stats programming admitted


1 yes 3.95 Beginner Beginner 0
2 yes 3.76 Beginner Beginner 0
3 no 3.7 Novice Beginner 1
4 yes 3.5 Beginner Novice 1
5 no 3.44 Novice Novice 0
6 yes 3.5 Beginner Advanced 1
7 yes 2.33 Novice Novice 1
8 no 3.6 Beginner Advanced 1
9 no 3.82 Advanced Advanced 1
10 no 3.71 Advanced Advanced 1
11 no 3.13 Advanced Advanced 1
12 no 3.65 Novice Novice 1
13 no 4 Advanced Novice 1
14 yes 3.45 Advanced Advanced 0
15 yes 4 Advanced Advanced 1
16 no 3.7 Advanced Advanced 1
17 no 3.83 Advanced Advanced 1
18 yes 3.81 Advanced Advanced 1
19 yes 1.98 Advanced Advanced 0
20 yes 3.9 Advanced Advanced 1
21 no 3.87 Novice Beginner 1
22 yes 3.46 Novice Beginner 0
23 yes 3.59 Advanced Novice 1
24 no 1.87 Advanced Novice 1
25 no 3.96 Advanced Advanced 1
26 yes 3.57 Advanced Advanced 1

Teradata Aster Analytics Foundation User Guide 441


Chapter 5: Statistical Analysis
CrossValidation

id masters gpa stats programming admitted


27 yes 3.96 Advanced Advanced 0
28 no 3.93 Advanced Advanced 1
29 yes 4 Novice Beginner 0
30 yes 3.79 Advanced Novice 0
31 yes 3.5 Advanced Beginner 1
32 yes 3.46 Advanced Beginner 0
33 no 3.55 Novice Novice 1
34 yes 3.85 Advanced Beginner 0
35 no 3.68 Novice Beginner 1
36 no 3 Advanced Novice 0
37 no 3.52 Novice Novice 1
38 yes 2.65 Advanced Beginner 1
39 yes 3.75 Advanced Beginner 0
40 yes 3.95 Novice Beginner 0

SQL-MapReduce Call
Choose the same weight and number of iterations (argument maxitem) to compare the logit and probit
models, so that the cverror result reflects the true strength of the models.

DROP TABLE IF EXISTS logitmodel;


DROP TABLE IF EXISTS probitmodel;
DROP TABLE IF EXISTS glmcvtable;
SELECT * FROM CROSSVALIDATION (
ON (SELECT 1) PARTITION BY 1
Function ('glm')
InputTable ('admissions_train')
OutputTable ('logitmodel', 'probitmodel')
InputColumns ('admitted', 'masters', 'gpa', 'stats', 'programming')
CategoricalColumns ('masters', 'stats', 'programming')
Link ('LOGIT', 'PROBIT')
Weight ('1', '1')
MaxIterNum ('25', '25')
FoldNum ('5')
CVParams('LINK', 'WEIGHT', 'MaxIterNum')
CVTable ('glmcvtable')
Metric ('AUROC')
);

442 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Distribution Matching
Output
Table 335: Cross-Validation Example Output Table

message
Finished. Results can be found in "glmcvtable"

The query below returns the output shown in the following table:

SELECT * FROM glmcvtable ORDER BY 1, 2;

Table 336: Cross-Validation Example Output Table glmcvtable

link weight maxiternum model cverror


LOGIT 1 25 "logitmodel" 0.527083333333333
PROBIT 1 25 "probitmodel" 0.527083333333333

The reported result for cverror is the same as the models are equi-weighted.

Distribution Matching

Summary
Given sample data and reference distributions, the function tests the hypothesis that the sample data comes
from the distributions (Hypothesis-Test Mode). Given the test results, the function finds the distribution
that best matches the sample data (Best-Match Mode).
The Distribution Matching function is composed of the functions DistnmatchReduce and
DistnmatchMultipleInput. DistnmatchReduce supports these distributions:
• For continuous variables:
∘ Beta
∘ Cauchy
∘ ChiSq
∘ Exponential
∘ F
∘ Gamma
∘ Lognormal
∘ Normal
∘ T
∘ Triangular
∘ Uniform
∘ Weibull
• For discrete variables:
∘ Binomial
∘ Geometric

Teradata Aster Analytics Foundation User Guide 443


Chapter 5: Statistical Analysis
Best-Match Mode
∘ Negative binomial
∘ Poisson
∘ Uniform
For evaluating the fit of the distribution to the data, the function supports these tests:
• Anderson-Darling test
• Kolmogorov-Smirnov test
• Cramér-von Mises criterion (hypothesis testing only)
• Pearson’s Chi-squared test

Best-Match Mode

Summary
In best-match mode, the function uses the result of hypothesis-test mode to find the distribution that best
matches the sample data. For each specified test, the function reports the best match, identifying the
distribution type and parameters.

Usage

Best-Match Mode Syntax (DOUBLE PRECISION Input)


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY column[,...]
ORDER BY column_name) AS rank, *
FROM input_table WHERE column_name IS NOT NULL)
AS input PARTITION BY ANY
ON (SELECT column[,...]
COUNT(*) AS group_size,
AVG (column_name) AS mean,
STDDEV (column_name) AS sd,
CASE
WHEN MIN (column_name) > 0 THEN AVG (LN (
CASE
WHEN column_name > 0 THEN column_name
ELSE 1
END)
)
ELSE 0
END AS mean_of_ln,
CASE
WHEN MIN (column_name) > 0 THEN STDDEV (LN (
CASE
WHEN column_name > 0 THEN column_name
ELSE 1

444 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Best-Match Mode
END)
)
ELSE -1
END AS sd_of_ln,
Max (column_name) AS maximum,
MIN (column_name) AS minimum
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY column[,...]
) AS groupstats DIMENSION
ValueColumn (column_name)
[ Tests ('test' [,...]) ]
[ Distributions ('distribution1:parameter1',...) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
MinGroupSize (minGroupSize)
[ NumCell (cell_Size) ]
)
PARTITION BY column[,...]
[ Top ('top') ]
);

Best-Match Mode Syntax (INTEGER Input)


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY column[,...]
ORDER BY column_name) AS rank,
column [,...], column_name
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY column [,...], column_name
) AS input PARTITION BY ANY
ON (SELECT column [,...],
COUNT(*) AS group_size,
AVG (column_name) AS mean,
STDDEV (column_name) AS sd,
MAX (column_name) AS maximum,
MIN (column_name) AS minimum
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY column[,...]
) AS groupstats DIMENSION
ValueColumn (column_name)
[ Tests ('test' [,...]) ]
[ Distributions ('distribution1:parameter1' [,... ]) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
MinGroupSize (minGroupSize)
[ NumCell (cell_Size) ]
)
PARTITION BY column[,...]

Teradata Aster Analytics Foundation User Guide 445


Chapter 5: Statistical Analysis
Best-Match Mode
[ Top ('top') ]
);

Arguments
Argument Category Description
ValueColumn Required Specifies the name of the input table column that contains
the values of the sample data set.
Tests Optional Specifies one to four tests to perform. A test can be:
• 'KS' (Kolmogorov-Smirnov test)
• 'AD' (Anderson-Darling test)
• 'CHISQ' (Pearson's Chi-squared test)
By default, the function runs all of the preceding tests.
Distributions Optional Specifies the reference distributions (which must be
continuous) and their parameters. The possible distribution
and parameters values for continuous distributions are in
the table, Continuous Distributions and Parameters, of the
Arguments section of the function: Hypothesis-Test Mode.
By default, the function uses these distributions:
• Beta
• Cauchy
• CHISQ
• Exponential
• F
• Gamma
• Lognormal
• Normal
• T
• Triangular
• Uniformcontinuous
• Weibull

GroupByColumns Optional Specifies the names of the input table columns that contain
the group identifications over which to run the test. The
function can run multiple tests for different partitions of the
data in parallel. If you omit this argument, then specify
PARTITION BY 1 and omit the GROUP BY clause in the
second ON clause.
MinGroupSize Optional Specifies the minimum group size. The function ignores
groups smaller than the minimum size when calculating
statistics. The default value is 50.
NumCell Optional Specifies the number of cells that you want to make discrete
in a continuous distribution. The cell_size must be greater
than 3 if distribution is NORMAL; otherwise, it must be

446 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Best-Match Mode

Argument Category Description


greater than 1. The quotient min_group_size/cell_size cannot
be less than 5. The default value is 10.
Top Optional Specifies the number of the top matching distributions for
the function to output. The default value is 1.

Input
The input table consists of an arbitrary number of grouping columns and a single value column that
contains the dataset to be matched to the specified distribution(s). The syntax shown includes clauses that
create two tables from the input table. One table ranks the data and the other table counts the number of
points in each group.
For continuous distributions, if your input table already includes a rank column, replace the clause ON
(SELECT RANK()... with the clause ON SELECT * FROM input_table.
Table 337: Distribution Matching Input Table Schema

Column Name Data Type Description


column Any Column used to partition or identify the values. The table can have
several such columns.
value_column INTEGER, Contains the values of the sample data set.
BIGINT,
NUMERIC, or
DOUBLE
PRECISION

Output
Table 338: Distribution Matching Output Table Schema

Column Name Data Type Description


column Any Column specified by the GroupByColumns argument.
group_size INTEGER Number of rows in the group.
best_match_KS DOUBLE Type and parameters of distribution identified as the best match
PRECISION by the Kalmogorov-Smirnov test.
p-value_KS DOUBLE P-value associated with the Kalmogorov-Smirnov test for the best
PRECISION match distribution that it found.
best_match_AD DOUBLE Type and parameters of distribution identified as the best match
PRECISION by the Anderson-Darling test.
p-value_AD DOUBLE P-value associated with the Anderson-Darling test for the best
PRECISION match distribution that it found.
best_match_CHISQ DOUBLE Type and parameters of distribution identified as the best match
PRECISION by the Pearson's ChiSquared test.

Teradata Aster Analytics Foundation User Guide 447


Chapter 5: Statistical Analysis
Best-Match Mode

Column Name Data Type Description


p-value_CHISQ DOUBLE P-value associated with the Pearson's ChiSquared test for the best
PRECISION match distribution that it found.

Examples
Before running the examples in this section, switch the output mode in Act to expanded output by entering
"-x" at the Act command prompt. With expanded output mode turned on, each record is split into rows,
with one row for each value, and each new record is introduced with a text label in a form like: ---
[ RECORD 37 ]---. This mode helps make wide tables readable on a small screen.

Example 1: Input Values of Type DOUBLE PRECISION

Input
The input table, distnmatch (Hypothesis Test Mode) Example 2 Input Table: factory7, is the same as in
Example 2: Normality Tests with 'groupingColumns' of the function: Hypothesis-Test Mode.

SQL-MapReduce Call

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY factory, product
ORDER BY mttf) AS rank, * FROM factory7
WHERE mttf IS NOT NULL) AS input PARTITION BY ANY
ON (SELECT factory, product, COUNT(*) AS group_size,
AVG(mttf) AS mean, STDDEV(mttf) AS sd,
CASE
WHEN MIN(mttf) > 0
THEN AVG(LN(CASE WHEN mttf > 0 THEN mttf ELSE 1 END))
ELSE 0
END AS mean_of_ln,
CASE
WHEN MIN(mttf) > 0
THEN STDDEV(LN(CASE WHEN mttf > 0 THEN mttf ELSE 1 END))
ELSE -1
END AS sd_of_ln,
MAX(mttf) AS maximum,
MIN(mttf) AS minimum
FROM factory7
WHERE mttf IS NOT NULL
GROUP BY factory, product) AS groupstats DIMENSION
ValueColumn ('mttf')
Tests ('KS', 'AD', 'CHISQ')
GroupingColumns ('factory', 'product')
MinGroupSize ('50')
NumCell ('10')
) PARTITION BY factory, product
);

448 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Best-Match Mode
Output
The output is shown in expanded output mode. The function has attempted to identify the best matching
distribution for each partition of the data, based on each test specified in the SQL-MapReduce call. For each
partition, the output shows the distribution and parameters identified by each test, and the associated p-
value.
Table 339: distnmatch (Best Match Mode) Example 1 Output Table

Column Name Value


-[ RECORD 1 ]------------- ---------------------------------------------------------------------------
factory F1
product A
group_size 4000
best_match_KS_top1 GAMMA:4494.418836778631,2.2249176761705147
p-value_KS_top1 0.921247
best_match_AD_top1 GAMMA:4494.418836778631,2.2249176761705147
p-value_AD_top1 0.992539
best_match_CHISQ_top1 LOGNORMAL:9.210200309753418,0.014922354370355606
p-value_CHISQ_top1 0.873597
-[ RECORD 2 ]------------- ---------------------------------------------------------------------------
factory F1
product C
group_size 4000
best_match_KS_top1 BETA:10,1
p-value_KS_top1 8.73764e-07
best_match_AD_top1 BETA:10,1
p-value_AD_top1 1.5e-07
best_match_CHISQ_top1 BETA:10,1
p-value_CHISQ_top1 0
-[ RECORD 3 ]------------- ---------------------------------------------------------------------------
factory F2
product B
group_size 4000
best_match_KS_top1 LOGNORMAL:9.210394859313965,0.009947648271918297
p-value_KS_top1 0.717749
best_match_AD_top1 GAMMA:10199.47546584695,0.9805440671668431

Teradata Aster Analytics Foundation User Guide 449


Chapter 5: Statistical Analysis
Best-Match Mode

Column Name Value


p-value_AD_top1 0.908634
best_match_CHISQ_top1 LOGNORMAL:9.210394859313965,0.009947648271918297
p-value_CHISQ_top1 0.356182
-[ RECORD 4 ]------------- ---------------------------------------------------------------------------
factory F2
product D
group_size 4000
best_match_KS_top1 UNIFORMCONTINUOUS:9500.3388671875,10499.8486328125
p-value_KS_top1 0.721553
best_match_AD_top1 BETA:10,1
p-value_AD_top1 1.5e-07
best_match_CHISQ_top1 UNIFORMCONTINUOUS:9500.3388671875,10499.8486328125
p-value_CHISQ_top1 0.911413

Example 2: Input Values of Type INTEGER


This example shows how the function tries finding the best matching distribution for some integer type
input data. The data comes from several sources and are generated by different distributions.

Input
The input is hypothetical and represents the ages of children visiting three amusement parks during a one-
week period in spring and another one-week period in summer. Only a subset of rows is shown.
Table 340: distnmatch (Best Match Mode) Example 2 Input Table age_distribution

season park_name age


Spring Funland 12
Spring Funland 1
Spring Funland 10
Spring Funland 2
Spring Funland 10
Spring Funland 3
Spring Funland 3
Spring Funland 7
Spring Funland 11
Spring Funland 8

450 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Best-Match Mode

season park_name age


.. .. ..
Spring Wonderland 10
Spring Wonderland 12
Spring Wonderland 9
Spring Wonderland 12
Spring Wonderland 10
Spring Wonderland 8
Spring Wonderland 8
Spring Wonderland 8
Spring Wonderland 9
Spring Wonderland 8
.. .. ..
Spring KidsWorld 7
Spring KidsWorld 6
Spring KidsWorld 5
Spring KidsWorld 6
Spring KidsWorld 4
Spring KidsWorld 1
Spring KidsWorld 8
Spring KidsWorld 6
Spring KidsWorld 3
Spring KidsWorld 4
.. .. ..
Summer Funland 10
Summer Funland 6
Summer Funland 10
Summer Funland 7
Summer Funland 4
Summer Funland 8
Summer Funland 6
Summer Funland 7
Summer Funland 4

Teradata Aster Analytics Foundation User Guide 451


Chapter 5: Statistical Analysis
Best-Match Mode

season park_name age


Summer Funland 10
.. .. ..
Summer Wonderland 7
Summer Wonderland 9
Summer Wonderland 9
Summer Wonderland 2
Summer Wonderland 4
Summer Wonderland 8
Summer Wonderland 5
Summer Wonderland 4
Summer Wonderland 7
Summer Wonderland 11
.. .. ..
Summer KidsWorld 5
Summer KidsWorld 5
Summer KidsWorld 3
Summer KidsWorld 8
Summer KidsWorld 4
Summer KidsWorld 7
Summer KidsWorld 6
Summer KidsWorld 3
Summer KidsWorld 2
Summer KidsWorld 0
... ... ...

SQL-MapReduce Call

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY park_name, season
ORDER BY age) AS rank, park_name, season, age
FROM age_distribution
WHERE age IS NOT NULL
GROUP BY park_name, season, age) AS input PARTITION BY ANY
ON (SELECT park_name, season,
COUNT(*) AS group_size,

452 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Best-Match Mode
AVG(age) AS mean,
STDDEV(age) AS sd,
MAX(age) AS maximum,
MIN(age) AS minimum
FROM age_distribution
WHERE age IS NOT NULL
GROUP BY park_name, season) AS groupstats DIMENSION
ValueColumn ('age')
Tests ('KS', 'AD', 'CHISQ')
GroupingColumns ('park_name', 'season')
MinGroupSize ('50')
NumCell ('10')
) PARTITION BY park_name, season
);

Output
The function has attempted to identify the best matching distribution for each partition of the data, based on
each test specified in the SQL-MapReduce call. For each partition, the output shows the distribution and
parameters identified by each test with the associated p-value.
Table 341: distnmatch (Best Match Mode) Example 2 Output Table (Columns 1-4)

park_name season group_size best_match_KS_top1


KidsWorld Spring 400 UNIFORMDISCRETE:1,8
KidsWorld Summer 400 POISSON:4.457499980926514
Funland Spring 400 UNIFORMDISCRETE:1,12
Funland Summer 400 POISSON:7.065000057220459
Wonderland Spring 400 BINOMIAL:16,0.5091542759742608
Wonderland Summer 400 POISSON:7.005000114440918

Table 342: distnmatch (Best Match Mode) Example 2 Output Table (Columns 5-7)

p-value_KS_top1 best_match_AD_top1 p-value_AD_top1


0.963945 BINOMIAL:100,0.5 1.5e-06
0.996074 BINOMIAL:100,0.5 1.5e-06
0.999293 NEGATIVEBINOMIAL:7,0.46705571565376613 1.5e-06
0.000672657 BINOMIAL:13,0.507505492077339 1.5e-06
0.0624169 BINOMIAL:16,0.5091542759742608 1.5e-06
0.993134 POISSON:7.005000114440918 6.47109e-06

Table 343: distnmatch (Best Match Mode) Example 2 Output Table (Columns 8-9)

best_match_CHISQ_top1 p-value_CHISQ_top1
BINOMIAL:100,0.5 0

Teradata Aster Analytics Foundation User Guide 453


Chapter 5: Statistical Analysis
EMAVG

best_match_CHISQ_top1 p-value_CHISQ_top1
BINOMIAL:100,0.5 0
UNIFORMDISCRETE:1,12 8.9484e-13
BINOMIAL:13,0.507505492077339 0
BINOMIAL:16,0.5091542759742608 0
BINOMIAL:110,0.06332936686652757 0

EMAVG

Summary
The EMAVG (exponential moving average) function computes the average over a number of points in a
time series, exponentially decreasing the weights of older values.

Background
Exponential moving average (EMA), or exponentially weighted moving average (EWMA), applies a damping
factor, alpha, that exponentially decreases the weights of older values. This technique gives much more
weight to recent observations, while retaining older observations.
The EMAVG function computes the arithmetic average of the first n rows and then, for each subsequent
row, computes the new value with this formula:

new_emavg = alpha * new_value + (1-alpha) * old_emavg

The initial value of old_emavg is the arithmetic average of the first n rows. The values n and alpha are
specified by the function arguments Start_Rows and Alpha, respectively.

Usage

EMAVG Syntax
Version 1.2

SELECT * FROM EMAVG (


ON { table_name | view_name | (query) }
PARTITION BY partition_column
ORDER BY order_column
[ TargetColumns ({ 'target_column' | 'target_column_range' }[,...]) ]
[ Alpha ('alpha') ]
[ StartRows ('n') ]

454 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
EMAVG
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Arguments
Argument Category Description
TargetColumns Optional Specifies the input column names for which the exponential moving average is
to be computed. If you omit this argument, then the function copies every
input column to the output table but does not compute any exponential
moving averages.
Alpha Optional Specifies the damping factor, a value in the range [0, 1], which represents a
percentage in the range [0, 100]. For example, if alpha is 0.2, then the damping
factor is 20%. A higher alpha discounts older observations faster. The default
value is 0.1.
StartRows Optional Specifies the number of rows at the beginning of the time series that the
function skips before it begins the calculation of the exponential moving
average. The function uses the arithmetic average of these rows as the initial
value of the exponential moving average. The value n must be an integer. The
default value of n is 2.
IncludeFirst Optional Specifies whether to include the starting rows in the output table. The default
value is 'false'. If you specify 'true', the output columns for the starting rows
contain NULL, because their exponential moving average is undefined.

Input
The input table must have the rows described in the following table. The table can have additional rows, but
the function ignores them.
Table 344: EMAVG Input Table Schema

Column Name Data Type Description


partition_column Any Column by which the input table is partitioned. Each partition
must include all rows of an entity. For example, to compute the
exponential moving average of a particular stock share price, all
transactions of that stock must be in the same partition.
order_column TIME, Column by which the input table is ordered.
TIMESTAMP,
INTEGER,
SMALLINT,
or BIGINT
input_column INTEGER, Input column for which the exponential moving average is to be
SMALLINT, computed.
BIGINT, or
DOUBLE
PRECISION

Teradata Aster Analytics Foundation User Guide 455


Chapter 5: Statistical Analysis
EMAVG
Output
Table 345: EMAVG Output Table Schema

Column Name Data Type Description


partition_column Same as in Column by which the input table is partitioned.
input table
input_column Same as in Column copied from the input table. The function copies
input table every input table column to the output table.
order_column Same as in Column by which the input table is ordered.
input table
input_column_mavg DOUBLE Exponential moving average for an input_column specified
PRECISION by the TargetColumns argument.

Example
This example computes an exponential moving average for the price of IBM stock. The input data is a series
of IBM common stock closing prices from 17 May 1961 to 2 November 1962.

Input
Table 346: EMAVG Example Input Table ibm_stock

id name period stockprice


1 IBM 1961-05-17 00:00:00 460
2 IBM 1961-05-18 00:00:00 457
3 IBM 1961-05-19 00:00:00 452
4 IBM 1961-05-22 00:00:00 459
5 IBM 1961-05-23 00:00:00 462
6 IBM 1961-05-24 00:00:00 459
7 IBM 1961-05-25 00:00:00 463
8 IBM 1961-05-26 00:00:00 479
9 IBM 1961-05-29 00:00:00 493
10 IBM 1961-05-31 00:00:00 490
11 IBM 1961-06-01 00:00:00 492
12 IBM 1961-06-02 00:00:00 498
13 IBM 1961-06-05 00:00:00 499
14 IBM 1961-06-06 00:00:00 497
15 IBM 1961-06-07 00:00:00 496

456 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
EMAVG

id name period stockprice


16 IBM 1961-06-08 00:00:00 490
17 IBM 1961-06-09 00:00:00 489
18 IBM 1961-06-12 00:00:00 478
19 IBM 1961-06-13 00:00:00 487
20 IBM 1961-06-14 00:00:00 491
... ... ... ...

SQL-MapReduce Call

SELECT * FROM EMAVG (


ON ibm_stock
PARTITION BY name
ORDER BY period
TargetColumns ('stockprice')
StartRows ('10')
IncludeFirst ('true')
) ORDER BY period;

Output
Table 347: EMAVG Example Output Table

id name period stockprice stockprice_mavg


1 IBM 1961-05-17 00:00:00 460
2 IBM 1961-05-18 00:00:00 457
3 IBM 1961-05-19 00:00:00 452
4 IBM 1961-05-22 00:00:00 459
5 IBM 1961-05-23 00:00:00 462
6 IBM 1961-05-24 00:00:00 459
7 IBM 1961-05-25 00:00:00 463
8 IBM 1961-05-26 00:00:00 479
9 IBM 1961-05-29 00:00:00 493
10 IBM 1961-05-31 00:00:00 490 467.4
11 IBM 1961-06-01 00:00:00 492 469.85999999999996
12 IBM 1961-06-02 00:00:00 498 472.674
13 IBM 1961-06-05 00:00:00 499 475.3066
14 IBM 1961-06-06 00:00:00 497 477.47594

Teradata Aster Analytics Foundation User Guide 457


Chapter 5: Statistical Analysis
FMeasure

id name period stockprice stockprice_mavg


15 IBM 1961-06-07 00:00:00 496 479.328346
16 IBM 1961-06-08 00:00:00 490 480.39551140000003
17 IBM 1961-06-09 00:00:00 489 481.25596026000005
18 IBM 1961-06-12 00:00:00 478 480.9303642340001
19 IBM 1961-06-13 00:00:00 487 481.53732781060006
20 IBM 1961-06-14 00:00:00 491 482.4835950295401
... ... ... ... ...

FMeasure

Summary
The FMeasure function calculates the accuracy of a test (usually the output of a classifier).

Background
In statistics, the F1 score (or F-score or F-measure) is a measure of a test’s accuracy that is based on both
precision and recall, which are defined as follows:
• Precision, p, is the number of correct results divided by the number of returned results.
• Recall, r, is the number of correct results divided by the number of expected results.
The F1 score can be interpreted as a weighted average of precision and recall, whose best value is 1 and worst
value is 0.
The traditional F1 score is the harmonic mean of precision and recall:
F =2*p*r / (p+r)
The general formula for a positive real β is:
Fβ =(1+β*β)*p*r /(β*β*p+r)

Usage

FMeasure Syntax
Version 1.4

SELECT * FROM FMeasure (


ON { table | view | (query) } PARTITION BY 1
ObsColumn ('observed_column')
PredictColumn ('predicted_column')

458 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
FMeasure
[ Classes ('class' [,...]) ]
[ Beta (beta_value) ]
);

Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input table column that contains
the observed class.
PredictColumn Required Specifies the name of the input table column that contains
the predicted class.
Classes Optional Specifies the class or classes to output in the result. The
default is all classes.
Beta Optional Specifies the value of β in the general formula in
Background. The beta_value must be a positive DOUBLE
PRECISION value. The default value is 1.0.

Input
The FMeasure function has one input table, view, or query that contains the test data. the following table
describes the input columns that function arguments must specify. The function ignores any additional
columns.
Table 348: FMeasure Input Table Schema

Column Name Data Type Description


observed_column Any Contains the expected class.
result_column Any Contains the result class.

Note:
The function is intended for general, multiclass input data. To submit a binary classification problem to
the function in the expected format, input a query that includes WHERE clauses.

Output
Table 349: FMeasure Output Table Schema

Column Name Data Type Description


class VARCHAR Contains the observed classes and, in the last row, which is labeled -
AVG-, the average F1 score for all classes. This column appears only if
you specify the Classes argument.
precision DOUBLE Contains the value of the precision variable, p, that the function used
PRECISION to calculate the F1 score.

Teradata Aster Analytics Foundation User Guide 459


Chapter 5: Statistical Analysis
FMeasure

Column Name Data Type Description


recall DOUBLE Contains the value of the precision variable, r, that the function used to
PRECISION calculate the F1 score.
beta DOUBLE Contains the value of the Beta argument that the function used to
PRECISION calculate the F1 score.
fmeasure DOUBLE Contains the F1 score.
PRECISION

Examples
• Input
• Example 1: Output All Classes
• Example 2: Output Specified Classes

Input
The input table has five attributes of personal computers—price, speed, hard disk size, RAM, and screen size.
The table has 500 rows, categorized into five price groups—SPECIAL, SUPER, HYPER, MEGA and UBER.
The predicted_compcategory values can be generated by a classification function, such as KNN.
Table 350: FMeasure Examples Input Table computers_category

compid price speed hd ram screen expected_compcategory predicted_compcategory


1 1499 25 80 4 14 SPECIAL SPECIAL
2 1795 33 85 2 14 SUPER SUPER
3 1595 25 170 4 15 SPECIAL SPECIAL
4 1849 25 170 8 14 SUPER HYPER
5 3295 33 340 16 14 HYPER SUPER
6 3695 66 340 16 14 UBER SPECIAL
7 1720 25 170 4 14 SPECIAL SPECIAL
8 1995 50 85 2 14 SUPER SUPER
9 2225 50 210 8 14 SUPER SUPER
12 2605 66 210 8 14 MEGA UBER
13 2045 50 130 4 14 SUPER SUPER
14 2295 25 245 8 14 MEGA MEGA
16 2225 50 130 4 14 SUPER SUPER
... ... ... ... ... ... ... ...

460 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
FMeasure
Example 1: Output All Classes

SQL-MapReduce Call

SELECT * FROM FMeasure (


ON computers_category
PARTITION BY 1
ObsColumn ('expected_compcategory')
PredictColumn ('predicted_compcategory')
Beta (1.0)
);

Output

Table 351: FMeasure Example 1 Output Table

class precision recall beta fmeasure


HYPER 0.936842105263158 0.89 1 0.912820512820513
MEGA 0.923076923076923 0.935064935064935 1 0.929032258064516
SPECIAL 0.84375 0.885245901639344 1 0.864
SUPER 0.935897435897436 0.954248366013072 1 0.944983818770227
UBER 0.896551724137931 0.8125 1 0.852459016393443
-AVG- 0.918 0.918 1 0.918

Example 2: Output Specified Classes

SQL-MapReduce Call

SELECT * FROM FMeasure (


ON computers_category
PARTITION BY 1
ObsColumn ('expected_compcategory')
PredictColumn ('predicted_compcategory')
Classes ('SPECIAL', 'HYPER')
Beta (1.0)
);

Output

Table 352: FMeasure Example 2 Output Table

class precision recall beta fmeasure


SPECIAL 0.84375 0.885245901639344 1 0.864
HYPER 0.936842105263158 0.89 1 0.912820512820513

Teradata Aster Analytics Foundation User Guide 461


Chapter 5: Statistical Analysis
GLM

GLM

Summary
The generalized linear model (GLM) is an extension of the linear regression model that enables the linear
equation to be related to the dependent variables by a link function. GLM performs linear regression analysis
for any of a number of distribution functions using a user-specified distribution family and link function.
GLM selects the link function based upon the distribution family and the assumed nonlinear distribution of
expected outcomes. The table in Background describes the supported link function combinations.
A GLM has three parts:
1. A random component—the probability distribution of Y from the exponential family
2. A fixed linear component—the linear expression of the predictor values (X1,X2,...,Xp), expressed as ƞ or

3. A link function that describes the relationship of the distribution function to the expected value of Y
(described in the table in Background)
GLM also supports categorical variables. For example, in the following table, size and color are independent
(predictive) variables and outcome is the dependent (response) variable. Size is a quantitative variable and
color is a qualitative variable (with the values yellow, blue, and red). In regression analysis, a qualitative
variable is called a categorical (or dummy) variable.
Table 353: Categorical Variables

size color outcome


10 yellow 1
5 blue 0
6 red 1

Note:
The Aster Analytics GLM function implementation uses the Fisher Scoring Algorithm, which is highly
scalable compared to the least-squares algorithm used in the glm() function in the R package stats. The
results of the two algorithms usually match closely. However, when the input data is highly skewed or has
a large variance, the Fisher Scoring Algorithm might diverge, and you might need to use knowledge of
the dataset and trial and error to select the optimal family and link functions.

Background
Table 354: Supported Family/Link Function Combinations

Family Family Function Name Link Link Function Used


Name Expression
Binomial or BINOMIAL or logit (default) log(μ/(1-μ)) When the dependent
Logistic LOGISTIC probit Φ-μ variable (Y) has only two
cloglog possible values (0 and 1,
log[-log(1-μ)]
'yes' and 'no', or 'true' and
log log(μ)
'false').

462 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM

Family Family Function Name Link Link Function Used


Name Expression
cauchit tan(π(μ - 1/2)) The algorithm applies the
model to the data,
predicts the most likely
outcome for each input,
and supplies a logit
(logarithm of odds) for
each outcome.
Gamma GAMMA inverse (default) μ-1 When data is continuous
identity μ with constant response
log variance and appears to
log(μ)
be right-skewed.
Gaussian GAUSSIAN identity (default) μ When the data is
inverse μ-1 grouped around a single
log mean and can be graphed
log(μ)
in a normal or bell curve
distribution.
Inverse INVERSE_GAUSSIAN inverse_mu_squar μ-2 When the data is
Gaussian ed (default) μ grouped around a single
identity mean but the graph
μ-1
inverse appears to have a right-
log(μ) skewed curve
log
distribution.
Poisson POISSON log (default) log(μ) To model count data
identity μ (nonnegative integers)
square_root and contingency models
μ1/2
(matrices of the
frequency distribution of
variables).
The algorithm assumes
that the dependent
variable (Y) has a Poisson
distribution (that is, that
Y is segmented into
intervals of, for example,
time or geographic
location) and then
calculates the discrete
probability of one or
more events occurring
within these segments.
Negative NEGATIVE_BINOMIAL log (default) log(μ) To model count data
Binomial identity μ (nonnegative integers),
usually over-dispersed
response variables.

Teradata Aster Analytics Foundation User Guide 463


Chapter 5: Statistical Analysis
GLM
The following table shows the common link functions for the common distribution exponential families. D
denotes the default link for each family.
Table 355: Common Link Functions for Distribution Exponential Families

Link Link Descriptive Binomial Gamma Gaussian Inverse_Gaussian Poisson Negative_Binomial


(Logistic)
logit LOGIT D
probit PROBIT *
cloglog COMPLEMENTARY_LOG_LOG *
identity IDENTITY * D * * *
inverse INVERSE D * *
log LOG * * * * D D

1/μ2 INVERSE_MU_SQUARED D

sqrt SQUARE_ROOT *
cauchit CAUCHIT *

For more information about generalized linear models, see:


• Dobson, A.J.; Barnett, A.G. (2008). Introduction to Generalized Linear Models (3rd ed.). Boca Raton, FL:
Chapman and Hall/CRC. ISBN 1-58488-165-8.
• Hardin, James; Hilbe, Joseph (2007). Generalized Linear Models and Extensions (2nd ed.). College
Station: Stata Press. ISBN 1-59718-014-9.

Usage

GLM Syntax
Version 1.7

SELECT * FROM GLM (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ InputColumns ({ 'input_column' | 'input_column_range' }[,...]) ]
[ CategoricalColumns ('columnname_value_pair'[,...]) ]
[ Family ('family') ]
[ Link ('link') ]
[ Weight ('weight_column') ]
[ Threshold ('threshold') ]
[ MaxIterNum ('max_iterations') ]
[ Intercept ( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]

464 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM
[ Step ( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the columns
described in the table in Input.
OutputTable Required Specifies the name for the output table of coefficients. This table
must not exist. For GLM, the output is written to the screen, and
the output table is the table where the coefficients are stored.
InputColumns Optional Specifies the name of the column that contains the dependent
variables (Y) followed by the names of the columns that contain
the predictor variables (Xi), in this format: 'Y,X1,X2,...,Xp'.
By default, the first column of the input table is Y and the
remaining input table columns are Xi, except for the column
specified by the Weight argument.
CategoricalColumns Optional Specifies columnname-value pairs, each of which contains the
name of a categorical input column and the category values in that
column that the function is to include in the model that it
generates.
Each columnname-value pair has one these forms:
• 'columnname:max_cardinality'
Limits the categories in the column to the max_cardinality
most common ones and groups the others together as 'others'.
For example, 'column_a:3' specifies that for column_a, the
function uses the 3 most common categories and sets the
category of the rows that do not belong to those 3 categories to
'others'.
• 'columnname:(category [, ...])'
Limits the categories in the column to those that you specify
and groups the others together as 'others'. For example,
'column_a : (red, yellow, blue)' specifies that for column_a, the
function uses the categories red, yellow, and blue, and sets the
category of the rows that do not belong to those categories to
'others'.
• 'columnname'
All category values appear in the model.

Teradata Aster Analytics Foundation User Guide 465


Chapter 5: Statistical Analysis
GLM

Argument Category Description


If you specify the ColumnNames argument, then the columns that
you specify in the CategoricalColumns argument must also appear
in the ColumnNames argument.
Family Optional Specifies the distribution exponential family. Supported values
are:
• 'BINOMIAL' (default)
• 'LOGISTIC' (equivalent to 'BINOMIAL')
• 'POISSON'
• 'GAUSSIAN'
• 'GAMMA'
• 'INVERSE_GAUSSIAN'
• 'NEGATIVE_BINOMIAL'

Link Optional Specifies the link function. The default value is 'CANONICAL'.
The canonical link functions (default link functions) and the link
functions that are allowed for each exponential family are listed in
the table in Background.
Weight Optional Specifies the name of an input table column that contains the
weights to assign to responses. The default value is 1.
You can use non-NULL weights to indicate that different
observations have different dispersions (with the weights being
inversely proportional to the dispersions). Equivalently, when the
weights are positive integers wi, each response yi is the mean of wi
unit-weight observations. A binomial GLM uses prior weights to
give the number of trials when the response is the proportion of
successes. A Poisson GLM rarely uses weights.
If the weight is less than the response value, then the function
throws an exception. Therefore, if the response value is greater
than 1 (the default weight), then you must specify a weight that is
greater than or equal to the response value.
Threshold Optional Specify the convergence threshold. The default value is 0.01.
MaxIterNum Optional Specifies the maximum number of iterations that the algorithm
runs before quitting if the convergence threshold has not been
met. The parameter max_iterations must be a positive INTEGER
value. The default value is 25.
Intercept Optional Specifies whether the function uses an intercept. For example, in
ß0+ß1*X1+ß2*X2+ ....+ ßpXp, the intercept is ß0.The default
value is 'true'.
Step Optional Specifies whether the function uses a step. The default value is
false. If the function uses a step, then it runs with the GLM model
that has the lowest Akaike information criterion (AIC) score,
drops one predictor from the current predictor group, and repeats
this process until no predictor remains.

466 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM
Input
The following describes the GLM input table columns that you can specify in function arguments. The input
table can contain additional columns, but the function ignores them.
Table 356: GLM Input Table Schema

Column Data Type Description


dependent_variable_column Any Required column that contains the dependent/response
variables. Cannot contain NULL values.
If you specify the ColumnNames argument, it must
specify this column first.
predictor_variable_columns Any Columns that contain independent/predictor variables.
One such column is required; others are optional. These
columns cannot contain NULL values.
If you specify the ColumnNames argument, it must
specify these columns after dependent_variable_column.
categorical_columns CHARACTER, Optional columns that contain categorical variables.
VARCHAR, Both the ColumnNames and CategoricalColumns
INTEGER, arguments must specify these columns.
BOOLEAN,
DATE, TIME
(without TIME
ZONE), IP4
weight_column INTEGER, Optional column that contains weights.
DOUBLE The Weight argument must specify this column.
PRECISION

Onscreen Output
The onscreen output of the GLM function is a regression analysis of the data, using the family and link
functions specified.

Columns
When a particular column is not used for its corresponding row, the column contains a value of zero (0).
Table 357: GLM Onscreen Output Columns

Column Description
predictor Contains the column name for each predictor that was input to the function and the
labels of the rows whose values appear in the second table in Rows.
estimate Contains the mean of the supplied values for each predictor and each value in the second
table in Rows.
std_error Contains the standard deviation of the mean (standard error) for each predictor.

Teradata Aster Analytics Foundation User Guide 467


Chapter 5: Statistical Analysis
GLM

Column Description
z_score Contains the likelihood that the null hypothesis is true, given this sample. The likelihood
is the difference between the observed sample mean and the hypothesized mean, divided
by the standard error.
p_value Contains the significance level for each predictor.
significance Contains the likelihood that the predictor is significant (refer to CoxPH Output).

Rows
The onscreen output includes a row for each of parameter in the following table with a value for estimated
value, standard error, z-score, p-value, and significance:
Table 358: GLM Onscreen Output Row Parameters

Parameter Description
Intercept The value of the logit (Y) when all predictors are 0.
Predictors A row for each predictor value (X1,X2,...,Xp).

The following values are also output in the second column (estimate).
Table 359: GLM Onscreen Output Values in the Estimate Column

Value Description (appears in significance column)


ITERATIONS# The number of Fisher Scoring iterations performed on the function.

Note:
With Step('true'), the function reports this number for each step.

ROWS# The number of rows of data received as input.


Residual The deviance, with degrees of freedom noted in the significance column.
deviance
Note:
Residual deviance is not displayed when the Family is GAMMA,
NEGATIVE_BINOMIAL, or INVERSE_GAUSSIAN.

Pearson The sum of squared Pearson’s residual.


goodness of fit
AIC Akaike information criterion, a measure of the relative quality of the model for the given
set of data.
BIC Bayesian information criterion, partly based on the likelihood function and closely
related to the AIC. BIC is a criterion for model selection among a finite set of models; the
model with the lowest BIC is preferred.
Wald Test Tests the goodness of fit.
Dispersion For GAUSSIAN, the value of this parameter is estimated from the data. For all other
parameter families, this parameter has the value 1.

468 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM
The coefficients are also stored in the table output_table_name for later use.

Note:
For the Gamma distribution density, AIC and BIC might have the value NaN when the dispersion
parameter is very small (for example, 0.00170243) and goodness-of-fit is poor (for example, 0.011).

Output Table
The output table specified by the OutputTable argument stores the estimated coefficients and statistics,
which are used by the functions GLMPredict and LRTEST.
When a particular column is not used for its corresponding row, the column contains a value of zero (0).
This is a description of the columns that appear in the output table:
Table 360: GLM Output Table Columns

Column Description
attribute The index of each predictor, starting from 0.
predictor The column name for each predictor that was supplied as input to the function.
category The category names of each predictor. Numeric predictors have NULL values in this
column.
estimate The mean of the supplied values for each predictor.
std_error Standard deviation of the mean for each predictor (standard error).
z_score or If the Family argument specifies the BINOMIAL, LOGISTIC, POISSON, GAMMA,
t_score INVERSE_GAUSSIAN, or INVERSE_BINOMIAL family, then the name of the column
is z_score.
The z-score is a measure of the likelihood that the NULL hypothesis is true, given this
sample. It is derived by taking the difference between the observed sample mean and the
hypothesized mean, divided by the standard error. The z-score statistic follows the
N(0,1) distribution.
If the Family argument specifies the GAUSSIAN family, then the name of the column is
t_score. The t_score statistic follows a t(N-p-1) distribution.
p_value The significance level (p-value) for each predictor.
significance The likelihood that the predictor is significant (refer to Output in the function: CoxPH).

The output includes a row for each of the following with a value for estimated value, standard error, z-score,
p-value, and significance:
Table 361: GLM Output Table Parameters

Parameter Description
Loglik The log likelihood of the model.
Intercept The value of the logit (Y) when all predictors are 0.
Predictors A row for each predictor value (X1,X2,...,Xp). Each numeric input column corresponds
to one predictor.

Teradata Aster Analytics Foundation User Guide 469


Chapter 5: Statistical Analysis
GLM
Odds Ratio and Confidence Intervals
You can exponentiate the coefficients (the “estimate” column in the output table coefficient estimates)and
interpret them as odds ratios (ORs). To perform this type of computation, you can run the following SQL
queries on the output of the GLM function.

-- odds ratios only


SELECT predictor, category,
EXP(estimate) AS odds_ratio
FROM glm_output;
-- odds ratios and 95% CI
SELECT predictor, category,
EXP(estimate) AS odds_ratio,
EXP(estimate - 1.96 * std_err) AS lower_bound,
EXP(estimate + 1.96 * std_err) AS upper_bound
FROM glm_output;

Goodness-of-Fit Tests
• Deviance
• Wald’s Test
• Rao's Score Test
• Pearson’s Chi-squared Statistic

Deviance
The deviance for a model M0, based on a dataset y, is defined as follows:

In the preceding equation:

denotes the fitted values of the parameters in the model M0.

denotes the fitted parameters for the “full model" (or "saturated model").
Both sets of fitted values are implicitly functions of the observations y. In this case, the full model is a model
with a parameter for every observation so that the data are fitted exactly. This expression is -2 times the log-
likelihood ratio of the reduced model compared to the full model.
The deviance is used to compare two models—in particular in the case of generalized linear models where it
has a similar role to residual variance from ANOVA in linear models (RSS).
Suppose in the framework of the GLM that there are two nested models, M1 and M2. In particular, suppose
that M1 contains the parameters in M2, and k additional parameters. Then, under the null hypothesis that
M2 is the true model, the difference between the deviances for the two models follows an approximate chi-
squared distribution with k-degrees of freedom. This provides us an alternative way for computing the log-
likelihood ratio of two models.

470 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM
Deviance is implemented in the GLM function. It computes residual deviance from model deviance and
saturated deviance. The function does not compute null deviance.

Wald’s Test
Significance tests can be performed for individual regression coefficients (that is, H0 : βj = 0) by computing
the Wald statistics, which are similar to the partial t-statistics from classical regression:

Under the null hypothesis that βj = 0, the Wald test statistic wj follows approximately a standard normal
distribution (and its square is approximately a chi-square on one-degree of freedom).
This quantity is computed by the GLM function as the Wald Test, as well as the corresponding 'p_value'. It is
in the output table, and also displayed on the screen.

Rao's Score Test


Rao’s score test, or the score test (often known as the Lagrange multiplier test in econometrics) is a statistical
test of a simple null hypothesis that a parameter of interest θ is equal to some particular value θ0. It is the
most powerful test when the true value of θ is close to θ0. The main advantage of the Score-test is that it does
not require an estimate of the information under the alternative hypothesis or unconstrained maximum
likelihood. This makes testing feasible when the unconstrained maximum likelihood estimate is a boundary
point in the parameter space.
Let L be the likelihood function which depends on θ parameter and let x be the data. The score is:

The observed information is:

Suppose that θ0 is the maximum likelihood estimate of θ under the null hypothesis Η0 : θ= θ.
Then

asymptotically under H0, where k is the number of constraints imposed by the null hypothesis.

Pearson’s Chi-squared Statistic


The deviance generalizes the sum of squared errors. Another generalization of sum of squared errors is
Pearson’s chi-squared statistic. Given a generalized linear model with responses yi, weights wi, fitted means
μi, variance function v(μ) and dispersion φ = 1, the Pearson goodness-of-fit statistic is

Teradata Aster Analytics Foundation User Guide 471


Chapter 5: Statistical Analysis
GLM

If the fitted model is correct and the observations yi are approximately normal, then X2 is approximately
distributed as X2on the residual degrees of freedom for the model. Both the deviance and the generalized
Pearson X2 have exact X2 distributions for Normal-theory linear models (assuming of course that the model
is true), and asymptotic results are available for the other distributions. The deviance has a general advantage
as a measure of discrepancy in that it is additive for nested sets of models if maximum-likelihood estimates
are used, whereas X2 in general is not. However, X2 may sometimes be preferred because of its more direct
interpretation.
The GLM function computes the Pearson’s goodness of fit.

Examples
• Example 1: Logistic Regression Analysis with Intercept
• Example 2: Logistic Regression Analysis with Step Argument
• Example 3: Gaussian Distribution Analysis with Default Options

Example 1: Logistic Regression Analysis with Intercept


In logistic regression, the dependent variable (Y) has only two possible values (0 and 1, 'yes' and 'no', or 'true'
and 'false'). The algorithm applies the model to the data and predicts the most likely outcome.

Input
The input table, admissions_train, contains data about applicants to an academic program. For each
applicant, attributes in the table include a Masters Degree indicator, a grade point average (on a 4.0 scale), a
statistical skills indicator, a programming skills indicator, and an indicator of whether the applicant was
admitted. The Masters Degree, statistical skills, and programming skills indicators are categorical variables.
Masters degree has two categories (yes or no), while the other two have three categories (Novice, Beginner
and Advanced). For admitted status, "1" indicates that the student was admitted and "0" indicates otherwise.
Table 362: GLM Example 1 Input Table admissions_train

id masters gpa stats programming admitted


1 yes 3.95 Beginner Beginner 0
2 yes 3.76 Beginner Beginner 0
3 no 3.7 Novice Beginner 1
4 yes 3.5 Beginner Novice 1
5 no 3.44 Novice Novice 0
6 yes 3.5 Beginner Advanced 1
7 yes 2.33 Novice Novice 1
8 no 3.6 Beginner Advanced 1
9 no 3.82 Advanced Advanced 1

472 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM

id masters gpa stats programming admitted


10 no 3.71 Advanced Advanced 1
11 no 3.13 Advanced Advanced 1
12 no 3.65 Novice Novice 1
13 no 4 Advanced Novice 1
14 yes 3.45 Advanced Advanced 0
15 yes 4 Advanced Advanced 1
16 no 3.7 Advanced Advanced 1
17 no 3.83 Advanced Advanced 1
18 yes 3.81 Advanced Advanced 1
19 yes 1.98 Advanced Advanced 0
20 yes 3.9 Advanced Advanced 1
21 no 3.87 Novice Beginner 1
22 yes 3.46 Novice Beginner 0
23 yes 3.59 Advanced Novice 1
24 no 1.87 Advanced Novice 1
25 no 3.96 Advanced Advanced 1
26 yes 3.57 Advanced Advanced 1
27 yes 3.96 Advanced Advanced 0
28 no 3.93 Advanced Advanced 1
29 yes 4 Novice Beginner 0
30 yes 3.79 Advanced Novice 0
31 yes 3.5 Advanced Beginner 1
32 yes 3.46 Advanced Beginner 0
33 no 3.55 Novice Novice 1
34 yes 3.85 Advanced Beginner 0
35 no 3.68 Novice Beginner 1
36 no 3 Advanced Novice 0
37 no 3.52 Novice Novice 1
38 yes 2.65 Advanced Beginner 1
39 yes 3.75 Advanced Beginner 0
40 yes 3.95 Novice Beginner 0

Teradata Aster Analytics Foundation User Guide 473


Chapter 5: Statistical Analysis
GLM
SQL-MapReduce Call
The default option is to include the intercept with the step argument set to false. The response variable
(“admitted”, in this example) must be specified as the first variable listed in the InputColumns argument,
followed by the other predictors.

DROP TABLE IF EXISTS glm_admissions_model;


SELECT * FROM GLM (
ON (SELECT 1)
PARTITION BY 1
InputTable ('admissions_train')
OutputTable ('glm_admissions_model')
InputColumns ('admitted','masters', 'gpa', 'stats', 'programming')
CategoricalColumns ('masters', 'stats', 'programming')
Family ('LOGISTIC')
Link ('LOGIT')
Weight ('1')
Threshold ('0.01')
MaxIterNum ('25')
Step ('false')
Intercept ('true')
);

Output
The output table shows the model statistics.
Table 363: GLM Example 1 Model Statistics

predictor estimate std_error z_score p_value significance


(Intercept) 1.07751 2.92076 0.368914 0.712192
masters.no 2.21655 1.01999 2.17311 0.0297719 *
gpa -0.113935 0.802573 -0.141962 0.88711
stats.Novice 0.0406848 1.11567 0.0364667 0.97091
stats.Beginner 0.526618 1.2229 0.430631 0.666736
programming.Beginner -1.76976 1.069 -1.65553 0.0978177 .
programming.Novice -0.98035 1.14004 -0.859923 0.389831
ITERATIONS # 4 0 0 0 Number of Fisher
Scoring iterations
ROWS # 40 0 0 0 Number of rows
Residual deviance 38.9038 0 0 0 on 33 degrees of
freedom
Pearson goodness of fit 37.7905 0 0 0 on 33 degrees of
freedom

474 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM

predictor estimate std_error z_score p_value significance


AIC 52.9038 0 0 0 Akaike
information
criterion
BIC 64.726 0 0 0 Bayesian
information
criterion
Wald Test 9.89642 0 0 0.19452
Dispersion parameter 1 0 0 0 Taken to be 1 for
BINOMIAL and
POISSON.

For categorical variables, the model selects a reference category. In this example, the Advanced category was
used as a reference for the stats variable.
The query below returns the output shown in the following two tables:

SELECT * FROM glm_admissions_model ORDER BY attribute;

Table 364: GLM Example 1 Output Table (Columns 1-4)

attribute predictor category estimate


-1 Loglik -19.4519
0 (Intercept) 1.0775
1 masters yes
2 masters no 2.21655
3 gpa -0.113935
4 stats Advanced
5 stats Novice 0.0406848
6 stats Beginner 0.526618
7 programming Advanced
8 programming Beginner -1.76976
9 programming Novice -0.98035

Table 365: GLM Example 1 Output Table (Columns 5-8)

std_err z_score p_value significance


40 6 0
2.92076 0.368914 0.712192

1.01999 2.17311 0.0297719 *

Teradata Aster Analytics Foundation User Guide 475


Chapter 5: Statistical Analysis
GLM

std_err z_score p_value significance


0.802573 -0.141962 0.88711

1.11567 0.0364667 0.97091


1.2229 0.430631 0.666736

1.069 -1.65553 0.0978177 .


1.14004 -0.859923 0.389831

Example 2: Logistic Regression Analysis with Step Argument


This example generates the regression model using the Step argument with an intercept.
The step argument is similar to the R function step(). After each step, the function drops one predictor from
the current predictor group. The next step starts with the GLM model that has the lowest AIC score model.
The function repeats this process until only the intercept remains.

Input
GLM Example 1 input table admissions_train

SQL-MapReduce Call

DROP TABLE IF EXISTS glm_admissions_model1;


SELECT * FROM GLM (
ON (SELECT 1)
PARTITION BY 1
InputTable ('admissions_train')
OutputTable ('glm_admissions_model1')
InputColumns ('admitted', 'masters', 'gpa', 'stats', 'programming')
CategoricalColumns ('masters', 'stats', 'programming')
Family ('LOGISTIC')
Link ('LOGIT')
Weight ('1')
Threshold ('0.01')
MaxIterNum ('25')
Step ('true')
Intercept ('true')
);

Output
Note that the model starts with 33 degrees of freedom and then consecutively increases the degrees of
freedom to 39, at which point the response is modeled with only the intercept. The model parameters are
obtained progressively by dropping one predictor variable.

476 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM
Table 366: GLM Example 2 Model Statistics

predictor estimate std_error z_score p_value significance


(Intercept) 1.07751 2.92076 0.368914 0.712192
masters.no 2.21655 1.01999 2.17311 0.0297719 *
gpa -0.113935 0.802573 -0.141962 0.88711
stats.Novice 0.0406848 1.11567 0.0364667 0.97091
stats.Beginner 0.526618 1.2229 0.430631 0.666736
programming.Beginne -1.76976 1.069 -1.65553 0.0978177 .
r
programming.Novice -0.98035 1.14004 -0.859923 0.389831
ITERATIONS # 4 0 0 0 Number of Fisher Scoring
iterations
ROWS # 40 0 0 0 Number of rows
Residual deviance 38.9038 0 0 0 on 33 degrees of freedom
Pearson goodness of 37.7905 0 0 0 on 33 degrees of freedom
fit
AIC 52.9038 0 0 0 Akaike information criterion
BIC 64.726 0 0 0 Bayesian information criterion
Wald Test 9.89642 0 0 0.19452
Dispersion parameter 1 0 0 0 Taken to be 1 for BINOMIAL and
POISSON.
....
Residual deviance 44.7694 0 0 0 on 34 degrees of freedom
Pearson goodness of 39.895 0 0 0 on 34 degrees of freedom
fit
AIC 56.7694 0 0 0 Akaike information criterion
BIC 66.9027 0 0 0 Bayesian information criterion
....
Residual deviance 41.8984 0 0 0 on 35 degrees of freedom
Pearson goodness of 41.8616 0 0 0 on 35 degrees of freedom
fit
AIC 51.8984 0 0 0 Akaike information criterion
BIC 60.3428 0 0 0 Bayesian information criterion
...
Residual deviance 39.1062 0 0 0 on 36 degrees of freedom

Teradata Aster Analytics Foundation User Guide 477


Chapter 5: Statistical Analysis
GLM

predictor estimate std_error z_score p_value significance


Pearson goodness of 37.9515 0 0 0 on 36 degrees of freedom
fit
AIC 47.1062 0 0 0 Akaike information criterion
BIC 53.8617 0 0 0 Bayesian information criterion
....
Residual deviance 45.6566 0 0 0 on 37 degrees of freedom
Pearson goodness of 40 0 0 0 on 37 degrees of freedom
fit
AIC 51.6566 0 0 0 Akaike information criterion
BIC 56.7232 0 0 0 Bayesian information criterion
...
Residual deviance 42.8744 0 0 0 on 38 degrees of freedom
Pearson goodness of 40 0 0 0 on 38 degrees of freedom
fit
AIC 46.8744 0 0 0 Akaike information criterion
BIC 50.2522 0 0 0 Bayesian information criterion
....
(Intercept) 0.619039 0.331497 1.86741 0.0618448 .
ITERATIONS # 3 0 0 0 Number of Fisher Scoring
iterations
ROWS # 40 0 0 0 Number of rows
Residual deviance 51.7958 0 0 0 on 39 degrees of freedom
Pearson goodness of 40 0 0 0 on 39 degrees of freedom
fit
AIC 53.7958 0 0 0 Akaike information criterion
BIC 55.4847 0 0 0 Bayesian information criterion
Wald Test 3.48721 0 0 0.0618447 .
Dispersion parameter 1 0 0 0 Taken to be 1 for BINOMIAL and
POISSON.

The query below returns the output shown in the following table:

SELECT * FROM glm_admissions_model1 ORDER BY attribute;

478 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM
Table 367: GLM Example 2 Output Table

attribute predictor category estimate std_err z_score p_value significance


-1 Loglik -22.3847 40 5 0
-1 Loglik -19.4519 40 6 0
-1 Loglik -19.4621 40 5 0
-1 Loglik -19.5493 40 4 0
-1 Loglik -20.9492 40 4 0
-1 Loglik -22.8158 40 3 0
-1 Loglik -19.5531 40 3 0
-1 Loglik -21.4127 40 2 0
-1 Loglik -22.8283 40 2 0
-1 Loglik -21.4372 40 1 0
-1 Loglik -25.8979 40 0 0
0 (Intercept) 0.676415 0.722718 0.935932 0.349308
0 (Intercept) 0.989811 2.90188 0.341094 0.733033
0 (Intercept) 1.46634 0.640513 2.28932 0.022061 *
0 (Intercept) 0.666854 2.75228 0.242292 0.808554
0 (Intercept) 1.04008 2.7457 0.378802 0.704835
0 (Intercept) 0.619039 0.331497 1.86741 0.0618448 .
0 (Intercept) 0.746249 0.70359 1.06063 0.288858
0 (Intercept) -0.182322 0.428174 -0.425811 0.670245
0 (Intercept) 0.385131 2.61141 0.14748 0.882753
0 (Intercept) 1.3463 2.81931 0.477529 0.632985
0 (Intercept) 1.07751 2.92076 0.368914 0.712192

Example 3: Gaussian Distribution Analysis with Default Options


For the Gaussian distribution the response variable must be a continuous numerical variable, where the data
is grouped around a single mean and the graph looks like a normal or bell curve distribution.

Input
The input table, housing_train, is real estate data on homes, which models the home price with 12 predictors
(six numerical and six categorical variables). The variable definition is:
• Response variable:
∘ price - sale price of a house in $
• Predictors:

Teradata Aster Analytics Foundation User Guide 479


Chapter 5: Statistical Analysis
GLM
∘ lotsize - the lot size of a property in square feet
∘ bedrooms - number of bedrooms
∘ bathrms - number of full bathrooms
∘ stories - number of stories excluding basement
∘ driveway - does the house has a driveway?
∘ recroom - does the house has a recreational room?
∘ fullbase - does the house has a full finished basement?
∘ gashw - does the house uses gas for hot water heating?
∘ airco - does the house has central air conditioning?
∘ garagepl - number of garage places
∘ prefarea - is the house located in the preferred neighborhood of the city?
∘ homestyle - style of home
Table 368: GLM Example 3 Input Table housing_train (Columns 1-7)

sn price lotsize bedrooms bathrms stories driveway


1 42000 5850 3 1 2 yes
2 38500 4000 2 1 1 yes
3 49500 3060 3 1 1 yes
4 60500 6650 3 1 2 yes
5 61000 6360 2 1 1 yes
6 66000 4160 3 1 1 yes
7 66000 3880 3 2 2 yes
8 69000 4160 3 1 3 yes
9 83800 4800 3 1 1 yes
10 88500 5500 3 2 4 yes
... ... ... ... ... ... ...

Table 369: GLM Example 3 Input Table housing_train (Columns 8-14)

recroom fullbase gashw airco garagepl prefarea homestyle


no yes no no 1 no Classic
no no no no 0 no Classic
no no no no 0 no Classic
yes no no no 0 no Eclectic
no no no no 0 no Eclectic
yes yes no yes 0 no Eclectic
no yes no no 2 no Eclectic
no no no no 0 no Eclectic
yes yes no no 0 no Eclectic

480 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLM

recroom fullbase gashw airco garagepl prefarea homestyle


yes no no yes 1 no Eclectic
... ... ... ... ... ... ...

SQL-MapReduce Call
In this example, the family is GAUSSIAN and the default family link is IDENTITY.

DROP TABLE IF EXISTS glm_housing_model;


SELECT * FROM GLM (
ON (SELECT 1)
PARTITION BY 1
InputTable ('housing_train')
OutputTable ('glm_housing_model')
InputColumns ('price ', 'lotsize ', 'bedrooms ', 'bathrms ',
'stories ', 'garagepl', 'driveway ', 'recroom ',
'fullbase ', 'gashw ', 'airco ', 'prefarea',
'homestyle')
CategoricalColumns ('driveway ', 'recroom ', 'fullbase ', 'gashw ',
'airco ', 'prefarea', 'homestyle')
Family ('GAUSSIAN')
Link ('IDENTITY')
Weight ('1')
Threshold ('0.01')
MaxIterNum ('25')
Step ('false')
Intercept ('true')
);

Output

Table 370: GLM Example 3 Model Statistics

predictor estimate std_error t_score p_value significance


(Intercept) 36349.3 2733.46 13.2979 0 ***
lotsize 2.08095 0.26133 7.96291 1.24345e-14 ***
bedrooms 782.093 766.84 1.01989 0.308296
bathrms 6772.31 1106.78 6.11894 1.96318e-09 ***
stories 2445.62 694.145 3.52321 0.000467307 ***
garagepl 1483.1 623.597 2.3783 0.0177847 *
driveway.no -2822.63 1481.25 -1.90558 0.0573049 .
recroom.yes 1208.53 1358.57 0.88956 0.37415
fullbase.yes 3588.3 1167.37 3.07382 0.00223419 **
gashw.yes 5787.25 2405.47 2.40587 0.0165127 *
airco.yes 6478.79 1152.16 5.62317 3.19341e-08 ***

Teradata Aster Analytics Foundation User Guide 481


Chapter 5: Statistical Analysis
GLM

predictor estimate std_error t_score p_value significance


prefarea.yes 6465.64 1212.84 5.33099 1.50887e-07 ***
homestyle.Classic -16550.9 1308.59 -12.6479 0 ***
homestyle.bungalow 37577.7 1850.17 20.3104 0 ***
ITERATIONS # 2 0 0 0 Number of
Fisher
Scoring
iterations
ROWS # 492 0 0 0 Number of
rows
Residual deviance Infinity 0 0 0 on 478
degrees of
freedom
Pearson goodness of fit 5.30669e+10 0 0 0 on 478
degrees of
freedom
AIC Infinity 0 0 0 Akaike
information
criterion
BIC Infinity 0 0 0 Bayesian
information
criterion
Wald Test 23174 0 0 0 ***
Dispersion parameter 1.11019e+08 0 0 0 Taken to be 1
for
BINOMIAL
and
POISSON.

Many predictors are significant at 95% confidence level (p-value < 0.05).
This query returns the following table:

SELECT * FROM glm_housing_model ORDER BY attribute;

Table 371: GLM Example 3 Output Table glm_housing_model

attribute predictor category estimate std_err z_score p_value significance


-1 Loglik -Infinity 492 13 0
0 (Intercept) 36349.3 2733.46 13.2979 0 ***
1 lotsize 2.08095 0.26133 7.96291 1.24345e-14 ***
2 bedrooms 782.093 766.84 1.01989 0.308296

482 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLMPredict

attribute predictor category estimate std_err z_score p_value significance


3 bathrms 6772.31 1106.78 6.11894 1.96318e-09 ***
4 stories 2445.62 694.145 3.52321 0.000467307 ***
5 garagepl 1483.1 623.597 2.3783 0.0177847 *
6 driveway yes
7 driveway no -2822.63 1481.25 -1.90558 0.0573049 .
8 recroom no
9 recroom yes 1208.53 1358.57 0.88956 0.37415
10 fullbase no
11 fullbase yes 3588.3 1167.37 3.07382 0.00223419 **
12 gashw no
13 gashw yes 5787.25 2405.47 2.40587 0.0165127 *
14 airco no
15 airco yes 6478.79 1152.16 5.62317 3.19341e-08 ***
16 prefarea no
17 prefarea yes 6465.64 1212.84 5.33099 1.50887e-07 ***
18 homestyle Eclectic
19 homestyle Classic -16550.9 1308.59 -12.6479 0 ***
20 homestyle bungalow 37577.7 1850.17 20.3104 0 ***

GLMPredict

Summary
The GLMPredict function uses the model generated by the function GLM to perform generalized linear
model prediction on new input data.
This function can be used with real-time applications. Refer to AMLGenerator.

Teradata Aster Analytics Foundation User Guide 483


Chapter 5: Statistical Analysis
GLMPredict
Usage

GLMPredict Syntax
Version 1.5

SELECT * FROM GLMPredict (


ON input_table
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ModelTable ('model_table')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Family ('family') ]
[ Link ('link') ]
);

Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Categor Description
y
ModelTable Require Specifies the name of the model table generated by the GLM function.
d
Accumulate Optional Specifies the names of input table columns to copy to the output table.
Family Optional Specifies the distribution exponential family. The default value is 'BINOMIAL'. If
you specify this argument, you must give it the same value that you used for the
Family argument of the function GLM when you generated the model table.
Link Optional The default value is 'CANONICAL'. The canonical link functions (default link
functions) and the link functions that are allowed for each exponential family are
listed in Background.

Note:
Use the same value that you used for the Link argument of the function GLM
when you generated the model table.

484 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLMPredict
Input
The GLMPredict function has two input tables:
• A table of new data, whose schema follows
• A model table generated by the GLM function, described in Output Table of the function GLM
Table 372: GLMPredict Input Table Schema

Column Data Type Description


dependent_variable_column Any Required column that contains the dependent/response
variables. Cannot contain NULL values.
predictor_variable_columns Any Columns that contain independent/predictor variables. One
such column is required; others are optional. These columns
cannot contain NULL values.

Output
Table 373: GLMPredict Output Table Schema

Column Data Type Description


accumulate_column Same as in Column copied from input_table.
input_table
fitted_value DOUBLE Score of the input data, given by the equation g-1(Xβ), where
PRECISION g-1 is the inverse link function, X the predictors, and β is the
vector of coefficients estimated by the GLM function.
For other values of Family, the scores are the expected values
of dependent/response variable, conditional on the
predictors.

Examples
• Example 1: Logistic Distribution Prediction
• Example 2: Gaussian Distribution Prediction

Example 1: Logistic Distribution Prediction

Input
The input test table, admissions_test, has admissions information for 20 students. The example uses the
glm_admissions_model (see the table, GLM Example 1 Model Statistics, from the Output) section of
Example 1 of the function GLM) to evaluate the prediction on the admission status of these students.
Table 374: GLMPredict Example 1 Input Table admissions_test

id masters gpa stats programming admitted


50 yes 3.95 Beginner Beginner 0

Teradata Aster Analytics Foundation User Guide 485


Chapter 5: Statistical Analysis
GLMPredict

id masters gpa stats programming admitted


51 yes 3.76 Beginner Beginner 0
52 no 3.7 Novice Beginner 1
53 yes 3.5 Beginner Novice 1
54 yes 3.5 Beginner Advanced 1
55 no 3.6 Beginner Advanced 1
56 no 3.82 Advanced Advanced 1
57 no 3.71 Advanced Advanced 1
58 no 3.13 Advanced Advanced 1
59 no 3.65 Novice Novice 1
60 no 4 Advanced Novice 1
61 yes 4 Advanced Advanced 1
62 no 3.7 Advanced Advanced 1
63 no 3.83 Advanced Advanced 1
64 yes 3.81 Advanced Advanced 1
65 yes 3.9 Advanced Advanced 1
66 no 3.87 Novice Beginner 1
67 yes 3.46 Novice Beginner 0
68 no 1.87 Advanced Novice 1
69 no 3.96 Advanced Advanced 1

SQL-MapReduce Call

CREATE TABLE glmpredict_admissions DISTRIBUTE BY hash(id) AS


SELECT * FROM GLMPredict (
ON admissions_test
ModelTable ('glm_admissions_model')
Accumulate ('id', 'masters', 'gpa', 'stats', 'programming',
'admitted')
Family ('LOGISTIC')
Link ('LOGIT')
) ORDER BY 1;

Output
The query below returns the output shown in the following table:

SELECT * FROM glmpredict_admissions ORDER BY 1;

486 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLMPredict
Table 375: GLMPredict Example 1 Output Table glmpredict_admissions

id masters gpa stats programming admitted fitted_value


50 yes 3.95 Beginner Beginner 0 0.35076568493861
51 yes 3.76 Beginner Beginner 0 0.355711270634992
52 no 3.7 Novice Beginner 1 0.758307972010008
53 yes 3.5 Beginner Novice 1 0.556015248287644
54 yes 3.5 Beginner Advanced 1 0.769476126790063
55 no 3.6 Beginner Advanced 1 0.968031451283364
56 no 3.82 Advanced Advanced 1 0.945773239222729
57 no 3.71 Advanced Advanced 1 0.946412421748131
58 no 3.13 Advanced Advanced 1 0.949666663764171
59 no 3.65 Novice Novice 1 0.874190783966478
60 no 4 Advanced Novice 1 0.8650601595144
61 yes 4 Advanced Advanced 1 0.650621000328301
62 no 3.7 Advanced Advanced 1 0.946470175529378
63 no 3.83 Advanced Advanced 1 0.945714776638306
64 yes 3.81 Advanced Advanced 1 0.655525617391707
65 yes 3.9 Advanced Advanced 1 0.653206427288345
66 no 3.87 Novice Beginner 1 0.754740354720256
67 yes 3.46 Novice Beginner 0 0.260036221097295
68 no 1.87 Advanced Novice 1 0.890966489834395
69 no 3.96 Advanced Advanced 1 0.94494933637213

Categorizing Column fitted_value


The fitted_value column gives the probability that a student belongs to one of the output classes. A typical
logistic regression graph is displayed, as shown in the following figure, mapping the input x-axis against a y
probability value between [0,1].

Teradata Aster Analytics Foundation User Guide 487


Chapter 5: Statistical Analysis
GLMPredict
The fitted_value probability of > = 0.5 implies class 1 (student admitted) and a probability of < 0.5 implies
class 0 (student rejected).
The following query returns the following table:

ALTER table glmpredict_admissions


ADD column fitted_category int;
UPDATE glmpredict_admissions SET fitted_category = 1
WHERE fitted_value > 0.4999;
UPDATE glmpredict_admissions SET fitted_category = 0
WHERE fitted_value < 0.4999;
SELECT * FROM glmpredict_admissions ORDER BY 1;

Table 376: GLMPredict Example 1 Output Table glmpredict_admissions

id masters gpa stats programming admitted fitted_value fitted_category


50 yes 3.95 Beginner Beginner 0 0.35076568493861 0
51 yes 3.76 Beginner Beginner 0 0.355711270634992 0
52 no 3.7 Novice Beginner 1 0.758307972010008 1
53 yes 3.5 Beginner Novice 1 0.556015248287644 1
54 yes 3.5 Beginner Advanced 1 0.769476126790063 1
55 no 3.6 Beginner Advanced 1 0.968031451283364 1
56 no 3.82 Advanced Advanced 1 0.945773239222729 1
57 no 3.71 Advanced Advanced 1 0.946412421748131 1
58 no 3.13 Advanced Advanced 1 0.949666663764171 1
59 no 3.65 Novice Novice 1 0.874190783966478 1
60 no 4 Advanced Novice 1 0.8650601595144 1
61 yes 4 Advanced Advanced 1 0.650621000328301 1
62 no 3.7 Advanced Advanced 1 0.946470175529378 1
63 no 3.83 Advanced Advanced 1 0.945714776638306 1
64 yes 3.81 Advanced Advanced 1 0.655525617391707 1
65 yes 3.9 Advanced Advanced 1 0.653206427288345 1
66 no 3.87 Novice Beginner 1 0.754740354720256 1
67 yes 3.46 Novice Beginner 0 0.260036221097295 0
68 no 1.87 Advanced Novice 1 0.890966489834395 1
69 no 3.96 Advanced Advanced 1 0.94494933637213 1

488 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
GLMPredict
Prediction Accuracy
The prediction accuracy is calculated as follows:

SELECT (SELECT COUNT(id) FROM glmpredict_admissions


WHERE admitted = fitted_category)/(SELECT count(id)
FROM glmpredict_admissions) AS prediction_accuracy;

Table 377: GLMPredict Example 1 Prediction Accuracy

prediction_accuracy
1.00000000000000000000

Example 2: Gaussian Distribution Prediction

Input
The input test table, housing_test, is test data on a sample of 54 homes. The example uses the Gaussian
model glm_housing_model from GLM Example 3 to evaluate the prediction for these new homes,
comparing the prediction with the original price information with root mean square error evaluation
(RMSE).
Table 378: GLMPredict Example 2 Input Table housing_test (Columns 1-7)

sn price lotsize bedrooms bathrms stories driveway


13 27000 1700 3 1 2 yes
16 37900 3185 2 1 1 yes
25 42000 4960 2 1 1 yes
38 67000 5170 3 1 4 yes
53 68000 9166 2 1 1 yes
104 132000 3500 4 2 2 yes
111 43000 5076 3 1 1 no
117 93000 3760 3 1 2 yes
132 44500 3850 3 1 2 yes
140 43000 3750 3 1 2 yes
142 40000 2650 3 1 2 yes
157 60000 2953 3 1 2 yes
161 63900 3162 3 1 2 yes
... ... ... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 489


Chapter 5: Statistical Analysis
GLMPredict
Table 379: GLMPredict Example 2 Input Table housing_test (Columns 8-14)

recroom fullbase gashw airco garagepl prefarea homestyle


no no no no 0 no Classic
no no no yes 0 no Classic
no no no no 0 no Classic
no no no yes 0 no Eclectic
no yes no yes 2 no Eclectic
no no yes no 2 no bungalow
no no no no 0 no Classic
no no yes no 2 no Eclectic
no no no no 0 no Classic
no no no no 0 no Classic
no yes no no 1 no Classic
no yes no yes 0 no Eclectic
no no no yes 1 no Eclectic
... ... ... ... ... ... ...

SQL-MapReduce Call
The “canonical” link specifies the default family link, which is “identity” for the Gaussian distribution.

DROP TABLE IF EXISTS glmpredict_housing;


CREATE TABLE glmpredict_housing DISTRIBUTE BY hash(sn) AS
SELECT * FROM GLMPredict (
ON housing_test
ModelTable ('glm_housing_model ')
Accumulate ('sn', 'price')
Family ('GAUSSIAN')
Link ('CANONICAL')
) ORDER BY 1;

Output
The following query returns the output shown in the following table:

SELECT * FROM glmpredict_housing ORDER BY 1;

Table 380: GLMPredict Example 2 Output Table

sn price fitted_value
13 27000 37345.844

490 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Hidden Markov Model Functions

sn price fitted_value
16 37900 43687.13175
25 42000 40902.028
38 67000 72487.6705
53 68000 79238.6937
104 132000 111528.007
111 43000 39102.8812
117 93000 66936.951
132 44500 41819.8865
140 43000 41611.7915
... ... ...

The fitted_value column gives the predicted home price.

RMSE

SELECT SQRT(AVG(POWER(glmpredict_housing.price -
glmpredict_housing.fitted_value, 2))) AS RMSE FROM glmpredict_housing;

Table 381: GLMPredict Example 2 RMSE

rmse
10246.7521984348

Hidden Markov Model Functions

Overview
The Hidden Markov model is a statistical model that describes the evolution of observable events that
depend on internal factors that are not directly observable. The following graph shows the key elements of
the model.
The graph has two parts, divided by a dashed line. Below the dashed line is the observed sequence; above the
dashed line are the hidden states (so called because they are not directly observable). The hidden states have
state transitions that introduce the state sequences.
In the following graph, the observed sequence is the weather, the hidden states are the seasons, and the state
sequence is summer, fall, winter, spring. The states have outgoing edges to the observations, where the edge
represents the emission. For example, summer emits good weather and winter emits bad weather.

Teradata Aster Analytics Foundation User Guide 491


Chapter 5: Statistical Analysis
Hidden Markov Model Functions

The HMM model addresses three problems: Learning, Decoding, and Evaluating. The following graph
assumes that historical weather data from years 2011 to 2013 is available to train the model. If the hidden
states are labeled in the training data, this type of training process is called “supervised learning.” Otherwise,
it is called “unsupervised learning.” To use unsupervised learning, you must specify the number of hidden
states. After the model is trained, you can make predictions. Given the observed sequence, inferring the
internal state is called “decoding.” Given the sequence, measuring the probability of the sequence is called
“evaluation.”

Models and Descriptions


The following table describes the HMM model and its corresponding description.
Table 382: HMM Models and Descriptions

HMM Model Description


Unsupervised Learning Given an observation sequence and the number of states, find the model that
maximizes the probability of the observed sequence.
Supervised Learning Given an observation sequences and states, find the model that maximizes the
probability of the observed sequence.

492 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner

HMM Model Description


Decoding Given the trained model and an observation sequence, find an optimal state
sequence.
Evaluation Given the trained model and an observation sequence, find the probability of
the sequence.

Aster Distributed Platforms


The following table describes the Aster platform on which each function is available.
Table 383: Functions and Aster Distributed Platforms

Function Name Aster Distributed Platform


HMMUnsupervisedLearner SQL-Graph Engine
HMMSupervisedLearner SQL-Graph Engine
HMMDecoder SQL-Map Reduce Engine
HMMEvaluator SQL-Map Reduce Engine

HMMUnsupervisedLearner

Summary
The HMMUnsupervisedLearner function is available on the SQL-Graph platform. The function can produce
multiple HMM models simultaneously, where each model is learned from a set of sequences and where each
sequence represents a vertex.

Usage

HMMUnsupervisedLearner Syntax
Version 1.3

SELECT * FROM HMMUnsupervisedLearner (


ON { table_name | view_name | (query) } AS vertices
PARTITION BY [ model_key, ...,] sequence_key_attributes
ORDER BY [ model_key, ]
time_ordered_sequence_attributes ASC
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]

Teradata Aster Analytics Foundation User Guide 493


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner
[ ModelColumn ('model_attribute') ]
SeqColumn ('sequence_attribute')
ObsColumn ('observed_attribute')
HiddenStateNum ('number')
[ MaxIterNum ('max_iterations') ]
[ Epsilon ('epsilon') ]
[ SkipColumn ('skip_attribute') ]
[ InitMethods ( { 'random' | 'flat' | 'input' }, 'seed_number') ]
[ InitParams ('init_state_probability_vector',
'state_transition_probability_matrix',
'observation_emission_probability_matrix')
[ OutputTables('init_state_prob',
'state_transition_prob',
'emit_prob')]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
ModelColumn Optional The name of the column that contains the model attribute. If you
specify this argument, then model_attribute must match a
model_key in the PARTITION BY clause. The values in the column
can be integers or strings.
SeqColumn Required The name of the column that contains the sequence attribute. The
sequence_attribute must be a sequence attribute in the PARTITION
BY clause. A sequence must contain more than two observation
symbol.
ObsColumn Required The name of the column that contains the observed symbols. The
function scans the input table to find all possible observed symbols.

Note:
Observed symbols are case-sensitive.

HiddenStateNum Required The number of hidden states.

Note:
The number of hidden states can influence model quality and
performance, so choose the number appropriately.

MaxIterNum Optional The number of iterations that the training process runs before the
function completes. The default is 10.
Epsilon Optional The threshold value in determining the convergence of HMM
training. If the parameter value difference is less than the threshold,
the training process converges. There is no default value. If you do

494 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner

Argument Category Description


not specify Epsilon, only MaxIterNum determines when the training
process converges.
SkipColumn Optional The name of the column whose values determine whether the
function skips the row. The function skips the row if the value is
“true”, “yes”, “y”, or “1”. The function does not skip the row if the
value is “false”, “f”, “no”, “n”, “0”, or NULL.
InitMethods Optional The method that the function uses to generate the initial parameters
for the initial state probabilities, state transition probabilities, and
emission probabilities. The possibilities are:
• random (default): The initial parameters are based on uniform
distribution.
• flat: The probabilities are equal. Each cell holds the same
probability in the matrix or vector.
• input: The function takes the initial parameters from the
InitParams argument.
The names of the preceding methods are case-insensitive.
The seed number is meaningful only when the specified method is
random.
InitParams Required When InitMethods has the value 'input', this argument specifies the
when initial parameters for the models. The first parameter specifies the
InitMethods initial state probabilities, the second parameter specifies the state
has the value transition probabilities, and the third parameter specifies the
'input' emission probabilities.
For example, if the NumberHiddenStates argument specifies three
hidden states and two observed symbols ('yes' and 'no'), then the
InitParams values are:
• init_state_probability_vector (the initial state probabilities):

'0.3333333333 0.3333333333 0.3333333333'


• state_transition_probability_matrix (the state transition
probabilities):

'0.3333333333 0.3333333333 0.3333333333;


0.3333333333 0.3333333333 0.3333333333;
0.3333333333 0.3333333333 0.3333333333'
• observation_emission_probability_matrix (the emission
probabilities):

'no:0.25 yes:0.75; no:0.35 yes:0.65; no:0.45


yes:0.55'

The sum of the probabilities in each row for the initial state
probabilities, state transition probabilities, or emission probabilities
parameters must be rounded to 1.0. The observed symbols are case-

Teradata Aster Analytics Foundation User Guide 495


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner

Argument Category Description


sensitive. The number of states and the number of observed symbols
must be consistent with the NumberHiddenStates argument and the
observed symbols in the input table; otherwise, the function displays
error messages.
OutputTables Optional The names of the output tables:
• init_state_prob
Initial state probability table (default name is Pi).
• state_transition_prob
State transition probabilities table (default name is A).
• emit_prob
Emission probability table (default name is B).

Input
The HMMUnsupervisedLearner function takes a vertices table as the input fact table. Each sequence
represents a vertex.
The PARTITION BY clause specifies attributes that represent the unique sequence across the table. For
example, in the following table, a valid PARTITION BY clause is PARTITION BY model_id, seq_id.
Table 384: HMMUnsupervisedLearner Example Vertices Table (Sequences)

model_id seq_id time observed_id


1 1 1 M
1 1 2 M
1 1 3 M
1 1 4 L
1 1 5 L
1 1 6 M
1 1 7 M
1 1 8 M
1 1 9 L
1 2 1 L
1 2 2 M
1 2 3 M
1 2 4 M
1 2 5 L

The ORDER BY clause ensures that the observations in each sequence are sorted chronologically in
ascending order. For example, in the preceding table, a valid ORDER BY clause is ORDER BY model_id,
seq_id, time. When seq_id is 1, the observed sequence is MMMLLMMML.

496 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner
The function can train either one HMM or multiple HMMs. Each model_id in the input table corresponds
to an output HMM model.

Output
The HMMUnsupervisedLearner function outputs console messages and generates the following three tables
through JDBC:
• Initial-state probability table
• State-transition probability table
• Emission probability table
Table 385: HMMUnsupervisedLearner Console Message Table Schema

Column Data Type Description


message VARCHAR This is the console message, which describes the output. For
example:
“HMM models will be saved to the tables pi_loan, A_loan,
and B_loan once the training process is successfully
completed.”

Table 386: HMMUnsupervisedLearner Initial-State Probability Table

Column Data Type Description


model_attribute BIGINT, The name of the column that contains the model attribute, specified by
INTEGER, the ModelColumn attribute.
or
VARCHAR
state VARCHAR The hidden state of the learned HMM.
probability DOUBLE The initial state probability determined by the function.
PRECISION

Table 387: HMMUnsupervisedLearner State-Transition Probability Table

Column Data Type Description


model_attribute BIGINT, The name of the column that contains the model attribute,
INTEGER, specified by the ModelColumn attribute.
or
VARCHAR
from_state VARCHAR The hidden state of the learned HMM, from which a transition
emanates.
to_state VARCHAR The hidden state of the learned HMM, to which a transition is
made.
probability DOUBLE The probability of a transition from the from_state to the to_state.
PRECISIO
N

Teradata Aster Analytics Foundation User Guide 497


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner
Table 388: HMMUnsupervisedLearner Observation Probability Table

Column Data Type Description


model_attribute BIGINT, The name of the column that contains the model attribute,
INTEGER, or specified by the ModelColumn attribute.
VARCHAR
state VARCHAR The hidden state of the learned HMM.
observed_key_attribute VARCHAR An observed symbol found in the observed_id column of the
input table.
probability DOUBLE The probability of emitting an observed symbol given that
PRECISION the HMM is in a hidden state.

Example

Loan Default Prediction


Financial institutions try to predict whether a customer is going to default on a loan by analyzing previous
transaction data. To reduce the number of defaulting customers, the financial institution may pro-actively
suggest default prevention initiatives to customers.
The statuses of loans are updated depending on payments made by the customer. These statuses can be used
to build a Hidden Markov Model to predict loan defaults. Let's assume that the statuses of the loans are the
following: current, late, one month late, two months late, three months late, four months late, defaulted and
paid. From transactions of customers who have paid off their loans, and of customers who have defaulted,
two Hidden Markov Models can be built. All transaction sequences ending with paid are used to train one
model, and all sequences ending with defaulted are used to train a second model. If a customer's current
sequence of statuses is evaluated against both the models, a default prediction is based on which model gives
a higher probability.
The hidden states of the HMM inherently denote the 'financial health' of the customer. A hidden state with a
high probability of emission of defaulted status would denote a state of poor financial health, while a hidden
state with high probability of emission of paid status would denote a state of a good financial health.
The observation symbols to build the HMM are the statuses referenced by number in the order shown in the
following table.
Table 389: HMMUnsupervisedLearner Example Observation Symbols

status symbols
current 1
late 2
one month late 3
two months late 4
three months late 5
four months late 6

498 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner

status symbols
defaulted 7
paid 8

Input
The input data used to train the two models are shown in the following table.
Table 390: HMMUnsupervisedLearner Example Input Table loan_prediction

model_id seq_id seq_vertex_id observed_id


1 1 0 1
1 1 1 1
1 1 2 1
1 1 3 1
1 1 4 1
1 1 5 1
1 1 6 1
1 1 7 1
1 1 8 1
1 1 9 1
1 1 10 1
1 1 11 1
1 1 12 4
1 1 13 5
1 1 14 6
1 1 15 6
1 1 16 6
1 1 17 7
...
2 1 0 1
2 1 1 1
2 1 2 1
2 1 3 1
2 1 4 1
2 1 5 1

Teradata Aster Analytics Foundation User Guide 499


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner

model_id seq_id seq_vertex_id observed_id


2 1 6 1
2 1 7 1
2 1 8 1
2 1 9 1
2 1 10 1
2 1 11 1
2 1 12 8

The status of the loan is shown in the model_id column, where a value of “1” denotes a defaulted loan and a
value of “2” denotes a paid loan. Rows with the same model id are used to train a single model. The use of
two model ids ensures that two different models are trained. Also notice that the defaulted loans end with
observed_id=7 and paid loans end with observed_id=8. The seq_vertex_id column provides the ordering of
the symbols in the sequences.

SQL-MapReduce Call
Assume that there are three hidden states to train the models and use the default method, random. The
query outputs three state tables: pi_loan (initial state probabilities), A_loan (state transition probabilities)
and B_loan (emission, or observation, probabilities).

DROP TABLE IF EXISTS pi_loan;


DROP TABLE IF EXISTS A_loan;
DROP TABLE IF EXISTS B_loan;
SELECT * FROM HMMUnsupervisedLearner (
ON loan_prediction AS "vertices"
PARTITION BY model_id, seq_id ORDER BY seq_vertex_id
ModelColumn ('model_id')
SeqColumn ('seq_id')
ObsColumn ('observed_id')
HiddenStateNum ('3')
InitMethods ('random')
OutputTables ('pi_loan', 'A_loan', 'B_loan')
);

Output

Table 391: HMMUnsupervisedLearner Example Output Message

message
HMM models will be saved to the tables pi_loan, A_loan, and B_loan once the training process is
successfully completed.

The query below returns the output shown in the following table:

SELECT * FROM pi_loan ORDER BY 1, 2, 3;

500 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner
Table 392: HMMUnsupervisedLearner Example Output Message pi_loan

model_id state probability


1 0 3.27859360752863e-09
1 1 6.31101164254861e-11
1 2 0.999999996658296
2 0 0.0281195637034241
2 1 0.0342773193334441
2 2 0.937603116963132

The following query returns the output shown in the following table:

SELECT * FROM A_loan ORDER BY 1, 2, 3;

Table 393: HMMUnsupervisedLearner Example Output Table A_loan

model_id from_state to_state probability


1 0 0 0.0640625915875049
1 0 1 0.935924962187232
1 0 2 1.24462252627054e-05
1 1 0 0.965881665949451
1 1 1 0.034116494612011
1 1 2 1.83943853841951e-06
1 2 0 0.0200526405800346
1 2 1 0.117982394786916
1 2 2 0.86196496463305
2 0 0 0.0135275442521027
2 0 1 0.831238578948845
2 0 2 0.155233876799053
2 1 0 0.650183335972221
2 1 1 0.176213616071229
2 1 2 0.17360304795655
2 2 0 0.143610221301369
2 2 1 0.259795319424887
2 2 2 0.596594459273744

Teradata Aster Analytics Foundation User Guide 501


Chapter 5: Statistical Analysis
HMMUnsupervisedLearner
The following query returns the output shown in the following table:

SELECT * FROM B_loan ORDER BY 1, 2, 3;

Table 394: HMMUnsupervisedLearner Example Output Table B_loan

model_id state observed probability


1 0 1 0.0278489806401534
1 0 2 2.13490305615227e-08
1 0 3 2.16452516177902e-08
1 0 4 0.303429629806469
1 0 5 4.02001410009148e-08
1 0 6 0.668721306358954
1 0 7 9.38935391851262e-218
1 0 8 1.07065392354079e-218
1 1 1 8.06980973060956e-05
1 1 2 0.0692752214932328
1 1 3 0.156873643540954
1 1 4 0.0227441708323421
1 1 5 0.213428380115597
1 1 6 0.537597885920567
1 1 7 1.71260624384861e-218
1 1 8 1.08244279935844e-217
1 2 1 0.993584745503789
1 2 2 0.0010870231448872
1 2 3 0.00531194778573366
1 2 4 1.62821429202875e-05
1 2 5 6.37778366603837e-11
1 2 6 1.35889208728403e-09
1 2 7 5.52574644335947e-220
1 2 8 5.07942195209965e-220
2 0 1 0.742492680184544
2 0 2 2.65054280803536e-06
2 0 3 3.46492926202034e-05
2 0 4 0.196209179240367

502 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMSupervisedLearner

model_id state observed probability


2 0 5 1.73539886740778e-10
2 0 6 0.0612608405661211
2 0 7 9.55020607438646e-216
2 0 8 1.0889956532584e-216
2 1 1 0.686900699423504
2 1 2 0.066121858529193
2 1 3 0.155155089264256
2 1 4 9.67394943767393e-06
2 1 5 0.0668633786585835
2 1 6 0.0249493001750253
2 1 7 1.5969261489751e-218
2 1 8 1.00932786930684e-217
2 2 1 0.998254250327946
2 2 2 0.000671519946212711
2 2 3 0.000761504268610617
2 2 4 0.00012244326239558
2 2 5 2.2387837278705e-06
2 2 6 0.000188043411107735
2 2 7 3.45823602960767e-219
2 2 8 3.17890807773867e-219

HMMSupervisedLearner

Summary
The HMMSupervisedLearner function is available on SQL-Graph platform. The function can produce
multiple HMM models simultaneously, where each model is learned from a set of sequences and where each
sequence represents a vertex.

Teradata Aster Analytics Foundation User Guide 503


Chapter 5: Statistical Analysis
HMMSupervisedLearner
Usage

HMMSupervisedLearner Syntax
Version 1.3

SELECT * FROM HMMSupervisedLearner (


ON {table_name | view_name | (query)} AS "vertices"
PARTITION BY [ model_key, ...,] sequence_key_attributes
ORDER BY [ model_key, ] time_ordered_sequence_attributes ASC
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
[ ModelColumn ('model_attribute') ]
SeqColumn ('sequence_attribute')
ObsColumn ('observed_attribute1')
StateColumn ('state_attributes')
[ SkipColumn ('skip_attribute') ]
[ OutputTables ('init_state_prob', 'state_transition_prob',
'emit_prob')]
[ BatchSize ('size') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
ModelColumn Optional The name of the column that contains the model attribute. If you specify
this argument, then its value must match a model_key in the PARTITION
BY clause.
SeqColumn Required The name of the column that contains the sequence attribute. The
sequence_attribute must be a sequence attribute in the PARTITION BY
clause. A sequence must contain more than two observation symbols.
ObsColumn Required The name of the column that contains the observed symbols. The function
scans the input table to find all possible observed symbols.

Note:
Observed symbols are case-sensitive.

StateColumn Required The state attributes. You can specify multiple states. The states are case-
sensitive.

504 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMSupervisedLearner

Argument Category Description


SkipColumn Optional The name of the column whose values determine whether the function
skips the row. The function skips the row if the value is “true”, “yes”, “y”,
or “1”. The function does not skip the row if the value is “false”, “f”, “no”,
“n”, “0”, or NULL.
OutputTables Optional The names of the output tables:
init_state_prob—Initial state probability table (default name is Pi).
state_transition_prob—State transition probability table (default name is
A).
emit_prob—Emission probability table (default name is B).
BatchSize Optional The number of models to process. The size must be positive.
If the batch size is not specified, the function avoids out-of-memory errors
by determining the appropriate size. If the batch size is specified and there
is insufficient free memory, the function reduces the batch size. The batch
size is determined dynamically, based on the memory conditions. For
example, at time T1, the specified batch size 1000 might be adjusted to 980,
and at time T2, the batch size might be adjusted to 800.

Input
The HMMSupervisedLearner function takes a vertices table as the input fact table. Each sequence represents
a vertex.
The PARTITION BY clause consists of list attributes representing the unique sequence across the entire
table. For example, in the following table, a valid PARTITION BY clause is PARTITION BY model_id,
seq_id.
Table 395: HMMSupervisedLearner Example vertices table (sequences)

model_id seq_id time state_id observed_id


1 1 1 C M
1 1 2 A M
1 1 3 C M
1 1 4 C L
1 1 5 A L
1 1 6 C M
1 1 7 C M
1 1 8 A M
1 1 9 C L
1 2 1 A L
1 2 2 C M
1 2 3 A M

Teradata Aster Analytics Foundation User Guide 505


Chapter 5: Statistical Analysis
HMMSupervisedLearner

model_id seq_id time state_id observed_id


1 2 4 A M
1 2 5 A L

The ORDER BY clause ensures that the observations in each sequence are sorted chronologically in
ascending order. For example, in the preceding table, a valid ORDER BY clause is ORDER BY model_id,
seq_id, time. When seq_id is 1, the observed sequence is MMMLLMMML.
The training function can train either one HMM or multiple HMMs. Each model id corresponds to an
output HMM model.

Output
The HMMSupervisedLearner function outputs console messages and generates the following three tables
through JDBC:
• Initial-state probability table
• State-transition probability table
• Emission probability table
Table 396: HMMSupervisedLearner Console Message Table Schema

Column Data Type Description


message VARCHAR This is the console message, which describes the output. For example:
“HMM models will be saved to the tables pi_loan, A_loan, and B_loan
once the training process is successfully completed.”

Table 397: HMMSupervisedLearner Initial-State Probability Table

Column Data Type Description


model_attribute BIGINT, The name of the column that contains the model attribute, specified by
INTEGER, the ModelColumn attribute.
or
VARCHAR
state VARCHAR The hidden state of the learned HMM.
probability DOUBLE The initial state probability determined by the function.
PRECISION

Table 398: HMMSupervisedLearner State-Transition Probability Table

Column Data Type Description


model_attribute BIGINT, The name of the column that contains the model attribute, specified by
INTEGER, the ModelColumn attribute.
or
VARCHAR
from_state VARCHAR The hidden state of the learned HMM, from which a transition
emanates.

506 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMSupervisedLearner

Column Data Type Description


to_state VARCHAR The hidden state of the learned HMM, to which a transition is made.
probability DOUBLE The probability of a transition from the from_state to the to_state.
PRECISION

Table 399: HMMSupervisedLearner Observation Probability Table

Column Data Type Description


model_attribute BIGINT, The name of the column that contains the model attribute, specified
INTEGER, by the ModelColumn attribute.
or
VARCHAR
state VARCHAR The hidden state of the learned HMM.
observed_key_attribute VARCHAR An observed symbol found in the observed_id column of the input
table.
probability DOUBLE The probability of emitting an observed symbol given that the HMM
PRECISION is in a hidden state.

Example

Customer Loyalty Prediction


Customer loyalty is an internal state, which cannot be directly observed or measured, but can be inferred
probabilistically. This use case shows how to predict the loyalty level of customers to a retailer. Shopping
patterns of customers can provide insights about the customer's loyalty to the retailer.
Customers who make frequent purchases or costlier purchases tend to be more loyal to a brand than
infrequent shoppers buying low-value items. A customer who makes a purchase each week is more loyal
than a customer who makes a purchase once a year. Similarly, a customer spending more money is more
loyal than a customer spending less money.
Based on the shopping patterns of several customers, you can build a Hidden Markov Model to analyze the
loyalty of new customers who do not have extensive purchase histories. Gaining insight about the loyalty
levels of customers can help retailers devise strategic marketing plans to retain loyal customers and convert
low-loyalty customers to high loyalty. Retailers can also take proactive action to turn around the fading
relationship with customers who show a downward trajectory of loyalty. This is particularly important for
high-value customers.
The dataset for this use case is the shopping history of customers. This dataset is transformed into
observations that can be used to train an HMM. Based on the time between purchases and the difference in
amount spent between purchases, purchases are classified into the following different levels:
Table 400: HMMSupervisedLearner Example Purchase Levels

Time between purchases Difference in amount spent


Small (S) Less (L)
Medium (M) Same (S)

Teradata Aster Analytics Foundation User Guide 507


Chapter 5: Statistical Analysis
HMMSupervisedLearner

Time between purchases Difference in amount spent


Large (L) More (M)

To determine the level that a purchase belongs to, you can use K-Means clustering with K=3 to generate
clusters based on either time difference between purchases or difference in amount spent. Purchases
clustered using either metric are classified into one of the three purchase levels. The observation associated
with a purchase is the combination of the levels from both metrics.
From the different levels of time difference and spending amount difference between purchases, nine
combinations of spending profiles are possible: SL, SS, SM, ML, MS, MM, LL, LS, LM. These nine spending
profiles serve as the observation symbols for the HMM.
If you assume a customer belongs to either of three loyalty levels—low (L), normal (N) or high (H)—then
the number of hidden states is three. The definition of these loyalty levels is up to the discretion of the
business. This example uses supervised learning, so each purchase in the input dataset is labeled with the
customer's loyalty level. You can use these labeled observations to train an HMM which can later be used to
assign loyalty levels to new, unlabeled purchase data.

Input
Table 401: HMMSupervisedLearner Example Input Table customer_loyalty

id user_id seq_id purchase_date loyalty_level observation


1 1 1 2014-01-01 L LL
2 1 1 2014-01-02 L ML
3 1 1 2014-01-03 L SL
4 1 1 2014-01-04 L LM
5 1 1 2014-01-05 L ML
6 1 1 2014-01-06 L LL
7 1 1 2014-01-07 L MM
8 1 1 2014-01-08 L MS
9 1 1 2014-01-09 L ML
10 1 1 2014-01-10 L LM
...
101 2 2 2014-01-01 N ML
102 2 2 2014-01-02 L LS
103 2 2 2014-01-03 N ML
104 2 2 2014-01-04 N ML
105 2 2 2014-01-05 N MS
106 2 2 2014-01-06 N ML
107 2 2 2014-01-07 N SM

508 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMSupervisedLearner

id user_id seq_id purchase_date loyalty_level observation


108 2 2 2014-01-08 N MS
109 2 2 2014-01-09 N LL
110 2 2 2014-01-10 N MM
...
201 3 3 2014-01-01 L LL
202 3 3 2014-01-02 L LS
203 3 3 2014-01-03 H SS
204 3 3 2014-01-04 H SM
205 3 3 2014-01-05 H LL
206 3 3 2014-01-06 N ML
207 3 3 2014-01-07 H SS
208 3 3 2014-01-08 H MM
209 3 3 2014-01-09 H MS
210 3 3 2014-01-10 H ML
... ... ... ... ... ...

SQL-MapReduce Call
The following SQL query generates the probabilities with state information:

DROP TABLE IF EXISTS pi_loyalty;


DROP TABLE IF EXISTS A_loyalty;
DROP TABLE IF EXISTS B_loyalty;
SELECT * FROM HMMSupervisedLearner (
ON customer_loyalty AS "vertices"
PARTITION BY user_id, seq_id
ORDER BY user_id, seq_id, purchase_date ASC
ModelColumn ('user_id')
SeqColumn ('seq_id')
ObsColumn ('observation')
StateColumn ('loyalty_level')
OutputTables ('pi_loyalty', 'A_loyalty', 'B_loyalty')
);

Output
There are three state tables - pi_loyalty (initial), A_loyalty (transition) and B_loyalty (emission) - that are
output.

Teradata Aster Analytics Foundation User Guide 509


Chapter 5: Statistical Analysis
HMMSupervisedLearner
Table 402: HMMSupervisedLearner Output Message

message
HMM models will be saved to the tables pi_loyalty, A_loyalty, and B_loyalty once the training process is
successfully completed.

The following query returns the output shown in the following table:

SELECT * FROM pi_loyalty ORDER BY 1, 2;

Table 403: HMMSupervisedLearner Output Table pi_loyalty

user_id state probability


1 H 0
1 L 1
1 N 0
2 H 0
2 L 0
2 N 1
3 H 0
3 L 1
3 N 0

The following query returns the output shown in the following table:

SELECT * FROM A_loyalty ORDER BY 1, 2, 3;

Table 404: HMMSupervisedLearner Output Table A_loyalty

user_id from_state to_state probability


1 H H 0.166666666666667
1 H L 0.666666666666667
1 H N 0.166666666666667
1 L H 0.0444444444444444
1 L L 0.933333333333333
1 L N 0.0222222222222222
1 N H 0.333333333333333
1 N L 0.666666666666667
1 N N 0
2 H H 0.461538461538462

510 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMSupervisedLearner

user_id from_state to_state probability


2 H L 0.230769230769231
2 H N 0.307692307692308
2 L H 0.2
2 L L 0.1
2 L N 0.7
2 N H 0.0657894736842105
2 N L 0.0789473684210526
2 N N 0.855263157894737
3 H H 0.921348314606742
3 H L 0.0224719101123595
3 H N 0.0561797752808989
3 L H 0.75
3 L L 0.25
3 L N 0
3 N H 0.833333333333333
3 N L 0
3 N N 0.166666666666667

The following query returns the output shown in the following table:

SELECT * FROM B_loyalty ORDER BY 1, 2, 3;

Table 405: HMMSupervisedLearner Output Table B_loyalty

user_id state observed probability


1 H LL 0
1 H LM 0
1 H LS 0
1 H ML 0
1 H MM 0.166666666666667
1 H MS 0.5
1 H SL 0.166666666666667
1 H SM 0.166666666666667
1 H SS 0
1 L LL 0.307692307692308

Teradata Aster Analytics Foundation User Guide 511


Chapter 5: Statistical Analysis
HMMSupervisedLearner

user_id state observed probability


1 L LM 0.142857142857143
1 L LS 0.131868131868132
1 L ML 0.153846153846154
1 L MM 0.0549450549450549
1 L MS 0.0879120879120879
1 L SL 0.0549450549450549
1 L SM 0.010989010989011
1 L SS 0.0549450549450549
1 N LL 0
1 N LM 0
1 N LS 0.333333333333333
1 N ML 0.333333333333333
1 N MM 0
1 N MS 0.333333333333333
1 N SL 0
1 N SM 0
1 N SS 0
2 H LL 0
2 H LM 0.0769230769230769
2 H LS 0
2 H ML 0.153846153846154
2 H MM 0.0769230769230769
2 H MS 0.153846153846154
2 H SL 0.153846153846154
2 H SM 0.153846153846154
2 H SS 0.230769230769231
2 L LL 0.1
2 L LM 0.2
2 L LS 0.3
2 L ML 0
2 L MM 0.2
2 L MS 0

512 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMSupervisedLearner

user_id state observed probability


2 L SL 0.2
2 L SM 0
2 L SS 0
2 N LL 0.0909090909090909
2 N LM 0.0779220779220779
2 N LS 0.0779220779220779
2 N ML 0.207792207792208
2 N MM 0.12987012987013
2 N MS 0.194805194805195
2 N SL 0.051948051948052
2 N SM 0.0909090909090909
2 N SS 0.0779220779220779
3 H LL 0.0555555555555556
3 H LM 0.0444444444444444
3 H LS 0.0777777777777778
3 H ML 0.0666666666666667
3 H MM 0.144444444444444
3 H MS 0.111111111111111
3 H SL 0.0555555555555556
3 H SM 0.2
3 H SS 0.244444444444444
3 L LL 0.25
3 L LM 0.25
3 L LS 0.5
3 L ML 0
3 L MM 0
3 L MS 0
3 L SL 0
3 L SM 0
3 L SS 0
3 N LL 0
3 N LM 0

Teradata Aster Analytics Foundation User Guide 513


Chapter 5: Statistical Analysis
HMMEvaluator

user_id state observed probability


3 N LS 0.166666666666667
3 N ML 0.333333333333333
3 N MM 0.166666666666667
3 N MS 0
3 N SL 0
3 N SM 0.166666666666667
3 N SS 0.166666666666667

HMMEvaluator

Summary
The HMMEvaluator function measures the probabilities of sequences, with respect to each trained HMM.

Usage

HMMEvaluator Syntax
Version 1.3

SELECT * FROM HMMEvaluator (


ON input_table AS "InitStateProb" PARTITION BY model_key
ON state_transition_table AS "TransProb" PARTITION BY model_key
ON emission_table AS "EmissionProb" PARTITION BY model_key
ON observation_table AS "observation" PARTITION BY model_key
ORDER BY time_ordered_sequence_attributes ASC
InitStateModelColumn ('model_key_attribute')
InitStateColumn ('state_key_attribute')
InitStateProbColumn ('probability')
TransAttributeColumn ('model_key_attribute')
TransFromStateColumn ('from_state_key_attribute')
TransToStateColumn ('to_state_key_attribute')
TransProbColumn ('probability')
EmitModelColumn ('model_key_attribute')
EmitStateColumn ('state_key_attribute')
EmitObsColumn ('observed_key_attribute')
EmitProbColumn ('probability')]
ModelColumn ('model_key_attribute')
SeqColumn ('seq_key_attribute')
ObsColumn ('observed_key_attribute1')
[ Incremental
( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ ShowChangeRate

514 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMEvaluator
( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ SeqProbColumn ('sequence_probability_attribute') ]
[ SkipColumn ('skip_key_attribute') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
InitStateModelColumn Required The name of the model attribute column in the
InitStateProb table.
InitStateColumn Required The name of the state attribute column in the
InitStateProb table.
InitStateProbColumn Required The name of the initial probability column in the
InitStateProb table.
TransAttributeColumn Required The name of the model attribute column in the
TransProb table.
TransFromStateColumn Required The name of the source of the state transition column in
the TransProb table.
TransToStateColumn Required The name of the target of the state transition column in
the TransProb table.
TransProbColumn Required The name of the state transition probability column in
the TransProb table.
EmitModelColumn Required The name of the model attribute column in the
EmissionProb table.
EmitStateColumn Required The name of the state attribute in the EmissionProb
table.
EmitObsColumn Required The name of the observation attribute column in the
EmissionProb table.
EmitProbColumn Required The name of the emission probability in the
EmissionProb table.
ModelColumn Required The name of the column that contains the model
attribute. If you specify this argument, then
model_attribute must match a model_key in the
PARTITION BY clause.
SeqColumn Required The name of the column that contains the sequence
attribute. The sequence_attribute must be a sequence
attribute in the PARTITION BY clause.
ObsColumn Required The name of the column that contains the observed
symbols.

Teradata Aster Analytics Foundation User Guide 515


Chapter 5: Statistical Analysis
HMMEvaluator

Argument Category Description

Note:
Observed symbols are case-sensitive.

Incremental Optional Specifies whether only new sequence probabilities are


computed: If 'true' (the default), only new sequence
probabilities are computed; if 'false', all probabilities are
computed.

Note:
If the SeqProbColumn argument is omitted, the
function cannot determine whether the observed
sequence is new; therefore, it treats all model
sequences in the input tables as new.

ShowChangeRate Optional If 'true' (the default), the function shows the percentage
change that corresponds to the applied model with the
difference from previous predicted probability.
SeqProbColumn Optional The function uses the previous value under the column
to calculate the change rate.
SkipColumn Optional The name of the column whose values determine
whether the function skips the row. The function skips
the row if the value is “true”, “yes”, “y”, or “1”. The
function does not skip the row if the value is “false”, “f”,
“no”, “n”, “0”, or NULL.
Accumulate Optional Specifies the names of the columns in input_table that
the function copies to the output table.

Input
HMMEvaluator accepts four input tables. Three tables contain the HMM parameter tables and output from
HMMUnsupervisedLearner or HMMSupervisedLearner function. The fourth table contains the newly
observed sequences, and has a schema similar to the input table or views for HMMUnsupervisedLearner.
Table 406: HMMEvaluator Initial-State Probability Table Schema

Column Data Type Description


model_attribute ANY The name of the column that contains the model attribute, specified by
the InitStateModelColumn attribute.
state_key_attribute ANY A symbol defining a hidden state.
probability DOUBLE The initial probability of the hidden state.
PRECISION

516 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMEvaluator
Table 407: HMMEvaluator State-Transition Probability Table Schema

Column Data Type Description


model_attribute ANY The name of the column that contains the model attribute,
specified by the TransAttributeColumn attribute.
from_state_key_attribute ANY A symbol defining a hidden state from which a transition
emanates.
to_state_key_attribute ANY A symbol defining a hidden state to which a transition is made.
probability DOUBLE The probability of a transition from the from_state to the to_state.
PRECISIO
N

Table 408: HMMEvaluator Emission Probability Table Schema

Column Data Type Description


model_attribute ANY The name of the column that contains the model attribute,
specified by the EmitModelColumn attribute.
state_key_attribute ANY A symbol defining a hidden state.
observed_key_attribute ANY A symbol defining an observation value.
probability DOUBLE The probability of emitting an observed symbol given that
PRECISION the HMM is in a hidden state.

Output
Table 409: HMMEvaluator Output table

Column Data Type Description


model_attribute VARCHAR The name of the column that contains the model attribute,
specified by the ModelColumn attribute.
seq_attribute VARCHAR A symbol defining a hidden state.
observed_key_attribute VARCHAR A symbol defining an observation value.
sequenceprob DOUBLE The function uses the previous value under the column to
PRECISION calculate the change rate.
change_rate DOUBLE The function calculates the data in this column based on the
PRECISION immediate previous event.

Teradata Aster Analytics Foundation User Guide 517


Chapter 5: Statistical Analysis
HMMEvaluator
Example

Loan Default Prediction (from HMMUnsupervisedLearner)


The example evaluates a given loan sequence against the two models: model 1 (trained on default loan cases)
and model 2 (trained on paid loan cases). The models are trained and referenced from the loan default
prediction example from HMMUnsupervisedLearner.

Input
The input, test_loan_prediction, is a test loan sequence, which HMMEvaluator uses to predict whether this
loan is more likely to be paid in full or to default. The input table does not include observations of 7 (default)
or 8 (paid). The sequence of observations in this table is the same for evaluation of both models (that is,
model_id = 1 and model_id = 2).
Table 410: HMMEvaluator Example Input Table: test_loan_prediction

model_id seq_id seq_vertex_id observed_id


1 17 0 1
1 17 1 1
1 17 2 1
1 17 3 1
1 17 4 1
1 17 5 1
1 17 6 1
1 17 7 2
1 17 8 1
1 17 9 1
1 17 10 1
1 17 11 1
1 17 12 1
1 17 13 1
1 17 14 3
1 17 15 4
1 17 16 5
1 17 17 6
1 17 18 6
1 17 19 6
1 17 20 6

518 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMEvaluator

model_id seq_id seq_vertex_id observed_id


1 17 21 6
1 17 22 6
1 17 23 6
1 17 24 6
1 17 25 6
2 17 0 1
2 17 1 1
2 17 2 1
2 17 3 1
2 17 4 1
2 17 5 1
2 17 6 1
2 17 7 2
2 17 8 1
2 17 9 1
2 17 10 1
2 17 11 1
2 17 12 1
2 17 13 1
2 17 14 3
2 17 15 4
2 17 16 5
2 17 17 6
2 17 18 6
2 17 19 6
2 17 20 6
2 17 21 6
2 17 22 6
2 17 23 6
2 17 24 6
2 17 25 6

Teradata Aster Analytics Foundation User Guide 519


Chapter 5: Statistical Analysis
HMMEvaluator
SQL-MapReduce Call
The following SQL-MapReduce call generates the probabilities for each observation of the sequence for each
model.

SELECT model_id, seq_id, observed_id, sequence_probability


FROM hmmevaluator (
ON pi_loan AS "InitStateProb" PARTITION BY model_id
ON A_loan AS "TransProb" PARTITION BY model_id
ON B_loan AS "EmissionProb" PARTITION BY model_id
ON test_loan_prediction AS "observation"
PARTITION BY model_id ORDER BY seq_id, seq_vertex_id
InitStateModelColumn ('model_id')
InitStateColumn ('state')
InitStateProbColumn ('probability')
TransAttributeColumn ('model_id')
TransFromStateColumn ('from_state')
TransToStateColumn ('to_state')
TransProbColumn ('probability')
EmitModelColumn ('model_id')
EmitStateColumn ('state')
EmitObsColumn ('observed')
EmitProbColumn ('probability')
ModelColumn ('model_id')
SeqColumn ('seq_id')
ObsColumn ('observed_id')
) ORDER BY seq_id, model_id, sequence_probability DESC;

Output
For the sequence used in this example (id=17), the output table shows the final sequence_probability given
by each model. For model 1, based on defaulted loans, the final sequence probability is 1.74 E-09. For model
2, based on paid loans, the final sequence probability is 3.13 E-20. Because the sequence probability is higher
for the model based on defaulted loans, this sequence is considered a potential loan default.
Table 411: HMMEvaluator Example Output Table

model_id seq_id observed_id sequence_probability


1 17 1 0.993584742274833
1 17 1 0.851505310052646
1 17 1 0.729260447782143
1 17 1 0.624564355984876
1 17 1 0.534898929195643
1 17 1 0.458106297146588
1 17 1 0.392338380263591
1 17 2 0.00358854867114758
1 17 1 0.000401495697642893
1 17 1 0.00026980372354861

520 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMEvaluator

model_id seq_id observed_id sequence_probability


1 17 1 0.00023093168726785
1 17 1 0.000197777640704223
1 17 1 0.000169383741802722
1 17 1 0.000145066206936303
1 17 3 3.36084804243903e-06
1 17 4 7.98372410594703e-07
1 17 5 1.58729988019435e-07
1 17 6 1.05435855365332e-07
1 17 6 5.7911331859339e-08
1 17 6 3.77256426226253e-08
1 17 6 2.10830918021873e-08
1 17 6 1.35206328300816e-08
1 17 6 7.66239716502205e-09
1 17 6 4.85221627629922e-09
1 17 6 2.78103185612748e-09
1 17 6 1.74324702481316e-09
2 17 1 0.980389981473457
2 17 1 0.857908863735501
2 17 1 0.72057154568008
2 17 1 0.593696584487765
2 17 1 0.485393734573205
2 17 1 0.395244199718448
2 17 1 0.321329565874656
2 17 2 0.00804174571505229
2 17 1 0.006257126737294
2 17 1 0.00485161500208196
2 17 1 0.00391022519558841
2 17 1 0.003136639612957
2 17 1 0.00254436934344932
2 17 1 0.00205892814732463
2 17 3 0.000117258620385012
2 17 4 1.49044721206875e-05

Teradata Aster Analytics Foundation User Guide 521


Chapter 5: Statistical Analysis
HMMDecoder

model_id seq_id observed_id sequence_probability


2 17 5 8.28281995237949e-07
2 17 6 3.6659469863682e-08
2 17 6 8.74083676644003e-10
2 17 6 3.47433552040545e-11
2 17 6 9.01243772422103e-13
2 17 6 3.33090622426949e-14
2 17 6 9.1607521394785e-16
2 17 6 3.22066536691838e-17
2 17 6 9.22501807653862e-19
2 17 6 3.13332757322232e-20

HMMDecoder

Summary
The HMMDecoder function finds the state sequence with the highest probability, given the learned model
and observed sequences.

Usage

HMMDecoder Syntax
Version 1.3

SELECT * FROM HMMDecoder (


ON init_prob_table AS "InitStateProb" PARTITION BY model_key
ON trans_prob_table AS "TransProb" PARTITION BY model_key
ON emission_prob_table AS "EmissionProb" PARTITION BY model_key
ON observation_table AS "observation" PARTITION BY model_key
ORDER BY time_ordered_sequence_attributes ASC
InitStateModelColumn ('model_attribute')
InitStateColumn ('state_key_attribute')
InitStateProbColumn ('probability')
TransAttributeColumn ('model_attribute')
TransFromStateColumn ('from_state_key_attribute')
TransToStateColumn ('to_state_key_attribute')
TransProbColumn ('probability')
EmitModelColumn ('model_attribute')
EmitStateColumn ('state_key_attribute')
EmitObsColumn ('observed_key_attribute')
EmitProbColumn ('probability')]

522 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMDecoder
ModelColumn ('model_attribute')
SeqColumn ('seq_attribute')
ObsColumn ('observed_attribute')
[ SequenceMaxSize ('range') ]
[ SkipColumn ('skip_attribute') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
SequenceMaxSize Optional The maximum length, in rows, of a sequence in the observation table.

For descriptions of the other arguments, see HMMEvaluator.

Input
The HMMDecoder function accepts four input tables. Three tables are dimensional tables that are generated
by the HMMUnsupervisedLearner or HMMSupervisedLearner function. The fourth table contains the
newly observed sequences. The schema of the fourth table is similar to the schema of the input table of the
HMMUnsupervisedLearner or HMSupervisedLearner function.
Table 412: HMMDecoder Initial-State Probability Table Schema

Column Name Data Type Description


model_attribute ANY A value specified by the InitStateModelColumn argument.
state_key_attribute ANY A symbol defining a hidden state.
probability DOUBLE The probability of a transition from the from_state to the
PRECISION to_state.

Table 413: HMMDecoder State Transition Probability Table Schema

Column Name Data Type Description


model_attribute ANY A value specified by the TransAttributeColumn argument.
The name of the column that contains the model attribute.
from_state_key_attribute ANY A symbol defining a hidden state from which a transition is
made.
to_state_key_attribute ANY A symbol defining a hidden state to which a transition is
made.
probability DOUBLE The probability of a transition from the from_state to the
PRECISION to_state.

Teradata Aster Analytics Foundation User Guide 523


Chapter 5: Statistical Analysis
HMMDecoder
Table 414: HMMDecoder Emission Probability Table Schema

Column Data Type Description


model_attribute ANY A value specified by the EmitModelColumn. The name of
the column that contains the model attribute.
state_key_attribute ANY A symbol defining a hidden state.
observed_key_attributes ANY A symbol defining an observation value.
probability DOUBLE The probability of emitting an observed symbol given that
PRECISION the HMM is in a hidden state.

The model attribute, sequence attributes, state attributes, observed key attributes, and percent change key
attributes specified in the arguments are used for output column names.

Output
Table 415: HMMDecoder Output Table Schema

Column Name Data Type Description


model_attribute STRING or The model attribute, specified by the ModelKey argument.
INTEGER
sequence_attribute STRING or The sequence attribute, specified by the SequenceKey
INTEGER argument.
observed_key_attribute ANY The observed symbols, specified by the ObservedKey
argument.
state_key_attribute ANY The most likely state of the HMM that corresponds to the
observed value in the given sequence.

Examples
• Example 1: Loan Default Prediction (from Unsupervised Learner)
• Example 2: Customer Loyalty Prediction (from Supervised Learner)
• Example 3: Part-of-Speech Tagging
• Example 4: Bank Customer Churn

Example 1: Loan Default Prediction (from Unsupervised Learner)


Suppose you want to find the three hidden states for the table, HMMEvaluator Example Input Table:
test_loan_prediction, from the (Input section of the Example in the function HMMEvaluator) which is an
unsupervised loan sequence trained with two models (1 and 2).

Input
The input consists of the trained model tables from the Output section of the Example in
HMMUnsupervisedLearner (pi_loan, A_loan, and B_loan). The function predicts or decodes the hidden
state information for the input new test sequence.

524 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMDecoder
SQL-MapReduce Call

SELECT * FROM HMMDecoder (


ON pi_loan AS "InitStateProb" PARTITION BY model_id
ON A_loan AS "TransProb" PARTITION BY model_id
ON B_loan AS "EmissionProb" PARTITION BY model_id
ON test_loan_prediction AS "observation" PARTITION BY model_id
ORDER BY model_id, seq_id, seq_vertex_id ASC
InitStateModelColumn ('model_id')
InitStateColumn ('state')
InitStateProbColumn ('probability')
TransAttributeColumn ('model_id')
TransFromStateColumn ('from_state')
TransToStateColumn ('to_state')
TransProbColumn ('probability')
EmitModelColumn ('model_id')
EmitStateColumn ('state')
EmitObsColumn ('observed')
EmitProbColumn ('probability')
ModelColumn ('model_id')
SeqColumn ('seq_id')
ObsColumn ('observed_id ')
Accumulate ('seq_vertex_id ')
) ORDER BY 1, 2, 5;

Output
For the same sequence, the hidden states are different for each model.
Table 416: HMMDecoder Example 1 Output Table

model_id seq_id observed_id state seq_vertex_id


1 17 1 2 0
1 17 1 2 1
1 17 1 2 2
1 17 1 2 3
1 17 1 2 4
1 17 1 2 5
1 17 1 2 6
1 17 2 2 7
1 17 1 2 8
1 17 1 2 9
1 17 1 2 10
1 17 1 2 11
1 17 1 2 12
1 17 1 2 13

Teradata Aster Analytics Foundation User Guide 525


Chapter 5: Statistical Analysis
HMMDecoder

model_id seq_id observed_id state seq_vertex_id


1 17 3 1 14
1 17 4 0 15
1 17 5 1 16
1 17 6 0 17
1 17 6 1 18
1 17 6 0 19
1 17 6 1 20
1 17 6 1 21
1 17 6 1 22
1 17 6 1 23
1 17 6 1 24
1 17 6 0 25
2 17 1 2 0
2 17 1 2 1
2 17 1 2 2
2 17 1 2 3
2 17 1 2 4
2 17 1 2 5
2 17 1 2 6
2 17 2 1 7
2 17 1 1 8
2 17 1 1 9
2 17 1 1 10
2 17 1 1 11
2 17 1 1 12
2 17 1 1 13
2 17 3 1 14
2 17 4 0 15
2 17 5 1 16
2 17 6 1 17
2 17 6 1 18
2 17 6 1 19

526 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMDecoder

model_id seq_id observed_id state seq_vertex_id


2 17 6 1 20
2 17 6 1 21
2 17 6 1 22
2 17 6 1 23
2 17 6 1 24
2 17 6 0 25

Example 2: Customer Loyalty Prediction (from Supervised Learner)

Input
The input table, customer_loyalty_newseq, is a collection of three new test sequences (seq_id 4, 5, 6) for
user_id 1. HMMSupervisedLearner trained multiple users and obtained the trained model files (pi_loyalty,
A_loyalty and B_loyalty). This example uses these model files with the input to determine the loyalty levels
of customers from the new sequence of purchases. The loyalty levels are low (L), normal (N), and high (H).
Table 417: HMMEncoder Example 2 Input Table customer_loyalty_newseq

id user_id seq_id purchase_date observation


301 1 4 2014-05-01 LL
302 1 4 2014-05-02 LS
303 1 4 2014-05-03 SS
304 1 4 2014-05-04 SM
305 1 4 2014-05-05 LL
306 1 4 2014-05-06 ML
307 1 4 2014-05-07 SS
308 1 4 2014-05-08 MM
309 1 4 2014-05-09 MS
310 1 4 2014-05-10 ML
311 1 5 2014-05-01 ML
312 1 5 2014-05-02 SM
313 1 5 2014-05-03 MS
314 1 5 2014-05-04 MS
315 1 5 2014-05-05 SS
316 1 5 2014-05-06 MS
317 1 5 2014-05-07 ML

Teradata Aster Analytics Foundation User Guide 527


Chapter 5: Statistical Analysis
HMMDecoder

id user_id seq_id purchase_date observation


318 1 5 2014-05-08 MM
319 1 5 2014-05-09 SM
320 1 5 2014-05-10 SM
321 1 6 2014-05-01 SM
322 1 6 2014-05-02 MS
323 1 6 2014-05-03 SS
324 1 6 2014-05-04 LM
325 1 6 2014-05-05 SL
326 1 6 2014-05-06 SS
327 1 6 2014-05-07 SS
328 1 6 2014-05-08 SM
329 1 6 2014-05-09 LS
330 1 6 2014-05-10 LS

SQL-MapReduce Call

SELECT * FROM HMMDecoder (


ON pi_loyalty AS "InitStateProb" PARTITION BY user_id
ON A_loyalty AS "TransProb" PARTITION BY user_id
ON B_loyalty AS "EmissionProb" PARTITION BY user_id
ON customer_loyalty_newseq AS "observation" PARTITION BY user_id
ORDER BY user_id, seq_id, purchase_date ASC
InitStateModelColumn ('user_id')
InitStateColumn ('state')
InitStateProbColumn ('probability')
TransAttributeColumn ('user_id')
TransFromStateColumn ('from_state')
TransToStateColumn ('to_state')
TransProbColumn ('probability')
EmitModelColumn ('user_id')
EmitStateColumn ('state')
EmitObsColumn ('observed')
EmitProbColumn ('probability')
ModelColumn ('user_id')
SeqColumn ('seq_id')
ObsColumn ('observation')
Accumulate ('purchase_date')
) ORDER BY 1, 2, 5;

Output
The output table shows the decoded loyalty levels for the new sequence. For seq_id 5, the loyalty level
increased towards the end of the sequence (from L to H; for the other sequences (seq_id 4 and seq_id 6), the
loyalty level did not change.

528 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMDecoder
Table 418: HMMDecoder Example 2 Output Table

user_id seq_id observation state purchase_date


1 4 LL L 2014-05-01
1 4 LS L 2014-05-02
1 4 SS L 2014-05-03
1 4 SM L 2014-05-04
1 4 LL L 2014-05-05
1 4 ML L 2014-05-06
1 4 SS L 2014-05-07
1 4 MM L 2014-05-08
1 4 MS L 2014-05-09
1 4 ML L 2014-05-10
1 5 ML L 2014-05-01
1 5 SM L 2014-05-02
1 5 MS L 2014-05-03
1 5 MS L 2014-05-04
1 5 SS L 2014-05-05
1 5 MS L 2014-05-06
1 5 ML L 2014-05-07
1 5 MM L 2014-05-08
1 5 SM H 2014-05-09
1 5 SM H 2014-05-10
1 6 SM L 2014-05-01
1 6 MS L 2014-05-02
1 6 SS L 2014-05-03
1 6 LM L 2014-05-04
1 6 SL L 2014-05-05
1 6 SS L 2014-05-06
1 6 SS L 2014-05-07
1 6 SM L 2014-05-08
1 6 LS L 2014-05-09
1 6 LS L 2014-05-10

Teradata Aster Analytics Foundation User Guide 529


Chapter 5: Statistical Analysis
HMMDecoder
Example 3: Part-of-Speech Tagging

Input
The HMMDecoder function can be used to decode the parts of speech (adjective, noun, verb etc.) for a word
set, if the set of phrases or words have been trained using a HMMSupervised or HMMUnsupervisedLearner.
Assume that you have a set of phrases (shown in the following table) whose parts of speech are unknown
and you have the three trained state tables (initial, state_transition and emission) readily available.
In this example, the parts of speech correspond to the hidden states of the HMM function. There are two
hidden states in this example: A(Adjective) and N(noun). HMMDecoder can be used to find these parts of
speech.
Table 419: HMMDecoder Example 3 Input Table phrases

model phrase_id word


1 1 clown
1 1 crazy
1 1 killer
1 1 problem
1 2 nice
1 2 weather

The following is a table of initial states:


Table 420: HMMDecoder Example 3 Input Table initial

model tag probability


1 A 0.25
1 N 0.75

The following is a table of state transitions:


Table 421: HMMDecoder Example 3 Input Table state_transition

model from_tag to_tag probability


1 A A 0
1 A N 1
1 N A 0.5
1 N N 0.5

The following is a table of emissions:

530 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMDecoder
Table 422: HMMDecoder Example 3 Input Table emission

model tag word probability


1 A clown 0
1 N clown 0.4
1 A crazy 1
1 N crazy 0
1 A killer 0
1 N killer 0.3
1 A problem 0
1 N problem 0.3

SQL-MapReduce Call

SELECT * FROM HMMDecoder (


ON initial AS "InitStateProb" PARTITION BY model
ON state_transition as "TransProb" PARTITION BY model
ON emission AS "EmissionProb" PARTITION BY model
ON phrases AS "observation" PARTITION BY model
ORDER BY model, phrase_id ASC
InitStateModelColumn ('model')
InitStateColumn ('tag')
InitStateProbColumn ('probability')
TransAttributeColumn ('model')
TransFromStateColumn ('from_tag')
TransToStateColumn ('to_tag')
TransProbColumn ('probability')
EmitModelColumn ('model')
EmitStateColumn ('tag')
EmitObsColumn ('word')
EmitProbColumn ('probability')
ModelColumn ('model')
SeqColumn ('phrase_id')
ObsColumn ('word')
) ORDER by 1, 2, 3;

Output

Table 423: HMMDecoder Example 3 Output Table

model phrase_id word tag


1 1 clown N
1 1 crazy A
1 1 killer N
1 1 problem N

Teradata Aster Analytics Foundation User Guide 531


Chapter 5: Statistical Analysis
HMMDecoder

model phrase_id word tag


1 2 nice A
1 2 weather A

Example 4: Bank Customer Churn

Input
HMMDecoder can also be used to find the propensity of customer churn, given the actions or transactions
of a customer in a bank. The input table, churn_data, contains different transactions (column action) of a
customer (column id). The order of transactions is shown in the column path_id. Assume that the trained
tables with their state probabilities are readily available, as shown in the four tables that follow churn_data,
and that the states correspond to T (True – customer is likely to churn) or F (False – customer is unlikely to
churn).
Table 424: HMMDecoder Example 4 Input Table churn_data

model action path_id path_max id product


1 CALL_COMPLAINT 1 4 1 BROKERAGE
1 CALL_COMPLAINT 2 4 1 BROKERAGE
1 FEE_REVERSAL 3 4 1 BROKERAGE
1 BALANCE_TRANSFER 4 4 1 BROKERAGE
1 ACCOUNT_BOOKED_ONLINE 1 4 2 CREDITCARD
1 FEE_REVERSAL 2 4 2 CREDITCARD
1 LINK_EXTERNAL_ACCOUNT 3 4 2 CREDITCARD
1 BALANCE_TRANSFER 4 4 2 CREDITCARD
1 STARTS_APPLICATION 1 5 3 CD
1 MORTGAGE_CALC 2 5 3 CD
1 COMPARE 3 5 3 CD
1 BROWSE 4 5 3 CD
1 COMPLETE_APPLICATION 5 5 3 CD

The following is a table of initial states:


Table 425: HMMDecoder Example 4 Input Table churn_initial

model tag probability


1 F 0.909
1 T 0.091

The following is a table of state transitions:

532 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
HMMDecoder
Table 426: HMMDecoder Example 4 Input Table churn_state_transition

model from_tag to_tag probability


1 F F 1
1 F T 0
1 T F 0
1 T T 1

The following is a table of emissions:


Table 427: HMMDecoder Example 4 Input Table churn_emission

model state observed probability


1 F ACCOUNT_BOOKED_OFFLINE 0.005
1 F ACCOUNT_BOOKED_ONLINE 0.028
1 F ADD_DIRECT_DEPOSIT 0.001
1 F BALANCE_TRANSFER 0
1 F BROWSE 0.545
1 F CALL_COMPLAINT 0
1 F CLICK 0.003
1 F COMPARE 0.113
1 F COMPLETE_APPLICATION 0.06
1 F ENROLL_AUTO_SAVINGS 0.001
1 F FEE_REVERSAL 0
1 F LINK_EXTERNAL_ACCOUNT 0
1 F LOAN_CALC 0.024
1 F MORTGAGE_CALC 0.011
1 F OLB 0.039
1 F REFERRAL 0.003
1 F STARTS_APPLICATION 0.167
1 T ACCOUNT_BOOKED_OFFLINE 0.027
1 T ACCOUNT_BOOKED_ONLINE 0.074
1 T ADD_DIRECT_DEPOSIT 0
1 T BALANCE_TRANSFER 0.238
1 T BROWSE 0.016
1 T CALL_COMPLAINT 0.233
1 T CLICK 0

Teradata Aster Analytics Foundation User Guide 533


Chapter 5: Statistical Analysis
HMMDecoder

model state observed probability


1 T COMPARE 0.033
1 T COMPLETE_APPLICATION 0.028
1 T ENROLL_AUTO_SAVINGS 0
1 T FEE_REVERSAL 0.221
1 T LINK_EXTERNAL_ACCOUNT 0.106
1 T LOAN_CALC 0.006
1 T MORTGAGE_CALC 0.002
1 T OLB 0.002
1 T REFERRAL 0.001
1 T STARTS_APPLICATION 0.013

SQL-MapReduce Call

SELECT * FROM HMMDecoder (


ON churn_initial AS "InitStateProb" PARTITION BY model
ON churn_state_transition as "TransProb" PARTITION BY model
ON churn_emission AS "EmissionProb" PARTITION BY model
ON churn_data AS "observation" PARTITION BY model
ORDER BY model,id, path_id ASC
InitStateModelColumn ('model')
InitStateColumn ('tag')
InitStateProbColumn ('probability')
TransAttributeColumn ('model')
TransFromStateColumn ('from_tag')
TransToStateColumn ('to_tag')
TransProbColumn ('probability')
EmitModelColumn ('model')
EmitStateColumn ('state')
EmitObsColumn ('observed')
EmitProbColumn ('probability')
ModelColumn ('model')
SeqColumn ('id')
ObsColumn ('action')
Accumulate ('path_id')
) ORDER BY 1, 2, 5, 3;

Output

Table 428: HMMDecoder Example 4 Output Table

model id action tag path_id


1 1 CALL_COMPLAINT T 1
1 1 CALL_COMPLAINT T 2
1 1 FEE_REVERSAL T 3

534 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Histogram

model id action tag path_id


1 1 BALANCE_TRANSFER T 4
1 2 ACCOUNT_BOOKED_ONLINE T 1
1 2 FEE_REVERSAL T 2
1 2 LINK_EXTERNAL_ACCOUNT T 3
1 2 BALANCE_TRANSFER T 4
1 3 STARTS_APPLICATION F 1
1 3 MORTGAGE_CALC F 2
1 3 COMPARE F 3
1 3 BROWSE F 4
1 3 COMPLETE_APPLICATION F 5

Histogram

Summary
Histograms are useful for assessing the shape of a data distribution. The Histogram function calculates the
frequency distribution of a dataset using sophisticated binning techniques that can automatically calculate
the bin width and number of bins. The function maps each input row to one bin and returns the frequency
(row count) and proportion (percentage of rows) of each bin.
The Aster Analytics histogram implementation, redesigned for release 6.21, includes the following
capabilities:
• User-selected or automatic bin determination
• User-selected left-inclusive or right-inclusive binning
• Multiple histograms for distinct groups

Background
The Histogram function uses either Sturges' or Scott's algorithm to compute binning (bin width and number
of bins). The bin width is the range for each group of values. Binning algorithms make strong assumptions
about the shape of the distribution. Depending on the actual data distribution and the goals of the analysis,
different bin widths may be appropriate.

Sturges’ Algorithm
Sturges' algorithm for calculating bin width can be written as:

w = r/(1 + log2n)

Teradata Aster Analytics Foundation User Guide 535


Chapter 5: Statistical Analysis
Histogram
where w is the bin width, r is the range of the data values and n is the number of elements in the data set.
This algorithm performs best if the data is normally distributed and n is at least 30.

Scott’s Algorithm
Scott's algorithm for calculating bin width can be written as:

w = 3.49s/(n1/3)

where w is the bin width, s is the standard deviation of the data values and n is the number of elements in the
data set. The number of bins is r/w, where r is the range of the data values.
This algorithm performs best on normally distributed data.

Usage

Histogram Syntax
Version 1.0
The function provides several options for how bins are defined:
• The user can specify a target for the number of bins.
• The user can specify the bin boundaries.
• The function can determine the bin boundaries automatically, using one of two built-in algorithms.
• The user can provide the minimum and maximum values for the histogram, and an optional bin size. If a
bin size is provided, the bins are equally sized; if not, they might not be.

SELECT * FROM hist (


ON (SELECT 1) PARTITION BY 1
InputTable (data_table)
OutputTable (out_table)
[ AutoBin ({ 'Sturges' | 'Scott' | number_of_bins }) ]
[ CustomBinTable (bin_table) ]
[ CustomBinColumn ('breaks_col') ]
[ StartValue ('bin_start')]
[ BinSize ('bin_size')]
[ EndValue ('bin_end')]
ValueColumn ('value_col')
[ Inclusion ({ 'left' | 'right' }) ]
[ GroupbyColumns('groupby_col') ]
);

Arguments
Argument Category Description
InputTable Required Table containing the input data.
OutputTable Required Name for the table that the function generates, containing the output.

536 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Histogram

Argument Category Description


AutoBin Optional Specifies either the algorithm to be used for selecting bin boundaries or
the approximate number of bins to be found. The permitted values are
STURGES, SCOTT, or a positive integer.
If this argument is present, CustomBinTable, CustomBinColumn,
StartValue, BinSize, and EndValue cannot be present.
CustomBinTable Optional A table containing the boundary points between bins.
If this argument is present, CustomBinColumn must also be present.
AutoBin, StartValue, BinSize, and EndValue cannot be present.
CustomBinColumn Optional The column in the CustomBinTable containing the boundary values.
Input columns must contain numeric SQL types.
If this argument is present, CustomBinTable must also be present.
AutoBin, StartValue, BinSize, and EndValue cannot be present.
StartValue Optional The smallest value to be used in binning.
If this argument is present, BinSize and EndValue must also be present.
AutoBin, CustomBinTable and CustomBinColumn cannot be present.
EndValue Optional The largest value to be used in binning.
If this argument is present, StartValue and BinSize must also be present.
AutoBin, CustomBinTable and CustomBinColumn cannot be present.
BinSize Optional For equally sized bins, a double value specifying the width of the bin. Omit
this argument if you are not using equally sized bins. The input value must
be greater than 0.0.
If this argument is present, StartValue and EndValue must also be present.
AutoBin, CustomBinTable and CustomBinColumn cannot be present.
GroupByColumns Optional Columns in the input table used to group values for binning. These
columns cannot contain doubles or floats.
Inclusion Optional Specifies whether to include points on bin boundaries in the bin on the left
or the bin on the right. The default value is left.
ValueColumn Required The column in the input table for which statistics are to be computed.
Column must contain a numeric SQL types (INTEGER, BIGINT, REAL,
DOUBLE PRECISION, NUMERIC, DECIMAL, SMALLINT).

Input
The input table must include a column with the data to be sorted, and may include one or more GroupBy
columns. Other columns are ignored.
Table 429: Histogram Input Table Schema

Column Name Data Type Description


ValueColumn NUMERIC Contains the values to be sorted.
GroupByColumns INTEGER or VARCHAR If this argument is present, the input table is
grouped by the specified columns, and a separate
histogram is produced for each group.

Teradata Aster Analytics Foundation User Guide 537


Chapter 5: Statistical Analysis
Histogram

Column Name Data Type Description


Columns specified in this argument are accumulated
to the output table.

Output
The output table contains any user-specified GroupBy columns and the bin information (lower boundary,
upper boundary, and number and percentage of input rows falling into that bin).
Table 430: Histogram Output Table Schema

Column Name Data Type Description


GroupByColumns Any Columns that were used to group the input data.
bin INTEGER Bin identifier. Starts at 0.
bin_start DOUBLE PRECISION Lower boundary of the bin.
bin_end DOUBLE PRECISION Upper boundary of the bin.
bin_count INTEGER Number of input rows in that bin.
bin_percent REAL Percent of input rows in that bin.

Examples
• Example 1: Bins with Sturges' Algorithm
• Example 2: Bins with Scott's Algorithm
• Example 3: You Specify Bins

Input
All examples use the same input table, cars_hist, which has the cylinder (cyl) and horsepower (hp) data for
different car models. The example computes the histograms on the hp column.
Table 431: Histogram Example Input Table cars_hist

id name cyl hp
1 Mazda RX4 6 110
2 Mazda RX4 Wag 6 110
3 Datsun 710 4 93
4 Hornet 4 Drive 6 110
5 Hornet Sportabout 8 175
6 Valiant 6 105
7 Duster 360 8 245
8 Merc 240D 4 62

538 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Histogram

id name cyl hp
9 Merc 230 4 95
10 Merc 280 6 123
11 Merc 280C 6 123
12 Merc 450SE 8 180
13 Merc 450SL 8 180
14 Merc 450SLC 8 180
15 Cadillac Fleetwood 8 205
16 Lincoln Continental 8 215
17 Chrysler Imperial 8 230
18 Fiat 128 4 66
19 Honda Civic 4 52
20 Toyota Corolla 4 65
21 Toyota Corona 4 97
22 Dodge Challenger 8 150
23 AM CJavelin 8 150
24 Camaro Z28 8 245
25 Pontiac Firebird 8 175
26 Fiat X1-9 4 66
27 Porsche 914-2 4 91
28 Lotus Europa 4 113
29 Ford Pantera L 8 264
30 Ferrari Dino 6 175
31 Maserati Bora 8 335
32 Volvo 142E 4 109

Example 1: Bins with Sturges' Algorithm

SQL-MapReduce Call

DROP TABLE IF EXISTS cars_sturges_out;


SELECT * FROM hist (
ON (SELECT 1) PARTITION BY 1
InputTable ('cars_hist')
OutputTable ('cars_sturges_out')
AutoBin ('Sturges')

Teradata Aster Analytics Foundation User Guide 539


Chapter 5: Statistical Analysis
Histogram
ValueColumn ('hp')
);

Output

Table 432: Histogram Example 1 Output Table

output table output columns


"cars_sturges_out" bin bin_start bin_end bin_count bin_percent

The following query returns the output shown in the following table:

SELECT * FROM cars_sturges_out ORDER BY 1;

Table 433: Histogram Example 1 Output Table cars_sturges_out

bin bin_start bin_end bin_count bin_percent


0 50 100 9 28.13
1 100 150 8 25.00
2 150 200 8 25.00
3 200 250 5 15.63
4 250 300 1 3.13
5 300 350 1 3.13

Example 2: Bins with Scott's Algorithm

SQL-MapReduce Call

DROP TABLE IF EXISTS cars_scott_out;


SELECT * FROM Hist_Reduce (
ON Hist_Map (
ON cars_hist AS data_input PARTITION BY any
ON hist_prep (
ON cars_hist value_column ('hp')
)
AS data_stat DIMENSION
value_column ('hp')
bin_select ('Scott')
) PARTITION BY 1
) ORDER BY bin;

540 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Histogram
Output

Table 434: Histogram Example 2 Output Table

output table output columns


"cars_scott_out" bin bin_start bin_end bin_count bin_percent

The following query returns the output shown in the following table:
SELECT * FROM cars_scott_out ORDER BY 1;
Table 435: Histogram Example 2 Output Table cars_scott_out

bin bin_start bin_end bin_count bin_percent


0 0 100 9 28.13
1 100 200 16 50.00
2 200 300 6 18.75
3 300 400 1 3.13

Example 3: You Specify Bins


The user specifies one bin size and start and end values, and the output is grouped for each values of
grouping variable (cyl).

SQL-MapReduce Call

DROP TABLE IF EXISTS cars_hist_out;


SELECT * FROM hist (
ON (SELECT 1) PARTITION BY 1
InputTable ('cars_hist')
OutputTable ('cars_hist_out')
StartValue ('20')
BinSize ('50')
EndValue ('400')
Inclusion ('right')
ValueColumn ('hp')
GroupbyColumns ('cyl')
);

Output

Table 436: Histogram Example 3 Output Table

output table output columns


"cars_hist_out" cyl bin bin_start bin_end bin_count bin_percent

The following query returns the output shown in the following table:
SELECT * FROM cars_hist_out ORDER BY 1;

Teradata Aster Analytics Foundation User Guide 541


Chapter 5: Statistical Analysis
KNN
Table 437: Histogram Example 3 Output Table cars_hist_out

cyl bin bin_start bin_end bin_count bin_percent


4 0 20 70 5 45.45
4 1 70 120 6 54.55
4 2 120 170 0 0.00
4 3 170 220 0 0.00
4 4 220 270 0 0.00
4 5 270 320 0 0.00
4 6 320 370 0 0.00
4 7 370 400 0 0.00
6 0 20 70 0 0.00
6 1 70 120 4 57.14
6 2 120 170 2 28.57
6 3 170 220 1 14.29
6 4 220 270 0 0.00
6 5 270 320 0 0.00
6 6 320 370 0 0.00
6 7 370 400 0 0.00
8 0 20 70 0 0.00
8 1 70 120 0 0.00
8 2 120 170 2 14.29
8 3 170 220 7 50.00
8 4 220 270 4 28.57
8 5 270 320 0 0.00
8 6 320 370 1 7.14
8 7 370 400 0 0.00

KNN

Summary
The KNN function uses training data objects to map test data objects to categories. The function is
optimized for both small and large training sets. The function supports user-defined distance metrics and
distance-weighted voting.

542 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
KNN

Background
In the IEEE International Conference on Data Mining (ICDM) in December 2006, the K-Nearest Neighbor
(kNN) classification algorithm was presented as one of the top 10 data-mining algorithms.
The kNN algorithm classifies data objects based on proximity to other data objects with known
classification. The objects with known classification serve as training data.
kNN classifies data based on the following parameters:
• Training data
• A metric that measures distance between objects
• The number of nearest neighbors (k)
The following figure shows an example of data classification using kNN. The red and blue dots represent
training data objects—the red dots are classified as cancerous tissue and the blue dots are classified as
normal tissue. The gray dot represents a test data object.
The inner circle represents k=4 and the outer circle represents k=10. When k=4, most of the nearest
neighbors of the gray dot are red, so the algorithm classifies the gray dot as cancerous tissue. When k=10,
most of the nearest neighbors of the gray dot are blue, so the algorithm classifies the gray dot as normal
tissue.
Figure 12: KNN Example

Teradata Aster Analytics Foundation User Guide 543


Chapter 5: Statistical Analysis
KNN
Usage

KNN Syntax
Version 1.3

SELECT * FROM KNN (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TrainingTable ('training_table')
TestTable ('test_table')
K (k)
ResponseColumn ('response_column')
IDColumn ('test_id_column')
DistanceFeatures ({ 'df_column' | 'df_column_range' }[,... ])
[ VotingWeight (voting_weight) ]
[ OutputTable ('output_table') ]
[ CustomizedDistance ('jar', 'distance_class') ]
[ ForceMapreduce
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ PartitionBlockSize ('partition_block_size') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
TrainingTable Required Specifies the name of the table that contains the training
data. Each row represents a classified data object.
TestTable Required Specifies the name of the table that contains the test data to
be classified by the kNN algorithm. Each row represents a
test data object.
K Required Specifies the number of nearest neighbors to use for
classifying the test data.
ResponseColumn Required Specifies the name of the training table column that contains
the class label or classification of the classified data objects.
IDColumn Required Specifies the name of the testing table column that uniquely
identifies a data object.

544 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
KNN

Argument Category Description


DistanceFeatures Required Specifies the names of the training table columns that the
function uses to compute the distance between a test object
and the training objects. The test table must also have these
columns.
VotingWeight Optional Specifies the voting weight of the distance between a test
object and the training objects. The voting_weight must be a
nonnegative integer. The default value is 0.
The function calculates distance-weighted voting, w, with
this equation:

w = 1/POWER(distance, voting_weight)

Where distance is the distance between the test object and


the training object.
OutputTable Optional Specifies the name of the output table. By default, the
function displays the output to the console.
CustomizedDistance Optional Specifies the distance function. The parameter jar is the
name of the JAR file that contains the distance metric class.
The parameter distance_class is the distance metric class
defined in the jar file. The KNN function installs the JAR file
on the Aster Database server. The default distance function
is Euclidean distance.
ForceMapreduce Optional Specifies whether to partition the training data. The default
value is 'false', which causes the KNN function to load all
training data into memory and use only the row function. If
you specify 'true', the KNN function partitions the training
data and uses the map-and reduce function.
Partition_Block_Size Optional Specifies the partition block size to use with
ForceMapreduce ('true'). The recommended value depends
on training data size and number of vworkers. For example,
if your training data size is 10 billion and you have 10
vworkers, the recommended partition_block_size is 1/n
billion, where n is an integer that corresponds to your
vworker nodes’ memory. Omitting this argument or
specifying an inappropriate partition_block_size can degrade
performance.

Input
The KNN function has two input tables, a training table and a test table.
The following table describes the required training table column. The training table can have additional
columns, but the function ignores them.

Teradata Aster Analytics Foundation User Guide 545


Chapter 5: Statistical Analysis
KNN
Table 438: KNN Training Table Schema

Column Name Data Type Description


id_column Any Contains unique row identifiers.
df_column INTEGER, Column that the function uses to compute the distance between a
BIGINT, test object and the training objects. The testing table must have a
SMALLINT, column with the same name and data type.
or NUMERIC
response_column VARCHAR Column that contains the class label or classification of the
classified data objects.

The following table describes the required test table column. The test table can have additional columns, but
the function ignores them.
Table 439: KNN Test Table Schema

Column Name Data Type Description


df_column INTEGER, Column that the function uses to compute the distance between a
BIGINT, test object and the training objects. The training table must have
SMALLINT, a column with the same name and data type.
or NUMERIC
test_id_column Any Contains unique test data object identifiers.

Output
The KNN function outputs a message and a table. By default, the function outputs only the table. If you
specify an output table name, then the function outputs the message to the console and creates an output
table with the specified name.
Table 440: KNN Output Table Schema

Column Name Data Type Description


test_id_column Same as in test Contains unique test data object identifiers.
table
category VARCHAR Contains the categories from the training table response_column, to
which the function mapped the test data objects.

Example

Input
The training input has as dimensions five attributes of personal computers—price, speed, hard disk size,
RAM, and screen size. The training table has 5008 rows, categorized into eight price groups, which the
following table describes.

546 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
KNN
Table 441: KNN Example Training Table computers_train1_clustered

id price speed hd ram screen computer_category


1 1499 25 80 4 14 SPECIAL
2 1795 33 85 2 14 SUPER
3 1595 25 170 4 15 SPECIAL
4 1849 25 170 8 14 SUPER
5 3295 33 340 16 14 HYPER
6 3695 66 340 16 14 UBER
7 1720 25 170 4 14 SPECIAL
8 1995 50 85 2 14 SUPER
9 2225 50 210 8 14 SUPER
12 2605 66 210 8 14 MEGA
13 2045 50 130 4 14 SUPER
14 2295 25 245 8 14 MEGA
16 2225 50 130 4 14 SUPER
17 1595 33 85 2 14 SPECIAL
18 2325 33 210 4 15 MEGA
19 2095 33 250 4 15 SUPER
20 4395 66 452 8 14 UBER
... ... ... ... ... ... ...

Table 442: KNN Example Price Categories

clusterid category
0 SPECIAL
1 UBER
2 MEGA
3 ULTRASUPER
4 SUPER
5 EXTREME
6 HYPER
7 ULTRA

The test table has more than 1000 rows.

Teradata Aster Analytics Foundation User Guide 547


Chapter 5: Statistical Analysis
KNN
Table 443: KNN Example computers_test1

id price speed hd ram screen


10 2575 50 210 4 15
11 2195 33 170 8 15
15 2699 50 212 8 14
29 3095 33 340 16 14
30 3244 66 245 8 14
38 3795 66 500 8 14
45 3495 50 340 16 14
46 2695 33 245 8 14
48 1749 25 120 4 14
51 2499 33 170 4 14
52 2395 33 130 4 14
59 2945 66 210 8 17
65 2195 66 85 2 14
66 1495 25 170 4 14
70 3095 66 245 8 14
86 1999 33 120 8 14
91 2975 50 210 4 17
92 2145 66 130 4 14
93 2420 33 170 8 15
94 2505 50 210 8 14
104 2999 66 330 4 15
... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM KNN (


ON (SELECT 1)
PARTITION BY 1
TrainingTable ('computers_train1_clustered')
TestTable ('computers_test1')
K (50)
ResponseColumn ('computer_category')
DistanceFeatures ('price', 'speed', 'hd', 'ram', 'screen')
VotingWeight (1)
IDColumn ('id')

548 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
KNN
OutputTable ('knn_output')
);

Output
The following query returns the output shown in the following table:

SELECT * FROM knn_output ORDER BY id;

Table 444: KNN Example Output Table knn_output

id computer_category
10 MEGA
11 SUPER
15 MEGA
29 HYPER
30 HYPER
38 UBER
45 UBER
46 MEGA
48 SPECIAL
51 MEGA
52 MEGA
59 HYPER
65 SUPER
66 SPECIAL
70 HYPER
86 SUPER
91 HYPER
92 SUPER
93 MEGA
94 MEGA
104 HYPER
... ...

Teradata Aster Analytics Foundation User Guide 549


Chapter 5: Statistical Analysis
LARS Functions
User-Defined Distance Metric
A user-defined distance metric is used in this example. The Java class com.example.MyDistance defines this
metric:

package com.example;
import com.asterdata.ncluster.sqlmr.data.RowView;
import com.asterdata.sqlmr.analytics.classification.knn.distance.Distance;
public class MyDistance implements Distance {
/**
* calculate the distance between the test row and the training row.
* note: 1.don't reverse the sequence of parameters
* 2. the columns of trainingRowView is 'responseColumn, f1,f2,...,fn'
* 3. the columns of testRowView is the same as TEST_TABLE
* 4. all the trainingRowView and testRowView is zero-based
* (0 <= index && index < getColumnCount())
*
* @param testRowView
* stands for a point in the test data set
* @param trainingRowView
* stands for a point in the training data set, the columns is the
* columns in distanceFeatures argument
* @return the double value of distance
*/
@Override
public double calculate(RowView testRowView, RowView trainingRowView) {
return Math.abs(testRowView.getIntAt(1) - trainingRowView.getIntAt(1));
}
}

LARS Functions

Summary
Least angle regression (LARS) and its most important modification, least absolute shrinkage and selection
operator (LASSO), are variants of linear regression that select the most important variables, one by one, and
fit the coefficients dynamically. Aster Database provides two LARS functions:
• LARS
• LARSPredict
The output of the LARS function is input to the LARSPredict function.

550 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARS
Background
LARS is a model-selection algorithm, a useful and less greedy version of traditional forward-selection
methods.
LASSO is a version of ordinary least squares (OLS) that constrains the sum of the absolute regression
coefficients. LASSO is an important sparse learning method. LASSO estimates the coefficients for each input
variable and uses them to make predictions for the response variables. Compared to ordinary least squares,
LASSO does the fitting in a smarter way. It always finds the most significant variables (those that have the
greatest absolute correlation with the current residuals) in a sequential manner. That is, it performs the
variable selection job.
This form of fitting is very efficient when you have thousands of input variables. The time complexity is the
same as running linear regression, which is linear in the number of rows. In addition, LASSO can work in
some situations where ordinary least squares cannot, such as when there is multicollinearity.
For more information about LARS, see Least Angle Regression, Bradley Efron and others, Department of
Statistics, Stanford University, 2004.

LARS

Summary
The LARS function generates a model that the function LARSPredict uses to make predictions for the
response variables.

Usage

LARS Syntax
Version 1.1

SELECT * FROM LARS (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
InputColumns ('response', 'predictor_columns')
[ Method ({ 'lar' | 'lasso'} ) ]
[ Intercept ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Normalize ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ MaxIterNum ('max_iterations') ]
);

Teradata Aster Analytics Foundation User Guide 551


Chapter 5: Statistical Analysis
LARS
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the input table.
OutputTable Required Specifies the name of the output table.
InputColumns Required Specifies the names of the columns of the input table that contain
the response and predictors.
The syntax of predictor_columns is:
{col[,...] | [start_column:end_column]}[,...]
where col is a column name and start_column and end_column are
the column indexes of the first and last columns in a range of
columns. The range includes start_column and end_column.
The leftmost column has column index 0, the column to its
immediate right has column index 1, and so on.

Note:
In a column range, brackets do not indicate optional elements.
You must include the bracket characters (for example, '[2:6]').

Note:
This function can take at most 799 response and predictor
variables.

Method Optional Specifies either 'lar' (least angle regression) or 'lasso'. The default
value is 'lasso'.
Intercept Optional Specifies whether an intercept is included in the model (and not
penalized). The default value is 'true'.
Normalize Optional Specifies whether each predictor is standardized to have unit L2
norm. The default value is 'true'.
MaxIterNum Optional Specifies the maximum number of steps the function executes. The
default value is 8*min(number_of_predictors, sample_size -
intercept). For example, if the number of predictors is 11, the sample
size (number of rows in the input table) is 1532, and the intercept is
1, then the default value is 8*min(11, 1532 - 1) = 88.

Input
The LARS function has one required input table, which contains the response column and predictor
columns. The input table can have additional columns, but the function ignores them.

552 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARS
Table 445: LARS Input Table Schema

Column Name Data Type Description


response Numeric or, if both Contains the response variables.
Intercept and Normalize
are 'false', BOOLEAN
predictor Numeric or, if both Contains the predictors. The table has the predictor
Intercept and Normalize columns specified by the ColumnNames argument.
are 'false', BOOLEAN

Note:
The LARS function skips input rows that contain NULL values.

Output
Table 446: LARS Output Table Schema

Column Name Data Type Description


steps INTEGER Contains the sequence number for each step. One LAR or LASSO
move represents one step.
var_id INTEGER Contains the sequence number of each predictor. The ColumnNames
argument specifies the sequence of the predictors.
var_name VARCHAR Contains the column name of each predictor.
max_abs_corr DOUBLE Contains the modified maximum absolute correlation (common for all
PRECISION active variables) between the active variables and the current residuals.
This value is not necessarily in the range [0,1].
step_length DOUBLE Contains the distance to move along the equiangular direction in each
PRECISION step.
intercept DOUBLE Contains the constant item in the model. This value evolves along the
PRECISION path.
predictor DOUBLE Contains the coefficient for a predictor. The table has one such column
PRECISION for each predictor.

Interpreting the Output


At the beginning of stepi, the variable Xk (var_id, var_name) either enters into (positive var_id) or drops
from (negative var_id) the regression model, and the current common correlation between active variables
and current residuals is max_abs_corr.
After moving along the equiangular direction for step_length distance, either an inactive variable qualifies to
enter into the model or a currently active variable is dropped from the model, in which the process reaches
stepi+1, so the intercept and coefficients correspond to the end of stepi as well as the beginning of stepi+1.

Teradata Aster Analytics Foundation User Guide 553


Chapter 5: Statistical Analysis
LARS
Examples
• Input
• Example 1: Method ('lar')
• Example 2: Method ('lasso')

Input
This input is diabetes data from “Least Angle Regression,” by Bradley Efron and others.
The input table has one response (vector y) and ten baseline predictors measured on 442 diabetes patients.
The baseline predictors are age, sex, body mass index (bmi), mean arterial pressure (map) and six blood
serum measurements (tc, ldl, hdl, tch, ltg, glu).
Table 447: LARS Examples Input Table: diabetes, Columns 1-6

id age sex bmi map tc


1 0.0380759 0.0506801 0.0616962 0.0218724 -0.0442235
2 -0.00188202 -0.0446416 -0.0514741 -0.0263278 -0.00844872
3 0.0852989 0.0506801 0.0444512 -0.00567061 -0.0455994
4 -0.0890629 -0.0446416 -0.011595 -0.0366564 0.0121906
5 0.00538306 -0.0446416 -0.0363847 0.0218724 0.00393485
... ... ... ... ... ...

Table 448: LARS Examples Input Table: diabetes, Columns 7-12

ldl hdl tch ltg glu y


-0.0348208 -0.0434008 -0.00259226 0.0199084 -0.0176461 151
-0.0191633 0.0744116 -0.0394934 -0.0683297 -0.092204 75
-0.0341945 -0.0323559 -0.00259226 0.00286377 -0.0259303 141
0.0249906 -0.0360376 0.0343089 0.022692 -0.00936191 206
0.0155961 0.00814208 -0.00259226 -0.0319914 -0.0466409 135
... ... ... ... ... ...

The column id is the row identifier, y is the response, and the other columns are predictors.
This data set is atypical in that each predictor has mean 0 and norm 1, which means that:
• The value of the Normalize argument is irrelevant.
• If the value of the Intercept argument is 'true', then the intercept is considered to be constant along the
entire path (which is typically not true).

554 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARS
Example 1: Method ('lar')

SQL-MapReduce Call

SELECT * FROM LARS (ON (SELECT 1)


PARTITION BY 1
InputTable ('diabetes')
OutputTable ('diabetes_lars')
InputColumns ('y', 'age', 'sex', 'bmi', 'map', 'tc', 'ldl', 'hdl',
'tch', 'ltg', 'glu')
Method ('lar')
Intercept ('true')
Normalize ('true')
MaxIterNum ('20')
);

Output

Table 449: LARS Example 1 Output Message

message
Successful.
Result has been stored in table: '"diabetes_lars"'.
(2 rows)

The following query returns the output shown in the following table:

SELECT * FROM diabetes_lars WHERE steps <> 0 ORDER BY steps;

Table 450: LARS Example 1 Output Table diabetes_lars, Columns 1–7

steps var_id var_name max_abs_corr step_length intercept age


1 3 bmi 949.435 60.1193 152.133 0
2 9 ltg 889.316 513.224 152.133 0
3 4 map 452.901 175.553 152.133 0
4 7 hdl 316.074 259.367 152.133 0
5 2 sex 130.131 88.6592 152.133 0
6 10 glu 88.7824 43.6779 152.133 0
7 5 tc 68.9652 135.984 152.133 0
8 8 tch 19.9813 54.0156 152.133 0
9 6 ldl 5.47747 5.56726 152.133 0
10 1 age 5.08918 73.5291 152.133 -10.0122

Teradata Aster Analytics Foundation User Guide 555


Chapter 5: Statistical Analysis
LARS
Table 451: LARS Example 1 Output Table diabetes_lars, Columns 8–16

sex bmi map tc ldl hdl tch ltg glu


0 60.1193 0 0 0 0 0 0 0
0 361.895 0 0 0 0 0 301.775 0
0 434.758 79.2364 0 0 0 0 374.916 0
0 505.66 191.27 0 0 -114.101 0 439.665 0
-74.9165 511.348 234.155 0 0 -169.711 0 450.667 0
-111.979 512.044 252.527 0 0 -196.045 0 452.393 12.0781
-197.757 522.265 297.16 -103.946 0 -223.926 0 514.75 54.7677
-226.134 526.885 314.389 -195.106 0 -152.477 106.343 529.916 64.4874
-227.176 526.391 314.95 -237.341 33.6284 -134.599 111.384 545.483 64.6067
-239.819 519.84 324.39 -792.184 476.746 101.045 177.064 751.279 67.6254

The following figure represents the results and shows how the standardized coefficients evolved during the
model-building process. The x-axis represents the ratio of the norm of the current beta to the full beta. The
y-axis represents the standardized coefficients, which are estimated when standardized predictors are used.
The numbers on the top of the graph represent the steps of the model-building process. The numbers on the
right represent the predictor IDs.
Figure 13: LAR Results

556 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARS
Example 2: Method ('lasso')

SQL-MapReduce Call

SELECT * FROM LARS (


ON (SELECT 1)
PARTITION BY 1
InputTable ('diabetes')
OutputTable ('diabetes_lasso')
InputColumns('y', 'age', '[2:5]', 'ldl', 'hdl', '[8:10]')
Method ('lasso')
Intercept ('true')
Normalize ('true')
MaxIterNum ('20')
);

Output

Table 452: LARS Example 2 Output Message

message
Successful.
Result has been stored in table: '"diabetes_lasso"'.
(2 rows)

The following query returns the output shown in the following table:

SELECT * FROM diabetes_lasso WHERE steps <> 0 ORDER BY steps;

Table 453: LARS Example 2 Output Table diabetes_lasso, Columns 1–7

steps var_id var_name max_abs_corr step_length intercept age


1 3 bmi 949.435 60.1193 152.133 0
2 9 ltg 889.316 513.224 152.133 0
3 4 map 452.901 175.553 152.133 0
4 7 hdl 316.074 259.367 152.133 0
5 2 sex 130.131 88.6592 152.133 0
6 10 glu 88.7824 43.6779 152.133 0
7 5 tc 68.9652 135.984 152.133 0
8 8 tch 19.9813 54.0156 152.133 0
9 6 ldl 5.47747 5.56726 152.133 0
10 1 age 5.08918 41.9996 152.133 -5.71894
11 -7 hdl 2.18225 7.2707 152.133 -7.01124

Teradata Aster Analytics Foundation User Guide 557


Chapter 5: Statistical Analysis
LARS

steps var_id var_name max_abs_corr step_length intercept age


12 7 hdl 1.31044 27.97 152.133 -10.0122

Table 454: LARS Example 2 (LASSO) Output Table diabetes_lasso, Columns 8–16

sex bmi map tc ldl hdl tch ltg glu


0 60.1193 0 0 0 0 0 0 0
0 361.895 0 0 0 0 0 301.775 0
0 434.758 79.2364 0 0 0 0 374.916 0
0 505.66 191.27 0 0 -114.101 0 439.665 0
-74.9165 511.348 234.155 0 0 -169.711 0 450.667 0
-111.979 512.044 252.527 0 0 -196.045 0 452.393 12.0781
-197.757 522.265 297.16 -103.946 0 -223.926 0 514.75 54.7677
-226.134 526.885 314.389 -195.106 0 -152.477 106.343 529.916 64.4874
-227.176 526.391 314.95 -237.341 33.6284 -134.599 111.384 545.483 64.6067
-234.398 522.649 320.343 -554.266 286.736 0 148.9 663.033 66.331
-237.101 521.075 321.549 -580.439 313.862 0 139.858 674.937 67.1794
-239.819 519.84 324.39 -792.184 476.746 101.045 177.064 751.279 67.6254

The following figure represents the results and shows how the standardized coefficients evolved during the
model-building process. The x-axis represents the ratio of the norm of the current beta to the full beta. The
y-axis represents the standardized coefficients, which are estimated when standardized predictors are used.
The numbers on the top of the graph represent the steps of the model-building process. The numbers on the
right represent the predictor IDs.

558 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARSPredict
Figure 14: LASSO Results

LARSPredict

Summary
The LARSPredict function takes new data and the model generated by the function LARS and uses the
predictors in the model to output predictions for the new data.

Usage

LARSPredict Syntax
Version 1.1

SELECT * FROM LARSPredict (


ON input_table AS data PARTITION BY ANY
ON model_table AS model DIMENSION
[ MODE ({ 'STEP' | 'FRACTION' | 'NORM' | 'LAMBDA' })]
[ S ('coef_position' [,...]) ]

Teradata Aster Analytics Foundation User Guide 559


Chapter 5: Statistical Analysis
LARSPredict
[ TargetCol ('target_column') ]
);

Arguments
Argument Category Description
Mode Optional Specifies the mode for the S argument:
• 'STEP' (default)
The S argument indicates the steps corresponding to the steps in the
model generated by the LARS function. The S argument can include any
real values in [1, k], where k is the maximum step in the model.
• 'FRACTION'
The S argument indicates the fractions of the L1 norm of the coefficients
against the maximum L1 norm. The maximum L1 norm is that of the full
OLS solution, which is the coefficients at the last step. The S argument
can include any real values in [0, 1].
• 'NORM'
The S argument indicates the L1 norm of the coefficients. The S
argument can include any real values in [0, max L1 norm]. For
maximum L1 norm, see above.
• 'LAMBDA'
The S argument indicates the maximum absolute correlations. For
definition, see the description of max_abs_corr in the Output section of
the function LARS. The S argument can include any real values.

S Optional Specifies the positions of the coefficients at which to generate predictions.


Each coefficient is a different DOUBLE PRECISION value in the range
specified by the Mode argument.
TargetCol Optional Specifies the name of the response column in the input table (for prediction
comparison). The sum-of-square error (SSE) for each prediction appears in
the last row of the output table.

Input
The LARSPredict function has two required input tables, the table that contains the new data (described by
the following table) and the model table generated by the LARS function (described in the Output section).
The data table can have columns that are not predictors, but the function ignores them.
Table 455: LARSPredict Data Table Schema

Column Name Data Type Description


predictor Same as in the model Contains the same predictors as the model table.
table generated by the
LARS function

560 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARSPredict
Output
Table 456: LARSPredict Output Table Schema

Column Name Data Type Description


input_table_column Same as in Column copied from the input table.
input table
prediction_coef_position DOUBLE Contains the prediction for coef_position. The table has a column
PRECISIO for each coef_position specified by the S argument.
N If you specify the TargetCol argument, then this column contains
the SSE for each prediction column in the last row.

Examples
• Example 1: Model ('diabetes_lars')
• Example 2: Model ('diabetes_lasso')

Example 1: Model ('diabetes_lars')

Input
• Data table below, diabetes_test, obtained by a 20% sampling rate (88 rows) from the table, diabetes (from
the Input section of the Examples for the function LARS.
• Model file, diabetes_lars, found in the Output section of the LARS function, Example 1
Table 457: LarsPredict Example Data Table: diabetes_test, Columns 1-6

id age sex bmi map tc


3 0.0852989 0.0506801 0.0444512 -0.00567061 -0.0455994
6 -0.0926955 -0.0446416 -0.0406959 -0.0194421 -0.0689906
8 0.0635037 0.0506801 -0.00189471 0.0666297 0.0906199
13 0.0162807 -0.0446416 -0.02884 -0.00911348 -0.00432087
14 0.00538306 0.0506801 -0.00189471 0.00810087 -0.00432087
17 -0.00551456 -0.0446416 0.0422956 0.0494153 0.0245741
26 -0.0672677 0.0506801 -0.0126728 -0.0400993 -0.0153285
28 -0.0236772 -0.0446416 0.0595406 -0.0400993 -0.0428475
30 0.0671362 0.0506801 -0.00620595 0.0631868 -0.0428475
33 0.0344434 0.0506801 0.125287 0.0287581 -0.0538552
35 0.0162807 -0.0446416 -0.06333 -0.0573137 -0.057983
... ... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 561


Chapter 5: Statistical Analysis
LARSPredict
Table 458: LarsPredict Example Data Table: diabetes_test, Columns 7-12

ldl hdl tch ltg glu y


-0.0341945 -0.0323559 -0.00259226 0.00286377 -0.0259303 141
-0.0792878 0.0412768 -0.0763945 -0.0411804 -0.0963462 97
0.108914 0.0228686 0.0177034 -0.0358167 0.00306441 63
-0.00976889 0.0449585 -0.0394934 -0.0307512 -0.0424988 179
-0.0157187 -0.00290283 -0.00259226 0.0383932 -0.013504 185
-0.0238606 0.0744116 -0.0394934 0.05228 0.0279171 166
0.00463594 -0.0581274 0.0343089 0.019199 -0.0342146 202
-0.0435889 0.0118237 -0.0394934 -0.0159983 0.0403434 85
-0.0958847 0.0523217 -0.0763945 0.0594238 0.0527697 283
-0.0129004 -0.102307 0.108111 0.000271486 0.0279171 341
-0.0489124 0.00814208 -0.0394934 -0.0594727 -0.0673514 65

SQL-MapReduce Call

SELECT * FROM LARSPredict (


ON diabetes_test AS data PARTITION BY ANY
ON diabetes_lars AS model DIMENSION
Mode ('step')
S ('1.6')
TargetCol ('y')
) ORDER BY id;

Output

Table 459: LARSPredict Example 1 Output Table, Columns 1–7

id age sex bmi map tc ldl


3 0.0852989 0.0506801 0.0444512 -0.00567061 -0.0455994 -0.0341945
6 -0.0926955 -0.0446416 -0.0406959 -0.0194421 -0.0689906 -0.0792878
8 0.0635037 0.0506801 -0.00189471 0.0666297 0.0906199 0.108914
13 0.0162807 -0.0446416 -0.02884 -0.00911348 -0.00432087 -0.00976889
... ... ... ... ... ... ...

Table 460: LARSPredict Example 1 Output Table, Columns 8–13

hdl tch ltg glu y prediction_1.6


-0.0323559 -0.00259226 0.00286377 -0.0259303 141 153.737
0.0412768 -0.0763945 -0.0411804 -0.0963462 97 150.666

562 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LARSPredict

hdl tch ltg glu y prediction_1.6


0.0228686 0.0177034 -0.0358167 0.00306441 63 152.065
0.0449585 -0.0394934 -0.0307512 -0.0424988 179 151.093
... ... ... ... ... ...

Example 2: Model ('diabetes_lasso')

Input
• Data table diabetes_test, obtained by a 20% sampling rate (88 rows) from the table diabetes
• Model file diabetes_lasso, output by LARS Example 2

SQL-MapReduce Call

SELECT * FROM LARSPredict (


ON diabetes_test AS data PARTITION BY 1
ON diabetes_lasso AS model DIMENSION
Mode ('step')
S ('1.6')
TargetCol ('y')
) ORDER BY id;

Output

Table 461: LARSPredict Example 2 Output Table, Columns 1–7

id age sex bmi map tc ldl


3 0.0852989 0.0506801 0.0444512 -0.00567061 -0.0455994 -0.0341945
6 -0.0926955 -0.0446416 -0.0406959 -0.0194421 -0.0689906 -0.0792878
8 0.0635037 0.0506801 -0.00189471 0.0666297 0.0906199 0.108914
13 0.0162807 -0.0446416 -0.02884 -0.00911348 -0.00432087 -0.00976889
... ... ... ... ... ... ...
440 0.0417084 0.0506801 -0.0159063 0.0172819 -0.0373437 -0.0138398
442 -0.0454725 -0.0446416 -0.0730303 -0.0814138 0.0837401 0.0278089

Table 462: LARSPredict Example 2 Output Table, Columns 8–13

hdl tch ltg glu y prediction_1.6


-0.0323559 -0.00259226 0.00286377 -0.0259303 141 153.737
0.0412768 -0.0763945 -0.0411804 -0.0963462 97 150.666
0.0228686 0.0177034 -0.0358167 0.00306441 63 152.065

Teradata Aster Analytics Foundation User Guide 563


Chapter 5: Statistical Analysis
Linear Regression

hdl tch ltg glu y prediction_1.6


0.0449585 -0.0394934 -0.0307512 -0.0424988 179 151.093
... ... ... ... ... ...
-0.0249927 -0.0110795 -0.0468795 0.0154907 132 151.56
0.173816 -0.0394934 -0.00421986 0.00306441 57 149.499
502871

Linear Regression

Summary
The LinRegMatrix function takes a data set and outputs a linear regression model. The LinReg function
takes the linear regression model and outputs its coefficients. The 0th coefficient corresponds to the slope
intercept.

Background
The linear regression model is probably the easiest predictive technique to use. This model can be as simple
as having one input variable and one output variable or as complex as having dozens of input variables. All
linear regression models fit this pattern: Independent variables are used first to model and then to predict
the result—the dependent variable. In matrix notation, a linear regression model is given by the formula Y =
Xβ + Ϲ, where:
• X is the independent (predictor) variable or vector.
• β is the vector of parameters.
• ε is the error vector.
• Y is the dependent (response) vector.
That is:

The input table contains all the predictor columns and, in its last column, the response vector. The output
table contains the beta coefficients (in the coefficient_index column). The 0th coefficient corresponds to the
slope intercept and the ith coefficient corresponds to the ith predictor variable. The LinReg function is
limited to outputting the coefficients; it does not give the significance of predictor variables by a p-value or
the goodness of fit by an R2 value.

564 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Linear Regression
Usage

Linear Regression Syntax


LinReg version 1.0, LinRegMatrix version 1.1

SELECT * FROM LinReg (


ON LinRegMatrix (
ON { table_name | view_name | (query) }) PARTITION BY 1
)
);

Note:
PARTITION BY 1 is required because all input data must be submitted to one worker.

Input
The input table for the LinRegMatrix function has one row for each data point and one column for each data
point component. A data point can have multiple x components but only one y component. The column
that represents the y component must be the last column in the input table.
Table 463: LinRegMatrix Input Table Schema

Column Name Data Type Description


x_component_i SMALLINT, Contains the values for the ith component of the data points. The table
INTEGER, has one such column for each x component.
BIGINT, or
DOUBLE
PRECISION
y_component SMALLINT, Contains the values for the y components of the data points. This column
INTEGER, must be last.
BIGINT, or
DOUBLE
PRECISION

Note:
If an input table row contains a NULL value, then the LinRegMatrix function skips that row.

Output
Table 464: LinReg Output Table Schema

Column Name Data Type Description


coefficient_index INTEGER Contains the indexes of the coefficients of the linear regression model
generated by the LinRegMatrix function.
value DOUBLE Contains the values of the coefficients of the linear regression model
PRECISION generated by the LinRegMatrix function.

Teradata Aster Analytics Foundation User Guide 565


Chapter 5: Statistical Analysis
Linear Regression
Example
This example uses the LinRegMatrix and LinReg functions to find the coefficients of the variables that
determine the selling price of a home in a given neighborhood. The response variable, SellingPrice (the
selling price of the house in dollars) is modeled on these independent (predictor) variables:
• House size (in square feet)
• Lot size (in square feet)
• Number of bedrooms
• Whether the kitchen counter is granite (0 or 1)
• Whether the bathrooms are upgraded (0 or 1)

Input
Table 465: LinRegMatrix Example Input Table housing_data

housesize lotsize bedrooms granite upgradedbathroom sellingprice


2397 14156 4 1 0 189900
2200 9600 4 0 1 195000
4032 10150 5 0 1 197900
3529 9191 6 0 0 205000
3247 10061 5 1 1 224900
2983 9365 5 0 1 230000
3536 19994 6 1 1 325000
3198 9669 5 1 1

Note:
The LinRegMatrix function skips the last row of housing_data because it contains a NULL value.

SQL-MapReduce Call

SELECT * FROM LinReg (


ON LinRegMatrix(ON housing_data) PARTITION BY 1
) ORDER BY value;

Output
Table 466: LinReg Example Output Table

CoefficientName value
Intercept -21739.2966650368
housesize -26.9307835091457
lotsize 6.33452410459345

566 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LRTEST

CoefficientName value
granite 7140.67629349537
upgradedbathroom 43179.1998888263
bedrooms 44293.7605841832

The 0th coefficient index is the slope intercept. The coefficient indices 1, 2, 3, 4, and 5 correspond to
HouseSize, LotSize, Bedrooms, Granite, and UpgradedBathroom, respectively.

LRTEST

Summary
The LRTEST function performs the likelihood ratio test for two GLM models, generated by the function
GLM.

Background
A likelihood ratio test is useful for comparing the fit a null model and an alternative model. The null model
is a special case of the alternative model. The likelihood ratio expresses how many times more likely the data
are under one model than the other. You can use the likelihood ratio or its logarithm to compute a p-value,
or compare it to a critical value to decide whether to reject the null model in favor of the alternative model.
When you use the logarithm of the likelihood ratio, the statistic is known as the log-likelihood ratio statistic.
You can use Wilks’s theorem to approximate the probability distribution of this statistic (assuming that the
null model is true).

Teradata Aster Analytics Foundation User Guide 567


Chapter 5: Statistical Analysis
LRTEST
Usage

LRTEST Syntax
Version 1.1

SELECT * FROM LRTEST (


ON (SELECT * FROM glm_output1 WHERE attribute = -1) AS "model1"
PARTITION BY 1
ON (SELECT * FROM glm_output2 WHERE attribute = -1) AS "model2"
PARTITION BY 1
Statistic ('predictor_column')
LogLik ('estimate_column')
ObsNum ('std_err_column')
ParamNum ('z_score_column')
);

Arguments
Argument Category Description
Statistic Required Specifies the name of the input column that contains the name of the
statistic. This column corresponds to the GLM output column
'predictor'.
LogLik Required Specifies the name of the input column that contains the log-likelihood
of the GLM model. This column corresponds to the GLM output
column 'estimate'.
ObsNum Required Specifies the name of the input column that contains the number of
observations. This column corresponds to the GLM output column
'std_err'.
ParamNum Required Specifies the name of the input column that contains the number of
parameters (excluding the intercept). This column corresponds to the
GLM output column 'z_score'.

Input
The LRTEST has two required input tables, both of which are model tables generated by the function GLM.
For the GLM model table schema, refer to the Output Table section.

Output
Table 467: LRTEST Output Table Schema

Column Name Data Type Description


distribution VARCHAR Type of Chi-squared distribution used for the test.
statistic DOUBLE PRECISION Value of the Chi-squared statistic.

568 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LRTEST

Column Name Data Type Description


p_value DOUBLE PRECISION Corresponding p-value for the Chi-squared test.

Example
This example compares two models generated by the GLM function.

Input
The input table, glm_tempdamage, has 22 observations with one numerical predictor variable (temp) and
one response variable (damage). The value of the response variable shows whether there is damage due to
temperature (1 means yes, 0 means no) .
Table 468: LRTEST Example Input Table glm_tempdamage

id temp damage
1 53 1
2 57 1
3 58 1
4 63 1
5 66 0
6 67 0
7 67 0
8 67 0
9 68 0
10 69 0
11 70 1
12 70 0
13 70 1
14 70 0
15 72 0
16 73 0
17 75 0
18 75 1
19 76 0
20 76 0
21 78 0

Teradata Aster Analytics Foundation User Guide 569


Chapter 5: Statistical Analysis
LRTEST

id temp damage
22 79 0

Because this is a binary outcome, the two models use GLM with logistic regression. SQL-MapReduce Call 1
generates a model using the predictor variable (the second table in its Output section). SQL-MapReduce Call
2 generates the null model (the second table in its Output section). The null model is produced with only the
intercept. SQL-MapReduce Call 3 uses the LRTEST function to compare the two GLM models.

SQL-MapReduce Call 1 - Create model based on temp variable

SELECT * FROM GLM (


ON (SELECT 1)
PARTITION BY 1
InputTable ('glm_tempdamage')
OutputTable ('damage_glm1')
ColumnNames ('damage','temp')
Family ('LOGISTIC')
Link ('CANONICAL')
Threshold ('0.01')
MaxIterNum ('10')
);

Output
Table 469: LRTEST Example Output Table

predictor estimate std_error z_score p_value significance


(Intercept) 14.8054 7.44367 1.98899 0.0467025 *
temp -0.228562 0.109324 -2.09069 0.036556 *
ITERATIONS # 4 0 0 0 Number of Fisher Scoring
iterations
ROWS # 22 0 0 0 Number of rows
Residual deviance 20.268 0 0 0 on 20 degrees of freedom
Pearson goodness of 22.7719 0 0 0 on 20 degrees of freedom
fit
AIC 24.268 0 0 0 Akaike information criterion
BIC 26.4501 0 0 0 Bayesian information criterion
Wald Test 6.03947 0 0 0.0488142 *
Dispersion parameter 1 0 0 0 Taken to be 1 for BINOMIAL and
POISSON.

570 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
LRTEST
The model output file is shown by the SQL statement below.

SELECT * FROM damage_glm1 ORDER BY 1;

Table 470: LRTEST Example 1 Model Table damage_glm1

attribute predictor category estimate std_err z_score p_value significanc


e
-1 Loglik -10.134 22 1 0
0 (Intercept) 14.8054 7.44367 1.98899 0.0467025 *
1 temp -0.228562 0.109324 -2.09069 0.036556 *

SQL-MapReduce Call 2 - Create null model

SELECT * FROM GLM (


ON (SELECT 1)
PARTITION BY 1
InputTable ('glm_tempdamage')
OutputTable ('damage_glm2')
ColumnNames ('damage')
Family ('LOGISTIC')
Link ('CANONICAL')
Threshold ('0.01')
MaxIterNum ('10')
);

Output
Table 471: LRTEST Example Output Table

predictor estimate std_error z_score p_value significance


(Intercept) -0.76214 0.457738 -1.66502 0.0959097 .
ITERATIONS # 3 0 0 0 Number of Fisher Scoring
iterations
ROWS # 22 0 0 0 Number of rows
Residual deviance 27.5216 0 0 0 on 21 degrees of freedom
Pearson goodness of 22 0 0 0 on 21 degrees of freedom
fit
AIC 29.5216 0 0 0 Akaike information criterion
BIC 30.6126 0 0 0 Bayesian information criterion
Wald Test 2.77228 0 0 0.0959095 .
Dispersion parameter 1 0 0 0 Taken to be 1 for BINOMIAL and
POISSON.

Teradata Aster Analytics Foundation User Guide 571


Chapter 5: Statistical Analysis
Percentile
The model output file is shown by the SQL statement below.

SELECT * FROM damage_glm2 ORDER BY 1;

Table 472: LRTEST Example Model Table damage_glm2

attribute predictor category estimate std_err z_score p_value significanc


e
-1 Loglik -13.7608 22 0 0
0 (Intercept) -0.76214 0.457738 -1.66502 0.0959097 .

SQL-MapReduce Call 3 - LRTEST

SELECT * FROM LRTEST (


ON (SELECT * FROM damage_glm1 WHERE attribute = -1) AS "model1"
PARTITION BY 1
ON (SELECT * FROM damage_glm2 WHERE attribute = -1) AS "model2"
PARTITION BY 1
Statistic ('predictor')
Loglik ('estimate')
ObsNum ('std_err')
ParamNum ('z_score')
);

Output
Table 473: LRTEST Example Output Table

distribution statistic p_value


Chi-squared 1 d.f. 7.2536 0.0070759

The final output compares the two GLM distributions and displays a chi squared statistic and a p-value. The
chi square value suggests that the data was more likely to be generated by model 1, based on all predictors,
than by the null model. The result is statistically significant at the 95% confidence level (p-value <0.05).

Percentile

Summary
The Percentile function generates percentiles for groups of numbers. The nth percentile is the smallest value
in a data set that is greater than n% of the values.
Use this function when the input data is partitioned into a large number of groups and you want to find the
percentile for each group. Each group must fit on a single worker node. The maximum number of input
rows in each group that the function can process depends on the cluster configuration. To find percentile

572 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Percentile
statistics for a very large input group that requires multiple workers, use the function Approximate
Percentile.

Usage

Percentile Syntax
Version 1.0

SELECT * FROM Percentile (


ON input_table
PARTITION BY partition_column [,...]
Percentile ('percentile' [,...])
Target_Columns ({ 'target_column' | 'target_column_range' }[,...])
[ Group_Columns ({ 'group_column' | 'group_column_range' }[,...]) ]
);

Arguments
Argument Category Description
Percentile Required Specifies the percentiles for the function to generate.
Target_Columns Required Specifies the names of the columns that contain the groups of numbers
whose percentiles are to be generated.
Group_Columns Optional Specifies the names of the columns to copy to the output table. Typically,
the list of group columns is the same as the list of partition columns.

Input
Table 474: Percentile Input Table Schema

Column Name Data Type Description


group_column VARCHAR Column to copy to the output table. Typically a
partition_column, used to group a predictor (independent)
variable.
target_column NUMERIC Contains values for which to calculate percentiles.

Output
Table 475: Percentile Output Table Schema

Column Name Data Type Description


group_column VARCHAR Column copied from the input table. Typically a
partition_column that groups a predictor (independent) variable.

Teradata Aster Analytics Foundation User Guide 573


Chapter 5: Statistical Analysis
Percentile

Column Name Data Type Description


percentile INTEGER
target_column NUMERIC Contains values for which the function calculated percentiles.

Example

Input
This example uses data from participants in the 2012 London Olympic Games. The input data consists of the
age, height, weight, sex, sport and country for a subset of the participants in the 2012 Summer Olympics.
Table 476: Percentile Example Input Table london_olympics

name country age height weight sex sport


Adriatik Hoxha Albania 22 194 130 M Athletics
Arben Kucana Albania 44 184 95 M Shooting
Briken Calja Albania 22 169 69 M Weightlifting
Daniel Godelli Albania 20 168 68 M Weightlifting
Endri Karina Albania 23 186 94 M Weightlifting
Klodiana Shala Albania 32 F Athletics
Majlinda Kelmendi Albania 21 51 F Judo
Noel Borshi Albania 16 164 54 F Swimming
Romela Begaj Albania 25 160 58 F Weightlifting
Sidni Hoxha Albania 20 193 86 M Swimming
Ali Hasan Mahboob Bahrain 30 175 70 M Athletics
... ... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM percentile (


ON london_olympics PARTITION BY country
Percentile ('0', '100')
TargetColumns ('height', 'weight', 'age')
GroupColumns ('country')
) ORDER BY 1,2;

Output
The output table displays the values for each target column, partitioned by country, corresponding to the
requested percentiles.

574 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis
Table 477: Percentile Example Output Table

country percentile height weight age


Albania 0 160 51 16
Albania 100 194 130 44
Bahrain 0 162 43 15
Bahrain 100 175 72 30
Costa Rica 0 163 50 21
Costa Rica 100 195 96 32
Cyprus 0 162 55 16
Cyprus 100 193 110 37
El Salvador 0 159 54 19
El Salvador 100 185 80 29
Eritrea 0 160 51 18
Eritrea 100 188 71 34
Grenada 0 162 56 19
Grenada 100 187 75 32
... ... ... ... ...

Principal Component Analysis

Summary
Principal component analysis (PCA) is a common unsupervised learning technique that is useful for both
exploratory data analysis and dimension reduction. PCA is often used as the core procedure for factor
analysis.
The PCA function is composed of two functions, PCA_Map and PCA_Reduce.

If the version of PCA_Reduce is AA 6.21 or later, you can input the PCA output to the function PCAPlot.

Teradata Aster Analytics Foundation User Guide 575


Chapter 5: Statistical Analysis
Principal Component Analysis
Background
When you have thousands of input variables, there is a high probability that some of them are linearly
correlated. Reasons to reduce the thousands of potentially linearly correlated input variables to a few linearly
uncorrelated variables, called principal components, are:
• Some statistical analysis tools, such as linear regression, do not allow linearly correlated inputs.
• High dimensionality causes many problems for statistical tools.
Given a data set with N observations and M variables, represented by an NxM matrix, PCA generates an
MxM rotation matrix. Each column of the rotation matrix represents an axis in M-dimensional space. The
first k columns are the k dimensions along which the data varies most (and thus in some cases are
considered the most important). Discarding the remaining M-k columns leaves an Mxk rotation matrix.
Multiplying the original NxM matrix by the Mxk rotation matrix produces an Nxk matrix that represents the
data set with a reduced dimensionality of k, where k is less than or equal to M.
Each eigenvector (output row, less the last standard deviation column) is a weighting scheme over the
original input variables; therefore, the linear combination of the original variables using this eigenvector is a
principal component. The multiplication works because the length of the eigenvector is the same as the
number of the original input variables. Selecting the first k eigenvectors produces k principal components
with the k highest standard deviations (due to the eigenvector computation). These principal components
are linearly uncorrelated and can be used as input variables in further analysis.
The rank of principal components decreases in standard deviation, and thus in significance. The first several
principal components usually explain 80%–90% of the total variance, which is sufficient in most
applications.

Usage

Principal Component Analysis Syntax


PCA_Reduce version 1.2, PCA_Map version 1.1

SELECT * FROM PCA_Reduce (


ON PCA_Map (
ON target_table
Target_Columns ({ 'target_column' | 'target_column_range' }[,...])
) PARTITION BY 1
[ Components (num_components) ]
) ORDER BY component_rank;

Arguments
Argument Category Description
Target_Columns Optional Specifies the target_table columns that contain the data, which
must be numeric. The default value is every target_table column.
Components Optional Specifies the number of principal components to return (an
integer). If num_components is k, then the function returns the
top k components. By default, the function returns every
principle component.

576 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis
Input
The PCA_Map function has one required input table, which contains the data set whose principal
components (eigenvectors) are to be returned.

Note:
The function ignores input rows that have missing values.

Table 478: PCA_Map Input Table Schema

Column Data Type Description


Name
target_colu NUMERIC Values of the ith dimension of the data set. The table has one such column for
mn_i each dimension.

Output
The PCA_Reduce function outputs a table in which each row represents a principal component, or
eigenvector. The first row represents the largest eigenvalue in the matrix.
Table 479: PCA_Reduce Output Table Schema

Column Data Type Description


Name
component INTEGER Rank of the principal component. Components are ranked in descending order
_rank of standard deviation (and variance).
dimension_i DOUBLE Values of the ith dimension of the data set. The table has one such column for
PRECISIO each dimension.
N
sd DOUBLE Standard deviation of the components in the eigenvector represented by the
PRECISIO row.
N
var_proport DOUBLE Proportion of variance of the components in the eigenvector represented by the
ion PRECISIO row.
N
cumulative DOUBLE Cumulative variance of the components in the eigenvector represented by the
_var PRECISIO row.
N
mean VARCHAR One row of this column contains a list of average values, one for each
target_column_i in the input table. The list has this format:
[average[,average...]]
The outer brackets appear in the table.
The other rows of this column contain NULL.

Teradata Aster Analytics Foundation User Guide 577


Chapter 5: Statistical Analysis
Principal Component Analysis
Example
This example uses PCA for dimension reduction; that is, it determines the principal components that
capture most of the variance of the explanatory variables. The principal components are mutually
orthogonal because the eigenvectors that span them are orthogonal.
The example uses the PCA_Map and PCA_Reduce functions to output a table of eigenvectors with their
component ranks and standard deviations and then uses SQL statements to derive the principal components
of the top three eigenvectors.

Input
The input is medical data for 25 patients, identified by patient ID (pid). The data has these attributes (also
called variables or dimensions):
• Age (years)
• Body mass index (BMI) (kg/m2)
• Blood pressure (mm Hg)
• Blood glucose level (mg/dL)
• Strokes (number experienced)
• Cigarettes (number smoked/month)
• Insulin (mg/dL)
• High-density lipoproteins (HDL) (mg/dL)
Table 480: PCA Example Input Table patient_pca_input

pid age bmi bloodpressure glucose strokes cigarettes insulin hdl


1 50 33.6 72 148 6 35 0 62.7
2 31 26.6 66 85 1 29 0 35.1
3 32 23.3 64 183 8 0 0 67.2
4 21 28.1 66 89 1 23 94 16.7
5 33 43.1 40 137 0 35 168 228.8
6 30 25.6 74 116 5 0 0 20.1
7 26 31 50 78 3 32 88 24.8
8 29 35.3 0 115 10 0 0 13.4
9 53 30.5 70 197 2 45 543 15.8
10 54 0 96 125 8 0 0 23.2
11 30 37.6 92 110 4 0 0 19.1
12 34 38 74 168 10 0 0 53.7
13 57 27.1 80 139 10 0 0 144.1
14 59 30.1 60 189 1 23 846 39.8
15 51 25.8 72 166 5 19 175 58.7
16 32 30 0 100 7 0 0 48.4

578 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis

pid age bmi bloodpressure glucose strokes cigarettes insulin hdl


17 31 45.8 84 118 0 47 230 55.1
18 31 29.6 74 107 7 0 0 25.4
19 33 43.3 30 103 1 38 83 18.3
20 32 34.6 70 115 1 30 96 52.9
21 27 39.3 88 126 3 41 235 70.4
22 50 35.4 84 99 8 0 0 38.8
23 41 39.8 90 196 7 0 0 45.1
24 29 29 80 119 9 35 0 26.3
25 51 36.6 94 143 11 33 146 25.4

SQL to Create Table of Eigenvectors

CREATE TABLE pca_health_ev DISTRIBUTE BY REPLICATION AS


SELECT * FROM pca_reduce (
ON pca_map (
ON patient_pca_input
TargetColumns('[1:8]')
)
PARTITION BY 1
) ORDER BY component_rank;

Output (Table of Eigenvectors)


The output table lists the eigenvectors in descending order of standard deviation (sd).
The query below returns the output shown in the following three tables.

SELECT * FROM pca_health_ev ORDER BY component_rank;

Table 481: PCA Example Output Table pca_health_ev (Columns 1-5)

component_rank age bmi bloodpressure glucose


1 0.023353196033822 0.005049565039979 0.001586693913677 0.084203225661064
7 12 59 2
2 0.040185975800814 0.040412675294224 -0.02234491061726 0.217685269075731
8 6 12
3 -0.16820726382671 0.063805486255078 -0.47282408914138 -0.82076625363468
7 4 3
4 -0.01296520733867 0.003780454260746 -0.85837985620275 0.406899030303581
29 81 9

Teradata Aster Analytics Foundation User Guide 579


Chapter 5: Statistical Analysis
Principal Component Analysis

component_rank age bmi bloodpressure glucose


5 -0.23359815129919 0.320787735507453 -0.12435452220625 0.309945063989336
8 7
6 -0.74333710399233 0.515142674213825 0.153195632015632 0.028239661432373
6 6
7 -0.58563727441361 -0.7907071584919 0.005530071477415 0.092549111006653
3 68 6
8 -0.13888402260094 -0.02759219308610 0.012492398189340 -0.02949775128549
39 4 48

Table 482: PCA Example Output Table pca_health_ev (Columns 6-9)

strokes cigarettes insulin hdl


-0.00859257029795621 0.0413329219482391 0.99526340917443 0.00222103039038422
-0.00344527374716509 0.0171848390743044 -0.022445679214246 0.973681022974589
-0.0539753477374444 0.180457886400629 0.0654659811769922 0.175074647858464
0.0391336176479417 -0.290464771196449 -0.0201355968289739 -0.105490035413321
-0.0568956998524478 0.846124412456638 -0.0575951508612638 -0.0922837968094822
-0.070564649365954 -0.389709106729591 0.0277394005213322 0.0137681856551496
-0.107921723264969 0.101814040327821 0.00467778052427058 0.0343536591559544
0.987727660277147 0.0538110654780203 0.0121302727639756 0.016583625709794

Table 483: PCA Example Output Table pca_health_ev (Columns 10-13)

sd var_proportion cumulative_var mean


194.717385062003 0.903560729160365 0.903560729160365 [37.88, 31.96399963378906, 66.8,
130.84, 5.12, 18.6, 108.16,
49.17200023651123]
46.3369853980967 0.0511685890757095 0.954729318236074
32.126518678328 0.0245966082061308 0.979325926442205
23.0594563502538 0.0126720249199549 0.99199795136216
14.4734867683913 0.00499222587415763 0.996990177236318
8.8921277044859 0.00188434002234855 0.998874517258666
6.5381934611736 0.00101874015287284 0.999893257411539
2.11638619509984 0.00010674258846095 1

In the following table, which is derived from the output table, the cumulative variance calculation shows that
the three top-ranked eigenvectors account for ~98% of the total variance.

580 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis
Table 484: Values Derived from PCA Example Output Table pca_health_ev

component_ra sd variance variance_proporti cumulative_variance


nk on
1 194.7174 37914.86 0.903560729 0.903560729
2 46.33699 2147.116 0.051168589 0.954729318
3 32.12652 1032.113 0.024596608 0.979325926
4 23.05946 531.7385 0.012672025 0.991997951
5 14.47349 209.4818 0.004992226 0.996990177
6 8.892128 79.06994 0.00188434 0.998874517
7 6.538193 42.74797 0.00101874 0.999893257
8 2.116386 4.479091 0.000106743 1

SQL to Create Table of Principal Components


This SQL statement creates a table that contains the principal components that correspond to the three top-
ranked eigenvectors—that is, the principal components that explain ~98% of the total variance. The
statement calculates these components by linearly combining the input attributes for each patient (from
Input) with the three top-ranked eigenvectors (shown in the table pca_health_ev).

Teradata Aster Analytics Foundation User Guide 581


Chapter 5: Statistical Analysis
Principal Component Analysis
Note:
This method of generating the principal components uses uncentered input, whereas the function
PCAPlot_stub uses centered input that it computes internally. The two methods generate principal
components that are different but equivalent. PCAPlot, which is more compact, is recommended when
the data has many dimensions and requires many principal components.

CREATE TABLE pca_health_pc DISTRIBUTE BY REPLICATION AS


SELECT pid, pca_1, pca_2, pca_3
FROM
(SELECT pid,
(a.age * b.age) +
(a.bmi * b.bmi) +
(a.bloodpressure * b.bloodpressure) +
(a.glucose * b.glucose) +
(a.strokes * b.strokes) +
(a.cigarettes * b.cigarettes) +
(a.insulin * b.insulin) +
(a.hdl * b.hdl) as pca_1
FROM pca_health_ev a,
patient_pca_input b
WHERE a.component_rank = 1) a
JOIN
(SELECT pid,
(a.age * b.age) +
(a.bmi * b.bmi) +
(a.bloodpressure * b.bloodpressure) +
(a.glucose * b.glucose) +
(a.strokes * b.strokes) +
(a.cigarettes * b.cigarettes) +
(a.insulin * b.insulin) +
(a.hdl * b.hdl) as pca_2
FROM pca_health_ev a,
patient_pca_input b
WHERE a.component_rank = 2) b
USING(pid)
JOIN
(SELECT pid,
(a.age * b.age) +
(a.bmi * b.bmi) +
(a.bloodpressure * b.bloodpressure) +
(a.glucose * b.glucose) +
(a.strokes * b.strokes) +
(a.cigarettes * b.cigarettes) +
(a.insulin * b.insulin) +
(a.hdl * b.hdl) as pca_3
FROM pca_health_ev a,
patient_pca_input b
WHERE a.component_rank = 3) c
USING(pid) order by pid;

Output (Table of Principal Components)


In the following table, each value in column pca_1, pca_2, and pca_3 is calculated by multiplying
corresponding attribute values in the input table (Input) and output table (Output (Table of Eigenvectors),

582 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis
Output (Table of Eigenvectors)) and then adding those products. (The attribute values are in columns age
through hdl.)
For example, for pid 1:
pca_1 = 50 * 0.023353196 + 33.6 * 0.005049565 + 72 * 0.001586694 + 148 * 0.084203226 + 6 * (- 0.00859257)
+ 35* 0.041332922+ 0 * 0.995263409 + 62.7 * 0.00222103 = 15.4479999925239
Table 485: PCA Example Output Table pca_health_pc

pid pca_1 pca_2 pca_3


1 15.4479999925239 95.6063494855112 -144.813884414743
2 9.38828381804583 54.0203436788621 -93.1642974316123
3 16.4562085141836 106.037696072979 -173.023717761683
4 102.765035084519 34.4211125353598 -92.8198997861495
5 181.746678156338 251.60582517126 -76.7873728335953
6 10.7165341344642 45.3918741685635 -130.361541710363
7 96.3460432611523 40.8714816471788 -74.3408072276598
8 10.4826993814663 40.6386392479444 -95.2075494656953
9 560.396754962081 48.6448813248549 -155.430694416748
10 11.9215857535403 49.7974281943461 -153.440157326562
11 10.3068415817 43.1982704402206 -133.30321137986
12 15.2827930280187 90.0718255965771 -167.310396453308
13 13.5332840582506 172.129432304726 -135.082617375201
14 860.562747564309 63.5444807135366 -125.049203724188
15 190.457102950088 91.1554213623796 -152.330669708236
16 9.36646167258605 71.3689044734852 -77.4493074289404
17 242.000092281589 76.2015591039508 -105.674466212198
18 9.99684287848697 48.7881615552709 -122.067685941854
19 93.9194009716571 41.4321906209861 -86.0707423033147
20 107.610631620675 76.0189342246316 -109.754823547146
21 247.290353480677 92.1018846926632 -112.112500012403
22 9.83325145607444 60.8650374623456 -120.763638056819
23 17.8451088790566 87.8002159712053 -200.263354330633
24 12.3985316709805 52.6325871176354 -128.090051863924
25 160.200377629451 54.5407906248303 -148.692017737866

Teradata Aster Analytics Foundation User Guide 583


Chapter 5: Statistical Analysis
Principal Component Analysis
SQL to Create View of Correlations
To interpret each principal component, you must compute the correlations between the original values for
each attribute (in Input) and each principal component (in Output (Table of Principal Components)).
The first step is to create a view that shows both the principal components and the patient attribute values,
using this SQL statement:

CREATE VIEW v_pca_health_corr_input AS


SELECT PC.pid, PC.pca_1, PC.pca_2, PC.pca_3,
SR.age, SR.bmi, SR.bloodpressure, SR.glucose, SR.strokes,
SR.cigarettes, SR.insulin, SR.hdl
FROM pca_health_pc PC
JOIN patient_pca_input SR ON (SR.pid = PC.pid);

Output (View of Correlations)


Table 486: PCA Example View v_pca_health_corr_input, Columns 1-8

pid pca_1 pca_2 pca_3 age bmi bloodpressure glucose


1 15.4479999925239 95.6063494855112 -144.81388441474 50 33.6 72 148
3
2 9.38828381804583 54.0203436788621 -93.164297431612 31 26.6 66 85
3
3 16.4562085141836 106.037696072979 -173.02371776168 32 23.3 64 183
3
4 102.765035084519 34.4211125353598 -92.819899786149 21 28.1 66 89
5
5 181.746678156338 251.60582517126 -76.787372833595 33 43.1 40 137
3
6 10.7165341344642 45.3918741685635 -130.36154171036 30 25.6 74 116
3
7 96.3460432611523 40.8714816471788 -74.340807227659 26 31 50 78
8
8 10.4826993814663 40.6386392479444 -95.207549465695 29 35.3 0 115
3
9 560.396754962081 48.6448813248549 -155.43069441674 53 30.5 70 197
8
10 11.9215857535403 49.7974281943461 -153.44015732656 54 0 96 125
2
11 10.3068415817 43.1982704402206 -133.30321137986 30 37.6 92 110
12 15.2827930280187 90.0718255965771 -167.31039645330 34 38 74 168
8

584 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis

pid pca_1 pca_2 pca_3 age bmi bloodpressure glucose


13 13.5332840582506 172.129432304726 -135.08261737520 57 27.1 80 139
1
14 860.562747564309 63.5444807135366 -125.04920372418 59 30.1 60 189
8
15 190.457102950088 91.1554213623796 -152.33066970823 51 25.8 72 166
6
16 9.36646167258605 71.3689044734852 -77.449307428940 32 30 0 100
4
17 242.000092281589 76.2015591039508 -105.67446621219 31 45.8 84 118
8
18 9.99684287848697 48.7881615552709 -122.06768594185 31 29.6 74 107
4
19 93.9194009716571 41.4321906209861 -86.070742303314 33 43.3 30 103
7
20 107.610631620675 76.0189342246316 -109.75482354714 32 34.6 70 115
6
21 247.290353480677 92.1018846926632 -112.11250001240 27 39.3 88 126
3
22 9.83325145607444 60.8650374623456 -120.76363805681 50 35.4 84 99
9
23 17.8451088790566 87.8002159712053 -200.26335433063 41 39.8 90 196
3
24 12.3985316709805 52.6325871176354 -128.09005186392 29 29 80 119
4
25 160.200377629451 54.5407906248303 -148.69201773786 51 36.6 94 143
6

Table 487: PCA Example View v_pca_health_corr_input, Columns 9-12

strokes cigarettes insulin hdl


6 35 0 62.7
1 29 0 35.1
8 0 0 67.2
1 23 94 16.7
0 35 168 228.8
5 0 0 20.1
3 32 88 24.8
10 0 0 13.4

Teradata Aster Analytics Foundation User Guide 585


Chapter 5: Statistical Analysis
Principal Component Analysis

strokes cigarettes insulin hdl


2 45 543 15.8
8 0 0 23.2
4 0 0 19.1
10 0 0 53.7
10 0 0 144.1
1 23 846 39.8
5 19 175 58.7
7 0 0 48.4
0 47 230 55.1
7 0 0 25.4
1 38 83 18.3
1 30 96 52.9
3 41 235 70.4
8 0 0 38.8
7 0 0 45.1
9 35 0 26.3
11 33 146 25.4

To compute the correlations between the original values for each attribute, use the preceding view and the
functions Corr_Map and Corr_Reduce (described in Correlation).

SQL-MapReduce Call to Get pca_1 Correlation Coefficients

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON v_pca_health_corr_input
ColumnPairs ('pca_1:age','pca_1:bmi', 'pca_1:bloodpressure',
'pca_1:glucose', 'pca_1:strokes',
'pca_1:cigarettes','pca_1:insulin', 'pca_1:hdl')
KEY_NAME ('pid')
)
PARTITION BY pid
);

Output (pca_1 Correlation Coefficients)


The first principal component, pca_1, is very strongly correlated with the variable insulin (99.99%), which
means that pca_1 is essentially a measure of insulin. The other attributes are only mildly correlated (the
correlation coefficients are less than 0.5).

586 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Principal Component Analysis
Table 488: PCA Example pca_1 Correlation Coefficients

corr value
pca_1:bmi 0.111364
pca_1:glucose 0.478427
pca_1:hdl 0.00949362
pca_1:insulin 0.999914
pca_1:age 0.409505
pca_1:bloodpressure 0.0123209
pca_1:cigarettes 0.459847
pca_1:strokes -0.471619

SQL-MapReduce Call to Get pca_2 Correlation Coefficients

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON v_pca_health_corr_input
TargetColumns ('pca_2:age','pca_2:bmi', 'pca_2:bloodpressure',
'pca_2:glucose', 'pca_2:strokes',
'pca_2:cigarettes','pca_2:insulin', 'pca_2:hdl')
KEY_NAME ('pid')
)
PARTITION BY pid
);

Output (pca_2 Correlation Coefficients)


The second principal component, pca_2, is essentially a measure of hdl (its correlation coefficient is 0.99)
and is mildly correlated to the other attributes.
Table 489: PCA Example pca_2 Correlation Coefficients

corr value
pca_2:bmi 0.212095
pca_2:glucose 0.294333
pca_2:hdl 0.990415
pca_2:insulin -0.00536637
pca_2:age 0.167691
pca_2:bloodpressure -0.0412905
pca_2:cigarettes 0.0454973
pca_2:strokes -0.0450002

Teradata Aster Analytics Foundation User Guide 587


Chapter 5: Statistical Analysis
PCAPlot
SQL-MapReduce Call to Get pca_3 Correlation Coefficients

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON v_pca_health_corr_input
ColumnPairs ('pca_3:age','pca_3:bmi', 'pca_3:bloodpressure',
'pca_3:glucose', 'pca_3:strokes',
'pca_3:cigarettes','pca_3:insulin', 'pca_3:hdl')
KEY_NAME ('pid')
)
PARTITION BY pid
);

Output (pca_3 Correlation Coefficients)


The third principal component, pca_3, is negatively correlated with blood pressure and glucose level.
Table 490: PCA Example pca_3 Correlation Coefficients

corr value
pca_3:age -0.48665
pca_3:bloodpressure -0.605769
pca_3:cigarettes 0.331247
pca_3:strokes -0.48879
pca_3:bmi 0.23217
pca_3:glucose -0.769423
pca_3:hdl 0.123469
pca_3:insulin 0.0108517

Summary
If the three principal components are used in determining patient health condition, then the most important
attributes are higher insulin and hdl levels and lower blood pressure and glucose levels.

PCAPlot

Summary
The PCAPlot function takes the principal components output by the PCA function (Principal Component
Analysis) and input data, centers the input data, changes the basis of the input data to the principal
components, and outputs the result.

Note:
The version of PCA_Reduce, a component of the PCA function, must be AA 6.21 or later.

588 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
PCAPlot
Usage

PCAPlot Syntax
Version 1.0

SELECT * FROM PCAPlot (


ON input_table AS inputtable PARTITION BY ANY
ON pca_table AS pca_table DIMENSION
Components ('num_components')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
Components Required Specifies the number of principal components to return (an
integer). If num_components is k, then the function returns the
top k components.
Accumulate Optional Specifies the names of the input table columns to copy to the
output table.

Input
The function requires an input table and a PCA table. The PCA table is output by the PCA_Reduce function.
Table 491: PCAPlot Input Table Schema

Column Name Data Type Description


input_column NUMERIC Contains the data to be transformed.
accumulate_column Any Column to copy to the output table.

Output
Table 492: PCAPlot Output Table Schema

Column Name Data Type Description


dimension_i DOUBLE Values of the ith dimension of the data set. The table has one such
PRECISION column for each dimension. The number of dimensions is determined
by the PCA_Map argument TargetColumn.
accumulate_column Same as in Column copied from the input table.
input table

Teradata Aster Analytics Foundation User Guide 589


Chapter 5: Statistical Analysis
PCAPlot
Example

Input
• Input table patient_pca_input
• PCA table pca_health_ev

SQL-MapReduce Call

SELECT * FROM PCAPlot (


ON patient_pca_input AS inputtable PARTITION BY ANY
ON pca_health_ev AS pca_table DIMENSION
Accumulate ('pid')
Components ('3')
) ORDER BY 1;

Output
Table 493: PCAPlot Example Output Table

pid principal_component_1 principal_component_2 principal_component_3


1 -105.202865798753 20.0509363738591 -20.2777000767145
2 -111.262581973231 -21.535069432790 31.3718869064158
3 -104.194657277093 30.4822829613273 -48.4875334236547
4 -17.8858307067573 -41.1343005762923 31.7162845518786
5 61.0958123650612 176.050412059608 47.7488115044328
6 -109.934331656812 -30.1635389430887 -5.82535737233537
7 -24.3048225301243 -34.6839314644733 50.1953771103683
8 -110.16816640981 -34.9167738637078 29.3286348723329
9 439.745889170805 -26.9105317867973 -30.8945100787203
10 -108.729280037736 -25.7579849173061 -28.9039729885343
11 -110.344024209577 -32.3571426714316 -8.76702704183158
12 -105.368072763258 14.5164124849249 -42.7742121152798
13 -107.117581733026 96.5740191930736 -10.5464330371725
14 739.911881773033 -12.0109323981156 -0.513019386160228
15 69.8062371588109 15.6000082507274 -27.7944853702083
16 -111.284404118691 -4.186508638167 47.0868769090877
17 121.349226490312 0.646145992298679 18.8617181258305
18 -110.65402291279 -26.7672515563812 2.46849839617399

590 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
RandomSample

pid principal_component_1 principal_component_2 principal_component_3


19 -26.7314648196195 -34.1232224906661 38.4654420347134
20 -13.0402341706013 0.463521112979414 14.7813607908824
21 126.6394876894 16.5464715810111 12.4236843256251
22 -110.817614335202 -14.6903756493066 3.77254628120944
23 -102.80575691222 12.2448028595531 -75.7271699926051
24 -108.252334120296 -22.9228259940168 -3.55386752589592
25 39.5495118381742 -21.0146224868219 -24.1558333998384

RandomSample

Summary
The RandomSample function takes a data set and uses a specified sampling method to output one or more
random samples. Each sample has exactly the number of rows specified.
The RandomSample function is useful for generating test sets, training sets, and initial centers for clustering
algorithms.
In addition to the default basic sampling, in which each input table row has a probability of being selected
that is proportional to its weight, this function provides two alternate methods, KMeans++ and KMeans||,
which are designed for generating a set of initial seeds for the function KMeans.

Usage

RandomSample Syntax
Version 1.0

SELECT * FROM RandomSample (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
NumSample ('sample_size' [,...])
[ WeightColumn ('weight_column') ]
[ SamplingMode ({ 'basic' | 'kmeans++' | 'kmeans||' }) ]
[ Distance ({ 'euclidean' | 'manhattan' }) ]
[ InputColumns ( { 'input_column' | 'input_column_range' }[,...] ) ]
[ AsCategories ( { 'ascat_column' | 'ascat_column_range' }[,...] ) ]
[ CategoryWeights ('category_weight' [,...]) ]

Teradata Aster Analytics Foundation User Guide 591


Chapter 5: Statistical Analysis
RandomSample
[ CategoricalDistance ({ 'overlap' | 'hamming' }) ]
[ Seed ('seed')
SeedColumn ({ 'seed_column' | 'seed_column_range' } [,...]) ]
[ OverSamplingRate ('rate') ]
[ IterationNum ('number_of_iterations') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data set from which to
take samples.
NumSample Required Specifies both the number of samples and their sizes. For each
sample_size (an INTEGER value), the function selects a sample that
has sample_size rows.
WeightColumn Optional Specifies the name of the input_table column that contains weights for
weighted sampling. The weight_column must have a numeric SQL data
type. By default, rows have equal weight.
SamplingMode Optional Specifies the sampling mode:
• 'basic' (default)
Each input_table row has a probability of being selected that is
proportional to its weight. The weight of each row is in
weight_column.
• 'kmeans++'
One row is selected in each of k iterations, where k is the number of
desired output rows. The first row is selected randomly. In
subsequent iterations, the probability of a row being selected is
proportional to the value in the WeightColumn multiplied by the
distance from the nearest row in the set of selected rows. The
distance is calculated using the methods specified by the Distance
and CategoricalDistance arguments.
• 'kmeans||'
Enhanced version of KMeans++ that exploits parallel architecture
to accelerate the sampling process. The algorithm is described in
the paper Scalable K-Means++ by Bahmani et al (http://
theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). Briefly, at
each iteration, the probability that a row is selected is proportional
to the value in the WeightColumn multiplied by the distance from
the nearest row in the set of selected rows (as in KMeans++).
However, the KMeans|| algorithm oversamples at each iteration,
significantly reducing the required number of iterations; therefore,
the resulting set of rows might have more than k data points. Each

592 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
RandomSample

Argument Category Description

row in the resulting set is then weighted by the number of rows in


the table that are closer to that row than to any other selected row,
and the rows are clustered to produce exactly k rows.

Tip:
For optimal performance, use 'kmeans++' when the desired sample
size is less than 15 and 'kmeans||' otherwise.

Distance Optional For KMeans++ and KMeans|| sampling, specifies the function for
computing the distance between numerical variables:
• 'euclidean' (default): The distance between two variables is defined
in Euclidean Distance (found in the Background section of the
function VectorDistance).
• 'manhattan': The distance between two variables is defined in
Manhattan Distance (found in the Background section of the
function VectorDistance).

InputColumns Optional For KMeans++ and KMeans|| sampling, specifies the names of the
input_table columns to use to calculate the distance between numerical
variables.
AsCategories Optional For KMeans++ and KMeans|| sampling, specifies the names of the
input_table columns that contain numerical variables to treat as
categorical variables.
CategoryWeights Optional For KMeans++ and KMeans|| sampling, specifies the weights
(DOUBLE PRECISION values) of the categorical variables, including
those that the AsCategories argument specifies. Specify the weights in
the order (from left to right) that the variables appear in the input
table. When calculating the distance between two rows, distances
between categorical values are scaled by these weights.
CategoricalDistance Optional For KMeans++ and KMeans|| sampling, specifies the function for
computing the distance between categorical variables:
'overlap' (default): The distance between two variables is 0 if they are
the same and 1 if they are different.
'hamming': The distance between two variables is the Hamming
distance between the strings that represent them. The strings must
have equal length.
Seed Optional Specifies the random seed with which to initialize the algorithm (a
LONG value). If you specify Seed, you must also specify SeedColumn.
SeedColumn Optional Specifies the names of the input_table columns by which to partition
the input. Function calls that use the same input data, seed, and
seed_column output the same result. If you specify SeedColumn, you
must also specify Seed.

Teradata Aster Analytics Foundation User Guide 593


Chapter 5: Statistical Analysis
RandomSample

Argument Category Description

Note:
Ideally, the number of distinct values in the seed_column is the
same as the number of workers in the cluster. A very large number
of distinct values in the seed_column degrades function
performance.

OverSamplingRate Optional For KMeans|| sampling, specifies the oversampling rate (a DOUBLE
PRECISION value greater than 0.0). The function multiplies rate by
sample_size (for each sample_size). The default rate is 1.0.
IterationNum Optional For KMeans|| sampling, specifies the number of iterations (an
INTEGER value greater than 0). The default number_of_iterations is 5.

Input
Table 494: RandomSample Input Table Schema

Column Name Data Type Description


input_column Any Column to use for sampling. The table must have at least one
such column, and must have every input_column that the
InputColumns argument specifies.
weight_column BYTEINT, Optional. Contains the weights for weighted sampling of the
SMALLINT, rows. Can be an input_column.
INTEGER,
BIGINT,
NUMERIC,
DOUBLE
PRECISION, or
NUMBER
seed_column Any Optional. Column by which to partition the input. Can be an
input_column.

Output
Table 495: RandomSample Output Table Schema

Column Name Data Type Description


set_id INTEGER Leftmost column. Identifies the sample set to which the row belongs.
input_column Same as in Column copied from the input table. Every input table column appears in
input table the output table.

Examples
• Input

594 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
RandomSample
• Example 1: Basic Sampling (Weighted)
• Example 2: KMeans++ Sampling
• Example 3: KMeans|| Sampling

Input
The input table has 32 observations of 11 variables for different models of cars:
• mpg: miles per U. S. gallon
• cyl: number of cylinders
• disp: displacement (cubic inches)
• hp: gross horsepower
• drat: drive ratio
• wt: weight (lbs/1000)
• qsec: quarter-mile time (seconds)
• vs: engine configuraton (V or S (straight))
• am: transmission type (automatic or manual)
• gear: number of forward gears
• carb: number of carburetors
The variables vs and am are categorical; the others are numerical.
Table 496: RandomSample Examples Input Table fs_input

sn model mpg cyl disp hp drat wt qsec vs am gear carb


1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 S manual 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 S manual 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.61 V manual 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 V automatic 3 1
5 Hornet 18.7 8 360 175 3.15 3.44 17.02 S automatic 3 2
Sportabout
6 Valiant 18.1 6 225 105 2.76 3.46 20.22 V automatic 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.84 S automatic 3 4
8 Merc 240D 24.4 4 146.7 62 3.69 3.19 20 V automatic 4 2
9 Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 V automatic 4 2
10 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 V automatic 4 4
11 Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 V automatic 4 4
12 Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 S automatic 3 3
13 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 S automatic 3 3
14 Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18 S automatic 3 3
15 Cadillac 10.4 8 472 205 2.93 5.25 17.98 S automatic 3 4
Fleetwood

Teradata Aster Analytics Foundation User Guide 595


Chapter 5: Statistical Analysis
RandomSample

sn model mpg cyl disp hp drat wt qsec vs am gear carb


16 Lincoln 10.4 8 460 215 3 5.424 17.82 S automatic 3 4
Continental
17 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 S automatic 3 4
18 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 V manual 4 1
19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 V manual 4 2
20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 V manual 4 1
21 Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 V automatic 3 1
22 Dodge Challenger 15.5 8 318 150 2.76 3.52 16.87 S automatic 3 2
23 AMC Javelin 15.2 8 304 150 3.15 3.435 17.3 S automatic 3 2
24 Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 S automatic 3 4
25 Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 S automatic 3 2
26 Fiat X1-9 27.3 4 79 66 4.08 1.935 18.9 V manual 4 1
27 Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 S manual 5 2
28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 V manual 5 2
29 Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 S manual 5 4
30 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 S manual 5 6
31 Maserati Bora 15 8 301 335 3.54 3.57 14.6 S manual 5 8
32 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 V manual 4 2

Example 1: Basic Sampling (Weighted)


This example uses basic sampling to select one sample of 10 rows, which are weighted by car weight. Because
the function call includes the Seed and SeedColumn arguments, it always produces the same output from the
same input.

SQL-MapReduce Call

SELECT * FROM RandomSample (


ON (SELECT 1) PARTITION BY 1
InputTable ('fs_input')
SamplingMode ('basic')
NumSample ('10')
WeightColumn ('wt')
Seed ('1')
SeedColumn ('model')
) ORDER BY 1, 2, 3;

596 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
RandomSample
Output

Table 497: RandomSample Example 1 Output Table

set_id sn model mpg cyl disp hp drat wt qsec vs am gear carb


0 1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 S manual 4 4
0 7 Duster 360 14.3 8 360 245 3.21 3.57 15.84 S automatic 3 4
0 11 Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 V automatic 4 4
0 12 Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 S automatic 3 3
0 13 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 S automatic 3 3
0 15 Cadillac 10.4 8 472 205 2.93 5.25 17.98 S automatic 3 4
Fleetwood
0 17 Chrysler 14.7 8 440 230 3.23 5.345 17.42 S automatic 3 4
Imperial
0 23 AMC Javelin 15.2 8 304 150 3.15 3.435 17.3 S automatic 3 2
0 27 Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 S manual 5 2
0 31 Maserati Bora 15 8 301 335 3.54 3.57 14.6 S manual 5 8

Example 2: KMeans++ Sampling


This example uses KMeans++ sampling with the Manhattan distance metric, and treats the numeric
variables cyl, gear, and carb as categorical variables (and the categorical variables vs and am). The category
weights are assigned in the order that the columns appear in the input table: 1000 to cyl, 10 to vs, 100 to am,
100 to gear, and 100 to carb.

SQL-MapReduce Call

SELECT * FROM RandomSample (


ON (SELECT 1) PARTITION BY 1
InputTable ('fs_input')
NumSample ('10')
SamplingMode ('kmeans++')
InputColumns ('mpg:carb')
CategoryWeights ('1000', '10', '100', '100', '100')
AsCategories ('cyl', 'gear', 'carb')
Distance ('manhattan')
Seed (1)
SeedColumn ('model')
) ORDER BY 1, 2, 3;

Teradata Aster Analytics Foundation User Guide 597


Chapter 5: Statistical Analysis
RandomSample
Output

Table 498: RandomSample Example 2 Output Table

set_id sn model mpg cyl disp hp drat wt qsec vs am gear carb


0 2 Mazda RX4 21 6 160 110 3.9 2.875 17.02 S manual 4 4
Wag
0 4 Hornet 4 21.4 6 258 110 3.08 3.215 19.44 V automatic 3 1
Drive
0 13 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 S automatic 3 3
0 18 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 V manual 4 1
0 21 Toyota 21.5 4 120.1 97 3.7 2.465 20.01 V automatic 3 1
Corona
0 24 Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 S automatic 3 4
0 25 Pontiac 19.2 8 400 175 3.08 3.845 17.05 S automatic 3 2
Firebird
0 27 Porsche 26 4 120.3 91 4.43 2.14 16.7 S manual 5 2
914-2
0 30 Ferrari Dino 19.7 6 301 335 3.54 3.57 14.6 S manual 5 6
0 31 Maserati 15 8 301 335 3.54 3.57 14.6 S manual 5 8
Bora

Example 3: KMeans|| Sampling


This example uses KMeans|| sampling. Like Example 2, this example treats the numeric variables cyl, gear,
and carb as categorical variables (and the categorical variables vs and am). However, this example uses the
Manhattan distance metric for the numerical variables and the Hamming distance metric for the categorical
variables. Because the Hamming distance metric requires categories of equal length, assume that in input
table column am, 'manual' has been changed to 'manualsys' (which is the same length as 'automatic').

Create and Populate Input Table

CREATE TABLE fs_input1 AS SELECT * FROM fs_input;


UPDATE fs_input1 SET am='manualsys' WHERE am='manual';

SQL-MapReduce Call

SELECT * FROM RandomSample (


ON (SELECT 1) PARTITION BY 1
InputTable ('fs_input1')
NumSample ('10')
SamplingMode ('kmeans||')
InputColumns('mpg:carb')
CategoryWeights ('1000', '10', '100', '100', '100')
AsCategories ('cyl' ,'gear', 'carb')

598 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Sample
CategoricalDistance ('hamming')
Distance ('manhattan')
Seed (1)
IterationNum ('2')
SeedColumn ('model')
) ORDER BY 1,2,3;

Output

Table 499: RandomSample Example 3 Output Table

set_id mpg cyl disp hp drat wt qsec vs am gear carb


0 12.42 8 414.4 228 3.324 4.7398 16.808 S automatic 3 4
0 15.8 8 351 264 4.22 3.17 14.5 S manualsys 5 4
0 17.225 8 349 162.5 2.9375 3.58125 16.9525 S automatic 3 2
0 17.3 8 275.8 180 3.07 3.73 17.6 S automatic 3 3
0 19.2 6 67.6 123 3.92 3.44 18.3 V automatic 4 4
0 19.7 6 145 175 3.62 2.77 15.5 S manualsys 5 6
0 21.4 4 121 109 4.11 2.78 18.6 V manualsys 4 2
0 21.4 6 258 110 3.08 3.215 19.44 V automatic 3 1
0 21.5 4 120.1 97 3.7 2.465 20.01 V automatic 3 1
0 23.6 4 143.75 78.5 3.805 3.17 21.45 V automatic 4 2

Sample

Summary
The Sample function draws rows randomly from the input table.

The function offers two sampling schemes:


• A simple Bernoulli (Binomial) sampling on a row-by-row basis with given sample rates
• Sampling without replacement that selects a given number of rows

Teradata Aster Analytics Foundation User Guide 599


Chapter 5: Statistical Analysis
Sample
Sampling can be either unconditional or conditional. Unconditional sampling applies to all input data and
always uses the same random number generator. Conditional sampling applies only to input data that meets
specified conditions and uses a different random number generator for each condition.

Note:
The Sample function does not guarantee the exact sizes of samples. If each sample must have an exact
number of rows, use the RandomSample function.

Usage

Sample Syntax

Unconditional Sampling, Single Sample Rate


Version 1.2

SELECT * FROM Sample (


ON { table_name | view_name | (query) }
PARTITION BY { ANY | key }
SampleFraction ('fraction')
[ Seed ('seed') ]
);

Unconditional Sampling, Approximate Sample Size


Version 1.2

SELECT * FROM Sample (


ON { table_name | view_name | (query) } AS data
PARTITION BY ANY { ANY | key }
ON { table_name | view_name | (query) } AS summary DIMENSION
ApproxSampleSize ('size')
[ Seed ('seed') ]
);

Conditional Simple Sampling, Single Sample Rate


Version 1.2

SELECT * FROM Sample (


ON { table_name | view_name | (query) }
PARTITION BY { ANY | key }
StratumColumn ('column')
Strata ('condition' [,...])
SampleFraction ('fraction')
[ Seed ('seed') ]
);

600 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Sample
Conditional Sampling, Variable Sample Rates
Version 1.2

SELECT * FROM Sample (


ON { table_name | view_name | (query) }
PARTITION BY { ANY | key }
StratumColumn ('column')
Strata ('condition' [,...])
SampleFraction ('fraction' [,...])
[ Seed ('seed') ]
);

Conditional Sampling, Approximate Sample Size


Version 1.2

SELECT * FROM sample (


ON { table_name | view_name | (query) } AS data
PARTITION BY ANY { ANY | key }
ON { table_name | view_name | (query) } AS summary DIMENSION
StratumColumn ('column')
Strata ('condition' [,...])
ApproxSampleSize ('total_sample_size')
[ Seed ('seed') ]
);

Conditional Sampling, Variable Approximate Sample Sizes


Version 1.2

SELECT * FROM sample (


ON { table_name | view_name | (query) } AS data
PARTITION BY ANY { ANY | key }
ON { table_name | view_name | (query) } AS summary DIMENSION
StratumColumn ('column')
Strata ('condition' [,...])
ApproxSampleSize ('size' [,...])
[ Seed ('seed') ]
);

Arguments
Argument Category Description
SampleFraction Required Specifies one or more fractions to use in sampling the data. (Syntax
options that do not use SampleFraction require ApproxSampleSize.)
If you specify only one fraction, then the function uses fraction for all
strata defined by the sample conditions.
If you specify more than one fraction, then the function uses each fraction
for sampling a particular stratum defined by the condition arguments.

Teradata Aster Analytics Foundation User Guide 601


Chapter 5: Statistical Analysis
Sample

Argument Category Description

Note:
For conditional sampling with variable sample sizes, specify one
fraction for each condition that you specify with the Strata argument.

Seed Optional Specifies an integer to add to each task ID to create a real random seed for
the task. The default value is 0.
ApproxSampleSize Optional Specifies one or more approximate sample sizes to use in sampling the
data. (Syntax options that do not use ApproxSampleSize require
SampleFraction.) Each sample size is approximate because the function
maps the size to the sample fractions and then generates the sample data.
If you specify only one size, then it represents the total sample size for the
entire population. If you also specify the Strata argument, then the
function proportionally generates sample units for each stratum.
If you specify more than one size, then each size corresponds to a stratum,
and the function uses each size to generate sample units for the
corresponding stratum.

Note:
For conditional sampling with variable approximate sample sizes,
specify one size for each condition that you specify with the Strata
argument.

StratumColumn Optional Specifies the name of the column that contains the sample conditions. If
the function has only one input table (the data table), then
condition_column is in the data table. If the function has two input tables,
data and summary, then condition_column is in the summary table.
Strata Optional Specifies the sample conditions that appear in the condition_column
specified by StratumColumn. If Strata specifies a condition that does not
appear in condition_column, then the function issues an error message.

Input
The Sample function always requires a data table. Some syntax options also require a summary table.
Table 500: Sample Data Table Schema

Column Name Data Type Description


data_column Any Contains data to be sampled.
condition_column Any (Optional) Contains the data conditions. For some syntax options, the
data conditions are in the summary table.

602 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Sample
Table 501: Sample Summary Table Schema

Column Name Data Type Description


data_column Any Contains data to be sampled. (Optional for Unconditional Sampling,
Approximate Sample Size, in the Sample Syntax section.)
condition_column Any Contains the data conditions.

Note:
The summary input must summarize the population statistics faithfully. That is, sum / stratum_count
with a non-null stratum value must equal the total population size. Otherwise, the final sample output
might not approximate the target sample fractions well.

Output
Table 502: Sample Output Table Schema

Column Name Data Type Description


data_column Same as in Copied from the data table. The output table contains every data table
data table row that meets a specified condition.

Examples
• Input
• Example 1: Unconditional Sampling with Single Sample Rate
• Example 2: Conditional Sampling with variable Sample Rate
• Example 3: Unconditional Sampling with Total(single) Approximate SampleSize
• Example 4: Conditional Sampling with variable Approximate SampleSize

Input
The input table (score_category) is obtained by categorizing the students (in the table students) based on
their score in a given subject. There are 100 students grouped into three categories - excellent (score > 90),
very good (80 < score < 90) and fair (score < 80) - as shown in the SQL case statement below
Table 503: Sample Example Input Table students

id score
1 5
2 83
3 95
4 95
5 90
6 55
7 40

Teradata Aster Analytics Foundation User Guide 603


Chapter 5: Statistical Analysis
Sample

id score
8 57
9 65
10 27
... ...

CREATE TABLE score_category DISTRIBUTE BY hash(id) AS (


SELECT *, CASE
WHEN score <= 80 THEN 'fair'
WHEN score > 80 AND score < 90 THEN 'very good'
WHEN score >= 90 THEN 'excellent'
END AS stratum FROM students
);

Table 504: Sample Example Input Table score_category

id score stratum
1 5 fair
2 83 very good
3 95 excellent
4 95 excellent
5 90 excellent
6 55 fair
7 40 fair
8 57 fair
9 65 fair
10 27 fair

Example 1: Unconditional Sampling with Single Sample Rate


This example selects a sample of approximately 20% ("SampleFraction(0.2)") of the rows in the input table.

SQL-MapReduce Call

SELECT * FROM Sample (


ON students PARTITION BY ANY
SampleFraction ('0.2')
Seed ('2')
) OUTPUT BY id;

604 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Sample
Output

Table 505: Sample Example 1 Output Table

id score
4 95
6 55
15 22
16 19
21 5
22 25
27 44
32 3
36 15
39 50
44 70
47 79
65 53
67 29
68 30
69 18
74 7
79 71
81 13
83 81
85 79
92 32

Example 2: Conditional Sampling with variable Sample Rate


In this example, different sampling rates (20%, 30%, and 40%) are applied to each category (fair, very good,
and excellent, respectively). The number sampled is rounded to the nearest integer. The arguments
StratumColumn() and Strata() are always used together.

SQL-MapReduce Call

SELECT * FROM Sample (


ON score_category PARTITION BY ANY
SampleFraction ('0.2', '0.3', '0.4')

Teradata Aster Analytics Foundation User Guide 605


Chapter 5: Statistical Analysis
Sample
StratumColumn ('stratum')
Strata ('fair', 'very good', 'excellent')
Seed ('2')
) ORDER BY stratum, id, score;

Output

Table 506: Sample Example 2 Output Table

id score stratum
12 93 excellent
28 90 excellent
60 97 excellent
78 91 excellent
90 100 excellent
8 57 fair
10 27 fair
21 5 fair
24 11 fair
27 44 fair
32 3 fair
37 14 fair
42 39 fair
46 19 fair
49 8 fair
54 43 fair
61 6 fair
79 71 fair
81 13 fair
85 79 fair
94 76 fair
99 44 fair
100 18 fair
20 85 verygood
95 84 verygood

606 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Sample
Example 3: Unconditional Sampling with Total(single) Approximate
SampleSize
This example uses the ApproxSampleSize argument and a summary (dimension) table as inputs. The
summary table containing the stratum count information is generated with the following SQL code:

SELECT stratum, count(*) AS stratum_count FROM score_category GROUP BY stratum;

Table 507: Sample Example 3 Summary Table

stratum stratum_count
very good 9
fair 77
excellent 14

SQL-MapReduce Call

SELECT * FROM sample (


ON (select stratum, count(*) AS stratum_count FROM
score_category GROUP BY stratum) AS summary DIMENSION
ON score_category PARTITION BY ANY
ApproxSampleSize ('10')
Seed ('2')
) ORDER BY id;

Output

Table 508: Sample Example 3 Output Table

id score stratum
4 95 excellent
6 55 fair
15 22 fair
16 19 fair
21 5 fair
27 44 fair
36 15 fair
39 50 fair
69 18 fair
81 13 fair
85 79 fair
92 32 fair

Teradata Aster Analytics Foundation User Guide 607


Chapter 5: Statistical Analysis
Sample
Example 4: Conditional Sampling with variable Approximate SampleSize
In this example, an approximate sample size of 5, 10 and 15 are applied to the categories 'excellent', 'fair' and
'very good', respectively. A summary (dimension) table is required with ApproxSampleSize.

SQL-MapReduce Call

SELECT * FROM sample (


ON (SELECT stratum, count(*) AS stratum_count FROM
score_category GROUP BY stratum) AS summary DIMENSION
ON score_category PARTITION BY ANY
StratumColumn ('stratum')
Strata ('excellent', 'fair', 'very good')
ApproxSampleSize ('5', '10', '5')
Seed ('2')
) ORDER BY stratum, id;

Output

Table 509: Sample Example 4 Output Table

id score stratum
12 93 excellent
28 90 excellent
60 97 excellent
78 91 excellent
90 100 excellent
8 57 fair
10 27 fair
21 5 fair
24 11 fair
27 44 fair
37 14 fair
46 19 fair
49 8 fair
79 71 fair
81 13 fair
85 79 fair
94 76 fair
20 85 very good
53 87 very good

608 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Shapley Value Functions

id score stratum
95 84 very good

Note:
The summary input must summarize the population statistics faithfully. That is, the sum over
stratum_count with a non-null stratum value must be equal to the total population size. If this condition
does not hold, the final sample output might not approximate the target sample fractions well.

Shapley Value Functions

Summary
The Shapley value is intended to reflect the importance of each player to the coalition in a cooperative game
(a game between coalitions of players, rather than between individual players).
The Shapley value functions are:
• GenerateCombination, a function that takes combinations of players (coalitions) and generates input for
AddOnePlayer
• SortCombination, a function that sorts combinations of players
• AddOnePlayer, a function that takes sorted combinations and outputs a table
• SQL Statements to Compute the Shapley Value, which query the AddOnePlayer input and output tables
The input to GenerateCombination can be either unsorted user data or sorted output from the function
nPath. If the input is unsorted, GenerateCombination inputs it to SortCombination.
The input to SortCombination can come from either GenerateCombination or the user.
The input to AddOnePlayer can come from either GenerateCombination or SortCombination.
Figure 15: Computing a Shapley Value

Background
The Shapley value of a player is the difference between the average coalition payoff if the player is a member
and the average coalition payoff if the player is not a member.
If N is the set of players, S is an arbitrary coalition of players that does not include player i, andv is the payoff
function, then this formula computes the Shapley value of player i:

Teradata Aster Analytics Foundation User Guide 609


Chapter 5: Statistical Analysis
GenerateCombination

Instead of computing the sum over all 2N possible coalitions, the Aster Analytics Shapley Value feature
computes an approximate Shapley value by sampling over coalitions whose observed payoff values are
included in the user data.

GenerateCombination
The GenerateCombination function takes combinations of players and generates input for AddOnePlayer.

Usage

GenerateCombination Syntax
Version 1.0

SELECT * FROM GenerateCombination (


ON { table | view | (query) }
);

Input
The first two columns of the input table must be index and payoff, respectively. The tables can have other
columns after these two, but the function ignores them.
Table 510: GenerateCombination Input Table Schema

Column Name Data Type Description


index STRING First column of the table. Contains bit strings that indicate which
players are in the combination. The length of each bit string is the total
number of players. Each bit represents a player. The bit has the value 1
if the player is in the combination and 0 otherwise. The most and least
significant bits represent the first and last players, respectively. Each
row represents a unique combination.
payoff DOUBLE Second column of table. Contains the observed payoff value for the
PRECISION corresponding combination in the index column.

610 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SortCombination
Output
Table 511: GenerateCombination Output Table Schema

Column Name Data Type Description


comb VARCHAR Combination, represented by space-separated player numbers. For
example, '1 2' or '2 3 4 5'. The first player is number 1, the second is
number 2, and so on.
size INTEGER Combination size.
value DOUBLE Assigned value of the combination.
PRECISION

Examples
Refer to the AddOnePlayer Examples.

SortCombination
The SortCombination function takes a table of combinations, generated by either the GenerateCombination
function or a SQL statement, and outputs a table of sorted combinations that can be input to AddOnePlayer.

Usage

SortCombination Syntax
Version 1.1

SELECT * FROM SortCombination (


ON { table | view | (query) } PARTITION BY key
CombinationColumn ('combination_column')
ValueColumn ('value_column')
[ Delimiter ('delimiter') ]
);

Arguments
Argument Category Description
CombinationColumn Required Specifies the name of the input table column that
contains the combinations.
ValueColumn Required Specifies the name of the input table column that
contains the assigned value of each combination.

Teradata Aster Analytics Foundation User Guide 611


Chapter 5: Statistical Analysis
AddOnePlayer

Argument Category Description


Delimiter Optional Specifies the character that separates player numbers
in combinations—either ' ' (space, the default), '#', '$',
'%', or '&'.

Input
The following table describes the required columns of the input table. The table can have additional
columns, but the function ignores them.
Table 512: SortCombination Input Table Schema

Column Name Data Type Description


comb VARCHAR Combination, represented by space-separated player numbers. For
example, '1 2' or '2 3 4 5'. The first player is number 1, the second is
number 2, and so on.
value DOUBLE Assigned value of the combination.
PRECISION

Output
Table 513: SortCombination Output Table Schema

Column Name Data Type Description


comb VARCHAR Combination, represented by space-separated player numbers. For
example, '1 2' or '2 3 4 5'.
size INTEGER Combination size.
value DOUBLE Assigned value of the combination.
PRECISION

Examples
Refer to the AddOnePlayer Examples.

AddOnePlayer
The AddOnePlayer function takes a table of sorted combinations, generated by either GenerateCombination
or SortCombination, and outputs a table. The AddOnePlayer input and output tables are queried by the SQL
Statements to Compute the Shapley Value.

612 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer
Usage

AddOnePlayer Syntax
Version 1.0

SELECT * FROM AddOnePlayer (


ON { table | view | (query) } PARTITION BY key
CombinationColumn ('combination_column')
SizeColumn ('size_column')
ValueColumn ('value_column')
NumPlayers ('number_of_players')
[ Delimiter ('delimiter') ]
);

Arguments
Argument Category Description
CombinationColumn Required Specifies the name of the input table column that contains the
combinations.
SizeColumn Required Specifies the name of the input table column that contains the size
of each combination.
ValueColumn Required Specifies the name of the input table column that contains the
characteristic value of each combination.
NumPlayers Required Specifies the number of players in the game, a positive integer.
Delimiter Optional Specifies the character that separates player numbers in
combinations—either ' ' (space, the default), '#', '$', '%', or '&'.

Input
The AddOnePlayer input table has the same schema as the SortCombination output table.

Note:
The comb column values must be unique; otherwise, the Shapley Values will be incorrect.

Output
Table 514: AddOnePlayer Output Table Schema

Column Name Data Type Description


comb1 VARCHAR Combination represented by input table column comb.
comb2 VARCHAR Combination produced by adding one new player to comb1.
player INTEGER Number of the player added to comb1 to produce comb2.
size INTEGER Size of comb1.

Teradata Aster Analytics Foundation User Guide 613


Chapter 5: Statistical Analysis
AddOnePlayer

Column Name Data Type Description


value DOUBLE Payoff value of comb1.
PRECISION
divisor INTEGER Divisor used to compute the factor that the Shapley value calculation
must use to weigh the effect of adding the player to comb1.

SQL Statements to Compute the Shapley Value


Assume that the AddOnePlayer input and output tables have the names InputTable and OutputTable,
respectively.
To compute a table that contains the Shapley value of each player:
1. Create a table that contains the weighted payoff produced by adding a player to each combination. For
example:

CREATE TABLE stratum (PARTITION KEY(player)) AS


SELECT player,
(InputTable.value - OutputTable.value) / divisor AS partial_value,
OutputTable.size AS size
FROM inputTable INNER JOIN
OutputTable ON (InputTable.comb = OutputTable.comb2);
2. Create a table that contains the partial Shapley value produced when each player is added to a
combination of a given size. For example:

CREATE TABLE stratum_avg (PARTITION KEY(player)) AS


SELECT player, size, SUM(partial_value) AS partial_avg
FROM stratum GROUP BY player, size;
3. Create a table that lists the Shapley value for each player. For example:

CREATE TABLE shapley_values (PARTITION KEY(player)) AS


SELECT player, SUM(partial_avg) / numberOfPlayers AS shapley_value
FROM stratum_avg GROUP BY player;

Alternatively, combine the preceding statements into this statement:

CREATE TABLE shapley_values (PARTITION KEY(player)) AS


SELECT player, SUM(partial_avg) / numberOfPlayers AS shapley_value
FROM (
SELECT player, size, SUM(partial_value) AS partial_avg
FROM (
SELECT player,
(inputTable.value - outputTable.value) / divisor AS partial_value,
outputTable.size AS size
FROM inputTable INNER JOIN
outputTable ON (inputTable.comb=outputTable.comb2)
) AS stratum
GROUP BY player, size
) AS stratum_avg
GROUP BY player;

614 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer
To normalize the Shapley values, so that their sum is 1:

SELECT player, shapley_value / (


SELECT SUM(shapley_value)
FROM shapley_values
) AS normalized__shapley_values
FROM shapley_values;

Examples
• Example 1: Use GenerateCombination and AddOnePlayer
• Example 2: Use nPath to Create Input to GenerateCombination

Example 1: Use GenerateCombination and AddOnePlayer


Assume there are three projects (players), each of which has a capital_cost and an operating_cost, as shown
in the following table. The player position is represented by the three digits shown in the column
project_index. The first digit corresponds to project1, the second digit to project2, etc. The digit '1' indicates
that the project is included, while '0' indicates it is excluded. The column projects_combined also indicates
which projects are included.

Input

Table 515: AddOnePlayer Example 1 Input Table project_cost

serialnum projects_combined project_index capital_cost operating_cost total_cost


1 p_1 100 0 3000 3000
2 p_2 010 10000 3000 13000
3 p_3 001 10000 0 10000
4 p_12 110 10000 4000 14000
5 p_13 101 10000 3000 13000
6 p_23 011 16000 3000 19000
7 p_123 111 16000 4000 20000

Assume that you implemented all three projects and want to know the average contribution of each project
to the total cost (capital_cost + operating_cost). Shapley value calculates that average cost contribution over
all possible orderings of the players.
Given a cost-sharing game, let the players join the game one at a time in predetermined order. As each
player joins, the number of players to be served increases. The player’s cost contribution is its net addition to
cost when it joins (that is., the incremental cost of adding it to the group of players whom have already
joined).
The last line in the preceding table that includes all the projects (p_123) is analyzed as follows to derive the
average contribution.
Sharing the Capital Cost of 16000 for p_123:

Teradata Aster Analytics Foundation User Guide 615


Chapter 5: Statistical Analysis
AddOnePlayer
• p_1 adds nothing to any coalition and therefore must not be charged
• p_ 2 and p_ 3 are symmetric players and therefore must pay equally
• The players must share 16,000 capital cost as:
∘ p_1 pays 0,
∘ p_2 pays 8000
∘ p_3- pays 8000
Sharing Operation Cost of 4000 for p_123:
• p_3 adds nothing to any coalition and therefore must not be charged
• p_ 1 and p_ 2 are symmetric players and therefore must pay equally
• The players must share 4000 capital cost as:
∘ p_1 pays 2000
∘ p_2-pays 200,
∘ p_3 pays 0
Sharing Total Cost of 20000 for p_123:
• p_1 pays 0+2000=2000
• p_2 pays 8000+2000=10,000
• p_3 pays 8000+0=8000
The preceding values are the Shapley values of the projects, which are also obtained in Output from the
analytic function, as the following sections show.

Generate Payoff Tables for Each Combination


Different combination of projects are generated using the GenerateCombination function on Input. The
value column indicates the total cost (sum of capital cost and operating_cost) as specified in the SQL-
MapReduce call.

SQL-MapReduce Call

CREATE FACT TABLE project_comb_cost (PARTITION KEY(comb)) AS


SELECT * FROM GenerateCombination (
ON (
SELECT project_index, total_cost AS value FROM project_cost
)
);

Output

Table 516: AddOnePlayer Example 1 Output Table (Generate Payoff Tables for Each Combination)

comb size value


1 1 3000
2 1 13000
3 1 10000
12 2 14000

616 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer

comb size value


13 2 13000
23 2 19000
123 3 20000

Add One Player to Each Combination


Using the output of the GenerateCombination function, the AddOnePlayer function computes the divisor
value by adding one player to each combination. The divisor is key to the calculation of the final Shapley
numbers.

SQL-MapReduce Call

CREATE FACT TABLE project_addone_cost (PARTITION KEY(comb2)) AS


SELECT * FROM AddOnePlayer (
ON project_comb_cost
CombinationColumn ('comb')
SizeColumn ('size')
ValueColumn ('value')
NumPlayers ('3')
);

Output

Table 517: AddOnePlayer Example 1 Output Table (Add One Player to Each Combination)

comb1 comb2 player size value divisor


1 12 2 1 3000 2
1 13 3 1 3000 2
12 123 3 2 14000 1
13 123 2 2 13000 1
2 12 1 1 13000 2
2 23 3 1 13000 2
23 123 1 2 19000 1
3 13 1 1 10000 2
3 23 2 1 10000 2
1 1 0 0 1
2 2 0 0 1
3 3 0 0 1

Teradata Aster Analytics Foundation User Guide 617


Chapter 5: Statistical Analysis
AddOnePlayer
Compute Shapley Values
The Shapley number for each project(player) is computed as an average of the sum of incremental
contributions of each project (in this example, three total), using the combined SQL query shown below. The
results match with those described above.

SQL-MapReduce Call

SELECT player, SUM(partial_avg) / 3 AS shapley_value FROM (


SELECT player, size, SUM(partial_value) AS partial_avg FROM (
SELECT player, (project_comb_cost.value -
project_addone_cost.value) / divisor AS
partial_value, project_addone_cost.size AS size
FROM project_comb_cost INNER JOIN
project_addone_cost ON (project_comb_cost.comb =
project_addone_cost.comb2)
) AS stratum GROUP BY player, size
) AS stratum_avg GROUP BY player ORDER BY player;

Output

Table 518: AddOnePlayer Example 1 Output Table (Shapley Values Computation)

player shapley_value
1 2000
2 10000
3 8000

Example 2: Use nPath to Create Input to GenerateCombination


This example shows how to calculate Shapley Values using nPath to create the input to the
GenerateCombination function in the Shapley Value feature.

Input
The input data is a simulated click stream.
Table 519: AddOnePlayer Example 2 Input Table Schema

Column Data Type Description


indiv_prod_id INTEGER Product ID.
mat_intractn_dt_ts INTEGER Timestamp.
mat_intractn_typ_cd VARCHAR Event type.
Impact events:
EMLOP
CLKNI
CLKIN

618 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer

Column Data Type Description


CLKAC
CLKCO
Conversion events:
ACTVD
APEAP
APEST
APESU
FUNDD
REACT
Irrelevant event, which is coded as XXX.

Generate and Load Click-Stream Sequences


Use the following procedure to generate the click-stream sequences and load them into a database file. The
Python script in step 1 generates the event types randomly with equal probability.
1. Generate the click-stream sequences as a .csv file (table.csv) by running this Python script:

-- Python script begins.


-- The input value for length (of each click-stream) is 100000.
-- The input value for partition is 5.

#!/usr/bin/python
import sys, getopt
import sys
import csv
import random

def usage():
print '\nUsage:-'
print 'generate_table_data.py -l length -p partition'
print ' '
sys.exit(2)

def main(argv):
length = 0
partition = 0
try:
opts, args =
getopt.getopt(argv,"hl:p:",["length=","partition="])
except getopt.GetoptError:
usage()
for opt, arg in opts:
if (opt == '-h'):
usage()
elif opt in ("-l", "--length"):
length = long(arg)
elif opt in ("-p", "--partition"):
partition = long(arg)
else:
usage()

Teradata Aster Analytics Foundation User Guide 619


Chapter 5: Statistical Analysis
AddOnePlayer
random.seed(50)
event = ['EMLOP','CLKNI','CLKIN', 'CLKAC', 'CLKCO', 'ACTVD',
'APEAP', 'APEST','APESU','FUNDD','REACT','XXX'];

f = open ( 'table.csv', 'wb' )


writer = csv.writer( f )

for indiv_prod_id in range(0,length):


for mat_intractn_dt_ts in range(0,partition):
index = random.randint(0,11)
#print str(mat_intractn_dt_ts) + "," +
str(indiv_prod_id) + "," + event[index]
writer.writerow([mat_intractn_dt_ts, indiv_prod_id,
event[index]] )

if __name__ == "__main__":
main(sys.argv[1:])

---python script ends


2. Create the table atrbtn_table_old_direct_noprsnt in the database:

DROP TABLE IF EXISTS atrbtn_table_old_direct_noprsnt;


CREATE TABLE atrbtn_table_old_direct_noprsnt (
indiv_prod_id INTEGER,
mat_intractn_dt_ts INTEGER,
mat_intractn_typ_cd VARCHAR
) DISTRIBUTE BY HASH(indiv_prod_id);
3. Load the click-stream sequences into the database table using nlcuster_loader (with the username and
password for your cluster IP):

ncluster_loader -h Queen_IP_address -U username


-w password atrbtn_table_old_direct_noprsnt table.csv -D ','

Output

Table 520: AddOnePlayer Example 2 Output Table atrbtn_table_old_direct_noprsnt

indiv_prod_id mat_intractn_dt_ts mat_intractn_typ_cd


0 0 ACTVD
0 1 XXX
0 2 CLKNI
0 3 XXX
0 4 REACT
0 5 APEST
0 6 APESU
0 7 ACTVD
0 8 APEST

620 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer

indiv_prod_id mat_intractn_dt_ts mat_intractn_typ_cd


0 9 XXX
0 10 CLKNI
0 11 APEAP
0 12 XXX
0 13 APESU
0 14 EMLOP
0 15 XXX
0 16 ACTVD
0 17 ACTVD
0 18 APEST
0 19 XXX
0 20 CLKNI
0 21 APEAP
0 22 EMLOP
0 23 EMLOP
0 24 EMLOP
... ... ...

Use nPath to Generate Conversion Counts

CREATE TABLE atrbtn_old_dr_pth_noprsnt_conv


distribute by hash(num_conv) compress low as
SELECT (IND_EMLOP || IND_CLKNI || IND_CLKIN || IND_CLKAC || IND_CLKCO) AS IND,
num_conv FROM
(SELECT case when num_EMLOP > 0 then '1' ELSE '0' END as IND_EMLOP
, case when num_CLKNI > 0 then '1' ELSE '0' END as IND_CLKNI
, case when num_CLKIN > 0 then '1' ELSE '0' END as IND_CLKIN
, case when num_CLKAC > 0 then '1' ELSE '0' END as IND_CLKAC
, case when num_CLKCO > 0 then '1' ELSE '0' END as IND_CLKCO
, count(conversion_evt) as num_conv
FROM NPATH(
ON atrbtn_table_old_direct_noprsnt
PARTITION BY indiv_prod_id
ORDER BY mat_intractn_dt_ts
MODE(nonoverlapping)
PATTERN ('(B|C|D|E|F)+.(act|ape|aps|apu|fun|rea){1}')
SYMBOLS(mat_intractn_typ_cd = 'EMLOP' as B,
mat_intractn_typ_cd = 'CLKNI' as C,
mat_intractn_typ_cd = 'CLKIN' as D,
mat_intractn_typ_cd = 'CLKAC' as E,
mat_intractn_typ_cd = 'CLKCO' as F,
mat_intractn_typ_cd = 'ACTVD' as act,

Teradata Aster Analytics Foundation User Guide 621


Chapter 5: Statistical Analysis
AddOnePlayer
mat_intractn_typ_cd = 'APEAP' as ape,
mat_intractn_typ_cd = 'APEST' as aps,
mat_intractn_typ_cd = 'APESU' as apu,
mat_intractn_typ_cd = 'FUNDD' as fun,
mat_intractn_typ_cd = 'REACT' as rea)
RESULT (first(indiv_prod_id of ANY(B,C,D,E,F)) as indiv_prod_id,
count(mat_intractn_typ_cd of B) as num_EMLOP,
count(mat_intractn_typ_cd of C) as num_CLKNI,
count(mat_intractn_typ_cd of D) as num_CLKIN,
count(mat_intractn_typ_cd of E) as num_CLKAC,
count(mat_intractn_typ_cd of F) as num_CLKCO,
LAST_NOTNULL(mat_intractn_typ_cd OF ANY(act,ape,aps,apu,fun,rea))
as conversion_evt,
ACCUMULATE (mat_intractn_typ_cd OF ANY(B,C,D,E,F)) as intractn_path)
)group by 1,2,3,4,5) a;

Output

Table 521: AddOnePlayer Example 2 Output Table atrbtn_old_dr_pth_noprsnt_conv

ind num_conv
00001 13223
00010 13124
00011 2693
00100 13278
00101 2753
00110 2677
00111 920
01000 13167
01001 2664
01010 2607
01011 843
01100 2628
01101 874
01110 931
01111 440
10000 13204
10001 2599
10010 2716
10011 882
10100 2651

622 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer

ind num_conv
10101 901
10110 847
10111 460
11000 2595
11001 850
11010 841
11011 460
11100 859
11101 463
11110 444
11111 351

Use nPath to Generate Total Counts

CREATE TABLE atrbtn_old_dr_pth_noprsnt_tot


distribute by hash(num_tot) compress low as
SELECT (IND_EMLOP || IND_CLKNI || IND_CLKIN || IND_CLKAC || IND_CLKCO) AS IND,
num_tot FROM
(SELECT case when num_EMLOP > 0 then '1' ELSE '0' END as IND_EMLOP
, case when num_CLKNI > 0 then '1' ELSE '0' END as IND_CLKNI
, case when num_CLKIN > 0 then '1' ELSE '0' END as IND_CLKIN
, case when num_CLKAC > 0 then '1' ELSE '0' END as IND_CLKAC
, case when num_CLKCO > 0 then '1' ELSE '0' END as IND_CLKCO
, count(intractn_path) as num_tot
FROM NPATH(
ON atrbtn_table_old_direct_noprsnt
PARTITION BY indiv_prod_id
ORDER BY mat_intractn_dt_ts
MODE(nonoverlapping)
PATTERN ('(B|C|D|E|F)+')
SYMBOLS(mat_intractn_typ_cd = 'EMLOP' as B,
mat_intractn_typ_cd = 'CLKNI' as C,
mat_intractn_typ_cd = 'CLKIN' as D,
mat_intractn_typ_cd = 'CLKAC' as E,
mat_intractn_typ_cd = 'CLKCO' as F)
RESULT (first(indiv_prod_id of ANY(B,C,D,E,F)) as indiv_prod_id,
count(mat_intractn_typ_cd of B) as num_EMLOP,
count(mat_intractn_typ_cd of C) as num_CLKNI,
count(mat_intractn_typ_cd of D) as num_CLKIN,
count(mat_intractn_typ_cd of E) as num_CLKAC,
count(mat_intractn_typ_cd of F) as num_CLKCO,
ACCUMULATE (mat_intractn_typ_cd OF ANY(B,C,D,E,F)) as intractn_path)
)group by 1,2,3,4,5) a;

Teradata Aster Analytics Foundation User Guide 623


Chapter 5: Statistical Analysis
AddOnePlayer
Output

Table 522: AddOnePlayer Example 2 Output Table atrbtn_old_dr_pth_noprsnt_tot

ind num_tot
00001 00001
00010 15345
00011 3128
00100 15512
00101 3169
00110 3128
00111 1090
01000 15402
01001 3114
01010 3030
01011 981
01100 3038
01101 1008
01110 1087
01111 520
10000 15418
10001 3024
10010 3144
10011 1038
10100 3078
10101 1046
10110 998
10111 536
11000 3003
11001 985
11010 997
11011 535
11100 993
11101 533
11110 514

624 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer

ind num_tot
11111 404

In this simulated case, for five players (impact events), all the 32 possible combinations are seen, except that
the empty set is not matched. This is because the data is generated randomly, so it is very likely that each
combination gets a chance to be seen. In real applications with many more players (dozens), it is very
unlikely to be the case.

Compute Shapley Values

CREATE FACT TABLE atrbtn_cvs (PARTITION KEY(comb)) AS


SELECT * FROM GenerateCombination ( ON (
SELECT atrbtn_old_dr_pth_noprsnt_conv.ind,
atrbtn_old_dr_pth_noprsnt_conv.num_conv /
atrbtn_old_dr_pth_noprsnt_tot.num_tot::real AS value
FROM atrbtn_old_dr_pth_noprsnt_conv INNER JOIN atrbtn_old_dr_pth_noprsnt_tot
ON atrbtn_old_dr_pth_noprsnt_conv.ind = atrbtn_old_dr_pth_noprsnt_tot.ind));

Output

Table 523: AddOnePlayer Example 2 Output Table

comb size value


1 1 0.856402
2 1 0.854889
3 1 0.855982
4 1 0.855262
5 1 0.858413
12 2 0.864136
13 2 0.861274
14 2 0.863868
15 2 0.859458
23 2 0.865043
24 2 0.860396
25 2 0.855491
34 2 0.855818
35 2 0.868728
45 2 0.860933
123 3 0.865055
124 3 0.843531

Teradata Aster Analytics Foundation User Guide 625


Chapter 5: Statistical Analysis
AddOnePlayer

comb size value


125 3 0.862944
134 3 0.848697
135 3 0.861377
145 3 0.849711
234 3 0.856486
235 3 0.867063
245 3 0.859327
345 3 0.844037
1234 4 0.863813
1235 4 0.868668
1245 4 0.859813
1345 4 0.858209
2345 4 0.846154
12345 5 0.868812

Compute Augmented Characteristic Values Table (ACVS)

CREATE FACT TABLE atrbtn_acvs (PARTITION KEY(comb2)) AS


SELECT * FROM AddOnePlayer (
ON atrbtn_cvs
CombinationColumn ('comb')
SizeColumn ('size')
ValueColumn ('value')
NumPlayers ('5')
);

Output

Table 524: AddOnePlayer Example 2 Output Table

comb1 comb2 player size value divisor


1 12 2 1 0.856402 4
1 13 3 1 0.856402 4
1 14 4 1 0.856402 4
1 15 5 1 0.856402 4
12 123 3 2 0.864136 6
12 124 4 2 0.864136 6
12 125 5 2 0.864136 6

626 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
AddOnePlayer

comb1 comb2 player size value divisor


123 1234 4 3 0.865055 4
123 1235 5 3 0.865055 4
1234 12345 5 4 0.863813 1
1235 12345 4 4 0.868668 1
124 1234 3 3 0.843531 4
124 1235 5 3 0.843531 4
1245 12345 3 4 0.859813 1
125 1235 3 3 0.862944 4
125 1245 4 3 0.862944 4
13 123 2 2 0.861274 6
13 134 4 2 0.861274 6
13 135 5 2 0.861274 6
134 1234 2 3 0.848697 4
134 1345 5 3 0.848697 4
1345 12345 2 4 0.858209 1
135 1235 2 3 0.861377 4
135 1345 4 3 0.861377 4
14 124 2 2 0.863868 6
14 134 3 2 0.863868 6
14 145 5 2 0.863868 6
145 1245 2 3 0.849711 4
145 1345 3 3 0.849711 4
15 125 2 2 0.859458 6
15 135 3 2 0.859458 6
15 145 4 2 0.859458 6
2 12 1 1 0.854889 4
2 23 3 1 0.854889 4
2 24 4 1 0.854889 4
2 25 5 1 0.854889 4
23 123 1 2 0.865043 6
23 234 4 2 0.865043 6
23 235 5 2 0.865043 6

Teradata Aster Analytics Foundation User Guide 627


Chapter 5: Statistical Analysis
AddOnePlayer

comb1 comb2 player size value divisor


234 1234 1 3 0.856486 4
234 2345 5 3 0.856486 4
2345 12345 1 4 0.846154 1
235 1235 1 3 0.867063 4
235 2345 4 3 0.867063 4
24 124 1 2 0.860396 6
24 234 3 2 0.860396 6
24 245 5 2 0.860396 6
245 1245 1 3 0.859327 4
245 2345 3 3 0.859327 4
25 125 1 2 0.855491 6
25 235 3 2 0.855491 6
25 245 4 2 0.855491 6
3 13 1 1 0.855982 4
3 23 2 1 0.855982 4
3 34 4 1 0.855982 4
3 35 5 1 0.855982 4
34 134 1 2 0.855818 6
34 234 2 2 0.855818 6
34 345 5 2 0.855818 6
345 1345 1 3 0.844037 4
345 2345 2 3 0.844037 4
35 135 1 2 0.868728 6
35 235 2 2 0.868728 6
35 345 4 2 0.868728 6
4 14 1 1 0.855262 4
4 24 2 1 0.855262 4
4 34 3 1 0.855262 4
4 45 5 1 0.855262 4
45 145 1 2 0.860933 6
45 245 2 2 0.860933 6
45 345 3 2 0.860933 6

628 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SMAVG

comb1 comb2 player size value divisor


5 15 1 1 0.858413 4
5 25 2 1 0.858413 4
5 35 3 1 0.858413 4
5 45 4 1 0.858413 4
1 1 0 0 1
2 2 0 0 1
3 3 0 0 1
4 4 0 0 1
5 5 0 0 1

Compute Unnormalized Shapley Values from CVS and ACVS Table

SELECT player, SUM(partial_avg) / 5 AS shapley_value FROM (


SELECT player, size, SUM(partial_value) AS partial_avg FROM (
SELECT player, (atrbtn_cvs.value - atrbtn_acvs.value) / divisor AS
partial_value, atrbtn_acvs.size AS size FROM atrbtn_cvs INNER JOIN atrbtn_acvs
ON (atrbtn_cvs.comb = atrbtn_acvs.comb2)) AS stratum
GROUP BY player, size) AS stratum_avg
GROUP BY player ORDER BY player;

Output

Table 525: AddOnePlayer Example 2 Output Table

player shapley_value
1 0.177030985554059
2 0.175257593393326
3 0.174638753135999
4 0.168013821045558
5 0.173870752255122

SMAVG

Summary
The SMAVG (simple moving average) function computes the simple moving average over a number of
points in a series.

Teradata Aster Analytics Foundation User Guide 629


Chapter 5: Statistical Analysis
SMAVG
Background
A Simple Moving Average (SMA) is the unweighted mean of the previous n data points. For example, a 10-
day simple moving average of closing price is the mean of the previous 10 days' closing prices.
To calculate the SMA:
1. Compute the arithmetic average of first R rows, as specified by the Window_Size argument, N.
2. for each subsequent row, compute new value as:

new_smavg = old_smavg - (PM-n+PM) / N

Usage

SMAVG Syntax
Version 1.2

SELECT * FROM SMAVG (


ON {table_name | view_name| (query) }
PARTITION BY partition_column
ORDER BY order_by_column
[ TargetColumns ({ 'target_column' | 'target_column_range' }[,...]) ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WindowSize ('window_size') ]
);

Arguments
Argument Category Description
TargetColumns Optional Specifies the names of the input columns for which the simple moving
average is to be computed. If you omit this argument, then the function
copies all numeric input columns to the output table.
IncludeFirst Optional Specifies whether to output window_size rows. Because the simple moving
average for the first window_size rows is undefined, the function returns
NULL values for those columns. The default value is 'false'.
WindowSize Optional Specifies the number of previous values to include in the computation of
the simple moving average. The default value is 10.

Note:
The SMAVG function treats the schema names, table names, and column names as case-insensitive
arguments. If any of these arguments contain capital letters, you must surround each one of them with
double quotation marks. For example:

TargetColumns ('"City"', 'date')

630 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SMAVG
Input
The input table requires only the columns specified by the PARTITION BY clause, ORDER BY clause, and
TargetColumns argument. The table can have additional columns, but the function ignores them.
Table 526: SMAVG Input Table Schema

Column Name Data Type Description


column_name SMALLINT, Column for which to calculate the simple moving average.
INTEGER,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
partition_column INTEGER, Column by which to partition the input. Input data must be
BIGINT, partitioned such that each partition contains all rows of the column
NUMERIC, or or columns whose simple moving average is to be calculated.
VARCHAR
order_by_column INTEGER, Column by which to order the input.
BIGINT,
TIME, or
TIMESTAMP

Output
Table 527: SMAVG Output Table Schema

Column Name Data Type Description


column_name Same as in Input table column for which to calculate the simple moving
input table average. These columns are in the same order in the input and
output tables.
column_name_smavg DOUBLE Simple moving average for the values in column_name.
PRECISION

Example
This example computes a simple moving average for the price of IBM stock. The input data is a series of
IBM common stock closing prices from 17 May 1961 to 2 November 1962.

Input
Table 528: SMAVG Example Input Table ibm_stock

id name period stockprice


1 IBM 1961-05-17 00:00:00 460
2 IBM 1961-05-18 00:00:00 457

Teradata Aster Analytics Foundation User Guide 631


Chapter 5: Statistical Analysis
SMAVG

id name period stockprice


3 IBM 1961-05-19 00:00:00 452
4 IBM 1961-05-22 00:00:00 459
5 IBM 1961-05-23 00:00:00 462
6 IBM 1961-05-24 00:00:00 459
7 IBM 1961-05-25 00:00:00 463
8 IBM 1961-05-26 00:00:00 479
9 IBM 1961-05-29 00:00:00 493
10 IBM 1961-05-31 00:00:00 490
11 IBM 1961-06-01 00:00:00 492
12 IBM 1961-06-02 00:00:00 498
13 IBM 1961-06-05 00:00:00 499
14 IBM 1961-06-06 00:00:00 497
15 IBM 1961-06-07 00:00:00 496
16 IBM 1961-06-08 00:00:00 490
17 IBM 1961-06-09 00:00:00 489
18 IBM 1961-06-12 00:00:00 478
19 IBM 1961-06-13 00:00:00 487
20 IBM 1961-06-14 00:00:00 491
... ... ... ...

SQL-MapReduce Call

SELECT * FROM SMAVG (


ON ibm_stock
PARTITION BY name
ORDER BY period
TargetColumns ('stockprice')
WindowSize ('10')
IncludeFirst ('true')
) ORDER BY period;

Output
Table 529: SMAVG Example Output Table

id name period stockprice stockprice_mavg


1 IBM 1961-05-17 00:00:00 460

632 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
Support Vector Machines

id name period stockprice stockprice_mavg


2 IBM 1961-05-18 00:00:00 457
3 IBM 1961-05-19 00:00:00 452
4 IBM 1961-05-22 00:00:00 459
5 IBM 1961-05-23 00:00:00 462
6 IBM 1961-05-24 00:00:00 459
7 IBM 1961-05-25 00:00:00 463
8 IBM 1961-05-26 00:00:00 479
9 IBM 1961-05-29 00:00:00 493
10 IBM 1961-05-31 00:00:00 490 467.4
11 IBM 1961-06-01 00:00:00 492 470.59999999999997
12 IBM 1961-06-02 00:00:00 498 474.7
13 IBM 1961-06-05 00:00:00 499 479.4
14 IBM 1961-06-06 00:00:00 497 483.2
15 IBM 1961-06-07 00:00:00 496 486.59999999999997
16 IBM 1961-06-08 00:00:00 490 489.7
17 IBM 1961-06-09 00:00:00 489 492.3
18 IBM 1961-06-12 00:00:00 478 492.2
19 IBM 1961-06-13 00:00:00 487 491.59999999999997
20 IBM 1961-06-14 00:00:00 491 491.7
... ... ... ... ...

Support Vector Machines


Support Vector Machines (SVMs) are popular classification algorithms. The objective of an SVM is similar
to that of a binary logistic regression algorithm: Given a set of predictor variables, classify an object as having
one of two possible outcomes. However, the methods by which the two types of algorithms achieve this
objective are very different.
A binary logistic regression algorithm develops a probabilistic model from a training data set. Then, given
test instance x, the algorithm estimates the probability that x belongs in a particular class.
An SVM takes a training data set and seeks the boundary that maximizes the distance between the two
classes. Then, given test instance x, the SVM determines the side of the boundary on which x lies, to predict
its class.
Aster Analytics has two versions of SVM classifiers:
• SparseSVM functions use a linear kernel method for input in sparse format.
• DenseSVM functions can use linear or nonlinear kernel methods for input in dense format.

Teradata Aster Analytics Foundation User Guide 633


Chapter 5: Statistical Analysis
SparseSVM Functions
These SVM functions, though binary, can classify objects into more than two classes by using the machine-
learning reduction technique one-against-all. One binary SVM is trained for each class. Each SVM labels the
nth class positive and all other classes negative. Each SVM trains each test observation. The class for which
the most observations are predicted to be positive is the resulting prediction.

SparseSVM Functions
The SparseSVM functions are:
• SparseSVMTrainer, which takes training data and builds a predictive model in binary format
• SparseSVMPredictor, which uses the model to predict the class of each sample in a test data set
• SVMModelPrinter, which displays readable information about the model

The SparseSVMTrainer and SparseSVMPredictor functions are designed for input that is in sparse format;
that is, each table row represents an attribute and each sample (observation) often consists of many
attributes. These functions are suitable for tasks like text classification, whose high number of attributes
(many unique words) might exceed the number of columns in the table.
This implementation of SparseSVM functions solves the primal form of a linear kernel support vector
machine, using gradient descent on the objective function. The implementation is based primarily on
Pegasos: Primal Estimated Sub-Gradient Solver for SVM (by S. Shalev-Shwartz, Y. Singer, and N. Srebro;
presented at ICML 2007).

SparseSVMTrainer

Summary
The SparseSVMTrainer function takes training data (in sparse format) and outputs a predictive model in
binary format, which is input to the functions SparseSVMPredictor and SVMModelPrinter.

634 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SparseSVMTrainer
Usage

SparseSVMTrainer Syntax
Version 1.1

SELECT * FROM SparseSVMTrainer (


ON (select 1) PARTITION by 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('model_table')
SampleIDColumn ('id_column')
AttributeColumn ('attribute_column')
[ ValueColumn ('value_column') ]
LabelColumn ('label_column')
[ Cost ('cost') ]
[ Bias ('bias') ]
[ Hash ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ HashBuckets (buckets_number) ]
[ ClassWeights ('class:weight' [,...]) ]
[ MaxStep ('max_step') ]
[ Epsilon ('epsilon') ]
[ Seed ('seed') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the training
samples.
ModelTable Required Specifies the name for the model table that the function creates (which
must not exist).
SampleIDColumn Required Specifies the name of the input_table column that contains the
identifiers of the training samples.
AttributeColumn Required Specifies the name of the input_table column that contains the
attributes of the samples.
ValueColumn Optional Specifies the name of the input_table column that contains the
attribute values. By default, each attribute has the value 1.

Teradata Aster Analytics Foundation User Guide 635


Chapter 5: Statistical Analysis
SparseSVMTrainer

Argument Category Description


LabelColumn Required Specifies the name of the input_table column that contains the classes
of the samples.
Cost Optional The regularization parameter λ in the SVM soft-margin loss function:

Must be greater than 0.0. The default value is 1.0.


Bias Optional A non-negative value. If the value is greater than zero, the function
converts each sample in the training set to ( , b); that is, it adds
another dimension containing the bias value b. This argument
addresses situations where not all samples center at 0. The default
value is 0.0.
Hash Optional Specifies whether to use hash projection on attributes. Hash projection
can accelerate processing speed but can slightly decrease accuracy. The
default value is 'false'.

Note:
You must use hash projection if the dataset has more features than
fit into memory.

HashBuckets Optional Valid only if Hash is 'true'. Specifies the number of buckets for hash
projection. In most cases, the function can determine the appropriate
number of buckets from the scale of the input data set. However, if the
dataset has a very large number of features, you might have to specify
buckets_number to accelerate the function.
ClassWeights Optional Specifies the weights for different classes. The format is: “classlabel
m:weight m, classlabel n:weight n”. If weight for a class is given, the
cost parameter for this class is weight * cost. A weight larger than 1
often increases the accuracy of the corresponding class; however, it
may decrease global accuracy. Classes not assigned a weight in this
argument is assigned a weight of 1.0.
MaxStep Optional A positive integer value that specifies the maximum number of
iterations of the training process. One step means that each sample is
seen once by the trainer. The input value must be in the range (0,
10000]. The default value is 100.
Epsilon Optional Termination criterion. When the difference between the values of the
loss function in two sequential iterations is less than this number, the
function stops. Must be greater than 0.0. The default value is 0.01.
Seed Optional A long integer value used to order the training set randomly and
consistently. You can use this value can to ensure that the same model
is generated if the function is run multiple times in a given database
with the same arguments. The input value must be in the range [0,
9223372036854775807]. The default value is 0.

636 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SparseSVMTrainer
Input
The SparseSVMTrainer function has one required input table, which contains the training data.
Table 530: SparseSVMTrainer Input Table Schema

Column Name Data Type Description


id_column INTEGER, SMALLINT, BIGINT, Contains the identifiers of the training
NUMERIC, NUMERIC(p), samples.
NUMERIC(p,a), TEXT, VARCHAR,
VARCHAR(n), UUID, or BYTEA
attribute_column INTEGER, SMALLINT, BIGINT, TEXT, Contains the attributes of the training
VARCHAR, or VARCHAR(n) samples.
value_column INTEGER, SMALLINT, BIGINT, Contains the attribute values.
NUMERIC, NUMERIC(p), or
NUMERIC(p,a)
label_column INTEGER, SMALLINT, BIGINT, TEXT, Contains the classes of the training
VARCHAR, or VARCHAR(n) samples.

Output
The SparseSVMTrainer function outputs console messages and a model table.
Table 531: SparseSVMTrainer Console Message Table Schema

Column Name Data Type Description


message VARCHAR Console message.

The model table, which is input to the function SparseSVMPredictor, is in binary format. To display its
readable content, use the function SVMModelPrinter.

Example

Input
The input data is a table of four iris attributes (sepal length, sepal width, petal length, and petal width),
grouped into three categories (setosa, versicolor, and virginica):
Table 532: SparseSVMTrainer Example Input Table svm_iris

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa

Teradata Aster Analytics Foundation User Guide 637


Chapter 5: Statistical Analysis
SparseSVMTrainer

id sepal_length sepal_width petal_length petal_width species


5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
... ... ... ... ... ...

Create Input Table in Sparse Format


To create from svm_iris an input table in the sparse format that the SparseSVMTrainer function requires,
use the function Unpivot:

CREATE TABLE svm_iris_input DISTRIBUTE BY hash(ID) AS


SELECT id, species, attribute, value :: double AS value FROM Unpivot (
ON svm_iris
ColsToUnpivot ('Sepal_Length', 'Sepal_Width', 'Petal_Length',
'Petal_Width')
ColsToAccumulate ('ID', 'Species')
) ORDER BY ID;

Table 533: SparseSVMTrainer Example Input Table svm_iris_input

id species attribute value


1 setosa sepal_length 5.1
1 setosa sepal_width 3.5
1 setosa petal_length 1.4
1 setosa petal_width 0.2
2 setosa sepal_length 4.9
2 setosa sepal_width 3.0
2 setosa petal_length 1.4
2 setosa petal_width 0.2
3 setosa sepal_length 4.7
3 setosa sepal_width 3.2

638 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SparseSVMTrainer

id species attribute value


3 setosa petal_length 1.3
3 setosa petal_width 0.2
... ... ... ...

Create Training and Testing Tables


From the sparse input table svm_iris_input, create a training table (with 80% of the rows) and a testing table
(with 20% of the rows):

CREATE TABLE svm_iris_input_train AS


SELECT * FROM svm_iris_input WHERE id%5!=0;
CREATE TABLE svm_iris_input_test AS
SELECT * FROM svm_iris_input WHERE id%5=0;

The testing table is input to the SparseSVMPredictor function. The testing table, which is input to the
SparseSVMPredictor function, is Input.
Table 534: SparseSVMTrainer Example Input Table svm_iris_input_train

id species attribute value


1 setosa sepal_length 5.1
1 setosa sepal_width 3.5
1 setosa petal_length 1.4
1 setosa petal_width 0.2
2 setosa sepal_length 4.9
2 setosa sepal_width 3.0
2 setosa petal_length 1.4
2 setosa petal_width 0.2
3 setosa sepal_length 4.7
3 setosa sepal_width 3.2
... ... ... ...

SQL-MapReduce Call

SELECT * FROM SparseSVMTrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('svm_iris_input_train')
ModelTable ('svm_iris_model')
SampleIDColumn ('id')
AttributeColumn ('attribute')
LabelColumn ('species')
Valuecolumn('value')

Teradata Aster Analytics Foundation User Guide 639


Chapter 5: Statistical Analysis
SparseSVMPredictor
MaxStep (150)
Seed ('0')
);

Output
Table 535: SparseSVMTrainer Example Output Message

message
Model table "svm_iris_model" is created successfully
The model is trained with 120 samples and 4 unique attributes
There are 3 different classes in the training set
The model is not converged after 150 steps with epsilon 0.01, the average value of the loss function for the
training set is 35.46831950524983
The corresponding training parameters are cost:1.0 bias:0.0

SparseSVMPredictor

Summary
The SparseSVMPredictor function takes the model generated by the function SparseSVMTrainer and a set
of test samples (in sparse format) and outputs a prediction for each sample.
This function can be used with real-time applications. Refer to AMLGenerator.

Usage

SparseSVMPredictor Syntax
Version 1.1

SELECT * FROM SparseSVMPredictor (


ON sample_table AS input PARTITION BY id_column
ON model_table AS model DIMENSION
SampleIDColumn ('id_column')
AttributeColumn ('attribute_column')
[ ValueColumn ('value_column') ]
[ AccumulateLabel
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ OutputClassNum ('output_class_number') ]
);

640 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SparseSVMPredictor
Arguments
Argument Category Description
SampleIDColumn Required Specifies the name of the sample_table column that contains
the identifiers of the test samples. The sample_table must be
partitioned by this column.
AttributeColumn Required Specifies the name of the sample_table column that contains
the attributes of the test samples.
ValueColumn Optional Specifies the name of the sample_table column that contains
the attribute values. By default, each attribute has the value
1.
AccumulateLabel Optional Specifies the names of the sample_table columns to copy to
the output table.
OutputClassNum Optional Valid only for multiple-class models. Specifies the number
of class labels to appear in the output table, with its
corresponding prediction confidence. The default value is 1.

Input
The SparseSVMPredictor function has two required input tables:
• The sample table, which contains the test data
The following table describes the required sample table columns. The function ignores any additional
columns, except those specified by the AccumulateLabel argument, which it copies to the output table.
• The model table, which is output by the SparseSVMTrainer function
The model table is in binary format. To display its readable content, use the function SVMModelPrinter.
Table 536: SparseSVMPredictor Sample Table Schema

Column Name Data Type Description


id_column INTEGER, SMALLINT, BIGINT, Contains the identifiers of the test
NUMERIC, NUMERIC(p), samples.
NUMERIC(p,a), TEXT, VARCHAR,
VARCHAR(n), UUID, or BYTEA
attribute_column INTEGER, SMALLINT, BIGINT, TEXT, Contains the attributes of the test
VARCHAR, or VARCHAR(n) samples.
value_column INTEGER, SMALLINT, BIGINT, Contains the attribute values.
NUMERIC, NUMERIC(p), or
NUMERIC(p,a)

Output
The SparseSVMPredictor function outputs a table that contains the predicted class of each test sample.

Teradata Aster Analytics Foundation User Guide 641


Chapter 5: Statistical Analysis
SparseSVMPredictor
Table 537: SparseSVMPredictor Output Table Schema

Column Name Data Type Description


id_column INTEGER, SMALLINT, BIGINT, Contains the identifiers of the test samples.
NUMERIC, NUMERIC(p),
NUMERIC(p,a), TEXT,
VARCHAR, VARCHAR(n),
UUID, or BYTEA
predict_value VARCHAR Contains the predicted classes of the test
samples.
predict_confidence DOUBLE PRECISION Contains the prediction confidences. Each
prediction confidence is a value between 0 and
1, computed by the formula that follows this
table. The higher the value, the more
dependable the prediction.
accumulate_column Any Column copied from the sample table.

The formula for predict_confidence is:

where i is the attribute id, xi is the value of attributes in the sample, and wi is the weight of attribute i.

Example

Input
This example takes two tables as input:
• The binary-format model svm_iris_model (produced by SparseSVMTrainer)
• The test data svm_iris_input_test
Table 538: SparseSVMPredictor Example Input Table svm_iris_input_test

id species attribute value


5 setosa sepal_length 5.0
5 setosa sepal_width 3.6
5 setosa petal_length 1.4
5 setosa petal_width 0.2
10 setosa sepal_length 4.9
10 setosa sepal_width 3.1
10 setosa petal_length 1.5
10 setosa petal_width 0.1

642 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SparseSVMPredictor

id species attribute value


15 setosa sepal_length 5.8
15 setosa sepal_width 4.0
15 setosa petal_length 1.2
15 setosa petal_width 0.2
... ... ... ...

SQL-MapReduce Call

CREATE TABLE svm_iris_predict_out DISTRIBUTE BY hash (id) AS


SELECT * FROM SparseSVMPredictor (
ON svm_iris_input_test AS input PARTITION BY id
ON svm_iris_model AS model dimension
SampleIDColumn ('id')
AttributeColumn ('attribute')
ValueColumn ('value')
AccumulateLabel ('species')
) ORDER BY id;

Output
The query below returns the output shown in the following table:

SELECT * FROM svm_iris_predict_out ORDER BY id;

Table 539: SparseSVMPredictor Example Output Table

id predict_value predict_confidence species


5 setosa 0.867398140899245 setosa
10 setosa 0.819661754676374 setosa
15 setosa 0.927883208386689 setosa
20 setosa 0.864916306271814 setosa
25 setosa 0.762567726157518 setosa
30 setosa 0.793833738774273 setosa
35 setosa 0.81125364470445 setosa
40 setosa 0.843753709530057 setosa
45 setosa 0.797242598289854 setosa
50 setosa 0.846642017701705 setosa
55 versicolor 0.504706688492186 versicolor
60 versicolor 0.271953255632149 versicolor

Teradata Aster Analytics Foundation User Guide 643


Chapter 5: Statistical Analysis
SparseSVMPredictor

id predict_value predict_confidence species


65 versicolor 0.261217569263106 versicolor
70 versicolor 0.571486181065923 versicolor
75 versicolor 0.508780607775202 versicolor
80 versicolor 0.528403904925176 versicolor
85 virginica 0.357815100538804 versicolor
90 versicolor 0.457613698073775 versicolor
95 versicolor 0.437749794631343 versicolor
100 versicolor 0.396542514919166 versicolor
105 virginica 0.871426508573154 virginica
110 virginica 0.785585474251325 virginica
115 virginica 0.929674476937472 virginica
120 versicolor 0.718867788404435 virginica
125 virginica 0.669828467529083 virginica
130 versicolor 0.708950643747213 virginica
135 versicolor 0.752700333623089 virginica
140 virginica 0.513515135953709 virginica
145 virginica 0.856957955346407 virginica
150 virginica 0.626924808331081 virginica

Check Prediction Accuracy


To check the prediction accuracy:

SELECT (SELECT count(id)


FROM svm_iris_predict_out
WHERE predict_value = species)/(
SELECT count(id) FROM svm_iris_predict_out) AS prediction_accuracy;

Table 540: SparseSVMPredictor Example 2 Prediction Accuracy

prediction_accuracy
0.86666666666666666667

644 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
SVMModelPrinter

SVMModelPrinter

Summary
The SVMModelPrinter function takes the training data and the model generated by the function
SparseSVMTrainer and displays specified information.

Usage

SVMModelPrinter Syntax
Version 1.1

SELECT DISTINCT * FROM SVMModelPrinter (


ON inputtable AS input PARTITION BY ANY
ON modeltable AS model DIMENSION
AttributeColumn ('attribute_column')
[ Summary ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Arguments
Argument Category Description
AttributeColumn Required Specifies the name of the input table column that contains the
attribute names.
Summary Optional Specifies whether the output is a summary of the model. If 'false', the
output is the weight of each attribute in the model. The default value
is 'false'.

Input
The SVMModelPrinter function has two required input tables:
• The input table that contains the training data, described by Input
• The model table, which is output by the SparseSVMTrainer function in binary format

Output
The SVMModelPrinter function outputs either a summary of the model (if Summary is 'true') or a table that
contains the weight of each attribute in the model.
Table 541: SVMModelPrinter Console Message Table Schema (Summary('true'))

Column Name Data Type Description


message VARCHAR Summary of the model.

Teradata Aster Analytics Foundation User Guide 645


Chapter 5: Statistical Analysis
SVMModelPrinter
Table 542: SVMModelPrinter Output Table Schema (Summary('false'))

Column Name Data Type Description


classid INTEGER Contains the identifiers of the classes of the model attributes.
classlabel VARCHAR Contains the labels of the classes of the model attributes.
attribute VARCHAR Contains the model attributes.The attribute of the bias is '<bias>'.
attributeid INTEGER Contains the identifiers of the model attributes.
weight DOUBLE Contains the weights of the model attributes.
PRECISION

Examples
• Example 1: ShowSummary('true')
• Example 2: ShowSummary('false')

Example 1: ShowSummary('true')

SQL-MapReduce Call

SELECT DISTINCT * FROM SvmModelPrinter(


ON svm_iris_input_train AS input PARTITION BY any
ON svm_iris_model AS model DIMENSION
AttributeColumn ('attribute')
Summary ('true')
);

Output

Table 543: SVMModelPrinter Example 1 Output Message

message
The corresponding training parameters are cost:1.0 bias:0.0
The model is not converged after 150 steps with epsilon 0.01, the average value of the loss function for the
training set is 35.46831950524983
The model is trained with 120 samples and 4 unique attributes
There are 3 different classes in the training set

Example 2: ShowSummary('false')

SQL-MapReduce Call

SELECT DISTINCT * FROM SvmModelPrinter(


ON svm_iris_input_train AS input PARTITION BY any

646 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVM Functions
ON svm_iris_model AS model DIMENSION
AttributeColumn ('attribute')
Summary ('false')
) ORDER BY classid, attributeid;

Output

Table 544: SVMModelPrinter Example 2 Output Table

classid classlabel attribute attributeid weight


0 setosa petal_length 0 -1.07927133381139
0 setosa petal_width 1 -0.558808502012156
0 setosa sepal_length 2 0.293032173674806
0 setosa sepal_width 3 0.565479913733184
0 setosa <bias> 4 0
1 versicolor petal_length 0 0.806536177499727
1 versicolor petal_width 1 -2.37374487610651
1 versicolor sepal_length 2 0.637750658139575
1 versicolor sepal_width 3 -1.5271432393421
1 versicolor <bias> 4 0
2 virginica petal_length 0 1.91938311998953
2 virginica petal_width 1 2.70710226326744
2 virginica sepal_length 2 -1.71970806488612
2 virginica sepal_width 3 -1.33210459268234
2 virginica <bias> 4 0

DenseSVM Functions
The DenseSVM functions are:
• DenseSVMTrainer, which takes training data and builds a predictive model in binary format
• DenseSVMPredictor, which uses the model to predict the class of each sample in a test data set
• DenseSVMModelPrinter, which displays readable information about the model
The DenseSVMTrainer and DenseSVMPredictor functions are designed for input in dense format; that is,
each table column contains values of a single attribute and there is a single row for each sample
(observation).
This implementation of DenseSVM function includes a linear SVM based on a Pegasos algorithm and a
non-linear SVM based on the Hash-SVM model described in the paper “Hash-SVM: Scalable Kernel
Machines for Large-Scale Visual Classification,” by Yadong Mu, Gang Hua, Wei Fan, and Shih-Fu Chang
(http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6909525).

Teradata Aster Analytics Foundation User Guide 647


Chapter 5: Statistical Analysis
DenseSVMTrainer

DenseSVMTrainer

Summary
The DenseSVMTrainer function takes training data in dense format and outputs a predictive model in
binary format, which is input to the functions DenseSVMPredictor and DenseSVMModelPrinter.

Usage

DenseSVMTrainer Syntax
Version 1.1

SELECT * FROM DenseSVMTrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
SampleIdColumn ('InputTable_column')
AttributeColumns
({ 'attribute_column' | 'attribute_column_range' }[,...])
[ KernelFunction ({ 'linear' | 'polynomial' | 'rbf' | 'sigmoid' }) ]
[ Gamma ('double') ]
[ Constant ('double') ]
[ Degree ('integer') ]
[ SubspaceDimension ('integer') ]
[ HashBits ('integer') ]
InputTable ('table_name')
ModelTable ('table_name')
LabelColumn ('InputTable_column')
[ Cost ('double') ]
[ Bias ('double') ]
[ ClassWeights ('string') ]
[ MaxStep ('integer') ]
[ Epsilon ('double') ]
[ Seed ('long') ]
[ OverwriteOutput ('boolean')]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

648 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMTrainer
Arguments

Argument Category Description


SampleIdColumn Required Name of the column in the InputTable that contains the
identifier of the training samples.
AttributeColumns Required Specifies all the attribute columns. Attribute columns must
have a numeric value.
KernelFunction Optional Specifies the kernel function that the DenseSVMTrainer
function uses to compute the hash function:
• Linear (default)
DenseSVMTrainer uses a Pegasos algorithm to solve the
linear SVM.
• Polynomial
DenseSVMTrainer uses a Hash-SVM algorithm.
The formula for a polynomial is: γ(uTv + c)d
• RBF
DenseSVMTrainer uses a Hash-SVM algorithm.
The formula for RBF is: exp (-γ * | | x - x' | |2)
• Sigmoid
DenseSVMTrainer uses a Hash-SVM algorithm.
The formula for sigmoid is: tanh (γ * uTv + c)
When DenseSVMTrainer uses a Hash-SVM algorithm, each
sample is represented by compact hash bits, over which an
inner product is defined to serve as the surrogate of the
original nonlinear kernels.
Gamma Optional Use only when KernelFunction is polynomial, RBF, or
sigmoid. A positive double that specifies γ. The minimum
value is 0.0. The default value is 1.0.
Constant Optional Use only when KernelFunction is polynomial or sigmoid. A
double value that specifies c. If KernelFunction is
polynomial, the minimum value is 0.0. The default value is
1.0.
Degree Optional Use only when KernelFunction is polynomial. A positive
integer that specifies the degree (d) of the polynomial kernel.
The input value must be greater than 0. The default value is
2.
SubspaceDimension Optional Valid only if kernel is polynomial, RBF, or sigmoid. A
positive integer that specifies the random subspace
dimension of the basis matrix V obtained by the Gram-
Schmidt process. Because the Gram-Schmidt process cannot
be parallelized, this dimension cannot be too large. Accuracy
increases with higher values of this number, but
computation costs also increase. The input value must be in
the range [1, 2048]. The default value is 256.

Teradata Aster Analytics Foundation User Guide 649


Chapter 5: Statistical Analysis
DenseSVMTrainer

Argument Category Description


HashBits Optional Valid only if kernel is polynomial, RBF, or sigmoid. A
positive integer specifying the number of compact hash bits
used to represent a data point. Accuracy increases with
higher values of this number, but computation costs also
increase. The input value must be in the range [8, 8192]. The
default value is 256.
InputTable Required Name of the table containing the training samples. Each row
consists of a sample_id, a set of attribute values, and a
corresponding label.
ModelTable Required Name for the model table that the function creates.
LabelColumn Required Column that identifies the class of the corresponding
sample. Must be an integer or a string.
Cost Optional The regularization parameter λ in the SVM soft-margin loss
function:

Must be greater than 0.0. The default value is 1.0.


Bias Optional A nonnegative value. If the value is greater than zero, each
sample in the training set is converted to ( , b); that is, it
adds another dimension containing the bias value b. This
argument addresses situations where not all samples center
at 0. The default value is 0.0.
ClassWeights Optional Specifies the weights for different classes. The format is:
classlabel m:weight m, classlabel n:weight
n. If weight for a class is given, the cost parameter for this
class is weight * cost. A weight larger than 1 often increases
the accuracy of the corresponding class; however, it may
decrease global accuracy. Classes not assigned a weight have
default weight 1.0.
MaxStep Optional A positive integer value that specifies the maximum number
of iterations of the training process. One step means that
each sample is seen once by the trainer. The input value
must be in the range (0, 10000]. The default value is 100.
Epsilon Optional Termination criterion. When the difference between the
values of the loss function in two sequential iterations is less
than this number, the function stops. Must be greater than
0.0. The default value is 0.01.
Seed Optional A long integer value used to order the training set randomly
and consistently. This value can be used to ensure that the
same model is generated if the function is run multiple times
in a given database with the same arguments. The input

650 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMTrainer

Argument Category Description


value must be in the range [0, 9223372036854775807]. The
default value is 0.
OverwriteOutput Optional If true, the function overwrites the output table specified in
the ModelTable argument. The default value is 'false'.

Input
Table 545: DenseSVMTrainer Input Table Schema

Column Name Data Type Description


id_column INTEGER, SMALLINT, Identifier of the training samples.
BIGINT, NUMERIC,
NUMERIC(p),
NUMERIC(p,a), TEXT,
VARCHAR,
VARCHAR(n), UUID,
BYTEA
attribute_column DOUBLE Can be more than one column. Each column contains the
values of an attribute of the training samples.
label INTEGER or STRING Class of the sample.

Output
The output is a binary table. To display its readable content, use the function DenseSVMModelPrinter.
The resulting model may be different even if the function is run with the same arguments, unless the
function is run on the same database with the seed value set.

Examples
• Input
• Train and Test Set
• Example 1: Linear Model
• Example 2: Polynomial Model
• Example 3: Radial Basis Model (RBF) Model
• Example 4: Sigmoid Model
In all of these examples, the DenseSVMTrainer function creates the model and the DenseSVMPredictor
function uses that model on a test set to make a prediction. The Polynomial, RBF and sigmoid models
generally obtain better prediction accuracy with higher values of the hashbits and subspacedimension
arguments. The value of subspacedimension argument cannot be greater than the number of rows input.
You can tune the model using the cost and bias arguments. For details on model-specific tuning parameters,
refer to the arguments section.

Teradata Aster Analytics Foundation User Guide 651


Chapter 5: Statistical Analysis
DenseSVMTrainer
Input
The well known ‘iris’ dataset, svm_iris, is used for all examples. The data has values for four attributes -
sepal_length, sepal_width, petal_length, petal_width and is grouped into three categories – setosa, versicolor,
virginica.
Table 546: DenseSVMTrainer Example Input Table svm_iris

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
... ... ... ... ... ...

Train and Test Set


The DenseSVM functions require input in pivoted or dense format. The input data set has 150 rows that are
split into a train set (80%) and a test set (20%). You can perform the split either with the following code or
with the Sample function (giving the SampleFraction attribute the value 0.8).

DROP TABLE IF EXISTS svm_iris_input;


CREATE TABLE svm_iris_input DISTRIBUTE BY hash(ID) AS
SELECT id, species, attribute, value :: double AS value FROM Unpivot (
ON svm_iris
colsToUnpivot ('Sepal_Length', 'Sepal_Width', 'Petal_Length',
'Petal_Width')
colsToAccumulate ('ID', 'Species')
) ORDER BY ID;
SELECT * FROM svm_iris_input ORDER BY id;
DROP TABLE IF EXISTS svm_iris_train;
DROP TABLE IF EXISTS svm_iris_test;
CREATE TABLE svm_iris_train AS
SELECT * FROM svm_iris WHERE id%5 != 0;

652 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMTrainer
CREATE TABLE svm_iris_test AS
SELECT * FROM svm_iris WHERE id%5 = 0;

The following query returns the output shown in the following table:

SELECT * FROM svm_iris_train ORDER BY id;

Table 547: DenseSVMTrainer Example Train Set Table svm_iris_train

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
11 5.4 3.7 1.5 0.2 setosa
... ... ... ... ... ...

The following query returns the output shown in the following table:

SELECT * FROM svm_iris_test ORDER BY id;

Table 548: DenseSVMTrainer Example Test Set Table svm_iris_test

id sepal_length sepal_width petal_length petal_width species


5 5 3.6 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
15 5.8 4 1.2 0.2 setosa
20 5.1 3.8 1.5 0.3 setosa
25 4.8 3.4 1.9 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
45 5.1 3.8 1.9 0.4 setosa
50 5 3.3 1.4 0.2 setosa
55 6.5 2.8 4.6 1.5 versicolor

Teradata Aster Analytics Foundation User Guide 653


Chapter 5: Statistical Analysis
DenseSVMTrainer

id sepal_length sepal_width petal_length petal_width species


60 5.2 2.7 3.9 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
70 5.6 2.5 3.9 1.1 versicolor
... ... ... ... ... ...

Example 1: Linear Model


The DenseSVM linear model is similar to the model created by the sparseSVM function.

SQL-MapReduce Call

DROP TABLE IF EXISTS densesvm_iris_linear_model;


SELECT* FROM DenseSVMTrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('svm_iris_train')
ModelTable ('densesvm_iris_linear_model')
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
LabelColumn ('species')
Cost ('1')
Bias ('0')
KernelFunction ('linear')
MaxStep (100)
Seed ('1')
);

Output

Table 549: DenseSVMTrainer Example 1 Output Table

message
Model table "densesvm_iris_linear_model" is created successfully
The model is trained with 120 samples and 4 unique attributes
There are 3 different classes in the training set
The model is not converged after 100 steps with epsilon 0.01, the average value of the loss function for the
training set is 38.035578853191694
The corresponding training parameters are cost:1.0 bias:0.0

654 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMTrainer
Example 2: Polynomial Model

SQL-MapReduce Call

DROP IF TABLE EXISTS densesvm_iris_polynomial_model;


SELECT * FROM DenseSVMTrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('svm_iris_train')
ModelTable ('densesvm_iris_polynomial_model')
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
LabelColumn ('species')
Cost ('1')
Bias ('0')
KernelFunction ('polynomial')
Gamma ('0.1')
HashBits ('512')
Degree ('2')
SubspaceDimension ('120')
MaxStep (100)
Seed ('1')
);

Output

Table 550: DenseSVMTrainer Example 2 Output Table

message
Model table "densesvm_iris_polynomial_model" is created successfully
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is not converged after 100 steps with epsilon 0.01, the average value of the loss function for the
training set is 12981.195818669565
The corresponding training parameters are cost:1.0 bias:0.0

Example 3: Radial Basis Model (RBF) Model

SQL-MapReduce Call

DROP IF TABLE EXISTS densesvm_iris_rbf_model;


SELECT * FROM DenseSVMTrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('svm_iris_train')
ModelTable ('densesvm_iris_rbf_model')
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
LabelColumn ('species')
Cost ('1')

Teradata Aster Analytics Foundation User Guide 655


Chapter 5: Statistical Analysis
DenseSVMTrainer
Bias ('0')
KernelFunction ('rbf')
Gamma ('0.1')
HashBits ('512')
SubspaceDimension ('120')
MaxStep (100)
Seed ('1')
);

Output

Table 551: DenseSVMTrainer Example 3 Output Table

message
Model table "densesvm_iris_rbf_model" is created successfully
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is converged after 39 steps with epsilon 0.01, the average value of the loss function for the
training set is 16.640287770464468
The corresponding training parameters are cost:1.0 bias:0.0

Example 4: Sigmoid Model

SQL-MapReduce Call

DROP TABLE IF EXISTS densesvm_iris_sigmoid_model;


SELECT * FROM DenseSVMTrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('svm_iris_train')
ModelTable ('densesvm_iris_sigmoid_model ')
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
LabelColumn ('species')
Cost ('1')
Bias ('0')
KernelFunction ('sigmoid')
Gamma ('0.1')
HashBits ('512')
SubspaceDimension ('120')
MaxStep (30)
Seed ('1')
);

656 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMPredictor
Output

Table 552: DenseSVMTrainer Example 4 Output Table

message
Model table "densesvm_iris_sigmoid_model" is created successfully
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is converged after 18 steps with epsilon 0.01, the average value of the loss function for the
training set is 55.1275114033879
The corresponding training parameters are cost:1.0 bias:0.0

Only models with RBF and Sigmoid kernels have converged. This may mean that the true boundaries in the
data set are hard to capture with a linear or polynomial model.

DenseSVMPredictor

Summary
The DenseSVMPredictor function takes the model generated by the function DenseSVMTrainer and a set of
test samples in dense format and outputs a prediction for each sample.

Usage

DenseSVMPredictor Syntax
Version 1.1

SELECT * FROM DenseSVMPredictor (


ON { table | view | query } AS input PARTITION BY ANY
ON { table | view | query } AS model DIMENSION
AttributeColumns
({ 'attribute_column' | 'attribute_column_range' }[,...])
SampleIdColumn ('input_column')
[ AccumulateLabel
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ OutputClassNum ('integer') ]
);

Teradata Aster Analytics Foundation User Guide 657


Chapter 5: Statistical Analysis
DenseSVMPredictor
Arguments
Argument Category Description
AttributeColumns Required Input table columns that contain the attributes of the test
samples. Attribute columns must be numeric (INT, REAL,
BIGINT, SMALLINT, or FLOAT).
SampleIdColumn Required Name of the input table column that contains the identifiers
of the test samples.
AccumulateLabel Optional Columns to be copied from the input table to the output
table.
OutputClassNum Optional Only valid for multiple class models. If the value of this
argument is k, the output table includes k class labels with
corresponding predict_confidence instead of a single
predicted result. The input value must be no less than 1. The
default value is 1.

Input
DenseSVMPredictor takes two input tables, a sample table containing data whose class is to be predicted,
and the model table produced by DenseSVMTrainer.
• The model table is in binary format. To display its readable content, use the function
DenseSVMModelPrinter.
• The schema of the sample table is shown in the following table. The function ignores any additional
columns, except those specified by the AccumulateLabel argument, which it copies to the output table.
Table 553: DenseSVMPredictor Input Sample Table Schema

Column Name Data Type Description


id_column INTEGER, SMALLINT, Identifier of the test samples to be predicted.
BIGINT, NUMERIC,
NUMERIC(p),
NUMERIC(p,a), TEXT,
VARCHAR, VARCHAR(n),
UUID, BYTEA
attribute_column DOUBLE PRECISION Can be more than one column. Each column
contains the values of an attribute of the test
samples.

Output
DenseSVMPredictor outputs a table containing the predicted class of each test sample. The schema is shown
in the following table.

658 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMPredictor
Table 554: DenseSVMPredictor Output Sample Table Schema

Column Name Data Type Description


id_column INTEGER, Identifier of the test samples.
SMALLINT, BIGINT,
NUMERIC,
NUMERIC(p),
NUMERIC(p,a),
TEXT, VARCHAR,
VARCHAR(n),
UUID, BYTEA
predict_value STRING Predicted class of the sample.
predict_confidence DOUBLE Estimation of the quality of the prediction. The value is
PRECISION between 0 and 1.
accumulate Any Columns copied from the input table.

Note:
The predict_confidence values may be different if the function is run on a different cluster.

Examples
• Input
• Example 1: Linear Model
• Example 2: Polynomial Model
• Example 3: Radial Basis Model (RBF) Model
• Example 4: Sigmoid Model

Input
These examples use the test dataset as input to the DenseSVMPredictor function.

Example 1: Linear Model

SQL-MapReduce Call

SELECT * FROM DenseSVMPredictor (


ON svm_iris_test AS input PARTITION BY ANY
ON densesvm_iris_linear_model AS model DIMENSION
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
AccumulateLabel ('id', 'species')
) ORDER BY id;

Teradata Aster Analytics Foundation User Guide 659


Chapter 5: Statistical Analysis
DenseSVMPredictor
Output

Table 555: DenseSVMPredictor Example 1 Output Table

id predict_value predict_confidence species


5 setosa 0.834460442660295 setosa
10 setosa 0.77698179935506 setosa
15 setosa 0.904620370750784 setosa
20 setosa 0.831036067464484 setosa
25 setosa 0.713640318278595 setosa
30 setosa 0.749760136354252 setosa
35 setosa 0.766957250093069 setosa
40 setosa 0.804297108005061 setosa
45 setosa 0.75080175280448 setosa
50 setosa 0.808477367597895 setosa
55 versicolor 0.509380447983795 versicolor
60 virginica 0.437785512959727 versicolor
65 versicolor 0.254388995091932 versicolor
70 versicolor 0.572603245374737 versicolor
75 versicolor 0.508482636653637 versicolor
80 versicolor 0.525738899884489 versicolor
85 virginica 0.643995565386973 versicolor
90 versicolor 0.459132712526973 versicolor
95 versicolor 0.436287581706344 versicolor
100 versicolor 0.39310360317574 versicolor
105 virginica 0.969991307456827 virginica
110 virginica 0.949973903066985 virginica
115 virginica 0.981893518564686 virginica
120 virginica 0.847012885109085 virginica
125 virginica 0.902480100166191 virginica
130 versicolor 0.717661810166786 virginica
135 virginica 0.884508297150359 virginica
140 virginica 0.815440611025279 virginica
145 virginica 0.965899115798588 virginica
150 virginica 0.866724666351753 virginica

660 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMPredictor
The prediction accuracy with the linear model is 90%.

Example 2: Polynomial Model

SQL-MapReduce Call

SELECT * FROM DenseSVMPredictor (


ON svm_iris_test AS input PARTITION BY ANY
ON densesvm_iris_polynomial_model AS model DIMENSION
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
AccumulateLabel ('id', 'species')
) ORDER BY id;

Output

Table 556: DenseSVMPredictor Example 2 Output Table

id predict_value predict_confidence species


5 setosa 1 setosa
10 setosa 1 setosa
15 setosa 1 setosa
20 setosa 1 setosa
25 setosa 1 setosa
30 setosa 1 setosa
35 setosa 1 setosa
40 setosa 1 setosa
45 setosa 1 setosa
50 setosa 1 setosa
55 virginica 1 versicolor
60 virginica 1 versicolor
65 setosa 1 versicolor
70 versicolor 1 versicolor
75 versicolor 1 versicolor
80 setosa 1 versicolor
85 virginica 1 versicolor
90 virginica 1 versicolor
95 virginica 1 versicolor
100 virginica 1 versicolor

Teradata Aster Analytics Foundation User Guide 661


Chapter 5: Statistical Analysis
DenseSVMPredictor

id predict_value predict_confidence species


105 virginica 1 virginica
110 virginica 1 virginica
115 virginica 1 virginica
120 virginica 1 virginica
125 virginica 1 virginica
130 virginica 1 virginica
135 virginica 1 virginica
140 virginica 1 virginica
145 virginica 1 virginica
150 virginica 1 virginica

The prediction accuracy with the polynomial model is 73.34%.

Example 3: Radial Basis Model (RBF) Model

SQL-MapReduce Call

SELECT * FROM DenseSVMPredictor (


ON svm_iris_test AS input PARTITION BY ANY
ON densesvm_iris_rbf_model AS model DIMENSION
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
AccumulateLabel ('id', 'species')
) ORDER BY id;

Output

Table 557: DenseSVMPredictor Example 3 Output Table

id predict_value predict_confidence species


5 setosa 0.759462243280868 setosa
10 setosa 0.749956836983807 setosa
15 setosa 0.744312392653372 setosa
20 setosa 0.761685486307989 setosa
25 setosa 0.706123801835284 setosa
30 setosa 0.74546864465768 setosa
35 setosa 0.758375257643847 setosa
40 setosa 0.762861263069852 setosa

662 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMPredictor

id predict_value predict_confidence species


45 setosa 0.72063801834686 setosa
50 setosa 0.764070094615493 setosa
55 versicolor 0.718583789317861 versicolor
60 versicolor 0.69705275515147 versicolor
65 versicolor 0.743635327519085 versicolor
70 versicolor 0.774094441049014 versicolor
75 versicolor 0.805953546799106 versicolor
80 versicolor 0.76991045016646 versicolor
85 versicolor 0.626297282449495 versicolor
90 versicolor 0.742172821606089 versicolor
95 versicolor 0.764061283278648 versicolor
100 versicolor 0.778180754495943 versicolor
105 virginica 0.819921814514784 virginica
110 virginica 0.789239991456868 virginica
115 virginica 0.791714097446213 virginica
120 versicolor 0.543320639355163 virginica
125 virginica 0.792639816244284 virginica
130 virginica 0.652651245896941 virginica
135 virginica 0.640968578224784 virginica
140 virginica 0.741382118250009 virginica
145 virginica 0.800886347061692 virginica
150 virginica 0.658977731573751 virginica

The prediction accuracy with the RBF model is 96.67%.

Example 4: Sigmoid Model

SQL-MapReduce Call

SELECT * FROM DenseSVMPredictor (


ON svm_iris_test AS input PARTITION BY ANY
ON densesvm_iris_sigmoid_model AS model DIMENSION
SampleIdColumn ('id')
AttributeColumns ('[1:4]')
AccumulateLabel ('id', 'species')
) ORDER BY id;

Teradata Aster Analytics Foundation User Guide 663


Chapter 5: Statistical Analysis
DenseSVMPredictor
Output

Table 558: DenseSVMPredictor Example 4 Output Table

id predict_value predict_confidence species


5 setosa 0.746591233570276 setosa
10 setosa 0.735684055731196 setosa
15 versicolor 0.478705967446573 setosa
20 setosa 0.733761949232838 setosa
25 setosa 0.737952913158837 setosa
30 setosa 0.744839260424954 setosa
35 setosa 0.735684302199767 setosa
40 setosa 0.74610706105241 setosa
45 setosa 0.682566149898358 setosa
50 setosa 0.737950987191321 setosa
55 virginica 0.588477773284318 versicolor
60 setosa 0.650348601535809 versicolor
65 versicolor 0.428915851630996 versicolor
70 setosa 0.469605915151742 versicolor
75 virginica 0.5459264767513 versicolor
80 setosa 0.450354714402099 versicolor
85 versicolor 0.495077171256058 versicolor
90 setosa 0.541238079806321 versicolor
95 versicolor 0.452171681401104 versicolor
100 versicolor 0.503884155446139 versicolor
105 virginica 0.663559257854133 virginica
110 virginica 0.725772461767464 virginica
115 versicolor 0.462751539260052 virginica
120 versicolor 0.495331599436918 virginica
125 virginica 0.707555441131219 virginica
130 virginica 0.713214560807804 virginica
135 virginica 0.534033824524893 virginica
140 virginica 0.707555440928665 virginica
145 virginica 0.707555441266978 virginica
150 virginica 0.455661070255452 virginica

664 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
DenseSVMModelPrinter
The prediction accuracy with the sigmoid model is 70%.

DenseSVMModelPrinter

Summary
DenseSVMModelPrinter extracts readable information from the model produced by DenseSVMTrainer.
The function can display either a summary of the model training results or a table containing the weights for
each attribute.

Usage

DenseSVMModelPrinter Syntax
Version 1.1

SELECT * FROM DenseSVMModelPrinter (


ON { table | view | query } AS input PARTITION BY ANY
ON { table | view | query } AS model DIMENSION
AttributeColumns ('input_column' [,…] )
[ Summary ({'true'|'yes'|'t'|'y'|'1'|'FALSE'|'no'|'f'|'n'|'0'}) ]
);

Arguments
Argument Category Description
AttributeColumns Required Input table columns that contain the attributes of the test samples.
Attribute columns must be numeric (INT, REAL, BIGINT,
SMALLINT, or FLOAT).
Summary Optional If true, the output contains only summary information of the model.
If false, the output contains the weight of each attribute in the
model. The default value is false.

Input
The function takes two input tables. One is the model table produced by DenseSVMTrainer. The other is the
input table to DenseSVMTrainer that was used to produce the model.

Output
The DenseSVMModelPrinter function outputs either a summary of the model (if Summary is 'true') or a
table that contains the weight of each attribute in the model.

Teradata Aster Analytics Foundation User Guide 665


Chapter 5: Statistical Analysis
DenseSVMModelPrinter
Table 559: DenseSVMModelPrinter Console Message Table Schema (Summary('true'))

Column Name Data Type Description


message VARCHAR Summary of the model.

Table 560: DenseSVMModelPrinter Output Table Schema (Summary('false'))

Column Name Data Type Description


classid INTEGER Contains the identifiers of the classes of the model attributes.
classlabel VARCHAR Contains the labels of the classes of the model attributes.
attribute VARCHAR Contains the model attributes.The attribute of the bias is '<bias>'.
attributeid INTEGER Contains the identifiers of the model attributes.
weight DOUBLE Contains the weights of the model attributes.
PRECISION

Example

Input
The RBF model is used as an example. Set the ShowSummary argument to false to output the model
parameters (weights, attributes, etc.). The inputs for this function are Train and Test Set section and
densesvm_iris_rbf_model from the function DenseSVMTrainer.

SQL-MapReduce Call with ShowSummary('false')

SELECT distinct * FROM DenseSvmModelPrinter (


ON svm_iris_train AS input PARTITION BY ANY
ON densesvm_iris_rbf_model AS model DIMENSION
AttributeColumns ('[1:4]')
ShowSummary ('false')
) ORDER BY 5 DESC;

Output
Table 561: DenseSVMModelPrinter Example Output Table

classid classlabel attribute attributeid weight


0 setosa sepal_length 1 0.281050166755443
2 virginica petal_length 3 0.271141503270778
2 virginica <bias> 4 0.233917713610281
2 virginica petal_width 4 0.233917713610281
0 setosa petal_width 4 -0.00892865636672077

666 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VectorDistance

classid classlabel attribute attributeid weight


0 setosa <bias> 4 -0.00892865636672077
2 virginica sepal_width 2 -0.035103102416801
0 setosa sepal_width 2 -0.0384081558222031
2 virginica sepal_length 1 -0.0552545484922601
1 versicolor sepal_width 2 -0.0790714407369086
1 versicolor <bias> 4 -0.11658738297731
1 versicolor petal_width 4 -0.11658738297731
1 versicolor petal_length 3 -0.125011362810483
0 setosa petal_length 3 -0.152141642400254
1 versicolor sepal_length 1 -0.197845664963323

SQL-MapReduce Call with ShowSummary('true')

SELECT distinct * FROM DenseSvmModelPrinter (


ON svm_iris_train AS input PARTITION BY ANY
ON densesvm_iris_rbf_model AS model DIMENSION
AttributeColumns ('[1:4]')
ShowSummary ('true')
);

Output
Table 562: DenseSVMModelPrinter Example Output Table

message
The model is trained with 120 samples and 512 unique attributes with hash projection
There are 3 different classes in the training set
The model is converged after 28 steps with epsilon 0.01, the average value of the loss function for the
training set is 15.81343373822361
The corresponding training parameters are cost:1.0 bias:0.0

VectorDistance

Summary
The VectorDistance function takes a table of target vectors and a table of reference vectors and returns a
table that contains the distance between each target-reference pair.

Teradata Aster Analytics Foundation User Guide 667


Chapter 5: Statistical Analysis
VectorDistance

Information retrieval and text mining applications use the vector distance between the Term Frequency
Inverse Document Frequency (TF-IDF) representations of two documents to measure the similarity of their
subject matter.

Background
The VectorDistance function computes the distance between each vector in the target table and each vector
in the reference table:

The VectorDistance function supports the following distance measurement algorithms:


• Cosine Similarity
• Euclidean Distance
• Manhattan Distance
• Binary Distance

Cosine Similarity
The cosine similarity between two vectors of an inner product space is the cosine of the angle between them.
The cosine of 0° is 1 and the cosine of any other angle is less than 1. Therefore, the cosine similarity
measures orientation and not magnitude. Regardless of their magnitude, two vectors with the same
orientation have a cosine similarity of 1, two vectors at 90° have a cosine similarity of 0, and two
diametrically opposed vectors have a cosine similarity of -1.

Note:
Cosine similarity is not a proper distance metric, because it does not have the triangle inequality property
and it violates the coincidence axiom (which says that two things separated by zero distance must be
identical).

668 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VectorDistance
Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and
magnitude as:

Cosine similarity is most commonly used in high-dimensional positive spaces. In positive space, cosine
similarity is often used for the complement, that is:
D cos(A, B) = 1 - S cos(A, B)

Euclidean Distance
The Euclidean distance between vectors p and q is the length of the line segment connecting them. If p=(p1,
p2,…, pn) and q=(q1, q2,…,qn) are vectors in Euclidean n-space, then the Euclidean distance between them
is:

Manhattan Distance
The Manhattan distance (or taxicab distance) between vectors p and q is the sum of the absolute differences
of their Cartesian coordinates. If p=(p1, p2,…, pn) and q=(q1, q2,…,qn) are vectors in an n-dimensional real
vector space with a fixed Cartesian coordinate system, then the Manhattan distance between them is:

For example, in the plane, the Manhattan distance between (p1, p2) and (q1, q2) is |p1-q1 |+|p2-q2|.

Binary Distance
The Binary distance between vectors p and q is 1 if the vectors are identical (that is, if they have the same
length and value) and 0 otherwise.

Usage

VectorDistance Syntax
Version 1.1

SELECT * FROM VectorDistance (


ON target_input_table AS target PARTITION BY target_id_column [,...]
ON ref_input_table AS ref DIMENSION
TargetIDColumns
({ 'target_id_column' | 'target_id_column_range' }[,...])
TargetFeatureColumn (feature_column)

Teradata Aster Analytics Foundation User Guide 669


Chapter 5: Statistical Analysis
VectorDistance
[ TargetValueColumn (value_column) ]
[ RefIDColumns ({ 'ref_id_column' | 'ref_id_column_range' }[,...])
[ RefTableSize ({ 'SMALL' | 'LARGE' }) ]
[ RefFeatureColumn (feature_column) ]
[ RefValueColumn (value_column) ]
[ DistanceMeasure (
{ 'cosine' | 'euclidean' | 'manhattan' | 'binary' }[,...])]
[ IgnoreMismatch
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ ReplaceInvalid (
{ 'PositiveInfinity' |'NegativeInfinity' | custom })]
[ TopK ('k') ]
[ MaxDistance ('threshold' [,...]) ]
);

Arguments
Argument Category Description
TargetIDColumns Required Specifies the names of the columns that comprise the target
vector identifier. You must partition the target input table by
these columns and specify them with this argument.
TargetFeatureColumn Required Specifies the name of the column that contains the target
vector feature name (for example, the axis of a 3-D vector).

Note:
An entry with a NULL value in a feature_column is
dropped.

TargetValueColumn Optional Specifies the name of the column that contains the value for
the target vector feature.

Note:
An entry with a NULL value in a value_column is
dropped.

RefIDColumns Optional Specifies the names of the columns that comprise the
reference vector identifier. The default value is the
TargetIDColumns argument value.
RefFeatureColumn Optional Specifies the name of the column that contains the reference
vector feature name. The default value is the
TargetFeatureColumn argument value.

Note:
An entry with a NULL value in a feature_column is
dropped.

RefValueColumn Optional Specifies the name of the column that contains the value for
the reference vector feature. The default value is the
TargetValueColumn argument value.

670 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VectorDistance

Argument Category Description

Note:
An entry with a NULL value in a value_column is
dropped.

RefTableSize Optional Specifies the size of the reference table. Specify 'LARGE' only
if the reference table does not fit in memory. The default
value, 'SMALL', allows faster processing.
DistanceMeasure Optional Specifies the distance measures that the function uses. The
default value is 'cosine'.
IgnoreMismatch Optional Specifies whether to drop mismatched dimensions. The
default value is 'true'. If DistanceMeasure is 'cosine', then
this argument is 'false'.
If you specify 'true', then two vectors with no common
features become two empty vectors when only their
common features are considered, and the function cannot
measure the distance between them.
ReplaceInvalid Optional Specifies the value to return when the function encounters
an infinite value or empty vectors. For custom, you can
supply any DOUBLE PRECISION value. The default value is
'PositiveInfinity'.
TopK Optional Specifies, for each target vector and for each measure, the
maximum number of closest reference vectors to include in
the output table. For k, you can supply any INTEGER value.
The default value is the maximum INTEGER value
(2,147,483,647).
MaxDistance Optional Specifies the maximum distance between a pair of target and
reference vectors. If the distance exceeds the threshold, the
pair does not appear in the output table.
If the DistanceMeasure argument specifies multiple
measures, then the MaxDistance argument must specify a
threshold for each measure. The ith threshold corresponds to
the ith measure. Each threshold can be any DOUBLE
PRECISION value.
If you omit this argument, then the function returns all
results.

Input
The VectorDistance function requires two input tables: target, which contains the target vectors, and ref,
which contains the reference vectors.

Teradata Aster Analytics Foundation User Guide 671


Chapter 5: Statistical Analysis
VectorDistance
Table 563: VectorDistance Target Input Table Schema

Column Name Data Type Description


target_id_column Any Column that is, or is part of, the target vector identifier. The
vector identifier can have multiple columns.
feature Any except Feature name of the vector.
BYTEA
value SMALLINT, Value of the feature.
INTEGER,
BIGINT,
NUMERIC,
or DOUBLE
PRECISION

Table 564: VectorDistance Reference Input Table Schema

Column Name Data Type Description


ref_id_column Any Column that is, or is part of, the reference vector identifier.
The vector identifier can have multiple columns.
feature Any except Feature name of the vector.
BYTEA
value SMALLINT, Value of the feature.
INTEGER,
BIGINT,
NUMERIC,
or DOUBLE
PRECISION

Output
Table 565: VectorDistance Output Table Schema

Column Name Data Type Description


target_target_id_column Data type of Column that is, or is part of, the target vector
target_id_column in identifier. The output table has one such column for
target input table every target_id_column in the target input table.
ref_ref_id_column Data type of Column that is, or is part of, the reference vector
ref_id_column in identifier. The output table has one such column for
reference input every ref_id_column in the reference input table.
table
type VARCHAR Distance measurement type.
distance DOUBLE Distance between the target vector and the reference
PRECISION vector.

672 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VectorDistance
Examples
These examples find and compare the Cosine, Euclidean, and Manhattan distances between a reference
vector and each of several target vectors. Short distance implies similarity—the closer a target vector is to the
reference vector, the more similar to the reference vector it is considered to be.
• Input
• Example 1: Default Thresholds
• Example 2: Specified Thresholds

Input
The raw input is mobile telephone user data where each user (who is identified with UserID) has these
attributes (for a specific time period):
• CallDuration—total time spent on telephone calls (in minutes)
• SMS—number of Short Message Service (SMS) messages sent and received
• DataCounter—data consumed (in megabytes)
Table 566: VectorDistance Examples Raw Input Data

UserID CallDuration SMS DataCounter


1 25000 24 4
2 40000 27 5
3 55000 32 7
4 27000 25 5
5 53000 30 5

The CallDuration values are so much higher than the values of the other attributes that they skew the
distribution. Normalizing the raw data to the range [0, 1] solves this problem.
In the following table, the raw input data in the preceding table has been normalized to the range [0, 1] using
the Min-Max normalization technique.
Table 567: VectorDistance Examples Normalized Input Data

UserID CallDuration SMS DataCounter


1 0.0000333 0.1 0.2
2 0.5 0.4 0.4
3 1 0.9 0.8
4 0.01 0.2 0.4
5 0.93 0.7 0.4

This technique transforms the value 'a' (in column A) to the value 'b' in the range [C, D], using this formula:
b=((a-minimum_value_in_A)/(maximum_value_in_A-minimum_value_in_A))*(D-C)+C
The following table shows the minimum and maximum values that the formula uses for each input table
column.

Teradata Aster Analytics Foundation User Guide 673


Chapter 5: Statistical Analysis
VectorDistance
Table 568: VectorDistance Examples Minimum and Maximum Values

Column Name Minimum Value Maximum Value


CallDuration 24999 55001
SMS 23 33
DataCounter 3 8

From the normalized input data, you choose one or more users to be the reference vector; the remaining
users are the target vectors. The choice of reference vector depends on the application. For example, if the
mobile telephone service is expanding its range to include a new area with similar users, then one or more
typical users (with average or median attribute values) can be the reference vector. When the company has
identified similar users in the new area, it can send them promotional offers.
For these examples, the reference vector is UserID 5. The following two tables are the reference and target
tables for the VectorDistance function.
Table 569: VectorDistance Examples Reference Table ref_mobile_data

UserID Feature Value


5 CallDuration 0.93
5 SMS 0.7
5 DataCounter 0.4

Table 570: VectorDistance Examples Target Table target_mobile_data

UserID Feature Value


1 CallDuration 0.0000333
1 SMS 0.1
1 DataCounter 0.2
2 CallDuration 0.5
2 SMS 0.4
2 DataCounter 0.4
3 CallDuration 1
3 SMS 0.9
3 DataCounter 0.8
4 CallDuration 0.01
4 SMS 0.2
4 DataCounter 0.4

674 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VectorDistance
Example 1: Default Thresholds

SQL-MapReduce Call

SELECT * FROM VectorDistance (


ON target_mobile_data AS target PARTITION BY UserID
ON ref_mobile_data AS ref DIMENSION
TargetIDColumns ('UserID')
TargetFeatureColumn ('Feature')
TargetValueColumn ('Value')
DistanceMeasure ('Cosine', 'Euclidean', 'Manhattan')
) ORDER BY Target_UserID;

Output

Table 571: VectorDistance Example 1 Output Table

target_userid ref_userid type distance


1 5 cosine 0.454865178527558
1 5 euclidean 1.12465019517762
1 5 manhattan 1.72996669672284
2 5 cosine 0.0260892301077248
2 5 euclidean 0.524309064791334
2 5 manhattan 0.729999989271164
3 5 cosine 0.0241505454220814
3 5 euclidean 0.452658810804166
3 5 manhattan 0.669999986886978
4 5 cosine 0.438222433743287
4 5 euclidean 1.04709120838197
4 5 manhattan 1.41999999247491

The following table (which is not output by the VectorDistance function) shows the distances of the target
vectors from the reference vector (UserID 5) and their similarity ranks. The shorter the distance, the higher
the similarity rank. Similarity rank is independent of measure—if relative distances are shorter in one
measure, they are shorter in all measures. UserID 3 is most similar to UserID 5.
Table 572: VectorDistance Example 1 Target Distances from Reference and Similarity Ranks

target_userid Cosine Distance Euclidean Distance Manhattan Similarity Rank


Distance
1 0.454865179 1.124650195 1.7299667 4
2 0.02608923 0.524309065 0.72999999 2
3 0.024150545 0.452658811 0.66999999 1

Teradata Aster Analytics Foundation User Guide 675


Chapter 5: Statistical Analysis
VWAP

target_userid Cosine Distance Euclidean Distance Manhattan Similarity Rank


Distance
4 0.438222434 1.047091208 1.41999999 3

Example 2: Specified Thresholds


This example uses the MaxDistance argument to limit the output to target vectors within the specified
distances for each measure.

SQL-MapReduce Call

SELECT * FROM VectorDistance (


ON target_mobile_data AS target PARTITION BY UserID
ON ref_mobile_data AS ref DIMENSION
TargetIDColumns ('UserID')
TargetFeatureColumn ('Feature')
TargetValueColumn ('Value')
DistanceMeasure ('Cosine', 'Euclidean', 'Manhattan')
MaxDistance ('0.03', '0.8', '1.0')
) ORDER BY target_userid;

Output
As the following table shows, only UserID 2 and UserID 3 meet the threshold criteria.
Table 573: VectorDistance Example 2 Output Table

target_userid ref_userid type distance


2 5 cosine 0.0260892301077248
2 5 euclidean 0.524309064791334
2 5 manhattan 0.729999989271164
3 5 cosine 0.0241505454220814
3 5 euclidean 0.452658810804166
3 5 manhattan 0.669999986886978

VWAP

Summary
The VWAP function computes the volume-weighted average price of a traded item (usually an equity share)
for each time interval in a series of equal-length time intervals.

676 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VWAP
Where sum applies to the current time interval:

VWAP = sum(volume*price)/sum(volume)

Usage

VWAP Syntax
Version 1.2

SELECT * FROM VWAP (


ON { table_name | view_name| (query) }
PARTITION BY expression [,...]
ORDER BY date_column
[ Price ('price_column') ]
[ Volume ('volume_column') ]
[ TimeInterval ('number_of_seconds') ]
[ DT ('date_column') ]
);

Arguments
Argument Category Description
Price Optional Specifies the name of the input table column that contains the price
at which the item traded. The default value is 'price'.
Volume Optional Specifies the name of the input table column that contains the
number of units traded in the transaction. The default value is
'volume'.
DT Optional Specifies the name of the input table column that contains the date
and time of the trade.
TimeInterval Optional Specifies the number of seconds in each time interval. The default
value is 0, which makes each row an interval, causing the function to
calculate no averages.

Input
You must partition the input table such that each partition contains all rows of the entity whose volume-
weighted average is to be calculated. For example, if the entity is a particular equity share, then all
transactions of that share must be in the same partition.
You must sort the input data on the DT column in ascending order.
Table 574: VWAP Input Table Schema

Column Name Data Type Description


input_column Any (Optional) Column to be copied to the output table.

Teradata Aster Analytics Foundation User Guide 677


Chapter 5: Statistical Analysis
VWAP

Column Name Data Type Description


price_column NUMERIC, Contains the price at which the item traded.
INTEGER,
BIGINT, or
DOUBLE
PRECISION
volume_column NUMERIC, Contains the number of units traded in the transaction.
INTEGER,
BIGINT, or
DOUBLE
PRECISION
date_column TIMESTAMP Contains the timestamp (date and time) of the trade. The
timestamp of the first row in the partition is considered to be the
start time of the first interval. The next interval starts immediately
after the end of the first, and so on.

Output
Table 575: VWAP Output Table Schema

Column Name Data Type Description


input_column Same as in input Column copied from the input table (if it exists there).
table
timestamp TIMESTAMP Contains the last timestamp in the interval.
vwap DOUBLE Contains the volume-weighted average price for the interval.
PRECISION

Example

Input
Table 576: VWAP Example Input Table stock_vol

id name period stockprice volume


1 IBM 1961-05-17 00:00:00 460 640000
1 IBM 1961-05-18 00:00:00 457 707200
1 IBM 1961-05-19 00:00:00 452 747200
1 IBM 1961-05-22 00:00:00 459 1.3312e+06
1 IBM 1961-05-23 00:00:00 462 1.848e+06
1 IBM 1961-05-24 00:00:00 459 779200
1 IBM 1961-05-25 00:00:00 463 528000

678 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
VWAP

id name period stockprice volume


1 IBM 1961-05-26 00:00:00 479 843200
1 IBM 1961-05-29 00:00:00 493 1.728e+06
1 IBM 1961-05-31 00:00:00 490 880000
1 IBM 1961-06-01 00:00:00 492 760000
1 IBM 1961-06-02 00:00:00 498 1.848e+06
1 IBM 1961-06-05 00:00:00 499 1.3472e+06
1 IBM 1961-06-06 00:00:00 497 1.48e+06
1 IBM 1961-06-07 00:00:00 496 1.0912e+06
1 IBM 1961-06-08 00:00:00 490 1.096e+06
1 IBM 1961-06-09 00:00:00 489 760000
1 IBM 1961-06-12 00:00:00 478 904000
1 IBM 1961-06-13 00:00:00 487 1.9552e+06
1 IBM 1961-06-14 00:00:00 491 1.0912e+06
1 IBM 1961-06-15 00:00:00 487 1.3232e+06
1 IBM 1961-06-16 00:00:00 482 376000
1 IBM 1961-06-19 00:00:00 479 339200
1 IBM 1961-06-20 00:00:00 478 640000
1 IBM 1961-06-21 00:00:00 479 640000
2 GE 1961-05-17 00:00:00 68.2502 2.7264e+06
2 GE 1961-05-18 00:00:00 67.7501 1.6704e+06
2 GE 1961-05-19 00:00:00 68.375 1.5168e+06
2 GE 1961-05-22 00:00:00 67.1251 1.8528e+06
2 GE 1961-05-23 00:00:00 67.1251 1.7184e+06
2 GE 1961-05-24 00:00:00 66 1.2672e+06
... ... ... ... ..

SQL-MapReduce Call

SELECT * FROM VWAP (


ON (SELECT * FROM stock_vol) PARTITION BY id ORDER BY period
Price ('stockprice')
Volume ('volume')
DT ('period')
TimeInterval ('432000')
) ORDER BY id;

Teradata Aster Analytics Foundation User Guide 679


Chapter 5: Statistical Analysis
WMAVG
Output
Because the time interval (86,400 seconds) equals one day, the function groups the first two rows together
and the last two rows together, and calculates the volume-weighted average price for each group.
Table 577: VWAP Example Output Table

id name timestamp vwap


1 IBM 1961-05-22 00:00:00 457.247
1 IBM 1961-05-26 00:00:00 465.132
1 IBM 1961-06-02 00:00:00 494.12
1 IBM 1961-06-09 00:00:00 494.896
1 IBM 1961-06-16 00:00:00 486
1 IBM 1961-06-21 00:00:00 478.605
2 GE 1961-05-22 00:00:00 67.8986
2 GE 1961-05-26 00:00:00 65.7631
2 GE 1961-06-02 00:00:00 65.8226
2 GE 1961-06-09 00:00:00 66.6285
2 GE 1961-06-16 00:00:00 66.4806
2 GE 1961-06-21 00:00:00 64.1541
3 PG 1961-05-22 00:00:00 18.0168
3 PG 1961-05-26 00:00:00 17.5237
3 PG 1961-06-02 00:00:00 19.2977
3 PG 1961-06-09 00:00:00 22.785
3 PG 1961-06-16 00:00:00 24.1296
3 PG 1961-06-21 00:00:00 24.4742
... ... ... ...

WMAVG

Summary
The WMAVG (weighted moving average) function computes the average over a number of points in a time
series, applying weights to older values. The weights for the older values decrease arithmetically.

680 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
WMAVG
Background
A weighted average has multiplying factors that give different weights to different data points.
Mathematically, the moving average is the convolution of the data points with a moving average function. In
technical analysis, a weighted moving average (WMA) has the specific meaning of weights that decrease
arithmetically. In an n-day WMA, the most recent data point has weight n, the second most recent data
point has weight (n - 1), and so on, until the weight is zero.
Where n is the number of old values to be considered for calculating the new weighted moving average:

TotalM+1 = TotalM + PM+1 - PM-n+1


NumeratorM+1 = NumeratorM + n*pM+1 - TotalM
WMAVGM+1 = NumeratorM+1/(n(n+1)/2)

Usage

WMAVG Syntax
Version 1.2

SELECT * FROM WMAVG (


ON { table_name| view_name| (query) }
PARTITION BY partition_column
ORDER BY order_by_column
[ TargetColumns ({ 'target_column' | 'target_column_range' }[,...]) ]
[ WindowSize ('window_size') ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Arguments
Argument Category Description
TargetColumns Optional Specifies the names of the input columns for which the weighted
moving average is to be computed. The function copies these
columns to the output table. If you omit this argument, then the
function copies these columns to the output table but does not
compute their weighted moving averages.
WindowSize Optional Specifies the number of old values to consider for computing the
new weighted moving average. The default value is '10'.
IncludeFirst Optional Specifies whether to output the first window_size rows. Because the
weighted moving average for the first window_size rows is
undefined, the function returns NULL values for those columns.
The default value is 'false'.

Teradata Aster Analytics Foundation User Guide 681


Chapter 5: Statistical Analysis
WMAVG
Input
Table 578: WMAVG Input Table Schema

Column Name Data Type Description


partition_column INTEGER, Column by which to partition the input. Input data must be
BIGINT, partitioned such that each partition contains all values
NUMERIC, whose weighted moving average is to be calculated.
or
VARCHAR
order_by_column INTEGER, Column by which to order the input.
BIGINT,
TIME, or
TIMESTAMP
target_column INTEGER, Column whose weighted moving average is to be computed
BIGINT, or (if specified with the InputColumns argument). The
NUMERIC function copies this column to the output table.

Output
Table 579: WMAVG Output Table Schema

Column Name Data Type Description


input_column Same as in Column copied from the input table.
input table
input_column_wmavg DOUBLE Contains the weighted moving average of input_column.
PRECISION

Example

Input
The input is hypothetical stock price and volume data of three companies IBM, General Electric (GE), and
Procter & Gamble (PG) between the time period of '05/17/1961' and '06/21/1961'. The WMAVG function in
the example calculates the weighted moving average for stockprice and volume for each company.
Data is assumed to be partitioned such that each partition contains all the rows of an entity. For example, if
the weighted moving average of a particular equity share price is required, then all transactions of that equity
share must be part of one partition. It is assumed that the input rows are provided in the correct order.
Table 580: WMAVG Example Input Table stock_data

id name period stockprice volume


1 IBM 1961-05-1700:00:00 460 640000
1 IBM 1961-05-1800:00:00 457 707200

682 Teradata Aster Analytics Foundation User Guide


Chapter 5: Statistical Analysis
WMAVG

id name period stockprice volume


1 IBM 1961-05-1900:00:00 452 747200
1 IBM 1961-05-2200:00:00 459 1.3312e+06
1 IBM 1961-05-2300:00:00 462 1.848e+06
1 IBM 1961-05-2400:00:00 459 779200
1 IBM 1961-05-2500:00:00 463 528000
1 IBM 1961-05-2600:00:00 479 843200
1 IBM 1961-05-2900:00:00 493 1.728e+06
1 IBM 1961-05-3100:00:00 490 880000
1 IBM 1961-06-0100:00:00 492 760000
1 IBM 1961-06-0200:00:00 498 1.848e+06
1 IBM 1961-06-0500:00:00 499 1.3472e+06
1 IBM 1961-06-0600:00:00 497 1.48e+06
1 IBM 1961-06-0700:00:00 496 1.0912e+06
1 IBM 1961-06-0800:00:00 490 1.096e+06
1 IBM 1961-06-0900:00:00 489 760000
1 IBM 1961-06-1200:00:00 478 904000
1 IBM 1961-06-1300:00:00 487 1.9552e+06
1 IBM 1961-06-1400:00:00 491 1.0912e+06
1 IBM 1961-06-1500:00:00 487 1.3232e+06
1 IBM 1961-06-1600:00:00 482 376000
1 IBM 1961-06-1900:00:00 479 339200
1 IBM 1961-06-2000:00:00 478 640000
1 IBM 1961-06-2100:00:00 479 640000
2 GE 1961-05-1700:00:00 68.2502 2.7264e+06
2 GE 1961-05-1800:00:00 67.7501 1.6704e+06
2 GE 1961-05-1900:00:00 68.375 1.5168e+06
2 GE 1961-05-2200:00:00 67.1251 1.8528e+06
2 GE 1961-05-2300:00:00 67.1251 1.7184e+06
2 GE 1961-05-2400:00:00 66 1.2672e+06
... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 683


Chapter 5: Statistical Analysis
WMAVG
SQL-MapReduce Call

SELECT * FROM WMAVG (


ON stock_vol PARTITION BY id ORDER BY name
TargetColumns ('stockprice', 'volume')
WindowSize ('5')
IncludeFirst ('true')
) ORDER BY id, period;

Output
Because the window size is 5, the values in the stockprice_mavg and volume_mavg columns show the
weighted average value of the previous five rows (or days in this example). Because the IncludeFirst
argument is set to 'true', the first five rows are shown in the output even though they contain null values for
those columns.
Table 581: WMAVG Example Output Table

id name period stockprice volume stockprice_mavg volume_mavg


1 IBM 1961-05-1700:00:00 460 640000
1 IBM 1961-05-1800:00:00 457 707200
1 IBM 1961-05-1900:00:00 452 747200
1 IBM 1961-05-2200:00:00 459 1.3312e+06
1 IBM 1961-05-2300:00:00 462 1.848e+06 458.4 1257386.6666666667
1 IBM 1961-05-2400:00:00 459 779200 458.73333333333335 1165546.6666666667
1 IBM 1961-05-2500:00:00 463 528000 460.46666666666664 980693.3333333334
1 IBM 1961-05-2600:00:00 479 843200 467.1333333333333 912853.3333333334
1 IBM 1961-05-2900:00:00 493 1.728e+06 476.6666666666667 1133546.6666666667
1 IBM 1961-05-3100:00:00 490 880000 482.93333333333334 1045120.0
1 IBM 1961-06-0100:00:00 492 760000 488.0 981226.6666666666
1 IBM 1961-06-0200:00:00 498 1.848e+06 492.8666666666667 1281280.0
1 IBM 1961-06-0500:00:00 499 1.3472e+06 495.73333333333335 1326400.0
... ... ... ... ... ... ...

684 Teradata Aster Analytics Foundation User Guide


CHAPTER 6
Text Analysis

Text Analysis
• LDA Functions
• Levenshtein Distance (LDist)
• Naive Bayes Text Classifier
• NER Functions (CRF Model Implementation)
• NER Functions (Max Entropy Model Implementation)
• nGram
• POSTagger
• Sentenizer
• Sentiment Extraction Functions
• Text Classifier
• Text_Parser
• TextChunker
• TextMorph
• TextTagging
• TextTokenizer
• TF_IDF

LDA Functions

Summary
The Latent Dirichlet Allocation (LDA) functions are:
• LDATrainer, which uses training data and parameters to build a topic model
• LDAInference, which uses the topic model to estimate the topic distribution in a set of documents
• LDATopicPrinter, which displays the readable information from the topic model

Teradata Aster Analytics Foundation User Guide 685


Chapter 6: Text Analysis
LDATrainer
Background
Topic modeling, which is useful in text analysis, assumes that a document consists of multiple abstract topics
with corresponding probabilities. Each topic emits a list of words with specific probability. That is, a word in
a given document is generated by a topic with certain probability decided by the topic, and the probability of
the topic is decided according to the document.
In the model, a document can contain many topics. For example, it might contain the words “rainy” and
“sunny,” which are related to weather. It might also contain the words “basketball” and “football,” which are
related to sports. If 20% of a document is about weather and the remainder is about sports, there are
probably about 4 times more sports-related words than weather-related words. Topic modeling is used to
obtain the latent factors based on a statistical framework.
Latent Dirichlet Allocation (LDA) is a well-known generative model that was introduced in the article Latent
Dirichlet Allocation (https://www.cs.princeton.edu/%7Eblei/papers/BleiNgJordan2003.pdf). In an LDA
model, the terms topic probabilities and topic-document probabilities are modeled with a Dirichlet
distribution.
As a Bayesian method, the main advantage of LDA is that it is not as susceptible to overfitting and works
well for smaller datasets. LDA has been successfully used in text modeling, content-based image retrieval,
and bioinformatics.

LDATrainer

Summary
The LDATrainer function uses training data and parameters to build a topic model, using an unsupervised
method to estimate the correlation between the topics and words according to the topic number and other
parameters. Optionally, the function generates the topic distributions for each training document.
The function uses an iterative algorithm; therefore, applying it to large data sets with a large number of
topics can be time-consuming.
The function assumes that the model table can be fitted into the memory of the vworkers.

Usage

LDATrainer Syntax
Version 1.1

SELECT * FROM LDATrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')

686 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDATrainer
ModelTable ('model_table')
[ OutputTable ('output_table') ]
TopicNum ('topic_number')
[ Alpha ('alpha') ]
[ Eta ('eta') ]
DocIDColumn ('doc_column')
WordColumn ('word_column')
[ CountColumn ('count_column') ]
[ MaxIterate ('max_iterate') ]
[ ConvergenceDelta ('convergence_delta') ]
[ Seed (seed) ]
[ OutputTopicNum ('topic_number') ]
[ OutputTopicWordNum ('topic_word_number') ]
);

Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the training
documents.
ModelTable Required Specifies the name for the model table that the function creates in the
database. This table must not already exist.
OutputTable Optional Specifies the name of the output table that contains the topic distribution
of each document in the input table, which the function creates in the
database. This table must not already exist. If you omit this argument, the
function does not generate this table.
TopicNum Required Specifies the number of topics for all the documents in the input table, an
INTEGER value in the range [2, 1000].
Alpha Optional Specifies a hyperparameter of the model, the prior smooth parameter for
the topic distribution over documents. As alpha decreases, fewer topics are
associated with each document. The default value is 0.1.
Eta Optional Specifies a hyperparameter of the model, the prior smooth parameter for
the word distribution over topics. As eta decreases, fewer words are
associated with each topic. The default value is 0.1.
DocIDColumn Required Specifies the name of the input column that contains the document
identifiers.
WordColumn Required Specifies the name of the input column that contains the words (one word
in each row).

Teradata Aster Analytics Foundation User Guide 687


Chapter 6: Text Analysis
LDATrainer

Argument Category Description


CountColumn Optional Specifies the name of the input column that contains the count of the
corresponding word in the row, a positive value. By default, the count of
each word is 1.
MaxIterate Optional Specifies the maximum number of iterations to perform if the model does
not converge, a positive INTEGER value. The default value is 50.
ConvergenceDelta Optional Specifies the convergence delta of log perplexity, a NUMERIC value in the
range [0.0, 1.0]. The default value is 1e-4.
Seed Optional Specifies the seed with which to initialize the model, a LONG value. Given
the same seed, cluster configuration, and input table, the function
generates the same model. By default, the function initializes the model
randomly.
OutputTopicNum Optional Ignored unless OutputTable is specified. Specifies the number of top-
weighted topics and their weights to include in the output table for each
training document. The value topics must be a positive INTEGER. The
default value, 'all', specifies all topics and their weights.
OutputTopicWord Optional Ignored unless OutputTable is specified. Specifies the number of top topic
Num words and their topic identifiers to include in the output table for each
training document. The value topic_words must be a positive INTEGER.
The value 'all' specifies all topic words and their topic identifiers. The
default value, 'none', specifies no topic words or topic identifiers.

Note:
The function might produce different results with different Seed settings and cluster configurations.

Input
Table 582: LDATrainer Training Table Schema

Column Data Type Description


Name
doc_column INTEGER, SMALLINT, Contains the document identifiers.
BIGINT, NUMERIC,
NUMERIC(p),
NUMERIC(p,a), TEXT,
VARCHAR, VARCHAR(n),
UUID, or BYTEA.
word_column INTEGER, SMALLINT, Contains the words (one word in each row).
BIGINT,
TEXT, VARCHAR, or
VARCHAR(n)
count_column INTEGER, SMALLINT, Optional. Contains the counts of the words. The default
BIGINT, NUMERIC, value is 1.
NUMERIC(p),

688 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDATrainer

Column Data Type Description


Name
NUMERIC(p,a),DOUBLE
PRECISION

Note:
You can use the output of the function TextTokenizer_stub with the argument OutputByWord('true') as
input to the LDATrainer function. Teradata recommends that you filter out the words with low
frequency and high frequency, as they may impact the topics that consist of common words that are not
meaningful in topic model.

Output
The LDATrainer function outputs a message, a model table, and (optionally) an output table.
Table 583: LDATrainer Output Message Schema

Column Name Data Type Description


message TEXT, Reports the iteration steps and perplexity of the model.
VARCHAR, or The perplexity formula is:
VARCHAR(n)
perplexity = 2H (p) = 2-Σx p (x) log2 p (x)
where H (p) is the entropy of the distribution.
There is no uniform standard for using perplexity to determine
whether a model is good—perplexity varies with the training
documents. However, you can use perplexity to find the best model for
a specified set of training documents: Generate models for several
subsets of the training documents and then choose the model with the
lowest perplexity.

Table 584: LDATrainer Model Table Schema

Column Name Data Type Description


topicid INTEGER Internally generated topic identifier.
value BYTEA Model in binary format.

Table 585: LDATrainer Output Table Schema

Column Name Data Type Description


docid Same as data Contains document identifiers from the input table.
type of
doc_column in
input table
topicid INTEGER Contains topic identifiers from the model table.
topicweight DOUBLE Contains topic weights.
PRECISION

Teradata Aster Analytics Foundation User Guide 689


Chapter 6: Text Analysis
LDATrainer

Column Name Data Type Description


topicwords TEXT, Optional. Contains topic words, separated by commas.
VARCHAR, or
VARCHAR(n)

Note:
Because the the model table contents are in BYTEA format, it is not readable. To see the binary contents,
use the function LDATopicPrinter.

Example

Input
The training table is log of vehicle complaints. The category column indicates whether the car has been in a
crash.
Table 586: LDATrainer Example Training Table complaints

doc_id text_data category


1 consumer was driving approximately 45 mph hit a deer with the front crash
bumper and then ran into an embankment head-on passenger's side air bag
did deploy hit windshield and deployed outward. driver's side airbag cover
opened but did not inflate it was still folded causing injuries.
2 when vehicle was involved in a crash totalling vehicle driver's side/ crash
passenger's side air bags did not deploy. vehicle was making a left turn and
was hit by a ford f350 traveling about 35 mph on the front passenger's side.
driver hit his head-on the steering wheel. hurt his knee and received neck
and back injuries.
3 consumer has experienced following problems; 1.) both lower ball joints no_crash
wear out excessively; 2.) head gasket leaks; and 3.) cruise control would shut
itself off while driving without foot pressing on brake pedal.
... ... ...

The stop words table, stopwords.text, contains:

a
an
in
is
to
into
was
the
and
this
with
they

690 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDATrainer
but
will

To generate a tokenized, filtered input file for the LDATrainer function, apply the function Text_Parser to
the training table:

SELECT * FROM Text_Parser (


ON complaints
TextColumn ('text_data')
ToLowerCase ('true')
Stemming ('false')
Punctuation ('\[.,?\!\]')
ListPositions ('true')
StopWords ('stopwords.txt')
RemoveStopWords ('true')
Accumulate ('doc_id', 'category')
) ORDER BY doc_id;

The following query returns the output shown in the following table:

SELECT * FROM complaints_traintoken ORDER BY doc_id;

Table 587: LDATrainer Example Tokenized and Filtered Input Table complaints_traintoken

doc_id category token frequency position


1 crash consumer 1 0
1 crash driving 1 2
1 crash approximately 1 3
1 crash 45 1 4
1 crash mph 1 5
1 crash hit 2 6,26
1 crash deer 1 8
1 crash front 1 11
1 crash bumper 1 12
1 crash then 1 14
1 crash ran 1 15
1 crash embankment 1 18
1 crash head-on 1 19
1 crash passenger's 1 20
1 crash side 2 21,32
... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 691


Chapter 6: Text Analysis
LDATrainer
SQL-MapReduce Call

SELECT * FROM LDATrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('complaints_traintoken')
ModelTable ('ldamodel')
OutputTable ('ldaout1')
TopicNum (5)
DocIDColumn ('doc_id')
WordColumn ('token')
CountColumn ('frequency')
MaxIterate (30)
ConvergenceDelta (1e-3)
Seed (2)
);

Output
Table 588: LDATrainer Example Message

message
Outputtable "ldaout1" is created successfully.
Training converged after 7 iterate steps with delta 4.2582574160277766E-5
There are 20 documents with 520 words in the training set, the perplexity is 92.139160

The following query returns the output shown in the following table:

SELECT * FROM ldaout1 ORDER BY docid, topicid;

Table 589: LDATrainer Example Output Table ldaout1

docid topicid topicweight


1 0 0.00442669824335036
1 1 0.00364972124026978
1 2 0.00313760355154859
1 3 0.985083744464884
1 4 0.00370223249994785
2 0 0.00333404274412358
2 1 0.00272130082493761
2 2 0.00322554604431533
2 3 0.986818406648743
... ... ...

692 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDAInference

LDAInference

Summary
The LDAInference function uses the model table generated by the function LDATrainer to infer the topic
distribution in a set of new documents. You can use the distribution for tasks such as classification and
clustering.
This function can be used with real-time applications. Refer to AMLGenerator.

Usage

LDAInference Syntax
Version 1.1

SELECT * FROM LDAInference (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('model_table')
OutputTable ('output_table')
DocIDColumn ('doc_column')
WordColumn ('word_column')
[ CountColumn ('count_column') ]
[ OutputTopicNum ('topic_number') ]
[ OutputTopicWordNum ('topic_word_number') ]
);

Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table or view that contains the new documents.
ModelTable Required Specifies the name of the model table generated by the function
LDATrainer.

Teradata Aster Analytics Foundation User Guide 693


Chapter 6: Text Analysis
LDAInference

Argument Category Description


OutputTable Required Specifies the name of the output table that contains the topic distribution
of each document in the input table, which the function creates in the
database. This table must not already exist.
DocIDColumn Required Specifies the name of the input column that contains the document
identifiers.
WordColumn Required Specifies the name of the input column that contains the words (one word
in each row).
CountColumn Optional Specifies the name of the input column that contains the count of the
corresponding word in the row, a NUMERIC value. The default value is 1.
OutputTopicNum Optional Specifies the number of top-weighted topics and their weights to include
in the output table for each training document. The value topics must be a
positive INTEGER. The default value, 'all', specifies all topics and their
weights.
OutputTopicWord Optional Specifies the number of top topic words and their topic identifiers to
Num include in the output table for each training document. The value
topic_words must be a positive INTEGER. The value 'all' specifies all topic
words and their topic identifiers. The default value, 'none', specifies no
topic words or topic identifiers.

Input
The LDAInference function requires an input table and a model table. Their schemas are the same as those
of the training table and model table of the function LDATrainer.

Output
The LDAInference function output table has the same schema as the output table of the LDATrainer
function.

Example

Input
The input table is a log of vehicle complaints.
Table 590: LDAInference Example Input Table complaints_test

doc_id text_data
1 ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE VEHICLE
TO STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE. CONSUMER HAD TO
CHANGE ALTERNATOR/ BATTERY AND STARTER, AND MODULE REPLACED 4
TIMES, BUT DEFECT STILL OCCURRING CANNOT DETERMINE WHAT IS CAUSING
THE PROBLEM.

694 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDAInference

doc_id text_data
2 ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED TO DEPLOY
DURING A CRASH AT APPROX. 28 MPH IMPACT. MANUFACTURER NOTIFIED.
3 WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER THAT
IS AROUND THE GAS PEDAL.
4 THERE IS A KNOCKING NOISE COMING FROM THE CATALYITC
CONVERTER ,AND THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE
STEERING.
5 CONSUMER WAS MAKING A TURN ,DRIVING AT APPROX 5- 10 MPH WHEN
CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT
DEPLOY . ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION,TO THE
FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.
6 WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND WHEN
MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL GOES TO
THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO DRIVE
VEHICLE- WHEELE COULD COME OFF.
7 DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN
WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S AND
THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN INJURIES. PLEASE
PROVIDE FURTHER INFORMATION AND VIN#.
8 THE AIR BAG WARNING LIGHT HAS COME ON. INDICATING AIRBAGS ARE
INOPERATIVE.THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM HAS
REOCCURRED.
9 CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING EAST. THE
OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE, CONSUMER HIT
OTHER VEHICLE AND STARTED TO SPIN AROUND ,COULDN'T STOP, RESULTING
IN A CRASH. UPON IMPACT, AIRBAGS DIDN'T DEPLOY.
10 WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISISON MADE A STRANGE
NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS REPAIRED
THE VEHICLE.

This example uses this stop words file, stopwords.txt:

a
an
in
is
to
into
was
the
and
this
with
they
but
will

Teradata Aster Analytics Foundation User Guide 695


Chapter 6: Text Analysis
LDAInference
To generate a tokenized, filtered input file for the LDAInference function, apply the function Text_Parser to
the input table, using the file stopwords.txt:

SELECT * FROM Text_Parser (


ON complaints_test
TextColumn ('text_data')
ToLowerCase ('true')
Stemming ('false')
Punctuation ('\[.,?\!\]')
ListPositions ('true')
StopWords ('stopwords.txt')
RemoveStopWords ('true')
Accumulate ('doc_id')
) ORDER BY doc_id;

The following query returns the output shown in the following table:

SELECT * FROM complaints_testtoken ORDER BY doc_id;

Table 591: LDAInference Example Tokenized and Filtered Input Table complaints_testtoken

doc_id token frequency position


1 electrical 1 0
1 control 1 1
1 module 2 2,25
1 shortening 1 4
1 out 1 5
1 causing 2 6,37
1 vehicle 1 8
1 stall 1 10
1 engine 1 11
1 become 1 13
1 totally 1 14
... ... ... ...

SQL-MapReduce Call
The table ldamodel was generated by the function LDATrainer.

SELECT * FROM LDAInference (


ON (SELECT 1) PARTITION BY 1
InputTable ('complaints_testtoken')
ModelTable ('ldamodel')
OutputTable ('ldaout2')
DocIDColumn ('doc_id')

696 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDATopicPrinter
WordColumn ('token')
OutputTopicNum (5)
OutputTopWordNum (5)
);

Output
Table 592: LDAInference Example Output Message

message
There are 10 valid documents with 153 recognized words in the input, the perplexity is 145.758867
Outputtable "ldaout2" is created successfully.

The following query returns the output shown in the following table:

SELECT * FROM ldaout2 ORDER BY docid, topicid;

Table 593: LDAInference Example Output Table ldaout2

docid topicid topicweight topicwords


1 0 0.00421317772559819 wipers,would,switch,when,on
1 1 0.982025483899112 vehicle,causing,consumer,replaced,which
1 2 0.00449350162478127 vehicle,manufacturer,would,transmission,when
1 3 0.00431128170088637 did,not,deploy,hit,vehicle
1 4 0.0049565550496219 vehicle,side,car,engine,while
2 0 0.237979551143677 wipers,would,switch,when,on
2 1 0.0131031974444299 vehicle,causing,consumer,replaced,which
2 2 0.0980050092051074 vehicle,manufacturer,would,transmission,when
2 3 0.6322490967996 did,not,deploy,hit,vehicle
2 4 0.0186631454071852 vehicle,side,car,engine,while
... ... ... ...

LDATopicPrinter

Summary
The LDATopicPrinter function displays in readable form information from the binary model table
generated by the function LDATrainer.

Teradata Aster Analytics Foundation User Guide 697


Chapter 6: Text Analysis
LDATopicPrinter
Usage

LDATopicPrinter Syntax
Version 1.1

SELECT * FROM LDATopicPrinter (


ON model_table_name PARTITION by 1
[ Summary ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputTopicWordNum ('topic_words') ]
[ WordWeight ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WordCount ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputByWord ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Arguments
Argument Category Description
Summary Optional Specifies whether to display only a summary of the information in
the model table. The default value is 'false'.
OutputTopicWordNum Optional Specifies the number of top topic words and their topic identifiers to
include in the output table for each training document. The value
topic_words must be a positive INTEGER. The default value, 'all',
specifies all topic words and their topic identifiers.
WordWeight Optional Specifies whether to display the weight (probability of occurrence) of
each unique word in each topic. The weights for the unique words in
each topic are normalized to 1. The default value is 'false'.
WordCount Optional Specifies whether to display the count (number of occurrences) of
each unique word in each topic. Topic distribution is factored into
word counts. The default value is 'false'.
OutputByWord Optional Specifies whether to display each topic-word pair in its own row. The
default value is 'true'. If you specify 'false', each row contains a unique
topic and all words that occur in that topic, separated by commas.

Input
The input to the LDATopicPrinter function is the model table generated by the function LDATrainer.

Output
The LDATopicPrinter function outputs a message and an output table.

698 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDATopicPrinter
Table 594: LDATopicPrinter Output Message Schema

Column Name Data Type Description


message TEXT, Reports this information about the model:
VARCHAR, or Number of topics
VARCHAR(n) Number of unique words (vocabulary size)
Hyperparameter values (specified by LDATrainer arguments Alpha
and Eta)
Number of training documents and words
Perplexity (defined in the Output section of the function LDATrainer)

The schema of the output table depends on the values of the arguments ShowSummary and OutputByWord.
If ShowSummary is true, the function outputs only the preceding table.
Table 595: LDATopicPrinter Output Table (showsummary=false and outputbyword=true)

Column Name Data Type Description


topicid INTEGER Topic identifier.
word TEXT, Unique word that appears in topic.
VARCHAR, or
VARCHAR(n)
wordweight DOUBLE Optional. Weight (probability of occurrence) of word in topic.
PRECISION Weights for unique words in each topic are normalized to 1.
wordcount DOUBLE Optional. Count (number of occurrences) of word in topic. Topic
PRECISION distribution is factored into word count.

Table 596: LDATopicPrinter Output Table (showsummary=false and outputbyword=false)

Column Name Data Type Description


topicid INTEGER Unique topic identifier.
wordsequence TEXT, Unique words that appear in topic, separated by commas.
VARCHAR, or
VARCHAR(n)

Examples
• Input
• Example 1: ShowSummary ('true')
• Example 2: OutputByWord ('false')
• Example 3: ShowWordWeight('true') and ShowWordCount('true')

Input
Model table ldamodel, generated by the LDATrainer Example.

Teradata Aster Analytics Foundation User Guide 699


Chapter 6: Text Analysis
LDATopicPrinter
Example 1: ShowSummary ('true')

SQL-MapReduce Call

SELECT * FROM LDATopicPrinter (


ON ldamodel PARTITION BY 1
Summary ('true')
);

Output

Table 597: LDATopicPrinter Example 1 Output Message

message
The model table is trained with the parameters: topicNumber:5, vocabularySize:309, alpha:0.100000, eta:
0.100000
There are 20 documents with 520 words in the training set, the perplexity is 92.139160

Example 2: OutputByWord ('false')

SQL-MapReduce Call

SELECT * FROM LDATopicPrinter (


ON ldamodel PARTITION BY 1
OutputByWord ('false')
OutputTopicWordNum (10)
) ORDER BY topicid;

Output

Table 598: LDATopicPrinter Example 2 Output Table

topicid wordsequence
0 wipers,would,switch,when,on,recall,windshield,notified,manufacturer,dealer
1 vehicle,causing,consumer,replaced,which,module,control,out,has,at
2 vehicle,manufacturer,would,transmission,when,problem,at,has,also,dealer
3 did,not,deploy,hit,vehicle,side,air,passenger's,bags,head-on
4 vehicle,side,car,engine,while,fire,for,from,left,wheel

700 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
LDATopicPrinter
Example 3: ShowWordWeight('true') and ShowWordCount('true')

SQL-MapReduce Call

SELECT * FROM LDATopicPrinter (


ON ldamodel PARTITION BY 1
OutputByWord ('true')
WordWeight ('true')
WordCount ('true')
) ORDER BY topicid, wordweight desc;

Output

Table 599: LDATopicPrinter Example 3 Output Table

topicid word wordweight wordcount


0 wipers 0.0473832596384327 3.99849732474236
0 would 0.0363395766162982 3.04325478408759
0 switch 0.0358233149828459 2.9985998375998
0 when 0.0249627005285442 2.05919213062327
0 on 0.0248539560903985 2.04978609160819
0 recall 0.0246409921706825 2.03136540755472
0 windshield 0.0243743208776925 2.00829921098562
0 notified 0.013018595230812 1.02606600163296
0 manufacturer 0.0129851924509451 1.02317677018355
0 dealer 0.012895292750406 1.01540073947214
0 has 0.012882546189796 1.01429820357743
0 repaired 0.0128536151128524 1.0117957598376
0 under 0.0128535972733459 1.01179421677851
0 after 0.012853557971549 1.01179081730133
0 by 0.0128080265518535 1.00785249808822
0 work 0.0128035467275753 1.00746500790795
... ... ... ...

Teradata Aster Analytics Foundation User Guide 701


Chapter 6: Text Analysis
Levenshtein Distance (LDist)

Levenshtein Distance (LDist)

Summary
The Levenshtein Distance (LDist) function computes the Levenshtein distance between two text values. The
Levenshtein distance (or edit distance) is the number of edits needed to transform one string into the other.
An edit is an insertion, deletion, or substitution of a single character.
The Levenshtein distance is useful for fuzzy matching of sequences and strings. The LDist function is often
used to resolve a user-entered value to a standard value. For example, when a enters "Jon Dow" when
searching for "John Doe".
A typical application of the LDist function is genome sequencing.

Usage

Levenshtein Distance (LDist) Syntax


Version 1.1

SELECT * FROM LDist (


ON { table | view | query }
PARTITION BY { key | ANY } DIMENSION
TargetColumn ('target_column')
Source ({ 'source_column' | 'source_column_range' }[,...])
[ Threshold ('threshold') ]
[ OutputColumnName ('output_distance_column') ]
[ OutputTargetColumn('output_target_column') ]
[ PrintSourceColumn('output_source_column') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TargetColumn Required Specifies the name of the input column that contains the target
text.
Source Required Specifies the names of the input columns that contain the source
text.
Threshold Optional Specifies the value that determines whether to return the
Levenshtein distance for a source-target pair. The threshold must a
positive integer. The function returns the Levenshtein distance for
a pair if it is less than or equal to threshold; otherwise, the function
returns -1. By default, the function returns the Levenshtein
distance of every pair.

702 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Levenshtein Distance (LDist)

Argument Category Description


OutputColumnName Optional Specifies the name of the output column that contains the
Levenshtein distance for a source-target pair. The default value is
'distance'.
OutputTargetColumn Optional Specifies the name of the output column that contains the
compared target text. The default value is 'target'.
PrintSourceColumn Optional Specifies the name of the output column that contains the
compared source text. The default value is 'source'.
Accumulate Optional Specifies the names of the input columns to copy to the output
table.

Input
Table 600: Levenshtein Distance (LDist) Input Table Schema

Column Name Data Type Description


source_column CHAR or Source text.
VARCHAR
target_column CHAR or Target text.
VARCHAR
accumulate_column Any Column to be copied to the output table.

Output
Table 601: Levenshtein Distance (LDist) Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
output_distance_column VARCHAR Levenshtein distance of the source-target pair.
output_target_column VARCHAR Compared target text.
output_source_column VARCHAR Compared source text.

Example

Input
A typical application of this function is genome sequencing, to find differences in base pairs (Adenine(A),
Thymine(T), Cytosine(C), Guanine(G)).

Teradata Aster Analytics Foundation User Guide 703


Chapter 6: Text Analysis
Levenshtein Distance (LDist)
Table 602: Levenshtein Distance (LDist) Example Input Table levendist_input

id src_text1 src_text2 tar_text


1 astre astter aster
2 hone fone phone
3 acqiese acquire acquiesce
4 AAAACCCCCGGGGA CCCGGGAACCAACC CCAGGGAAACCCAC
5 alice allen allies
6 angela angle angels
7 senter center centre
8 chef cheap chief
9 circus circle circuit
10 debt debut debris
11 deal dell lead
12 bare bear bear

SQL-MapReduce Call

SELECT * FROM LDist (


ON levendist_input
Source ('src_text1', 'src_text2')
TargetColumn ('tar_text')
Threshold (10)
Accumulate ('id')
) ORDER BY id;

Output
Table 603: Levenshtein Distance (LDist) Example Output Table

id target source distance


1 aster astre 2
1 aster astter 1
2 phone hone 1
2 phone fone 2
3 acquiesce acqiese 2
3 acquiesce acquire 3
4 CCAGGGAAACCCAC AAAACCCCCGGGGA -1
4 CCAGGGAAACCCAC CCCGGGAACCAACC 4

704 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Naive Bayes Text Classifier

id target source distance


5 allies alice 3
5 allies allen 2
6 angels angela 1
6 angels angle 2
7 centre senter 3
7 centre center 2
8 chief chef 1
8 chief cheap 3
9 circuit circus 2
9 circuit circle 3
10 debris debt 3
10 debris debut 3
11 lead deal 2
11 lead dell 3
12 bear bare 2
12 bear bear 0

Naive Bayes Text Classifier


• Summary
• NaiveBayesTextClassifierTrainer
• NaiveBayesTextClassifierPredict

Summary
The Naive Bayes Text Classifier is a variant of the Naive Bayes classification algorithm that is designed
specifically for document classification.

Note:
For information about the Naive Bayes classification algorithm and functions, refer to the chapter Naive
Bayes.

Naive Bayes Text Classifier executes these functions:


• NaiveBayesTextClassifierTrainer, which generates a model from training data
• NaiveBayesTextClassifierPredict, which uses the model to make predictions about testing data
The preceding functions process tokens, not documents. Therefore, internal functions translate the training
and input documents to tokens and then input the tokens to the functions.

Teradata Aster Analytics Foundation User Guide 705


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer

NaiveBayesTextClassifierTrainer

Summary
The NaiveBayesTextClassifierTrainer function takes training data as input and outputs a model table.

Usage

NaiveBayesTextClassifierTrainer Syntax
Version 1.1

SELECT * FROM NaiveBayesTextClassifierTrainer (


ON (SELECT * FROM NaiveBayesTextClassifierInternal (
ON token_table AS tokens PARTITION BY category
[ ON categories_table AS categories DIMENSION ]
[ ON stop_words_table AS stop_words DIMENSION ]
TokenColumn ('token_column')
[ ModelType ({ 'Multinomial' | 'Bernoulli' }) ]
[ DocIDColumns ({ 'doc_id_column' | 'doc_id_column_range' }[,...])]
DocCategoryColumn ('doc_category_column')
[ CategoryColumn ('category_column') |
Categories ('category' [,...]) ]
[ StopWordsColumn ('stop_words_column') |
StopWords ('word' [,...]) ]
)
PARTITION BY 1
);

706 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer
Arguments
Argument Category Description
TokenColumn Required Specifies the name of the token_table column that contains the
tokens to be classified.
ModelType Optional Specifies the model type of the text classifier. The default value
is 'Multinomial'. The formulas for the two model types follow
this table.
DocIDColumn Required if Specifies the names of the token_table columns that contain
ModelType is the document identifier.
'Bernoulli',
unnecessary
otherwise
DocCategoryColumn Required Specifies the name of the token_table column that contains the
document category.
CategoryColumn Optional Specifies the name of the categories_table column that contains
the prediction categories. The default value is the first column
of categories_table.
Categories Optional Specifies the prediction categories.

Note:
Specify either this argument or the categories_table, but not
both.

StopWordsColumn Optional Specifies the name of the stop_words_table column that


contains the stop words. The default value is the first column
of stop_words_table.
StopWords Optional Specifies words to ignore (such as a, an, and the).

Note:
Specify either this argument or the stop_words_table, but
not both.

The Multinomial (default) model formula is:

p(Ci|D) Probability that new document is classified to category i


TC Total token count (including duplicate tokens)
Tj Count of token j in category i (including duplicate tokens)
TCi Token count in category i (including duplicate tokens)
TCij Count of token j in category i (including duplicate tokens)

Teradata Aster Analytics Foundation User Guide 707


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer

p(Ci|D) Probability that new document is classified to category i


|V| Number of unique tokens in training set V

The Bernoulli model formula is:

p(Ci|D) Probability that new document is classified to category i


DC Total document count
DCi Document count in category i
V Number of unique tokens in training set V
Tk Token in V that is not in document D
DCij Document count in category i that contains token j
|C| Number of unique categories in category set C

Input
The NaiveBayesTextClassifierTrainer function has one required input table, token, and two optional input
tables, categories and stop_words.
The token table, which contains the classified training tokens, is usually generated by a tokenizing function,
such as TextTokenizer or Text_Parser. The following table describes its schema.
Table 604: NaiveBayesTextClassifierTrainer Token Table Schema

Column Name Data Type Description


doc_id_column CHARACTER, Contains the identifiers of the documents that contain the
VARCHAR, classified training tokens. The table can have more than one
text, INTEGER, such column.
or SMALLINT
token_column CHARACTER, Contains the classified training tokens.
VARCHAR, or
text
doc_category_column CHARACTER, Contains the categories of the documents that contain the
VARCHAR, or classified training tokens.
text
Note:
Partition the table by this column.

The categories table contains all possible prediction categories. If you omit this table, then you must specify
all possible prediction categories with the Categories argument.

708 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer
Table 605: NaiveBayesTextClassifierTrainer Categories Table Schema

Column Name Data Type Description


category_column CHARACTER, Contains all possible prediction categories.
VARCHAR, or
text

The stop_words table contains all possible stop words (a, an, the, and so on). If you omit this table, then you
must specify all possible stop words with the Stop_Words argument.
Table 606: NaiveBayesTextClassifierTrainer Stop_Words Table Schema

Column Name Data Type Description


stop_words_column CHARACTER, Contains all possible stop words.
VARCHAR, or
text

Output
The NaiveBayesTextClassifierTrainer function outputs a model table, described in the following table.
Table 607: NaiveBayesTextClassifierTrainer Model Table Schema

Column Name Data Type Description


TOKEN VARCHAR Contains the classified training tokens.
CATEGORY VARCHAR Contains the prediction categories for the tokens.
PROB DOUBLE Contains the probabilities that the tokens are in the categories.
PRECISION

The following is an example of a model table.


Table 608: NaiveBayesTextClassifierTrainer Model Table Example

TOKEN CATEGORY PROB


ASTER_NAIVE_BAYES_PRIOR_PROB C000008 0.1
ASTER_NAIVE_BAYES_MISSING_TOKEN_PROB C000008 0.0555555555555556
ASTER_NAIVE_BAYES_PRIOR_PROB C000013 0.1
ASTER_NAIVE_BAYES_MISSING_TOKEN_PROB C000013 0.0555555555555556
... ... ...
bank C000008
bank C000013
bank C000020
bank C000022
bank C000024

Teradata Aster Analytics Foundation User Guide 709


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer

TOKEN CATEGORY PROB


bank C000007
transportation ... ...
... ... ...

Examples
• English Example
• Chinese Example

English Example

Input
The training table is log of vehicle complaints. The category column identifies whether the car has been in a
crash.
Table 609: NaiveBayesTextClassifierTrainer English Example Training Table complaints

doc_id text_data category


1 consumer was driving approximately 45 mph hit a deer with the front crash
bumper and then ran into an embankment head-on passenger's side air bag
did deploy hit windshield and deployed outward. driver's side airbag cover
opened but did not inflate it was still folded causing injuries.
2 when vehicle was involved in a crash totalling vehicle driver's side/ crash
passenger's side air bags did not deploy. vehicle was making a left turn and
was hit by a ford f350 traveling about 35 mph on the front passenger's side.
driver hit his head-on the steering wheel. hurt his knee and received neck
and back injuries.
3 consumer has experienced following problems; 1.) both lower ball joints no_crash
wear out excessively; 2.) head gasket leaks; and 3.) cruise control would
shut itself off while driving without foot pressing on brake pedal.
... ... ...

The TextTokenizer function is invoked within the NaiveBayesTextClassifierTrainer function, as shown


below, to generate the model table 'complaints_tokens_model'.

SQL Statement to Create Model Table

CREATE DIMENSION TABLE complaints_tokens_model AS


SELECT * FROM NaiveBayesTextClassifierTrainer (
ON (SELECT * FROM NaiveBayesTextClassifierInternal (
ON (SELECT doc_id, lower(token) AS token, category
FROM TextTokenizer (
ON complaints PARTITION BY ANY

710 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer
TextColumn ('text_data')
OutputByWord ('true')
Accumulate ('doc_id', 'category')
)
) AS "TOKENS" PARTITION BY category
TokenColumn ('token')
ModelType ('Bernoulli')
DocIDColumns ('doc_id')
DocCategoryColumn ('category')
)
) PARTITION BY 1) ORDER BY prob;

Output
The following query returns the output shown in the following table:

SELECT * FROM complaints_tokens_model;

Table 610: NaiveBayesTextClassifierTrainer English Example Model Table complaints_tokens_model

token category prob


ASTER_NAIVE_BAYES_TEXT_MODEL_TYPE BERNOULLI 1
been crash 0.285714285714286
been no_crash 0.235294117647059
accurate no_crash 0.117647058823529
joints no_crash 0.117647058823529
shift no_crash 0.117647058823529
about crash 0.285714285714286
about no_crash 0.117647058823529
bag crash 0.285714285714286
... ... ..

Chinese Example
This example uses two files, news.data and stop_words.data. You must install these files onto the database
with the command \install filename.ext.

Input
The training table is a collection of categorized news articles in Simplified Chinese, from news.data.
To create the training table, use this statement:

CREATE FACT TABLE news (


doc_id VARCHAR(10),
content TEXT,

Teradata Aster Analytics Foundation User Guide 711


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer
category VARCHAR(8)
) DISTRIBUTE BY HASH(doc_id);

To load the training table with data, use this command:

ncluster_loader -h queen_ip_address -U username -w password news


news.data;

NaiveBayesTextClassifierTrainer Chinese Example Training Table news

To create the stop words table, use this statement:

CREATE DIMENSION TABLE stop_words (word TEXT);

To load the stop words table with data from stop_words.data, use this command:

ncluster_loader -h queen_ip_address -U username -w password stop_words


stop_words.data;

NaiveBayesTextClassifierTrainer Chinese Example stop_words:

SQL Statement to Create Model Table

CREATE DIMENSION TABLE news_model AS


SELECT * FROM NaiveBayesTextClassifierTrainer (

712 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NaiveBayesTextClassifierTrainer
ON (SELECT * FROM NaiveBayesTextClassifierInternal (
ON (SELECT token, category
FROM TextTokenizer (
ON news PARTITION BY ANY TextColumn ('content')
OutputByWord ('true')
Language ('zh_CN')
Accumulate ('category')
)
) AS "TOKENS" PARTITION BY category
ON stop_words AS "STOP_WORDS" DIMENSION
TokenColumn ('token')
ModelType ('Multinomial')
DocCategoryColumn ('category')
StopWordsColumn ('word')
)
) PARTITION BY 1);

Output
The following query returns the output shown in the following table:

SELECT * FROM news_model;

NaiveBayesTextClassifierTrainer Chinese Example Model Table news_model:

Teradata Aster Analytics Foundation User Guide 713


Chapter 6: Text Analysis
NaiveBayesTextClassifierPredict

NaiveBayesTextClassifierPredict

Summary
The NaiveBayesTextClassifierPredict function uses the model table generated by the
NaiveBayesTextClassifierTrainer function to predict outcomes for test data.
This function can be used with real-time applications. Refer to AMLGenerator.

Usage

NaiveBayesTextClassifierPredict Syntax
Version 1.1

SELECT * FROM NaiveBayesTextClassifierPredict (


ON input_table PARTITION BY doc_id_column
ON model_table DIMENSION
InputTokenColumn ('token_column')
[ ModelType ({ 'Multinomial' | 'Bernoulli' }) ]
[ DocIDColumns ({ 'doc_id_column' | 'doc_id_column_range' }[,...])]
[ ModelTokenColumn ('model_token_column')
ModelCategoryColumn ('model_category_column')
ModelProbColumn ('model_token_count_column') ]
[ TopK ({ num_of_top_k_predictions | 'num_of_top_k_predictions' }) ]
);

Arguments
Argument Category Description
InputTokenColumn Required Specifies the name of the input_table column that contains the
tokens.
ModelType Optional Specifies the model type of the text classifier. The default value is
'Multinomial'.
DocIDColumn Required Specifies the names of the input_table columns that contain the
document identifier.
ModelTokenColumn Optional Specifies the name of the model_table column that contains the
tokens. The default value is the first column of model_table.
ModelCategoryColumn Optional Specifies the name of the model_table column that contains the
prediction categories. The default value is the second column of
model_table.
ModelProbColumn Optional Specifies the name of the model_table column that contains the
token counts. The default value is the third column of
model_table.

714 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NaiveBayesTextClassifierPredict

Argument Category Description


TopK Optional Specifies the number of most likely prediction categories to
output with their loglikelihood values (for example, the top 10
most likely prediction categories). The default is all prediction
categories.

Note:
Specify either all or none of the arguments ModelTokenColumn, ModelCategoryColumn, and
ModelProbColumn.

Input
The NaiveBayesTextClassifierPredict function has two required input tables, the model_table output by the
function NaiveBayesTextClassifierTrainer, and input_table, which contains the test data for which to predict
outcomes.
The test data must be in the form of document-token pairs (as in the following table). To transform the
input documents into this form, input them to the function TextTokenizer or Text_Parser.
Table 611: NaiveBayesTextClassifierPredict Input Table Schema

Column Name Data Type Description


doc_id_column CHARACTER, Contains the identifiers of the documents that contain the classified
VARCHAR, training tokens. A document identifier have multiple columns;
text, INTEGER, therefore, the table can have more than one such column.
or SMALLINT
token_column CHARACTER, Contains the classified training tokens.
VARCHAR, or
text

Output
The NaiveBayesTextClassifierPredict function outputs a table of predictions for the test data.
Table 612: NaiveBayesTextClassifierPredict Output Table Schema

Column Name Data Type Description


DOC_ID CHARACTER, Contains single- or multiple-column document identifier.
VARCHAR,
text, INTEGER,
or SMALLINT
PREDICTION VARCHAR Contains prediction categories.
LOGLIK DOUBLE Contains loglikelihoods that the documents belong to the categories.
PRECISION

Teradata Aster Analytics Foundation User Guide 715


Chapter 6: Text Analysis
NaiveBayesTextClassifierPredict
Examples
• English Example
• Chinese Example

English Example

Input
The input table is a log of vehicle complaints. The example applies TextTokenizer to the complaints_test log
to generate test data, and uses the model complaints_tokens_model, generated by NaiveBayesTextClassifier.
Table 613: NaiveBayesTextClassifierPredict English Example Input Table complaints

doc_id text_data
1 ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE VEHICLE TO
STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE. CONSUMER HAD TO
CHANGE ALTERNATOR/ BATTERY AND STARTER, AND MODULE REPLACED 4
TIMES, BUT DEFECT STILL OCCURRING CANNOT DETERMINE WHAT IS CAUSING
THE PROBLEM.
2 ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED TO DEPLOY
DURING A CRASH AT APPROX. 28 MPH IMPACT. MANUFACTURER NOTIFIED.
3 WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER THAT IS
AROUND THE GAS PEDAL.
4 THERE IS A KNOCKING NOISE COMING FROM THE CATALYITC CONVERTER ,AND
THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE STEERING.
5 CONSUMER WAS MAKING A TURN ,DRIVING AT APPROX 5- 10 MPH WHEN
CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT
DEPLOY . ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION,TO THE
FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.
6 WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND WHEN
MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL GOES TO
THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO DRIVE
VEHICLE- WHEELE COULD COME OFF.
7 DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN
WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S AND
THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN INJURIES. PLEASE
PROVIDE FURTHER INFORMATION AND VIN#.
8 THE AIR BAG WARNING LIGHT HAS COME ON. INDICATING AIRBAGS ARE
INOPERATIVE.THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM HAS
REOCCURRED.
9 CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING EAST. THE
OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE, CONSUMER HIT OTHER
VEHICLE AND STARTED TO SPIN AROUND ,COULDN'T STOP, RESULTING IN A
CRASH. UPON IMPACT, AIRBAGS DIDN'T DEPLOY.

716 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NaiveBayesTextClassifierPredict

doc_id text_data
10 WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISISON MADE A STRANGE
NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS REPAIRED THE
VEHICLE.

SQL-MapReduce Call

SELECT * FROM NaiveBayesTextClassifierPredict (


ON (SELECT doc_id, lower(token) AS token
FROM TextTokenizer (
ON complaints_test PARTITION BY ANY
TextColumn ('text_data')
OutputByWord ('true')
Accumulate ('doc_id')
)
) AS predicts PARTITION BY doc_id
ON complaints_tokens_model AS "model" DIMENSION
InputTokenColumn ('token')
ModelType ('Bernoulli')
DocIDColumns ('doc_id')
TopK ('1')
) ORDER BY doc_id;

Output

Table 614: NaiveBayesTextClassifierPredict English Example Output Table complaints_tokens_model

doc_id prediction loglik


1 no_crash -98.5428474553942
2 no_crash -93.588731591964
3 no_crash -74.6281619901653
4 no_crash -80.2178104775681
5 crash -115.803709744721
6 no_crash -116.281532818164
7 crash -111.174594434561
8 no_crash -92.1427644980375
9 no_crash -109.322164262443
10 no_crash -82.820437621875

Chinese Example
This example uses two files, news.data and stop_words.data. You must install these files onto the database
with the command \install filename.ext.

Teradata Aster Analytics Foundation User Guide 717


Chapter 6: Text Analysis
NaiveBayesTextClassifierPredict
Input
The input table is a collection of categorized news articles in Simplified Chinese, from news.data.
To create the input table, use this statement:

CREATE FACT TABLE news_test (


doc_id VARCHAR(10),
content TEXT,
category VARCHAR(8)
) DISTRIBUTE BY HASH(doc_id);

To load the input table with data, use this command:

ncluster_loader -h queen_ip_address -U username -w password news_test


news_test.data;

NaiveBayesTextClassifierPredict Chinese Example Test Data news_test

SQL-MapReduce Call

SELECT doc_id, prediction, loglik


FROM NaiveBayesTextClassifierPredict (
ON (SELECT doc_id, token, category FROM TextTokenizer (
ON news_test PARTITION BY ANY
TextColumn ('content')
OutputByWord ('true')
Language ('zh_CN')
Accumulate ('doc_id','category')
)
) AS "predicts" PARTITION BY doc_id
ON news_model AS "model" DIMENSION
InputTokenColumn ('token')
ModelType ('Multinomial')
DocIDColumns ('doc_id')
TopK ('1')
) ORDER BY 1, 3 DESC, 2;

718 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NER Functions (CRF Model Implementation)
Output

Table 615: NaiveBayesTextClassifierPredict Chinese Example Output Table

doc_id prediction loglik


C000007_18 C000007 -4923.28359580986
C000007_19 C000007 -1596.10129952798
C000008_18 C000008 -5979.02464596063
C000008_19 C000010 -833.320507790382
C000010_18 C000010 -480.210632735085
C000010_19 C000007 -2846.53595294302
C000013_18 C000020 -8735.82177627438
C000013_19 C000007 -2489.0641900242
C000014_18 C000014 -809.047993943993
C000014_19 C000014 -785.245941548395
C000016_18 C000016 -3108.02074488676
C000016_19 C000016 -1448.22847590781
C000020_18 C000020 -7332.18575906246
C000020_19 C000010 -2330.35384502299
C000022_18 C000022 -1481.35692698142
C000022_19 C000020 -1396.88420115986
C000023_18 C000023 -1694.44629981671
C000023_19 C000023 -1836.41141238941
C000024_18 C000008 -10101.4254565714
C000024_19 C000007 -1043.25992547555

NER Functions (CRF Model Implementation)

Summary
Named entity recognition (NER) is a process for finding specified entities in text. For example, a simple news
named-entity recognizer for English might find the person “John J. Smith” and the location “Seattle” in the
text string “John J. Smith lives in Seattle.”
NER functions let you specify how to extract named entities when training the data models. The Aster
Analytics Foundation provides two sets of NER functions.
The NER functions that use the Conditional Random Fields (CRF) model are:

Teradata Aster Analytics Foundation User Guide 719


Chapter 6: Text Analysis
NERTrainer
• NERTrainer, which takes training data and outputs a CRF model (a binary file)
• NER, which takes input documents and extracts specified entities, using one or more CRF models and, if
appropriate, rules (regular expressions) or a dictionary
The function uses models to extract the names of persons, locations, and organizations; rules to extract
entities that conform to rules (such as phone numbers, times, and dates); and a dictionary to extract
known entities.
• NEREvaluator, which evaluates a CRF model
The NER functions that use the Max Entropy Model are documented in NER Functions (Max Entropy
Model Implementation).

NERTrainer

Summary
The NERTrainer function takes training data and outputs a CRF model (a binary file) that can be specified
in the function NER and NEREvaluator.

Usage

NERTrainer Syntax
Version 1.1

SELECT * FROM NERTrainer (


ON input_table PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TextColumn ('text_column')
[ ExtractorJAR ('jar_file') ]

720 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NERTrainer
FeatureTemplate ('template_file')
ModelFile ('model_file')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
[ MaxIterNum (max_iteration_times) ]
[ Eta (eta_threshhold_value) ]
[ MinOccurNum (threshhold_value) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
ExtractorJAR Optional Specifies the name of the JAR file that contains the Java classes that
extract features. You must install this JAR file in Aster Database
under the user search path before calling the function.

Note:
The name jar_file is case-sensitive.

FeatureTemplate Required Specifies the name of the file that specifies how to generate features
when training the model. You must install this feature template file
in Aster Database under the user search path before calling the
function. For more information about template_file, refer to Feature
Template.
ModelFile Required Specifies the name of the model file that the function generates and
installs in Aster Database.
Language Optional Specifies the language of the input text:
• 'en' (English, the default)
• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)

MaxIterNum Optional Specifies the maximum number of iterations. The default value is
1000.
Eta Optional Specifies the tolerance of the termination criterion. Defines the
differences of the values of the loss function between two sequential
epochs. The default value is 0.0001.
When training a model, the function performs n-times iterations. At
the end of each epoch, the function calculates the loss or cost
function on the training samples. If the loss function value change is
very small between two sequential epochs, the function considers
the training process to have converged.

Teradata Aster Analytics Foundation User Guide 721


Chapter 6: Text Analysis
NERTrainer

Argument Category Description


The function defines Eta as:

Eta=(f(n)-f(n-1))/f(n-1)

where f(n) is the loss function value of the nth epoch.


MinOccurNum Optional Specifies the minimum number times that a feature must occur in
the input text before the function uses the feature to construct the
model. The default value is 0.

Feature Template
The template_file has two parts. Part 1 declares the classes used to extract features and Part 2 specifies the
features to use to train the model. For example:

#part 1: extractor classes


0: Defaul_Token
1: Begin_with_Uppercase
2: com.asterdata.ner.SuffixExtractor
#part 2: templates
%x[0,0]
%x[0,1]
%x[0,2]
%x[-1,0]
%x[1,0]
%x[-1,1]%x[0,1]

Part 1
Part 1 of the example template file declares three extractor classes—Defaul_Token, Begin_with_Uppercase,
and com.asterdata.ner.SuffixExtractor, with serial numbers 0, 1, and 2, respectively. (Serial numbers must
start with 0 and be incremented by 1.)
Defaul_Token and Begin_with_Uppercase are default extractor classes, defined by the function. The
following table lists the default extractor classes and describes the features that they extract.
Table 616: NERTrainer Default Extractor Classes and Features

Extractor Class Feature


Defaul_Token The token itself.
Begin_with_Uppercase "T" (true) if the token begins with an uppercase letter, "F" (false) otherwise.
All_Uppercase "T" (true) if all characters of the token are uppercase, "F" (false) otherwise.
Is_Digital "T" (true) if the token represents a digit (for example, 1 or '2'), "F" (false)
otherwise.
Has_Hyphen "T" (true) if the token has a hyphen, "F" (false) otherwise.
Prefix_n The first n characters of the token, where n is a positive integer.

722 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NERTrainer

Extractor Class Feature


Suffix_n The last n characters of the token, where n is a positive integer.

The extractor class com.asterdata.ner.SuffixExtractor is user-defined. You must package user-defined


extractor classes in a JAR file, specify the JAR file in the ExtractorJAR argument, and implement the
interface com.asterdata.sqlmr.text_analysis.ner.Extractor. The interface declaration is:

package com.asterdata.sqlmr.text_analysis.ner;
import java.io.Serializable;
import java.util.List;
/*
* To define a function that generates features from a sequence,
* you must implement this interface.
*/
public interface Extractor extends Serializable
{
/**
* extract the feature of a token
* @param sequence
* @param i, the current token index
* @return the feature flag
*/
String extract(List String sequence, int i);
}

The following is an example of a Java class that implements SuffixExtractor:

public class SuffixExtractor implements Extractor


{
@Override
public String extract(List String sequence, int i)
{
String token = sequence.get(i);
return String.valueOf(token.charAt(token.length()-1));
}
}

Suppose that the function applies the extractor classes in the example template file to the input text "More
restaurants open in San Diego." For the token "More":
• Defaul_Token extracts the feature "More".
• Begin_with_Uppercase extracts the feature "T" (true).
• com.asterdata.ner.SuffixExtractor extractor extracts the feature "e".
Applying the same three extractor classes to the entire input text generates this matrix:

More T e
restaurants F s
open F n
in F n
San T n
Diego T o
. F .

Teradata Aster Analytics Foundation User Guide 723


Chapter 6: Text Analysis
NERTrainer
Part 2
In Part 2 of the template file, each line is a template in which the macro %x[row, column] specifies a token in
the input text. In the macro, row specifies the relative position from the current token and column specifies
the absolute position of the column. For example, if the input text is "More restaurants open in San Diego."
and the current token is "San", then the following table shows the selected feature for each template.
Table 617: Selected Features for Templates in NERTrainer Example Template File

Template Selected Feature


%x[0,0] San
%x[0,1] T
%x[0,2] n
%x[-1,0] in
%x[1,0] Diego
%x[-1,1]%x[0,1] FT

Input
The input table must have a column of text to be analyzed. The table can have other columns, but the
function ignores them.
Table 618: NERTrainer Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text to analyze. Within the text, each entity must be identified with
this syntax:
<START:entity_type> entity <END>
For example:
<START:location>Country1<END> has arrived

Output
The function outputs a message to the console and a CRF model (a binary file installed in the database).
Table 619: NERTrainer Output Message Schema

Column Name Data Type Description


train_result VARCHAR Message indicating whether the function ran successfully.

724 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NERTrainer
Example

Input
The input train table, ner_sports_train, is a collection of different sports news in xml format (with tags like
<START:PER> Roger <END>). There are 500 rows of training data.
Table 620: NERTrainer Example Input Table ner_sports_train

id content
2 CRICKET - <START:ORG> LEICESTERSHIRE <END> TAKE OVER AT TOP AFTER INNINGS
VICTORY .
3 <START:LOC> LONDON <END> 1996-08-30
4 West Indian all-rounder <START:PER> Phil Simmons <END> took four for 38 on Friday as
<START:ORG> Leicestershire <END> beat <START:ORG> Somerset <END> by an innings and
39 runs in two days to take over at the head of the county championship .
5 Their stay on top
6 After bowling <START:ORG> Somerset <END> out for 83 on the opening morning at
<START:LOC> Grace Road <END>
7 Trailing by 213
8 <START:ORG> Essex <END>
9 <START:PER> Hussain <END>
10 By the close <START:ORG> Yorkshire <END> had turned that into a 37-run advantage but off-
spinner <START:PER> Such <END> had scuttled their hopes
... ...

The function generates a model file (ner_model.bin), obtained by training on the input data and a template
file (template_1.txt) that specifies how to extract the text.
The following is an example template file (template_1.txt):

#part 1: extractor classes


0: Defaul_Token
1: Begin_with_Uppercase
2: Prefix_2
#part 2: templates
%x[0,0]
%x[0,1]
%x[0,2]
%x[-1,0]
%x[1,0]
%x[-1,1]%x[0,1]

Teradata Aster Analytics Foundation User Guide 725


Chapter 6: Text Analysis
NER
SQL-MapReduce Call

SELECT * FROM NERTrainer (


ON ner_sports_train PARTITION BY 1
TextColumn ('content')
FeatureTemplate('template_1.txt')
ModelFile ('ner_model.bin')
);

Output
Table 621: NERTrainer Example Output Table

train_result
Model generated.
Training time(s): 7.468
File name: ner_model.bin
File size(KB): 373
Model successfully installed.

The model file, ner_model.bin, is in binary format.

NER

Summary
The NER function takes input documents and extracts specified entities, using one or more CRF models
(generated by the function NERTrainer) and, if appropriate, rules (regular expressions) or a dictionary.
The function uses models to extract the names of persons, locations, and organizations; rules to extract
entities that conform to rules (such as phone numbers, times, and dates); and a dictionary to extract known
entities.

Usage

NER Syntax
Version 1.1

SELECT * FROM NER (


ON input_table PARTITION BY ANY
[ ON rules_table AS rules DIMENSION ]
[ ON dictionary_table AS dict DIMENSION ]
TextColumn ('text_column')
[ Models ('model_file[:jar_file]' [,...]) ]

726 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NER
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
[ ShowEntityContext ('context_words') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
Models Optional Specifies the CRF models (binary files) to use, generated by
NERTrainer. If you specified the ExtractorJAR argument in the
NERTrainer call that generated model_file, then you must specify
the same jar_file in this argument. You must install model_file and
jar_file in Aster Database under the user search path before calling
the NER function.

Note:
The names model_file and jar_file are case-sensitive.

Language Optional Specifies the language of the input text:


• 'en' (English, the default)
• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)

ShowEntityContext Optional Specifies the number of context words to output. If context_words is


n (which must be a positive integer), the function outputs the n
words that precede the entity, the entity, and the n words that follow
the entity. The default value is 0.
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.

Input
The NER function has a required input table, an optional rules tables, and an optional dictionary table.

Note:
Use the function TextTokenizer_stub to tokenize the input text before inputting it to the NER function.

The following table describes the required columns of the input table. The table can have other columns, but
the function ignores them.
Table 622: NER Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text to analyze. (Tokenize this column.)
accumulate_column Any Column to copy to the output table.

Teradata Aster Analytics Foundation User Guide 727


Chapter 6: Text Analysis
NER
Table 623: NER Rules Table Schema

Column Name Data Type Description


type VARCHAR Entity type.
regex VARCHAR Regular expression that represents an entity of this type. The regular
expression must conform to the Java Regex standard, which is
documented at http://docs.oracle.com/javase/tutorial/essential/regex/
quant.html.

Table 624: NER Dictionary Table Schema

Column Name Data Type Description


type VARCHAR Entity type.
dict VARCHAR Dictionary word.

Output
Table 625: NER Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
sn INTEGER Serial number of the extracted entity.
entity VARCHAR Extracted entity.
type VARCHAR Type of the extracted entity.
start INTEGER Start position of the extracted entity in the input text.
end INTEGER End position of the extracted entity in the input text.
context VARCHAR Context of the extracted entity (if ShowContent argument was
specified).
approach VARCHAR Method used to identify the extracted entity—CRF, RULE, or
DICT.

Example

Input
This example uses two input tables:
• Input test table ner_sports_test contains the text to be analyzed.
• Rules table rule_table contains regular expressions used to parse emails. This table must be given with the
alias “rules”.
Model file ner_model.bin, generated by the NerTrainer Example, is also used.

728 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NER
Table 626: NER Example Input Table ner_sports_test

id content
528 email [email protected] to contact for all sport info
529 email [email protected] to contact for all cricket info
530 email [email protected] to contact for all tennis info
531 1= <START:PER> Igor Trandenkov <END> ( <START:LOC> Russia <END> ) 5.86
532 3. <START:PER> Maksim Tarasov <END> ( <START:LOC> Russia <END> ) 5.86
533 4. <START:PER> Tim Lobinger <END> ( <START:LOC> Germany <END> ) 5.80
534 5. <START:PER> Igor Potapovich <END> ( <START:LOC> Kazakstan <END> ) 5.80
535 6. <START:PER> Jean Galfione <END> ( <START:LOC> France <END> ) 5.65
536 7. <START:PER> Pyotr Bochkary <END> ( <START:LOC> Russia <END> ) 5.65
537 8. <START:PER> Dmitri Markov <END> ( <START:LOC> Belarus <END> ) 5.65
583 <START:LOC> GENEVA <END> 1996-08-30
584 <START:ORG> UEFA <END> came down heavily on Belgian club <START:ORG> Standard
Liege <END> on Friday for disgraceful behaviour in an Intertoto final match against
<START:ORG> Karlsruhe <END> of <START:LOC> Germany <END> .
... ...

Table 627: NER Example Rules Table rule_table

type regex
email [\w\-]([\.\w])+[\w]+@([\w\-]+\.)+[a-zA-Z]{2,4}

SQL-MapReduce Call

SELECT * FROM ner (


ON ner_sports_test partition BY any
ON rule_table AS rules DIMENSION
TextColumn ('content')
Models ('ner_model.bin')
ShowEntityContext (2)
Accumulate ('id')
) ORDER BY id, sn;

Output
Depending on the text content and the model, the function outputs the entity, type and the approach (CRF
or rule based).

Teradata Aster Analytics Foundation User Guide 729


Chapter 6: Text Analysis
NER
Table 628: NER Example Output Table

id sn entity type start end context approach


528 1 [email protected] email 2 2 ... email RULE
[email protected]
to contact
529 1 [email protected] email 2 2 ... email RULE
[email protected]
to contac
530 1 [email protected] email 2 2 ... email RULE
[email protected]
to contact
531 1 Igor Trandenkov PER 3 4 1= <START:PER> CRF
Igor Trandenkov
<END> (
532 1 Maksim Tarasov PER 3 4 3. <START:PER> CRF
Maksim Tarasov
<END> (
533 1 Tim Lobinger PER 3 4 4. <START:PER> CRF
Tim Lobinger
<END> (
534 1 Igor Potapovich PER 3 4 5. <START:PER> CRF
Igor Potapovich
<END> (
535 1 Jean Galfione PER 3 4 6. <START:PER> CRF
Jean Galfione
<END> (
536 1 Pyotr Bochkary PER 3 4 7. <START:PER> CRF
Pyotr Bochkary
<END> (
537 1 Dmitri Markov PER 3 4 8. <START:PER> CRF
Dmitri Markov
<END> (
584 1 Standard Lieg PER 11 12 club CRF
<START:ORG>
Standard Liege
<END> on
587 1 Roberto Bisconti PER 2 3 ... <START:PER> CRF
Roberto Bisconti
<END> will
592 1 MONTE CARLO PER 2 3 ... <START:LOC> CRF
MONTE CARLO
<END>
1996-08-30

730 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NEREvaluator

id sn entity type start end context approach


593 1 Kenny Harrison PER 4 5 champion CRF
<START:PER>
Kenny Harrison
<END> and
593 2 Jonathan Edwards PER 12 13 holder CRF
<START:PER>
Jonathan Edwards
<END> will
595 1 Milan ORG 53 53 in <START:LOC> CRF
Milan <END> .
596 1 What ORG 9 9 saying : What type CRF
of
596 2 Milan ORG 26 26 in <START:LOC> CRF
Milan <END>
where
598 1 TO COACH PER 6 7 BARATELLI CRF
<END> TO
COACH
<START:ORG>
NICE
600 1 Dominique PER 5 6 goalkeeper CRF
Baratelli <START:PER>
Dominique
Baratelli <END> is

NEREvaluator

Summary
The NEREvaluator function evaluates a CRF model (generated by the function NERTrainer).

Usage

NEREvaluator Syntax
Version 1.1

SELECT * FROM NEREvaluator (


ON input_table PARTITION BY 1
TextColumn ('text_column')
Model ('model_file[:jar_file]')

Teradata Aster Analytics Foundation User Guide 731


Chapter 6: Text Analysis
NEREvaluator
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
Model Required Specifies the CRF model file to evaluate, generated by NERTrainer.
If you specified the ExtractorJAR argument in the NERTrainer call
that generated model_file, then you must specify the same jar_file in
this argument. You must install model_file and jar_file in Aster
Database under the user search path before calling the NER
function.

Note:
The names model_file and jar_file are case-sensitive.

Language Optional Specifies the language of the input text:


• 'en' (English, the default)
• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)

Input
The input is a CRF model (a binary file) generated by the function NERTrainer.

Output
Table 629: NEREvaluator Output Table Schema

Column Name Data Type Description


type VARCHAR Entity type.
Final row value: -AVG-
precision DOUBLE Precision value of the entity type.
PRECISION Final row value: Average precision value for all entity types.
recall DOUBLE Recall value of the entity type.
PRECISION Final row value: Average recall value for all entity types.
f1_measure DOUBLE F1 score (F-measure) of the entity type.
PRECISION Final row value: Average F1 score for all entity types.

732 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
NER Functions (Max Entropy Model Implementation)
Example
This function evaluates the efficacy of the model file ner_model.bin, created by the NerTrainer function in
terms of precision, recall, and f1_measure.

Input
Model file ner_model.bin, generated by the NerTrainer Example.

SQL-MapReduce Call

SELECT * FROM NerEvaluator (


ON ner_sports_test PARTITION BY 1
TextColumn ('content')
Model ('ner_model.bin')
);

Output
Table 630: NEREvaluator Example Output Table

type precision recall f1_measure


LOC 1.0000 0.4444 0.6154
ORG 0.0000 0.0000 -1.0000
PER 0.7222 0.8125 0.7647
-AVG- 0.7778 0.4884 0.6000

NER Functions (Max Entropy Model


Implementation)

Summary
Named entity recognition (NER) is a process of finding instances of specified entities in text. For example, a
simple news named-entity recognizer for the English language might find the person “John J. Smith” and the
location “Seattle” in the text “John J. Smith lives in Seattle”.
NER functions let you specify how to extract entities when training the data models. The Aster Analytics
Foundation provides two sets of NER functions.
The NER functions that use the Max Entropy Model model are:
• TrainNamedEntityFinder, which takes training data and outputs a Max Entropy data model
• FindNamedEntity, which takes input documents (in XML format) and extracts specified entities, using a
Max Entropy model and, if appropriate, rules (regular expressions) or a dictionary

Teradata Aster Analytics Foundation User Guide 733


Chapter 6: Text Analysis
FindNamedEntity
The function uses a model to extract the entity types "person", "location", and "organization" and rules to
extract the entity types "date", "time", "email" and "money". If you specify these entity names, the function
invokes the default model types and model file names. To extract all entities in one FindNamedEntity
call, specify "all".
• Evaluate Named Entity Finder, which evaluates a Max Entropy model
The NER functions that use the Conditional Random Fields (CRF) Model are documented in NER
Functions (CRF Model Implementation).

FindNamedEntity

Summary
The FindNamedEntity function evaluates the input text, identifies tokens based on the specified model, and
outputs the tokens with detailed information. The function does not identify sentences; it simply tokenizes.
Token identification is not case-sensitive.

Usage

FindNamedEntity Syntax
Version 1.2

SELECT * FROM FindNamedEntity (


ON { table | view | (query) } PARTITION BY ANY
[ ON (configure_table) AS ConfigureTable DIMENSION ]
TextColumn ('text_column')
[ Model ({ 'entity_type [:model_type: { model_file' |
'regular_expression' } ] [,...]} | 'all' }) ]
[ ShowEntityContext ('context_words') ]
[ EntityColumn ('entity_column') ]
[ Accumulate

734 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
FindNamedEntity
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
If the input is a query, you must map it to an alias.

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text
to analyze.
Model Optional Specifies the model items to load. Optional if you specify
configuration_table; required otherwise (and you cannot specify
'all').
If you specify both configuration_table and this argument, then the
function loads the specified model items from configuration_table. If
you specify configuration_table but omit this argument, its default
value is 'all' (every model item from configuration_table).
The entity_type is the name of an entity type (for example, PERSON,
LOCATION, or EMAIL), which appears in the output table.
The model_type is one of these model types:
• 'max entropy': maximum entropy language model generated by
training
• 'rule': rule-based model, a plain text file with one regular
expression on each line
• 'dictionary': dictionary-based model, a plain text file with one
word on each line
• 'reg exp': regular expression that describes entity_type
If model_type is 'reg exp', specify regular_expression (a regular
expression that describes entity_type); otherwise, specify model_file
(the name of the model file). Before calling the function, add the
location of every specified model_file to the user/session default
search path.
If you specify configuration_table, you can use entity_type as a
shortcut. For example, if the configure_table has the row
'organization, max entropy, en-ner-organization.bin', you can
specify Model('organization') as a shortcut for
Model('organization:max entropy:en-ner-organization.bin').

Note:
For model_type 'max entropy', if you specify configuration_file
and omit this argument, then the JVM of the worker node needs
more than 2GB of memory.

ShowEntityContext Optional Specifies the number of context words to output. If context_words is


n (which must be a positive integer), the function outputs the n

Teradata Aster Analytics Foundation User Guide 735


Chapter 6: Text Analysis
FindNamedEntity

Argument Category Description


words that precede the entity, the entity, and the n words that follow
the entity. The default value is 0.
EntityColumn Optional Specifies the name of the output table column that contains the
entity names. The default value is 'entity'.
Accumulate Optional Specifies the names of input columns to copy to the output table. No
accumulate_column can be an entity_column. By default, the
function copies all input columns to the output table.

Creating the Table of Default Models


Before calling the FindNamedEntity function, you must create the table of default models. To create the
table, use this command:

CREATE DIMENSION TABLE nameFind_configure (model_entitytype VARCHAR,


model_method VARCHAR, model_file VARCHAR);

Default English-language models are provided with the SQL-MapReduce functions. Before using these
models, you must install them (using the \install command in ACT) and create a default configure_table, as
follows:

DROP TABLE IF EXISTS nameFind_configure;


CREATE DIMENSION TABLE nameFind_configure
(model_name VARCHAR, model_type VARCHAR, model_file VARCHAR);
INSERT INTO nameFind_configure VALUES ('person','max entropy','en-ner-
person.bin');
INSERT INTO nameFind_configure VALUES ('location','max entropy','en-ner-
location.bin');
INSERT INTO nameFind_configure VALUES ('organization','max entropy','en-ner-
organization.bin');
INSERT INTO nameFind_configure VALUES ('date','rules','date.rules');
INSERT INTO nameFind_configure VALUES ('time','rules','time.rules');
INSERT INTO nameFind_configure VALUES ('phone','rules','phone.rules');
INSERT INTO nameFind_configure VALUES ('money','rules','money.rules');
INSERT INTO nameFind_configure VALUES ('email','rules','email.rules');
INSERT INTO nameFind_configure VALUES
('percentage','rules','percentage.rules');
\install evaluatenamedentityfinderpartition.zip
\install evaluatenamedentityfinderrow.zip
\install findnamedentity.zip
\install trainnamedentityfinder.zip
\install nameFinderModel/date.rules
\install nameFinderModel/time.rules
\install nameFinderModel/en-sent.bin
\install nameFinderModel/email.rules
\install nameFinderModel/email.bin
\install nameFinderModel/en-ner-location.bin
\install nameFinderModel/percentage.rules
\install nameFinderModel/en-token.bin
\install nameFinderModel/names.txt
\install nameFinderModel/en-ner-organization.bin

736 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
FindNamedEntity
\install nameFinderModel/money.rules
\install nameFinderModel/en-ner-person.bin
\install nameFinderModel/phone.rules
\install nameFinderModel/country.txt

Table 631: Default English-Language Models in Table nameFind_configure

model_name model_type model_file


person max entropy en-ner-person.bin
location max entropy en-ner-location.bin
organization max entropy en-ner-organization.bin
date rules date.rules
time rules time.rules
phone rules phone.rules
money rules money.rules
email rules email.rules
percentage rules percentage.rules

Input
The following table describes the required input table columns. The table can have additional columns, but
the function ignores them.
Table 632: FindNamedEntity Input Table Schema

Column Name Data Type Description


text_column VARCHAR Contains input text.
accumulate_column Any Column to copy to the output table.

Table 633: FindNamedEntity Configuration Table Schema

Column Name Data Type Description


model_name VARCHAR Name of an entity type (for example, PERSON, LOCATION,
or EMAIL).
model_type VARCHAR One of these model types:
'max entropy': maximum entropy language model generated
by training
'rule': rule-based model, a plain text file with one regular
expression on each line
'dictionary': dictionary-based model, a plain text file with one
word on each line
'reg exp': regular expression that describes entity_type

Teradata Aster Analytics Foundation User Guide 737


Chapter 6: Text Analysis
FindNamedEntity

Column Name Data Type Description


model_file VARCHAR Name of model file that describes the entity type. This
column appears if model_type is not 'reg exp'.
reg_exp VARCHAR Regular expression that describes the entity type. This column
appears if model_type is 'reg exp'.

Output
Table 634: FindNamedEntity Input Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
TYPE VARCHAR Entity type.
ENTITY VARCHAR Entity name.
START INTEGER Start position. This column appears only if you specify the
ShowEntityContext argument.
END INTEGER End position. This column appears only if you specify the
ShowEntityContext argument.
CONTEXT VARCHAR Words before and after the entity. This column appears only
if you specify the ShowEntityContext argument.

Example

Input
Table 635: FindNamedEntity Example Input Table assortedtext_input

id source content
1001 misc contact Alan by email at [email protected] for all sport info
1002 misc contact Mark at [email protected] for all cricket info
1003 misc contact Roger at [email protected] for all tennis info
1004 wiki The contiguous United States consists of the 48 adjoining U.S. states plus
Washington, D.C., on the continent of North America
1005 wiki California's economy is centered onTechnology,Finance,real estate services,
Government, and professional, Scientific and Technical business Services; together
comprising 58% of the State Government economy
1006 wiki Houston is the largest city in Texas and the fourth-largest in the United States, while
San Antonio is the second largest and seventh largest in the state.

738 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
FindNamedEntity

id source content
1007 wiki Thomas is a photographer whose natural landscapes of the West are also a statement
about the importance of the preservation of the wildness

SQL-MapReduce Call

SELECT * FROM FindNamedEntity (


ON assortedtext_input PARTITION BY ANY
on namefind_configure AS "ConfigureTable" DIMENSION
TextColumn ('content')
Model ('all')
Accumulate ('id', 'source')
);

Output
Table 636: FindNamedEntity Example Output Table

id source ENTITY TYPE


1002 misc Mark person
1002 misc [email protected] email
1004 wiki United States location
1004 wiki U.S. location
1004 wiki Washington location
1004 wiki North America location
1006 wiki Texas location
1006 wiki United States location
1006 wiki San Antonio location
1001 misc [email protected] email
1003 misc Roger person
1003 misc [email protected] email
1005 wiki State Government organization
1005 wiki 58% percentage
1007 wiki Thomas person

Teradata Aster Analytics Foundation User Guide 739


Chapter 6: Text Analysis
TrainNamedEntityFinder

TrainNamedEntityFinder

Summary
The TrainNamedEntityFinder function takes training data and outputs a Max Entropy data model. The
function is based on OpenNLP, and follows its annotation. For more information on OpenNLP, see http://
opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html.
The trainer supports only the English language.

Usage

TrainNamedEntityFinder Syntax

Version 1.3

SELECT * FROM TrainNamedEntityFinder (


ON { table | view | (query)}
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TextColumn ('text_column')
EntityType ('entity_type')
Model ('model_file')
[ IterNum ('iterator')]
[ Cutoff ('cutoff')]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text to
analyze.
EntityType Required Specifies the entity type to be trained (for example, PERSON). The input
training documents must contain the same tag.
Model Required Specifies the name of the data model file to be generated.

740 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TrainNamedEntityFinder

Argument Category Description


IterNum Optional Specifies the iterator number for training (an openNLP training
parameter). The default value is 100.
Cutoff Optional Specifies the cutoff number for training (an openNLP training
parameter). The default value is 5.

Input
Table 637: TrainNamedEntityFinder Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text to analyze. Within the text, each entity must be identified with this
syntax:

<START:entity_type>entity<END>

For example:

<START:location>Country1<END> has arrived

Output
The function outputs a message to the console and a Max Entropy model (a binary file). The file is
automatically installed on the Aster Database cluster.
Table 638: TrainNamedEntityFinder Output Message Schema

Column Name Data Type Description


train_result VARCHAR Message indicating whether the function ran successfully.

Example

Input
The input table, nermem_sports_train, is a collection of sports news in xml format (with tags like
<START:PER> Roger <END>). There are 50 rows of training data with an id column and a content column
(containing text information). The function generates a model file location.sports and accepts only one tag
('LOCATION') in the type argument.
Table 639: TrainNamedEntityFinder Example Input Table nermem_sports_train

id content
2 CRICKET - <START:ORG> LEICESTERSHIRE <END> TAKE OVER AT TOP AFTER INNINGS
VICTORY .

Teradata Aster Analytics Foundation User Guide 741


Chapter 6: Text Analysis
TrainNamedEntityFinder

id content
3 <START:LOCATION> LONDON <END> 1996-08-30
4 West Indian all-rounder <START:PER> Phil Simmons <END> took four for 38 on Friday as
<START:ORG> Leicestershire <END> beat <START:ORG> Somerset <END> by an innings and
39 runs in two days to take over at the head of the county championship .
5 Their stay on top
6 After bowling <START:ORG> Somerset <END> out for 83 on the opening morning at
<START:LOCATION> Grace Road <END>
7 Trailing by 213
8 <START:ORG> Essex <END>
9 <START:PER> Hussain <END>
10 By the close <START:ORG> Yorkshire <END> had turned that into a 37-run advantage but off-
spinner <START:PER> Such <END> had scuttled their hopes
11 At the <START:LOCATION> Oval <END>
12 He was well backed by <START:LOCATION> England <END> hopeful <START:PER> Mark
Butcher <END> who made 70 as <START:ORG> Surrey <END> closed on 429 for seven
... ...

SQL-MapReduce Call

SELECT * FROM TrainNamedEntityFinder (


ON nermem_sports_train
PARTITION BY 1
EntityType ('LOCATION')
TextColumn ('content')
Model ('location.sports')
);

Output
Table 640: TrainNamedEntityFinder Example Output Table

train_result
model installed

742 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Evaluate Named Entity Finder

Evaluate Named Entity Finder

Summary
The EvaluateNamedEntityFinderRow and EvaluateNamedEntityFinderPartition functions operate as a row
and a partition function, respectively. Each function takes a set of evaluating data and generates the
precision, recall, and F-measure values of a specified maximum entropy data model. Neither function
supports regular-expression-based or dictionary-based models.

Usage

Evaluate Named Entity Finder Syntax


Version 1.1

SELECT * FROM EvaluateNamedEntityFinderPartition (


ON EvaluateNamedEntityFinderRow (
ON { table| view| (query) }
TextColumn ('text_column')
Model ('model_file')
)
PARTITION BY 1
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text to
analyze.
Model Required Specifies name of the model file to evaluate.

Input
Table 641: EvaluateNamedEntityFinderRow Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text to analyze. Within the text, each entity must be identified with this
syntax:
<START:entity_type> entity <END>
For example:
<START:location>Country1<END> has arrived

Teradata Aster Analytics Foundation User Guide 743


Chapter 6: Text Analysis
Evaluate Named Entity Finder
Output
Table 642: EvaluateNamedEntityFinderPartition Output Table Schema

Column Name Data Type Description


Precision INTEGER Precision value of the model.
Recall DOUBLE Recall value of the model.
PRECISION
F-Measure DOUBLE F-measure (F1 score) of the model.
PRECISION

Example
The EvaluateNamedEntityFinderPartition invokes the EvaluateNamedEntityFinderRow function, which
takes as input the test data (nermem_sports_test). The test set is a collection of sports news in xml format
similar to the training data. The function evaluates the efficacy in terms of the precision, recall and f-
measure value of the maximum entropy data model location.sports (from TrainNamedEntityFinder).

Input
Table 643: EvaluateNamedEntityFinderRow Example Input Table nermem_sports_test

id content
3 <START:LOCATION> LONDON <END> 1996-08-30
4 West Indian all-rounder <START:PER> Phil Simmons <END> took four for 38 on Friday as
<START:ORG> Leicestershire <END> beat <START:ORG> Somerset <END> by an innings and
39 runs in two days to take over at the head of the county championship .
6 After bowling <START:ORG> Somerset <END> out for 83 on the opening morning at
<START:LOCATION> Grace Road <END>
9 <START:PER> Hussain <END>
10 By the close <START:ORG> Yorkshire <END> had turned that into a 37-run advantage but off-
spinner <START:PER> Such <END> had scuttled their hopes
11 At the <START:LOCATION> Oval <END>
12 He was well backed by <START:LOCATION> England <END> hopeful <START:PER> Mark
Butcher <END> who made 70 as <START:ORG> Surrey <END> closed on 429 for seven
14 Australian <START:PER> Tom Moody <END> took six for 82 but <START:PER> Chris Adams
<END>
16 They were held up by a gritty 84 from <START:PER> Paul Johnson <END> but ex-England fast
bowler <START:PER> Martin McCague <END> took four for 55 .
20 <START:LOCATION> LONDON <END> 1996-08-30
22 <START:LOCATION> Leicester <END> : <START:ORG> Leicestershire <END> beat
<START:ORG> Somerset <END> by an innings and 39 runs .

744 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
nGram

id content
... ...

SQL-MapReduce Call

SELECT * FROM EvaluateNamedEntityFinderPartition (


ON EvaluateNamedEntityFinderRow (
ON nermem_sports_test
Model ('location.sports')
TextColumn ('content')
)
PARTITION BY 1
);

Output
Table 644: EvaluateNamedEntityFinderPartition Example Output Table

Precision Recall F-Measure


1 0.952380952380952 0.975609756097561

nGram

Summary
The nGram function tokenizes (splits) an input stream of text and outputs n multigrams (called n-grams)
based on the specified delimiter and reset parameters. nGram provides more flexibility than standard
tokenization when performing text analysis. Many two-word phrases carry important meaning (for example,
"machine learning") that unigrams (single-word tokens) do not capture. This, combined with additional
analytical techniques, can be useful for performing sentiment analysis, topic identification, and document
classification.
nGram considers each input row to be one document, and it returns a row for each unique n-gram in each
document. nGram also returns, for each document, the counts of each n-gram and the total number of n-
grams.

Background
For general information about tokenization, see http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer.

Teradata Aster Analytics Foundation User Guide 745


Chapter 6: Text Analysis
nGram
Usage

nGram Syntax
Version 1.5

SELECT * FROM nGram (


ON { table_name | view_name| (query) }
TextColumn ('text_column_name')
[ Delimiter ('delimiter_regular_expression') ]
Grams (gram_number |'range_of_values'[,...])
[ OverLapping({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ ToLowerCase
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'} ) ]
[ Punctuation ('punctuation_regular_expression') ]
[ Reset ('reset_regular_expression') ]
[ TotalGramCount
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ TotalCountColumn ('total_count_column_name') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ NGramColumn ('ngram_column_name') ]
[ NumGramsColumn ('numgrams_column_name') ]
[ FrequencyColumn ('count_column_name') ]
);

Arguments
Argument Category Description
TextColumn Required The name of the column that contains the input text. Input columns
must contain string SQL types.
Delimiter Optional A regular expression that specifies the character or string that
separates words in the input text. The default value is the space
character (' ').
Grams Required A list of integers or ranges of integers that specify the length, in
words, of each n-gram (that is, the value of n). A range_of_values
has the syntax integer1-integer2, where integer1 <= integer2. The
values of n, integer1, and integer2 must be positive.
OverLapping Optional A Boolean value that specifies whether the function allows
overlapping n-grams. When this value is 'true' (the default), each
word in each sentence starts an n-gram, if enough words follow it
(in the same sentence) to form a whole n-gram of the specified size.
For information on sentences, see the description of the Reset
argument.
ToLowerCase Optional A Boolean value that specifies whether the function converts all
letters in the input text to lowercase. The default value is 'true'.

746 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
nGram

Argument Category Description


Punctuation Optional A regular expression that specifies the punctuation characters for
the function to remove before evaluating the input text. The default
characters to remove are: `~#^&*()-
Reset Optional A regular expression that specifies the character or string that ends a
sentence. The default sentence-ending characters are: .,?!
At the end of a sentence, the function discards any partial n-grams
and searches for the next n-gram at the beginning of the next
sentence. An n-gram cannot span two sentences.
TotalGramCount Optional A Boolean value that specifies whether the function returns the total
number of n-grams in the document (that is, in the row). The
default value is 'false'. If you specify 'true', then the name of the
returned column is specified by the Total_Count_Column_Name
argument.

Note:
The total number of n-grams is not necessarily the number of
unique n-grams.

TotalCountColumn Optional The name of the column to return if the value of the Total argument
is 'true'. The default value is 'totalcnt'.
Accumulate Optional The names of the columns to return for each n-gram. These
columns cannot have the same names as those specified by the
arguments NGramColumn, NumGramsColum, and
TotalCountColumn. By default, the function returns all input
columns for each n-gram.
NGramColumn Optional The name of the column that is to contain the generated n-grams.
The default value is 'ngram'.
NumGramsColum Optional The name of the column that is to contain the length of n-gram (in
words). The default value is 'n'.
FrequencyColumn Optional The name of the column that is to contain the count of each unique
n-gram (that is, the number of times that each unique n-gram
appears in the document). The default value is 'frequency'.

Input
Each row of the input table contains a document to be tokenized. The input table can have additional rows,
some or all of which the function returns in the output table.
Table 645: Input Table Schema

Column Name Data Type Description


text_column_name VARCHAR Required. Documents to be tokenized.

Teradata Aster Analytics Foundation User Guide 747


Chapter 6: Text Analysis
nGram

Column Name Data Type Description


column_name VARCHAR Optional. The input table can have many such columns. The
Accumulate argument determines which of these columns
appear in the output table.

Output
The output table has a row for each unique n-gram in each input document.
Table 646: Output Table Schema

Column Name Data Type Description


total_count_column_name INTEGER Total number of n-grams in the document. This
column appears in the output table only if the value
of the Total argument is 'true'.
accumulate_column_name VARCHAR Column from the input table. The output table can
have many such columns.
ngram_column_name VARCHAR Generated n-grams.
numgrams_column_name INTEGER Length of n-gram in words (the value n).
count_column_name INTEGER Count of each unique n-gram in the document.

Examples
The nGram function tokenizes a given document based on the length specified by the Grams argument. It
also provides additional control of tokenization by allowing the user to specify punctuation delimiters with
the Punctuation argument.
These examples show the use of the Total and Overlapping arguments:
• Input
• Example 1: Overlapping ('true') and TotalGramCount ('true')
• Example 2: Overlapping ('false') and TotalGramCount ('false')

Input
The input table contains paragraphs about common analytics topics (regression, decision Trees, and so on).
Table 647: nGram Example Input Table paragraphs_input

paraid paratopic paratext


1 Decision Trees Decision tree learning uses a decision tree as a predictive model which
maps observations about an item to conclusions about the items target
value. It is one of the predictive modelling approaches used in
statistics, data mining and machine learning. Tree models where the
target variable can take a finite set of values are called classification
trees. In these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels.

748 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
nGram

paraid paratopic paratext


Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees.
2 Simple Regression In statistics, simple linear regression is the least squares estimator of a
linear regression model with a single explanatory variable. In other
words, simple linear regression fits a straight line through the set of n
points in such a way that makes the sum of squared residuals of the
model (that is, vertical distances between the points of the data set and
the fitted line) as small as possible
... ... ...

Example 1: Overlapping ('true') and TotalGramCount ('true')

SQL-MapReduce Call

SELECT * FROM nGram (


ON paragraphs_input
TextColumn ('paratext')
Delimiter (' ')
Grams ('4-6')
OverLapping ('true')
ToLowerCase ('true')
Punctuation ('\[.,?\!\]')
Reset ('\[.,?\!\]')
TotalGramCount ('true')
Accumulate ('paraid', 'paratopic')
) ORDER BY paraid;

Output

Table 648: nGram Example 1 Output Table

paraid paratopic ngram n frequency totalcnt


1 Decision Trees decision tree learning uses 4 1 73
1 Decision Trees decision tree learning uses a 5 1 66
1 Decision Trees decision tree learning uses a decision 6 1 60
1 Decision Trees tree learning uses a 4 1 73
1 Decision Trees tree learning uses a decision 5 1 66
1 Decision Trees tree learning uses a decision tree 6 1 60
1 Decision Trees learning uses a decision 4 1 73
1 Decision Trees learning uses a decision tree 5 1 66
1 Decision Trees learning uses a decision tree as 6 60
... ... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 749


Chapter 6: Text Analysis
nGram
Example 2: Overlapping ('false') and TotalGramCount ('false')

SQL-MapReduce Call

SELECT * FROM nGram (


ON paragraphs_input
TextColumn ('paratext')
Delimiter (' ')
Grams ('4-6')
OverLapping ('false')
ToLowerCase ('true')
Punctuation ('\[.,?\!\]')
Reset ('\[.,?\!\]')
TotalGramCount ('false')
Accumulate ('paraid', 'paratopic')
) ORDER BY paraid;

Output

Table 649: nGram Example 2 Output Table

paraid paratopic ngram n frequency


1 Decision Trees decision tree learning uses 4 1
1 Decision Trees a decision tree as 4 1
1 Decision Trees a predictive model which 4 1
1 Decision Trees maps observations about an 4 1
1 Decision Trees item to conclusions about 4 1
1 Decision Trees the items target value 4 1
1 Decision Trees decision tree learning uses a 5 1
1 Decision Trees decision tree as a predictive 5 1
1 Decision Trees model which maps observations about 5 1
1 Decision Trees an item to conclusions about 5 1
1 Decision Trees decision tree learning uses a decision 6 1
1 Decision Trees tree as a predictive model which 6 1
1 Decision Trees maps observations about an item to 6 1
1 Decision Trees conclusions about the items target value 6 1
... ... ... ... ...

750 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
POSTagger

POSTagger

Summary
The POSTagger function generates part-of-speech (POS) tags for the words contained in the input text
(typically sentences). POS tagging is the first step in the syntactic analysis of a language, and an important
preprocessing step in many natural language-processing applications.

Background
The POSTagger function was developed on the Penn Treebank Project and Chinese Penn Treebank Project
dataset. Its POS tags comply with the tags defined by the two projects.

Part-of-Speech Tags for English Text


The following table lists the POS tags used in the Penn Treebank Project for English text.

Number Tag Description Examples


1 CC Coordinating conjunction and
2 CD Cardinal number 1, third
3 DT Determiner the
4 EX Existential there there is
5 FW Foreign word d'hoevre
6 IN Preposition / subordinating conjunction in, of, like
7 JJ Adjective green
8 JJR Adjective, comparative greener
9 JJS Adjective, superlative greenest
10 LS List item marker 1)
11 MD Modal could, will
12 NN Noun, singular or mass table
13 NNS Noun, plural tables
14 NNP Proper noun, singular John

Teradata Aster Analytics Foundation User Guide 751


Chapter 6: Text Analysis
POSTagger

Number Tag Description Examples


15 NNPS Proper noun, plural Vikings
16 PDT Predeterminer both the boys
17 POS Possessive ending friend's
18 PRP Personal pronoun I, he, it
19 PRP$ Possessive pronoun my, his
20 RB Adverb however, usually,
naturally, here, good
21 RBR Adverb, comparative better
22 RBS Adverb, superlative best
23 RP Particle give up
24 SYM Symbol
25 TO to to go, to him
26 UH Interjection uhhuhhuhh
27 VB Verb, base form take
28 VBD Verb, past tense took
29 VBG Verb, gerund or present participle taking
30 VBN Verb, past participle taken
31 VBP Verb, non-3rd person singular present take
32 VBZ Verb, 3rd person singular present takes
33 WDT Wh-determiner which
34 WP Wh-pronoun who, what
35 WP$ Possessive wh-pronoun whose
36 WRB Wh-adverb where, when

Part-of-Speech Tags for Chinese Text


The following tables list the POS tags used in the Penn Treebank Project for Chinese text.
Table 650: Chinese POS Tags: Verb, adjective

Number Tag Description


1 VA Predicative adjective
2 VC Copula
3 VE Only 有, 没有, and 无 are tagged as VE when they are the main verbs.
4 VV Other verb. Includes the rest verbs, such as modals, raising predicates,
control verbs, action verbs, psych-verb, and so on.

752 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
POSTagger
Table 651: Chinese POS Tags: Noun

Number Tag Description


5 NR Proper Noun. An NR is a name of a particular person, politically or
geographically defined location, or organization. A NR is usually unique
and cannot be modified by a Det+M.
6 NT Temporal Noun.
7 NN Other Noun.

Table 652: Chinese POS Tags: Localizer

Number Tag Description


8 LC

Table 653: Chinese POS Tags: Pronoun

Number Tag Description


9 PN

Table 654: Chinese POS Tags: Determiner and number

Number Tag Description


10 DT Determiner
11 CD Cardinal Number
12 OD Ordinal Number

Table 655: Chinese POS Tags: Measure word

Number Tag Description


13 M

Table 656: Chinese POS Tags: Adverb

Number Tag Description


14 AD

Table 657: Chinese POS Tags: Preposition

Number Tag Description


15 P

Table 658: Chinese POS Tags: Conjunction

Number Tag Description


16 CC Coordinating conjunction.
17 CS Subordinating conjunction.

Teradata Aster Analytics Foundation User Guide 753


Chapter 6: Text Analysis
POSTagger
Table 659: Chinese POS Tags: Particle

Number Tag Description


18 DEC This only includes 的 and 之 when they function as a complementizer or a
norminalizer.
19 DEG This only includes 的 and 之 when they function as a genitive marker or an
associative marker.
20 DER 得 is tagged as DER in potential form V-得-R, and in V-de construction.
21 DEV This only includes 地 when it occurs in "XP 地 VP", where XP modifies the
VP.
22 AS Aspect Particle. Verbal particles that indicate aspect are tagged as AS.
23 SP Sentence-final particle.
24 ETC The tag is used for the word 等 and 等等.
25 MSP Other particle. This includes particles, such as 所, 以, 来, and 而, when they
appear before a VP.

Table 660: Chinese POS Tags: Others

Number Tag Description


26 IJ Interjection. Interjections appear in the sentence-initial position.
27 ON Onomatopoeia
28 LB This only includes 被, 叫, 给, and 为 when they occur in the long bei-
construction.
29 SB This only includes 被 and 给 when they occur in the short bei-construction.
31 JJ Other noun-modifier.
32 FW Foreign Word.
33 PU Punctuation.

Usage

POSTagger Syntax
Version 2.1

SELECT * FROM POSTagger (


ON { table | view | query }
TextColumn ('text_column_name')]
[ Language ({ 'en' | 'zh_Cn' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

754 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
POSTagger
Arguments
Argument Category Description
TextColumn Required Specifies the name of the input column that contains the text to be
tagged.
Language Optional Specifies the language of the input text. The supported values are:
en (English, the default value)
zh_CN (Simplified Chinese)
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.

Note:
If you intend to use the POSTagger output table as input to the
function TextChunker, then this argument must specify the input
table columns that comprise the partition key.

Input
The POSTagger function requires a model file and an input table.
Two model files are provided with this function:
• pos_model_2.0_en_141008.bin for English
• pos_model_2.0_zh_cn_141008.bin for Simplified Chinese

Note:
Before running POSTagger, add the model file locations to the default search path for the user or session.

The following table describes the input table columns that you can specify with function arguments. The
input table can have additional columns, but the function ignores them.
Table 661: POSTagger Input Table Schema

Column Name Data Type Description


accumulate_column Any Column to copy to the output table.
text_column_name VARCHAR Contains the text to be tagged. Each row of this column
must contain a well-formatted sentence. To convert English
text to formatted sentences, you can use the function
Sentenizer.

Output
Table 662: POSTagger Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Copied from the input table.
input table

Teradata Aster Analytics Foundation User Guide 755


Chapter 6: Text Analysis
POSTagger

Column Name Data Type Description


word_sn INTEGER Word serial number (position of the word in the input text).
word VARCHAR Word extracted from input text.
pos_tag VARCHAR POS tag of the word.

Example

Input
The input table is the output of the Sentenizer function example.

SQL-MapReduce Call

SELECT * FROM POSTagger (


ON (
SELECT * FROM Sentenizer (
ON paragraphs_input
TextColumn ('paratext')
Accumulate ('paraid')
)
)
TextColumn ('sentence')
Accumulate ('paraid', 'sentence', 'sentence_sn')
) ORDER BY paraid, sentence_sn, word_sn;

Output
Table 663: POSTagger Example Output Table

sentence sentence_sn word_sn word pos_tag


Decision tree learning uses a decision tree as 1 1 Decision NN
a predictive model which maps observations
about an item to conclusions about the items
target value.
Cluster analysis or clustering is the task of 1 1 Cluster NN
grouping a set of objects in such a way that
objects in the same group (called a cluster)
are more similar (in some sense or another)
to each other than to those in other groups
(clusters).
Association rule learning is a method for 1 1 Association NNP
discovering interesting relations between
variables in large databases.

756 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
POSTagger

sentence sentence_sn word_sn word pos_tag


In statistics, simple linear regression is 1 1 In statistics, si JJ
the least squares estimator of a linear mple
regression model with a single explanatory
variable.
Logistic regression was developed by 1 1 Logistic JJ
statistician David Cox in 1958[2]
[3] (although much work was done in the
single independent variable case almost two
decades earlier).
Cluster analysis or clustering is the task of 1 2 analysis or cl VBZ
grouping a set of objects in such a way that ustering is
objects in the same group (called a cluster)
are more similar (in some sense or another)
to each other than to those in other groups
(clusters).
Association rule learning is a method for 1 2 rule NN
discovering interesting relations between
variables in large databases.
In statistics, simple linear regression is 1 2 linear NN
the least squares estimator of a linear
regression model with a single explanatory
variable.
Logistic regression was developed by 1 2 regression NN
statistician David Cox in 1958[2]
[3] (although much work was done in the
single independent variable case almost two
decades earlier).
Decision tree learning uses a decision tree as 1 2 tree NN
a predictive model which maps observations
about an item to conclusions about the items
target value.
In statistics, simple linear regression is 1 3 regression is VBZ
the least squares estimator of a linear
regression model with a single explanatory
variable.
Association rule learning is a method for 1 3 learning NN
discovering interesting relations between
variables in large databases.
Cluster analysis or clustering is the task of 1 3 the DT
grouping a set of objects in such a way that
objects in the same group (called a cluster)
are more similar (in some sense or another)
to each other than to those in other groups
(clusters).

Teradata Aster Analytics Foundation User Guide 757


Chapter 6: Text Analysis
POSTagger

sentence sentence_sn word_sn word pos_tag


Logistic regression was developed by 1 3 was VBD
statistician David Cox in 1958[2]
[3] (although much work was done in the
single independent variable case almost two
decades earlier).
Decision tree learning uses a decision tree as 1 3 learning NN
a predictive model which maps observations
about an item to conclusions about the items
target value.
In statistics, simple linear regression is 1 4 the least JJ
the least squares estimator of a linear
regression model with a single explanatory
variable.
Association rule learning is a method for 1 4 is VBZ
discovering interesting relations between
variables in large databases.
Cluster analysis or clustering is the task of 1 4 task NN
grouping a set of objects in such a way that
objects in the same group (called a cluster)
are more similar (in some sense or another)
to each other than to those in other groups
(clusters).
Decision tree learning uses a decision tree as 1 4 uses VBZ
a predictive model which maps observations
about an item to conclusions about the items
target value.
Logistic regression was developed by 1 4 developed VBN
statistician David Cox in 1958[2]
[3] (although much work was done in the
single independent variable case almost two
decades earlier).
In statistics, simple linear regression is 1 5 squares estim NN
the least squares estimator of a linear ator
regression model with a single explanatory
variable.
Association rule learning is a method for 1 5 a DT
discovering interesting relations between
variables in large databases.
Cluster analysis or clustering is the task of 1 5 of IN
grouping a set of objects in such a way that
objects in the same group (called a cluster)
are more similar (in some sense or another)
to each other than to those in other groups
(clusters).

758 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Sentenizer

sentence sentence_sn word_sn word pos_tag


Logistic regression was developed by 1 5 by IN
statistician David Cox in 1958[2]
[3] (although much work was done in the
single independent variable case almost two
decades earlier).
Decision tree learning uses a decision tree as 1 5 a DT
a predictive model which maps observations
about an item to conclusions about the items
target value.

Sentenizer

Summary
The Sentenizer function extracts sentences from English input text. A sentence ends with a punctuation
mark such as period (.), question mark (?), or exclamation mark (!).

Background
Many Natural Language Processing (NLP) processing tasks (such as Part-Of-Speech tagging and chunking)
begin by identifying sentences.

Usage

Sentenizer Syntax
Version 1.1

SELECT * FROM Sentenizer (


ON input_table
TextColumn ('text_column')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input column that contains the text from which
to extract sentences.
Accumulate Optional Specifies the names of the input columns to copy to the output table.

Teradata Aster Analytics Foundation User Guide 759


Chapter 6: Text Analysis
Sentenizer
Input
This function requires an input table and the sentenizer_default_model.bin file. The model file is distributed
with the function.

Note:
Before running this function, add the schema of the model file that the function uses to the user/session
default search path.
Table 664: Sentenizer Input Table Schema

Column Name Data Type Description


accumulate_column Any Column to copy to the output table.
text_column VARCHAR Contains the text from which to extract sentences.

Output
Table 665: Sentenizer Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
sentence_sn INTEGER Position or sequence number of the extracted sentence.
sentence VARCHAR Extracted sentence.

Example

Input
The input table contains paragraphs about common analytics topics (regression, decision Trees, and so on).
Table 666: Sentenizer Example Input Table paragraphs_input

paraid paratopic paratext


1 Decision Trees Decision tree learning uses a decision tree as a predictive model which
maps observations about an item to conclusions about the items target
value. It is one of the predictive modelling approaches used in
statistics, data mining and machine learning. Tree models where the
target variable can take a finite set of values are called classification
trees. In these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels.
Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees.
2 Simple Regression In statistics, simple linear regression is the least squares estimator of a
linear regression model with a single explanatory variable. In other
words, simple linear regression fits a straight line through the set of n

760 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Sentenizer

paraid paratopic paratext


points in such a way that makes the sum of squared residuals of the
model (that is, vertical distances between the points of the data set and
the fitted line) as small as possible.
... ... ...

SQL-MapReduce Call

SELECT * FROM Sentenizer (


ON paragraphs_input
TextColumn ('paratext')
Accumulate ('paraid', 'paratopic')
) ORDER BY 1, 3;

Output
Table 667: Sentenizer Example Output Table

paraid paratopic sentence_sn sentence


1 Decision Trees 1 Decision tree learning uses a decision tree as a
predictive model which maps observations about an
item to conclusions about the items target value.
1 Decision Trees 2 It is one of the predictive modelling approaches used
in statistics, data mining and machine learning.
1 Decision Trees 3 Tree models where the target variable can take a
finite set of values are called classification trees.
1 Decision Trees 4 In these tree structures, leaves represent class labels
and branches represent conjunctions of features that
lead to those class labels.
1 Decision 5 Decision trees where the target variable can take
continuous values (typically real numbers) are called
regression trees.
2 Simple Regression 1 In statistics, simple linear regression is the least
squares estimator of a linear regression model with a
single explanatory variable.
2 Simple Regression 2 In other words, simple linear regression fits a
straight line through the set of n points in such a way
that makes the sum of squared residuals of the
model (that is, vertical distances between the points
of the data set and the fitted line) as small as
possible.
3 Logistic Regression 1 Logistic regression was developed by statistician
David Cox in 1958[2][3](although much work was

Teradata Aster Analytics Foundation User Guide 761


Chapter 6: Text Analysis
Sentenizer

paraid paratopic sentence_sn sentence


done in the single independent variable case almost
two decades earlier).
3 Logistic Regression 2 The binary logistic model is used to estimate the
probability of a binary response based on one or
more predictor (or independent) variables (features).
3 Logistic Regression 3 As such it is not a classification method.
3 Logistic Regression 4 It could be called a qualitative response/discrete
choice model in the terminology of economics.
4 Cluster analysis 1 Cluster analysis or clustering is the task of grouping
a set of objects in such a way that objects in the same
group (called a cluster) are more similar (in some
sense or another) to each other than to those in
other groups (clusters).
4 Cluster analysis 2 It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used
in many fields, including machine learning, pattern
recognition, image analysis, information retrieval,
and bioinformatics.
4 Cluster analysis 3 Cluster analysis itself is not one specific algorithm,
but the general task to be solved.
4 Cluster analysis 4 It can be achieved by various algorithms that differ
significantly in their notion of what constitutes a
cluster and how to efficiently find them.
5 Association rule learning 1 Association rule learning is a method for discovering
interesting relations between variables in large
databases.
5 Association rule learning 2 It is intended to identify strong rules discovered in
databases using different measures of
interestingness.
5 Association rule learning 3 Based on the concept of strong rules, Rakesh
Agrawal et al.[2] introduced association rules for
discovering regularities between products in large-
scale transaction data recorded by point-of-sale
(POS) systems in supermarkets.
5 Association rule learning 4 For example, the rule {onions, potatoes} => {burger}
found in the sales data of a supermarket would
indicate that if a customer buys onions and potatoes
together, they are likely to also buy hamburger meat.

762 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Sentiment Extraction Functions

Sentiment Extraction Functions

Summary
Sentiment extraction is the process of inferring user sentiment (positive, negative, or neutral) from text
(typically call center logs, forums, and social media).
The sentiment extraction functions are:
• TrainSentimentExtractor, which trains a model—takes training documents and outputs a maximum
entropy classification model
• ExtractSentiment, which uses either the classification model or a dictionary model to extract the
sentiment of each input document or sentence; that is, to output predictions
• EvaluateSentimentExtractor, which uses test data to evaluate the precision and recall of the predictions

Background
As user-generated content has increased, sentiment extraction has become more important. Typical use
cases are:
• Support Forum
A software company has an online forum where users can share knowledge and ask each other questions
about how to use its products. If a user post shows appreciation or shares information, the company
support staff need not respond. However, if a user post shows frustration at an unanswered question, or
anger at a product, then the support staff can react as soon as possible.
• Mining User-Generated Reviews
A retailer has a web site where customers can submit reviews of its products. The retailer wants to get
feedback about the products by analyzing these reviews, rather than by sending customers
questionnaires.
• Online Reputation Management
A company wants to protect its brand and reputation by monitoring negative news, blog entries, reviews,
and comments on the Internet.

Teradata Aster Analytics Foundation User Guide 763


Chapter 6: Text Analysis
TrainSentimentExtractor

TrainSentimentExtractor

Summary
The TrainSentimentExtractor function trains a model; that is, takes training documents and outputs a
maximum entropy classification model, which it installs on Aster Database. For information about
maximum entropy, see http://en.wikipedia.org/wiki/Maximum_entropy_method.

Usage

TrainSentimentExtractor Syntax
Version 2.1

SELECT * FROM TrainSentimentExtractor (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('training_table')
TextColumn ('text_column')
SentimentColumn ('sentiment_column')
ModelFile ('model_file')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the training data.
TextColumn Required Specifies the name of the input table column that contains the training
data.
SentimentColumn Required Specifies the name of the input table column that contains the
sentiment values, which are 'POS' (positive), 'NEG' (negative), and
'NEU' (neutral).
ModelFile Required Specifies the name of the file to which the function outputs the model.
Language Optional Specifies the language of the training data:

764 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TrainSentimentExtractor

Argument Category Description

• 'en' (English, the default)


• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)

Input
Table 668: TrainSentimentExtractor Input Table Schema

Column Name Data Type Description


text_column VARCHAR Training data (one document in each row).
sentiment_column VARCHAR Sentiment values, which are 'POS' (positive), 'NEG' (negative), and
'NEU' (neutral).

Output
The function outputs a binary file that contains a maximum entropy classification model, which it installs on
Aster Database, and a message.
Table 669: TrainSentimentExtractor Output Message Schema

Column Name Data Type Description


train_result VARCHAR Message about the model file.

Example

Input
The input table is a collection of user reviews for different products.
Table 670: TrainSentimentExtractor Example Input Table sentiment_train

id product category review


1 camera POS we primarily bought this camera for high
image quality and excellent video capability
without paying the price for a dslr. it has
excelled in what we expected of it, and
consequently represented excellent value for
me. all my friends want my camera for their
vacations. i would recommend this camera to
anybody. definitely worth the price. plus, when
you buy some accessories, it becomes even
more powerful.

Teradata Aster Analytics Foundation User Guide 765


Chapter 6: Text Analysis
TrainSentimentExtractor

id product category review


2 office suite POS it is the best office suite i have used to date. it
is launched before office 2010 and it is ages
ahead of it already. the fact that i could
comfortable import xls, doc, ppt and modify
them, and then export them back to the doc,
xls, ppt is terrific. i needed the compatibility. it
is a very intuitive suite and the drag drop
functionality is terrific.
3 camera POS this is a nice camera, delivering good quality
video images decent photos. light small, using
easily obtainable, high quality minidv i love it.
minor irritations include touchscreen based
menu only digital photos can only be
transferred via usb, requiring ilink and usb if
you use ilink.
... ... ... ...

SQL-MapReduce Call

SELECT * FROM TrainSentimentExtractor (


ON (SELECT 1)
PARTITION BY 1
InputTable ('sentiment_train')
TextColumn ('review')
SentimentColumn ('category')
ModelFile ('sentimentmodel1.bin')
);

Output
Table 671: TrainSentimentExtractor Example Output Table

train_result
Model generated.
Training time(s): 0.167
File name: sentimentmodel1.bin
File size(KB): 3
Model successfully installed

766 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
ExtractSentiment

ExtractSentiment

Summary
The ExtractSentiment function extracts the sentiment (positive, negative, or neutral) of each input
document or sentence, using either a classification model output by the function TrainSentimentExtractor
or a dictionary model.
The dictionary model consists of WordNet, a lexical database of the English language, and the following
negation words:
• no
• not
• neither
• never
• scarcely
• hardly
• nor
• little
• nothing
• seldom
• few
The function handles negated sentiments as follows:
• -1 if the sentiment is negated (for example, “I am not happy”)
• -1 if the sentiment and a negation word are separated by one or two words (for example, “I am not very
happy” or “I am not at all happy”)
• +1 if the sentiment and a negation word are separated by three words (for example, “I am not saying I am
happy”)
This function can be used with real-time applications. Refer to AMLGenerator.

Usage

ExtractSentiment Syntax
Version 3.1

SELECT * FROM ExtractSentiment (


ON { table| view| (query) } [ PARTITION BY ANY ]
[ ON { table| view| (query) } AS dict DIMENSION ]
TextColumn ('text_column')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
[ Model ({ 'dictionary[:dict_file]' | 'classification:model_file' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Level ({ 'document' | 'sentence' }) ]
[ HighPriority ({ 'NEGATIVE_RECALL' | 'NEGATIVE_PRECISION' |
'POSITIVE_RECALL' | 'POSITIVE_PRECISION' | 'NONE' }) ]

Teradata Aster Analytics Foundation User Guide 767


Chapter 6: Text Analysis
ExtractSentiment
[ Filter ({ 'POSITIVE' | 'NEGATIVE' | 'ALL' }) ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input column that contains text from which to
extract sentiments.
Language Optional Specifies the language of the input text:
• 'en' (English, the default)
• 'zh_CN' (Simplified Chinese)
• 'zh_TW' (Traditional Chinese)

Model Optional Specifies the model type and file. The default model type is dictionary. If
you omit this argument or specify dictionary without dict_file, then you
must specify a dictionary table with alias 'dict'. If you specify both dict
and dict_file, then whenever their words conflict, dict has higher priority.
The dict_file must be a text file in which each line contains only a
sentiment word, a space, and the opinion score of the sentiment word.
If you specify classification model_file, then model_file must be the name
of a model file generated and installed on the database by the function
TrainSentimentExtractor.

Note:
Before running the function, add the location of dict_file or
model_file to the user/session default search path.

Accumulate Optional Specifies the names of the input columns to copy to the output table.
Level Optional Specifies the level of analysis—whether to analyze each document (the
default) or each sentence.
HighPriority Optional Specifies the highest priority when returning results:
• NEGATIVE_RECALL
Give highest priority to negative results, including those with lower-
confidence sentiment classifications (maximizes the number of
negative results returned).
• NEGATIVE_PRECISION
Give highest priority to negative results with high-confidence
sentiment classifications.
• POSITIVE_RECALL
Give highest priority to positive results, including those with lower-
confidence sentiment classifications (maximizes the number of
positive results returned).
• POSITIVE_PRECISION
Give highest priority to positive results with high-confidence
sentiment classifications.
• NONE

768 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
ExtractSentiment

Argument Category Description

Give all results the same priority.

Filter Optional Specifies the kind of results to return:


• POSITIVE
Return only results with positive sentiments.
• NEGATIVE
Return only results with negative sentiments.
• ALL (default)
Return all results.

Input
The function has a required input table and an optional dictionary table.
The following table describes the required input table columns. The table can have additional columns, but
the function ignores them.
Table 672: ExtractSentiment Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text from which to extract sentiments.
accumulate_column Any Column to copy to the output table.

The following table describes the required first and second columns of the dictionary table. The table can
have additional columns, but the function ignores them.
Table 673: ExtractSentiment Dictionary Table Schema

Column Name Data Type Description


sentiment_word VARCHAR First column, containing the Sentiment word.
opinion_score INTEGER Second column, containing the opinion score for sentiment word.

Output
Table 674: ExtractSentiment Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
out_content This column appears only for Level ('SENTENCE), and contains
the sentence.
out_polarity Depends on the value of out_content:
If out_content is NULL: nothing
If out_content is an empty string: UNKNOWN

Teradata Aster Analytics Foundation User Guide 769


Chapter 6: Text Analysis
ExtractSentiment

Column Name Data Type Description


Otherwise: POS (positive), NEG (negative), or NEU (neutral)
out_strength Strength of output_polarity if it is POS, NEG, or NEU; otherwise
nothing:
0: neutral
1: higher than neutral
2: higher than 1
out_sentiment_words This column appears only when the function uses a dictionary
model, and contains all sentiment words in the document or
sentence.

Examples
• Prerequisites
• Input
• Example 1: Model ('dictionary'), Level ('document')
• Example 2: Model ('dictionary'), Level ('sentence')
• Example 3: Model ('classification:default_sentiment_classification_model.bin')
• Example 4: Model ('classification:sentimentmodel1.bin')
• Example 5: Dictionary Table Instead of Model File

Prerequisites
These files must be installed in the directory sentimentAnalysisModel:
• default_sentiment_classification_model.bin
• For English input text: default_sentiment_lexicon.txt
• For Simplified Chinese input text: default_sentiment_lexicon_zh_cn.txt
• For Traditional Chinese input text: default_sentiment_lexicon_zh_tw.txt

Input
Table 675: ExtractSentiment Examples Input Table sentiment_extract_input

id product category review


1 camera POS we primarily bought this camera for high
image quality and excellent video capability
without paying the price for a dslr. it has
excelled in what we expected of it, and
consequently represented excellent value for
me. all my friends want my camera for their
vacations. i would recommend this camera to
anybody. definitely worth the price. plus, when
you buy some accessories, it becomes even
more powerful.

770 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
ExtractSentiment

id product category review


2 office suite POS it is the best office suite i have used to date. it
is launched before office 2010 and it is ages
ahead of it already. the fact that i could
comfortable import xls, doc, ppt and modify
them, and then export them back to the doc,
xls, ppt is terrific. i needed the compatibility. it
is a very intuitive suite and the drag drop
functionality is terrific.
3 camera POS this is a nice camera, delivering good quality
video images decent photos. light small, using
easily obtainable, high quality minidv i love it.
minor irritations include touchscreen based
menu only digital photos can only be
transferred via usb, requiring ilink and usb if
you use ilink.
4 gps POS it is a fine gps. outstanding performance,
works great. you can even get incredible
coordinate accuracy from streets and trips to
compare.
5 gps POS nice graphs and map route info. i would not
run outside again without this unique gadget.
great job. big display, good backlight, really
watertight, training assistant. i use in trail
running and it worked well through out the
race.
6 gps NEG most of the complaints i have seen in here are
from a lack of rtfm. i have never seen so many
mistakes do to what i think has to be none
update of data to the system. i wish i could
make all the rating stars be empty.
7 gps NEG this machine is all screwed up. on my way
home from a friends house it told me there is
no possible route. i found their website
support difficult to navigate. i am is so
disappointed and just returned it and now
looking for another one
8 camera NEG i hate my camera, and im stuck with it. this
camera sucks so bad, even the dealers on ebay
have difficulty selling it. horrible indoors, does
not capture fast action, screwy software, no
suprise, and screwy audio/video codec that
does not work with hardly any app.
9 television NEG $3k is way too much money to drop onto a
piece of crap. poor customer support. after
about 1 and a half years and hardly using the

Teradata Aster Analytics Foundation User Guide 771


Chapter 6: Text Analysis
ExtractSentiment

id product category review


tv, a big yellow pixilated stain appeared.
product is very inferior and subject to several
lawsuits. i expressed my dissatisfaction with
the situation as this is a known issue.
10 camera NEG i returned my camera to the vendor as i will
not tolerate a sub standard product that is a
known issue especially from vendor who will
not admit that this needs to be removed from
the shelf due to failing parts updated. due to
the constant need for repair, i would never
recommend this product.

Example 1: Model ('dictionary'), Level ('document')


This example uses the dictionary model file default_sentiment_lexicon.txt.

SQL-MapReduce Call

SELECT * FROM ExtractSentiment (


ON sentiment_extract_input
TextColumn ('review')
Model ('dictionary')
Level ('document')
Accumulate ('id', 'product')
) ORDER BY id;

Output

Table 676: ExtractSentiment Example 1 Output Table

id product out_polarity out_strength out_sentiment_words


1 camera POS 2 excellent 1, capability 1, excelled 1,
excellent 1, recommend 1, worth 1,
powerful 1. In total, positive score:7
negative score:0
2 office suite POS 2 best 1, comfortable 1, terrific 1, intuitive
1, drag -1, terrific 1. In total, positive
score:5 negative score:-1
3 camera POS 2 nice 1, good 1, decent 1, obtainable 1,
love 1, irritations -1. In total, positive
score:5 negative score:-1
4 gps POS 2 fine 1, outstanding 1, works 1, great 1,
incredible 1. In total, positive score:5
negative score:0

772 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
ExtractSentiment

id product out_polarity out_strength out_sentiment_words


5 gps POS 2 nice 1, great 1, good 1, worked 1, well 1.
In total, positive score:5 negative score:0
6 gps NEG 2 complaints -1, lack -1, mistakes -1. In
total, positive score:0 negative score:-3
7 gps NEG 2 screwed -1, support 1, difficult -1,
disappointed -1. In total, positive score:1
negative score:-3
8 camera NEG 2 hate -1, stuck -1, sucks -1, bad -1,
difficulty -1, horrible -1, not fast -1,
screwy -1, screwy -1, not work -1. In
total, positive score:0 negative score:-10
9 television NEG 2 crap -1, poor -1, support 1, stain -1,
inferior -1, issue -1. In total, positive
score:1 negative score:-5
10 camera NEG 2 issue -1, failing -1, never recommend -1.
In total, positive score:0 negative score:-3

Example 2: Model ('dictionary'), Level ('sentence')


This example uses the dictionary model file default_sentiment_lexicon.txt.

SQL-MapReduce Call

SELECT * FROM ExtractSentiment (


ON sentiment_extract_input
TextColumn ('review')
Model ('dictionary')
Level ('sentence')
Accumulate ('id', 'product')
) ORDER BY id;

Output

Table 677: ExtractSentiment Example 2 Output Table

i product out_content out_polarity out_strength out_sentiment_words


d
1 camera we primarily bought this POS 2 excellent 1, capability
camera for high image quality 1, excelled 1, excellent
and excellent video capability 1. In total, positive
without paying the price for a score:4 negative score:
dslr. it has excelled in what we 0
expected of it, and
consequently represented
excellent value for me. all my

Teradata Aster Analytics Foundation User Guide 773


Chapter 6: Text Analysis
ExtractSentiment

i product out_content out_polarity out_strength out_sentiment_words


d
friends want my camera for
their vacations.
1 camera i would recommend this POS 2 recommend 1, worth
camera to anybody. definitely 1, powerful 1. In total,
worth the price. plus, when positive score:3
you buy some accessories, it negative score:0
becomes even more powerful.
2 office suite it is the best office suite i have POS 2 best 1. In total,
used to date. positive score:1
negative score:0
2 office suite it is launched before office NEU 0
2010 and it is ages ahead of it
already.
2 office suite the fact that i could POS 2 comfortable 1, terrific
comfortable import xls, doc, 1. In total, positive
ppt and modify them, and score:2 negative score:
then export them back to the 0
doc, xls, ppt is terrific.
... ... ... ... ... ...

Example 3: Model
('classification:default_sentiment_classification_model.bin')
This example uses the maximum entropy classification model file
default_sentiment_classification_model.bin.

SQL-MapReduce Call

SELECT * FROM ExtractSentiment (


ON sentiment_extract_input
TextColumn ('review')
Model ('classification:default_sentiment_classification_model.bin')
Level ('document')
Accumulate ('id')
) ORDER BY id;

Output

Table 678: ExtractSentiment Example 3 Output Table

id out_polarity out_strength
1 NEG 2
2 POS 2

774 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
ExtractSentiment

id out_polarity out_strength
3 NEG 2
4 POS 2
5 POS 1
6 NEG 2
7 NEG 2
8 NEG 2
9 NEG 2
10 NEG 2

Example 4: Model ('classification:sentimentmodel1.bin')


Model file sentimentmodel1.bin is the output of the TrainSentinmentExtractor function, see the Example
section.

SQL-MapReduce Call

SELECT * FROM ExtractSentiment (


ON sentiment_extract_input
TextColumn ('review')
Model ('classification:sentimentmodel1.bin')
Level ('document')
Accumulate ('id')
) ORDER BY id;

Output

Table 679: ExtractSentiment Example 4 Output Table

id out_polarity out_strength
1 POS 2
2 POS 2
3 POS 2
4 POS 2
5 POS 2
6 NEG 2
7 NEG 2
8 NEG 2
9 NEG 2
10 NEG 2

Teradata Aster Analytics Foundation User Guide 775


Chapter 6: Text Analysis
ExtractSentiment
Example 5: Dictionary Table Instead of Model File
Table 680: ExtractSentiment Example 5 Dictionary Table sentiment_word

word opinion
screwed 2
excellent 2
incredible 2
terrific 2
outstanding 2
fun 1
love 1
nice 1
big 0
update 0
constant 0
small 0
mistake -1
difficulty -1
disappointed -1
not tolerate -1
stuck -1
terrible -2
crap -2

SQL-MapReduce Call

SELECT * FROM ExtractSentiment (


ON sentiment_extract_input PARTITION BY ANY
ON sentiment_word AS dict DIMENSION
TextColumn ('review')
Level ('document')
Accumulate ('id', 'product')
) ORDER BY id;

776 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
ExtractSentiment
Output

Table 681: ExtractSentiment Example 5 Output Table

id product out_polarity out_strength out_sentiment_wo


rds
1 camera POS 2 excellent 2,
excellent 2. In total,
positive score:4
negative score:0
2 office suite POS 2 terrific 2, terrific 2.
In total, positive
score:4 negative
score:0
3 camera POS 2 nice 1, small 0, love
1. In total, positive
score:2 negative
score:0
4 gps POS 2 outstanding 2,
incredible 2. In
total, positive score:
4 negative score:0
5 gps POS 2 nice 1, big 0. In
total, positive score:
1 negative score:0
6 gps NEU 0 update 0. In total,
positive score:0
negative score:0
7 gps POS 2 screwed 2. In total,
positive score:2
negative score:0
8 camera NEG 2 stuck -1, difficulty
-1. In total, positive
score:0 negative
score:-2
9 television NEG 2 crap -2, big 0. In
total, positive score:
0 negative score:-2
10 camera NEG 2 not tolerate -1,
constant 0. In total,
positive score:0
negative score:-1

Teradata Aster Analytics Foundation User Guide 777


Chapter 6: Text Analysis
EvaluateSentimentExtractor

EvaluateSentimentExtractor

Summary
The EvaluateSentimentExtractor function uses test data to evaluate the precision and recall of the
predictions output by the function ExtractSentiment. The precision and recall are affected by the model that
ExtractSentiment uses; therefore, if you change the model, you must rerun EvaluateSentimentExtractor on
the new predictions.
For basic information on precision and recall calculations see http://en.wikipedia.org/wiki/
Precision_and_recall.

Usage

EvaluateSentimentExtractor Syntax
Version 1.1

SELECT * FROM EvaluateSentimentExtractor (


ON { table | view | (query) } PARTITION BY 1
ObsColumn ('observed_column')
SentimentColumn ('sentiment_column')
);

Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input column with the observed sentiment
(POS, NEG or NEU).
SentimentColumn Required Specifies the name of the input column with the predicted sentiment
(POS, NEG or NEU).

Input
The input table, which contains the test data, must have the columns described in the following table. The
table can have additional columns, but the function ignores them.
Table 682: EvaluateSentimentExtractor Input Table Schema

Column Name Data Type Description


observed_column VARCHAR Observed sentiment (POS, NEG or NEU).
sentiment_column VARCHAR Sentiment predicted by the ExtractSentiment function (POS, NEG
or NEU).

778 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
EvaluateSentimentExtractor
Output
Table 683: EvaluateSentimentExtractor Output Table Schema

Column Name Data Type Description


evaluation_result VARCHAR Reports these values:
• positive record (total relevant, relevant, total retrieved)
• recall and precision
• negative record (total relevant, relevant, total retrieved)
• recall and precision
• positive and negative record (total relevant, relevant, total
retrieved)
• recall and precision

Given these definitions:


• POS_EXPECT = count of rows in which observed sentiment is POS
• POS_RETURN = count of rows in which actual sentiment is POS
• POS_TRUE = count of rows in which both observed and actual sentiment are POS
• NEG_EXPECT = count of rows in which observed sentiment is NEG
• NEG_RETURN = count of rows in which actual sentiment is NEG
• NEG_TRUE = count of rows in which both observed and actual sentiment are NEG
• NEU_EXPECT = count of rows in which observed sentiment is NEU
• NEU_RETURN = count of rows in which actual sentiment is NEU
• NEU_TRUE = count of rows in which both observed and actual sentiment are NEU
Precision and recall are calculated as follows:
• Positive sentiment:
∘ Precision = POS_TRUE / POS_RETURN
∘ Recall = POS_TRUE / POS_EXPECT
• Negative sentiment:
∘ Precision = NEG_TRUE / NEG_RETURN
∘ Recall = NEG_TRUE / NEG_EXPECT
• Neutral sentiment:
∘ Precision = NEU_TRUE / NEU_RETURN
∘ Recall = NEU_TRUE / NEU_EXPECT
• All sentiment:
∘ Precision = (POS_TRUE + NEG_TRUE + NEU_TRUE) / (POS_RETURN + NEG_RETURN +
NEU_RETURN)
∘ Recall = (POS_TRUE + NEG_TRUE + NEU_TRUE) / (POS_EXPECT + NEG_EXPECT +
NEU_EXPECT)

Example
• Input
• Example 1: Model ('dictionary')

Teradata Aster Analytics Foundation User Guide 779


Chapter 6: Text Analysis
EvaluateSentimentExtractor
• Example 2: Model ('classification:default_sentiment_classification_model.bin')
• Example 3: Model ('classification:sentimentmodel1.bin')
• Example 4: Dictionary Table Instead of Model File

Input
The input to the function EvaluateSentimentExtractor is the output from the function ExtractSentiment;
therefore, these examples have the same Prerequisites as ExtractSentiment.

Example 1: Model ('dictionary')


This example uses the dictionary model file default_sentiment_lexicon.txt.

SQL-MapReduce Call

SELECT * FROM EvaluateSentimentExtractor (


ON ExtractSentiment (
ON sentiment_extract_input
TextColumn ('review')
Accumulate ('category')
Model ('dictionary')
)
PARTITION BY 1
ObsColumn ('category')
SentimentColumn ('out_polarity')
);

Output

Table 684: EvaluateSentimentExtractor Example 1 Output Table

evaluation_result
positive record (total relevant, relevant, total retrieved): 5 5 5
recall and precision: 1.00 1.00
negative record (total relevant, relevant, total retrieved): 5 5 5
recall and precision: 1.00 1.00
positive and negative record (total relevant, relevant, total retrieved): 10 10 10
recall and precision: 1.00 1.00

Example 2: Model
('classification:default_sentiment_classification_model.bin')
This example uses the classification model file default_sentiment_classification_model.bin.

SQL-MapReduce Call

SELECT * FROM EvaluateSentimentExtractor (


ON ExtractSentiment (

780 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
EvaluateSentimentExtractor
ON sentiment_extract_input
TextColumn ('review')
Accumulate ('category')
Model ('classification:default_sentiment_classification_model.bin')
)
PARTITION BY 1
ObsColumn ('category')
SentimentColumn ('out_polarity')
);

Output

Table 685: EvaluateSentimentExtractor Example 2 Output Table

evaluation_result
positive record (total relevant, relevant, total retrieved): 5 3 3
recall and precision: 0.60 1.00
negative record (total relevant, relevant, total retrieved): 5 5 7
recall and precision: 1.00 0.71
positive and negative record (total relevant, relevant, total retrieved): 10 8 10
recall and precision: 0.80 0.80

Example 3: Model ('classification:sentimentmodel1.bin')


This example uses the model file sentimentmodel1.bin, output by the function TrainSentinmentExtractor,
see the Example section.

SQL-MapReduce Call

SELECT * FROM EvaluateSentimentExtractor (


ON ExtractSentiment (
ON sentiment_extract_input
TextColumn ('review')
Accumulate ('category')
Model ('classification:sentimentmodel1.bin')
)
PARTITION BY 1
ObsColumn ('category')
SentimentColumn ('out_polarity')
);

Output

Table 686: EvaluateSentimentExtractor Example 3 Output Table

evaluation_result
positive record (total relevant, relevant, total retrieved): 5 5 5
recall and precision: 1.00 1.00
negative record (total relevant, relevant, total retrieved): 5 5 5

Teradata Aster Analytics Foundation User Guide 781


Chapter 6: Text Analysis
Text Classifier

evaluation_result
recall and precision: 1.00 1.00
positive and negative record (total relevant, relevant, total retrieved): 10 10 10
recall and precision: 1.00 1.00

Example 4: Dictionary Table Instead of Model File


The dictionary table, sentiment_word, is from the function ExtractSentiment, see Example 5: Dictionary
Table Instead of Model File.

SQL-MapReduce Call

SELECT * FROM EvaluateSentimentExtractor (


ON ExtractSentiment (
ON sentiment_extract_input partition BY ANY
ON sentiment_word AS dict DIMENSION
TextColumn ('review')
Accumulate ('category')
)
PARTITION BY 1
ObsColumn ('category')
SentimentColumn ('out_polarity')
);

Output

Table 687: EvaluateSentimentExtractor Example 4 Output Table

evaluation_result
positive record (total relevant, relevant, total retrieved): 5 5 6
recall and precision: 1.00 0.83
negative record (total relevant, relevant, total retrieved): 5 3 3
recall and precision: 0.60 1.00
positive and negative record (total relevant, relevant, total retrieved): 10 8 9
recall and precision: 0.80 0.89

Text Classifier

Summary
Text Classifier is composed of these functions:
• TextClassifierTrainer, which trains the text classifier and creates a model
• TextClassifier, which classifies the text
• TextClassifierEvaluator, which evaluates the trained classifier model

782 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextClassifierTrainer

Background
Text classification is the task of choosing the correct class label for a given text input. In basic text
classification tasks, each input is considered in isolation from all other inputs, and the set of class labels is
defined in advance.
Text classification is a two-stage process:
1. Train the model:
Preprocess the text data and produce tokens.
Use natural language processing (NLP) functionality such as tokenization, stemming, and stop words.
From the tokens, use statistical measures to select a subset.
Generate the feature for each word in the subset.
Use machine learning algorithms to train a classifier.
2. Classify the text.

TextClassifierTrainer

Summary
The TextClassifierTrainer function trains a machine learning classifier for text classification and creates a
model file. After installing the model file in Aster Database, you can input it to the function TextClassifier.

Usage

TextClassifierTrainer Syntax
Version 1.4

SELECT * FROM TextClassifierTrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]

Teradata Aster Analytics Foundation User Guide 783


Chapter 6: Text Analysis
TextClassifierTrainer
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
TextColumn ('text_column')
CategoryColumn ('category_column')
ModelFile ('model_file')
ClassifierType ({ 'KNN' | 'MaxEnt' })
[ ClassifierParameters ('name:value' [,...]) ]
[ NLPParameters ('name:value' [,...]) ]
[ FeatureSelectionMethod ('DF:[{ min:max | min: | :max}]') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the documents to
use to train the model.
TextColumn Required Specifies the name of the column that contains the text of the
training documents.
CategoryColumn Required Specifies the name of the column that contains the category of
the training documents.
ModelFile Required Specifies the name for the model file to be generated.
ClassifierType Required Specifies the classifier type of the model, KNN algorithm or
maximum entropy model.
ClassifierParameters Optional Applies only if the classifier type of the model is KNN. Specifies
parameters for the classifier. The name must be 'compress' and
value must be in the range (0, 1). The n training documents are
clustered into value*n groups (for example, if there are 100
training documents, then ClassifierParameters('compress:0.6')
clusters them into 60 groups), and the model uses the center of
each group as the feature vector.
NLPParameters Optional Specifies natural language processing (NLP) parameters for
preprocessing the text data and produce tokens. Each name:value
pair must be one of the following:
• tokenDictFile:token_file
where token_file is the name of an Aster Database file in
which each line contains a phrase, followed by a space,
followed by the token for the phrase (and nothing else).

784 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextClassifierTrainer

Argument Category Description

• stopwordsFile:stopword_file
where stopword_file is the name of an Aster Database file in
which each line contains exactly one stop word (a word to
ignore during tokenization, such as a, an, or the).
• useStem:{'true' | 'false'}
which specifies whether the function stems the tokens. The
default value is 'false'.
• stemIgnoreFile:stem_ignore_file
where stem_ignore_file is the name of an Aster Database file
in which each line contains exactly one word to ignore during
stemming. Specifying this parameter with 'useStem:false'
causes an exception.
• useBgram:{'true' | 'false'}
which specifies whether the function uses Bigram, which
considers the proximity of adjacent tokens when analyzing
them. The default value is 'false'.
• language:{ 'en' | 'zh_CN' | 'zh_TW' }
which specifies the language of the input text—English (en),
Simplified Chinese (zh_CN), or Traditional Chinese
(zh_TW). The default value is en. For the values zh_CN and
zh_TW, the function ignores the parameters useStem and
stemIgnoreFile.
Example:

NLPParameters ('tokenDictFile:token_dict.txt',
'stopwordsFile:fileName',
'useStem:true',
'stemIgnoreFile:fileName',
'useBgram:true',
'language:zh_CN')

FeatureSelectionMethod Optional Specifies the feature selection method, DF (document


frequency). The values min and max must be in the range (0, 1).
The function selects only the tokens that appear in at least min*n
documents and at most max*n documents, where n is the
number of training documents. For example,
FeatureSelection ('DF:[0.1:0.9]') causes the function to select only
the tokens that appear in at least 10% but no more than 90% of
the training documents. If min exceeds max, the function uses
min as max and max as min.

Input
The input table must have the columns described in the following table. The input table can have additional
columns, but the function ignores them.

Teradata Aster Analytics Foundation User Guide 785


Chapter 6: Text Analysis
TextClassifierTrainer
Table 688: TextClassifierTrainer Input Table Schema

Column Name Data Type Description


text_column VARCHAR Contains the text of the training documents.
category_column VARCHAR Contains the category of the training documents.

Output
The function outputs a binary file with the name specified by ModelFile argument, installs the binary file on
Aster Database, and prints a message about the model generation.
Table 689: TextClassifierTrainer Output Message Schema

Column Name Data Type Description


train_result VARCHAR Contains information about the model generation.

Example

Input
Table 690: TextClassifierTrainer Example Input Table texttrainer_input

id content category
1 Tennis star Roger Federer was born on August 8, 1981, in Basel, Switzerland, to sports
Swiss father Robert Federer and South African mother Lynette Du Rand
2 Federer took an interest in sports at an early age, playing tennis and soccer at sports
the age of 8.
3 At age 14, Federer became the national junior champion in Switzerland sports
4 Federer won the Wimbledon boys singles and doubles titles in 1998, and turned sports
professional later that year.
5 In 2003, following a successful season on grass, Federer became the first Swiss sports
man to win a Grand Slam title when he emerged victorious at Wimbledon.
6 A natural disaster is a major adverse event resulting from natural processes of natural disaster
the Earth. Examples include floods, volcanic eruptions, earthquakes, tsunamis,
and other geologic processes.
7 In a vulnerable area, however, such as San Francisco in 1906, an earthquake can natural disaster
have disastrous consequences and leave lasting damage, requiring years to
repair.
8 An earthquake is the result of a sudden release of energy in the Earth crust that natural disaster
creates seismic waves.
9 Volcanoes can cause widespread destruction and consequent disaster in several natural disaster
ways.

786 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextClassifierTrainer

id content category
10 A flood is an overflow of water that submerges land natural disaster

This example uses this stop words file, stopwords.txt:

a
an
in
is
to
into
was
the
and
this
with
they
but
will

SQL-MapReduce Call

SELECT * FROM TextClassifierTrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('texttrainer_input')
TextColumn ('content')
CategoryColumn ('category')
ModelFile ('knn.bin')
ClassifierType ('knn')
ClassifierParameters ('compress:0.9')
NLPParameters ('useStem:true', 'stopwordsFile: stopwords.txt')
FeatureSelectionMethod ('DF:[0.1:0.99]')
);

Output
Table 691: TextClassifierTrainer Example Output Table

train_result
Model generated.
Training time(s): 0.216
File name: knn.bin
File size(KB): 1
Model successfully installed

Teradata Aster Analytics Foundation User Guide 787


Chapter 6: Text Analysis
TextClassifier

TextClassifier

Summary
The TextClassifier function classifies input text, using a model output by the function TextClassifierTrainer.

Usage

TextClassifier Syntax
Version 1.2

SELECT * FROM TextClassifier (


ON input_table
TextColumn ('text_column')
Model ('model_name')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the column of the input table that contains the text to be used
for predicting classification.
Model Required Specifies the model (which you must install in the database before calling
the function).
Accumulate Optional Specifies the names of the input columns to copy to the output table.

Input
Table 692: TextClassifier Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text to be classified.
accumulate_column Any Column to copy to the output table.

788 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextClassifier
Output
Table 693: TextClassifier Output Table Schema

Column Name Data Type Description


out_category VARCHAR Predicted category of the text. If the text is empty or contains no model
features, then this value is NULL.
accumulate_column Same as in Column copied from the input table.
input table

Example

Input
• The following table, textclassifier_input
• Model file knn.bin, output by the function, TextClassifierTrainer, see the Example section.
Table 694: TextClassifier Example Input Table: textclassifier_input

id content category
16 At the beginning of 2004, Federer had a world ranking of No. 2, and that same sports
year, he won the Australian Open, the U.S. Open, the ATP Masters and
retained the Wimbledon singles title.
17 Federer held on to his No. 1 ranking from 2004 into 2008. In 2006 and 2007, he sports
won the singles championships at the Australian Open, Wimbledon and the
U.S. Open.
18 A paragon of graceful athleticism, Federer was named the Laureus World sports
Sportsman of the Year from 2005-08.
19 Cyclone, tropical cyclone, hurricane, and typhoon are different names for the natural disaster
same phenomenon, which is a cyclonic storm system that forms over the
oceans.
20 Drought is the unusual dryness of soil, resulting in crop failure and shortage of natural disaster
water and for other uses which is caused by significant low rainfall than
average over a prolonged period.
21 A tornado is a violent, dangerous, rotating column of air that is in contact with natural disaster
both the surface of the earth and a cumulonimbus cloud or, in rare cases, the
base of a cumulus cloud.

SQL-MapReduce Call

SELECT * FROM TextClassifier (


ON textclassifier_input
TextColumn ('content')
Model ('knn.bin')

Teradata Aster Analytics Foundation User Guide 789


Chapter 6: Text Analysis
TextClassifierEvaluator
Accumulate ('id', 'category')
) ORDER BY id;

Output
Table 695: TextClassifier Example Output Table

id category out_category
16 sports sports
17 sports sports
18 sports natural disaster
19 natural disaster natural disaster
20 natural disaster natural disaster
21 natural disaster natural disaster

TextClassifierEvaluator

Summary
The TextClassifierEvaluator function evaluates the precision, recall and F-measure of the trained model
output by the function Text Classifier.

Usage

TextClassifierEvaluator Syntax
Version 1.2

SELECT * FROM TextClassifierEvaluator (


ON { table_name| view_name| (query) } PARTITION BY 1
ObsColumn ('expected_column')
PredictColumn ('predicted_column')
);

Arguments
Argument Category Description
ObsColumn Required Specifies the name of the input column that contains the expected
(correct) category.

790 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextClassifierEvaluator

Argument Category Description


PredictColumn Required Specifies the name of the input column that contains the predicted
category (assigned by the function TextClassifier).

Input
Table 696: TextClassifierEvaluator Input Table Schema

Column Name Data Type Description


expected_column VARCHAR Expected (correct) category.
predicted_column VARCHAR Predicted category (assigned by the function TextClassifier).

Output
Table 697: TextClassifierEvaluator Output Table Schema

Column Name Data Type Description


precision DOUBLE Precision of the model.
PRECISION
recall DOUBLE Recall of the model.
PRECISION
f-measure DOUBLE F-measure of the model.
PRECISION

Example

Input
• The table, TextClassifier Example Output Table, in the Output section of the Example from the function
TextClassifier.
• Model file knn.bin, output by the function, TextClassifierTrainer, see the Example section.

SQL-MapReduce Call

SELECT * FROM TextClassifierEvaluator (


ON TextClassifier (
ON (SELECT * FROM textclassifier_input)
TextColumn ('content')
Accumulate ('category')
Model ('knn.bin')
) PARTITION BY 1
ObsColumn ('category')
PredictColumn ('out_category')
);

Teradata Aster Analytics Foundation User Guide 791


Chapter 6: Text Analysis
Text_Parser
Output
Table 698: TextClassifierEvaluator Example Output Table

precision recall f-measure


0.833333333333333 0.833333333333333 0.833333333333333

Text_Parser

Summary
The Text_Parser function tokenizes an input stream of words, optionally stems them (reduces them to their
root forms), and then outputs them. The function can either output all words in one row or output each
word in its own row with (optionally) the number of times that the word appears.
This function can be used with real-time applications. Refer to AMLGenerator.

Background
Parsing English language text includes:
• Punctuating sentences
• Breaking a sentence into words (tokenizing it)
• Removing stop words
• Stemming words (reducing them to their root forms)
The Text_Parser function reads a document into a memory buffer and creates a hash table. The dictionary
for the document must not exceed available memory; however, a million-word dictionary with an average
word length of ten bytes requires only 10 MB of memory.
The Text_Parser function uses Porter2 as the stemming algorithm.
For general information about tokenization, see:
http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer
For general information about stemming, see:
http://en.wikipedia.org/wiki/Stemming

Usage

Text_Parser Syntax
Version 1.3

SELECT * FROM Text_Parser (


ON { table_name| view_name| (query) }
[ PARTITION BY expression [,...] ]

792 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Text_Parser
TextColumn ('text_column_name')
[ ToLowerCase ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Stemming ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Delimiter ('delimiter_regular_expression') ]
[ TotalWordsNum
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Punctuation ('punctuation_regular_expression') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ TokenColumn ('token_column') ]
[ FrequencyColumn ('frequency_column') ]
[ TotalColumn ('total_column') ]
[ RemoveStopWords
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ PositionColumn ('position_column') ]
[ ListPositions
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputByWord ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ StemmingExceptions ('exception_rule_file') ]
[ StopWords ('stop_word_file') ]
);

Note:
If you include the PARTITION BY clause, the function treats all rows in the same partition as a single
document. If you omit the PARTITION BY clause, the function treats each row as a single document.

Arguments
Argument Catego Description
ry
TextColumn Requir Specifies the name of the input column whose contents are to be tokenized.
ed
ToLowerCase Option Specifies whether to convert input text to lowercase. The default value is 'true'.
al
Note:
The function ignores this argument if the Stemming argument has the
value 'true'.

Stemming Option Specifies whether to stem the tokens—that is, whether to apply the Porter2
al stemming algorithm to each token to reduce it to its root form. Before
stemming, the function converts the input text to lowercase and applies the
RemoveStopWords argument. The default value is 'false'.
Delimiter Option Specifies a regular expression that represents the word delimiter. The default
al value is '[\t\b\f\r]+').
TotalWordsNum Option Specifies whether to output a column that contains the total number of words
al in the input document. The default value is 'false'.
Punctuation Option Specifies a regular expression that represents the punctuation characters to
al remove from the input text. With Stemming ('true'), the recommended value
is '[\\\[.,?\!:;~()\\\]]+'.
The default value is '[.,!?]'.

Teradata Aster Analytics Foundation User Guide 793


Chapter 6: Text Analysis
Text_Parser

Argument Catego Description


ry
Accumulate Option Specifies the names of the input columns to copy to the output table. By
al default, the function copies all input columns to the output table.

Note:
No accumulate_column can be the same as token_column or total_column.

TokenColumn Option Specifies the name of the output column that contains the tokens. The default
al value is 'token'.
FrequencyColum Option Specifies the name of the output column that contains the frequency of each
n al token. The default value is 'frequency'.

Note:
The function ignores this argument if the OutputByWord argument has
the value 'false'.

TotalColumn Option Specifies the name of the output column that contains the total number of
al words in the input document. The default value is 'total_count'.
RemoveStopWord Option Specifies whether to remove stop words from the input text before parsing.
s al The default value is 'false'.
PositionColumn Option Specifies the name of the output column that contains the position of a word
al within a document. The default value is 'position'.
ListPositions Option Specifies whether to output the position of a word in list form. The default
al value is 'false', which causes the function to output a row for each occurrence
of the word.

Note:
The function ignores this argument if the OutputByWord argument has
the value 'false'.

OutputByWord Option Specifies whether to output each token of each input document in its own row
al in the output table. The default value is 'true'. If you specify 'false', then the
function outputs each tokenized input document in one row of the output
table.
StemmingExcepti Option Specifies the location of the file that contains the stemming exceptions. A
ons al stemming exception is a word followed by its stemmed form. The word and its
stemmed form are separated by white space. Each stemming exception is on
its own line in the file. For example:
bias bias
news news
goods goods
lying lie
ugly ugli
sky sky
early earli

794 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Text_Parser

Argument Catego Description


ry
The words 'lying', 'ugly', and 'early' are to become 'lie', 'ugli', and 'earli',
respectively. The other words are not to change.
StopWords Option Specifies the location of the file that contains the stop words (words to ignore
al when parsing text). Each stop word is on its own line in the file. For example:
a
an
the
and
this
with
but
will

Input
The Text_Parser function has one input table. If you include the PARTITION BY clause, then the function
treats all rows in the same partition as a single document. If you omit the PARTITION BY clause, then the
function treats each row as a single document.
Table 699: Text_Parser Input Table Schema

Column Name Data Type Description


text_column VARCHAR Contains text to be parsed.
accumulate_column Any Column to copy to the output table.

Output
The Text_Parser function has one output table, whose schema depends on the value of the OutputByWord
argument.
Table 700: Text_Parser Output Table Schema, Output_By_Word ('true')

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
token_column VARCHAR Contains tokens from input document, one to a row.
frequency_column INTEGER Contains the frequency of each token.
position_column VARCHAR Contains the position of a word within a document.

Teradata Aster Analytics Foundation User Guide 795


Chapter 6: Text Analysis
Text_Parser
Table 701: Text_Parser Output Table Schema, Output_By_Word ('false')

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
token_column VARCHAR Contains tokens from input document, one to a row.

Examples
• Example 1: With StopWords and without StemmingExceptions
• Example 2: With StemmingExceptions and without StopWords

Example 1: With StopWords and without StemmingExceptions

Input
The input table is log of vehicle complaints. The column category indicates whether the car was involved in a
crash.
Table 702: Text_Parser Examples Input Table complaints

doc_id text_data category


1 consumer was driving approximately 45 mph hit a deer with the front crash
bumper and then ran into an embankment head-on passenger's side air bag
did deploy hit windshield and deployed outward. driver's side airbag cover
opened but did not inflate it was still folded causing injuries.
2 when vehicle was involved in a crash totalling vehicle driver's side/ crash
passenger's side air bags did not deploy. vehicle was making a left turn and
was hit by a ford f350 traveling about 35 mph on the front passenger's side.
driver hit his head-on the steering wheel. hurt his knee and received neck
and back injuries.
3 consumer has experienced following problems; 1.) both lower ball joints no_crash
wear out excessively; 2.) head gasket leaks; and 3.) cruise control would
shut itself off while driving without foot pressing on brake pedal.
... ... ...

This example uses this stop words file, stopwords.txt:

a
an
in
is
to
into
was
the
and

796 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
Text_Parser
this
with
they
but
will

SQL-MapReduce Call

CREATE DIMENSION TABLE complaints_traintoken AS


SELECT * FROM Text_Parser (
ON complaints
TextColumn ('text_data')
ToLowerCase ('true')
Stemming ('false')
OutputByWord ('true')
Punctuation ('\[.,?\!\]')
RemoveStopWords ('true')
ListPositions ('true')
Accumulate ('doc_id', 'category')
StopWords ('stopwords.txt')
) ORDER BY doc_id;

Output
This query returns the table complaints_traintoken:

SELECT * FROM complaints_traintoken;

Table 703: Text_Parser Example 1 Output Table complaints_traintoken

doc_id category token frequency position


1 crash consumer 1 0
1 crash driving 1 2
1 crash approximately 1 3
1 crash 45 1 4
1 crash mph 1 5
1 crash hit 2 6,26
1 crash deer 1 8
1 crash front 1 11
1 crash bumper 1 12
1 crash then 1 14
1 crash ran 1 15
1 crash embankment 1 18
... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 797


Chapter 6: Text Analysis
Text_Parser
Example 2: With StemmingExceptions and without StopWords

Input
The input table is the first two rows of Text_Parser Examples Input Table complaints.
Table 704: Text_Parser Example 2 Input Table complaints_mini

doc_id text_data category


1 consumer was driving approximately 45 mph hit a deer with the front bumper crash
and then ran into an enbankment head-on passenger's side air bag did deploy hit
windshield and deployed outward. driver's side airbag cover opened but did not
inflate it was still folded causing injuries.
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's crash
side air bags did not deploy. vehicle was making a left turn and was hit by a ford
f350 traveling about 35 mph on the front passenger's side. driver hit his head-on
the steering wheel. hurt his knee and received neck and back injuries.

The stemming exceptions table, stemmingexception.text, contains:

consumer customer
enbankment embankment

SQL-MapReduce Call

SELECT * FROM Text_Parser (


ON complaints_mini
TextColumn ('text_data')
ToLowerCase ('true')
Stemming ('true')
OutputByWord ('false')
Punctuation ('\[.,?\!\]')
Accumulate ('doc_id', 'category')
StemmingExceptions ('stemmingexception.txt')
) ORDER BY doc_id;

Output

Table 705: TextParser Example 2 Output Table

doc_id category tokens


1 crash customer was drive approxim 45 mph hit a deer with the front bumper and then
ran into an embankment head-on passeng side air bag did deploy hit windshield
and deploy outward driver side airbag cover open but did not inflat it was still fold
caus injuri
2 crash when vehicl was involv in a crash total vehicl driver side/ passeng side air bag did
not deploy vehicl was make a left turn and was hit by a ford f350 travel about 35

798 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextChunker

doc_id category tokens


mph on the front passeng side driver hit his head-on the steer wheel hurt his knee
and receiv neck and back injuri

TextChunker

Summary
The TextChunker function divides text into phrases and assigns each phrase a tag that identifies its type.

Background
Text chunking (also called shallow parsing) divides text into phrases in such a way that syntactically related
words become members of the same phrase. Phrases do not overlap; that is, a word is a member of only one
chunk.
For example, the sentence “He reckons the current account deficit will narrow to only # 1.8 billion in
September .” can be divided as follows, with brackets delimiting phrases:
[NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to]
[NP only # 1.8 billion] [PP in] [NP September]
After each opening bracket is a tag that identifies the chunk type (NP, VP, and so on). For information about
chunk types, refer to Output.
For more information about text chunking, see:
• Erik F. Tjong Kim Sang and Sabine Buchholz, Introduction to the CoNLL-2000 Shared Task: Chunking.
In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 2000.
• Fei Sha and Fernando Pereira, Shallow Parsing with Conditional Random Fields. [2003]

Teradata Aster Analytics Foundation User Guide 799


Chapter 6: Text Analysis
TextChunker
Usage

TextChunker Syntax
Version 1.2

SELECT * FROM TextChunker (


ON input_table PARTITION BY partition_key ORDER BY word_sn
WordColumn ('word_column')
POSColumn ('pos_tag_column')
);

Note:
The input_table is output table of the POSTagger_stub, which contains the columns partition_key and
word_sn.

Arguments
Argument Category Description
WordColumn Required Specifies the name of the input table column that contains the words to
chunk into phrases. Typically, this is the word column of the output table
of the POSTagger function (described in the Output section of Usage).
POSColumn Required Specifies the name of the input table column the part-of-speech (POS)
tag of words. Typically, this is the pos_tag column of the output table of
the POSTagger function (described in the Output section of Usage).

Input
The TextChunker function requires:
• An input table generated by the POSTagger function (for its schema, refer to POSTagger Output Table
Schema)
When running POSTagger to generate this table, specify in the Accumulate argument the name of the
input column that contains the unique row identifiers.
• The model file, chunker_default_model.bin, which is provided with the function

Note:
Before running TextChunker, add the model file location to the default search path for the user or
session.

Output
Table 706: TextChunker Output Table Schema

Column Name Data Type Description


partition_key VARCHAR Unique key that identifies the partition that contains the text.
chunk_sn INTEGER Sequence number of a phrase in the sentence.

800 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextChunker

Column Name Data Type Description


chunk VARCHAR Text chunk (syntactically related words).
chunk_tag VARCHAR Phrase type tag (refer to the following table).

Table 707: TextChunker Phrase Type Tags

Tag Phrase Type


NP noun phrase
VP verb phrase
PP prepositional phrase
ADVP adverb phrase
SBAR subordinated clause
ADJP adjective phrase
PRT particles
CONJP conjunction phrase
INTJ interjection
LST list marker
UCP unlike coordinated phrase
O punctuation marks

Example
• Example 1: Using Output from POSTagger
• Example 2. Using Output from Sentenizer and POSTagger

Example 1: Using Output from POSTagger

Input

Table 708: TextChunker Example 1 Input Table cities

paraid paratext
1 I live in Los Angeles.
2 New York is a great city.
3 Chicago is a lot of fun, but the winters are very cold and windy.
4 Philadelphia and Boston have many historical sites.

Teradata Aster Analytics Foundation User Guide 801


Chapter 6: Text Analysis
TextChunker
SQL-MapReduce Call

SELECT * FROM TextChunker (


ON (
SELECT * FROM POSTagger (
ON cities
Accumulate ('paraid')
TextColumn ('paratext')
) ORDER BY paraid
) PARTITION BY paraid ORDER BY word_sn
WordColumn ('word')
POSColumn ('pos_tag')
);

Output

Table 709: TextChunker Example 1 Output Table

partition_key chunk_sn chunk chunk_tag


1 1 I NP
1 2 live VP
1 3 in PP
1 4 Los Angeles NP
1 5 . O
3 1 Chicago NP
3 2 is VP
3 3 a lot NP
3 4 of PP
3 5 fun NP
3 6 , O
3 7 but O
3 8 the winters NP
3 9 are VP
3 10 very cold and windy NP
3 11 . O
2 1 New York NP
2 2 is VP
2 3 a great city NP
2 4 . O
4 1 Philadelphia and Boston NP

802 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextChunker

partition_key chunk_sn chunk chunk_tag


4 2 have VP
4 3 many historical sites NP
4 4 . O

Example 2. Using Output from Sentenizer and POSTagger

Input

Table 710: TextChunker Example 2 Input Table paragraphs_input

paraid paratopic paratext


1 Decision Trees Decision tree learning uses a decision tree as a predictive model
which maps observations about an item to conclusions about
the items target value. It is one of the predictive modelling
approaches used in statistics, data mining and machine
learning. Tree models where the target variable can take a finite
set of values are called classification trees. In these tree
structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values
(typically real numbers) are called regression trees.
2 Simple Regression In statistics, simple linear regression is the least squares
estimator of a linear regression model with a single explanatory
variable. In other words, simple linear regression fits a straight
line through the set of n points in such a way that makes the
sum of squared residuals of the model (that is, vertical distances
between the points of the data set and the fitted line) as small as
possible.
3 Logistic Regression Logistic regression was developed by statistician David Cox in
1958[2][3] (although much work was done in the single
independent variable case almost two decades earlier). The
binary logistic model is used to estimate the probability of a
binary response based on one or more predictor (or
independent) variables (features). As such it is not a
classification method. It could be called a qualitative response/
discrete choice model in the terminology of economics.
4 Cluster analysis Cluster analysis or clustering is the task of grouping a set of
objects in such a way that objects in the same group (called a
cluster) are more similar (in some sense or another) to each
other than to those in other groups (clusters). It is a main task
of exploratory data mining, and a common technique for
statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information
retrieval, and bioinformatics. Cluster analysis itself is not one
specific algorithm, but the general task to be solved. It can be

Teradata Aster Analytics Foundation User Guide 803


Chapter 6: Text Analysis
TextChunker

paraid paratopic paratext


achieved by various algorithms that differ significantly in their
notion of what constitutes a cluster and how to efficiently find
them.
5 Association rule learning Association rule learning is a method for discovering
interesting relations between variables in large databases. It is
intended to identify strong rules discovered in databases using
different measures of interestingness. Based on the concept of
strong rules, Rakesh Agrawal et al.[2] introduced association
rules for discovering regularities between products in large-
scale transaction data recorded by point-of-sale (POS) systems
in supermarkets. For example, the rule {onions, potatoes} =>
{burger} found in the sales data of a supermarket would
indicate that if a customer buys onions and potatoes together,
they are likely to also buy hamburger meat.

SQL-MapReduce Call
TextChunker requires each sentence to have a unique identifier, and the input to TextChunker must be
partitioned by that identifier.

SELECT * FROM TextChunker (


ON POSTagger (
ON (
SELECT paraid*1000+sentence_sn
AS sentence_id, sentence FROM Sentenizer (
ON paragraphs_input
TextColumn ('paratext')
Accumulate ('paraid')
)
)
TextColumn ('sentence')
Accumulate ('sentence_id')
) PARTITION BY sentence_id ORDER BY word_sn
WordColumn ('word')
POSColumn ('pos_tag')
) ORDER BY 1, 2;

Output

Table 711: TextChunker Example 2 Output Table

partition_key chunk_sn chunk chunk_tag


1001 1 Decision tree learning NP
1001 2 uses VP
1001 3 a decision tree NP
1001 4 as PP
1001 5 a predictive model NP

804 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextChunker

partition_key chunk_sn chunk chunk_tag


1001 6 which NP
1001 7 maps VP
1001 8 observations NP
1001 9 about PP
1001 10 an item NP
1001 11 to PP
1001 12 conclusions NP
1001 13 about PP
1001 14 the items target value NP
1001 15 . O
1002 1 It NP
1002 2 is VP
1002 3 one NP
1002 4 of PP
1002 5 the predictive modelling approaches NP
1002 6 used VP
1002 7 in PP
1002 8 statistics , data mining and machine learning NP
1002 9 . O
1003 1 Tree models NP
1003 2 where ADVP
1003 3 the target variable NP
1003 4 can take VP
1003 5 a finite set NP
1003 6 of PP
1003 7 values NP
1003 8 are called VP
1003 9 classification trees NP
1003 10 . O
... ... ... ...

Teradata Aster Analytics Foundation User Guide 805


Chapter 6: Text Analysis
TextMorph

TextMorph

Summary
The TextMorph function outputs each input word in its standard forms (called morphs) with their
corresponding parts of speech. The following table shows examples of words and their standard forms.
Table 712: Examples of Words and Their Standard Forms

Input Word Standard Forms


books book
ran run
better good, well

Background
Lemmatization is a basic text analysis tool that determines the lemmas (standard forms) of words, so that all
forms of a word can be grouped together, improving the accuracy of text analysis.
The TextMorph function implements a lemmatization algorithm based on the WordNet 3.0 dictionary,
which is packaged with the function. If an input word is in the dictionary, the function outputs its morphs
with their parts of speech; otherwise, the function outputs the input word itself and sets its part of speech to
NULL.
When an input word has multiple morphs, the function outputs them by the order of precedence of their
parts of speech: noun, verb, adj, and adv. That is, if an input word has a noun form, then it is listed first. If
the same word has a verb form, then it is listed next, and so on.

Usage

TextMorph Syntax
Version 1.2

SELECT * FROM TextMorph (


ON { table | view | (query) }
WordColumn ('word_column')
[ POSTagColumn ('pos_tag_column') ]
[ SingleOutput ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ POS ('pos' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

806 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextMorph
Arguments
Argument Category Description
WordColumn Required Specifies the name of the input table column that contains the words.
POSTagColumn Optional Specifies the name of the input table column that contains the part-of-
speech (POS) tags of the words, generated by the function POSTagger.
If you specify this argument, the function outputs each morph according
to its POS tag.
SingleOutput Optional Specifies whether to output only one morph for each word. The default
value is 'true'. If you specify 'false', the function outputs all morphs for
each word.
POS Optional Specifies the parts of speech to output. A pos can be 'noun', 'verb', 'adj', or
'adv'. Specification order is irrelevant; the order of precedence is: 'noun',
'verb', 'adj', 'adv'. By default, the function outputs all parts of speech.
If you specify this argument and SingleOutput ('true'), then the function
outputs only the first pos.

Note:
The function does not determine the part of speech of the word from
its context, it uses all possible parts of speech for the word in the
dictionary.

Accumulate Optional Specifies the names of the input columns to copy to the output table.

Input
Table 713: TextMorph Input Table Schema

Column Name Data Type Description


word_column VARCHAR Contains one input word or phrase in each row.
pos_tag_column VARCHAR Contains the POS tags of the words, generated by the
function POSTagger. The following table summarizes the
English POSTagger tags and gives their corresponding
TextMorph parts of speech.
accumulate_column Any Column to copy to the output table.

Table 714: English POSTagger Tags and Corresponding TextMorph Tags

Tag Number POSTagger Tag Description Examples TextMorph POS


1 CC Coordinating and
conjunction
2 CD Cardinal number 1, third
3 DT Determiner the
4 EX Existential there there is

Teradata Aster Analytics Foundation User Guide 807


Chapter 6: Text Analysis
TextMorph

Tag Number POSTagger Tag Description Examples TextMorph POS


5 FW Foreign word hors d’oeuvre
6 IN Preposition or in, of, like
coordinating
conjunction
7 JJ Adjective green adj
8 JJR Adjective, greener adj
comparative
9 JJB Adjective, greenest adj
superlative
10 LS List item marker 1)
11 MD Modal could, will
12 NN Noun, singular or table noun
mass
13 NNS Noun, plural tables noun
14 NNP Noun, proper, John noun
singular
15 NNPS Noun, proper, Vikings noun
plural
16 PDT Predeterminer both boys
17 POS Possessive friend’s noun
18 PRP Pronoun, personal I, he, it
19 PRPS Pronoun, possessive my, his
20 RB Adverb however, usually, adv
naturally, here,
good
21 RBR Adverb, better adv
comparative
22 RBS Adverb, superlative best adv
23 RP Particle give up
24 SYM Symbol
25 TO to to go, to him
26 UH Interjection uh, huh, hmm
27 VB Verb, base form take verb
28 VBD Verb, past tense took verb
29 VBG Verb, gerund or taking verb
present participle

808 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextMorph

Tag Number POSTagger Tag Description Examples TextMorph POS


30 VBN Verb, past participle taken verb
31 VBP Verb, non-third- take verb
person singular
present tense
32 VBZ Verb, third-person takes verb
singular present
tense
33 WDT Wh- determiner which
34 WP Wh- pronoun who, what
35 WP Wh- pronoun, whose
possessive
36 WRB Wh- adverb where, when

Output
Table 715: TextMorph Input Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
morph VARCHAR Morph of the input word.
pos VARCHAR Part of speech of the morph.

Examples
• Input
• Example 1: SingleOutput ('true')
• Example 2: SingleOutput ('false')
• Example 3: POS ('noun', 'verb') and SingleOutput ('false')
• Example 4: POS ('noun', 'verb') and SingleOutput ('true')
• Example 5: Using TextMorph with POSTagger and TextTagging

Input
Table 716: TextMorph Examples 1-4 Input Table words_input

id word
1 regression
2 Roger
3 better

Teradata Aster Analytics Foundation User Guide 809


Chapter 6: Text Analysis
TextMorph

id word
4 datum
5 quickly
6 proud
7 father
8 juniors
9 doing
10 being
11 negating
12 yearly

Example 1: SingleOutput ('true')

SQL-MapReduce Call

SELECT * FROM TextMorph (


ON words_input
WordColumn ('word')
SingleOutput ('true')
Accumulate ('id', 'word')
) ORDER BY id;

Output

Table 717: TextMorph Example 1 Output Table

id word morph pos


1 regression regression noun
2 Roger Roger
3 better good adj
4 datum datum noun
5 quickly quickly adv
6 proud proud adj
7 father father noun
8 juniors junior noun
9 doing do verb
10 being be verb
11 negating negate verb

810 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextMorph

id word morph pos


12 yearly yearly noun

Example 2: SingleOutput ('false')

SQL-MapReduce Call

SELECT * FROM TextMorph (


ON words_input
WordColumn ('word')
SingleOutput ('false')
Accumulate ('id', 'word')
) ORDER BY id;

Output

Table 718: TextMorph Example 2 Output Table

id word morph pos


1 regression regression noun
2 Roger Roger
3 better good adj
3 better well adj
3 better well adv
4 datum datum noun
5 quickly quickly adv
6 proud proud adj
7 father father noun
7 father father verb
8 juniors junior noun
9 doing do verb
10 being be verb
11 negating negate verb
12 yearly yearly noun
12 yearly yearly adj
12 yearly yearly adv

Teradata Aster Analytics Foundation User Guide 811


Chapter 6: Text Analysis
TextMorph
Example 3: POS ('noun', 'verb') and SingleOutput ('false')

SQL-MapReduce Call

SELECT * FROM TextMorph (


ON words_input
WordColumn ('word')
SingleOutput ('false')
POS ('noun', 'verb')
Accumulate ('id', 'word')
) ORDER BY id;

Output
For the input word better, the function does not find noun and verb morphs. However, the function finds
better in the dictionary as both a noun and a verb, so it outputs those.
With SingleOutput ('false'), the words better and father appear in the output table as both nouns and verbs.
Table 719: TextMorph Example 3 Output Table

id word morph pos


1 regression regression noun
3 better better noun
3 better better verb
4 datum datum noun
7 father father noun
7 father father verb
8 juniors junior noun
9 doing do verb
10 being be verb
11 negating negate verb
12 yearly yearly noun

Example 4: POS ('noun', 'verb') and SingleOutput ('true')

SQL-MapReduce Call

SELECT * FROM TextMorph (


ON words_input
WordColumn ('word')
SingleOutput ('true')
POS ('noun', 'verb')

812 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextMorph
Accumulate ('id', 'word')
) ORDER BY id;

Output
With SingleOutput ('true'), the words better and father appear in the output table only as nouns.
Table 720: TextMorph Example 4 Output Table

id word morph pos


1 regression regression noun
3 better better noun
4 datum datum noun
7 father father noun
8 juniors junior noun
9 doing do verb
10 being be verb
11 negating negate verb
12 yearly yearly noun

Example 5: Using TextMorph with POSTagger and TextTagging


This example uses the function POSTagger to create the input table for TextMorph, whose output table is
input to the function TextTagging.

POSTagger Input

Table 721: TextMorph Example 5 POSTagger Input Table pos_input

id txt
s1 Roger Federer born on 8 August 1981, is a greatest tennis player, who has been continuously
ranked inside the top 10 since October 2002 and has won Wimbledon, USOpen, Australian and
FrenchOpen titles mutiple times

Statement to Create POSTagger Output Table

CREATE TABLE postagger_output DISTRIBUTE BY HASH(id) AS


SELECT * FROM PosTagger (
ON pos_input
Accumulate ('id')
TextColumn ('txt')
);

Teradata Aster Analytics Foundation User Guide 813


Chapter 6: Text Analysis
TextMorph
POSTagger Output and TextMorph Input

Table 722: TextMorph Example 5 Table postagger_output

id word_sn word pos_tag


s1 1 Roger NNP
s1 2 Federer NNP
s1 3 born NN
s1 4 on IN
s1 5 8 CD
s1 6 August NNP
s1 7 1981 CD
s1 8 , O
s1 9 is VBZ
s1 10 a DT
s1 11 greatest JJS
s1 12 tennis NN
s1 13 player NN
s1 14 , O
s1 15 who WP
s1 16 has VBZ
s1 17 been VBN
s1 18 continuously RB
s1 19 ranked VBN
s1 20 inside IN
s1 21 the DT
s1 22 top JJ
s1 23 10 CD
s1 24 since IN
s1 25 October NNP
s1 26 2002 CD
s1 27 and CC
s1 28 has VBZ
s1 29 won VBN
s1 30 Wimbledon NNP

814 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextMorph

id word_sn word pos_tag


s1 31 , O
s1 32 USOpen NNP
s1 33 , O
s1 34 Australian JJ
s1 35 and CC
s1 36 FrenchOpen JJ
s1 37 titles NNS
s1 38 mutiple JJ
s1 39 times NNS

Statement to Create TextMorph Output Table

CREATE TABLE textmorph_output DISTRIBUTE BY hash(id) AS


SELECT * FROM TextMorph (
ON postagger_output
WordColumn ('word')
POSTagColumn ('pos_tag')
Accumulate ('id', 'word_sn', 'word', 'pos_tag')
);

TextMorph Output and TextTagging Input

Table 723: TextMorph Example 5 Table textmorph_output

id word_sn word pos_tag morph pos


s1 1 Roger NNP Roger
s1 2 Federer NNP Federer
s1 3 born NN born noun
s1 4 on IN on
s1 5 8 CD 8
s1 6 August NNP august noun
s1 7 1981 CD 1981
s1 8 , O ,
s1 9 is VBZ be verb
s1 10 a DT a
s1 11 greatest JJS great adj
s1 12 tennis NN tennis noun

Teradata Aster Analytics Foundation User Guide 815


Chapter 6: Text Analysis
TextMorph

id word_sn word pos_tag morph pos


s1 13 player NN player noun
s1 14 , O ,
s1 15 who WP who
s1 16 has VBZ have verb
s1 17 been VBN be verb
s1 18 continuously RB continuously adv
s1 19 ranked VBN rank verb
s1 20 inside IN inside
s1 21 the DT the
s1 22 top JJ top adj
s1 23 10 CD 10
s1 24 since IN since
s1 25 October NNP october noun
s1 26 2002 CD 2002
s1 27 and CC and
s1 28 has VBZ have verb
s1 29 won VBN win verb
s1 30 Wimbledon NNP wimbledon noun
s1 31 , O ,
s1 32 USOpen NNP USOpen
s1 33 , O ,
s1 34 Australian JJ Australian adj
s1 35 and CC and
s1 36 FrenchOpen JJ FrenchOpen adj
s1 37 titles NNS title noun
s1 38 mutiple JJ mutiple adj
s1 39 times NNS time noun

SQL-MapReduce Call to TextTagging

SELECT * FROM TextTagging (


ON textmorph_output
Rules('equal(morph, "Australian") as grandslam',
'equal(morph, "wimbledon") as grandslam',
'equal(morph, "USOpen") as grandslam',

816 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextMorph
'equal(morph, "FrenchOpen") as grandslam')
Accumulate ('id', 'word_sn', 'morph')
) ORDER BY id, word_sn;

TextTagging Output

Table 724: TextMorph Example 5 TextTagging Output Table

id word_sn morph tag


s1 1 Roger
s1 2 Federer
s1 3 born
s1 4 on
s1 5 8
s1 6 august
s1 7 1981
s1 8 ,
s1 9 be
s1 10 a
s1 11 great
s1 12 tennis
s1 13 player
s1 14 ,
s1 15 who
s1 16 have
s1 17 be
s1 18 continuously
s1 19 rank
s1 20 inside
s1 21 the
s1 22 top
s1 23 10
s1 24 since
s1 25 october
s1 26 2002
s1 27 and

Teradata Aster Analytics Foundation User Guide 817


Chapter 6: Text Analysis
TextTagging

id word_sn morph tag


s1 28 have
s1 29 win
s1 30 wimbledon grandslam
s1 31 ,
s1 32 USOpen grandslam
s1 33 ,
s1 34 Australian grandslam
s1 35 and
s1 36 FrenchOpen grandslam
s1 37 title
s1 38 mutiple
s1 39 time

TextTagging

Summary
The TextTagging function tags text documents according to user-defined rules that use text-processing and
logical operators.

This function can be used with real-time applications. Refer to AMLGenerator.

818 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTagging
Usage

TextTagging Syntax
Version 1.3

SELECT * FROM TextTagging (


ON { text_table | text_view | (text_query) } PARTITION BY ANY
[ ON { rules_table | rules_view | (rules_query) } AS rules DIMENSION ]
[ Language ({ 'en' | 'zh_cn' | 'zh_tw' }) ]
[ Rules ('rule AS tag_name' [,...]) ]
[ Tokenize ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputByTag ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ TagDelimiter ('delimiter_string') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
Language Optional Specifies the language of the input text:
• 'en': English (default)
• 'zh_cn': Simplified Chinese
• 'zh_tw': Traditional Chinese
If UseTokenizer specifies 'true', then the function uses the value of
Language to create the word tokenizer.
Rules Optional Specifies the tag names and tagging rules. Use this argument if and only
if you do not specify a rules table. For information about defining
tagging rules, refer to Defining Tagging Rules.
Tokenize Optional Specifies whether the function tokenizes the input text before evaluating
the rules and tokenizes the text string parameter in the rule definition
when parsing a rule. If you specify 'true', then you must also specify the
Language argument. The default value is 'false'.
OutputByTag Optional Specifies whether the function outputs a tuple when a text document
matches multiple tags. The default value is 'false', which means that one
tuple in the output stands for one document and the matched tags are
listed in the output column tag.
TagDelimiter Optional Specifies the delimiter that separates multiple tags in the output column
tag if OutputByTag has the value 'false' (the default). The default value is
the comma (,). If OutputByTag has the value 'true', specifying this
argument causes an error.
Accumulate Optional Specifies the names of text table columns to copy to the output table.

Teradata Aster Analytics Foundation User Guide 819


Chapter 6: Text Analysis
TextTagging

Argument Category Description

Note:
Do not use the name 'tag' for an accumulate_column, because the
function uses that name for the output table column that contains the
tags.

Defining Tagging Rules


You can specify tagging rules with either the Rules argument or a rules table.
The following table explains the operations that a rule can use. In the table:
• The operand opn (where n is 1, 2, or 3) can be any of the following:
∘ A string literal
You must enclose the string literal in double quotation marks (for example, "Start countdown").
If the string literal contains double quotation marks, then you must precede each double quotation
mark with two backslashes (for example, "\\"Start countdown\\"").
The empty string ("") is not allowed.
If an operation has only string literal operands, matches are case-insensitive and do not consider
overlapping.
∘ A Java regular expression (regex"exp")
An operation with one or more Java regular expression operands uses fuzzy matching. Fuzzy
matching evaluates the original text input; that is, matching is case-sensitive and the text is not
tokenized.
∘ In the superdist operation only, a list of string literals or Java regular expressions
For details, see the description of the superdist operation in the following table.
• The operands lower and upper are nonnegative integers.
You can omit either lower or upper, but not both. For example, all of the following are valid syntax for the
contain operation:

contain(col, op1, lower, upper)


contain(col, op1, lower,)
contain(col, op1, , upper)

If x is the number of times that op1 appears in col, then the preceding operations have the following
meanings, respectively:
lower <= x <= upper
lower <= x
x <= upper
The meanings of lower, x, and upper depend on the operation.
For simplicity, the following table shows only the syntax that specifies both lower and upper.

820 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTagging
Table 725: Rule Operations

Syntax Description
Returns 'true' if the text in column col and the value of op1 are equal; 'false'
equal(col, op1) otherwise.

Returns 'true' if, in column col, the number of times that the value of op1
contain(col, op1, appears is in the range [lower, upper]; 'false' otherwise.
lower, upper)

Returns 'true' if, in column col, the distance between the values of op1 and
dist(col, op1, op2, op2 (that is, the number of words between them) is in the range [lower,
lower, upper) upper]; 'false' otherwise.
The distance computation depends on the Language and UseTokenizer
arguments.
By default, Language is 'en' (English) and UseTokenizer is 'false', and words
are delimited by whitespace characters.
If Language is 'zh_cn' (Simplified Chinese) or 'zh_tw' (Traditional Chinese)
and UseTokenizer is 'true', then the function performs word segmentation
before computing the distance between words.
Returns 'true' if, in column col, the values of op1, op2, and op3 satisfy the
superdist(col, op1, context rules con1 and con2; 'false' otherwise.
op2, con1, op3, con2) The rule con1 specifies the context for inclusion. The possible values of
con1and their meanings are:
nwn: op2 appears n or fewer words before or after op1.
nrn: op2 appears n or fewer words after op1.
para: op2 appears in the same paragraph as op1.
sent: op2 appears in the same sentence as op1.
The rule con2 specifies the context for exclusion. The possible values of con2
and their meanings are:
nwn: op3 does not appear n or fewer words before or after op1.
nrn: op3 does not appear n or fewer words after op1.
para: op3 does not appear in the same paragraph as op1.
sent: op3 does not appear in the same sentence as op1.
The distance computation depends on the Language and UseTokenizer
arguments (for details, refer to the description of the dist operation).
A paragraph ends with either "\n" or "\r\n". A sentence ends with either
period (.), question mark (?), or exclamation mark (!). The function
fragments the input into paragraphs or sentences and then checks the
context rule on each piece of text. If one piece satisfies the rule, then the
function tags the whole input.
opn (where n is 1, 2, or 3) can be a list of words. Enclose the list in double
quotation marks and separate the words with semicolons. For example:
"good;bad;neutral"
If opn is a Java regular expression, then exp can be a list. Separate the items
with semicolons. For example: regex"invest[\w]*;volatil[\w]*;risk"

Teradata Aster Analytics Foundation User Guide 821


Chapter 6: Text Analysis
TextTagging

Syntax Description
When a list appears in an inclusion context, the rule is satisfied if at least
one item appears in the context. When a list appears in an exclusion
context, the rule is satisfied if no item appears in the context.
The operand-context pairs after op1 are optional; that is, the following are
valid syntax:
superdist(col, op1,,,,)
superdist(col, op1, op2, con1,,)
superdist(col, op1,,, op3, con2)
superdist(col, op1, op2, con1, op3, con2)
superdist(col, op1,,,,)
The final syntax in the preceding list returns 'true' if op1 appears in col.
Returns 'true' if, in column col, the number of items (lines in the dictionary
dict(col, file) is in the range [lower, upper]; 'false' otherwise.
"[schema/]dictionary"
,lower, upper) Note:
This operation requires that the dictionary file [schema.] dictionary is
installed on your Aster Database cluster. The dictionary name,
dictionary, is case-sensitive. If the dictionary is in the public schema,
then you can omit the schema name, schema.

Returns 'true' if both operation1 and operation2 return 'true'; 'false'


operation1 and otherwise.
operation2

Returns 'true' if one or both operation1 or operation2 returns 'true'; 'false'


operation1 or otherwise.
operation2

Returns 'true' if operation returns 'false'; 'false' if operation returns 'true'.


not operation

Input
The TextTagging function has a required text table and an optional rules table. If you omit the rules table,
then you must specify the tagging rules with the Rules argument.
The following table describes the columns of the text table. The table can have additional columns, but the
function ignores them unless you specify them in rules.
Table 726: TextTagging Text Table Schema

Column Name Data Type Description


text_column VARCHAR Text to be tagged.
accumulate_column Any Column to be copied to the output table.

822 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTagging
Table 727: TextTagging Rules Table Schema

Column Name Data Type Description


tagname VARCHAR Name of tag.
definition VARCHAR Definition of tag.

Output
Table 728: TextTagging Output Table Schema

Column Name Data Type Description


accumulate_column Same as in text Column copied from the text table. Typically, one
table accumulate_column contains document identifiers.
tag VARCHAR Tuple of tags that match the text document. Tag names come
from the Rules argument or rules table. If a text document
matches no tag, then its value in this column is an empty string.

Examples
• Input
• Example 1: Specify Rules Argument
• Example 2: Specify Rules Table
• Example 3: Specify Dictionary File in Rules Argument
• Example 4: Specify Superdist in Rules Argument

Input
Table 729: TextTagging Examples 1-4 Input Table: text_inputs

id title content catalog


1 Chennai Floods Chennai floods have battered the capital city of Tamil Nadu and its Regional
adjoining areas. Normal life came to a standstill when roads were
submerged in water and all modes of transport were severely
affected. In the past, Chennai has had tsunamis and earthquakes
2 Tennis Superstars Roger Federer born on 8 August 1981, is a greatest tennis player, sports
who has been continuously ranked inside the top 10 since October
2002 and has won Wimbledon, USOpen, Australian and
FrenchOpen titles mutiple times
3 Sports Rivalry The Federer Nadal rivalry, known by many as Fedal, is between sports
two professional tennis players, Roger Federer of Switzerland and
Rafael Nadal of Spain. They are currently engaged in a storied
rivalry, which many consider to be the greatest in tennis history.
They have played 34 times, most recently in the 2015 Swiss Indoors
final, and Nadal leads their eleven-year-old rivalry with an overall
record of 231

Teradata Aster Analytics Foundation User Guide 823


Chapter 6: Text Analysis
TextTagging

id title content catalog


4 Sports Rivalry The India Pakistan cricket rivalry is one of the most intense sports sports
rivalries in the world. An India-Pakistan cricket match has been
estimated to attract up to one billion viewers, according to TV
ratings firms and various other reports. The 2011 World Cup
semifinal between the two teams attracted around 988 million
television viewers
5 Sports Rivalry An Ashes series is traditionally of five Tests, hosted in turn by sports
England and Australia at least once every four years. As of August
2015, England hold the ashes, having won three of the five Tests in
the 2015 Ashes series. Overall, Australia has won 32 series,
England 32 and five series have been drawn.

Example 1: Specify Rules Argument

SQL-MapReduce Call

SELECT * FROM TextTagging (


ON text_inputs
Rules ('contain(content, "floods", 1,) or
contain(content, "tsunamis", 1,) AS Natural-Disaster',
'contain(content, "Roger", 1,) and
contain(content, "Nadal", 1,) AS Tennis-Rivalry',
'contain(title, "Tennis", 1,) and
contain(content, "Roger", 1,) AS Tennis-Greats',
'contain(content, "India", 1,) and
contain(content, "Pakistan", 1,) AS Cricket-Rivalry',
'contain(content,"Australia",1,) and
contain(content, "England", 1,) AS The-Ashes'
)
OutputByTag ('true')
Accumulate ('id')
) ORDER BY id;

Output

Table 730: TextTagging Example 1 Output Table

id tag
1 Natural-Disaster
2 Tennis-Greats
3 Tennis-Rivalry
4 Cricket-Rivalry
5 The-Ashes

824 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTagging
Example 2: Specify Rules Table
The rules table defines the same rules as the Rules argument in Example 1.
Table 731: TextTagging Example 2 Rule Inputs Table rule_inputs

tagname definition
Cricket-Rivalry contain(content,"India",1,) and
contain(content,"Pakistan",1,)
Natural-Disaster contain(content, "floods",1,) or
contain(content,"tsunamis",1,)
Tennis-Greats contain(title,"Tennis",1,) and
contain(content,"Roger",1,)
Tennis-Rivalry contain(content,"Roger",1,) and
contain(content,"Nadal",1,)
The-Ashes contain(content,"Australia",1,) and
contain(content,"England",1,)

SQL-MapReduce Call

SELECT * FROM TextTagging (


ON text_inputs PARTITION BY ANY
ON rule_inputs AS rules DIMENSION
Accumulate ('id')
) ORDER BY id;

Output

Table 732: TextTagging Example 2 Output Table

id tag
1 Natural-Disaster
2 Tennis-Greats
3 Tennis-Rivalry
4 Cricket-Rivalry
5 The-Ashes

Example 3: Specify Dictionary File in Rules Argument


This example uses this dictionary file, keywords.txt:

floods
tsunamis
Roger
Nadal

Teradata Aster Analytics Foundation User Guide 825


Chapter 6: Text Analysis
TextTagging
India
Pakistan
England
Australia

SQL-MapReduce Call

SELECT * FROM TextTagging (


ON text_inputs
Rules ('dict(content, "keywords.txt", 1,) and
equal(title, "Chennai Floods") AS Natural-Disaster',
'dict(content, "keywords.txt", 2,) and
equal(catalog, "sports") AS Great-Sports-Rivalry '
)
Accumulate ('id')
) ORDER BY id;

Output

Table 733: TextTagging Example 3 Output Table

id tag
1 Natural-Disaster
2
3 Great-Sports-Rivalry
4 Great-Sports-Rivalry
5 Great-Sports-Rivalry

Example 4: Specify Superdist in Rules Argument

SQL-MapReduce Call

SELECT * FROM TextTagging (


ON text_inputs
Rules ('superdist(content, "Chennai", "floods", sent, ,)
AS Chennai-Flood-Disaster',
'superdist(content, "Roger", "titles", para, "Nadal", para)
AS Roger-Champion',
'superdist(content, "Roger", "Nadal", para, ,)
AS Tennis-Rivalry',
'contain(content, regex"[A|a]shes", 2,)
AS Aus-Eng-Cricket',
'superdist(content, "Australia", "won", nw5, ,)
AS Aus-victory'
)
Accumulate ('id')
) ORDER BY id;

826 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTokenizer
Output

Table 734: TextTagging Example 4 Output Table

id tag
1 Chennai-Flood-Disaster
2 Roger-Champion
3 Tennis-Rivalry
4
5 Aus-Eng-Cricket,Aus-victory

TextTokenizer

Summary
The TextTokenizer function extracts English, Chinese, or Japanese tokens from text. Examples of tokens are
words, punctuation marks, and numbers. Tokenization is the first step of many types of text analysis.

This function can be used with real-time applications. Refer to AMLGenerator.

Teradata Aster Analytics Foundation User Guide 827


Chapter 6: Text Analysis
TextTokenizer
Usage

TextTokenizer Syntax
Version 3.2

SELECT * FROM TextTokenizer (


ON input_table PARTITION BY ANY
[ ON dict_table AS dict DIMENSION ]
TextColumn ('text_column') ]
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' | 'jp' | }) ]
[ Model ('model_file') ]
[ OutputDelimiter ('delimiter') ]
[ OutputByWord ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ UserDictionaryFile ('user_dictionary_file') ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the text to
tokenize.
Language Optional Specifies the language of the text in text_column:
• en (English, the default)
• zh_CN (Simplified Chinese)
• zh_TW (Traditional Chinese)
• jp (Japanese)

Model Optional Specifies the name of model file that the function uses for tokenizing.
The model must be a conditional random-fields model and model_file
must already be installed on the database. If you omit this argument,
or if model_file is not installed on the database, then the function uses
white spaces to separate English words and an embedded dictionary
to tokenize Chinese text.

Note:
If you specify Language('jp'), the function ignores this argument.

OutputDelimiter Optional Specifies the delimiter for separating tokens in the output. The default
value is slash (/).
OutputByWord Optional Specifies whether to output one token in each row. The default value
is 'false' (output one line of text in each row).
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.

828 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTokenizer

Argument Category Description


UserDictionaryFile Optional Specifies the name of the user dictionary to use to correct results
specified by the model. If you specify both this argument and a
dictionary table (dict), then the function uses the union of
user_dictionary_file and dict as its dictionary. Input describes the
format of user_dictionary_file and dict.

Note:
If the function finds more than one matched term, it selects the
longest term for the first match.

Input
The function has a required input table and an optional dictionary table.
Table 735: TextTokenizer Input Table Schema

Column Name Data Type Description


text_column VARCHAR Text to tokenize.
accumulate_column Any Column to copy to the output table.

Table 736: TextTokenizer Dictionary Table Schema

Column Name Data Type Description


entry VARCHAR Dictionary entry.

The following table describes the format of both the dictionary table (dict) and the user dictionary file
(specified by the UserDictionaryFile argument).
Table 737: TextTokenizer Dictionary Table and User Dictionary File Format

Language Format
Chinese and English One dictionary word on each line.
Japanese A dictionary entry consists of the following comma-separated words:
word—The original word.
tokenized_word—The tokenized form of the word.
reading—The reading of word in Katakana.
pos—The part-of-speech of the word.
For example:
成田空港,成田空港,ナリタクウコウ,カスタム名詞

Output
The schema of the output table depends on the value of the OutputByWord argument.

Teradata Aster Analytics Foundation User Guide 829


Chapter 6: Text Analysis
TextTokenizer
Table 738: TextTokenizer Output Table Schema for OutputByWord ('true')

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
sn VARCHAR Serial number, in the input text, of the extracted token.
token VARCHAR Extracted token.

Table 739: TextTokenizer Output Table Schema for OutputByWord ('false')

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
token VARCHAR Tokenized text, separated by the delimiter specified by the
OutputDelimiter argument.

Examples
• Example 1: Chinese Tokenization
• Example 2: Japanese Tokenization
• Example 3: English Tokenization

Example 1: Chinese Tokenization

Input

Table 740: TextTokenizer Example 1 Input Table cn_input

id txt
t1 我从小就不由自主地认为自己长大以后一定得成为一个象我父亲一样的画家, 可能是父母
潜移默化的影响。
t2 中华人民共和国 辽宁省 铁岭市 靠山屯 村支书 赵本山。

Table 741: TextTokenizer Example 1 Dictionary Table cn_dict

txt
辽宁省铁岭市靠山屯村
赵本山

SQL-MapReduce Call 1

SELECT * FROM textTokenizer(


ON cn_input PARTITION BY ANY
ON cn_dict AS dict DIMENSION

830 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTokenizer
Language 'zh_CN')
OutputDelimiter (' ')
OutputByWord ('false')
Accumulate ('id')
TextColumn ('txt')
);

Output

Table 742: TextTokenizer Example 1 Output Table 1

id token
t1 我 从小 就 不由自主 地 认为 自己 长大 以后 一定 得 成为 一个 象 我 父亲 一样 的 画家 , 可
能 是 父母 潜移默化 的 影响 。
t2 中华人民共和国 辽宁省 铁岭市 靠山屯 村支书 赵本山 。

SQL-MapReduce Call 2

SELECT * FROM TextTokenizer(


ON cn_input PARTITION BY any
ON cn_dict AS dict DIMENSION
Language ('zh_CN')
OutputByWord ('true')
Accumulate ('id')
TextColumn ('txt')
);

Output

Table 743: TextTokenizer Example 1 Output Table 2

id sn token
t1 1 我
t1 2 从小
t1 3 就
t1 4 不由自主
… ... ...
t2 1 中华人民共和国
t2 2 辽宁省
... ... ...

Teradata Aster Analytics Foundation User Guide 831


Chapter 6: Text Analysis
TextTokenizer
Example 2: Japanese Tokenization

Input

Table 744: TextTokenizer Example 2 Input Table jp_input

id txt
t1 総務省は 28 日、全国の主要 51 市を対象に 2013 年の物価水準を比較した消費者物価
地域差指数を発表した。
t2 ソチ五輪6位の浅田真央(23)=中京大=はSP女子世界最高の78・66点で首
位に立った。

Table 745: TextTokenizer Example 2 Japanese Dictionary jp_dict

word
地域差指数,地域差指数,チイキサシスウ,カスタム名詞

User dictionary file user_dict_jp.txt:


SP女子,SP女子,エスピージョシ,カスタム名詞

SQL-MapReduce Call 1

SELECT * FROM TextTokenizer (


ON jp_input PARTITION BY any
ON jp_dict AS dict DIMENSION
Language ('jp')
OutputByWord ('false')
Accumulate ('id')
TextColumn ('txt')
UserDictionaryFile ('user_dict_jp.txt')
);

Output

Table 746: TextTokenizer Example 2 Output Table 1

id token
t1 総務省/は/28 日/、/全国/の/主要/51 市/を/対象/に/2013 年/の/物価水準/を/比較/し/た/
消費者/物価/地域差指数/を/発表/し/た/。
t2 ソチ五輪/6位/の/浅田真央/(/23/)/=/中京大/=/は/SP女子/世界最高/の/78・
66点/で/首位/に/立っ/た/。

SQL-MapReduce Call 2

SELECT * FROM TextTokenizer (


ON jp_input PARTITION BY any
ON jp_dict AS dict DIMENSION

832 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TextTokenizer
Language ('jp')
OutputByWord ('true')
Accumulate ('id')
TextColumn ('txt')
);

Output

Table 747: TextTokenizer Example 2 Output Table 2

id sn token
t1 1 総務省
t1 2 は
t1 3 28 日
t1 4 、
... ... ...
t2 12 SP女子
t2 13 世界最高
... ... ...

Example 3: English Tokenization

Input
The input table is log of vehicle complaints. The category column indicates whether the car has been
involved in a crash.
Table 748: TextTokenizer Example 3 Input Table complaints

doc_id text_data category


1 consumer was driving approximately 45 mph hit a deer with the front bumper and crash
then ran into an embankment head-on passenger's side air bag did deploy hit
windshield and deployed outward. driver's side airbag cover opened but did not
inflate it was still folded causing injuries.
2 when vehicle was involved in a crash totalling vehicle driver's side/ passenger's side crash
air bags did not deploy. vehicle was making a left turn and was hit by a ford f350
traveling about 35 mph on the front passenger's side. driver hit his head-on the
steering wheel. hurt his knee and received neck and back injuries.
3 consumer has experienced following problems; 1.) both lower ball joints wear out no_crash
excessively; 2.) head gasket leaks; and 3.) cruise control would shut itself off while
driving without foot pressing on brake pedal.
... ... ...

Teradata Aster Analytics Foundation User Guide 833


Chapter 6: Text Analysis
TF_IDF
SQL-MapReduce Call

SELECT * FROM textTokenizer (


ON complaints PARTITION BY any
Language ('en')
OutputDelimiter (' ')
OutputByWord ('true')
Accumulate ('doc_id')
TextColumn ('text_data')
) ORDER BY doc_id;

Output

Table 749: TextTokenizer Example 3 Output Table

doc_id sn token
1 1 consumer
1 2 was
1 3 driving
1 4 approximately
1 5 45
1 6 mph
1 7 hit
1 8 a
1 9 deer
1 10 with
1 11 the
1 12 front
... ... ...

TF_IDF

Summary
The TF_IDF function can do either of the following:
• Take any document set and output the inverse document frequency (IDF) and term frequency- inverse
document frequency (TF-IDF) scores for each term.
• Use the output of a previous run of the TF_IDF function on a training document set to predict TF_IDF
scores of an input (test) document set.

834 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TF_IDF
You can use the TF-IDF scores as input for many document clustering and classification algorithms,
including:
• Cosine-similarity
• Latent Dirichlet allocation
• K-means clustering
• K-nearest neighbors
You can use the TF-IDF scores derived from a training document set to generate a model in a classification
function (for example, SparseSVMTrainer) and then use the resulting TF-IDF scores in a classification
prediction function (for example, SparseSVMPredictor).

Background
TF-IDF stands for “term frequency- inverse document frequency,” a technique for evaluating the
importance of a specific term in a specific document in a document set. Term frequency (tf) is the number of
times that the term appears in the document and inverse document frequency (idf) is the number of times
that the term appears in the document set. The TF-IDF score for a term is tf *idf. A term with a high TF-IDF
score is especially relevant to the specific document.
The TF_IDF function represents each document as an N-dimensional vector, where N is the number of
terms in the document set (therefore, the document vector is usually very sparse). Each entry in the
document vector is the TF-IDF score of a term.

Usage

TF_IDF Syntax
TF_IDF version 2.1, TF version 1.1

SELECT * FROM TF_IDF (


ON TF (
ON { table | view | (query) } PARTITION BY docid
[ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ]
) AS tf PARTITION BY term
[ ON (SELECT COUNT (DISTINCT docid)
FROM doccount_table) AS doccount DIMENSION ]
[ ON (SELECT term, COUNT (DISTINCT docid)
FROM docperterm_table
GROUP BY term) AS docperterm PARTITION BY term ]
[ ON (SELECT DISTINCT (term) AS term, idf
FROM tf_idf_output_table ) AS idf PARTITION BY term ]
);

Recommended for large document sets:

SELECT * FROM TF_IDF (


ON TF (
ON input_table PARTITION BY docid
[ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ]
) AS tf PARTITION BY term

Teradata Aster Analytics Foundation User Guide 835


Chapter 6: Text Analysis
TF_IDF
ON (SELECT COUNT (DISTINCT docid) FROM doccount_table
) AS doccount DIMENSION
ON (SELECT term, COUNT (DISTINCT docid)
FROM docperterm_table
GROUP BY term) AS docperterm PARTITION BY term
) ORDER BY docid;

Acceptable for small document sets:

SELECT * FROM TF_IDF (


ON TF (
ON input_table PARTITION BY docid
) AS tf PARTITION BY term
ON (SELECT COUNT (DISTINCT docid) FROM input_table
) AS doccount DIMENSION
) ORDER BY docid;

Arguments
Argument Category Description
Formula Optional Specifies the formula for calculating the term frequency (tf) of term t in
document d:
• 'normal' (normalized frequency, default)

tf(t,d) = f ((t,d) / sum {w,w ∈d}

This value is rf divided by the number of terms in the document.


• 'bool' (Boolean frequency)
tf((t,d) = 1 if t occurs in d; otherwise, tf((t,d) = 0.
• 'log' (logarithmically-scaled frequency)

tf((t,d) = log(f((t,d)+1)

where f((t,d) is the number of times t occurs in d (that is, the raw
frequency, rf).
• 'augment' (augmented frequency, which prevents bias towards
longer documents)

tf((t,d) = 0.5 + (0.5 × f ((t,d) / max {f(w,d) : w


∈d })

This value is rf divided by the maximum raw frequency of any term in


the document.

836 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TF_IDF

Argument Category Description

Note:
When using the output of a previous run of the TF_IDF function on a
training document set to predict TF_IDF scores on an input
document set, use the same Formula value for the input document set
that you used for the training document set.

Input
The TF_IDF function always requires as input the output of the TF function. The input for the TF function
is the document set. The other TF_IDF input tables depend on your reason for running the function:
• If you are running TF_IDF to output the IDF and TF-IDF values for each term in the document set, then
TF_IDF also requires the input table doccount and has optional input table docperterm.
• If you are running the function to predict TF_IDF values, then TF_IDF also requires the input table idf.
The table idf is the output of an earlier call to TF_IDF, using the training document set as input to the TF
function, the doccount table, and optionally, the docperterm table.
If you omit the docperterm table, the function creates it by processing the entire document set, which can
require a large amount of memory. If there is not enough memory to process the entire document set, then
the docperterm table is required.
Table 750: TF Input Table (Document Set) Schema

Column Name Data Type Description


docid Any Document identifier.
term VARCHAR Term.
count INTEGER Number of times that term appears in the document.

Table 751: TF Output and TF_IDF Input Table Schema

Column Name Data Type Description


docid Any Document identifier.
term VARCHAR Term.
tf DOUBLE Term frequency.
PRECISION
count INTEGER Number of times that term appears in the document.

Table 752: TF_IDF doccount Table Schema

Column Name Data Type Description


count BIGINT Number of documents in the document set.

Teradata Aster Analytics Foundation User Guide 837


Chapter 6: Text Analysis
TF_IDF
Table 753: TF_IDF docperterm Table Schema

Column Name Data Type Description


term VARCHAR Term.
count BIGINT Number of documents that contain term.

Output
Table 754: TF_IDF Output Schema

Column Name Data Type Description


docid Any Document identifier of document d.
term VARCHAR Term t.
tf DOUBLE Term frequency of term t in document d, calculated as specified by the
PRECISION Formula argument.
idf DOUBLE Inverse document frequency of term t in document d, calculated by
PRECISION this formula:
IDF(t) = log (doccount / doccount(t))
where doccount is the number of documents in the document set and
doccount(t) is the number of documents that contain the term t.
tf_idf DOUBLE TF_IDF score of of term t in document d, calculated by this formula:
PRECISION TF_IDF(t, d) = TF(t, d) * IDF(t)

Examples
• Example 1: TF_IDF on Tokenized Training Document Set
• Example 2: TF_IDF on Tokenized Test Set

Note:
The examples tokenize the document sets with the function nGram_stub, but alternatively, you can use
the function TextTokenizer_stub.

Example 1: TF_IDF on Tokenized Training Document Set


This example tokenizes a training document set and inputs the tokenized set to the TD_IDF function.

Input

Table 755: TF_IDF Example 1 Input Table tfidf_train

docid content
1 Chennai floods have battered the capital city of Tamil Nadu and its adjoining areas. Normal life
came to a standstill when roads were submerged in water and all modes of transport were
severely affected. In the past, Chennai has had tsunamis and earthquakes

838 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TF_IDF

docid content
2 Roger Federer born on 8 August 1981, is a greatest tennis player, who has been continuously
ranked inside the top 10 since October 2002 and has won Wimbledon, USOpen, Australian and
FrenchOpen titles mutiple times
3 The Federer Nadal rivalry, known by many as Fedal, is between two professional tennis players,
Roger Federer of Switzerland and Rafael Nadal of Spain. They are currently engaged in a storied
rivalry, which many consider to be the greatest in tennis history. They have played 34 times, most
recently in the 2015 Swiss Indoors final, and Nadal leads their eleven-year-old rivalry with an
overall record of 23–11
4 The India Pakistan cricket rivalry is one of the most intense sports rivalries in the world. An
India-Pakistan cricket match has been estimated to attract up to one billion viewers, according to
TV ratings firms and various other reports. The 2011 World Cup semifinal between the two
teams attracted around 988 million television viewers
5 An Ashes series is traditionally of five Tests, hosted in turn by England and Australia at least once
every four years. As of August 2015, England hold the ashes, having won three of the five Tests in
the 2015 Ashes series. Overall, Australia has won 32 series, England 32 and five series have been
drawn.

Step 1: Create Tokenized Training Document Set

CREATE FACT TABLE tfidf_token1 DISTRIBUTE BY HASH(docid) AS


SELECT * FROM nGram (
ON tfidf_train
TextColumn ('content')
Delimiter (' ')
Grams ('1')
Overlapping ('false')
ToLowerCase ('true')
Punctuation ('\[.,?\!\]')
Reset ('\[.,?\!\]')
Total ('false')
Accumulate ('docid')
);

Step 2: Create Input for TF_IDF Function


Create a table with the required tokenized input in the column term:

CREATE fact TABLE tfidf_input1 DISTRIBUTE BY hash(term) AS


SELECT docid, ngram AS term, frequency AS count
FROM tfidf_token1;

This query returns the following table:

SELECT * FROM tfidf_input1 ORDER BY 1, 3, 2;

Teradata Aster Analytics Foundation User Guide 839


Chapter 6: Text Analysis
TF_IDF
Table 756: TF_IDF Example 1 Output Table tfidf_input1

docid term count


1 a 1
1 adjoining 1
1 affected 1
1 all 1
1 areas 1
1 battered 1
1 came 1
1 capital 1
1 city 1
1 earthquakes 1
1 floods 1
1 had 1
1 has 1
1 have 1
1 its 1
1 life 1
1 modes 1
1 nadu 1
1 normal 1
... ... ...

SQL-MapReduce Call

CREATE fact TABLE tfidf_output1 DISTRIBUTE BY HASH(term) AS


SELECT * FROM tf_idf (
ON TF (
ON tfidf_input1 PARTITION BY docid
) AS tf PARTITION BY term
ON (SELECT COUNT (DISTINCT docid) FROM tfidf_input1
) AS doccount DIMENSION
);

840 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TF_IDF
Output
This query returns the following table:

SELECT * FROM tfidf_output1 ORDER BY tf_idf DESC;

Table 757: TF_IDF Example 1 Output Table

docid term tf idf tf_idf


5 series 0.0727272727272727 1.6094379124341 0.117050029995207
5 ashes 0.0545454545454545 1.6094379124341 0.0877875224964055
5 england 0.0545454545454545 1.6094379124341 0.0877875224964055
5 five 0.0545454545454545 1.6094379124341 0.0877875224964055
1 were 0.0465116279069767 1.6094379124341 0.0748575773225163
1 chennai 0.0465116279069767 1.6094379124341 0.0748575773225163
3 nadal 0.0447761194029851 1.6094379124341 0.0720643841388403
4 viewers 0.037037037037037 1.6094379124341 0.0596088115716333
4 world 0.037037037037037 1.6094379124341 0.0596088115716333
4 cricket 0.037037037037037 1.6094379124341 0.0596088115716333
4 one 0.037037037037037 1.6094379124341 0.0596088115716333
5 tests 0.0363636363636364 1.6094379124341 0.0585250149976036
5 australia 0.0363636363636364 1.6094379124341 0.0585250149976036
5 32 0.0363636363636364 1.6094379124341 0.0585250149976036
3 they 0.0298507462686567 1.6094379124341 0.0480429227592269
3 many 0.0298507462686567 1.6094379124341 0.0480429227592269
2 australian 0.0285714285714286 1.6094379124341 0.04598394035526
2 10 0.0285714285714286 1.6094379124341 0.04598394035526
2 mutiple 0.0285714285714286 1.6094379124341 0.04598394035526
2 1981 0.0285714285714286 1.6094379124341 0.04598394035526
2 inside 0.0285714285714286 1.6094379124341 0.04598394035526
2 born 0.0285714285714286 1.6094379124341 0.04598394035526
2 frenchopen 0.0285714285714286 1.6094379124341 0.04598394035526
2 on 0.0285714285714286 1.6094379124341 0.04598394035526
2 continuously 0.0285714285714286 1.6094379124341 0.04598394035526
2 player 0.0285714285714286 1.6094379124341 0.04598394035526
2 usopen 0.0285714285714286 1.6094379124341 0.04598394035526

Teradata Aster Analytics Foundation User Guide 841


Chapter 6: Text Analysis
TF_IDF

docid term tf idf tf_idf


2 ranked 0.0285714285714286 1.6094379124341 0.04598394035526
2 8 0.0285714285714286 1.6094379124341 0.04598394035526
2 titles 0.0285714285714286 1.6094379124341 0.04598394035526
2 2002 0.0285714285714286 1.6094379124341 0.04598394035526
... ... ... ... ...

Example 2: TF_IDF on Tokenized Test Set


This example uses the IDF values from Output to predict the TF_IDF scores of a test document set.

Input

Table 758: TF_IDF Example 2 Input Table tfidf_test

docid content
6 In Chennai, India, floods have closed roads and factories, turned off power, shut down the
airport and forced thousands of people out of their homes.
7 Spanish tennis star Rafael Nadal said he was happy with the improvement in his game after a
below-par year, and looked forward to reigniting his long-time rivalry with Roger Federer in
India.
8 Nadal, the world number five, said he has always enjoyed playing against Federer and hoped they
would do so for years to come.

Step 1. Create Tokenized Test Document Set

CREATE fact TABLE tfidf_token2 DISTRIBUTE BY HASH(docid) AS


SELECT * FROM nGram (
ON tfidf_test
TextColumn ('content')
Delimiter (' ')
Grams ('1')
Overlapping ('false')
ToLowerCase ('true')
Punctuation ('\[.,?\!\]')
Reset ('\[.,?\!\]')
Total ('false')
Accumulate ('docid')
);

842 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TF_IDF
Step 2. Create Input for TF_IDF Function
Create a table with the required tokenized input in the column term:

CREATE fact TABLE tfidf_input2 DISTRIBUTE BY HASH(term) AS


SELECT docid, ngram AS term, frequency AS count
FROM tfidf_token2;

This query returns the following table:

SELECT * FROM tfidf_input2 ORDER BY 1, 3, 2;

Table 759: TF_IDF Example 2 Output Table tfidf_input2

docid term count


6 airport 1
6 chennai 1
6 closed 1
6 down 1
6 factories 1
6 floods 1
6 forced 1
6 have 1
6 homes 1
6 in 1
6 india 1
6 off 1
6 out 1
6 people 1
6 power 1
6 roads 1
6 shut 1
6 the 1
6 their 1
6 thousands 1
6 turned 1
6 and 2
6 of 2

Teradata Aster Analytics Foundation User Guide 843


Chapter 6: Text Analysis
TF_IDF

docid term count


7 a 1
7 after 1
7 and 1
7 below-par 1
7 federer 1
... ... ...

SQL-MapReduce Call
The bold clause references the IDF values from Output.

CREATE fact TABLE tfidf_output2 DISTRIBUTE BY hash(term) AS


SELECT * FROM tf_idf (
ON TF (
ON tfidf_input2 PARTITION BY docid
Formula ('normal')
) AS tf PARTITION BY term
ON (SELECT DISTINCT(term) AS term, idf FROM tfidf_output1
) AS idf PARTITION BY term
);

Output
This query below returns the following table:

SELECT * FROM tfidf_output2 ORDER BY tf_idf DESC;

Table 760: TD_IDF Example 2 Output Table

docid term tf idf tf_idf


7 with 0.0625 1.6094379124341 0.100589869527131
8 five 0.0434782608695652 1.6094379124341 0.0699755614101783
8 they 0.0434782608695652 1.6094379124341 0.0699755614101783
8 years 0.0434782608695652 1.6094379124341 0.0699755614101783
8 world 0.0434782608695652 1.6094379124341 0.0699755614101783
8 nadal 0.0434782608695652 1.6094379124341 0.0699755614101783
6 floods 0.04 1.6094379124341 0.064377516497364
6 chennai 0.04 1.6094379124341 0.064377516497364
6 india 0.04 1.6094379124341 0.064377516497364
6 their 0.04 1.6094379124341 0.064377516497364

844 Teradata Aster Analytics Foundation User Guide


Chapter 6: Text Analysis
TF_IDF

docid term tf idf tf_idf


6 roads 0.04 1.6094379124341 0.064377516497364
7 nadal 0.03125 1.6094379124341 0.0502949347635656
7 rafael 0.03125 1.6094379124341 0.0502949347635656
7 india 0.03125 1.6094379124341 0.0502949347635656
8 federer 0.0434782608695652 0.916290731874155 0.0398387274727894
7 federer 0.03125 0.916290731874155 0.0286340853710673
7 roger 0.03125 0.916290731874155 0.0286340853710673
7 tennis 0.03125 0.916290731874155 0.0286340853710673
7 rivalry 0.03125 0.916290731874155 0.0286340853710673
8 to 0.0434782608695652 0.510825623765991 0.0222098097289561
6 have 0.04 0.510825623765991 0.0204330249506396
6 of 0.08 0.22314355131421 0.0178514841051368
7 to 0.03125 0.510825623765991 0.0159633007426872
7 a 0.03125 0.510825623765991 0.0159633007426872
7 in 0.0625 0.22314355131421 0.0139464719571381
8 has 0.0434782608695652 0.22314355131421 0.00970189353540043
6 in 0.04 0.22314355131421 0.00892574205256839
6 the 0.04 0 0
7 the 0.03125 0 0
6 and 0.08 0 0
7 and 0.03125 0 0
8 and 0.0434782608695652 0 0
8 the 0.0434782608695652 0 0

Teradata Aster Analytics Foundation User Guide 845


Chapter 6: Text Analysis
TF_IDF

846 Teradata Aster Analytics Foundation User Guide


CHAPTER 7
Cluster Analysis

Cluster Analysis
• Canopy
• Gaussian Mixture Model Functions
• KMeans
• KMeansPlot
• KModes
• KModesPredict
• Minhash
• Modularity

Note:
The Modularity function, which discovers clusters in input graphs, is in Graph Analysis_part2.

Canopy

Summary
The Canopy function takes a set of data points and identifies each point with one or more canopies.
Canopies are groups of points that are interrelated, close, or similar. Canopy clustering is often performed in
preparation for more rigorous clustering techniques, such as k-means clustering.

Note:
The canopy clustering algorithm is nondeterministic, and the randomness of the canopy assignments
cannot be controlled by a seed argument.

Background
Canopy clustering is a very simple, fast, and surprisingly accurate method for grouping objects into
preliminary clusters. Each object is represented as a point in a multidimensional feature space.
The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1
and T2 (T1 > T2), for processing. A point can belong to a canopy if the distance from the point to the canopy
center is less than T1.
Judicious selection of canopy centers (with no canopy center less than T2 from the next) and points in a
canopy enables more efficient execution of clustering algorithms, which are often called within canopies.

Teradata Aster Analytics Foundation User Guide 847


Chapter 7: Cluster Analysis
Canopy
Note:
• For distance measurement, the Canopy function uses Euclidian distance.
• If there are more than 10,000 canopy centers, the function fails. Run the function again, increasing the
value of T2 (specified by the TightDistance argument).

Canopy clustering is often an initial step in more rigorous clustering techniques, because after the data
points are clustered into canopies:
• More expensive distance measurements can be restricted to points inside the canopies, which can
significantly reduce their number.
• The more rigorous clustering technique need perform only intra-canopy clustering, which can be
parallelized.
Points that belong to different canopies do not have to be considered at the same time in this clustering
process.
Canopy clustering is done in three map-reduce steps:
1. Each mapper performs canopy clustering on the points in its input set and outputs its canopies' centers
(which are local to the mapper).
2. The reducer takes all the points in each (local) canopy and calculates centroids to produce the final
canopy centers.
3. Final canopy centers that are too close to each other are deleted (to eliminate the effects of earlier
localization).
A driver extracts information from the initial canopy-generation step and uses it to make another SQL-
MapReduce call that finishes the clustering process.

Usage

Canopy Syntax
Version 2.0

SELECT * FROM Canopy (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('string')
LooseDistance ('maximum')
TightDistance ('minimum')
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

848 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
Canopy
Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data to be clustered.
LooseDistance Required Specifies the maximum distance that any point can be from a canopy
center to be considered part of that canopy (T1 in Background).
TightDistance Required Specifies the minimum distance that separates two canopy centers (T2 in
Background).

Input
The Canopy function has one required input table, which contains the data to be clustered. The input table
cannot have any columns not described in the following table.
Table 761: Canopy Input Table Schema

Column Name Data Type Description


id Any Leftmost column, which identifies the row.
dimension_i DOUBLE Contains the data in the ith dimension. The table has one such column
PRECISION for each dimension.

Output
The Canopy function outputs a table of canopies and their centers.
Table 762: Canopy Output Table Schema

Column Name Data Type Description


canopyid INTEGER Identifies the canopy.
dimension_i DOUBLE Contains the canopy center for the ith dimension. The table has one
PRECISION such column for each dimension.

Example

Input
The input table has more than 6000 rows of computer specifications.
Table 763: Canopy Example Input Table computers_train1

id price speed hd ram screen


1 1499 25 80 4 14
2 1795 33 85 2 14
3 1595 25 170 4 15

Teradata Aster Analytics Foundation User Guide 849


Chapter 7: Cluster Analysis
Canopy

id price speed hd ram screen


4 1849 25 170 8 14
5 3295 33 340 16 14
6 3695 66 340 16 14
7 1720 25 170 4 14
8 1995 50 85 2 14
9 2225 50 210 2 14
12 2605 66 210 8 14
13 2045 50 130 4 14
14 2295 25 245 8 14
16 2225 50 130 4 14
17 1595 33 85 2 14
18 2325 33 210 4 15
19 2095 33 250 4 15
20 4395 66 452 8 14
... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM Canopy (


ON (SELECT 1) PARTITION BY 1
InputTable ('computers_train1')
LooseDistance ('1000')
TightDistance ('500')
) ORDER BY canopyid;

Output
Table 764: Canopy Example Output Table

canopyid price speed hd ram screen


1 1946.92 48.8442 341.391 6.07969 14.4756
2 2511.81 54.7 445.15 9.80651 14.7219
3 3764.45 64.2245 541.782 12.1361 15.2585
4 4489.81 64.5714 734.286 14.4762 15.7619
5 3052.15 61.7052 637.072 15.0761 14.9484

850 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
Gaussian Mixture Model Functions

Gaussian Mixture Model Functions


A Gaussian Mixture Model (GMM) is a method of clustering numerical data. Applications that use GMM
include market segmentation, network analysis, customer profiling, and recommender systems.
A GMM uses soft assignment; that is, it computes the probability that each point is a member of each cluster.
Each cluster in a GMM is specified by a weight, a mean point, and a covariance; therefore, clusters of
different eccentricities can be located within data that is not perfectly scaled.
The basic GMM fitting algorithm requires a known, fixed number of clusters. An advanced variant, the
Dirichlet Process GMM (DP-GMM) estimates the number of clusters in the data, using an algorithm based
on variational Bayesian methods. The DP-GMM uses a "stick-breaking process" to define the prior
probability of each cluster, enforcing a "rich-get-richer" clustering approach. The DP-GMM does not start a
new cluster unless it is very unlikely that a particular data point is in a preexisting cluster.
You can use GMMs in situations where k-means clustering is insufficient (for example, when clusters are
not roughly spherical because input attributes are on different scales or can be correlated with each other).
You can use GMMs either directly or in conjunction with k-means clustering. For example, you can use k-
means clustering to find an initial set of cluster centers, which you can use to initialize a GMM function that
produces a more refined model.
The Aster Analytics GMM package has three functions:
• GMMFit, to fit a GMM to training data
• GMMPredict, to predict cluster assignments for test data
• GMMProfile, to compute statistics about each cluster in a GMM
You can specify whether GMMFit uses a basic GMM or DP-GMM algorithm.

GMMFit

Summary
GMMFit is a driver function that fits a Gaussian Mixture Model (GMM) to data supplied in an input table.
You specify whether GMMFit uses a basic GMM algorithm with a fixed number of clusters or a Dirichlet
Process GMM (DP-GMM) algorithm with a variable number of clusters.
The output table of the GMMFit function can be input to the function GMMPredict.

Usage

GMMFit Syntax
Version 1.0

SELECT * FROM GMMFit (


ON { table | view | (query) | (SELECT 1) } AS init_params
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]

Teradata Aster Analytics Foundation User Guide 851


Chapter 7: Cluster Analysis
GMMFit
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ MaxClusterNum ('max_clusters') ]
[ ClusterNum ('clusters') ]
[ CovarianceType ({ 'spherical' | 'diagonal' | 'tied' | 'full' }) ]
[ Tolerance ('tolerance') ]
[ MaxIterNum ('max_iterations') ]
[ ConcentrationParam ('concentration') ]
[ PackOutput ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data to be
clustered.
OutputTable Required Specifies the name of the output table to which the function
outputs cluster information. The table must not already exist.
MaxClusterNum Required if Specifies the maximum number of clusters in a Dirichlet Process
ClusterNum is model and causes the function to use the DP-GMM algorithm.
omitted, otherwise This value must have the data type INTEGER. The default value is
not allowed 20.
ClusterNum Required if Specifies the number of clusters in a model and causes the function
MaxClusterNum is to use the basic GMM algorithm. This value must have the data
omitted, otherwise type INTEGER and be greater than 0. The default value is 10.
not allowed
CovarianceType Optional Specifies the type of the covariance matrices, thereby determining
how many parameters the function estimates for each cluster:
'spherical': Each covariance matrix is of the form σI. The function
estimates one parameter for each cluster.
'diagonal' (default): Each covariance matrix has zeros on the
nondiagonal. The function estimates D parameters for each cluster,
where D is the number of dimensions in the matrix.
'tied': Each cluster has the same covariance matrix. The function
estimates (1/2)D(D-1) parameters.
'full': Each cluster has an arbitrary covariance matrix. The function
estimates (1/2)D(D-1) parameters for each cluster.

852 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMFit

Argument Category Description


Tolerance Optional Specifies the minimum change in log-likelihood between iterations
that causes the function to terminate. This value must have the
data type DOUBLE PRECISION and be greater than 0. The default
value is 0.001.
MaxIterNum Optional Specifies the maximum number of iterations for which the
function runs. This value must have the data type INTEGER and
be greater than 0. The default value is 10.
ConcentrationParam Optional
Note:
Specify this argument only if you specify MaxClusterNum.

Specifies the concentration parameter, α, which determines the


number of clusters that the DP-GMM algorithm generates. This
value must have the data type DOUBLE PRECISION and be
greater than 0.
The expected number of clusters is α log N, where N is the number
of points in the data set; therefore, a larger α value tends to cause
the algorithm to find more clusters.
PackOutput Optional Specifies whether the function packs the output. The default value
is 'false'.

Input
The GMMFit function has two input tables, input_table and (optionally) init_params.
Table 765: GMMFit input_table Schema

Column Data Type Description


Name
id Any Leftmost column, which identifies the data point.
dim_n SQL numeric data type The table has one such column for each dimension (n has
the values 1 through D).

In the init_params table, you can specify initial values for the cluster weights, means, and covariances of each
cluster. You can specify one, two, or all three of these initial values. If you do not want to specify any of these
values, then omit init_params and specify (SELECT 1) instead of a reference to a table, view, or query.
The init_params table must have the same schema as the GMMFit output table. The following table
describes the init_params table and explains how the function assigns initial values that you do not specify.
Table 766: GMMFit init_params Table Schema

Column Data Type Description


Name
weight SQL numeric data type Initial weight of the cluster. If you do not specify this value,
then the function gives all clusters the same initial weight.

Teradata Aster Analytics Foundation User Guide 853


Chapter 7: Cluster Analysis
GMMFit

Column Data Type Description


Name
dim_n SQL numeric data type Initial mean of the cluster. If you do not specify this value,
then the function selects the initial means from a
multivariate standard normal distributed centered at the
origin.
The table has one such column for each dimension (n has
the values 1 through D).
covariance VARCHAR Initial covariance of the cluster. Possible values depend on
CovarianceType:
'spherical': Positive numeric value (for example, 1.0.)
'diagonal': JSON representation of a DOUBLE PRECISION
array (for example, [1.0,2.0,3.0,4.0])
'tied' or 'full': JSON representation of a two-dimensional
DOUBLE PRECISION array (for example, [[1.0,2.0],
[2.0,4.0]])
If you do not specify this value, then the function assigns
each cluster an initial covariance matrix equal to the
covariance matrix.

Output
The GMMFit function outputs a message and output_table. The message describes these properties:
Table 767: GMMFit Output Message Properties

Property Value
Output Table Name of the output table to which the function outputs cluster
information (output_table).
Algorithm Used Algorithm that the function used—Basic GMM or DP-DMM.
Stopping Criterion Why the function stopped—maximum iterations reached or
convergence reached.
Delta Log Likelihood Change in the mean log-likelihood for each data point between the
next-to-last and the final iterations.
Number of Iterations Number of iterations that the function performed before stopping.
Number of Clusters Number of clusters in the GMM.
Covariance Type Spherical, diagonal, tied, or full.
Number of Data Points Number of data points in the data set.
Global Mean Mean of the data set.
Global Covariance Covariance of the data set.
Log Likelihood Log-likelihood of the data, given the GMM.
Akaike Information Criterion Akaike Information Criterion.

854 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMFit

Property Value
Bayesian Information Criterion Bayesian Information Criterion.

The output_table format depends on the PackOutput argument. For PackOutput('false'), the default, the
following table describes output_table.
Table 768: GMMFit output_table Schema for PackOutput('false')

Column Name Data Type Description


cluster_id INTEGER Identification number of the cluster
points_assigned INTEGER Number of points in the training data set assigned to the
cluster
covariance_type VARCHAR Covariance type (specified by the CovarianceType
argument)
weight DOUBLE PRECISION Weight assigned to the cluster
dim_n DOUBLE PRECISION Mean of the cluster n.
The table has one such column for each dimension (n has
the values 1 through D).
cov_n DOUBLE PRECISION Covariance of the cluster n.
Depends on the covariance type:
'spherical': Each row in a single covariance column
contains a single DOUBLE PRECISION value.
'diagonal': There are D covariance rows, each containing a
DOUBLE PRECISION value.
'tied' or 'full': There are D(D-1) covariance rows, each
containing a DOUBLE PRECISION value.
The table has one such column for each dimension (n has
the values 1 through D).
determinant DOUBLE PRECISION Determinant of the covariance matrix
precision VARCHAR Precision matrix, the inverse of the covariance matrix.
The precision matrix is serialized and stored to improve
the performance of the function GMMPredict.

For PackOutput('true'), the following table describes output_table.


Table 769: GMMFit output_table Schema for PackOutput('true')

Column Name Data Type Description


cluster_id INTEGER Identification number of the cluster
points_assigned INTEGER Number of points in the training data set assigned to the
cluster
covariance_type VARCHAR Covariance type (specified by the CovarianceType
argument)
weight DOUBLE PRECISION Weight assigned to the cluster

Teradata Aster Analytics Foundation User Guide 855


Chapter 7: Cluster Analysis
GMMFit

Column Name Data Type Description


mean VARCHAR A vector of D DOUBLE PRECISION values that specify
the mean of each cluster (for example, [4.5, 2.3, 1.3)].
covariance VARCHAR Depends on the covariance type:
'spherical': Each row in a single covariance column
contains a single DOUBLE PRECISION value.
'diagonal': Each row contains a white-space separated list
of D DOUBLE PRECISION values.
'tied' or 'full': Each row contains a white-space separated
list of D*D DOUBLE PRECISION values.
determinant DOUBLE PRECISION Determinant of the covariance matrix
precision VARCHAR Precision matrix, the inverse of the covariance matrix.
The precision matrix is serialized and stored to improve
the performance of the function GMMPredict.

Examples
• Input
• Example 1: Basic GMM, Spherical Covariance, Packed Output
• Example 2: Basic GMM, Diagonal Covariance, Unpacked Output
• Example 3: DP-GMM, Full Covariance, Unpacked Output

Input
This example uses the well known 'iris' dataset (gmm_iris_input). The data has values for four attributes—
sepal_length, sepal_width, petal_length, and petal_width—which are the data dimensions. The input does
not include the species column, because the goal is data clustering, not classification. Each example outputs
three clusters.
From the raw data, a train set and a test set are created.
The function GMMFit uses the train set to generate the model. The GMMPredict function uses the model
information to predict clusters for the test data.
Table 770: GMMFit Example ‘Iris’ Dataset gmm_iris_input

id sepal_length sepal_width petal_length petal_width


1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3

856 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMFit

id sepal_length sepal_width petal_length petal_width


8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
... ... ... ... ...

Split Input into Training and Testing Data Sets


This code divides the 150 data rows into a training data set (80%) and a testing dataset (20%):

DROP TABLE IF EXISTS gmm_iris_train;


DROP TABLE IF EXISTS gmm_iris_test;
CREATE TABLE gmm_iris_train AS
SELECT * FROM gmm_iris_input WHERE id%5!=0;
CREATE TABLE gmm_iris_test AS
SELECT * FROM gmm_iris_input WHERE id%5=0;

Example 1: Basic GMM, Spherical Covariance, Packed Output

Input

Table 771: GMMFit Example 1 Input Table gmm_iris_train

id sepal_length sepal_width petal_length petal_width


1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
11 5.4 3.7 1.5 0.2
12 4.8 3.4 1.6 0.2
13 4.8 3 1.4 0.1
14 4.3 3 1.1 0.1
16 5.7 4.4 1.5 0.4
... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 857


Chapter 7: Cluster Analysis
GMMFit
SQL-MapReduce Call

SELECT * FROM gmmfit(


ON (SELECT 1) AS init_params PARTITION BY 1
InputTable ('gmm_iris_train')
OutputTable ('gmm_output_ex1')
ClusterNum (3)
CovarianceType ('spherical')
MaxIterNum (10)
PackOutput (1)
);

Output
Because the SQL-MapReduce call set PackOutput to 1, a single mean column displays a vector containing
the mean value for each dimension. Refer to the schema for argument definitions.
An output message table (immediately following) and an output table are shown below.
Table 772: GMMFit Example 1 Output Message Table

property value
Output Table gmm_output_ex1
Algorithm Used Basic GMM
... ...
Stopping Criterion Iteration limit reached
Delta Log Likelihood 0.013310
Number of Iterations 10
Number of Clusters 3
Covariance Type spherical
... ...
Number of Data Points 120
Global Mean [5.866, 3.055, 3.770, 1.205]
Global Covariance [[0.7197, -0.04204, 1.326, 0.5265], [-0.04204, 0.1916, -0.3241, -0.1213], [1.326,
-0.3241, 3.167, 1.298], [0.5265, -0.1213, 1.298, 0.5708]]
... ...
Log Likelihood -364.450
Akaike Information 762.899 on 17 parameters
Criterion
Bayesian Information 810.287 on 17 parameters
Criterion

858 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMFit
The following query returns the output shown in the following table:

SELECT * FROM gmm_output_ex1 ORDER BY cluster_id;

Table 773: GMMFit Example 1 Output Table gmm_output_ex1 (Columns 1-4)

cluster_id points_assigned covariance_type weight


0 1 spherical 0.00833333333333333
1 39 spherical 0.324999998499867
2 80 spherical 0.6666666681668

Table 774: GMMFit Example 1 Output Table gmm_output_ex1 (Columns 5-8)

mean covariance determinant prec


[4.5, 2.299999952316284, 1.0E-12 1e-48 1.0E12
1.2999999523162842,
0.30000001192092896]
[5.010256409188803, 0.07235370835810885 2.74058439183871e-05 13.820991662937034
3.446153855805786,
1.44615384307372,
0.2512820560787926]
[6.299999998810371, 0.3514238290308256 0.0152519307815099 2.84556685514995
2.873750007907585,
4.9337499775605185,
1.6812499819268993]

Example 2: Basic GMM, Diagonal Covariance, Unpacked Output


In this example, PackOutput is set to 0, so the mean for each dimension is output in a separate column.
Refer to the schema for argument definitions.

SQL-MapReduce Call

DROP TABLE IF EXISTS gmm_output_ex2;


SELECT * FROM gmmfit(
ON (SELECT 1) AS init_params PARTITION BY 1
InputTable ('gmm_iris_train')
OutputTable ('gmm_output_ex2')
ClusterNum (3)
CovarianceType ('diagonal')
MaxIterNum (10)
PackOutput (0)
);

Teradata Aster Analytics Foundation User Guide 859


Chapter 7: Cluster Analysis
GMMFit
Output

Table 775: GMMFit Example 2 Output Message Table

property value
Output Table gmm_output_ex2
Algorithm Used Basic GMM
... ...
Stopping Criterion Iteration limit reached
Delta Log Likelihood 0.018931
Number of Iterations 10
Number of Clusters 3
Covariance Type diagonal
... ...
Number of Data Points 120
Global Mean [5.866, 3.055, 3.770, 1.205]
Global Covariance [[0.7197, -0.04204, 1.326, 0.5265], [-0.04204, 0.1916, -0.3241, -0.1213],
[1.326, -0.3241, 3.167, 1.298], [0.5265, -0.1213, 1.298, 0.5708]]
... ...
Log Likelihood -305.091
Akaike Information 662.182 on 26 parameters
Criterion
Bayesian Information 734.657 on 26 parameters
Criterion

The following query returns the output shown in the table gmm_output_ex2:

SELECT * FROM gmm_output_ex2 ORDER BY cluster_id;

Table 776: GMMFit Example 2 Output Table gmm_output_ex2 (Columns 1-6)

cluster_id points_assigned covariance_type weight sepal_length sepal_width


0 56 diagonal 0.463728582737372 6.56751643513457 2.99034957312515
1 40 diagonal 0.333333333332912 4.99750000238412 3.41750001311301
2 24 diagonal 0.202938083929716 5.68870509210384 2.60731133573207

Table 777: GMMFit Example 2 Output Table gmm_output_ex2 (Columns 7-11)

petal_length petal_width cov_1 cov_2 cov_3


5.31297748907166 1.87945392924265 0.328907194931553 0.0789954902732131 0.412726614048047
1.44249999523138 0.252500005066006 0.13174372760081 0.152943747872323 0.0244437555312951

860 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMFit
petal_length petal_width cov_1 cov_2 cov_3
4.06718699627474 1.22833926311197 0.183744196287492 0.0867936412267367 0.195533280241609

Table 778: GMMFit Example 2 Output Table gmm_output_ex2 (Columns 12-14)

cov_4 determinant prec


0.101126958343901 0.00108443891101065 [3.040371312668016, 12.658950486178506,
2.422911355756636, 9.888560047453522]
0.011993750630212 5.90724008666195e-06 [7.590494198175791, 6.538351609081782,
40.91024387474792, 83.37675434747011]
0.0317389021878551 9.89724055263653e-05 [5.442348766408755, 11.521581372391491,
5.114218913344879, 31.5070758932125]

Example 3: DP-GMM, Full Covariance, Unpacked Output


Dirichlet Process GMM (DP-GMM) estimates the number of clusters in the data using an algorithm based
on variational Bayesian methods. This example uses full covariance and PackOutput set to 0.

SQL-MapReduce Call

DROP TABLE if exists dpgmm_output_ex3;


SELECT * FROM gmmfit (
ON (SELECT 1) AS init_params PARTITION BY 1
InputTable ('gmm_iris_train')
OutputTable ('dpgmm_output_ex3')
MaxClusterNum (3)
CovarianceType ('full')
MaxIterNum (10)
PackOutput (0)
);

Output

Table 779: GMMFit Example 3 Output Message Table

property value
Output Table dpgmm_output_ex3
Algorithm Used Dirichlet Process GMM
Stopping Criterion Algorithm converged with tolerance 0.001
Delta Log Likelihood 0.000494
Number of Iterations 9
Number of Clusters Found 1
Covariance Type full
Number of Data Points 120

Teradata Aster Analytics Foundation User Guide 861


Chapter 7: Cluster Analysis
GMMFit

property value
Global Mean [5.866, 3.055, 3.770, 1.205]
Global Covariance [[0.7197, -0.04204, 1.326, 0.5265], [-0.04204, 0.1916, -0.3241, -0.1213],
1.326, -0.3241, 3.167, 1.298], [0.5265, -0.1213, 1.298, 0.5708]]
Log Likelihood 1550.435
Akaike Information Criterion -3012.870 on 44 parameters
Bayesian Information Criterion -2890.220 on 44 parameters

The following query returns the output shown in the table dpgmm_output_ex3:

SELECT * FROM dpgmm_output_ex3 ORDER BY cluster_id;

Table 780: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 1-6)

cluster_id points_assigned covariance_typ weight sepal_length sepal_width


e
0 120 full 0.999991735613739 5.7341932211886 3.01843399034949
1 0 full 8.25613837088043e-06 0 0
2 0 full 8.24789043818501e-09 0 0

Table 781: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 7-11)

petal_length petal_width cov_11 cov_12 cov_13


3.61166170612114 1.14486231099299 1.66229689039324 -0.214185091078124 1.69427436339271
0 0 0.166666666666667 0 0
0 0 0.166666666666667 0 0

Table 782: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 12-16)

cov_14 cov_22 cov_23 cov_24 cov_33


0.693859013357687 1.14410673773899 0.0262470085044228 0.0196024370046391 4.00032158430361
0 0.166666666666667 0 0 0.166666666666667
0 0.166666666666667 0 0 0.166666666666667

Table 783: GMMFit Example 3 Output Table: dpgmm_output_ex3 (Columns 17-20)

cov_34 cov_44 determinant prec


1.27929539653395 1.50738875289365 6.57941093165913 [[0.7824065815292398,
-0.04166719847723184,
-0.2990852200079339,
-0.10577545748404801],
[-0.09328114699788496,
0.8754360535621004,

862 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMPredict

cov_34 cov_44 determinant prec


0.03258314800179726,
0.00390066150799786],
[-0.16999962390216933,
0.13304206710705144,
0.4009441383070519,
-0.26375290851001104],
[-0.04690499550266552,
0.04519483233542283,
-0.2524462687807997,
0.898648654482997]]
0 0.166666666666667 0.00077160493827160 [[6.0, 0.0, 0.0, 0.0], [0.0, 6.0, 0.0, 0.0], [0.0,
5 0.0, 6.0, 0.0], [0.0, 0.0, 0.0, 6.0]]
0 0.166666666666667 0.00077160493827160 [[6.0, 0.0, 0.0, 0.0], [0.0, 6.0, 0.0, 0.0], [0.0,
5 0.0, 6.0, 0.0], [0.0, 0.0, 0.0, 6.0]]

GMMPredict

Summary
The GMMPredict function takes the output from the function GMMFit and predicts the cluster assignments
for each [[σ11 σ12 ...σ1D],[σ21 ... σD,D-1 σDD]] point in a specified data set. Because GMM functions do soft
assignments of data points to clusters (that is, GMM functions give probabilities that each data point is in
each cluster), you can specify the top N most likely clusters for a given point and the probability that the
point is a member of each of those clusters.
The output table of the GMMPredict function can be input to the function GMMProfile.

Usage

GMMPredict Syntax
Version 1.0

SELECT * FROM GMMPredict (


ON { table | view | (query) } AS modeldata DIMENSION
ON { table | view | (query) } AS testdata PARTITION BY key
[ OutputFormat ({ 'sparse' | 'dense' }) ]
[ TopNClusters (n) ]
[ PrintLogLikelihood
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Attributes ({ 'testdata_column' | 'testdata_column_range' }[,...])]

Teradata Aster Analytics Foundation User Guide 863


Chapter 7: Cluster Analysis
GMMPredict
[ IDColumn ('testdata_column')]
);

Arguments
Argument Category Description
OutputFormat Optional Specifies how the function outputs the weights that it assigns to each
of the top N clusters:
'sparse' (default): The function outputs each weight to a separate
row of the output table.
'dense': The function outputs the weights to a single row of the
output table.
TopNClusters Optional Specifies the number of cluster weights that the function outputs.
This value must be an INTEGER. For the value n, the function
outputs for each data point the cluster with the greatest weight, the
cluster with the second-greatest weight, and so on, ending with the
cluster with the kth-greatest weight. The default value is 1.
PrintLogLikelihood Optional Specifies whether to output the log likelihood of an observation,
given the data. The default value is 'false'.
Accumulate Optional Specifies the names of testdata columns to copy to the output table.
Attributes Optional Specifies the names of testdata columns that correspond to the
attributes in the modeldata table. By default, these columns are all
testdata columns except the first.
IDColumn Optional Specifies the input table column that defines the row identifier. The
default value is the first input table column.

Input
The GMMPredict function has two input tables, testdata and modeldata. For the schema of testdata, refer to
GMMFit input_table Schema. For the schema of modeldata, refer to GMMFit input_table Schema and
GMMFit output_table Schema.

Output
The GMMPredict function has one output table, whose format depends on the OutputFormat argument.
The following table describes the output table for OutputFormat('sparse'), the default. The table has D+3
columns, where D is the number of dimensions of the input data.
Table 784: GMMPredict Output Table Schema for OutputFormat('sparse')

Column Name Data Type Description


accumulate_column Same as in Column copied from the testdata table. Typically, one
input table accumulate_column contains the unique identifier of a data
point.

864 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMPredict

Column Name Data Type Description


data_point NUMERIC Input data point. The table has D such columns, which are
copied from input_table to the output table. Their names are
the same in input_table and the output table.
cluster_rank INTEGER Probability that the data point belongs to the cluster.
cluster_id INTEGER Identification number of the cluster
prob DOUBLE Probability that the cluster assigned to the data point
PRECISION

The following table describes the output table for OutputFormat('dense'). The table has D+2k columns,
where D is the number of dimensions of the input data and n is the the number of cluster weights that the
function outputs (the value of the TopNClusters argument).
Table 785: GMMPredict Output Table Schema for OutputFormat('dense')

Column Name Data Type Description


accumulate_column Same as in Column copied from the testdata table. Typically, one
input table accumulate_column contains the unique identifier of a data
point.
id Any Identification of the data point
data_point NUMERIC Input data point. The table has D such columns, which are
copied from input_table to the output table. Their names are
the same in input_table and the output table.
cluster_rank_i INTEGER Probability that the data point belongs to the cluster.
cluster_id_i INTEGER Identification number of the cluster with the greatest
probability assigned to the data point. The table has n such
columns (cluster_id_1, ..., cluster_id_n).
prob_i DOUBLE Probability assigned to the data point by the cluster named
PRECISION in cluster_id_n. The table has k such columns (prob_1, ...,
prob_n).
For each n, the column prob_n immediately follows the
column cluster_id_n.

Example
The GMMPredict function applies the model created by GMMFit to the test input to cluster the test data.

Input
Table 786: GMMPredict Example Input Table gmm_iris_test

id sepal_length sepal_width petal_length petal_width


5 5 3.6 1.4 0.2

Teradata Aster Analytics Foundation User Guide 865


Chapter 7: Cluster Analysis
GMMPredict

id sepal_length sepal_width petal_length petal_width


10 4.9 3.1 1.5 0.1
15 5.8 4 1.2 0.2
20 5.1 3.8 1.5 0.3
25 4.8 3.4 1.9 0.2
30 4.7 3.2 1.6 0.2
35 4.9 3.1 1.5 0.2
40 5.1 3.4 1.5 0.2
45 5.1 3.8 1.9 0.4
50 5 3.3 1.4 0.2
55 6.5 2.8 4.6 1.5
60 5.2 2.7 3.9 1.4
65 5.6 2.9 3.6 1.3
70 5.6 2.5 3.9 1.1
75 6.4 2.9 4.3 1.3
80 5.7 2.6 3.5 1
85 5.4 3 4.5 1.5
90 5.5 2.5 4 1.3
95 5.6 2.7 4.2 1.3
100 5.7 2.8 4.1 1.3
105 6.5 3 5.8 2.2
110 7.2 3.6 6.1 2.5
115 5.8 2.8 5.1 2.4
120 6 2.2 5 1.5
125 6.7 3.3 5.7 2.1
130 7.2 3 5.8 1.6
135 6.1 2.6 5.6 1.4
140 6.9 3.1 5.4 2.1
145 6.7 3.3 5.7 2.5
150 5.9 3 5.1 1.8

866 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMPredict
SQL-MapReduce Call

SELECT * FROM GMMPredict (


ON gmm_output_ex1 AS modeldata DIMENSION
ON gmm_iris_test AS testdata PARTITION BY id
TopNClusters (3))
ORDER BY id, prob desc;

Output
The output table shows the dimensions and id of each sample, and the probability that it belongs to each of
the three clusters.
Table 787: GMMPredict Example Output Table (Columns 1-4)

id sepal_length sepal_width petal_length


5 5 3.59999990463257 1.39999997615814
5 5 3.59999990463257 1.39999997615814
5 5 3.59999990463257 1.39999997615814
10 4.90000009536743 3.09999990463257 1.5
10 4.90000009536743 3.09999990463257 1.5
10 4.90000009536743 3.09999990463257 1.5
15 5.80000019073486 4 1.20000004768372
15 5.80000019073486 4 1.20000004768372
15 5.80000019073486 4 1.20000004768372
20 5.09999990463257 3.79999995231628 1.5
20 5.09999990463257 3.79999995231628 1.5
20 5.09999990463257 3.79999995231628 1.5
25 4.80000019073486 3.40000009536743 1.89999997615814
25 4.80000019073486 3.40000009536743 1.89999997615814
25 4.80000019073486 3.40000009536743 1.89999997615814
... ... ... ...

Table 788: GMMPredict Example Output Table (Columns 5-8)

petal_width cluster_rank cluster_id prob


0.200000002980232 1 0 0.999999996516264
0.200000002980232 2 2 3.48373546251981e-09
0.200000002980232 3 0 0
0.100000001490116 1 1 0.999999999978146

Teradata Aster Analytics Foundation User Guide 867


Chapter 7: Cluster Analysis
GMMProfile

petal_width cluster_rank cluster_id prob


0.100000001490116 2 2 2.18542496451431e-11
0.100000001490116 3 0 0
0.200000002980232 1 1 0.999999998968956
0.200000002980232 2 2 1.03104372199014e-09
0.200000002980232 3 0 0
0.300000011920929 1 1 0.999999999970437
0.300000011920929 2 2 2.9562773726376e-11
0.300000011920929 3 0 0
0.200000002980232 1 1 0.999999998740311
0.200000002980232 2 2 1.25968923928218e-09
0.200000002980232 3 0 0
... ... ... ...

GMMProfile

Summary
The GMMProfile function takes the output of the function GMMFit and outputs information about how
each cluster diverges from the global data statistics.

Usage

GMMProfile Syntax
Version 1.0

SELECT * FROM GMMProfile (


ON { table | view | (query) } PARTITION BY 1
);

Input
The GMMProfile function takes as input the modeltable that the GMMFit function outputs.

868 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMProfile
Output
Table 789: GMMProfile Output Table Schema

Column Data Type Description


Name
cluster_id INTEGER Identification number of the cluster.
dimension VARCHAR Name of the input dimension.
delta_mean DOUBLE PRECISION Difference between the mean of the cluster and the mean of
the data set along each dimension.
divergence DOUBLE PRECISION Estimated Kullback-Leibler divergence between the cluster
and the data set along each dimension.

Examples
The examples in this section show the delta mean and divergence for each of the models created with
GMMFit.

Example 1

Input
Use the following tables from the Output section of Example 1: Basic GMM, Spherical Covariance, Packed
Output of the function GMMFit:
• GMMFit Example 1 Output Table: gmm_output_ex1 (Columns 1-4)
• GMMFit Example 1 Output Table: gmm_output_ex1 (Columns 5-8)

SQL-MapReduce Call

SELECT * FROM gmmprofile (ON gmm_output_ex1 PARTITION BY 1);

Output

Table 790: GMMProfile Example 1 Output Table

cluster_id dimension delta_mean divergence


0 0 -1.36583333412806 16.0871661269885
0 1 -0.755000054836273 15.6918314391749
0 2 -2.47000003655752 25.0016000723501
0 3 -0.904999979833762 14.2441750796929
1 0 -0.855576924939259 1386313799.54973
1 1 0.391153848653229 5775093662.30758

Teradata Aster Analytics Foundation User Guide 869


Chapter 7: Cluster Analysis
GMMProfile

cluster_id dimension delta_mean divergence


1 2 -2.32384614580008 390477795.61446
1 3 -0.953717935675898 311363117.008069
2 0 0.434166664682309 14964265940.4416
2 1 -0.181249999244972 2835887314.8319
2 2 1.16374998868672 56481512840.0448
2 3 0.476249990172208 9413647124.31546

Example 2

Input
Use the following tables from the Output section of Example 2: Basic GMM, Diagonal Covariance,
Unpacked Output of the function GMMFit:
• GMMFit Example 2 Output Table gmm_output_ex2 (Columns 1-6)
• GMMFit Example 2 Output Table gmm_output_ex2 (Columns 7-11)
• GMMFit Example 2 Output Table gmm_output_ex2 (Columns 12-14)

SQL-MapReduce Call

SELECT * FROM gmmprofile (ON gmm_output_ex2 PARTITION BY 1);

Output

Table 791: GMMProfile Example 2 Output Table

cluster_id dimension delta_mean divergence


0 0 -0.868334230250709 3.16530577290871
0 1 0.36249887847439 0.706865126726026
0 2 -2.32750007414059 21.9891140697781
0 3 -0.952500183235945 3.8803081334208
1 0 0.371891983059538 3.36329375822824
1 1 -0.148943256528575 0.709903097697611
1 2 1.05285833592309 23.9019007586876
1 3 0.440314978830265 4.19452603457996
2 0 0.488281841506716 4.96718581315709
2 1 -0.209322477802817 1.80990130022945
2 2 1.26010901309915 27.9126125013201

870 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
GMMProfile

cluster_id dimension delta_mean divergence


2 3 0.507475356350264 5.55756178817895

Example 3

Input
Use the following tables from the Output section of Example 3: DP-GMM, Full Covariance, Unpacked
Output of the function GMMFit:
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 1-6)
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 7-11)
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 12-16)
• GMMFit Example 3 Output Table dpgmm_output_ex3 (Columns 17-20)

SQL-MapReduce Call

SELECT * FROM gmmprofile (ON dpgmm_output_ex3 PARTITION BY 1);

Output

Table 792: GMMProfile Example 3 Output Table

cluster_id dimension delta_mean divergence


0 0 -0.868334230250709 3.16530577290871
0 1 0.36249887847439 0.706865126726026
0 2 -2.32750007414059 21.9891140697781
0 3 -0.952500183235945 3.8803081334208
1 0 0.371891983059538 3.36329375822824
1 1 -0.148943256528575 0.709903097697611
1 2 1.05285833592309 23.9019007586876
1 3 0.440314978830265 4.19452603457996
2 0 0.488281841506716 4.96718581315709
2 1 -0.209322477802817 1.80990130022945
2 2 1.26010901309915 27.9126125013201
2 3 0.507475356350264 5.55756178817895

Teradata Aster Analytics Foundation User Guide 871


Chapter 7: Cluster Analysis
KMeans

KMeans

Summary
The KMeans function takes a data set and outputs the centroids of its clusters and, optionally, the clusters
themselves.

Background
K-means clustering is a simple unsupervised learning algorithm that is popular for cluster analysis in data
mining. The algorithm aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean—the centroid for the cluster.
The algorithm aims to minimize an objective function (in this case, a squared error function). The objective
function, which is a chosen distance measure between a data point and the cluster center, indicates the
distance of the n data points from their respective centroids.
The algorithm has these steps:
1. Place k points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. Recalculate the positions of the k centroids.
4. Repeat steps 2 and 3 until the centroids no longer move.
Now the objects are in groups from which the metric to be minimized can be calculated.
Although the procedure always terminates, the k-means algorithm does not necessarily find the optimal
configuration, corresponding to the global objective function minimum. The algorithm is significantly
sensitive to the initial randomly selected cluster centers. To reduce the effect of these limitations, the k-
means algorithm can be run multiple times.
The k-means algorithm in map-reduce consists of an iteration (until convergence) of a map and a reduce
step. The map step assigns each point to a cluster. The reduce step takes all the points in each cluster and
calculates the new centroid of the cluster.

872 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans
Usage

KMeans Syntax
Version 1.6

SELECT * FROM KMeans (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ ClusteredOutput ('clustered_output_table') ]
[ UnpackColumns
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ InitialSeeds (starting_clusters) ]
[ NumClusters (number_of_means) [ Seed (seed) ] ]
[ CentroidsTable ('centroids_table') ]
[ Threshold ('threshold') ]
[ MaxIterNum ('max_iterations') ]
);

Note:
You must specify only one of the arguments NumClusters, InitialSeeds, and CentroidsTable. If you
specify more than one, the function gives top priority to InitialSeeds, then to NumClusters, and then to
CentroidsTable.

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the features by which to cluster the
data.
OutputTable Required Specifies the name of the table in which to output the centroids of the clusters.
ClusteredOutp Optional Specifies the name of the table in which to store the clustered output. If you
ut omit this argument, the function does not generate a table of clustered output.
UnpackColum Optional Specifies whether the means for each centroid appear unpacked (that is, in
ns separate columns) in output_table. By default, the function concatenates the
means for the centroids and outputs the result in a single VARCHAR column.

Teradata Aster Analytics Foundation User Guide 873


Chapter 7: Cluster Analysis
KMeans

Argument Category Description


InitialSeeds Optional Specifies the initial seed means as strings of underscore-delimited DOUBLE
PRECISION values. For example, this clause initializes eight clusters in eight-
dimensional space:
Means('50_50_50_50_50_50_50_50',
'150_150_150_150_150_150_150_150',
'250_250_250_250_250_250_250_250',
'350_350_350_350_350_350_350_350',
'450_450_450_450_450_450_450_450',
'550_550_550_550_550_550_550_550',
'650_650_650_650_650_650_650_650',
'750_750_750_750_750_750_750_750')
The dimensionality of the means must match the dimensionality of the data
(that is, each mean must have n numbers in it, where n is the number of input
columns minus one).
By default, the algorithm chooses the initial seed means randomly.

Note:
With InitialSeeds, the function uses a deterministic algorithm and the
function supports up to 1596 dimensions.

NumClusters Optional Specifies the number of clusters to generate from the data.
Note: With NumClusters, the function uses a nondeterministic algorithm and
the function supports up to 1543 dimensions.
CentroidsTabl Optional The table that contains the initial seed means for the clusters. The schema of the
e centroids table depends on the value of the UnpackColumns argument.

Note:
With CentroidsTable, the function uses a deterministic algorithm and the
function supports up to 1596 dimensions.

Threshold Optional Specifies the convergence threshold. When the centroids move by less than this
amount, the algorithm has converged. The default value is 0.0395.
MaxIterNum Optional Specifies the maximum number of iterations that the algorithm runs before
quitting if the convergence threshold has not been met. The default value is 10.

Input
The KMeans function has one required input table (specified by the InputTable argument) and one optional
input table (specified by the CentroidsTable argument).
The required input table contains the features by which to cluster the data.
Table 793: KMeans Input Table Schema

Column Name Data Type Description


id INTEGER Contains the identifier of the user or item.

874 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans

Column Name Data Type Description


dimension_i DOUBLE Contains the data in dimension i. The table has columns dimension_1
PRECISION through dimension_n, where n is the number of dimensions. Each
dimension is a feature by which to cluster the data.
For example, if the required application is the clustering of points by
latitude and longitude on the Earth's surface, then the input table has
three columns: point-id, latitude, and longitude. Clustering is
performed on the latitude and longitude columns. The dimensionality
n of the data is not specified as an argument, but is implicitly derived
from the data.

The optional input table contains the contains the initial seed means for the clusters. This table has the same
schema as the table of cluster centroids (specified by the OutputTable argument), which is affected by the
UnpackColumns argument and is described by KMeans Results Messages and KMeans Output Table
Schema for UnpackColumns('true').

Output
The KMeans function has two required outputs and one optional output. The required outputs are the result
messages (output to the screen) and the table of cluster centroids (specified by the OutputTable argument).
The optional output is a table of the clusters themselves (specified by the ClusteredOutput argument).
The results messages table starts with information about each cluster, described by the following two tables.
Table 794: KMeans Results Messages Table Schema

Column Name Data Type Description


clusterid INTEGER Contains the cluster identifiers of the centroids.
feature_set VARCHAR Column name is the concatenation of the feature names. For example, if
the feature names are 'p1', 'p2', and 'p3', then the column name is 'p1 p2
'p3'.
Contains the concatenation of the means in the centroid. For example,
means 3, 5, and 6 are represented as '3 5 6'.

Note:
The UnpackColumns argument does not affect this column.

size INTEGER Contains the number of points in the cluster.


withinss INTEGER Contains the within-cluster-sum-of-squares—the sum of squared
differences of each point from its cluster centroid.

Table 795: KMeans Results Messages

Label Value
Converged : 'True' if the algorithm converged, 'False' otherwise.
Number of iterations : Number of iterations that the algorithm performed.
Number of clusters : Number of clusters.

Teradata Aster Analytics Foundation User Guide 875


Chapter 7: Cluster Analysis
KMeans

Label Value
Output table : Name of the output table specified by the OutputTable argument.
Total_WithinSS : Sum of withinss values in the preceding table.
Between_SS : Between sum of squares—the sum of squared distances of centroids to the
global mean, where the squared distance of each mean to the global mean is
multiplied by the number of data points it represents.

The schema of the table of cluster centroids is affected by the UnpackColumns argument.
Table 796: KMeans Output Table Schema for UnpackColumns('false') (Default)

Column Name Data Type Description


clusterid INTEGER Contains the cluster identifiers of the centroids.
feature_set VARCHAR Column name is the concatenation of the feature names. For example, if
the feature names are 'p1', 'p2', and 'p3', then the column name is 'p1 p2
'p3'.
Contains the concatenation of the means of the features in the centroid.
For example, means 3, 5, and 6 are represented as '3 5 6'.
size INTEGER Contains the number of points in the cluster.
withinss INTEGER Contains the within-cluster-sum-of-squares—the sum of squared
differences of each point from its cluster centroid.

Table 797: KMeans Output Table Schema for UnpackColumns('true')

Column Name Data Type Description


clusterid INTEGER Contains the cluster identifiers of the centroids.
feature_i INTEGER or Contains the means for feature i. The table has one such column for each
VARCHAR feature.
size INTEGER Contains the number of points in the cluster.
withinss INTEGER Contains the within-cluster-sum-of-squares—the sum of squared
differences of each point from its cluster centroid.

The following table describes the optional table of the clusters themselves.
Table 798: KMeans Clustered Output Table Schema

Column Name Data Type Description


pointid INTEGER Contains the identifier of the user or item (from input_table).
centroidid INTEGER Contains the identifier of the centroid for pointid.

876 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans
Examples
The input (Input) contains attributes of personal computers (price, speed, hard disk size, RAM, and screen
size). The table contains over 6000 rows. These examples show how to use of various arguments to find eight
clusters based on all five attributes.

Input
Table 799: KMeans Examples Input Table computers_train1

id price speed hd ram screen


1 1499 25 80 4 14
2 1795 33 85 2 14
3 1595 25 170 4 15
4 1849 25 170 8 14
5 3295 33 340 16 14
6 3695 66 340 16 14
7 1720 25 170 4 14
8 1995 50 85 2 14
9 2225 50 210 8 14
12 2605 66 210 8 14
13 2045 50 130 4 14
14 2295 25 245 8 14
16 2225 50 130 4 14
17 1595 33 85 2 14
18 2325 33 210 4 15
19 2095 33 250 4 15
20 4395 66 452 8 14
... ... ... ... ... ...

Example 1: NumClusters and UnpackColumns('false') by Default

SQL-MapReduce Call
This call tries to group the 5-dimensional data points into 8 clusters.

SELECT * FROM kmeans (


ON (SELECT 1)
PARTITION BY 1
InputTable ('computers_train1')

Teradata Aster Analytics Foundation User Guide 877


Chapter 7: Cluster Analysis
KMeans
OutputTable ('kmeanssample_centroid')
NumClusters ('8')
Threshold ('0.05')
MaxIterNum ('10')
);

Output

Table 800: KMeans Example 1 Results Message Table

clusterid price speed hd ram screen size withinss


0 2861.5646437995 57.7915567282 428.9525065963 758 3.23717069498672E7
11.746701847 14.7941952507
1 1463.1038461539 39.1371794872 236.5679487179 780 2.45706992256417E7
3.9282051282 14.208974359
2 2392.8843069874 52.4788087056 334.5612829324 873 2.60713860412369E7
7.3195876289 14.5841924399
3 3636.914893617 65.4723404255 615.4042553191 235 5.03470581617022E7
13.6340425532 15.1276595745
4 2095.0655462185 53.6352941176 512.6991596639 595 2.04614168806725E7
8.6588235294 14.7210084034
5 2689.9612756264 62.9225512528 980.2961275626 439 5.32294371890669E7
19.5353075171 15.15261959
6 1762.2406947891 64.9081885856 595.4863523573 403 1.95361545012407E7
7.0967741935 14.682382134
7 1891.4 42.9837837838 197.0627027027 4.08 14.3221621622 925 2.13319771956758E7
Converged: False
NumberofIterations: 10
Numberofclusters: 8
Outputtable: "kmeanssample_centroid"
Total_WithinSS: 2.479198361451039E8
Between_SS: 1.790667656379848E9

The following query returns the output shown in the following table:

SELECT * FROM kmeanssample_centroid;

Table 801: KMeans Example 1 Output Table kmeanssample_centroid

clusterid price speed hd ram screen size withinss


0 2861.5646437995 57.7915567282 428.9525065963 11.746701847 14.7941952507 758 3.23717069498672E7
1 1463.1038461539 39.1371794872 236.5679487179 3.9282051282 14.208974359 780 2.45706992256417E7

878 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans
clusterid price speed hd ram screen size withinss
2 2392.8843069874 52.4788087056 334.5612829324 7.3195876289 14.5841924399 873 2.60713860412369E7
3 3636.914893617 65.4723404255 615.4042553191 13.6340425532 15.1276595745 235 5.03470581617022E7
4 2095.0655462185 53.6352941176 512.6991596639 8.6588235294 14.7210084034 595 2.04614168806725E7
5 2689.9612756264 62.9225512528 980.2961275626 19.5353075171 15.15261959 439 5.32294371890669E7
6 1762.2406947891 64.9081885856 595.4863523573 7.0967741935 14.682382134 403 1.95361545012407E7
7 1891.4 42.9837837838 197.0627027027 4.08 14.3221621622 925 2.13319771956758E7

Example 2: NumClusters and UnpackColumns('true')

SQL-MapReduce Call

SELECT * FROM kmeans (


ON (SELECT 1)
PARTITION BY 1
InputTable ('computers_train1')
OutputTable ('kmeanssample_centroid')
UnpackColumns ('true')
NumClusters ('8')
Threshold ('0.05')
MaxIterNum ('10')
);

Output

Table 802: KMeans Example 2 Results Message Table

clusterid price speed hd ram screen size withinss


0 1494.1547619048 39.7965367965 237.25 3.9372294372 14.2067099567 924 3.39788217456717E7
1 3723.5736842105 66.0894736842 598.4631578947 13.0105263158 15.2 190 3.93415275684214E7
2 2490.1091269841 53.244047619 280.875 6.6706349206 14.7023809524 504 1.20976457876997E7
3 2417.9474497682 57.5023183926 615.9103554869 12.142194745 647 3.59278789428129E7
4 2093.8691695108 47.6769055745 279.3970420933 5.590443686 879 2.47440773060303E7
14.4357224118
5 2843.8243243243 61.8581081081 1079.7972972973 21.8648648649 296 3.01185782972965E7
15.1013513514
6 2941.4206219313 60 435.1931260229 12.1047463175 14.8346972177 611 2.32189197119465E7
7 1832.9561128527 53.0208986416 417.9059561129 6.0752351097 957 5.19993211306162E7
14.5454545455
Converged: False
NumberofIterations: 10
Numberofclusters: 8

Teradata Aster Analytics Foundation User Guide 879


Chapter 7: Cluster Analysis
KMeans
clusterid price speed hd ram screen size withinss
Outputtable: "kmeanssample_centroid"
Total_WithinSS: 2.514267704904952E8
Between_SS: 1.7871607220344727E9

The following query returns the output shown in the table kmeanssample_centroid:

SELECT * FROM kmeanssample_centroid;

Table 803: KMeans Example 2 Output Table kmeanssample_centroid (Columns 1-4)

clusterid price speed hd


0 1494.15476190476 39.7965367965368 237.25
1 3723.57368421053 66.0894736842105 598.463157894737
2 2490.10912698413 53.2440476190476 280.875
3 2417.94744976816 57.5023183925811 615.910355486862
4 2093.86916951081 47.6769055745165 279.397042093288
5 2843.82432432432 61.8581081081081 1079.7972972973
6 2941.42062193126 60 435.193126022913
7 1832.95611285266 53.0208986415883 417.905956112853

Table 804: KMeans Example 2 Output Table kmeanssample_centroid (Columns 5-8)

ram screen size withinss


3.93722943722944 14.20670995671 924 33978821.7456717
13.0105263157895 15.2 190 39341527.5684214
6.67063492063492 14.702380952381 504 12097645.7876997
12.1421947449768 14.8253477588872 647 35927878.9428129
5.59044368600683 14.4357224118316 879 24744077.3060303
21.8648648648649 15.1013513513514 296 30118578.2972965
12.1047463175123 14.8346972176759 611 23218919.7119465
6.07523510971787 14.5454545454545 957 51999321.1306162

Example 3: InitialSeeds and ClusteredOutput

SQL-MapReduce Call

SELECT * FROM kmeans (


ON (SELECT 1)

880 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans
PARTITION BY 1
InputTable ('computers_train1')
OutputTable ('kmeanssample_output')
InitialSeeds ('2249_51_408_8_14', '2165_51_398_7_14.6',
'2182_51_404_7_14.6', '2204_55_372_7.19_14.6',
'2419_44_222_6.6_14.3', '2394_44.3_277_7.3_14.5','
2326_43.6_301_7.11_14.3', '2288_44_325_7_14.4')
ClusteredOutput ('kmeanssample_clusteredoutput')
);

Output

Table 805: KMeans Example 3 Results Message Table

clusterid price speed hd ram screen size withinss


0 2857.75 62.027027027027 1075.40540540541 296 2.89798655236483E7
21.9459459459459 15.1047297297297
1 1471.77529411765 41.42 261.765882352941 850 3.12681781317644E7
4.13647058823529 14.2341176470588
2 1966.25868725869 64.3745173745174 518 3.75512305289583E7
682.048262548263 9.08880308880309
14.8552123552124
3 1863.05935613682 44.3581488933602 994 2.36222057082472E7
242.107645875252 4.62374245472837
14.3460764587525
4 3765.46783625731 65.7953216374269 171 3.52858142923985E7
603.818713450292 12.9824561403509
15.2222222222222
5 2977.40142095915 60.8081705150977 563 2.21058786252213E7
432.916518650089 12.113676731794 14.8827708703375
6 2553.49551856594 54.0832266325224 781 2.97090384046078E7
465.946222791293 10.6171574903969
14.7477592829706
7 2226.79880239521 50.6071856287425 835 2.19925897556877E7
309.419161676647 6.3185628742515 14.5329341317365
Converged: False
NumberofIterations: 10
Numberofclusters: 8
Outputtable: "kmeanssample_output"
ClusteredOutputtable: "kmeanssample_clusteredoutput"
Total_WithinSS: 2.305148009705335E8
Between_SS: 1.8080726915544217E9

Teradata Aster Analytics Foundation User Guide 881


Chapter 7: Cluster Analysis
KMeans
The following query returns the output shown in the following table:

SELECT * FROM kmeanssample_output ORDER BY clusterid;

Table 806: KMeans Example 3 Output Table kmeanssample_output

clusterid price speed hd ram screen size withinss


0 2857.75 62.027027027027 1075.40540540541 296 28979865.5236483
21.9459459459459 15.1047297297
1 1471.77529411765 41.42 261.765882352941 850 31268178.1317644
4.13647058823529 14.2341176471
2 1966.25868725869 64.3745173745174 518 37551230.5289583
682.048262548263 9.08880308880309 14.8552123552
3 1863.05935613682 44.3581488933602 994 23622205.7082472
242.107645875252 4.62374245472837 14.3460764588
4 3765.46783625731 65.7953216374269 171 35285814.2923985
603.818713450292 12.9824561403509 15.2222222222
5 2977.40142095915 60.8081705150977 563 22105878.6252213
432.916518650089 12.113676731794 14.8827708703
6 2553.49551856594 54.0832266325224 781 29709038.4046078
465.946222791293 10.6171574903969 14.747759283
7 2226.79880239521 50.6071856287425 835 21992589.7556877
309.419161676647 6.3185628742515 14.5329341317

The following query returns the output shown in the following table:

SELECT * FROM kmeanssample_clusteredoutput ORDER BY id;

Table 807: KMeans Example 3 Output Table kmeanssample_clusteredoutput

id clusterid
1 1
2 3
3 1
4 3
5 5
6 4
7 3
8 3
9 7

882 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans

id clusterid
12 6
13 3
14 7
16 7
17 1
18 7
19 7
20 4
... ...

Example 4: CentroidsTable and ClusteredOutput


This example uses the following tables from the Output section from Example 2: NumClusters and
UnpackColumns('true').
• KMeans Example 2 Output Table kmeanssample_centroid (Columns 1-4)
• KMeans Example 2 Output Table kmeanssample_centroid (Columns 5-8)

SQL-MapReduce Call

SELECT * FROM kmeans (


ON (SELECT 1)
PARTITION BY 1
InputTable ('computers_train1')
OutputTable ('kmeanssample_output')
CentroidsTable ('kmeanssample_centroid')
ClusteredOutput ('kmeanssample_clusteredoutput')
);

Output

Table 808: KMeans Example 4 Results Message Table

clusterid price speed hd ram screen size withinss


0 1510.0854271357 39.9145728643 239.6783919598 995 3.78087315477371E7
3.9939698492 14.224120603
1 3804.3660130719 63.5294117647 555.9346405229 153 2.9924035307189E7
11.8954248366 15.2679738562
2 2450.5174180328 53.5676229508 375.3278688525 976 3.3481207880125E7
8.3114754098 14.6270491803
3 2347.8198198198 64.7027027027 880.2702702703 333 1.99780178258257E7
14.9189189189 15.1681681682

Teradata Aster Analytics Foundation User Guide 883


Chapter 7: Cluster Analysis
KMeans

clusterid price speed hd ram screen size withinss


4 2008.3631889764 44.7716535433 230.8828740157 1016 2.92639514901571E7
4.7933070866 14.3818897638
5 3003.3223140496 60.8595041322 1077.520661157 242 1.98665569917355E7
23.0743801653 14.9752066116
6 2941.2778625954 60.0106870229 445.1160305344 655 2.93420013404598E7
12.2320610687 14.8595419847
7 1863.894984326 59.8761755486 523.7539184953 638 2.41424811222563E7
7.1849529781 14.6943573668
Converged: False
NumberofIterations: 10
Numberofclusters: 8
Outputtable: "kmeanssample_output"
ClusteredOutputtable: "kmeanssample_clusteredoutput"
Total_WithinSS: 2.2380698350548548E8
Between_SS: 1.8147805090194747E9

The following query returns the output shown in the following table:

SELECT * FROM kmeanssample_output ORDER BY clusterid;

Table 809: KMeans Example 4 Output Table kmeanssample_output

clusterid price speed hd ram screen size withinss


0 1510.08542713568 39.9145728643216 995 37808731.5477371
239.678391959799 3.99396984924623
14.2241206030151
1 3804.3660130719 63.5294117647059 555.934640522876 153 29924035.307189
11.8954248366013 15.2679738562092
2 2450.51741803279 53.5676229508197 976 33481207.880125
375.327868852459 8.31147540983607
14.6270491803279
3 2347.81981981982 64.7027027027027 880.27027027027 333 19978017.8258257
14.9189189189189 15.1681681681682
4 2008.36318897638 44.7716535433071 1016 29263951.4901571
230.882874015748 4.79330708661417
14.3818897637795
5 3003.32231404959 60.8595041322314 242 19866556.9917355
1077.52066115702 23.0743801652893
14.9752066115702

884 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeans

clusterid price speed hd ram screen size withinss


6 2941.27786259542 60.0106870229008 655 29342001.3404598
445.116030534351 12.2320610687023
14.8595419847328
7 1863.89498432602 59.8761755485893 638 24142481.1222563
523.753918495298 7.18495297805643
14.6943573667712

The following query returns the output shown in the following table:

SELECT * FROM kmeanssample_clusteredoutput ORDER BY id;

Table 810: KMeans Example 4 Output Table kmeanssample_clusteredoutput

id clusterid
1 0
2 4
3 0
4 4
5 6
6 1
7 0
8 4
9 4
12 2
13 4
14 2
16 4
17 0
18 2
19 4
20 1
... ...

Teradata Aster Analytics Foundation User Guide 885


Chapter 7: Cluster Analysis
KMeansPlot

KMeansPlot

Summary
The KMeansPlot function takes a model—a table of cluster centroids output by the KMeans function—and
an input table of test data, and uses the model to assign the test data points to the cluster centroids.

Usage

KMeansPlot Syntax
Version 1.1

SELECT * FROM KMeansPlot (


ON { table | view | query } PARTITION BY ANY
ON { table | view | query } DIMENSION
[ CentroidsTable ('centroids_table') ]
[ PrintDistance ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]

);

Note:
When calling KMeansPlot on a view, you must provide aliases (a requirement of multi-input SQL-
MapReduce). For example:

SELECT *
FROM KMeansPlot (
ON pa_prdwk.seg_data_v AS input_data PARTITION BY ANY
ON pa_prdwk.seg_data_output AS segmentation_data_output DIMENSION
CentroidsTable ('segmentation_data_output')
);

886 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeansPlot
Arguments
Argument Category Description
CentroidsTable Optional Specifies the name of the table of cluster centroids output by the KMeans
function.
PrintDistance Optional Specifies whether to print the distance between each data point and the
nearest cluster. The default value is 'false'.

Input
TheKMeansPlot function has two required input tables:
• An input table of test data (the input with the PARTITION BY ANY clause), which has the same schema
as the KMeans input table
• The table of cluster centroids output by the KMeans function (the input with the DIMENSION clause)

Output
Table 811: KMeansPlot Output Table Schema

Column Name Data Type Description


id INTEGER Identifier of the data point.
clusterid INTEGER Identifier of the cluster to which to the data point is assigned.
distance DOUBLE Distance between the data point and the center of the assigned cluster.
PRECISION This column appears only if the PrintDistance argument has the value
'true'.
attribute INTEGER Data point attribute. This table has one such column for each data point
attribute.

Example
This example uses the table of cluster centroids output by a KMeans function example.

Input
The input table of test data, computers_test1, contains attributes of personal computers (price, speed, hard
disk size, RAM, and screen size). This table has over 1000 rows. If a row contains a null value, KMeansPlot
assigns the cluster ID -1 to that row.
The table of cluster centroids, kmeanssample_centroid, was output by the Kmeans function.
Table 812: KMeansPlot Example Input Table computers_test1

id price speed hd ram screen


10 2575 50 210 4 15
11 2195 33 170 8 15

Teradata Aster Analytics Foundation User Guide 887


Chapter 7: Cluster Analysis
KMeansPlot

id price speed hd ram screen


15 2699 50 212 8 14
29 3095 33 340 16 14
30 3244 66 245 8 14
38 3795 66 500 8 14
45 3495 50 340 16 14
46 2695 33 245 8 14
48 1749 25 120 4 14
51 2499 33 170 4 14
52 2395 33 130 4 14
59 2945 66 210 8 17
65 2195 66 85 2 14
66 1495 25 170 4 14
70 3095 66 245 8 14
86 1999 33 120 8 14
91 2975 50 210 4 17
92 2145 66 130 4 14
93 2420 33 170 8 15
94 2505 50 210 8 14
104 2999 66 330 4 15
... ... ... ... ... ...

Table 813: KMeansPlot Example Input Table kmeanssample_centroid

clusterid price speed hd ram screen size withinss


0 2224.57057057057 51.7417417417417 335.113113113113 999 31124476.9549561
6.63863863863864 14.5755755755756
1 1854.76587795766 42.0884184308842 195.638854296389 803 14330020.7995024
4.01743462017435 14.3013698630137
2 1799.80802792321 59.4502617801047 521.869109947644 573 21619217.6753922
7.0087260034904 14.6212914485166
3 2692.29752953813 55.0601503759398 406.770139634801 931 41217046.99893
10.2556390977444 14.7948442534909
4 1452.08707482993 38.9197278911565 236.696598639456 735 22305606.1986394
3.94285714285714 14.2068027210884
5 2327.02958579882 65.491124260355 897.396449704142 338 22052669.0769224
14.9585798816568 15.1479289940828

888 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KMeansPlot

clusterid price speed hd ram screen size withinss


6 3060.36743215031 62.6179540709812 784.100208768267 479 62626318.7348671
18.321503131524 14.8329853862213
7 3812.38 62.92 543.633333333333 11.8133333333333 150 29785847.880002
15.3066666666667

SQL-MapReference Call

SELECT *
FROM KMeansPlot (
ON computers_test1 PARTITION BY ANY
ON kmeanssample_centroid DIMENSION
CentroidsTable ('kmeanssample_centroid')
) ORDER BY id, clusterid;

Output
Table 814: KMeansPlot Example Output Table

id clusterid price speed hd ram screen


10 2 2575 50 210 4 15
11 4 2195 33 170 8 15
15 2 2699 50 212 8 14
29 6 3095 33 340 16 14
30 6 3244 66 245 8 14
38 1 3795 66 500 8 14
45 1 3495 50 340 16 14
46 2 2695 33 245 8 14
48 0 1749 25 120 4 14
51 2 2499 33 170 4 14
52 2 2395 33 130 4 14
59 6 2945 66 210 8 17
65 4 2195 66 85 2 14
66 0 1495 25 170 4 14
70 6 3095 66 245 8 14
86 4 1999 33 120 8 14
91 6 2975 50 210 4 17
92 4 2145 66 130 4 14

Teradata Aster Analytics Foundation User Guide 889


Chapter 7: Cluster Analysis
KModes

id clusterid price speed hd ram screen


93 2 2420 33 170 8 15
94 2 2505 50 210 8 14
104 6 2999 66 330 4 15
... ... ... ... ... ...

KModes

Summary
KModes is an extension of KMeans that supports categorical data. KModes models are fit similarly to
KMeans models. The core algorithm is an expectation-maximization algorithm that finds a locally optimal
solution. The main steps to fitting the model are:
• Initialization - A set of K initial cluster centers is selected. This set can be generated using the
RandomSample function (RandomSample) which allows the user to sample rows from an input table
using the kmeans++ and kmeans|| algorithms. These initialization algorithms generate initial cluster
centers that are more likely to lead to better local optima.
• E step - Performed by a mapper. Each point in the input table is assigned to one of the K clusters, and the
sums of the numerical attributes and counts of the categorical attributes are stored.
• M step - Performed by a reducer. The statistics generated by each worker in the E step are aggregated and
new cluster centers are generated. For numerical attributes, the new center is the mean of the value of the
attribute for the points assigned to the cluster. For categorical attributes, the new center it the mode of
the attribute value for the points assigned to the cluster.
The algorithm runs for either a set number of iterations or until the change in movement of the cluster
centers drops below a user-specified threshold.
When assigning points to a cluster, a hybrid distance function that combines a numeric distance and a
categorical distance is required. The default distance between two data points in a KModes model is the
squared Euclidean distance:

where N denotes the indices of numerical attributes, C denotes the indices of categorical attributes, and wj
denotes the weight to be assigned to a category.
The Manhattan distance can also be used:

890 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModes
Usage

KModes Syntax
Version 1.0

SELECT * FROM KModes (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('table_name')
OutputTable ('table_name')
[ InitialSeedTable ('table_name')]
[ ModelIdColumn] ('column_name')
[ NumClusters ('integer' [,...] ) ]
InputColumns ('InputTable_column' [,...] )
[ Threshold ('double') ]
[ MaxIterNum ('integer') ]
[ Distance ({ 'manhattan' | 'euclidean' }) ]
[ CategoryWeights ('double' [,...] ) ]
[ AsCategories
({ 'ascat_column' | 'ascat_column_range' }[,...]) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Input table is the table containing the list of features by which
to cluster the data.
OutputTable Required Output table is the table where output is stored. The output
table contains the centroids of the clusters.
InitialSeedTable Optional An input table containing the points that serve as initial cluster
centers. InitialSeedTable cannot be used if NumClusters is used
and is required if NumClusters is not used.
ModelIdColumn Optional If this argument is present, it indicates that the table specified
in InitialSeedTable contains more than one set of seed values
(that is, it contains seed values for more than one model).
This argument specifies the column in InitialSeedTable that
identifies which rows are associated with each model.

Teradata Aster Analytics Foundation User Guide 891


Chapter 7: Cluster Analysis
KModes

Argument Category Description


NumClusters Optional An integer or a list of integers. If a single value is given, the
function trains a model with that number of clusters. If a list of
integers is supplied, the function trains a model for each value.
Initial seeds are specified by performing KMeans|| sampling
using the FixedSample function. NumClusters cannot be used if
InitialSeedTable is used and is required if InitialSeedTable is
not used.
InputColumns Required Specifies the input table columns to use for clustering.
Threshold Optional This is the convergence threshold. When the centroids move by
less than this amount, the algorithm has converged. The input
value must be no less than 0.0. The default value is 0.0395.
MaxIterNum Optional Specifies the maximum number of iterations that the algorithm
runs before quitting if the convergence threshold is not met.
The input value must be an integer greater than 0. The default
value is 10.
Distance Optional Specifies the distance metric that the Kmodes function uses for
numeric dimensions. The permitted input values are
[Manhattan, Euclidean]. The default value is Euclidean.
CategoricalDistance Optional Specifies the distance metric that the Kmodes function uses for
categorical dimensions. The permitted input values are
[Overlap, Hamming]. The default value is Overlap.
Overlap: Distance is 0 if two points are in the same category or
1 if they are in different categories.
Hamming: Used for categories that are strings of equal length.
The fraction of characters that are different.
CategoryWeights Optional The weights to be assigned to each category in the KModes
distance.
AsCategories Optional Indicates which numeric categories to interpret as categorical
variables. Input columns must contain numeric SQL types.

Input
The function has one required input table and one optional input table. The required input table contains
the data points to be clustered with one dimension in each column.
Table 815: KModes Input Table Schema

Column Name Data Type Description


dimension_i Any Data for dimension i. The table has columns dimension_1
through dimension_n, where n is the number of dimensions.
Each dimension is a feature by which to cluster the data. You
need not specify n; the function determines it automatically.

892 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModes
If you do not provide a value in the NumClusters argument, you must provide an initial seed table. This
table has the same schema as the preceding table.
The function allows you to try different combinations of seeds to generate multiple models simultaneously.
You can then compare the model metrics to find the best model. There are two ways to generate multiple
models:
• You can specify multiple values in the NumClusters argument. For example, NumClusters(‘3’, ‘3’, ‘4’) fits
3 models, two with 3 clusters and one model with 4 clusters. It’s a good practice to try multiple
initializations when fitting KModes, which is why you might use the same number more than once.
• You can use the function RandomSample to select multiple sets of rows from the input data table, and
use these randomly selected samples as seeds. To do this, follow these steps:
Run RandomSample. Assign the argument NumSample a set of values x1, x2, …, xn where n is the
number of different sets of rows to generate (this becomes the number of models later created by
KModes) and xi is the number of seed rows to select for each model (this determines the number of
clusters in model i later created by KModes).
Save the output of the RandomSample run to a table. This table has a column, set_id, that identifies each
set of points.
In the KModes function call, set InitialSeedTable to the name of the table you generated, and set
ModelIdColumn to ‘set_id’.

Output
The output displayed on the screen is a summary table containing statistics about the KModes run. There
are four columns if a single model is trained, or five if multiple models are trained simultaneously. The three
right-most columns are separated from the summary so that users can sort by them and quickly find the best
model.
Table 816: KMode Output Summary Table

Column Name Data Type Description


<model_id> VARCHAR Only appears if multiple models are trained.
Integer, starting with 0, identifying the
model.
summary VARCHAR Presents the following data about the model:
Number of Clusters (number of clusters
found in the model)
Number of Iterations (number of iterations
required)
Model Converged (whether or not the
algorithm converged)
Number of Data Points (number of input
rows used to build the model)
between_cluster_error DOUBLE Sum of squared distances of centroids to the
PRECISION global mean, where the squared distance of
each mean to the global mean is multiplied
by the number of data points in the cluster.

Teradata Aster Analytics Foundation User Guide 893


Chapter 7: Cluster Analysis
KModes

Column Name Data Type Description


total_within_cluster_error DOUBLE The sum over all clusters of the within
PRECISION cluster error (within_cluster_ss).
pseudo_f DOUBLE The value given by this formula:
PRECISION (between_cluster_errror / (K - 1)) /
(total_within_cluster_error / (N - K))
where N is the total number of data points,
or the total weight if the points are weighted,
and K is the number of clusters.

Table 817: KMode Output Model Table

Column Name Data Type Description


<model_id> INTEGER ID of the model.
cluster_id INTEGER ID assigned to the cluster.
numerical attributes DOUBLE One column appears for each numerical dimension from the
PRECISION input table.
categorical attributes VARCHAR One column appears for each categorical dimension from
the input table.
within_cluster_ss DOUBLE The total distance summed over all points in the cluster,
PRECISION between the point and the cluster center, as calculated by the
Distance metric.
cluster_weight DOUBLE Total weight of the data points assigned to the cluster.
PRECISION
distance_metric VARCHAR The value of the Distance argument in the function call is
copied to the output table so the user does not have to
specify it again when using KModesPredict.
category_weights VARCHAR The value of the CategoryWeights argument in the function
call is copied to the output table so the user does not have to
specify it again when using KModesPredict.

Examples
• Input
• Example 1: Using InitialSeedTable
• Example 2: Using NumClusters

Input
Both examples use the input table kmodes_input, which has 32 observations, on 11 variables, about different
models of cars. The table has two categorical (vs, am), three numerical variables (‘cyl’, ‘gear’, ‘carb’) that are
considered as categories in the SQL-MapReduce call and six normalized pure numeric variables (‘mpg’,
‘disp’, ‘hp’, ‘drat’, ‘wt’, ‘qsec’), listed below:

894 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModes
• mpg - miles/(US) gallon
• cyl - Number of cylinders
• disp - Displacement (cu.in.)
• hp - Gross horsepower
• drat - Rear axle ratio
• wt - Weight (lb/1000)
• qsec - 1/4 mile time
• vs - V/S
• am - Transmission (automatic/manual)
• gear - Number of forward gears
• carb - Number of carburetors
Table 818: KModes Example Input Table kmodes_input (Columns 1-5)

model mpg cyl disp hp


AMC Javelin -0.811459624 8 0.591244935 0.048313323
Cadillac Fleetwood -1.607882616 8 1.946753815 0.850496796
Camaro Z28 -1.126710392 8 0.962396176 1.433902959
Chrysler Imperial -0.894420352 8 1.688561647 1.215125648
Datsun 710 0.449543447 4 -0.990182091 -0.783040459
Dodge Challenger -0.761683187 8 0.704204008 0.048313323
Duster 360 -0.960788935 8 1.043081228 1.433902959
Ferrari Dino -0.064813069 6 -0.691647397 0.412942174
Fiat 128 2.042389431 4 -1.226589294 -1.176839619
Fiat X1-9 1.196190002 4 -1.224168743 -1.176839619
Ford Pantera L -0.71190675 8 0.970464681 1.711020886
Honda Civic 1.710546517 4 -1.25079481 -1.381031775
Hornet 4 Drive 0.217253407 6 0.220093694 -0.53509284
Hornet Sportabout -0.230734526 8 1.043081228 0.412942174
Lincoln Continental -1.607882616 8 1.849931752 0.996348337
Lotus Europa 1.710546517 4 -1.094265808 -0.491337378
Maserati Bora -0.844643915 8 0.567039419 2.746566825
Mazda RX4 0.150884825 6 -0.570619819 -0.53509284
Mazda RX4 Wag 0.150884825 6 -0.570619819 -0.53509284
Merc 230 0.449543447 4 -0.725535119 -0.753870151
Merc 240D 0.715017777 4 -0.677930938 -1.235180235
Merc 280 -0.147773797 6 -0.509299179 -0.345485837
Merc 280C -0.380063837 6 -0.509299179 -0.345485837

Teradata Aster Analytics Foundation User Guide 895


Chapter 7: Cluster Analysis
KModes

model mpg cyl disp hp


Merc 450SE -0.612353876 8 0.363713088 0.485867945
Merc 450SL -0.463024565 8 0.363713088 0.485867945
Merc 450SLC -0.811459624 8 0.363713088 0.485867945
Pontiac Firebird -0.147773797 8 1.365821438 0.412942174
Porsche 914-2 0.980492108 4 -0.890939476 -0.812210767
Toyota Corolla 2.291271616 4 -1.287909934 -1.191424773
Toyota Corona 0.233845553 4 -0.892553178 -0.724699843
Valiant -0.3302874 6 -0.046166978 -0.60801861
Volvo 142E 0.217253407 4 -0.885291523 -0.549677994

Table 819: KModes Example Input Table kmodes_input (Columns 6-12)

drat wt qsec vs am gear carb


-0.835197792 0.22254417 -0.307088658 S automatic 3 2
-1.246659826 2.077504765 0.073449451 S automatic 3 4
0.24956575 0.636460997 -1.364760755 S automatic 3 4
-0.685575235 2.174596366 -0.239934874 S automatic 3 4
0.473999587 -0.917004624 0.426006817 V manual 4 1
-1.564607761 0.309415603 -0.54772305 S automatic 3 2
-0.722980874 0.360516446 -1.124126363 S automatic 3 4
0.043834734 -0.457097039 -1.314395417 S manual 5 6
0.90416444 -1.039646647 0.907275602 V manual 4 1
0.90416444 -1.310481114 0.588295128 V manual 4 1
1.166003916 -0.048290296 -1.874010283 S manual 5 4
2.493904115 -1.637526508 0.375641479 V manual 4 2
-0.96611753 -0.002299538 0.890487156 V automatic 3 1
-0.835197792 0.227654255 -0.46378082 S automatic 3 2
-1.115740088 2.255335698 -0.016088927 S automatic 3 4
0.324377029 -1.741772228 -0.530934604 V manual 5 2
-0.105787824 0.360516446 -1.818048797 S manual 5 8
0.567513685 -0.610399567 -0.777165145 S manual 4 4
0.567513685 -0.349785269 -0.46378082 S manual 4 4
0.604919325 -0.068730634 2.826754593 V automatic 4 2
0.174754472 -0.027849959 1.203871481 V automatic 4 2

896 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModes

drat wt qsec vs am gear carb


0.604919325 0.227654255 0.252526208 V automatic 4 4
0.604919325 0.227654255 0.588295128 V automatic 4 4
-0.98482035 0.871524874 -0.251127171 S automatic 3 3
-0.98482035 0.524039143 -0.139204198 S automatic 3 3
-0.98482035 0.575139986 0.084641749 S automatic 3 3
-0.96611753 0.641571082 -0.446992374 S automatic 3 2
1.55876313 -1.100967659 -0.642857578 S manual 5 2
1.166003916 -1.4126828 1.147909994 V manual 4 1
0.193457291 -0.76881218 1.20946763 V automatic 3 1
-1.564607761 0.248094592 1.326986752 V automatic 3 1
0.960272899 -0.44687687 0.420410668 V manual 4 2

The kmodes_init table is an additional input that contains three points that serve as initial cluster centers.
Table 820: KModes Example Input Table kmodes_init (Columns 1-5)

model mpg cyl disp hp


Datsun 710 0.449543447 4 -0.990182091 -0.783040459
Ferrari Dino -0.064813069 6 -0.691647397 0.412942174
Lincoln Continental -1.607882616 8 1.849931752 0.996348337

Table 821: KModes Example Input Table kmodes_init (Columns 6-12)

drat wt qsec vs am gear carb


0.473999587 -0.917004624 0.426006817 V manual 4 1
0.043834734 -0.457097039 -1.314395417 S manual 5 6
-1.115740088 2.255335698 -0.016088927 S automatic 3 4

Example 1: Using InitialSeedTable

SQL-MapReduce Call

DROP TABLE IF EXISTS kmodes_clusters;


SELECT * FROM kmodes (
ON (SELECT 1) PARTITION BY 1
InputTable ('kmodes_input ')
InitialSeedTable ('kmodes_init ')
OutputTable ('kmodes_clusters ')
InputColumns ('mpg:carb')

Teradata Aster Analytics Foundation User Guide 897


Chapter 7: Cluster Analysis
KModes
AsCategories ('cyl', 'gear', 'carb')
);

Output
With the initialseedtable argument, the cluster centers and assignments are the same every time, with the
same distance metric (in this case, the default, Euclidean).
Table 822: KModes Example 1 Output Table

summary between_cluster_error total_within_cluster_err pseudo_f


or
Number of Clusters: 3 195.82116758156 113.178832431262 16.7251942136624
Number of Iterations: 5
Model Converged: true
Number of Data Points: 32.0

The model table 'kmodes_clusters' is generated.


The following query returns the output shown in the table kmodes_clusters:

SELECT * FROM kmodes_clusters ORDER BY 1;

Table 823: KModes Example 1 Output Table kmodes_clusters (Columns 1-5)

cluster_id mpg disp hp drat


0 -0.724943435928571 0.890010157642857 0.511912862714286 -0.9434069635
1 -0.2639188168 -0.059076587 0.760068841 0.4478156392
2 0.882215552923077 -0.935750713230769 -0.843624945153846 0.843739945692308

Table 824: KModes Example 1 Output Table kmodes_clusters (Columns 6-12)

wt qsec cyl vs am gear carb


0.794435602785714 -0.180375863 8 S automatic 3 4
-0.221011145 -1.2494800924 6 S manual 5 4
-0.770541747153846 0.674820195846154 4 V manual 4 2

Table 825: KModes Example 1 Output Table kmodes_clusters (Columns 13-16)

within_cluster_ss cluster_weight distance_metric category_weights


43.6174097337196 14 EUCLIDEAN,OVERLAP [1.0,1.0,1.0,1.0,1.0]
20.7870494221671 5
48.7743732753756 13

898 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModes
Example 2: Using NumClusters
This example specifies NumCluster('3') to obtain three clusters. Because different cluster centers are
produced each time you run the example, cluster assignments might differ.

SQL-MapReduce Call

DROP TABLE IF EXISTS kmodes_clusters1;


SELECT * FROM kmodes (
ON (SELECT 1) PARTITION BY 1
InputTable ('kmodes_input ')
NumClusters ('3')
OutputTable ('kmodes_clusters1 ')
InputColumns ('mpg:carb')
AsCategories ('cyl', 'gear', 'carb')
);

Output

Table 826: KModes Example 2 Output Table

set_id summary between_cluster_error total_within_cluster_e pseudo_f


rror
0 Number of Clusters: 3 189.040612061326 110.959387951497 16.4690218375959
Number of Iterations: 3
Model Converged: true
Number of Data Points:
32.0

The following query returns the output shown in the table kmodes_clusters1:

SELECT * FROM kmodes_clusters1 ORDER BY 1;

Table 827: KModes Example 2 Output Table kmodes_clusters1 (Columns 1-6)

set_id cluster_id mpg disp hp drat


0 0 -0.6870185315 0.45206321975 1.576108211 0.338404144
0 1 -0.69403828546153 0.884442002384615 0.440990547615385 -1.03517409530769
8
0 2 0.7847047892 -0.887066594 -0.80248733113333 0.8069097776
3

Table 828: KModes Example 2 Output Table kmodes_clusters1 (Columns 7-13)

wt qsec cyl vs am gear carb


0.122897527 -1.592803813 8 S manual 5 4

Teradata Aster Analytics Foundation User Guide 899


Chapter 7: Cluster Analysis
KModesPredict

wt qsec cyl vs am gear carb


0.806587495538462 -0.0892693328461538 8 S automatic 3 2
-0.731815169933333 0.502114438733333 4 V manual 4 2

Table 829: KModes Example 2 Output Table kmodes_clusters1 (Columns 14-17)

within_cluster_ss cluster_weight distance_metric category_weights


12.1398871106861 4 EUCLIDEAN, [1.0, 1.0, 1.0, 1.0, 1.0]
OVERLAP
39.4522701602673 13
59.3672306805433 15

KModesPredict

Summary
KModesPredict is the prediction function corresponding to KModes.

Usage

KModesPredict Syntax
Version 1.0

SELECT * FROM KModesPredict (


ON { table | view | query } AS input PARTITION BY ANY
ON { table | view | query } AS model DIMENSION
[ TestModels ('string' [,...] ) ]
[ PrintDistance ({ 'true'|'yes'|'t'|'y'|'1'|'FALSE'|'no'|'f'|'n'|'0' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TestModels Optional Specifies the model IDs to use for prediction. The default behavior is to
use all models.
PrintDistance Optional Specifies whether to output the distance from each observation to its
closest centroid. The default value is false.
Accumulate Optional Columns from the input table to be passed through to the output table.

900 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModesPredict
Input
The model table must be output by the KModes function. Any columns are allowed for the input table;
however, each attribute in the model table must be found in the input table.

Output
The output schema is:
Table 830: KmodesPredict Output Table Schema

Column Name Data Type Description


<set_id> VARCHAR Only appears if multiple models are used for prediction.
The model identifier.
cluster_id VARCHAR Cluster to which the point is assigned.
distance Double Only appears if PrintDistance was set to True. Contains the
distance from each observation to its closest centroid.
numerical attributes Double Numerical attributes from the input table that were used to
create the model(s).
categorical attributes VARCHAR Categorical attributes from the input table that were used
to create the model(s).
accumulate columns Any Columns from the input table passed through to the output
table.

Example

Input
KModes Example Input Table kmodes_input and the model table produced by the KModes function
'kmodes_clusters' are used for cluster prediction.

SQL-MapReduce Call

SELECT * FROM kmodespredict (


ON kmodes_clusters AS model DIMENSION
ON kmodes_input AS input PARTITION BY any
Accumulate ('model')
) ORDER BY cluster_id, model;

Output
Each input row is assigned one of the three clusters as shown below.

Teradata Aster Analytics Foundation User Guide 901


Chapter 7: Cluster Analysis
KModesPredict
Table 831: KmodesPredict Example Output Table (Columns 1-6)

cluster_id mpg disp hp drat drat


0 -0.811459624 0.591244935 0.048313323 -0.835197792 -0.835197792
0 -1.607882616 1.946753815 0.850496796 -1.246659826 -1.246659826
0 -1.126710392 0.962396176 1.433902959 0.24956575 0.24956575
0 -0.894420352 1.688561647 1.215125648 -0.685575235 -0.685575235
0 -0.761683187 0.704204008 0.048313323 -1.564607761 -1.564607761
0 -0.960788935 1.043081228 1.433902959 -0.722980874 -0.722980874
0 0.217253407 0.220093694 -0.53509284 -0.96611753 -0.96611753
0 -0.230734526 1.043081228 0.412942174 -0.835197792 -0.835197792
0 -1.607882616 1.849931752 0.996348337 -1.115740088 -1.115740088
0 -0.612353876 0.363713088 0.485867945 -0.98482035 -0.98482035
0 -0.463024565 0.363713088 0.485867945 -0.98482035 -0.98482035
0 -0.811459624 0.363713088 0.485867945 -0.98482035 -0.98482035
0 -0.147773797 1.365821438 0.412942174 -0.96611753 -0.96611753
0 -0.3302874 -0.046166978 -0.60801861 -1.564607761 -1.564607761
1 -0.064813069 -0.691647397 0.412942174 0.043834734 0.043834734
1 -0.71190675 0.970464681 1.711020886 1.166003916 1.166003916
1 -0.844643915 0.567039419 2.746566825 -0.105787824 -0.105787824
1 0.150884825 -0.570619819 -0.53509284 0.567513685 0.567513685
1 0.150884825 -0.570619819 -0.53509284 0.567513685 0.567513685
2 0.449543447 -0.990182091 -0.783040459 0.473999587 0.473999587
2 2.042389431 -1.226589294 -1.176839619 0.90416444 0.90416444
2 1.196190002 -1.224168743 -1.176839619 0.90416444 0.90416444
2 1.710546517 -1.25079481 -1.381031775 2.493904115 2.493904115
2 1.710546517 -1.094265808 -0.491337378 0.324377029 0.324377029
2 0.449543447 -0.725535119 -0.753870151 0.604919325 0.604919325
2 0.715017777 -0.677930938 -1.235180235 0.174754472 0.174754472
2 -0.147773797 -0.509299179 -0.345485837 0.604919325 0.604919325
2 -0.380063837 -0.509299179 -0.345485837 0.604919325 0.604919325
2 0.980492108 -0.890939476 -0.812210767 1.55876313 1.55876313
2 2.291271616 -1.287909934 -1.191424773 1.166003916 1.166003916
2 0.233845553 -0.892553178 -0.724699843 0.193457291 0.193457291

902 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
KModesPredict

cluster_id mpg disp hp drat drat


2 0.217253407 -0.885291523 -0.549677994 0.960272899 0.960272899

Table 832: KmodesPredict Example Output Table (Columns 7-14)

wt qsec cyl vs am gear carb model


0.22254417 -0.307088658 8 S automatic 3 2 AMC Javelin
2.077504765 0.073449451 8 S automatic 3 4 Cadillac Fleetwood
0.636460997 -1.364760755 8 S automatic 3 4 Camaro Z28
2.174596366 -0.239934874 8 S automatic 3 4 Chrysler Imperial
0.309415603 -0.54772305 8 S automatic 3 2 Dodge Challenger
0.360516446 -1.124126363 8 S automatic 3 4 Duster 360
-0.002299538 0.890487156 6 V automatic 3 1 Hornet 4 Drive
0.227654255 -0.46378082 8 S automatic 3 2 Hornet Sportabout
2.255335698 -0.016088927 8 S automatic 3 4 Lincoln Continental
0.871524874 -0.251127171 8 S automatic 3 3 Merc 450SE
0.524039143 -0.139204198 8 S automatic 3 3 Merc 450SL
0.575139986 0.084641749 8 S automatic 3 3 Merc 450SLC
0.641571082 -0.446992374 8 S automatic 3 2 Pontiac Firebird
0.248094592 1.326986752 6 V automatic 3 1 Valiant
-0.457097039 -1.314395417 6 S manual 5 6 Ferrari Dino
-0.048290296 -1.874010283 8 S manual 5 4 Ford Pantera L
0.360516446 -1.818048797 8 S manual 5 8 Maserati Bora
-0.610399567 -0.777165145 6 S manual 4 4 Mazda RX4
-0.349785269 -0.46378082 6 S manual 4 4 Mazda RX4 Wag
-0.917004624 0.426006817 4 V manual 4 1 Datsun 710
-1.039646647 0.907275602 4 V manual 4 1 Fiat 128
-1.310481114 0.588295128 4 V manual 4 1 Fiat X1-9
-1.637526508 0.375641479 4 V manual 4 2 Honda Civic
-1.741772228 -0.530934604 4 V manual 5 2 Lotus Europa
-0.068730634 2.826754593 4 V automatic 4 2 Merc 230
-0.027849959 1.203871481 4 V automatic 4 2 Merc 240D
0.227654255 0.252526208 6 V automatic 4 4 Merc 280
0.227654255 0.588295128 6 V automatic 4 4 Merc 280C
-1.100967659 -0.642857578 4 S manual 5 2 Porsche 914-2

Teradata Aster Analytics Foundation User Guide 903


Chapter 7: Cluster Analysis
Minhash

wt qsec cyl vs am gear carb model


-1.4126828 1.147909994 4 V manual 4 1 Toyota Corolla
-0.76881218 1.20946763 4 V automatic 3 1 Toyota Corona
-0.44687687 0.420410668 4 V manual 4 2 Volvo 142E

Minhash

Summary
The Minhash function uses transaction history to cluster similar items or users together. For example, the
function can cluster items that are frequently bought together or users that bought the same items.

Background
Data analysis frequently requires the detection of similarity between items in large transactional data sets.
Canopy and k-means clustering perform well with physical data, but transactional data often requires less
restrictive forms of analysis. Locality-sensitive hashing, or minhash, is a particularly effective way of
clustering items based on the Jaccard metric of similarity.
Minhash assigns a pair of users to the same cluster with probability proportional to the overlap between the
set of items that they have bought. Each user u is represented by a set of items that he or she has bought. The
similarity between users ui and uj is defined as the overlap between their item sets, given by the intersection
of the item sets divided by the union of the item sets. This quotient is called the Jaccard coefficient or Jaccard
metric.
Minhash calculates one or more cluster identifiers for each user as the hash value (s) of a randomly chosen
item from a permutation of the set of items that the user has bought. With a universal class of hash
functions, the probability that two users are hashed to the same cluster identifier is their Jaccard coefficient
(S).
If cluster identifiers are formed by concatenating p hash values, each generated by hashing a random item
from the item set with multiple hash functions, then the probability that any two users have the same hash
key is Sp.
If each user is assigned to multiple clusters, the probability that two users have the same hash key increases,
causing more effective clustering. Therefore, minhash computes several cluster identifiers for each user.
Minhash produces each cluster identifier by selecting an item from the user’s item set, hashing it with each
of several hash functions, and concatenating p hash values. Therefore, p (the number of key groups) must be
a divisor of the number of hash functions. (The item that minhash selects from the item set is the one that
produces the minimum hash value for a particular hash function, hence the name of the algorithm.)

904 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
Minhash
Usage

Minhash Syntax
Version 2.2

SELECT * FROM Minhash (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutpuTTable ('output_table')
IDColumn ('id_column')
ItemsColumn ('items_column')
[ SeedTable ('seed_table_to_use') ]
[ SaveSeedTo ('seed_table_to_save') ]
HashNum ('number_of_hash_functions')
KeyGroups ('number_of_key_groups')
[ InputFormat ({ 'bigint' | 'integer' | 'string' | 'hex' }) ]
[ MinClusterSize ('minimum_cluster_size') ]
[ MaxClusterSize ('maximum_cluster_size') ]
[ Delimiter ('delimiter') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the input table.
OutputTable Required Specifies the name of the output table.
IDColumn Required Specifies the name of the input table column that contains the values to
be hashed into the same cluster. Typically these values are customer
identifiers.
ItemsColumn Required Specifies the name of the input column that contains the values to use for
hashing.
SeedTable Optional Specifies the name of the table that contains the seeds to use for hashing.
Typically, this table was created by an earlier Minhash call that specified
its name in the SaveSeedTo argument.
SaveSeedTo Optional Specifies the name of the table where seeds are to be saved.

Teradata Aster Analytics Foundation User Guide 905


Chapter 7: Cluster Analysis
Minhash

Argument Category Description


HashNum Required Specifies the number of hash functions to generate. The
number_of_hash_functions determines the number and size of clusters
generated.
KeyGroups Required Specifies the number of key groups to generate. The
number_of_keygroups must be a divisor of number_of_hash_functions.
A large number_of_keygroups decreases the probability that multiple
users are assigned to the same cluster identifier.
InputFormat Optional Specifies the format of the values to be hashed (the values in
items_column).
MinClusterSize Optional Specifies the minimum cluster size. The default value is 3.
MaxClusterSize Optional Specifies the maximum cluster size. The default value is 5.
Delimiter Optional Specifies the delimiter used between hashed values (typically customer
identifiers) in the output. The default value is the space character.

Example

Input
The input table (salesdata) consists of 341 distinct users and the items they have purchased in an office
supplies store. For ease of use, the items are assigned itemids (shown in the following table) which are then
used in the input table.
Table 833: Minhash Example Items and Itemids

item itemid
Storage 1
Appliances 2
Binders 3
Telephones 4
Paper 5
Rubber Bands 6
Computer Peripherals 7
Office Furnishings 8
Office Machines 9
Envelopes 10
Bookcases 11
Tables 12
Pens & Art Supplies 13

906 Teradata Aster Analytics Foundation User Guide


Chapter 7: Cluster Analysis
Minhash

item itemid
Chairs & Chairmats 14
Scissors 15
Rulers & Trimmers 16
Copiers & Fax Storage 17
Labels 18

Table 834: Minhash Example Input Table salesdata

userid itemid
1 1
2 23
3 4
4 2
5 31
6 1
7 5
8 56
9 21
10 3
11 8
12 10
13 11 4
... ...

SQL-MapReduce Call

SELECT *
FROM minhash(
ON (SELECT 1)
PARTITION BY 1
InputTable ('salesdata')
OutputTable ('minhashoutput')
IDColumn ('userid')
ItemsColumn ('itemid')
HashNum ('1002')
KeyGroups ('3')
InputType ('integer')
MinClusterSize ('3')

Teradata Aster Analytics Foundation User Guide 907


Chapter 7: Cluster Analysis
Minhash
MaxClusterSize ('5')
);

The number of hash functions must be an integer multiple of number of keygroups, while each clusterid is
generated by concatenating KeyGroups’ hashcodes together. The larger the amount of keygroups, fewer
clusters are obtained.

Output
The following query returns the output shown in the following table:

SELECT * FROM minhashoutput ORDER BY clusterid;

Table 835: Minhash Example Output Table

clusterid userid
1002732123681872942919652130 142 153 22 229 273
10191305779223184216324476 106 65 94
102623915513963258275858860 15 154 200 219 227
10521510524181490254808958 106 162 41 76
1057328301636481327290076924 145 336 64 73
111640426347546462487275395 159 199 329
111640426379300784959427683 172 201 8
1145291930783954549119382258 116 16 255
11574213171254045121408249132 116 126 264
1174195802405410071547744710 220 323 336 64 73
1178104602478564384799399977 233 336 64 73
12111042574047172271448914 105 233 336 64 73
... ...

908 Teradata Aster Analytics Foundation User Guide


CHAPTER 8
Naive Bayes

Naive Bayes
• What is Naive Bayes?
• Naive Bayes Functions
• Naive Bayes Example

What is Naive Bayes?


The Naive Bayes classification algorithm is very simple, yet surprisingly effective. Given a training data set
with known discrete outcomes and either discrete or continuous input variables, the algorithm generates a
model and uses it to predict the outcome of future observations, based on their input variables.
The main components of the Naive Bayes model are:
• Bayes’ Theorem, a classical law that states that the probability of observing an outcome given the data is
proportional to the probability of observing the data given the outcome times the prior probability of the
outcome.
• The naive probability model, which assumes that the input data are independent of each other and
conditional on the outcome. This is a very strong assumption that is never true in real life, but it makes
computation of all model parameters extremely simple, and violating the assumption does not
significantly hurt the model.

Naive Bayes Functions


• Summary
• NaiveBayesMap and NaiveBayesReduce
• NaiveBayesPredict
• Naive Bayes Example

Note:
For the Naive Bayes functions designed specifically for text classification, refer to Naive Bayes Text
Classifier.

Summary
The Naive Bayes classifier executes these functions:
• NaiveBayesMap and NaiveBayesReduce, which generate a model from training data
• NaiveBayesPredict, which uses the model to make predictions about testing data

Teradata Aster Analytics Foundation User Guide 909


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce

Note:
You must grant the EXECUTE privilege on the NaiveBayesMap, NaiveBayesReduce, and
NaiveBayesPredict functions to the database user who will run them. For more information, refer to Set
Permissions to Allow Users to Run Functions.

NaiveBayesMap and NaiveBayesReduce

Summary
The NaiveBayesMap and NaiveBayesReduce functions generate a model from training data. A table of
training data is input to the NaiveBayesMap function, whose output is input to the NaiveBayesReduce
function, which outputs the model.

Usage

Naive Bayes Syntax


Version 1.3

CREATE TABLE model_table_name (PARTITION KEY(column_name)) AS


SELECT * FROM NaiveBayesReduce (
ON (
SELECT * FROM NaiveBayesMap (
ON input_table
ResponseColumn ('response_column')
NumericInputs ({ 'numeric_input_column' |
'numeric_input_column_range' }[,...] )
CategoricalInputs ({ 'categorical_input_column' |
'categorical_input_column_range' }[,...] )
)
) PARTITION BY column_name
);

910 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce
Arguments
Argument Category Description
ResponseColumn Required Specifies the name of the input table column that contains the
response variable, passed as text.
NumericInputs Either Specifies input table columns that contain numeric values to be
NumericInputs or included in the model. The syntax of numeric_input_columns
CategoricalInputs is is:
required
{'column_name [,...]' |
'[start_column:end_column]'}[,...]

For example:

'input1','[4:21]','[25:53]','input73,
input80', '[25:53]'

The first column index is 0.


CategoricalInputs Either Specifies input table columns that contain categorical values to
NumericInputs or be included in the model. The syntax of
CategoricalInputs is categorical_input_columns is the same as the syntax of
required numeric_input_columns.

Input
The NaiveBayesMap function has one input table, which contains the training data. Each row represents one
observation. The following table describes the input table columns that function arguments can specify.
Table 836: NaiveBayesMap Input (Training) Table Schema

Column Name Data Type Description


response_column VARCHAR, Contains responses for the observations. Specified by the
BOOLEAN, or Response argument.
INTEGER Note: The function ignores rows that have NULL in this
column.
numeric_column Any SQL Contains numeric values to be included in the model. The table
numeric data can have many such columns. Specified by the NumericInputs
type argument.
categorical_column VARCHAR or Contains categorical values to be included in the model. The
INTEGER table can have many such columns. Specified by the
CategoricalInputs argument.

Output
The NaiveBayesMap function output is input to the NaiveBayesReduce function. The NaiveBayesReduce
function outputs a model table. The following table describes the model table.

Teradata Aster Analytics Foundation User Guide 911


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce
Table 837: NaiveBayesReduce Output (Model) Table Schema

Column Name Data Type Description


class VARCHAR Response
variable VARCHAR Input variable (names of input column)
type VARCHAR Types of input variable ('NUMERIC' or 'CATEGORICAL')
category VARCHAR If type is 'NUMERIC': NULL
If type is 'CATEGORICAL': Categories of input categorical variables
cnt BIGINT Count of observations with this class, variable, and category.
sum DOUBLE If type is 'NUMERIC': Sum of variable values for observations with this
PRECISION class, variable, and category.
If type is 'CATEGORICAL': NULL
sumSq DOUBLE If type is 'NUMERIC': Sum of square variable values for observations
PRECISION with this class, variable, and category.
If type is 'CATEGORICAL': NULL
totalcnt BIGINT Total count of observations

Naive Bayes Example

NaiveBayesMap Input: Training Table


The well known 'iris' dataset (nb_input_iris) is used in this example. The data has values for four attributes
(sepal_length, sepal_width, petal_length and petal_width) and are grouped into three categories (setosa,
versicolor and virginica). From the raw input data, a training set and a test set are created. The functions
NaiveBayesMap and NaiveBayesReduce use the training set to generate the model. The NaiveBayesPredict
function uses that model and predicts the output for a test set. Finally, SQL code is shown that determines
prediction accuracy based on the original and predicted results.
Table 838: Naive Bayes Example Iris Table nb_input_iris

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa

912 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce

id sepal_length sepal_width petal_length petal_width species


10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
... ... ... ... ... ...

Split Input into Training and Testing Data Sets


This code divides the 150 data rows into a training data set (80%) and a testing dataset (20%):

DROP TABLE IF EXISTS nb_iris_input_train;


DROP TABLE IF EXISTS nb_iris_input_test;
CREATE TABLE nb_iris_input_train AS
SELECT * FROM nb_input_iris WHERE id%5!=0;
CREATE TABLE nb_iris_input_test AS
SELECT * FROM nb_input_iris WHERE id%5=0;
SELECT * FROM nb_iris_input_train ORDER BY id;

Table 839: Naive Bayes Example Train Table nb_iris_input_train

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
16 5.7 4.4 1.5 0.4 setosa
... ... ... ... ... ...

SELECT * FROM nb_iris_input_test ORDER BY id;

Teradata Aster Analytics Foundation User Guide 913


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce
Table 840: Naive Bayes Example Test Table nb_iris_input_test

id sepal_length sepal_width petal_length petal_width species


5 5 3.6 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
15 5.8 4 1.2 0.2 setosa
20 5.1 3.8 1.5 0.3 setosa
25 4.8 3.4 1.9 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
45 5.1 3.8 1.9 0.4 setosa
50 5 3.3 1.4 0.2 setosa
55 6.5 2.8 4.6 1.5 versicolor
60 5.2 2.7 3.9 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
70 5.6 2.5 3.9 1.1 versicolor
75 6.4 2.9 4.3 1.3 versicolor
80 5.7 2.6 3.5 1 versicolor
85 5.4 3 4.5 1.5 versicolor
90 5.5 2.5 4 1.3 versicolor
95 5.6 2.7 4.2 1.3 versicolor
100 5.7 2.8 4.1 1.3 versicolor
105 6.5 3 5.8 2.2 virginica
110 7.2 3.6 6.1 2.5 virginica
115 5.8 2.8 5.1 2.4 virginica
120 6 2.2 5 1.5 virginica
125 6.7 3.3 5.7 2.1 virginica
130 7.2 3 5.8 1.6 virginica
135 6.1 2.6 5.6 1.4 virginica
140 6.9 3.1 5.4 2.1 virginica
145 6.7 3.3 5.7 2.5 virginica
150 5.9 3 5.1 1.8 virginica

914 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce
SQL-MapReduce Call to Generate the Model

DROP TABLE IF EXISTS nb_iris_model;


CREATE TABLE nb_iris_model (PARTITION KEY(class)) AS
SELECT * FROM NaiveBayesReduce (
ON (
SELECT * FROM NaiveBayesMap (
ON nb_iris_input_train
Response ('species')
NumericInputs ('[1:4]')
)
) PARTITION BY class
);

NaiveBayesReduce and NaiveBayesMap Output: Model Table


The query below returns the output shown in the following table.

SELECT * FROM nb_iris_model ORDER BY 1;

Table 841: Naive Bayes Example Model Table nb_iris_model

class variable type category cnt sum sumSq totalcnt


setosa sepal_width NUMERIC 40 136.700000524521 473.290003499985 40
setosa petal_width NUMERIC 40 10.1000002026558 3.03000012755394 40
setosa sepal_length NUMERIC 40 199.900000095367 1004.27000005722 40
setosa petal_length NUMERIC 40 57.6999998092651 84.2099996709824 40
versicolor sepal_width NUMERIC 40 111.10000038147 313.130002088547 40
versicolor petal_width NUMERIC 40 53.299999833107 72.7099995040894 40
versicolor sepal_length NUMERIC 40 239.599999427795 1446.13999296188 40
versicolor petal_length NUMERIC 40 172.399999141693 752.219992570878 40
virginica sepal_width NUMERIC 40 118.799999952316 356.539999780655 40
virginica petal_width NUMERIC 40 81.1999989748001 166.999995970726 40
virginica sepal_length NUMERIC 40 264.400000572205 1764.92000530243 40
virginica petal_length NUMERIC 40 222.299999713898 1249.1499958992 40

NaiveBayesPredict Input
The input for the SQL-MapReduce call shown below is as follows:
• Model File - NaiveBayesReduce and NaiveBayesMap Output: Model Table
• Test Dataset - Split Input into Training and Testing Data Sets

Teradata Aster Analytics Foundation User Guide 915


Chapter 8: Naive Bayes
NaiveBayesMap and NaiveBayesReduce
SQL-MapReduce Call to Predict Outcomes of Test Table Data

DROP TABLE IF EXISTS nb_iris_predict;


CREATE TABLE nb_iris_predict (PARTITION KEY(id)) AS
SELECT * FROM NaiveBayesPredict (
ON nb_iris_input_test
Model ('nb_iris_model ')
IDCol ('id')
NumericInputs ('[1:4]')
) ORDER BY id;
SELECT * FROM nb_iris_predict ORDER BY 1;

NaiveBayesPredict Output: Predict Outcomes Table


The output provides a prediction for each row in the test data set and specifies the log likelihood values that
were used to make the predictions for each category.
Table 842: Naive Bayes Example Output Table

id prediction loglik_virginica loglik_setosa loglik_versicolor


5 setosa -60.9907330174083 0.940424559067427 -38.2319825308929
10 setosa -61.5861966261907 -0.173043897170957 -37.6660830556247
15 setosa -64.7169548001753 -3.55476375390931 -42.613272284101
20 setosa -57.7992844148636 0.531796840642284 -35.7613053354934
25 setosa -55.0939143017897 -3.23703029869347 -32.1179858509341
30 setosa -58.0673073752287 0.109611164911179 -34.9285997859276
35 setosa -58.1980267787658 0.660202577013632 -34.9335988704833
40 setosa -58.3538858459019 0.976840811041703 -35.4425587940391
45 setosa -50.3847602463201 -4.36921429673761 -29.0537478266948
50 setosa -59.4745348026195 1.00257959230347 -36.5026022674224
55 versicolor -5.22108005914589 -270.465431908161 -1.7396367893394
60 versicolor -11.3356467465064 -174.565470791378 -2.31925264962004
65 versicolor -12.6496488706934 -138.435722453706 -2.1898005756116
70 versicolor -15.236843619572 -152.47255627778 -2.3538459106499
75 versicolor -8.34632493685681 -214.383653794905 -1.14727508911532
80 versicolor -18.455946984498 -109.900955754698 -3.72743011721095
85 versicolor -7.00283150694931 -249.656488976769 -2.00455589365379
90 versicolor -12.0279925543069 -177.470336291088 -1.74539749109463
95 versicolor -10.1802450220293 -198.037109900803 -1.10567314638237
100 versicolor -10.1315405651018 -187.294956922171 -1.02885306444447

916 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesPredict

id prediction loglik_virginica loglik_setosa loglik_versicolor


105 virginica -1.58321671192447 -540.56351949849 -14.859643718252
110 virginica -6.11301966870239 -654.801984259278 -28.8385135092999
115 virginica -3.64635253153959 -456.647579953406 -15.3298808321577
120 versicolor -7.73615017754911 -322.909009762056 -3.53629430321742
125 virginica -1.87627054598219 -509.817023097936 -13.7515396871732
130 virginica -3.36908052149115 -469.802937074554 -9.13832860900173
135 versicolor -5.81482980902253 -403.678170868448 -4.51644862072851
140 virginica -1.48430911768034 -463.610989255182 -12.0238603485835
145 virginica -3.82266629516761 -576.395460020916 -22.6942168473031
150 virginica -2.57004648415525 -366.506113945482 -4.84887216455807

Prediction Accuracy
The prediction accuracy (proportion of correct predictions) of 93.33% is obtained using the following SQL
statements:

DROP TABLE IF EXISTS nb_predict_accuracy;

CREATE TABLE nb_predict_accuracy DISTRIBUTE BY hash(id) AS


SELECT nb_iris_input_test.id, species, prediction
FROM nb_iris_predict, nb_iris_input_test
WHERE nb_iris_input_test.id = nb_iris_predict.id;

SELECT (SELECT count(id) FROM nb_predict_accuracy


FROM nb_predict_accuracy) AS prediction_accuracy
WHERE prediction = species)/(SELECT count(id)
;

Table 843: Naive Bayes Example Prediction Accuracy

prediction_accuracy
0.93333333333333333333

NaiveBayesPredict

Summary
The NaiveBayesPredict function uses the model output by the NaiveBayesReduce function to predict the
outcomes for a test set of data.
This function can be used with real-time applications. Refer to AMLGenerator.

Teradata Aster Analytics Foundation User Guide 917


Chapter 8: Naive Bayes
NaiveBayesPredict
Usage

NaiveBayesPredict Syntax
Version 1.4

SELECT * FROM NaiveBayesPredict (


ON input_table
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
Model ('model_table_name')
IDCol ('test_point_id_col')
NumericInputs ({ 'numeric_input_column' |
'numeric_input_column_range' }[,...] )
CategoricalInputs ({ 'categorical_input_column' |
'categorical_input_column_range' }[,...] )
);

Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
Model Required Specify the name of the model table generated by the NaiveBayesReduce
function.
IDCol Required Specify the name of the column that contains the ID that uniquely
identifies the test input data.
NumericInput Either Specify the same numeric_input_columns that you specified when you
s NumericInput used the NaiveBayesMap and NaiveBayesReduce functions to generate
s or the model table from the training data.
CategoricalInp
uts is required
CategoricalInp Either Specify the same categorical_input_columns that you specified when you
uts NumericInput used the NaiveBayesMap and NaiveBayesReduce functions to generate
s or the model table from the training data.
CategoricalInp
uts is required

918 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesPredict
Input
The NaiveBayesPredict function has two input tables: the model table output by the NaiveBayesReduce
function (see the Output section) and a table of test data. The test table has the same schema as the training
table (see the Input section of NaiveBayesMap and NaiveBayesReduce).

Output
The NaiveBayesPredict function outputs a table of predictions for the observations in the test table. Each
row represents one observation.
Table 844: NaiveBayesPredict Output Table Schema

Column Name Data Type Description


id INTEGER Row identifier.
prediction VARCHAR Prediction for this observation.
loglik_response_i DOUBLE PRECISION Loglikelihood (natural logarithm of the probability)
that this observation has response_i. The table has one
column for each possible response. For example, if the
possible responses are 'Yes' and 'No', then the table has
columns loglik_Yes and loglik_No.

Naive Bayes Example

NaiveBayesMap Input: Training Table


The well known 'iris' dataset (nb_input_iris) is used in this example. The data has values for four attributes
(sepal_length, sepal_width, petal_length and petal_width) and are grouped into three categories (setosa,
versicolor and virginica). From the raw input data, a training set and a test set are created. The functions
NaiveBayesMap and NaiveBayesReduce use the training set to generate the model. The NaiveBayesPredict
function uses that model and predicts the output for a test set. Finally, SQL code is shown that determines
prediction accuracy based on the original and predicted results.
Table 845: Naive Bayes Example Iris Table nb_input_iris

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa

Teradata Aster Analytics Foundation User Guide 919


Chapter 8: Naive Bayes
NaiveBayesPredict

id sepal_length sepal_width petal_length petal_width species


9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
... ... ... ... ... ...

Split Input into Training and Testing Data Sets


This code divides the 150 data rows into a training data set (80%) and a testing dataset (20%):

DROP TABLE IF EXISTS nb_iris_input_train;


DROP TABLE IF EXISTS nb_iris_input_test;
CREATE TABLE nb_iris_input_train AS
SELECT * FROM nb_input_iris WHERE id%5!=0;
CREATE TABLE nb_iris_input_test AS
SELECT * FROM nb_input_iris WHERE id%5=0;
SELECT * FROM nb_iris_input_train ORDER BY id;

Table 846: Naive Bayes Example Train Table nb_iris_input_train

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
16 5.7 4.4 1.5 0.4 setosa

920 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesPredict

id sepal_length sepal_width petal_length petal_width species


... ... ... ... ... ...

SELECT * FROM nb_iris_input_test ORDER BY id;

Table 847: Naive Bayes Example Test Table nb_iris_input_test

id sepal_length sepal_width petal_length petal_width species


5 5 3.6 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
15 5.8 4 1.2 0.2 setosa
20 5.1 3.8 1.5 0.3 setosa
25 4.8 3.4 1.9 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
45 5.1 3.8 1.9 0.4 setosa
50 5 3.3 1.4 0.2 setosa
55 6.5 2.8 4.6 1.5 versicolor
60 5.2 2.7 3.9 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
70 5.6 2.5 3.9 1.1 versicolor
75 6.4 2.9 4.3 1.3 versicolor
80 5.7 2.6 3.5 1 versicolor
85 5.4 3 4.5 1.5 versicolor
90 5.5 2.5 4 1.3 versicolor
95 5.6 2.7 4.2 1.3 versicolor
100 5.7 2.8 4.1 1.3 versicolor
105 6.5 3 5.8 2.2 virginica
110 7.2 3.6 6.1 2.5 virginica
115 5.8 2.8 5.1 2.4 virginica
120 6 2.2 5 1.5 virginica
125 6.7 3.3 5.7 2.1 virginica
130 7.2 3 5.8 1.6 virginica
135 6.1 2.6 5.6 1.4 virginica

Teradata Aster Analytics Foundation User Guide 921


Chapter 8: Naive Bayes
NaiveBayesPredict

id sepal_length sepal_width petal_length petal_width species


140 6.9 3.1 5.4 2.1 virginica
145 6.7 3.3 5.7 2.5 virginica
150 5.9 3 5.1 1.8 virginica

SQL-MapReduce Call to Generate the Model

DROP TABLE IF EXISTS nb_iris_model;


CREATE TABLE nb_iris_model (PARTITION KEY(class)) AS
SELECT * FROM NaiveBayesReduce (
ON (
SELECT * FROM NaiveBayesMap (
ON nb_iris_input_train
Response ('species')
NumericInputs ('[1:4]')
)
) PARTITION BY class
);

NaiveBayesReduce and NaiveBayesMap Output: Model Table


The query below returns the output shown in the following table.

SELECT * FROM nb_iris_model ORDER BY 1;

Table 848: Naive Bayes Example Model Table nb_iris_model

class variable type category cnt sum sumSq totalcnt


setosa sepal_width NUMERIC 40 136.700000524521 473.290003499985 40
setosa petal_width NUMERIC 40 10.1000002026558 3.03000012755394 40
setosa sepal_length NUMERIC 40 199.900000095367 1004.27000005722 40
setosa petal_length NUMERIC 40 57.6999998092651 84.2099996709824 40
versicolor sepal_width NUMERIC 40 111.10000038147 313.130002088547 40
versicolor petal_width NUMERIC 40 53.299999833107 72.7099995040894 40
versicolor sepal_length NUMERIC 40 239.599999427795 1446.13999296188 40
versicolor petal_length NUMERIC 40 172.399999141693 752.219992570878 40
virginica sepal_width NUMERIC 40 118.799999952316 356.539999780655 40
virginica petal_width NUMERIC 40 81.1999989748001 166.999995970726 40
virginica sepal_length NUMERIC 40 264.400000572205 1764.92000530243 40
virginica petal_length NUMERIC 40 222.299999713898 1249.1499958992 40

922 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
NaiveBayesPredict
NaiveBayesPredict Input
The input for the SQL-MapReduce call shown below is as follows:
• Model File - NaiveBayesReduce and NaiveBayesMap Output: Model Table
• Test Dataset - Split Input into Training and Testing Data Sets

SQL-MapReduce Call to Predict Outcomes of Test Table Data

DROP TABLE IF EXISTS nb_iris_predict;


CREATE TABLE nb_iris_predict (PARTITION KEY(id)) AS
SELECT * FROM NaiveBayesPredict (
ON nb_iris_input_test
Model ('nb_iris_model ')
IDCol ('id')
NumericInputs ('[1:4]')
) ORDER BY id;
SELECT * FROM nb_iris_predict ORDER BY 1;

NaiveBayesPredict Output: Predict Outcomes Table


The output provides a prediction for each row in the test data set and specifies the log likelihood values that
were used to make the predictions for each category.
Table 849: Naive Bayes Example Output Table

id prediction loglik_virginica loglik_setosa loglik_versicolor


5 setosa -60.9907330174083 0.940424559067427 -38.2319825308929
10 setosa -61.5861966261907 -0.173043897170957 -37.6660830556247
15 setosa -64.7169548001753 -3.55476375390931 -42.613272284101
20 setosa -57.7992844148636 0.531796840642284 -35.7613053354934
25 setosa -55.0939143017897 -3.23703029869347 -32.1179858509341
30 setosa -58.0673073752287 0.109611164911179 -34.9285997859276
35 setosa -58.1980267787658 0.660202577013632 -34.9335988704833
40 setosa -58.3538858459019 0.976840811041703 -35.4425587940391
45 setosa -50.3847602463201 -4.36921429673761 -29.0537478266948
50 setosa -59.4745348026195 1.00257959230347 -36.5026022674224
55 versicolor -5.22108005914589 -270.465431908161 -1.7396367893394
60 versicolor -11.3356467465064 -174.565470791378 -2.31925264962004
65 versicolor -12.6496488706934 -138.435722453706 -2.1898005756116
70 versicolor -15.236843619572 -152.47255627778 -2.3538459106499
75 versicolor -8.34632493685681 -214.383653794905 -1.14727508911532
80 versicolor -18.455946984498 -109.900955754698 -3.72743011721095

Teradata Aster Analytics Foundation User Guide 923


Chapter 8: Naive Bayes
NaiveBayesPredict

id prediction loglik_virginica loglik_setosa loglik_versicolor


85 versicolor -7.00283150694931 -249.656488976769 -2.00455589365379
90 versicolor -12.0279925543069 -177.470336291088 -1.74539749109463
95 versicolor -10.1802450220293 -198.037109900803 -1.10567314638237
100 versicolor -10.1315405651018 -187.294956922171 -1.02885306444447
105 virginica -1.58321671192447 -540.56351949849 -14.859643718252
110 virginica -6.11301966870239 -654.801984259278 -28.8385135092999
115 virginica -3.64635253153959 -456.647579953406 -15.3298808321577
120 versicolor -7.73615017754911 -322.909009762056 -3.53629430321742
125 virginica -1.87627054598219 -509.817023097936 -13.7515396871732
130 virginica -3.36908052149115 -469.802937074554 -9.13832860900173
135 versicolor -5.81482980902253 -403.678170868448 -4.51644862072851
140 virginica -1.48430911768034 -463.610989255182 -12.0238603485835
145 virginica -3.82266629516761 -576.395460020916 -22.6942168473031
150 virginica -2.57004648415525 -366.506113945482 -4.84887216455807

Prediction Accuracy
The prediction accuracy (proportion of correct predictions) of 93.33% is obtained using the following SQL
statements:

DROP TABLE IF EXISTS nb_predict_accuracy;

CREATE TABLE nb_predict_accuracy DISTRIBUTE BY hash(id) AS


SELECT nb_iris_input_test.id, species, prediction
FROM nb_iris_predict, nb_iris_input_test
WHERE nb_iris_input_test.id = nb_iris_predict.id;

SELECT (SELECT count(id) FROM nb_predict_accuracy


FROM nb_predict_accuracy) AS prediction_accuracy
WHERE prediction = species)/(SELECT count(id)
;

Table 850: Naive Bayes Example Prediction Accuracy

prediction_accuracy
0.93333333333333333333

924 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
Naive Bayes Example

Naive Bayes Example

NaiveBayesMap Input: Training Table


The well known 'iris' dataset (nb_input_iris) is used in this example. The data has values for four attributes
(sepal_length, sepal_width, petal_length and petal_width) and are grouped into three categories (setosa,
versicolor and virginica). From the raw input data, a training set and a test set are created. The functions
NaiveBayesMap and NaiveBayesReduce use the training set to generate the model. The NaiveBayesPredict
function uses that model and predicts the output for a test set. Finally, SQL code is shown that determines
prediction accuracy based on the original and predicted results.
Table 851: Naive Bayes Example Iris Table nb_input_iris

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
... ... ... ... ... ...

Split Input into Training and Testing Data Sets


This code divides the 150 data rows into a training data set (80%) and a testing dataset (20%):

DROP TABLE IF EXISTS nb_iris_input_train;


DROP TABLE IF EXISTS nb_iris_input_test;
CREATE TABLE nb_iris_input_train AS
SELECT * FROM nb_input_iris WHERE id%5!=0;
CREATE TABLE nb_iris_input_test AS
SELECT * FROM nb_input_iris WHERE id%5=0;
SELECT * FROM nb_iris_input_train ORDER BY id;

Teradata Aster Analytics Foundation User Guide 925


Chapter 8: Naive Bayes
Naive Bayes Example
Table 852: Naive Bayes Example Train Table nb_iris_input_train

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3 1.4 0.1 setosa
14 4.3 3 1.1 0.1 setosa
16 5.7 4.4 1.5 0.4 setosa
... ... ... ... ... ...

SELECT * FROM nb_iris_input_test ORDER BY id;

Table 853: Naive Bayes Example Test Table nb_iris_input_test

id sepal_length sepal_width petal_length petal_width species


5 5 3.6 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
15 5.8 4 1.2 0.2 setosa
20 5.1 3.8 1.5 0.3 setosa
25 4.8 3.4 1.9 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
45 5.1 3.8 1.9 0.4 setosa
50 5 3.3 1.4 0.2 setosa
55 6.5 2.8 4.6 1.5 versicolor
60 5.2 2.7 3.9 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor

926 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
Naive Bayes Example

id sepal_length sepal_width petal_length petal_width species


70 5.6 2.5 3.9 1.1 versicolor
75 6.4 2.9 4.3 1.3 versicolor
80 5.7 2.6 3.5 1 versicolor
85 5.4 3 4.5 1.5 versicolor
90 5.5 2.5 4 1.3 versicolor
95 5.6 2.7 4.2 1.3 versicolor
100 5.7 2.8 4.1 1.3 versicolor
105 6.5 3 5.8 2.2 virginica
110 7.2 3.6 6.1 2.5 virginica
115 5.8 2.8 5.1 2.4 virginica
120 6 2.2 5 1.5 virginica
125 6.7 3.3 5.7 2.1 virginica
130 7.2 3 5.8 1.6 virginica
135 6.1 2.6 5.6 1.4 virginica
140 6.9 3.1 5.4 2.1 virginica
145 6.7 3.3 5.7 2.5 virginica
150 5.9 3 5.1 1.8 virginica

SQL-MapReduce Call to Generate the Model

DROP TABLE IF EXISTS nb_iris_model;


CREATE TABLE nb_iris_model (PARTITION KEY(class)) AS
SELECT * FROM NaiveBayesReduce (
ON (
SELECT * FROM NaiveBayesMap (
ON nb_iris_input_train
Response ('species')
NumericInputs ('[1:4]')
)
) PARTITION BY class
);

NaiveBayesReduce and NaiveBayesMap Output: Model Table


The query below returns the output shown in the following table.

SELECT * FROM nb_iris_model ORDER BY 1;

Teradata Aster Analytics Foundation User Guide 927


Chapter 8: Naive Bayes
Naive Bayes Example
Table 854: Naive Bayes Example Model Table nb_iris_model

class variable type category cnt sum sumSq totalcnt


setosa sepal_width NUMERIC 40 136.700000524521 473.290003499985 40
setosa petal_width NUMERIC 40 10.1000002026558 3.03000012755394 40
setosa sepal_length NUMERIC 40 199.900000095367 1004.27000005722 40
setosa petal_length NUMERIC 40 57.6999998092651 84.2099996709824 40
versicolor sepal_width NUMERIC 40 111.10000038147 313.130002088547 40
versicolor petal_width NUMERIC 40 53.299999833107 72.7099995040894 40
versicolor sepal_length NUMERIC 40 239.599999427795 1446.13999296188 40
versicolor petal_length NUMERIC 40 172.399999141693 752.219992570878 40
virginica sepal_width NUMERIC 40 118.799999952316 356.539999780655 40
virginica petal_width NUMERIC 40 81.1999989748001 166.999995970726 40
virginica sepal_length NUMERIC 40 264.400000572205 1764.92000530243 40
virginica petal_length NUMERIC 40 222.299999713898 1249.1499958992 40

NaiveBayesPredict Input
The input for the SQL-MapReduce call shown below is as follows:
• Model File - NaiveBayesReduce and NaiveBayesMap Output: Model Table
• Test Dataset - Split Input into Training and Testing Data Sets

SQL-MapReduce Call to Predict Outcomes of Test Table Data

DROP TABLE IF EXISTS nb_iris_predict;


CREATE TABLE nb_iris_predict (PARTITION KEY(id)) AS
SELECT * FROM NaiveBayesPredict (
ON nb_iris_input_test
Model ('nb_iris_model ')
IDCol ('id')
NumericInputs ('[1:4]')
) ORDER BY id;
SELECT * FROM nb_iris_predict ORDER BY 1;

NaiveBayesPredict Output: Predict Outcomes Table


The output provides a prediction for each row in the test data set and specifies the log likelihood values that
were used to make the predictions for each category.

928 Teradata Aster Analytics Foundation User Guide


Chapter 8: Naive Bayes
Naive Bayes Example
Table 855: Naive Bayes Example Output Table

id prediction loglik_virginica loglik_setosa loglik_versicolor


5 setosa -60.9907330174083 0.940424559067427 -38.2319825308929
10 setosa -61.5861966261907 -0.173043897170957 -37.6660830556247
15 setosa -64.7169548001753 -3.55476375390931 -42.613272284101
20 setosa -57.7992844148636 0.531796840642284 -35.7613053354934
25 setosa -55.0939143017897 -3.23703029869347 -32.1179858509341
30 setosa -58.0673073752287 0.109611164911179 -34.9285997859276
35 setosa -58.1980267787658 0.660202577013632 -34.9335988704833
40 setosa -58.3538858459019 0.976840811041703 -35.4425587940391
45 setosa -50.3847602463201 -4.36921429673761 -29.0537478266948
50 setosa -59.4745348026195 1.00257959230347 -36.5026022674224
55 versicolor -5.22108005914589 -270.465431908161 -1.7396367893394
60 versicolor -11.3356467465064 -174.565470791378 -2.31925264962004
65 versicolor -12.6496488706934 -138.435722453706 -2.1898005756116
70 versicolor -15.236843619572 -152.47255627778 -2.3538459106499
75 versicolor -8.34632493685681 -214.383653794905 -1.14727508911532
80 versicolor -18.455946984498 -109.900955754698 -3.72743011721095
85 versicolor -7.00283150694931 -249.656488976769 -2.00455589365379
90 versicolor -12.0279925543069 -177.470336291088 -1.74539749109463
95 versicolor -10.1802450220293 -198.037109900803 -1.10567314638237
100 versicolor -10.1315405651018 -187.294956922171 -1.02885306444447
105 virginica -1.58321671192447 -540.56351949849 -14.859643718252
110 virginica -6.11301966870239 -654.801984259278 -28.8385135092999
115 virginica -3.64635253153959 -456.647579953406 -15.3298808321577
120 versicolor -7.73615017754911 -322.909009762056 -3.53629430321742
125 virginica -1.87627054598219 -509.817023097936 -13.7515396871732
130 virginica -3.36908052149115 -469.802937074554 -9.13832860900173
135 versicolor -5.81482980902253 -403.678170868448 -4.51644862072851
140 virginica -1.48430911768034 -463.610989255182 -12.0238603485835
145 virginica -3.82266629516761 -576.395460020916 -22.6942168473031
150 virginica -2.57004648415525 -366.506113945482 -4.84887216455807

Teradata Aster Analytics Foundation User Guide 929


Chapter 8: Naive Bayes
Naive Bayes Example
Prediction Accuracy
The prediction accuracy (proportion of correct predictions) of 93.33% is obtained using the following SQL
statements:

DROP TABLE IF EXISTS nb_predict_accuracy;

CREATE TABLE nb_predict_accuracy DISTRIBUTE BY hash(id) AS


SELECT nb_iris_input_test.id, species, prediction
FROM nb_iris_predict, nb_iris_input_test
WHERE nb_iris_input_test.id = nb_iris_predict.id;

SELECT (SELECT count(id) FROM nb_predict_accuracy


FROM nb_predict_accuracy) AS prediction_accuracy
WHERE prediction = species)/(SELECT count(id)
;

Table 856: Naive Bayes Example Prediction Accuracy

prediction_accuracy
0.93333333333333333333

930 Teradata Aster Analytics Foundation User Guide


CHAPTER 9
Ensemble Methods

Ensemble Methods
• Random Forest Functions
• Single Decision Tree Functions
• AdaBoost Functions

Random Forest Functions

Summary
SQL-MapReduce provides a suite of functions to create a predictive model based on a combination of the
Classification and Regression Trees (CART) algorithm for training decision trees, and the ensemble learning
method of bagging.

The random forest functions are:


• Forest_Drive, which builds a predictive model based on training data.
• Forest_Predict, which uses the model generated by the Forest_Drive function to analyze the input data
and make predictions.
• Forest_Analyze, which analyzes the model generated by the Forest_Drive function and gives weights to
the variables used in the model. This helps you understand the basis by which the Forest_Predict
function makes predictions.

Teradata Aster Analytics Foundation User Guide 931


Chapter 9: Ensemble Methods
Random Forest Functions
Background
Decision trees are a supervised learning technique used for both classification and regression problems. In
technical terms, a decision tree creates a piecewise constant approximation function for the training data.
Decision trees are a common procedure used in data mining and supervised learning because of their
robustness to many of the problems of real world data, such as missing values, irrelevant variables, outliers
in input variables, and variable scalings. The decision tree algorithm has few parameters to tune.
The SQL-MapReduce decision tree functions implement an algorithm for decision tree training and
prediction based on Classification and Regression Trees, by Breiman, Friedman, Olshen and Stone (1984).
In the original Random Forest algorithm developed by Leo Breiman, each tree grows as follows:
• If the number of cases in the training set is N, sample N cases at random, but with replacement from the
original data. This sample becomes the training set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at each node, m variables are
selected at random from M and the best split on those m variables is used to split the node. The value of
m is held constant during the forest growing.
• Each tree is grown to the largest extent possible. There is no pruning.
Teradata Aster’s implementation of the Random Forest algorithm differs from Leo Breiman’s algorithm in
the following ways:
• The Forest_Drive function lets you specify m using the optional argument Mtry. If you do not specify
Mtry, the function uses all variables to train the decision tree (equivalent to bootstrap aggregating or
bagging).
• The Forest_Drive function randomly assigns rows to individual vworkers. Each vworker creates trees
with a bootstrapping technique, using only its local data.

Decision Tree Basics


Decision trees are very simple models. For example, suppose you want to predict the value of a variable, y,
and you have two predictor variables, x1 and x2. You want to model y as a function of x1 and x2 (y = f(x1,
x2)).
You can visualize x1 and x2 as forming a plane, and values of y at particular coordinates of (x1, x2) rising out
of the plane in the third dimension. A decision tree partitions the plane into rectangles and assigns each
partition to predict a constant value of y, which is usually the average value of all the y values in that region.
You can extend this two-dimensional example into arbitrarily many dimensions to fit models with large
numbers of predictors.

In this example, the x1-x2 plane has four regions, R1, R2, R3 and R4. The predicted value of y for any test
observation in R1 is the average value of y for all training observations in R1.

932 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Random Forest Functions
This information can be represented by a decision tree:

The algorithm starts at the Root node. If the x1 value for a data point is greater than 5, then the algorithm
travels down the right path; if the value of x1 is less than 5, then the algorithm travels down the left path. At
each subsequent node, the algorithm determines which branch to follow, until it reaches a leaf node, to
which it assigns a prediction value.

Advantages of Decision Trees


Decision trees are easy to visualize and understand, and offer an interpretable reason for the decisions they
make (for example, person X is high-risk because his income is below C, his total debt is above D, and his
age is below A). In addition, decision trees are robust to spurious, colinear, and correlated input variables.

Disadvantages of Decision Trees


Decision tree training is a highly unstable procedure. Small differences in the training set can cause vastly
different decision tree structures, and often very different outcomes. Also, because they are piecewise-
constant approximations, observations on the boundaries between regions are more likely to have high error
rates.
The error rate of the random forest being created is determined by two factors:
• The higher correlation between trees, the higher the error rate.
• The stronger the individual trees, the lower the overall error rate.

Implementation Notes
In the original Random Forest algorithm developed by Leo Breiman, each tree grows as follows:
• If the number of cases in the training set is N, sample N cases at random, but with replacement from the
original data. This sample becomes the training set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at each node, m variables are
selected at random from M and the best split on those m variables is used to split the node. The value of
m is held constant during the forest growing.
• Each tree is grown to the largest extent possible. There is no pruning.

Teradata Aster Analytics Foundation User Guide 933


Chapter 9: Ensemble Methods
Forest_Drive
Teradata Aster’s implementation of the Random Forest algorithm differs from Leo Breiman’s algorithm in
the following ways:
• The Forest_Drive function lets you specify m using the optional argument Mtry. If you do not specify
Mtry, the function uses all variables to train the decision tree (equivalent to bootstrap aggregating or
bagging).
• The Forest_Drive function randomly assigns rows to individual vworkers. Each vworker creates trees
with a bootstrapping technique, using only its local data.

Usage
The SQL-MapReduce decision tree functions create a decision model that predicts an outcome based on a
set of input variables. When constructing the tree, the splitting of branches stops when any of the stopping
criteria is met.
The SQL-MapReduce decision tree functions support these predictive models:

Model Description
Regression problems (continuous This model is used when the predicted outcome from the data is a real
response variable) number. For example, the dollar amount of insurance claims for a year
or the GPA expected for a college student.
Multiclass classification This model is used to classify data by predicting to which provided
(classification tree analysis) classes the data belongs. For example, whether the input data is
political news, economic news, or sports news.
Binary classification (binary This model is used to make predictions when the outcome can be
response variable) represented as a binary value (true/false, yes/no, 0/1). For example,
whether the input insurance claim description data represents an
accident.

Forest_Drive

Summary
The Forest_Drive function takes as input a training set of data and uses it to generate a predictive model.
You can input the model to the Forest_Predict function, which uses it to make predictions.

934 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Drive
Note:
The size of each individual decision tree generated by the Forest_Drive function must be less than 32 MB.
The factors that affect the size of a decision tree are the depth of the tree, the number of categorical
inputs, the number of numerical inputs, and the number of surrogates. If the size of a decision tree
exceeds 32 MB, the function issues an error message. Therefore, control the factors in the input data that
increase the size of decision trees.
Aster Analytics provides a tree_size_estimator function that you can use to estimate maximum values for
the arguments TreeSize and NumTrees, based on the cluster configuration and the number of predictor
variables. The syntax is:

SELECT * FROM tree_size_estimator


(ON inputtable NumericInputs ('predictor' [,…]));

The query results includes a row_count column. The average value of this column is the recommended
maximum value for the argument NumTrees.

Usage

Forest_Drive Syntax
Version 1.5

SELECT * FROM Forest_Drive (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database( 'db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table_name')
OutputTable ('output_table_name')
ResponseColumn ('response_column')
[ NumericInputs ({ 'numeric_input_column_name' |
'numeric_input_column_range' }[,...]) ]
[ MaxNumCategoricalValues (max_cat_values) ]
[ CategoricalInputs ({ 'categorical_input_column_name' |
'categorical_input_column_range' }[,...]) ]
[ TreeType ( { 'regression' | 'classification' } ) ]
[ NumTrees (number_of_trees) ]
[ TreeSize (tree_size) ]
[ MinNodeSize (min_node_size) ]
[ Variance (variance) ]
[ MaxDepth (max_depth) ]
[ MonitorTable ('monitor_table_name') ]
[ DropMonitorTable
( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ Mtry ('mtry') ]
[ MtrySeed ('mtryseed') ]

Teradata Aster Analytics Foundation User Guide 935


Chapter 9: Ensemble Methods
Forest_Drive
[ Seed ('seed') ]
);

Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the input data set.
OutputTable Required Specifies the name of the output table in which the function stores
the predictive model that it generates.
If a table with this name exists in the database, the function drops
the existing table and creates a new table with the same name.
ResponseColumn Required Specifies the name of the column that contains the response
variable (that is, the quantity that you want to predict).
NumericInputs Either Specifies the names of the columns that contain the numeric
NumericInp predictor variables (which must be numeric values).
uts or
CategoricalIn
puts is
required
MaxNumCategoricalVal Optional Specifies the maximum number of distinct values for a single
ues categorical variable. The max_cat_values must be a positive
INTEGER. The default value is 20. A max_cat_values greater than
20 is not recommended.
CategoricalInputs Either Specifies the names of the columns that contain the categorical
NumericInp predictor variables (which can be either numeric or VARCHAR
uts or values).
CategoricalIn Each categorical input column can have at most max_cat_values
puts is distinct categorical values. If max_cat_values exceeds 20, the
required function might run out of memory, because classification trees
grow rapidly as max_cat_valuesincreases.
NumTrees Optional Specifies the number of trees to grow in the forest model. When
specified, number_of_trees must be greater than or equal to the
number of vworkers.
When not specified, the function builds the minimum number of
trees that provides the input dataset with full coverage.
TreeType Optional Specifies whether the analysis is a regression (continuous
response variable) or a multiclass classification (predicting result
from the number of classes). The default value is 'regression' if the

936 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Drive

Argument Category Description


response variable is numeric and 'classification' if the response
variable is nonnumeric.
TreeSize Optional Specifies the number of rows that each tree uses as its input data
set. If not specified, the function builds a tree using either the
number of rows on a vworker or the number of rows that fit into
the vworker’s memory, whichever is less.
MinNodeSize Optional Specifies a decision tree stopping criterion; the minimum size of
any node within each decision tree. The default value is 1.
Variance Optional Specifies a decision tree stopping criterion. If the variance within
any node dips below this value, the algorithm stops looking for
splits in the branch. The default value is 0.
MaxDepth Optional Specifies a decision tree stopping criterion. If the tree reaches a
depth past this value, the algorithm stops looking for splits.
Decision trees can grow to (2(max_depth+1) - 1) nodes. This
stopping criteria has the greatest effect on the performance of the
function. The default value is 12.
NumSurrogates Optional Specifies the number of surrogate splits to evaluate for each node.
The default value is 0.
MonitorTable Optional Specifies the name of the table in which the function stores
monitoring information. The default value is
'default_dt_monitor_table' in the current schema.
DropMonitorTable Optional Specifies whether to drop the table specified by MonitorTable, if it
exists. The default value is 'true'.
Mtry Optional Specifies the number of variables to randomly sample from each
input value. For example, if mtry is 3, then the function randomly
samples 3 variables from each input at each split. The mtry must
be an INTEGER.
MtrySeed Optional Specifies a LONG value to use in determining the random seed
for mtry.
Seed Optional Specifies a LONG value to use in determining the seed for the
random number generator. If you specify this value, you can
specify the same value in future calls to this function and the
function will build the same tree.

Input
Table 857: Forest_Drive Input Table Schema

Column Name Data Type Description


id_column Any Optional. Unique row identifier (not an argument to the
function).

Teradata Aster Analytics Foundation User Guide 937


Chapter 9: Ensemble Methods
Forest_Drive

Column Name Data Type Description


response_column NUMERIC, Column specified by the Response argument.
INTEGER,
BIGINT,
DOUBLE
PRECISION,
VARCHAR, or
BOOLEAN
numeric_input_columns NUMERIC, Columns specified by the NumericInputs argument.
INTEGER,
BIGINT, or
DOUBLE
PRECISION
categorical_input_columns INTEGER, Columns specified by the CategoricalInputs argument.
BIGINT, or Each column can have at most 20 distinct values.
VARCHAR

Note:
Forest_Drive skips input rows that contain NULL values.

Output
The Forest_Drive function populates the table specified by the OutputTable argument with the decision tree
that it creates. The following table shows the output table schema.
Table 858: Forest_Drive Output Table Schema

Column Data Type Description


worker_ip VARCHAR The IP address of the worker that produced the decision tree.
task_index INTEGER The ID of the worker that produced the decision tree.
tree_num INTEGER The ID of the decision tree.
tree VARCHAR A string representation of the decision tree.

Example
This example uses home sales data to create a model that predicts home style, which can be input to the
Forest_Predict Example.

Input
The following table describes the home sales data contained in the input table. There are six numerical
predictors and six categorical predictors. The response variable is homestyle.

938 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Drive
Table 859: Forest_Drive Example Input Data Descriptions

Column Name Description


price Sale price in U. S. dollars (numeric)
lotsize Lot size in square feet (numeric)
bedrooms Number of bedrooms (numeric)
bathrms Number of full bathrooms (numeric)
stories Number of stories, excluding basement (numeric)
driveway Whether the house has a driveway—yes or no (categorical)
recroom Whether the house has a recreation room—yes or no (categorical)
fullbase Whether the house has a full finished basement—yes or no (categorical)
gashw Whether the house uses gas to heat water—yes or no (categorical)
airco Whether the house has central air conditioning—yes or no (categorical)
garagepl Number of garage places (numeric)
prefarea Whether the house is in a preferred neighborhood—yes or no (categorical)
homestyle Style of home (response variable)

The table of raw training data, housing_train, is described by the following two tables.
Table 860: Forest_Drive Example Input Table housing_train (Columns 1-7)

sn price lotsize bedrooms bathrms stories driveway


1 42000 5850 3 1 2 yes
2 38500 4000 2 1 1 yes
3 49500 3060 3 1 1 yes
4 60500 6650 3 1 2 yes
5 61000 6360 2 1 1 yes
6 66000 4160 3 1 1 yes
7 66000 3880 3 2 2 yes
8 69000 4160 3 1 3 yes
9 83800 4800 3 1 1 yes
10 88500 5500 3 2 4 yes
11 90000 7200 3 2 1 yes
12 30500 3000 2 1 1 no
14 36000 2880 3 1 1 no
15 37000 3600 2 1 1 yes
... ... ... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 939


Chapter 9: Ensemble Methods
Forest_Drive
Table 861: Forest_Drive Example Input Table housing_train (Columns 8-14)

recroom fullbase gashw airco garagepl prefarea homestyle


no yes no no 1 no Classic
no no no no 0 no Classic
no no no no 0 no Classic
yes no no no 0 no Eclectic
no no no no 0 no Eclectic
yes yes no yes 0 no Eclectic
no yes no no 2 no Eclectic
no no no no 0 no Eclectic
yes yes no no 0 no Eclectic
yes no no yes 1 no Eclectic
no yes no yes 3 no Eclectic
no no no no 0 no Classic
no no no no 0 no Classic
no no no no 0 no Classic
... ... ... ... ... ... ...

SQL-MapReduce Call
We use default values for the MaxDepth, MinNodeSize, Variance and NumSurrogates, and build 50 trees on
two worker nodes. Both seed values are set to 100 for repeatability and mtry is assigned a value of 3
(sqrt(12)= 3.4), as it is a classification type decision tree.
A good starting point for mtry is sqrt(p) for classification and p/3 for regression, where p is number of
variables used for prediction.

SELECT * FROM Forest_Drive (


ON (SELECT 1)
PARTITION BY 1
InputTable ('housing_train ')
OutputTable ('rft_model')
TreeType ('classification')
ResponseColumn ('homestyle')
NumericInputs ('price ', 'lotsize ', 'bedrooms ', 'bathrms ',
'stories ', 'garagepl')
CategoricalInputs ('driveway ', 'recroom ', 'fullbase ', 'gashw ',
'airco ', 'prefarea')
MaxDepth (12)
MinNodeSize (1)
NumTrees (50)
NumSurrogates (0)
Variance (0.0)
Mtry ('3')

940 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Drive
MtrySeed ('100')
Seed ('100')
);

Output
The summary table and the model table are shown below.
Table 862: Forest_Drive Example Output Summary Table

message
Computing 50 classification trees.
Each worker is computing 25 trees.
Each tree will contain approximately 246 points.
Poisson sampling parameter: 1.00
Query finished in 8.962 seconds.
Decision forest created in table "rft_model".

The following query returns the output shown in the following table:

SELECT task_index, tree_num, CAST (tree AS VARCHAR(50))


FROM rft_model ORDER BY 1;

Table 863: Forest_Drive Example Output Model Table rft_model

task_index tree_num cast(tree as character varying(50))


0 0 {"responseCounts_":{"Eclectic":148,"bungalow":30,"
0 1 {"responseCounts_":{"Eclectic":158,"bungalow":26,"
0 2 {"responseCounts_":{"Eclectic":120,"bungalow":38,"
0 3 {"responseCounts_":{"Eclectic":166,"bungalow":29,"
0 4 {"responseCounts_":{"Eclectic":138,"bungalow":32,"
0 5 {"responseCounts_":{"Eclectic":158,"bungalow":34,"
0 6 {"responseCounts_":{"Eclectic":168,"bungalow":32,"
0 7 {"responseCounts_":{"Eclectic":145,"bungalow":40,"
0 8 {"responseCounts_":{"Eclectic":150,"bungalow":34,"
0 9 {"responseCounts_":{"Eclectic":156,"bungalow":42,"
0 10 {"responseCounts_":{"Eclectic":148,"bungalow":18,"
0 11 {"responseCounts_":{"Eclectic":147,"bungalow":20,"
0 12 {"responseCounts_":{"Eclectic":150,"bungalow":31,"
0 13 {"responseCounts_":{"Eclectic":135,"bungalow":32,"

Teradata Aster Analytics Foundation User Guide 941


Chapter 9: Ensemble Methods
Forest_Drive

task_index tree_num cast(tree as character varying(50))


0 14 {"responseCounts_":{"Eclectic":139,"bungalow":24,"
0 15 {"responseCounts_":{"Eclectic":146,"bungalow":27,"
0 16 {"responseCounts_":{"Eclectic":152,"bungalow":23,"
0 17 {"responseCounts_":{"Eclectic":135,"bungalow":23,"
0 18 {"responseCounts_":{"Eclectic":148,"bungalow":29,"
0 19 {"responseCounts_":{"Eclectic":166,"bungalow":33,"
0 20 {"responseCounts_":{"Eclectic":142,"bungalow":28,"
0 21 {"responseCounts_":{"Eclectic":172,"bungalow":27,"
0 22 {"responseCounts_":{"Eclectic":147,"bungalow":37,"
0 23 {"responseCounts_":{"Eclectic":158,"bungalow":31,"
0 24 {"responseCounts_":{"Eclectic":158,"bungalow":33,"
1 0 {"responseCounts_":{"Eclectic":140,"bungalow":44,"
1 1 {"responseCounts_":{"Eclectic":161,"bungalow":28,"
1 2 {"responseCounts_":{"Eclectic":131,"bungalow":25,"
1 3 {"responseCounts_":{"Eclectic":167,"bungalow":28,"
1 4 {"responseCounts_":{"Eclectic":150,"bungalow":19,"
1 5 {"responseCounts_":{"Eclectic":158,"bungalow":24,"
1 6 {"responseCounts_":{"Eclectic":177,"bungalow":32,"
1 7 {"responseCounts_":{"Eclectic":156,"bungalow":24,"
1 8 {"responseCounts_":{"Eclectic":156,"bungalow":37,"
1 9 {"responseCounts_":{"Eclectic":165,"bungalow":24,"
1 10 {"responseCounts_":{"Eclectic":135,"bungalow":29,"
1 11 {"responseCounts_":{"Eclectic":140,"bungalow":20,"
1 12 {"responseCounts_":{"Eclectic":156,"bungalow":24,"
1 13 {"responseCounts_":{"Eclectic":147,"bungalow":34,"
1 14 {"responseCounts_":{"Eclectic":151,"bungalow":22,"
1 15 {"responseCounts_":{"Eclectic":161,"bungalow":18,"
1 16 {"responseCounts_":{"Eclectic":156,"bungalow":19,"
1 17 {"responseCounts_":{"Eclectic":126,"bungalow":29,"
1 18 {"responseCounts_":{"Eclectic":148,"bungalow":26,"
1 19 {"responseCounts_":{"Eclectic":177,"bungalow":21,"
1 20 {"responseCounts_":{"Eclectic":137,"bungalow":31,"

942 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Predict

task_index tree_num cast(tree as character varying(50))


1 21 {"responseCounts_":{"Eclectic":171,"bungalow":28,"
1 22 {"responseCounts_":{"Eclectic":146,"bungalow":30,"
1 23 {"responseCounts_":{"Eclectic":149,"bungalow":21,"
1 24 {"responseCounts_":{"Eclectic":158,"bungalow":18,"

Forest_Predict

Summary
The Forest_Predict function uses the model generated by the Forest_Drive function to generate predictions
on a response variable for a test set of data. The model can be stored in either a table or a file.
This function can be used with real-time applications. Refer to AMLGenerator.

Usage

Forest_Predict Syntax
Version 1.5

SELECT * FROM Forest_Predict (


ON { table_name | view_name| (query) } [ PARTITION BY ANY ]
[ ON model_table AS ModelTable DIMENSION ]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ModelFile ('model_file')
Forest ('model_table')
[ NumericInputs ({ 'numeric_input_column_name' |
'numeric_input_column_range' }[,...]) ]
[ CategoricalInputs ({ 'categorical_input_column_name' |
'categorical_input_column_range' }[,...]) ]
IDColumn ('id_column')
[ Detailed ({ 'true' | 'false' }) ]
[ Accumulate ({ 'accumulate_column' |
'accumulate_column_range' } [,...]) ]
);

Teradata Aster Analytics Foundation User Guide 943


Chapter 9: Ensemble Methods
Forest_Predict
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
ModelFile Either Specifies the name of the text file or ZIP file that contains the trained
ModelFile, model generated by the Forest_Drive function. You must have
ModelTable,o installed this model previously using the ACT \install command.
r Forest is If you specify ModelTable, then the function uses it and ignores
required ModelFile and Forest. If you specify both ModelFile and Forest, then
the function uses Forest.
Forest Either Specifies the name of the table that contains the decision forest
ModelFile, generated by the Forest_Drive function.
ModelTable,o
r Forest is
required
NumericInputs Optional Specifies the names of the columns that contain the numeric predictor
variables. By default, the function gets these variables from the model
generated by Forest_Drive. If you specify this argument, you must
specify it exactly as you specified it in the Forest_Drive call that
generated the model.
CategoricalInputs Optional Specifies the names of the columns that contain the categorical
predictor variables. By default, the function gets these variables from
the model generated by Forest_Drive. If you specify this argument,
you must specify it exactly as you specified it in the Forest_Drive call
that generated the model.
IDColumn Required Specifies the column that contains a unique identifier for each test
point in the test set.
Detailed Optional Specifies whether to output detailed information about the forest
trees; that is, the decision tree and the specific tree information,
including task index and tree index for each tree. The default value is
'false'.
Accumulate Optional Specifies the names of the input columns to copy to the output table.

Input
The Forest_Predict function has a required input table and an optional model table.
If you do not specify the optional model table, then you must specify the model with either the ModelFile or
Forest argument. Teradata recommends that you specify the optional model table.
The input table for the Forest_Predict function must contain an ID column (for example, user_id or
transaction_id), so that each test point can be associated with a prediction. It must also contain all columns

944 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Predict
specified by the NumericInputs argument and all columns specified by the CategoricalInputs argument. The
following table shows its schema.
Table 864: Forest_Predict Input Table Schema

Column Name Data Type Description


id_column Any Contains unique identifiers for test points in the test set. Cannot
contain NULL values.
numeric_column NUMERIC, Contains numeric inputs. Cannot contain NULL values.
INTEGER,
BIGINT, or
DOUBLE
PRECISION
category_column INTEGER, Contains categorical inputs. Cannot contain NULL values.
BIGINT, or
VARCHAR
accumulate_column Any Column to copy to the output table.

Output
The output table is a set of predictions for each test point. The following table describes the output table
schema.
Table 865: Forest_Predict Output Table Schema

Column Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
id_column Same as in Column copied from the input table.
input table
prediction DOUBLE Predicted value of the test point, as generated by the model.
PRECISION
confidence_lower DOUBLE Lower bound of the confidence interval.
PRECISION
confidence_upper DOUBLE Upper bound of the confidence interval.
PRECISION
tree_num VARCHAR Either the catenation of task_index and tree_num from the
model table, to show which tree generated the prediction, or
'final' to show the overall prediction. This column appears
only if you specify Detailed ('true').

By design, for the classification tree, the columns confidence_lower and confidence_upper contain the same
value.

Teradata Aster Analytics Foundation User Guide 945


Chapter 9: Ensemble Methods
Forest_Predict
Example

Input
The input test data (housing_test) has 54 observations of 14 variables. The example uses the model
rft_model (see the Output section), created by the Forest_Drive function, to predict the homestyle of the test
dataset.
Table 866: Forest_Predict Example Input Table housing_test (Columns 1-7)

sn price lotsize bedrooms bathrms stories driveway


13 27000 1700 3 1 2 yes
16 37900 3185 2 1 1 yes
25 42000 4960 2 1 1 yes
38 67000 5170 3 1 4 yes
53 68000 9166 2 1 1 yes
104 132000 3500 4 2 2 yes
111 43000 5076 3 1 1 no
117 93000 3760 3 1 2 yes
132 44500 3850 3 1 2 yes
140 43000 3750 3 1 2 yes
142 40000 2650 3 1 2 yes
157 60000 2953 3 1 2 yes
... ... ... ... ... ... ...

Table 867: Forest_Predict Example Input Table housing_test (Columns 8-14)

recroom fullbase gashw airco garagepl prefarea homestyle


no no no no 0 no Classic
no no no yes 0 no Classic
no no no no 0 no Classic
no no no yes 0 no Eclectic
no yes no yes 2 no Eclectic
no no yes no 2 no bungalow
no no no no 0 no Classic
no no yes no 2 no Eclectic
no no no no 0 no Classic
no no no no 0 no Classic

946 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Predict

recroom fullbase gashw airco garagepl prefarea homestyle


no yes no no 1 no Classic
no yes no yes 0 no Eclectic
... ... ... ... ... ... ...

SQL-MapReduce Call
Use the Accumulate argument to pass the homestyle variable, to easily compare the actual and predicted
response for each observation.

CREATE TABLE rf_housing_predict DISTRIBUTE BY hash (sn) AS


SELECT * FROM Forest_Predict (
ON housing_test
Forest ('rft_model')
NumericInputs ('price ', 'lotsize ', 'bedrooms ', 'bathrms ',
'stories ', 'garagepl')
CategoricalInputs ('driveway ', 'recroom ', 'fullbase ', 'gashw ',
'airco ', 'prefarea')
IdColumn ('sn')
Accumulate ('homestyle')
Detailed ('false')
);

Output
The function’s predicted response is in the prediction column and the original classification values are in the
homestyle column. The upper and lower confidence intervals are also shown in the output table.
The following query returns the output shown in the following table:

SELECT * FROM rf_housing_predict ORDER BY 2;

Table 868: Forest_Predict Example Output Table

homestyle sn prediction confidence_lower confidence_upper


Classic 13 Classic 0.6 0.6
Classic 16 Classic 0.56 0.56
Classic 25 Classic 0.54 0.54
Eclectic 38 Eclectic 0.7 0.7
Eclectic 53 Eclectic 0.54 0.54
bungalow 104 bungalow 0.36 0.36
Classic 111 Classic 0.54 0.54
Eclectic 117 Eclectic 0.46 0.46
Classic 132 Classic 0.54 0.54

Teradata Aster Analytics Foundation User Guide 947


Chapter 9: Ensemble Methods
Forest_Predict

homestyle sn prediction confidence_lower confidence_upper


Classic 140 Classic 0.52 0.52
Classic 142 Eclectic 0.5 0.5
Eclectic 157 Eclectic 0.64 0.64
Eclectic 161 Eclectic 0.74 0.74
bungalow 162 Eclectic 0.46 0.46
Eclectic 176 Eclectic 0.48 0.48
Eclectic 177 Eclectic 0.56 0.56
Classic 195 Classic 0.76 0.76
Classic 198 Classic 0.48 0.48
Eclectic 224 Eclectic 0.56 0.56
Classic 234 Classic 0.64 0.64
Classic 237 Classic 0.48 0.48
Classic 239 Classic 0.52 0.52
Classic 249 Classic 0.7 0.7
Classic 251 Classic 0.6 0.6
Eclectic 254 Eclectic 0.66 0.66
Eclectic 255 Eclectic 0.6 0.6
Classic 260 Eclectic 0.5 0.5
Eclectic 274 Eclectic 0.66 0.66
Classic 294 Classic 0.62 0.62
Eclectic 301 Classic 0.56 0.56
Eclectic 306 Eclectic 0.7 0.7
Eclectic 317 Eclectic 0.5 0.5
bungalow 329 Eclectic 0.52 0.52
bungalow 339 bungalow 0.56 0.56
Eclectic 340 Eclectic 0.54 0.54
Eclectic 353 Eclectic 0.44 0.44
Eclectic 355 Classic 0.4 0.4
Eclectic 364 Eclectic 0.54 0.54
bungalow 367 bungalow 0.52 0.52
bungalow 377 Eclectic 0.46 0.46
Eclectic 401 Eclectic 0.56 0.56

948 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Analyze

homestyle sn prediction confidence_lower confidence_upper


Eclectic 403 Eclectic 0.56 0.56
Eclectic 408 Eclectic 0.56 0.56
Eclectic 411 Eclectic 0.54 0.54
Eclectic 440 Eclectic 0.66 0.66
Eclectic 441 Classic 0.5 0.5
Eclectic 443 Classic 0.52 0.52
Classic 459 Classic 0.74 0.74
Classic 463 Eclectic 0.56 0.56
Eclectic 469 Eclectic 0.62 0.62
Eclectic 472 Eclectic 0.54 0.54
bungalow 527 Eclectic 0.52 0.52
bungalow 530 Eclectic 0.58 0.58
Eclectic 540 Eclectic 0.42 0.42

Prediction Accuracy
The prediction accuracy is 77.78% as calculated by the following SQL statement:

SELECT (SELECT count(sn) FROM rf_housing_predict


WHERE homestyle = prediction) / (SELECT count(sn)
FROM rf_housing_predict) AS PA;

Table 869: Forest_Predict Accuracy

pa
0.77777777777777777778

Forest_Analyze

Summary
The Forest_Analyze function analyzes the model generated by the Forest_Drive function and gives weights
to the variables in the model. This function shows variable/attribute counts in each tree level, helping you to
understand the importance of different variables in the decision-making process.

Teradata Aster Analytics Foundation User Guide 949


Chapter 9: Ensemble Methods
Forest_Analyze
Usage

Forest_Analyze Syntax
Version 1.1

SELECT * FROM Forest_Analyze (


ON { table_name | view_name | (query) }
[ NumLevels (number_of_levels) ]
);

Arguments
Argument Category Description
NumLevels Optional Specifies the number of levels to analyze. The default value is 5.

Input
The input to the Forest_Analyze function is the model generated by the Forest_Drive function. Forest_Drive
Output Table Schema shows its schema.

Output
The output of the Forest_Analyze function is a table of model analysis data. The following table shows its
schema.
Table 870: Forest_Analyze Output Table Schema

Column Data Type Description


worker_ip VARCHAR The IP address of the worker that produced the decision tree.
task_index INTEGER The ID of the worker that produced the decision tree.
tree_num INTEGER The ID of the decision tree.
variable VARCHAR A string representation of the decision tree.
level INTEGER The highest level of the decision tree at which the variable appears.
cnt INTEGER The number of times that the variable is used as a split node in the
decision tree.
importance DOUBLE The importance statistics for each decision tree in the random forest.
PRECISION To find the overall importance of each variable, use this query, where
n is the number of trees:

SELECT variable, sum(importance)/n


FROM Forest_Analyze (
ON { table | view | (query) }

950 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Analyze

Column Data Type Description

[ NumLevels (number_of_levels)]
) GROUP BY variable;

Examples

Input
The following examples show how to use the Forest_Analyze function to analyze the sample model
generated by Forest_Drive. The rft_model table, generated by the Forest_Drive function, is used as input.

Example 1

SQL-MapReduce Call

SELECT task_index, tree_num, variable, level, cnt


FROM Forest_Analyze (ON rft_model)
ORDER BY 1;

Output
There are two worker nodes that construct 25 trees each. The level and count of variables for each tree and
worker at each node are output as shown in the following table.
Table 871: Forest_Analyze Example 1 Output Table

task_index tree_num variable level cnt


0 0 price 0 1
0 0 fullbase 1 1
0 0 airco 1 1
0 0 lotsize 2 1
0 0 driveway 2 1
0 0 recroom 2 1
0 0 price 2 1
0 0 stories 3 4
0 0 garagepl 3 1
0 0 bathrms 3 1
0 0 bedrooms 4 1
0 0 lotsize 4 2

Teradata Aster Analytics Foundation User Guide 951


Chapter 9: Ensemble Methods
Forest_Analyze

task_index tree_num variable level cnt


0 0 price 4 4
0 0 bedrooms
0 0 lotsize
0 0 driveway
0 0 stories
0 0 recroom
0 0 price
0 0 garagepl
0 0 bathrms
0 0 fullbase
0 0 airco
0 1 lotsize 0 1
0 1 recroom 1 1
0 1 bathrms 1 1
0 1 bedrooms 2 1
0 1 driveway 2 1
0 1 price 2 1
0 1 fullbase 2 1
0 1 bedrooms 3 1
0 1 driveway 3 1
0 1 price 3 1
0 1 garagepl 3 1
0 1 airco 3 1
0 1 bedrooms 4 1
0 1 price 4 3
0 1 bathrms 4 1
0 1 bedrooms
0 1 lotsize
0 1 driveway
0 1 recroom
0 1 price
0 1 garagepl

952 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Forest_Analyze

task_index tree_num variable level cnt


0 1 bathrms
0 1 fullbase
0 1 airco
... .... .... ... ...

Example 2: Calculating Variable importance

SQL-MapReduce Call
The overall variable importance is calculated by averaging the importance over 50 trees, as shown below.

SELECT variable, SUM(importance)/50


FROM Forest_Analyze (ON rft_model)
GROUP BY variable
ORDER BY 2 DESC;

Output
The variable importance is shown in descending order. The top three variables for modeling and prediction
are price, lotsize and bedrooms.
Table 872: Forest_Analyze Example 2 Output Table

variable sum(importance) / 50
price 0.530036819315194
lotsize 0.40869314472933
bedrooms 0.216136248043658
stories 0.176956469036925
bathrms 0.171395287455378
garagepl 0.16108831869553
fullbase 0.0853787807623518
airco 0.0720778853448971
recroom 0.0607107804514478
driveway 0.0336033805550212
gashw 0.0161230714649009
prefarea 0.00464901131486607

Teradata Aster Analytics Foundation User Guide 953


Chapter 9: Ensemble Methods
Single Decision Tree Functions
Best Practices
Training a decision tree model is a relatively automatic procedure, but for best performance, be aware of the
following:
• Use the same set of columns for CategoricalInputs and NumericInputs while building the model and
using the model (for prediction); otherwise the Forest_Predict function fails.
• The Forest_Drive function computes several parameters that are important for the performance of the
model. If necessary, you can set these parameters to improve function performance:
∘ NumTrees—By default, the Forest_Drive function builds the number of trees such that the total
number of sampled points is equal to the size of the original input dataset. For example, if your input
dataset contains one billion rows, and the function determines that each tree must be trained on a
sample of one million rows, the function trains 1,000 trees. Depending on your dataset, you might
want more or fewer trees. Generally, a model of 300 decision trees works well for most prediction
tasks. If your dataset is small, specify a value for NumTrees that is a multiple of the number of
vworkers in your cluster.
∘ TreeSize—Each decision tree is built on a sample of the original dataset. The function computes the
value of this parameter such that the decision tree algorithm does not run out of memory. With the
TreeSize parameter, you can specify how many rows each decision tree is to contain. Setting this
parameter too high can result in Out of Memory errors.
• You can check progress of the Forest_Builder and Forest_Predict functions in AMC. Log into AMC and
click the Processes tab. If a function is still running, you see a process with its name. Click that process
name and then click the View Logs link. The logs show stdout from the process, which helps you check
progress and diagnose potential problems.
• If a specific variable can have more than 20 values, consolidate some of the categories to improve runtime
performance.
• The tree field in the output model table can become very large. If the trees are too large, or the model has
too many trees, the Forest_Predict function can fail and output NaNs as predictions (NaN means “not a
number”). If the Forest_Predict logs in AMC show NaNs, try one of the following:
∘ Train fewer decision trees.
∘ Decrease the MaxDepth parameter in the Forest_Drive function.
∘ Reduce the cardinality of your categorical input variables.
• Each vworker trains decision trees using a subsample of the data on its partition. Significant data skew
can produce strange results.
• For better efficiency when running the Forest_Drive function, ensure that the training data in the input
table is randomly distributed.

Single Decision Tree Functions


Decision trees are a common procedure used in data mining and supervised learning because of their
robustness to many of the problems of real world data, such as missing values, irrelevant variables, outliers
in input variables, and variable scalings. The algorithm is an “off-the-shelf” procedure, with a few parameters
to tune.
This implementation creates only one decision tree, as opposed to generating multiple trees, as in the case of
random forests. These functions support classification trees on continuous variables and categorical
attributes, handle missing values during the prediction phase, and support GINI, entropy, and chi-square
impurity measurements.

954 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

Single_Tree_Drive

Summary
The Single_Tree_Drive function creates a single decision tree in a distributed fashion, either weighted or
unweighted. The model table that this function outputs can be input to the function Single_Tree_Predict.

Background

Tree Building
The Single_Tree_Drive function takes the entire data set as training input and builds a single decision tree
from it.

Installing the Helper Functions


Before calling the Single_Tree_Drive function, ensure that the helper functions are installed:
• single_tree_drive
• best_splits_by_attributes
• best_splits_by_nodes
• partition_data
• percentile
• approxPercentileMap
• approxPercentileReduce

Difference Between the Decision Tree Approaches


In a random forest, each vworker operates on its own data and builds one or more decision trees. During the
forest-building process, vworkers need not communicate with each other.
A forest is built based on the training data set. After the forest is built, all the future data points are predicted
against all the trees in the forest, and then the function calculates an aggregate value for the global predicted
value. Because many trees are involved, it is not clear which variables are the most important at the different
levels of the trees.

Teradata Aster Analytics Foundation User Guide 955


Chapter 9: Ensemble Methods
Single_Tree_Drive
The single-tree approach requires vworkers to communicate with each other during the tree-building
process. This communication can be very expensive, depending on the number of variables and the number
of possible splits; therefore, the single-tree algorithm uses a sampling approach to reduce the number of
splits.
The single-tree approach implements the classification tree for numeric and categorical variables.
The Single_Tree_Drive function uses Approximate Percentile and Percentile for sampling the split values.
The split table has all the splits for all the numerical attributes to be considered for building the single
decision tree.

Usage

Single_Tree_Drive Syntax
Version 1.3

SELECT * FROM Single_Tree_Drive (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database( 'db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
{ InputTable ('input_table') |
AttributeTableName ('attribute_table')
ResponseTableName ('response_table') }
OutputTable ('output_table')
AttributeNameColumns ('attribute_column' [,...])
AttributeValueColumn ('node_column')
ResponseColumn ('response_column')
IDColumns ({ 'id_column' | 'id_column_range' } [,...])
[ CategoricalAttributeTableName
('categorical_attribute_table') ]
[ SaveFinalResponseTableTo ('final_response_table') ]
[ SplitsTable ('splits_table') ]
[ SplitsValueColumn ('splits_valcol') ]
[ NumSplits ('num_splits_to_consider') ]
[ ApproxSplits
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ IntermediateSplitsTable ('intermediate_splits_table') ]
[ DropTable ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ MinNodeSize ('minimum_split_size') ]
[ MaxDepth ('max_depth') ]
[ Weighted ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WeightColumn ('weight_column') ]
[ SplitMeasure ( { 'gini' | 'entropy' | 'chisquare' } ) ]
);

956 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive
Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Optional Specifies the name of the table that contains the input data set.
* *Required if you omit AttributeTableName and
ResponseTableName.
AttributeTableName Optional Specifies the name of the table that contains the attribute names
* and the values.
*Required if you omit InputTable.
ResponseTableName Optional Specifies the name of the table that contains the response
* values.
*Required if you omit InputTable.
OutputTable Required Specifies the name for the output table that is to contain the
final decision tree (the model table). The name must not exceed
64 characters.
AttributeNameColumns Required Specifies the names of the attribute table columns that define
the attribute.
AttributeValueColumn Required Specifies the names of the attribute table columns that define
the value.
ResponseColumn Required Specifies the name of the response table column that contains
the response variable.
IDColumns Required Specifies the names of the columns in the response and
attribute tables that specify the ID of the instance.
CategoricalAttributeTableNam Optional Specifies the name of the input table that contains the
e categorical attributes.
SaveFinalResponseTableTo Optional Specifies the name for the output table that is to contain the
final PID and response pair from the response table and the
node_id from the final single drive tree.
SplitsTable Optional Specifies the name of the input table that contains the user-
specified splits. By default, the function creates new splits.
SplitsValueColumn Optional If you specify SplitsTableName, this argument specifies the
name of the column that contains the split value. If
UseApproximateSplits is 'true', then the default value is
splits_valcol; if not, then the default value is the
AttributeValueColumn argument, node_column.

Teradata Aster Analytics Foundation User Guide 957


Chapter 9: Ensemble Methods
Single_Tree_Drive

Argument Category Description


NumSplits Optional Specifies the number of splits to consider for each variable. The
default value is 10. The function does not consider all possible
splits for all attributes.
ApproxSplits Optional Specifies whether to use approximate percentiles (true) or exact
percentiles (false). The default value is true. Internally, the
function uses percentile values as split values.
IntermediateSplitsTable Optional Specifies the name for the intermediate splits table, if it is to be
saved. By default, the function does not save the intermediate
splits table.
DropTable Optional Specifies whether to drop the output table (specified by
OutputTableName) if it already exists. The default value is
'false'.
MinNodeSize Optional Specifies the decision tree stopping criterion and the minimum
size of any particular node within each decision tree. The
default value is 100.
MaxDepth Required Specifies a decision tree stopping criterion. If the tree reaches a
depth past this value, the algorithm stops looking for splits.
Decision trees can grow up to (2(max_depth+1) - 1) nodes. This
stopping criteria has the greatest effect on function
performance. The maximum value is 60. The default value is 5.
Weighted Optional Specifies whether to build a weighted decision tree. The default
value is 'false'. If you specify 'true', then you must also specify
the WeightColumn argument.
WeightColumn Optional Specifies the name of the response table column that contains
the weights of the attribute values.
SplitMeasure Optional Specifies the impurity measurement to use while constructing
the decision tree. The default value is 'gini'. If the tree is
weighted, this value cannot be 'chisquare'.

Input
Single decision trees support millions of attributes. Because the database cannot have millions of columns,
you must spread the attributes across rows in the form of key-value pairs, where key is the name of the
attribute and value is the value of the attribute.
To convert an input table used in Forest_Drive into an input table for Single_Tree_Drive, use the Unpivot
function.
The Single_Tree_Drive function requires either an input table or both an attribute table and a response
table. The function has two optional input tables, the splits table and the categorical splits table.
Table 873: Single_Tree_Drive Input Table Schema

Column Name Data Type Description


id_column Any Data point identifier. Cannot be NULL.

958 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

Column Name Data Type Description


attribute_column VARCHAR Attribute name. Cannot be NULL.
Every attribute in the attribute table must be given a non-
empty partition in the splits table.
node_column Numeric attribute: Attribute value. Can be NULL, in which case the function
NUMERIC, INTEGER, estimates its value by arithmetic means on an attribute
BIGINT, or DOUBLE basis. If this value is out of range, the function cannot use it
PRECISION to partition the training data; therefore, it is useless.
Categorical attribute:
Any
response_column NUMERIC, INTEGER, Response value for the data point. Can be NULL.
BIGINT, or DOUBLE
PRECISION
weight_column DOUBLE PRECISION Weight of the data point. Cannot be NULL. This column
appears only if the decision tree is weighted.

Table 874: Single_Tree_Drive Attribute Table Schema

Column Name Data Type Description


id_column Any Data point identifier. Cannot be NULL.
attribute_column VARCHAR Attribute name. Cannot be NULL.
Every attribute in the attribute table must be given a non-
empty partition in the splits table.
node_column Numeric attribute: Attribute value. Can be NULL, in which case the function
NUMERIC, INTEGER, estimates its value by arithmetic means on an attribute basis.
BIGINT, or DOUBLE If this value is out of range, the function cannot use it to
PRECISION partition the training data; therefore, it is useless.
Categorical attribute:
Any

Table 875: Single_Tree_Drive Response Table Schema

Column Name Data Type Description


id_column Any Data point identifier. Cannot be NULL.
response_column NUMERIC, Response value for the data point. Can be NULL.
INTEGER,
BIGINT, or
DOUBLE
PRECISION
weight_column DOUBLE Weight of the data point. Cannot be NULL. This column
PRECISION appears only if the decision tree is weighted.

Note:
The response table must not have a column named node_id.

Teradata Aster Analytics Foundation User Guide 959


Chapter 9: Ensemble Methods
Single_Tree_Drive
Table 876: Single_Tree_Drive Splits Table Schema

Column Name Data Type Description


attribute_column VARCHAR Attribute name. Cannot be NULL.
Every attribute in the attribute table must be given a non-
empty partition in the splits table.
split_id INTEGER Split identifier. Cannot be NULL.
splits_valcol NUMERIC, Split value. Cannot be NULL.
INTEGER,
BIGINT, or
DOUBLE
PRECISION

Table 877: Single_Tree_Drive Categorical Splits Table Schema

Column Name Data Type Description


attribute VARCHAR Categorical attribute name.

Output
The Single_Tree_Drive function outputs console messages, a model table, and (optionally) an intermediate
splits table and final response table. The following table shows the schema of the message table.
Table 878: Single_Tree_Drive Console Message Table Schema

Column Data Type Description


message VARCHAR Console message.

The model table has a row for each node in the model (the single decision tree that the function creates). The
name of the model table is specified by the OutputTableName argument. The following table shows the
schema of the model table.
Table 879: Single_Tree_Drive Model Table Schema

Column Data Type Description


node_id INTEGER Node identifier.
node_size INTEGER Number of objects in the node.
node_gini[(p)] DOUBLE GINI impurity value for the information in the node. If you
PRECISION specify ImpurityMeasurement('gini'), the column name is
node_gini(p); otherwise, it is node_gini.
node_entropy[(p)] DOUBLE Entropy impurity value for the information in the node. If you
PRECISION specify ImpurityMeasurement('entropy'), the column name is
node_entropy(p); otherwise, it is node_entropy.
node_chisq_pv[(p)] DOUBLE Chi-square impurity value for the information in the node. If you
PRECISION specify ImpurityMeasurement('chisquare'), the column name is
node_chisq_pv(p); otherwise, it is node_chisq_pv.

960 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

Column Data Type Description


node_label VARCHAR Output category for the node.
node_majorvotes INTEGER Number of objects that belong to the category identified by
node_label.
split_value DOUBLE Numerical split value.
PRECISION
split_gini[(p)] DOUBLE GINI impurity measurement for the information in the node
PRECISION after splitting. If you specify ImpurityMeasurement('gini'), the
column name is split_gini(p); otherwise, it is split_gini.
split_entropy[(p)] DOUBLE Entropy impurity measurement for the information in the node
PRECISION after splitting. If you specify ImpurityMeasurement('entropy'),
the column name is split_entropy(p); otherwise, it is
split_entropy.
split_chisq_pv[(p)] DOUBLE Chi-square impurity measurement for the information in the
PRECISION node after splitting. If you specify
ImpurityMeasurement('chisquare'), the column name is
split_chisq_pv(p); otherwise, it is split_chisq_pv.
left_id INTEGER Identifier of the left child of the node.
left_size INTEGER Number of objects in left child of the node.
left_label VARCHAR Output category for left child of the node.
left_majorvotes INTEGER Number of objects that belong to the category identified by
left_label.
right_id INTEGER Identifier of the right child of the node.
right_size INTEGER Number of objects in right child of the node.
right_label VARCHAR Output category for right child of the node.
right_majorvotes INTEGER Number of objects that belong to the category identified by
right_label.
left_bucket VARCHAR When the split value is the categorical attribute, the value in the
left child of the node.
right_bucket VARCHAR When the split value is the categorical attribute, the value in the
right child of the node.
attribute VARCHAR Split attribute.
node_majorfreq DOUBLE Weighted objects that belong to the category identified by
PRECISION node_label. This column appears only if the Weighted argument
is 'true'.
left_majorfreq DOUBLE Weighted objects that belong to the category identified by
PRECISION left_label. This column appears only if the Weighted argument is
'true'.

Teradata Aster Analytics Foundation User Guide 961


Chapter 9: Ensemble Methods
Single_Tree_Drive

Column Data Type Description


right_majorfreq DOUBLE Weighted objects that belong to the category identified by
PRECISION right_label. This column appears only if the Weighted argument
is 'true'.

The following table describes the intermediate splits table. The name of the intermediate splits table is
specified by the MaterializedSplitsTableWithName argument.
Table 880: Single_Tree_Drive Intermediate Splits Table Schema

Column Data Type Description


attribute VARCHAR Attribute name (from the attribute table, Input). For
each attribute, the table has the number of rows
specified by the MaxDepth argument.
percentile INTEGER Percentage of values in the split. For example, if
attribute A has 100 different values, then percentile
=10 and value =1 means that 100*10%=10 (the 10th
value) of attribute A is 1, and 1 is the split value.
value NUMERIC, INTEGER, Split value (from the attribute table, Input).
BIGINT, or DOUBLE
PRECISION

The following table describes the output response table. The name of the output response table is specified
by the SaveFinalResponseTableTo argument.
Table 881: Single_Tree_Drive Output Response Table Schema

Column Data Type Description


node_id INTEGER Node identifier.
pid Any Data point identifier.
response NUMERIC, Response value for the data point.
INTEGER,
BIGINT, or
DOUBLE
PRECISION

Examples
• Example 1
• Example 2

962 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive
Example 1

Input
The well known 'iris' dataset (iris_input)is used in this example. The data has values for four attributes
(sepal_length, sepal_width, petal_length and petal_width) and are grouped into three categories (setosa (1),
versicolor (2), virginica (3)). From the raw data, a train and test set are created.
The function Single_Tree_Drive acts on the train set to generate the model. The Single_Tree_Predict
function uses that model and a test set to predict the output. The prediction accuracy is determined based on
the original and prediction results.
Table 882: Single_Tree_Drive Example 1 Iris Table iris_input

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 5 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 5 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1
... ... ... ... ... ...

Split Input into Training and Testing Data Sets


This code divides the 150 data rows into a training data set (80%) and a testing dataset (20%):

DROP TABLE IF EXISTS iris_train;


DROP TABLE IF EXISTS iris_test;
CREATE TABLE iris_train AS
SELECT * FROM iris_input WHERE id%5!=0;
CREATE TABLE iris_test AS
SELECT * FROM iris_input WHERE id%5=0;
SELECT * FROM iris_train ORDER BY id;

Table 883: Single_Tree_Drive Example 1 Train Table iris_train

id sepal_length sepal_width petal_length petal_width species


1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1

Teradata Aster Analytics Foundation User Guide 963


Chapter 9: Ensemble Methods
Single_Tree_Drive

id sepal_length sepal_width petal_length petal_width species


4 4.6 3.1 1.5 0.2 1
6 5.4 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 5 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
11 5.4 3.7 1.5 0.2 1
12 4.8 3.4 1.6 0.2 1
13 4.8 3 1.4 0.1 1
14 4.3 3 1.1 0.1 1
16 5.7 4.4 1.5 0.4 1
... ... ... ... ... ...

SELECT * FROM iris_test ORDER BY id;

Table 884: Single_Tree_Drive Example 1 Test Table iris_test

id sepal_length sepal_width petal_length petal_width species


5 5 3.6 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1
15 5.8 4 1.2 0.2 1
20 5.1 3.8 1.5 0.3 1
25 4.8 3.4 1.9 0.2 1
30 4.7 3.2 1.6 0.2 1
35 4.9 3.1 1.5 0.2 1
40 5.1 3.4 1.5 0.2 1
45 5.1 3.8 1.9 0.4 1
50 5 3.3 1.4 0.2 1
55 6.5 2.8 4.6 1.5 2
60 5.2 2.7 3.9 1.4 2
65 5.6 2.9 3.6 1.3 2
70 5.6 2.5 3.9 1.1 2
75 6.4 2.9 4.3 1.3 2
80 5.7 2.6 3.5 1 2
85 5.4 3 4.5 1.5 2

964 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

id sepal_length sepal_width petal_length petal_width species


90 5.5 2.5 4 1.3 2
95 5.6 2.7 4.2 1.3 2
100 5.7 2.8 4.1 1.3 2
105 6.5 3 5.8 2.2 3
110 7.2 3.6 6.1 2.5 3
115 5.8 2.8 5.1 2.4 3
120 6 2.2 5 1.5 3
125 6.7 3.3 5.7 2.1 3
130 7.2 3 5.8 1.6 3
135 6.1 2.6 5.6 1.4 3
140 6.9 3.1 5.4 2.1 3
145 6.7 3.3 5.7 2.5 3
150 5.9 3 5.1 1.8 3

Attribute Tables
Attribute tables, created from the raw train and test data, are used as inputs.

DROP TABLE IF EXISTS iris_attribute_train;


CREATE FACT TABLE iris_attribute_train (pid INTEGER, attribute varchar,
attrvalue real) DISTRIBUTE BY hash(pid);
INSERT INTO iris_attribute_train SELECT id, 'sepal_length',
sepal_length FROM iris_train;
INSERT INTO iris_attribute_train SELECT id, 'sepal_width',
sepal_width FROM iris_train;
INSERT INTO iris_attribute_train SELECT id, 'petal_length',
petal_length FROM iris_train;
INSERT INTO iris_attribute_train SELECT id, 'petal_width',
petal_width FROM iris_train;

The following query returns the output shown in the following table:

SELECT * FROM iris_attribute_train ORDER BY pid, attribute;

Table 885: Single_Tree_Drive Example 1 Attribute Table iris_attribute_train

pid attribute attrvalue


1 petal_length 1.4
1 petal_width 0.2
1 sepal_length 5.1
1 sepal_width 3.5

Teradata Aster Analytics Foundation User Guide 965


Chapter 9: Ensemble Methods
Single_Tree_Drive

pid attribute attrvalue


2 petal_length 1.4
2 petal_width 0.2
2 sepal_length 4.9
2 sepal_width 3
3 petal_length 1.3
3 petal_width 0.2
3 sepal_length 4.7
3 sepal_width 3.2
... ... ...

DROP TABLE IF EXISTS iris_attribute_test;


CREATE FACT TABLE iris_attribute_test (pid INTEGER, attribute varchar,
attrvalue real) DISTRIBUTE BY hash(pid);
INSERT INTO iris_attribute_test SELECT id, 'sepal_length', sepal_length
FROM iris_test;
INSERT INTO iris_attribute_test SELECT id, 'sepal_width ', sepal_width
FROM iris_test;
INSERT INTO iris_attribute_test SELECT id, 'petal_length ',
petal_length FROM iris_test;
INSERT INTO iris_attribute_test SELECT id, 'petal_width ',
petal_width FROM iris_test;

The following query returns the output shown in the following table:

SELECT * FROM iris_attribute_test ORDER BY pid, attribute;

Table 886: Single_Tree_Drive Example 1 Attribute Table iris_attribute_test

pid attribute attrvalue


5 petal_length 1.4
5 petal_width 0.2
5 sepal_length 5
5 sepal_width 3.6
10 petal_length 1.5
10 petal_width 0.1
10 sepal_length 4.9
10 sepal_width 3.1
15 petal_length 1.2
15 petal_width 0.2

966 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

pid attribute attrvalue


15 sepal_length 5.8
15 sepal_width 4
... ... ...

Response Tables
Response tables, created from the raw train and test data, are used as inputs.

DROP TABLE IF EXISTS iris_response_train;


CREATE DIMENSION TABLE iris_response_train (pid INTEGER, response VARCHAR);
INSERT INTO iris_response_train SELECT id, species FROM iris_train;

The following query returns the output shown in the following table:

SELECT * FROM iris_response_train ORDER BY pid;

Table 887: Single_Tree_Drive Example 1 Response Table iris_response_train

pid response
1 1
2 1
3 1
4 1
6 1
7 1
8 1
9 1
11 1
12 1
13 1
14 1
16 1
... ...

DROP TABLE IF EXISTS iris_response_test;


CREATE DIMENSION TABLE iris_response_test (pid INTEGER, response VARCHAR);
INSERT INTO iris_response_test SELECT id, species FROM iris_test;

Teradata Aster Analytics Foundation User Guide 967


Chapter 9: Ensemble Methods
Single_Tree_Drive
The following query returns the output shown in the following table:

SELECT * FROM iris_response_test ORDER BY pid;

Table 888: Single_Tree_Drive Example 1 Response Table iris_response_test

pid response
5 1
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 2
60 2
65 2
70 2
75 2
80 2
85 2
90 2
95 2
100 2
105 3
110 3
115 3
120 3
125 3
130 3
135 3
140 3

968 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

pid response
145 3
150 3

SQL-MapReduce Call

DROP TABLE IF EXISTS iris_attribute_output;


DROP TABLE IF EXISTS splits_small;
SELECT * FROM single_tree_drive (
ON (select 1) PARTITION BY 1
AttributeTableName ('iris_attribute_train')
OutputTable ('iris_attribute_output')
IntermediateSplitsTable ('splits_small')
ResponseTableName ('iris_response_train')
NumSplits ('3')
SplitMeasure ('gini')
MaxDepth ('10')
IDColumns ('pid')
AttributeNameColumns ('attribute')
AttributeValueColumn ('attrvalue')
ResponseColumn ('response')
MinNodeSize ('10')
ApproxSplits ('false')
);

Output
The function call creates two tables, a model table 'iris_attribute_output' and an intermediate splits table
'splits_small'.
Table 889: Single_Tree_Drive Example 1 Output Message

message
Input tables:"iris_attribute_train", "iris_response_train"
Output model table: "iris_attribute_output"
Depth of the tree is:6

The following query returns the output shown in the table iris_attribute_output:

SELECT * FROM iris_attribute_output ORDER BY 1;

Table 890: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 1-6)

node_id node_size node_gini(p) node_entropy node_chisq_pv node_label


0 120 0.666666666666667 1.58496250072116 1 1
2 80 0.5 1 1 2
5 39 0.0499671268902038 0.172036949353113 1 2

Teradata Aster Analytics Foundation User Guide 969


Chapter 9: Ensemble Methods
Single_Tree_Drive

node_id node_size node_gini(p) node_entropy node_chisq_pv node_label


6 41 0.0928019036287924 0.281193796432043 1 3
14 37 0.0525931336742148 0.179256066928321 1 3
30 24 0.0798611111111112 0.249882292833186 1 3
61 14 0.13265306122449 0.371232326640875 1 3

Table 891: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 7-11)

node_majorvotes split_value split_gini(p) split_entropy split_chisq_pv


40 3 0.333333333333333 0.666666666666666 0
40 1.70000004768372 0.0719199499687304 0.227979833481065 1.11022302462516e-16
38 4.90000009536743 0.0384615384615385 0.0832080127650393 0.00272911340977045
39 4.90000009536743 0.0840474620962426 0.240916755467913 0.0492232443463754
36 2.90000009536743 0.0518018018018018 0.162085811567472 0.455587679837851
23 3.20000004768372 0.0773809523809524 0.216552190540511 0.387955122826146
13 6.30000019073486 0.131868131868132 0.363297594798595 0.773484680980946

Table 892: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 12-19)

left_id left_size left_label left_majorvot right_id right_size right_label right_majorvote


es s
1 40 1 40 2 80 2 40
5 39 2 38 6 41 3 39
11 35 2 35 12 4 2 3
13 4 3 3 14 37 3 36
29 13 3 13 30 24 3 23
61 14 3 13 62 10 3 10
123 1 3 1 124 13 3 12

Table 893: Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 20-22)

left_bucket right_bucket attribute


petal_length
petal_width
petal_length
petal_length
sepal_width
sepal_width

970 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Drive

left_bucket right_bucket attribute


sepal_length

The following query returns the output shown in the following table:

SELECT * FROM splits_small ORDER BY 1;

Table 894: Single_Tree_Drive Example 1 Output Table splits_small

attribute percentile attrvalue


petal_length 33.3333333333333 3
petal_length 66.6666666666667 4.90000009536743
petal_length 100 6.90000009536743
petal_width 33.3333333333333 1
petal_width 66.6666666666667 1.70000004768372
petal_width 100 2.5
sepal_length 33.3333333333333 5.40000009536743
sepal_length 66.6666666666667 6.30000019073486
sepal_length 100 7.90000009536743
sepal_width 33.3333333333333 2.90000009536743
sepal_width 66.6666666666667 3.20000004768372
sepal_width 100 4.40000009536743

Example 2

Input
This example illustrates an alternate input format. The attribute table and response table of Example 1 are
combined into a single table, iris_altinput, which is used in the argument Inputtable().
Table 895: Single_Tree_Drive Example 2 Input Table iris_altinput

pid attribute attrvalue response


1 petal_width 0.2 1
1 petal_length 1.4 1
1 sepal_width 3.5 1
1 sepal_length 5.1 1
2 petal_width 0.2 1
2 petal_length 1.4 1

Teradata Aster Analytics Foundation User Guide 971


Chapter 9: Ensemble Methods
Single_Tree_Drive

pid attribute attrvalue response


2 sepal_width 3 1
2 sepal_length 4.9 1
3 petal_width 0.2 1
3 petal_length 1.3 1
3 sepal_width 3.2 1
3 sepal_length 4.7 1
4 petal_width 0.2 1
4 petal_length 1.5 1
4 sepal_width 3.1 1
4 sepal_length 4.6 1
... ... ... ....

SQL-MapReduce Call

DROP TABLE IF EXISTS iris_attribute_output_2;


DROP TABLE IF EXISTS splits_small_2;
SELECT * FROM single_tree_drive (
ON (SELECT 1) PARTITION BY 1
InputTable ('iris_altinput ')
OutputTable ('iris_attribute_output_2')
IntermediateSplitsTable ('splits_small_2')
NumSplits ('3')
SplitMeasure ('gini')
MaxDepth ('10')
IdColumns ('pid')
AttributeNameColumns ('attribute')
AttributeValueColumn ('attrvalue')
ResponseColumn ('response')
MinNodeSize ('10')
ApproxSplits ('false')
);

Output

Table 896: Single_Tree_Drive Example 2 Output Table

message
Input tables:"iris_altinput",
Output model table: "iris_attribute_output_2"
Depth of the tree is:6

972 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Predict
The attribute and the splits table can be viewed by the below queries and are the same as Example 1.

SELECT * FROM iris_attribute_output_2 ORDER BY 1;


SELECT * FROM splits_small_2 ORDER BY 1;

Single_Tree_Predict

Summary
The Single_Tree_Predict function applies a tree model to a data input, outputting predicted labels for each
data point.
This function can be used with real-time applications. Refer to AMLGenerator.

Usage

Single_Tree_Predict Syntax
Version 1.2

SELECT * FROM Single_Tree_Predict (


ON attribute_table AS attribute_table
PARTITION BY pid_col[,...]
ON model_table AS model_table DIMENSION
AttrTableGroupbyColumns ({ 'gcol' | 'gcol_range' } [,...])
AttrTablePIDColumns ({ 'pid_col' | 'pid_col_range' } [,...])
AttrTableValColumn ('value_column')
);

Arguments
Argument Category Description
AttrTableGroupByColu Required Specifies the names of the columns on which attribute_table is
mns partitioned. Each partition contains one attribute of the input data.
AttrTablePIDColumns Required Specifies the names of the columns that define the data point
identifiers.
AttrTableValColumn Required Specifies the name of the column that contains the input values.

Input
The Single_Tree_Predict function has two input tables, the attribute table that is also input to the
Single_Tree_Drive function (described in the Input section Single_Tree_Drive) and the model table that is
output by the Single_Tree_Drive function (described in the Output section of Single_Tree_Drive).

Teradata Aster Analytics Foundation User Guide 973


Chapter 9: Ensemble Methods
Single_Tree_Predict
Output
Table 897: Single_Tree_Predict Output Table Schema

Column Name Data Type Description


id_column Any Data point identifier from the attribute table.
pred_label VARCHAR Predicted response value for this input data point.

Example

Input
The Single_Tree_Predict function acts on the following tables (taken from the Single_Tree_Drive Examples)
and produces the prediction on the test set.
• Test Input:
∘ Single_Tree_Drive Example 1 Attribute Table iris_attribute_test
• Model Table Output:
∘ Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 1-6)
∘ Single_Tree_Drive Example 1 Output Table iris_attribute_output (Columns 7-11)

SQL-MapReduce Call

CREATE TABLE singletree_predict DISTRIBUTE BY hash(pid) AS


SELECT * FROM single_tree_predict (
ON iris_attribute_test AS attribute_table PARTITION BY pid
ORDER BY attribute
ON iris_attribute_output as model_table DIMENSION
AttrTable_GroupbyColumns ('attribute')
AttrTable_pidColumns ('pid')
AttrTable_valColumn ('attrvalue')
) ORDER BY pid;

Output
The predict labels “1”, “2”, and “3” correspond to species 'setosa', 'versicolor, and 'virginica'.
The following query returns the output shown in the following table:

SELECT * FROM singletree_predict ORDER BY 1;

Table 898: Single_Tree_Predict Example Output Table

pid pred_label
5 1
10 1

974 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
Single_Tree_Predict

pid pred_label
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 2
60 2
65 2
70 2
75 2
80 2
85 2
90 2
95 2
100 2
105 3
110 3
115 3
120 2
125 3
130 2
135 2
140 3
145 3
150 3

Teradata Aster Analytics Foundation User Guide 975


Chapter 9: Ensemble Methods
AdaBoost Functions
Prediction Accuracy
The following SQL code calculates and displays the prediction accuracy.

DROP TABLE IF EXISTS st_predict_accuracy;


CREATE TABLE st_predict_accuracy DISTRIBUTE BY hash(pid) AS
SELECT pid, pred_label :: INTEGER AS pred_label,species FROM
singletree_predict, iris_test WHERE id = pid;
SELECT (SELECT COUNT(pid) FROM st_predict_accuracy
WHERE pred_label = species)/(SELECT COUNT(pid)
FROM st_predict_accuracy) AS prediction_accuracy;

Table 899: Single_Tree_Predict Example Prediction Accuracy

prediction_accuracy
0.90000000000000000000

AdaBoost Functions
• AdaBoost_Drive, which takes a training data set and a single decision tree and uses adaptive boosting to
produce a strong classifying model
• AdaBoost_Predict, which applies the strong classifying model to a new data set

Background
Boosting is a technique that develops a strong classifying algorithm from a collection of weak classifying
algorithms. A classifying algorithm is weak if its correct classification rate is slightly better than random
guessing (which is 50% for binary classification). The intuition behind boosting is that combining a set of
predictions, each of which has more than 50% probability of being correct, can produce an arbitrarily
accurate predictor function.
The AdaBoost algorithm (described by J. Zhu, H. Zou, S. Rosset and T. Hastie 2009 in https://
web.stanford.edu/~hastie/Papers/samme.pdf) is iterative. It starts with a weak classifying algorithm, and
each iteration gives higher weights to the data points that the previous iteration classified incorrectly—a
technique called Adaptive Boosting, for which the AdaBoost algorithm is named. AdaBoost constructs a
strong classifier as a linear combination of weak classifiers.
The AdaBoost_Drive function uses a single decision tree as the initial weak classifying algorithm.
Boosting can be very sensitive to noise in the data. Because weak classifiers are likely to incorrectly classify
outliers, the algorithm weights outliers more heavily with each iteration, thereby increasing their influence
on the final result.
The boosting process is:
1. Train on a data set, using a weak classifier. (For the first iteration, all data points have equal weight.)
2. Calculate the weighted training error.
3. Calculate the weight of the current classifier to use in the final calculation (step 6).
4. Update the weights for the next iteration by decreasing the weights of the correctly classified data points
and increasing the weights of the incorrectly classified data points.
5. Repeat steps 1 through 4 for each weak classifier.

976 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Drive
6. Calculate the strong classifier as a weighted vote of the weak classifiers, using the weights calculated in
step 3.
Mathematically:
1. Assume that the training set has n data points classified into K classes:
(x1, y1), (x2, y2), ..., (xn, yn) where yi is an element of {1, 2, ..., K}
2. In the first iteration, assign the same weight to each data point:
wi(i) = 1/n
3. For each of the T weak classifiers:
a. Fit classifier ht(x) to the training data, using weights wt.
b. Calculate the error rate for the classifier:

c. Calculate the weight of the classifier ht:


αt = log ((1-erri ) / erri ) + log (K - 1)
d. Update the weights wt+1, increasing the weights of the data points that ht classified incorrectly:

for all i = 1, 2, ..., n


4. Calculate the strong classifier:

AdaBoost_Drive

Summary
The AdaBoost_Drive function takes a training data set and a single decision tree and uses adaptive boosting
to produce a strong classifying model that can be input to the function AdaBoost_Predict.

Teradata Aster Analytics Foundation User Guide 977


Chapter 9: Ensemble Methods
AdaBoost_Drive
Usage

AdaBoost_Drive Syntax

Version 1.5

SELECT * FROM AdaBoost_Drive (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database( 'db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
AttributeTable ('attribute_table')
AttributeNameColumns ('attribute_name_column' [,...] )
AttributeValueColumn ('attribute_value_column')
[ CategoricalAttributeTable ('cat_attribute_table') ]
ResponseTable ('response_table')
OutputTable ('output_table')
IdColumns ('id_column' [,...] )
ResponseColumn ('response_column')
[ IterNum ('iterations') ]
[ NumSplits ('splits') ]
[ ApproxSplits ({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ SplitMeasure ({ 'gini' | 'entropy' }) ]
[ MaxDepth ('max_depth') ]
[ MinNodeSize ('min_node_size') ]
[ DropOutputTable
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
AttributeTable Required Specifies the name of the table that contains the
attributes and values of the data.
AttributeNameColumns Required Specifies the names of attribute table columns that
contain the data attributes.
AttributeValueColumn Required Specifies the names of attribute table columns that
contain the data values.
CategoricalAttributeTable Optional Specifies the name of the table that contains the
names of the categorical attributes.

978 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Drive

Argument Category Description


ResponseTable Required Specifies the name of the table that contains the
responses (labels) of the data.
OutputTable Required Specifies the name of the output table in which the
function stores the predictive model it generates.
IdColumns Required Specifies the names of the columns in the response
and attribute tables that specify the identifier of the
instance.
ResponseColumn Required Specifies the name of the response table column that
contains the responses (labels) of the data.
IterNum Optional Specifies the number of iterations to boost the weak
classifiers, which is also the number of weak
classifiers in the ensemble (T). The iterations must
an INTEGER in the range [2, 200]. The default
value is 20.
NumSplits Optional Specifies the number of splits to try for each
attribute in the node splitting. The splits must an
INTEGER. The default value is 10.
ApproxSplits Optional Specifies whether to use approximate percentiles.
The default value is 'true'.
SplitMeasure Optional Specifies the type of measure to use in node splitting.
The default value is 'gini'.
MaxDepth Optional Specifies the maximum depth of the tree. The
max_depth must an INTEGER in the range [1, 10].
The default value is 3.
MinNodeSize Optional Specifies the minimum size of any particular node
within each decision tree. The min_node_size must
an INTEGER. The default value is 100.
DropOutputTable Optional Specifies whether to drop output_table if it exists.
The default value is 'false'.

Input
The function requires an attribute table and a response table, and has an optional categorical attribute table.
Table 900: AdaBoost_Drive Attribute Table Schema

Column Name Data Type Description


id_column INTEGER, Point identifier. Cannot be NULL.
SMALLINT,
BIGINT,
NUMERIC,
NUMERIC(p
),

Teradata Aster Analytics Foundation User Guide 979


Chapter 9: Ensemble Methods
AdaBoost_Drive

Column Name Data Type Description


NUMERIC(p,
a), TEXT,
VARCHAR,
VARCHAR(n
), UUID, or
BYTEA
attribute_name_column VARCHAR Attribute name. Cannot be NULL.
Every categorical attribute in this table must be given a
nonempty partition in the categorical attribute table.
attribute_value_column Numeric Attribute value. Can be NULL, in which case the function
attribute: estimates its value by arithmetic means on an attribute basis.
INTEGER, If this value is out of range, the function cannot use it to
BIGINT, partition the training data; therefore, it is useless.
NUMERIC,
or DOUBLE
PRECISION
Categorical
attribute:
Any

Table 901: AdaBoost_Drive Response Table Schema

Column Name Data Type Description


id_column INTEGER, Point identifier. Cannot be NULL.
SMALLINT,
BIGINT,
NUMERIC,
NUMERIC(p),
NUMERIC(p,a),
TEXT,
VARCHAR,
VARCHAR(n),
UUID, or
BYTEA
response_column INTEGER, Response value for the point. Can
BIGINT, be NULL.
NUMERIC,
VARCHAR, or
DOUBLE
PRECISION

Table 902: AdaBoost_Drive Categorical Attribute Table Schema

Column Name Data Type Description


categorical_attribute_name_column VARCHAR Categorical attribute name.

980 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Drive
Output
The function outputs a message and a model table.
Table 903: AdaBoost_Drive Output Message Schema

Column Name Data Type Description


message VARCHAR Reports the names of the input tables, the number of classification trees
computed, and the name of the output table.

Table 904: AdaBoost_Drive Model Table Schema

Column Name Data Type Description


classifier_id INTEGER Classifier identifier in the range [1, T].
classifier_weight INTEGER Classifier weight.
node_id INTEGER Node identifier.
node_label VARCHAR Node label.
node_majorfreq DOUBLE The weighted sum of the objects that belong to the category
PRECISION identified by node_label.
attribute VARCHAR Name of the attribute.
split_value DOUBLE Split value for numeric attributes. For categorical attributes, this
PRECISION column contains NaN.
left_bucket VARCHAR When the split value is a categorical attribute, the value in the left
child of the node.
right_bucket VARCHAR When the split value is a categorical attribute, the value in the right
child of the node.
left_label VARCHAR The output category for left child of the node.
right_label VARCHAR The output category for right child of the node.
left_majorfreq DOUBLE The weighted sum of the objects that belong to the category
PRECISION identified by left_label.
right_majorfreq DOUBLE The weighted sum of the objects that belong to the category
PRECISION identified by right_label.

Example
This example uses home sales data to create a model that predicts home style, which can be input to the
AdaBoostPredict Example.

Input
Forest_Drive Example Input Data Descriptions describes the real estate sales data contained in the input
table. There are six numerical predictors and six categorical predictors. The response variable is homestyle.
The table of raw training data, housing_train, is described by the following two tables.

Teradata Aster Analytics Foundation User Guide 981


Chapter 9: Ensemble Methods
AdaBoost_Drive
Table 905: AdaBoostDrive Example Raw Input Table housing_train, Columns 1-9

sn price lotsize bedrooms bathrms stories driveway recroom fullbase


1 42000 5850 3 1 2 yes no yes
2 38500 4000 2 2 1 yes no no
3 49500 3060 3 3 1 yes no no
4 60500 6650 3 1 2 yes yes no
5 61000 6360 2 1 1 yes no no
6 66000 4160 3 1 1 yes yes yes
7 66000 3880 3 2 2 yes no yes
8 69000 4160 3 1 1 yes no no
9 83800 4800 3 1 1 yes yes yes
10 88500 5500 3 2 4 yes yes no
11 90000 7200 3 2 1 yes no yes
12 30500 3000 2 1 1 yes no no
14 36000 2880 3 1 1 no no no
15 37000 3600 2 1 1 yes no no
... ... ... ... ... ... ... ... ...

Table 906: AdaBoostDrive Example Input Table housing_train, Columns 10-14

gashw airco prefarea garagepl homestyle


no no 1 no Classic
no no 0 no Classic
no no 0 no Classic
no no 0 no Eclectic
no no 0 no Eclectic
no yes 0 no Eclectic
no no 2 no Eclectic
no no 0 no Eclectic
no no 0 no Eclectic
no yes 1 no Eclectic
no yes 3 no Eclectic
no no 0 no Classic
no no 0 no Classic
no no 0 no Classic

982 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Drive

gashw airco prefarea garagepl homestyle


... ... ... ... ...

Create the input table for the AdaBoostDrive function, housing_train_att, by using the Unpivot function on
the table of raw data, housing_train:

CREATE TABLE housing_train_att distribute BY HASH(sn) AS (


SELECT * FROM unpivot (
ON housing_train
ColsToUnpivot ('price', 'lotsize', 'bedrooms', 'bathrms',
'stories', 'driveway', 'recroom', 'fullbase', 'gashw', 'airco',
'garagepl', 'prefarea')
ColsToAccumulate ('sn')
)
);

This query returns the following table:

SELECT * FROM housing_train_att ORDER BY 1, 2;

Table 907: AdaBoost Functions Example Input Table housing_train_att

sn attribute value
1 airco no
1 bathrms 1
1 bedrooms 3
1 driveway yes
1 fullbase yes
1 garagepl 1
1 gashw no
1 lotsize 5850.0
1 prefarea no
1 price 42000.0
1 recroom no
1 stories 2
2 airco no
2 bathrms 1
2 bedrooms 2
2 fullbase no
2 garagepl 0

Teradata Aster Analytics Foundation User Guide 983


Chapter 9: Ensemble Methods
AdaBoost_Drive

sn attribute value
2 gashw no
2 lotsize 4000.0
2 prefarea no
2 price 38500.0
2 recroom no
2 stories 1
... ... ...

Create the response table for the AdaBoostDrive function, housing_train_response, by selecting the columns
sn and homestyle from the table of raw data, housing_train:

CREATE TABLE housing_train_response DISTRIBUTE BY HASH(sn) AS


(SELECT sn, homestyle AS response FROM housing_train);

This query returns the following table:

SELECT * FROM housing_train_response ORDER BY 1;

Table 908: AdaBoost Functions Example Input Table housing_train_response

sn response
1 Classic
2 Classic
3 Classic
4 Eclectic
5 Eclectic
6 Eclectic
7 Eclectic
8 Eclectic
9 Eclectic
10 Eclectic
... ...

Create and populate the categorical attribute table, housing_cat, for the AdaBoostDrive function:

CREATE TABLE housing_cat (attribute VARCHAR)


DISTRIBUTE BY HASH(attribute);
INSERT INTO housing_cat VALUES ('driveway');
INSERT INTO housing_cat VALUES ('recroom');
INSERT INTO housing_cat VALUES ('fullbase');

984 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Drive
INSERT INTO housing_cat VALUES ( 'gashw');
INSERT INTO housing_cat VALUES ('airco');
INSERT INTO housing_cat VALUES ('prefarea');

This query returns the following table:

SELECT * FROM housing_cat ORDER BY 1;

Table 909: AdaBoost Functions Example Input Table housing_cat

attribute
airco
driveway
fullbase
gashw
prefarea
recroom

SQL-MapReduce Call
Create the model, abd_model, using the default values for the optional arguments:

SELECT * FROM adaboost_drive (


ON (SELECT 1) PARTITION BY 1
AttributeTable ('housing_train_att')
CategoricalAttributeTable ('housing_cat')
ResponseTable ('housing_train_response')
OutputTable ('abd_model')
IDColumns ('sn')
AttributeNameColumns ('attribute')
AttributeValueColumn ('value')
ResponseColumn ('response')
IterNum (20)
NumSplits (10)
MaxDepth (3)
MinNodeSize (100)
DropOutputTable ('true')
);

Output
Because the argument IterNum has the value 20, the function builds 20 classification trees.
Table 910: AdaBoost_Drive Example Output Message

message
Input tables:"housing_train_att", "housing_cat", "housing_train_response"

Teradata Aster Analytics Foundation User Guide 985


Chapter 9: Ensemble Methods
AdaBoost_Drive

message
Running 20 round AdaBoost, computing 20 classification trees.
AdaBoost model created in table "abd_model"

This query returns the model table, described by the following two tables:

SELECT * FROM abd_model ORDER BY 1,3;

Table 911: AdaBoost_Drive Example Model Table, Columns 1-7

classifier_i classifier_weight node_id node_label node_majorfreq attribute split_value


d
1 2.82867818889083 0 Eclectic 0.601626016260163 price 49500
1 2.82867818889083 2 Eclectic 0.601626016260163 price 90000
1 2.82867818889083 5 Eclectic 0.538617886178861 price 55000
2 3.284500712218 0 Eclectic 0.598193473193469 price 55000
2 3.284500712218 1 Classic 0.359382284382285 gashw NaN
2 3.284500712218 2 Eclectic 0.568648018648017 garagepl 1
2 3.284500712218 4 Classic 0.357867132867133 fullbase NaN
2 3.284500712218 5 Eclectic 0.257575757575758 lotsize 16200
2 3.284500712218 6 Eclectic 0.311072261072261 stories 4
3 2.87425191371606 0 Eclectic 0.458459069137445 price 95000
3 2.87425191371606 1 Eclectic 0.384960363716916 price 51500
3 2.87425191371606 3 Classic 0.142718762142357 garagepl 2
... ... ... ... ... ... ...

Table 912: AdaBoost_Drive Example Model Table, Columns 8-13

left_bucket right_bucket left_label right_label left_majorfreq right_majorfreq


Classic Eclectic 0.241869918699188 0.601626016260163
Eclectic bungalow 0.538617886178861 0.113821138211382
Eclectic Eclectic 0.0792682926829268 0.459349593495934
Classic Eclectic 0.359382284382285 0.568648018648015
yes no Eclectic Classic 0.00303030303030304 0.357867132867133
Eclectic Eclectic 0.257575757575758 0.311072261072259
yes no Classic Classic 0.0294871794871793 0.328379953379953
Eclectic bungalow 0.257575757575758 0.000757575757575758
Eclectic Eclectic 0.295221445221445 0.0158508158508158

986 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Predict

left_bucket right_bucket left_label right_label left_majorfreq right_majorfreq


Eclectic bungalow 0.384960363716916 0.398822168720203
Classic Eclectic 0.142718762142357 0.348728479658944
Classic Eclectic 0.13649642855846 0.0144927536231885
... ... ... ... ...

AdaBoost_Predict

Summary
The AdaBoost_Predict function applies the model output by the AdaBoost_Drive function to a new data set.

Usage

AdaBoost_Predict Syntax
Version 1.5

SELECT * FROM AdaBoost_Predict (


ON { table | view | query } AS attributetable PARTITION BY key
ON { table | view | query } AS model DIMENSION
AttrTableGroupbyColumns ('group_column' [,...] )
AttrTablePidColumns ('pid_column' [,...])
AttrTableValColumn ('value_column')
);

Arguments
Argument Categor Description
y
AttrTableGroupbyColumns Require Specifies the names of the columns on which the attribute table is
d partitioned.
AttrTablePidColumns Require Specifies the names of the attribute table columns that contain the
d data point identifiers.
AttrTableValColumns Require Specifies the name of the attribute table column that contains the
d data point values.

Input
The function requires the attribute table that is also input to the AdaBoost_Drive function and the model
table output by the AdaBoost_Drive function.

Teradata Aster Analytics Foundation User Guide 987


Chapter 9: Ensemble Methods
AdaBoost_Predict
Output
Table 913: AdaBoost_Predict Output Table Schema

Column Name Data Type Description


pid_column INTEGER, Data point identifier from the attribute table.
SMALLINT,
BIGINT,
NUMERIC,
NUMERIC(p),
NUMERIC(p,a),
TEXT,
VARCHAR,
VARCHAR(n),
UUID, or
BYTEA
pred_label VARCHAR Predicted response value for this input data point.

Example
This example uses test data and the model output by the AdaBoostDrive Example to use real estate sales data
to predict home style.

Input
The table of raw test data, housing_test, is described by the following two tables.
Table 914: AdaBoostPredict Example Raw Input Table housing_test, Columns 1-9

sn price lotsize bedrooms bathrms stories driveway recroom fullbase


13 27000 1700 3 1 2 yes no no
16 37900 3185 2 1 1 yes no no
25 42000 4960 2 1 1 yes no no
38 67000 5170 3 1 4 yes no no
53 68000 9166 2 1 1 yes no yes
104 132000 3500 4 2 2 yes no no
111 43000 5076 3 1 1 no no no
117 93000 3760 3 1 2 yes no no
132 44500 3850 3 1 2 yes no no
140 43000 3750 3 1 2 yes no no
142 40000 2650 3 1 2 yes no yes
157 60000 2953 3 1 2 yes no yes

988 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Predict

sn price lotsize bedrooms bathrms stories driveway recroom fullbase


... ... ... ... ... ... ... ... ...

Table 915: AdaBoostPredict Example Input Table housing_test, Columns 10-14

gashw airco prefarea garagepl homestyle


no no 0 no Classic
no yes 0 no Classic
no no 0 no Classic
no yes 0 no Eclectic
no yes 2 no Eclectic
yes no 2 no bungalow
no no 0 no Classic
yes no 2 no Eclectic
no no 0 no Classic
no no 0 no Classic
no no 1 no Classic
no yes 0 no Eclectic
... ... ... ... ...

Create the input table for the AdaBoostPredict function, housing_test_att, by using the Unpivot function on
the table of raw data, housing_test:

CREATE TABLE housing_test_att distribute BY HASH(sn) AS (


SELECT * FROM unpivot (
ON housing_test
ColsToUnpivot ('price', 'lotsize', 'bedrooms', 'bathrms',
'stories', 'driveway', 'recroom', 'fullbase', 'gashw', 'airco',
'garagepl', 'prefarea')
ColsToAccumulate ('sn')
)
);

This query returns the following table:

SELECT * FROM housing_test_att ORDER BY 1, 2;

Table 916: AdaBoost Functions Example Input Table housing_test_att

sn attribute value
13 airco no
13 bathrms 1

Teradata Aster Analytics Foundation User Guide 989


Chapter 9: Ensemble Methods
AdaBoost_Predict

sn attribute value
13 bedrooms 3
13 driveway yes
13 fullbase no
13 garagepl 0
13 gashw no
13 lotsize 1700.0
13 prefarea no
13 price 27000.0
13 recroom no
13 stories 2
16 airco yes
16 bathrms 1
16 bedrooms 2
16 driveway yes
16 fullbase no
16 garagepl 0
16 gashw no
16 lotsize 3185.0
16 prefarea no
16 price 37900.0
16 recroom no
16 stories 1
... ... ...

SQL-MapReduce Call

CREATE DIMENSION TABLE housing_prediction AS


SELECT * FROM adaboost_predict (
ON housing_test_att AS attributetable PARTITION BY sn
ON abd_model AS model DIMENSION
AttrTableGroupbyColumns ('attribute')
AttrTablePidColumns ('sn')
AttrTableValColumn ('value')
);

990 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Predict
This query returns the following table:

SELECT * FROM housing_prediction ORDER BY 1;

Output
The pred_label column contains the predicted response.
Table 917: AdaBoost_Predict Example Output Table

sn pred_label
13 Classic
16 Classic
25 Classic
38 Eclectic
53 Eclectic
104 Bungalow
111 Classic
117 Eclectic
132 Classic
140 Classic
142 Classic
157 Eclectic
161 Eclectic
162 Bungalow
176 Eclectic
177 Eclectic
195 Classic
198 Classic
224 Eclectic
234 Classic
237 Classic
239 Classic
249 Classic
251 Classic
254 Eclectic

Teradata Aster Analytics Foundation User Guide 991


Chapter 9: Ensemble Methods
AdaBoost_Predict

sn pred_label
255 Eclectic
260 Classic
274 Eclectic
294 Classic
301 Eclectic
306 Eclectic
317 Eclectic
329 Bungalow
339 Bungalow
340 Eclectic
353 Eclectic
355 Eclectic
364 Eclectic
367 Bungalow
377 Bungalow
401 Eclectic
403 Eclectic
408 Eclectic
411 Eclectic
440 Eclectic
441 Eclectic
443 Eclectic
459 Classic
463 Classic
469 Eclectic
472 Eclectic
527 Bungalow
530 Eclectic
540 Eclectic

992 Teradata Aster Analytics Foundation User Guide


Chapter 9: Ensemble Methods
AdaBoost_Predict
Prediction Accuracy
This query displays the prediction accuracy:

SELECT (SELECT COUNT(*) FROM housing_prediction, housing_test


WHERE housing_prediction.sn = housing_test.sn
AND housing_prediction.pred_label =
housing_test.homestyle) / (SELECT COUNT(SN) FROM housing_prediction)
) AS PA;

Table 918: Single_Tree_Predict Example Prediction Accuracy

PA
0.98148148148148148148

The prediction accuracy is 98.1%, a large improvement over the Forest_Predict function, whose prediction
accuracy is 77.7% on the same input.

Teradata Aster Analytics Foundation User Guide 993


Chapter 9: Ensemble Methods
AdaBoost_Predict

994 Teradata Aster Analytics Foundation User Guide


CHAPTER 10
Association Analysis

Association Analysis
• Basket_Generator
• CFilter
• FPGrowth
• Recommender Functions

Basket_Generator

Summary
The Basket_Generator function generates baskets (sets) of items. The input is typically a set of purchase
transaction records or web page view logs. Each basket is a unique combination or permutation of items.
You can use the baskets as part of a collaborative filtering algorithm, which is useful for analyzing purchase
behavior of users in a store or on a web site. You can also use this function on activity data (for example,
“users who viewed this page also viewed this page”).

Background
Retailers mine transaction data to find combinations (baskets) of items that customers purchase together or
shop for at the same time. Retailers frequently must automatically identify such baskets, look for trends over
time, and compare other attributes (such as stores).
The Basket_Generator function is intended to facilitate market basket analysis by operating on data that is
structured in a form typical of retail transaction history databases.

Usage

Basket_Generator Syntax
Version 1.3

SELECT * FROM Basket_Generator (


ON { table_name | view_name| (query) }
PARTITION BY partition_column [,...]
BasketItem ('basket_item' [,...])
[ BasketSize ('basket_size') ]

Teradata Aster Analytics Foundation User Guide 995


Chapter 10: Association Analysis
Basket_Generator
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Combination ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxItems ('item_set_max') ]
);

Arguments
Argument Category Description
BasketItem Required Specifies the names of the input columns that contain the items to be
collected into baskets. If you specify multiple columns, the function
treats every unique combination of column values as one item.
For example, you could specify only the column that contains the stock
keeping unit (SKU) that identifies an item that was sold. Alternatively,
you could specify the SKU column and the columns that contain the
month manufactured, color and size.
BasketSize Optional Specifies the number of items to be included in a basket (an INTEGER
value). The default value is 2.
Accumulate Optional Specifies the names of the input columns to copy to the output table.
Each accumulate_column must be a partition_column; otherwise, the
function is nondeterministic. However, not every partition_column must
be an accumulate_column.
Combination Optional Specifies whether the function returns a basket for each unique
combination of items. The default value is 'true'. If you specify 'false',
then the function returns a basket for each unique permutation of items.
In a combination, item order is irrelevant. For example, the baskets
"tomatoes and basil" and "basil and tomatoes" are equivalent.
In a permutation, item order is relevant. For example, the baskets
"tomatoes and basil" and "basil and tomatoes" are not equivalent.
The function returns combinations and permutations in lexicographical
order.
MaxItems Optional Specifies the maximum number of items in a partition (an INTEGER
value). If the number of items in a partition exceeds item_set_max, then
the function ignores that partition. The default value is 100.

Input
The following table describes the input table columns that you can specify in function arguments. The input
table can have additional columns, but the function ignores them.
Table 919: Basket_Generator Input Table Schema

Column Name Data Type Description


basket_item BOOLEAN, Contains items to be collected into baskets. The input table must
INTEGER, have at least one such column.

996 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
Basket_Generator

Column Name Data Type Description


BIGINT, or
VARCHAR
accumulate_column Any Column to be copied to the output table. Must be a
partition_column.

Output
In the output table, each row represents a basket.
Table 920: Basket_Generator Output Table Schema

Column Name Data Type Description


accumulate_column VARCHAR Copied from the input table.
basket_item VARCHAR Contains items (values) from the input column basket_item.
If the BasketItem argument specifies c columns, then the
output table has c * basket_item such columns.

If the number of combinations or permutations exceeds one million, then the function outputs no rows.
If n is the number of distinct items that can appear in a basket and r is basket_size, then:
• The maximum possible number of combinations is nCror n!/(r!(n-r)!)
• The maximum possible number of permutations is nPr or n!/(n-r)!)

Examples
• Input
• Example 1: Partition by tranid
• Example 2: Increase BasketSize

Input
These examples both use the same grocery data (grocery_transaction) for a sample of five transactions or
customers. The function outputs the different combinations of two items (basket size of 2) grouped by the
transaction id(tranid).
Table 921: Basket_Generator Example Input Table grocery_transaction

tranid period storeid region item sku category


999 20100715 1 west milk 1 dairy
999 20100715 1 west butter 2 dairy
999 20100715 1 west eggs 3 dairy
999 19990715 1 west flour 4 baking
999 19990715 1 west spinach 4 produce

Teradata Aster Analytics Foundation User Guide 997


Chapter 10: Association Analysis
Basket_Generator

tranid period storeid region item sku category


1000 20100715 1 west milk 1 dairy
1000 20100715 1 west eggs 3 dairy
1000 19990715 1 west flour 4 baking
1000 19990715 1 west spinach 2 produce
1001 20100715 1 west milk 1 dairy
1001 20100715 1 west butter 2 dairy
1001 20100715 1 west eggs 3 dairy
1002 20100715 1 west milk 1 dairy
1002 20100715 1 west butter 2 dairy
1002 20100715 1 west spinach 3 produce
1500 20100715 3 west butter 2 dairy
1500 20100715 3 west eggs 3 dairy
1500 20100715 3 west flour 4 baking

Example 1: Partition by tranid


Partition by tranid is recommended for use on a small number of transactions.

SQL-MapReduce Call

SELECT * FROM Basket_Generator (


ON grocery_transaction PARTITION BY tranid
BasketItem ('item')
BasketSize ('2')
Accumulate ('tranid')
Combination ('true')
) ORDER BY tranid;

Output

Table 922: Basket_Generator Example 1 Output Table

tranid item1 item2


999 butter eggs
999 butter flour
999 butter milk
999 butter spinach
999 eggs flour

998 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
Basket_Generator

tranid item1 item2


999 eggs milk
999 eggs spinach
999 flour milk
999 flour spinach
999 milk spinach
1000 eggs flour
1000 eggs milk
1000 eggs spinach
1000 flour milk
1000 flour spinach
1000 milk spinach
1001 butter eggs
1001 butter milk
1001 eggs milk
1002 butter milk
1002 butter spinach
1002 milk spinach
1500 butter eggs
1500 butter flour
1500 eggs flour

Example 2: Increase BasketSize


The number of combination is typically reduced for a higher basket size.

SQL-MapReduce Call

SELECT * FROM Basket_Generator (


ON grocery_transaction PARTITION BY tranid
BasketItem ('item')
BasketSize ('4')
Accumulate ('tranid')
Combination ('true')
) ORDER BY tranid;

Teradata Aster Analytics Foundation User Guide 999


Chapter 10: Association Analysis
CFilter
Output

Table 923: Basket_Generator Example 2 Output Table

tranid item1 item2 item3 item4


999 butter eggs flour milk
999 butter eggs flour spinach
999 butter eggs milk spinach
999 butter flour milk spinach
999 eggs flour milk spinach
1000 eggs flour milk spinach

CFilter

Summary
The CFilter function performs collaborative filtering by using a series of SQL commands and SQL-
MapReduceSQL-MapReduce functions. You run this function by using an internal JDBC wrapper function.

Background
Analysts use collaborative filtering to find items or events that are frequently paired with other items or
events. For example, an online store that tells a shopper, “Other shoppers who bought this item also bought
these items” uses a collaborative filtering algorithm. A networking site that tells a user, “Those who viewed
this profile also viewed these profiles” also uses a collaborative filtering algorithm. CFilter is a general-
purpose collaborative filter that can provide answers in many similar use cases.

Usage

CFilter Syntax
Version 1.7

SELECT * FROM CFilter (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')

1000 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
CFilter
InputColumns ({ 'input_column' | 'input_column_range' }[,...])
JoinColumns ({ 'join_column' | 'join_column_range' }[,...])
[ AddColumns ({ 'add_column' | 'add_column_range' } [,...]) ]
[ PartitionKeyColumn ('partition_key_column') ]
[ MaxItemSet ('max_item_set') ]
[ DropTable ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

Note:
For information about the authentication arguments, refer to the following usage notes in Aster Analytics
Foundation User Guide, Chapter 2:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data to filter.
OutputTable Required Specifies the name of the output table that the function creates. The table
must not exist.
InputColumns Required Specifies the names of the input table columns that contain the data to
filter.
JoinColumns Required Specifies the names of the input table columns to join.
AddColumns Optional Specifies the names of the input columns to copy to the output table. The
function partitions the input data and the output table on these columns.
By default, the function treats the input data as belonging to one
partition.

Note:
Specifying a column as both an add_column and a join_column
causes incorrect counts in partitions.

PartitionKey Optional Specifies the names of the output column to use as the partition key. The
default value is 'col1_item1'.
MaxItemSet Optional Specifies the maximum size of the item set. The default value is 100.
DropTable Optional Specifies whether to drop the output table if it exists. The default value is
'false'.

Input
The CFilter function has one input table. The following table describes the columns that appear in function
arguments. The table can have additional columns, but the function ignores them.

Teradata Aster Analytics Foundation User Guide 1001


Chapter 10: Association Analysis
CFilter
Table 924: CFilter Input Table Schema

Column Name Data Type Description


input_column VARCHAR Contains data to filter.
join_column Any Column to join.
add_column Any Column to copy to the output table, used to partition the input data and
output table. Must not be a join_column.

Output
Table 925: CFilter Output Table Schema

Column Name Data Type Description


col1_item1 VARCHAR Name of item1.
col1_item2 VARCHAR Name of item2.
cntb INTEGER Count of co-occurrence of both items in the partition.
cnt1 INTEGER Count of occurrence of item1 in the partition.
cnt2 INTEGER Count of occurrence of item2 in the partition.
score DOUBLE Product of two conditional probabilities:
PRECISION P({ item2 | item1 }) * P({ item1 | item2 })
The preceding product equals the following quotient:
(cntb * cntb)/(cnt1 * cnt2)
support DOUBLE Percentage of transactions in the partition in which the two items co-
PRECISION occur, calculated with this formula:
cntb/tran_cnt
where tran_cnt is the number of transactions in the partition.
For example, if eggs and milk were purchased together 3 times in 5
transactions in the same store, and the data is partitioned by store, then
the support value in the partition is 3/5 = 0.6 = 60%.
confidence DOUBLE Percentage of transactions in the partition in which item1 occurs, in
PRECISION which item2 also occurs, calculated with this formula:
cntb/cnt1
For example, if, in the same store, the number of times that a customer
buys both milk (item1) and butter (item2) is 3 (cntb) and the number of
times that a customer buys milk is 4 (cnt1), then the confidence that a
person who buys milk will also buy butter is 3/4 = 0.75 = 75%.
lift DOUBLE Ratio of the observed support value to the expected support value if item1
PRECISION and item2 were independent; that is:
support(item1 and item2) / [support(item1) * support(item2)]
The value is calculated with this formula:
(cntb/tran_cnt) / [(cnt1/tran_cnt) * (cnt2/tran_cnt)]

1002 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
CFilter

Column Name Data Type Description


If Lift > 1, the occurrence of item1 or item2 has a positive effect on the
occurrence of the other items.
If Lift = 1, the occurrence of item1 or item2 has a no effect on the
occurrence of the other items.
If Lift < 1, the occurrence of item1 or item2 has a negative effect on the
occurrence of the other items.
z_score DOUBLE Significance of co-occurrence, assuming that cntb follows a normal
PRECISION distribution, calculated with this formula:
(cntb - mean(cntb)) / sd(cntb)
If all cntb values are equal, then sd(cntb) is 0, and the function does not
calculate zscore.

Deleting Duplicate Output Table Rows


The output table often includes many duplicate rows, because each pair of items appears in two rows—one
row has item1 in col1_item1 and item2 in col1_item2, and the other row has item2 in col1_item1 and item1
in col1_item2. To delete duplicate rows from the output table, use this code (where output_table is the
output table name):

DROP TABLE IF EXISTS Copy;


CREATE TABLE Copy DISTRIBUTE BY HASH(col1_item1)
AS SELECT *, ROW_NUMBER() OVER(ORDER BY col1_item1, col1_item2) rn
FROM output_table;
DROP TABLE IF EXISTS DuplicatesRemoved;
CREATE TABLE DuplicatesRemoved DISTRIBUTE BY HASH(col1_item1)
AS SELECT col1_item1, col1_item2,
ROW_NUMBER() OVER(ORDER BY col1_item1, col1_item2) rn FROM Copy;
DELETE FROM DuplicatesRemoved WHERE rn IN (
SELECT CASE WHEN a.rn > b.rn THEN b.rn ELSE a.rn END minrn
FROM DuplicatesRemoved a
JOIN Copy b
ON a.col1_item1=b.col1_item2 AND a.col1_item2=b.col1_item1);
DROP TABLE IF EXISTS Copy;

Table 926: CFilter DuplicatesRemoved Table Schema

Column Name Data Type Description


col1_item1 VARCHAR Name of item1.
col1_item2 VARCHAR Name of item2.
rn INTEGER Row number in output_table when ordered by col1_item1, col1_item2.

Examples
• Input
• Example 1: Collaborative Filtering by Product
• Example 2: Collaborative Filtering by Customer Segment

Teradata Aster Analytics Foundation User Guide 1003


Chapter 10: Association Analysis
CFilter
Input
The input is sales transaction data of an office supply chain store. The columns are:
• orderid: order (transaction) identifier
• orderdate: order date
• orderqty: quantity of product ordered
• region: geographic region of store where order was placed
• customer_segment: segment of customer who ordered product
• prd_category: category of product ordered
• product: product ordered
Table 927: CFilter Examples Input Table sales_transaction

orderid orderdate orderqty region customer_segment prd_category product


3 2010-10-13 6 Nunavut Small Business Office Storage &
00:00:00 Supplies Organization
293 2012-10-01 49 Nunavut Consumer Office Appliances
00:00:00 Supplies
293 2012-10-01 27 Nunavut Consumer Office Binders and
00:00:00 Supplies Binder
Accessories
483 2011-07-10 30 Nunavut Corporate Technology Telephones and
00:00:00 Communicatio
n
515 2010-08-28 19 Nunavut Consumer Office Appliances
00:00:00 Supplies
515 2010-08-28 21 Nunavut Consumer Furniture Office
00:00:00 Furnishings
613 2011-06-17 12 Nunavut Corporate Office Binders and
00:00:00 Supplies Binder
Accessories
613 2011-06-17 22 Nunavut Corporate Office Storage &
00:00:00 Supplies Organization
643 2011-03-24 21 Nunavut Corporate Office Storage &
00:00:00 Supplies Organization
678 2010-02-26 44 Nunavut Home Office Office Paper
00:00:00 Supplies
807 2010-11-23 45 Nunavut Home Office Office Paper
00:00:00 Supplies
807 2010-11-23 32 Nunavut Home Office Office Rubber Bands
00:00:00 Supplies
868 2012-06-08 32 Nunavut Home Office Office Appliances
00:00:00 Supplies

1004 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
CFilter

orderid orderdate orderqty region customer_segment prd_category product


... ... ... ... ... ... ...

Example 1: Collaborative Filtering by Product


Collaborative filtering by product is also called item-based collaborative filtering.

SQL-MapReduce Call

SELECT * FROM cfilter (


ON (SELECT 1)
PARTITION BY 1
InputTable ('sales_transaction')
OutputTable ('cfilter_output')
InputColumns ('product')
JoinColumns ('orderid')
AddColumns ('region')
DropTable ('true')
);

Output
This query returns the output shown in the table cfilter_output:

SELECT * FROM cfilter_output ORDER BY region, score;

Because all cntb values are equal, no z-scores appear.


Table 928: CFilter Example 1 Output Table cfilter_output (Columns 1-6)

region col1_item1 col1_item2 cntb cnt1 cnt2


Atlantic Paper Office Furnishings 1 4 5
Atlantic Office Furnishings Paper 1 5 4
Atlantic Binders and Binder Accessories Office Furnishings 1 3 5
Atlantic Office Furnishings Binders and Binder Accessories 1 5 3
Atlantic Computer Peripherals Office Furnishings 1 2 5
Atlantic Office Furnishings Computer Peripherals 1 5 2
Atlantic Labels Pens & Art Supplies 1 3 2
Atlantic Pens & Art Supplies Labels 1 2 3
... ... ... ... ... ...

Teradata Aster Analytics Foundation User Guide 1005


Chapter 10: Association Analysis
CFilter
Table 929: CFilter Example 1 Output Table cfilter_output (Columns 7-11)

score support confidence lift z_score


0.05 0.0555555555555556 0.25 0.9
0.05 0.0555555555555556 0.2 0.9
0.0666666666666667 0.0555555555555556 0.333333333333333 1.2
0.0666666666666667 0.0555555555555556 0.2 1.2
0.1 0.0555555555555556 0.5 1.8
0.1 0.0555555555555556 0.2 1.8
0.166666666666667 0.0555555555555556 0.333333333333333 3
0.166666666666667 0.0555555555555556 0.5 3
... ... ... ... ...

Example 2: Collaborative Filtering by Customer Segment

SQL-MapReduce Call

SELECT * FROM cfilter (


ON (SELECT 1)
PARTITION BY 1
InputTable ('sales_transaction')
OutputTable('cfilter_output1')
InputColumns ('customer_segment')
JoinColumns ('product')
DropTable ('true')
);

Output
This query returns the output shown in the table cfilter_output1:

SELECT * FROM cfilter_output1 ORDER BY col1_item1, score;

Table 930: CFilter Example 2 Output Table cfilter_output1 (Columns 1-7)

col1_item1 col1_item2 cntb cnt1 cnt2 score support


Consumer Small Business 13 13 17 0.764705882352941 0.764705882352941
Consumer Corporate 13 13 17 0.764705882352941 0.764705882352941
Consumer Home Office 13 13 16 0.8125 0.764705882352941
Corporate Consumer 13 17 13 0.764705882352941 0.764705882352941
Corporate Home Office 16 17 16 0.941176470588235 0.941176470588235

1006 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
FPGrowth

col1_item1 col1_item2 cntb cnt1 cnt2 score support


Corporate Small Business 17 17 17 1 1
Home Office Consumer 13 16 13 0.8125 0.764705882352941
Home Office Small Business 16 16 17 0.941176470588235 0.941176470588235
Home Office Corporate 16 16 17 0.941176470588235 0.941176470588235
Small Business Consumer 13 17 13 0.764705882352941 0.764705882352941
Small Business Home Office 16 17 16 0.941176470588235 0.941176470588235
Small Business Corporate 17 17 17 1 1

Table 931: CFilter Example 2 Output Table cfilter_output1 (Columns 8-10)

confidence lift z_score


1 1 -0.98058067569092
1 1 -0.98058067569092
1 1.0625 -0.98058067569092
0.764705882352941 1 -0.98058067569092
0.941176470588235 1 0.784464540552737
1 1 1.37281294596729
0.8125 1.0625 -0.98058067569092
1 1 0.784464540552737
1 1 0.784464540552737
0.764705882352941 1 -0.98058067569092
0.941176470588235 1 0.784464540552737
1 1 1.37281294596729

FPGrowth

Summary
The FPGrowth (frequent pattern growth) function uses an FP-growth algorithm to generate association
rules from patterns in a data set, and then determines their interestingness.

Background
Association rule mining is intended to identify strong rules in databases, using different measures of
interestingness, and then discover regularities between products in large-scale transaction data recorded by
point-of-sale (POS) systems in supermarkets. For example, the association rule {onions,

Teradata Aster Analytics Foundation User Guide 1007


Chapter 10: Association Analysis
FPGrowth
potatoes}=>{hamburger} indicates that if a customer buys onions and potatoes together, he or she is likely to
buy hamburger. Unlike sequence mining, association rule mining typically does not consider the order of
items, either within a transaction or across transactions.
Association rules can be very useful in making marketing decisions, such as when to offer promotional
pricing or where to place products. They are also useful in many other areas, such as Web usage mining,
intrusion detection, continuous production, and bioinformatics. However, association rule mining in large
databases is computationally expensive, especially when there is a large number of patterns.
An efficient, scalable method for mining the complete set of frequent patterns from a data set is the
frequent-pattern growth (FP-growth) algorithm. The FP-growth algorithm stores crucial information about
frequent pattern growth in a frequent-pattern tree (FP-tree), an extended prefix-tree structure of
compressed information, and then uses a divide-and-conquer strategy. Because the FP-tree retains the item
set association information, the FP-growth algorithm does not do candidate generations, which are
computationally expensive.
The FPGrowth function does the following:
1. Divides a large data set into independent partitions.
2. Compresses the data in each partition, creating an FP-tree to represent association information between
items.
3. Divides the FP-tree into conditional databases, each associated with one frequent pattern.
4. Mines each conditional database for patterns.
5. Generates association rules from the patterns.
To generate an association rule from a pattern, the function takes one or more items in the pattern as the
consequence and the remaining items as the antecedent. In the rule {onions, potatoes}=>{hamburger},
{onions, potatoes} is the antecedent and {hamburger} is the consequence.
6. Determines the interestingness of the association rules.

Note:
The FPGrowth function automatically truncates long transactions by removing low-frequency items,
guaranteeing that a single transaction generates at most 1 million patterns. Automatic truncation
depends only on the value of the MaxPatternLength argument.

Usage

FPGrowth Syntax
Version 1.2

SELECT * FROM FPGrowth (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputPatternTable ('output_pattern_table')

1008 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
FPGrowth
OutputRuleTable ('output_rule_table')
TranItemColumns ({ 'item_column' | 'item_column_range' }[,...] )
TranIDColumns ({ 'id_column' | 'id_column_range' }[,...])
[ PatternsOrRules ({ 'patterns' | 'rules' | 'both' }) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ PatternDistributionKeyColumn ('p_distribution_key_column') ]
[ RuleDistributionKeyColumn ('r_distribution_key_column') ]
[ Compress ({ 'nocompress' | 'high' | 'medium' | 'low' }) ]
[ DropTable ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ GroupSize (group_size) ]
[ MinSupport (min_support) ]
[ MinConfidence (min_confidence) ]
[ MaxPatternLength (pattern_length) ]
[ AntecedentCountRange ('lower-bound_upper_bound') ]
[ ConsequenceCountRange ('lower-bound_upper_bound') ]
[ Delimiter ('delimiter') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the data set.
OutputPatternTable See Required if PatternsOrRules is 'patterns' or 'both';
Description otherwise, not allowed. Specifies the name of the table
where the function outputs the patterns.
OutputRuleTable See Required if PatternsOrRules is 'rules' or 'both';
Description otherwise, not allowed. Specifies the name of the table
where the function outputs the rules.
TranItemColumns Required Specifies the names of the columns that contain
transaction items to analyze.
TranIDColumns Required Specifies the names of the columns that contain
identifiers for the transaction items.
PatternsOrRules Optional Specifies whether the function outputs patterns, rules, or
both. An example of a pattern is {onions, potatoes,
hamburger}. The default value is 'both'.
GroupByColumns Optional Specifies the names of columns that define the partitions
into which the function groups the input data and
calculates output for it. At least one column must be
usable as a distribution key.
If you omit this argument, then the function considers
all input data to be in a single partition.

Teradata Aster Analytics Foundation User Guide 1009


Chapter 10: Association Analysis
FPGrowth

Argument Category Description

Note:
Do not specify the same column in both this
argument and the TRANIDCOLUMN argument,
because this causes incorrect counting in the
partitions.

PatternDistributionKeyColumn Optional Specifies the name of the column to use as the


distribution key for output_pattern_table. The default
value is 'pattern_tranitemcolumns'.
RuleDistributionKeyColumn Optional Specifies the name of the column to use as the
distribution key for output_rule_table. The default value
is 'antecedent_tranitemcolumns'.
Compress Optional Specifies the compression level the output tables. The
default value is 'nocompress'.
Realized compression ratios depend on both this value
and the data characteristics. These ratios typically range
from 3x to 12x. For more information about
compression, see Aster Database User Guide for Aster
Appliances.
DropTable Optional Specifies whether the function drops and then creates
output_pattern_table or output_rule_table if it exists
('true') or issues an error message ('false'). The default
value is 'false'.
GroupSize Optional Specifies the number of transaction items to be assigned
to each worker. This value must be an INTEGER in the
range from 1 to the number of distinct transaction
items, inclusive. For a machine with limited RAM, use a
relatively small value. The default value is 4.
MinSupport Optional Specifies the minimum support value of returned
patterns (including the specified support value). This
value must be a DECIMAL in the range [0, 1]. The
default value is 0.05.
MinConfidence Optional Specifies the minimum confidence value of returned
patterns (including the specified confidence value). This
value must be a DECIMAL in the range [0, 1]. The
default value is 0.8.
MaxPatternLength Optional Specifies the maximum length of returned patterns. The
length of a pattern is the sum of the item numbers in the
antecedent and consequence columns. This value must
be an INTEGER greater than 2. The default value is 10.
MaxPatternLength also limits the length of returned
rules to this value.
AntecedentCountRange Optional Specifies the range for na, the number of items in the
antecedent. The function returns only patterns for

1010 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
FPGrowth

Argument Category Description


which na is in the range [lower_bound, upper_bound].
The lower_bound must be greater an integer greater
than 0. The lower_bound and upper_bound can be equal.
The default value is '1-infinite'.
ConsequenceCountRange Optional Specifies the range for nc, the number of items in the
consequence. The function returns only patterns for
which nc is in the range [lower_bound, upper_bound].
The lower_bound must be greater an integer greater
than 0. The lower_bound and upper_bound can be equal.
The default value is '1-1'.
Delimiter Optional Specifies the delimiter that separates items in the output.
The default value is ',' (that is, the default delimiter is
comma).

Input
The FPGrowth function has one required input table. The following table describes its columns.
Table 932: FPGrowth Input Table Schema

Column Name Data Type Description


item_column Any except The table can have more than one such column. These columns
BYTEINT contain the transaction items to analyze.
id_column Any The table can have more than one such column. These columns
contain the identifiers of the transaction items.
group_by_column Any The table can have more than one such column. These columns
define the partitions for the input data. At least one column
must be usable as a distribution key.
p_distribution_key_column Any Contains the distribution key for output_pattern_table.
r_distribution_key_column Any Contains the distribution key for output_rule_table.

Output
The FPGrowth function outputs either a pattern table, a rule table, or both (depending on the value of the
PatternsOrRules argument).
The following table describes its columns of the pattern table.
Table 933: FPGrowth Pattern Table Schema

Column Name Data Type Description


group_by_column Same as input Column copied from input table
pattern_item_column VARCHAR Pattern composed of transaction items
length_of_pattern INTEGER Number of items in pattern

Teradata Aster Analytics Foundation User Guide 1011


Chapter 10: Association Analysis
FPGrowth

Column Name Data Type Description


count BIGINT Count of occurrence of pattern
support DOUBLE Percentage of transactions that contain the pattern: count/t,
PRECISION where t is the number of transactions.
For example, if eggs and milk were purchased together 3
times in 5 transactions, then the support value is 3/5, 60%.

The output has one row for each rule. The following table describes its columns.
Table 934: FPGrowth Rule Table Schema

Column Name Data Type Contents


antecedent_item_column VARCHAR Items in the antecedent of the rule.
consequence_item_column VARCHAR Items in the consequence of the rule.
count_of_antecedent INTEGER Count of items in the antecedent of the rule.
count_of_consequence INTEGER Count of items in the consequence of the rule.
cntb BIGINT Count of transactions that contain both the antecedent and
consequence.
cnt_antecedent BIGINT Count of transactions that contain the antecedent.
cnt_consequence BIGINT Count of transactions that contain the consequence.
score DOUBLE Product of two conditional probabilities:
PRECISION (cntb / cnt_antecedent) * (cntb / cnt_consequence)
support DOUBLE Percentage of transactions that contain both the antecedent
PRECISION and consequence: cntb/t, where t is the number of
transactions.
For example, if eggs and milk were purchased together 3
times in 5 transactions, then the support value is 3/5, 60%.
confidence DOUBLE Percentage of transactions that contain the antecedent that
PRECISION also contain the consequence:
cntb / cnt_antecedent
For example, for the antecedent milk and consequence
butter, if cntb=3 and cnt_antecedent=4, then the confidence
value is 3/4, 75%. In other words, 75% of the time, when a
person buys milk, the person also buys butter.
lift DOUBLE Ratio of the observed support value to the expected support
PRECISION value if the antecedent and consequence are independent:
(cntb/t) / ((cnt_antecedent/t) * (cnt_consequence/t))
conviction DOUBLE More reliable alternative to confidence:
PRECISION ((1-cnt_consequence)/t) / ((1-cntb)/cnt_antecedent)
leverage DOUBLE Difference between the percentage of transactions that
PRECISION contain both the antecedent and consequence (cntb/t) and

1012 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
FPGrowth

Column Name Data Type Contents


the expectation for cntb/t if the antecedent and consequence
were statistically independent:
(cntb/t) - ((cnt_antecedent/t) * (cnt_consequence/t))
coverage DOUBLE Percentage of transactions in which the rule applies:
PRECISION cnt_antecedent/t
Another name for coverage is antecedent support.
chi_square DOUBLE Chi-squared test result, used to test the hypothesis that the
PRECISION antecedent and consequence are not associated. The formula
follows this table.
z_score DOUBLE Significance of cntb, assuming that it follows a normal
PRECISION distribution:
(cntb - mean(cntb)) / standard_deviation(cntb)
If every cntb is the same, then standard_deviation(cntb) is 0,
and the function does not compute z_score.

The formula for the value of chi_square is:


(t * (cntb * (t + cntb - cnt_antecedent - cnt_consequence) - (cnt_antecedent - cntb) *
(cnt_consequence - cntb))**2) /
(cnt_antecedent * (t - cnt_antecedent) * cnt_consequence * (t - cnt_consequence))

Example

Input
The input (sales_transaction) is sales transaction data of an office supply chain store by different geographic
regions and customer segments. The column product specifies the items that are purchased by a customer in
a given transaction (column orderid).
Table 935: FPGrowth Example Input Table sales_transaction

orderid orderdate orderqty region customer_segment prd_category product


3 2010-10-13 6 Nunavut Small Business Office Supplies Storage &
00:00:00 Organization
293 2012-10-01 49 Nunavut Consumer Office Supplies Appliances
00:00:00
293 2012-10-01 27 Nunavut Consumer Office Supplies Binders and
00:00:00 Binder Accessories
483 2011-07-10 30 Nunavut Corporate Technology Telephones and
00:00:00 Communication
515 2010-08-28 19 Nunavut Consumer Office Supplies Appliances
00:00:00

Teradata Aster Analytics Foundation User Guide 1013


Chapter 10: Association Analysis
FPGrowth
orderid orderdate orderqty region customer_segment prd_category product
515 2010-08-28 21 Nunavut Consumer Furniture Office Furnishings
00:00:00
613 2011-06-17 12 Nunavut Corporate Office Supplies Binders and
00:00:00 Binder Accessories
613 2011-06-17 22 Nunavut Corporate Office Supplies Storage &
00:00:00 Organization
643 2011-03-24 21 Nunavut Corporate Office Supplies Storage &
00:00:00 Organization
678 2010-02-26 44 Nunavut Home Office Office Supplies Paper
00:00:00
807 2010-11-23 45 Nunavut Home Office Office Supplies Paper
00:00:00
807 2010-11-23 32 Nunavut Home Office Office Supplies Rubber Bands
00:00:00
868 2012-06-08 32 Nunavut Home Office Office Supplies Appliances
00:00:00
... ... ... ... ... ... ...

SQL-MapReduce Call

SELECT * FROM FPGrowth (


ON (SELECT 1)
PARTITION BY 1
InputTable ('sales_transaction')
OutputRuleTable ('fpgrowth_out_rule')
OutputPatternTable ('fpgrowth_out_pattern')
TranItemColumns ('product')
TranIDColumns ('orderid')
GroupByColumns ('region')
MinSupport (0.01)
MinConfidence (0.0)
MaxPatternLength (4)
ConsequenceCountRange ('1-1')
PatternsOrRules ('both')
);

Output
Table 936: FMeasure Example Output Message

Output Information
Patterns are kept in table "fpgrowth_out_pattern"
Rules are kept in table "fpgrowth_out_rule"

1014 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
FPGrowth
The query below returns the output shown in the following table:

SELECT * FROM fpgrowth_out_pattern ORDER BY region, pattern_product;

Table 937: FPGrowth Example Output Table fpgrowth_out_pattern

region pattern_product length_of_pattern count support


Atlantic Appliances, Chairs & Chairmats 2 1 0.0555555555555556
Atlantic Chairs & Chairmats, Computer 2 1 0.0555555555555556
Peripherals
Atlantic Labels, Pens & Art Supplies 2 1 0.0555555555555556
Atlantic Office Furnishings, Binder s and 2 1 0.0555555555555556
Binder Accessories
Atlantic Office Furnishings, Computer 2 1 0.0555555555555556
Peripherals
Atlantic Office Furnishings, Paper 2 1 0.0555555555555556
Atlantic Office Furnishings, Paper, Storage & 3 1 0.0555555555555556
Organization
Atlantic Office Furnishings, Storage & 2 1 0.0555555555555556
Organization
Atlantic Paper, Labels 2 1 0.0555555555555556
Atlantic Paper, Storage & Organization 2 1 0.0555555555555556
Northwest Binder s and Binder Accessories, 2 3 0.0111524163568773
Territories Computer Peripherals
Northwest Binder s and Binder Accessories, 2 4 0.0148698884758364
Territories Office Furnishings
Northwest Binder s and Binder Accessories, 2 3 0.0111524163568773
Territories Office Machines
Northwest Binder s and Binder Accessories, 2 3 0.0111524163568773
Territories Rubber Bands
Northwest Binder s and Binder Accessories, 2 4 0.0148698884758364
Territories Telephones and Communication
Northwest Computer Peripherals, Chairs & 2 3 0.0111524163568773
Territories Chairmats
... ... ... ... ...

The query below returns the output shown in the table fpgrowth_out_rule:

SELECT * FROM fpgrowth_out_rule ORDER BY region, score;

Teradata Aster Analytics Foundation User Guide 1015


Chapter 10: Association Analysis
Recommender Functions
Table 938: FPGrowth Example Output Table fpgrowth_out_rule (Columns 1-6)

region antecedent_product consequence_product count_of_antecedent count_of_consequence cntb


Atlantic Paper Office Furnishings 1 1 1
Atlantic Office Furnishings Paper 1 1 1
Atlantic Binders and Binder Office Furnishings 1 1 1
Accessories
Atlantic Office Furnishings Binders and Binder 1 1 1
Accessories
Atlantic Labels Paper 1 1 1

Table 939: FPGrowth Example Output Table fpgrowth_out_rule (Columns 7-12)

cnt_antecedent cnt_consequence score support confidence lift


4 5 0.05 0.0555555555555556 0.25 0.9
5 4 0.05 0.0555555555555556 0.2 0.9
3 5 0.0666666666666667 0.0555555555555556 0.333333333333333 1.2
5 3 0.0666666666666667 0.0555555555555556 0.2 1.2
3 4 0.0833333333333333 0.0555555555555556 0.333333333333333 1.5

Table 940: FPGrowth Example Output Table fpgrowth_out_rule (Columns 13-17)

conviction leverage coverage chi_square z_score


0.962962962962963 -0.00617283950617285 0.222222222222222 0.0197802197802198
0.972222222222222 -0.00617283950617285 0.277777777777778 0.0197802197802198
1.08333333333333 0.00925925925925925 0.166666666666667 0.0553846153846154
1.04166666666667 0.00925925925925925 0.277777777777778 0.0553846153846154
1.16666666666667 0.0185185185185185 0.166666666666667 0.257142857142857

The output tables contain only those rows that conform to MinSupport of 0.01. The rest of the rows that
violate the condition are deleted.

Recommender Functions
A recommender system is an information filtering system that predicts the ratings or preferences that users
assign to entities like books, songs, movies or other products. Recommender systems are widely used among
online retailers and other businesses.
The goal of a recommender system is to generate accurate recommendations to users of items or products
that might interest them. The typical recommendation task is to predict the rating a user would give to an
item. The Teradata Aster recommender functions are based on Collaborative Filtering (CF), which relies
only on historical rankings of products by users to identify similarities between users and between products,
and thus to identify products that are new to a particular user that the user would rate highly.

1016 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
WSRecommender
Teradata Aster provides these recommendation functions:
• The WSRecommender function is an item-based, collaborative filtering function that uses a weighted-
sum algorithm to make recommendations.1 The function predicts the rating a user would give to an item
by calculating the average of ratings the user has given similar items, weighted by a similarity score
between the items.
• KNNRecommenderTrain and KNNRecommenderPredict take a similar approach to WSRecommender,
but attempt to increase prediction accuracy by adjusting for systematic biases and replacing heuristic
calculations of similarity coefficients with a global optimization that simultaneously estimates all
weights.2

WSRecommender

Summary
The WSRecommender function is an item-based, collaborative filtering function that uses a weighted-sum
algorithm to make recommendations (for example, items or products for users to consider purchasing).

Usage

WSRecommender Syntax
Version 1.0

SELECT * FROM (
ON (SELECT * FROM WSRecommenderReduce (
ON item_table_name AS item_table PARTITION BY item1_column
ON user_table_name AS user_table PARTITION BY item_column
[ Item1 ('item1_column') ]
[ Item2 ('item2_column') ]
[ ItemSimilarity ('similarity_column') ]
[ UserItem ('item_column') ]
[ UserID ('user_column') ]
1 "Item-Based Collaborative Filtering recommendation Algorithms." Badrul Sarwar, George Karypis, Joseph
Konstan and John Riedl.
2 “Improved Neighborhood-based Collaborative Filtering.” Robert M. Bell and Yehuda Koren.

Teradata Aster Analytics Foundation User Guide 1017


Chapter 10: Association Analysis
WSRecommender
[ UserPref ('preference_column') ]
[ AccumulateItem ({ 'accumulate_item_column' |
'accumulate_item_column_range' }[,...]) ]
[ AccumulateUser ({ 'accumulate_user_column' |
'accumulate_user_column_range' }[,...]) ]
)
) AS temporary_table PARTITION BY usr, col1_item2
);

Arguments
Argument Category Description
Item1 Optional Specifies the name of the item_table column that contains the first item
(item1). The default value is 'col1_item1'.
Item2 Optional Specifies the name of the item_table column that contains the second
item (item2). The default value is 'col1_item2'.
ItemSimilarity Optional Specifies the name of the item_table column that contains the similarity
score for item1 and item2. The default value is 'cntb'.
UserItem Optional Specifies the name of the user_table column that contains the names of
the items that the user viewed or purchased. The default value is 'item'.
UserID Optional Specifies the name of the user_table column that contains the unique
user identifiers. The default value is 'usr'.
UserPref Optional Specifies the name of the user_table column that contains user
preferences for an item, expressed as numeric values. The value 0
indicates no preference. The default value is 'preference'.
AccumulateItem Optional Specifies the names of item_table columns to copy to the output table.
AccumulateUser Optional Specifies the names of user_table columns to copy to the output table.

Input
The WSRecommender function requires an item table and a user table.
The item table must be symmetric with respect to item1_column and item2_column. That is, if a row has
'apple' in item1_column and 'bread' in item2_column, then another row must have 'bread' in item1_column
and 'apple' in item2_column, and these two rows must have the same value in similarity_column.
Table 941: WSRecommender Item Table Schema

Column Name Data Type Description


item1_column VARCHAR Contains the first item (item1). Column on which
the table is partitioned.

Note:
The database handles NULL values in
partitioning columns. You need not exclude
them with a WHERE clause.

1018 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
WSRecommender

Column Name Data Type Description


item2_column VARCHAR Contains the second item (item2).
similarity_column INTEGER or Contains the similarity score for item1 and item2.
DOUBLE You can compute this score with the function
PRECISION CFilter—the number of times item1_column co-
occurs with item2_column is their similarity score.
accumulate_item_column Any Column to be copied to the output table.

The function gives the best results when the items in item1_column and item2_column satisfy triangular
inequality; that is: if a row has 'apple' in item1_column and 'bread' in item2_column, then another row must
have 'bread' in item1_column and 'apple' in item2_column, and these two rows must have the same value in
similarity_column.
Table 942: WSRecommender User Table Schema

Column Name Data Type Description


item_column VARCHAR Name of item that the user viewed or bought.
Column on which the table is partitioned.

Note:
The database handles NULL values in
partitioning columns. You need not exclude
them with a WHERE clause.

user_column VARCHAR Unique user identifier.


preference_column INTEGER User preference score for the item.
accumulate_user_column Any Column to be copied to the output table.

Output
Table 943: WSRecommender Output Table Schema

Column Name Data Type Description


item VARCHAR Name of item that the user viewed or bought. Corresponds
to the user table column item_column.
usr VARCHAR Unique user identifier. Corresponds to the user table
column user_column.
recommendation DOUBLE Recommendation score. If the user viewed or bought the
PRECISION item, this value is the user preference score for the item (in
preference_column in the user table). Otherwise, this value is
the recommendation score calculated by the function.
new_reco_flag INTEGER New recommendation flag, which indicates whether the
item is to be recommended to the user. This value is 1 (yes)
if the user never viewed or bought the item and the
recommendation score is greater than 0; 0 (no) otherwise.

Teradata Aster Analytics Foundation User Guide 1019


Chapter 10: Association Analysis
WSRecommender

Column Name Data Type Description


accumulate_item_column Same as in Column copied from the item table.
item table
accumulate_user_column Same as in Column copied from the user table.
user table

Example

Input
The item table, recommender_product, contains product categories and their similarity scores. The
similarity scores are from column cntb in CFilter Example 2 Output Table cfilter_output1, which contains
the number of co-occurrences of the items in product_category_a and product_category_b2.
Table 944: WSRecommender Example Item Table recommender_product

region product_category_a product_category_b product_similarity


Northern California Consumer Corporate 13
Northern California Consumer Home Office 13
Northern California Consumer Small Business 13
Northern California Corporate Corporate 13
Northern California Corporate Home Office 16
Northern California Corporate Small Business 17
Northern California Home Office Corporate 13
Northern California Home Office Home Office 16
Northern California Home Office Small Business 16
Northern California Small Business Corporate 13
Northern California Small Business Home Office 17
Northern California Small Business Small Business 16

The user table, recommender_user, shows the product preference (business presence) of four companies in
four product categories, on a scale of 0 to 10 (10 is highest). For example, the table shows that in the
Consumer product category, Walmart has a high business presence, while Staples has none. The
prod_preference 0 means that the company has never viewed or bought a product in that category.
Table 945: WSRecommender Example User Table recommender_user

product_category companyname prod_preference month


Consumer Costco 5 December
Consumer Ikea 3 December

1020 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
WSRecommender

product_category companyname prod_preference month


Consumer Staples 0 December
Consumer Walmart 9 December
Corporate Costco 8 December
Corporate Ikea 5 December
Corporate Staples 1 December
Corporate Walmart 0 December
Home Office Costco 0 December
Home Office Ikea 9 December
Home Office Staples 7 December
Home Office Walmart 2 December
Small Business Costco 9 December
Small Business Ikea 0 December
Small Business Staples 8 December
Small Business Walmart 2 December

SQL-MapReduce Call

SELECT * FROM WSRecommender (


ON (SELECT * FROM WSRecommenderReduce (
ON (SELECT * FROM recommender_product) AS item_table
PARTITION BY product_category_a
ON (SELECT * FROM recommender_user) AS user_table
PARTITION BY product_category
Item1 ('product_category_a')
Item2 ('product_category_b')
ItemSimilarity ('product_similarity')
UserItem ('product_category')
UserID ('companyname')
UserPref ('prod_preference')
)
) AS temp_input_table PARTITION BY usr, col1_item2
) ORDER BY recommendation desc, item;

Output
If the company (usr) has ever viewed or bought items in the product category (item), then the
recommendation column contains the value in the prod_preference column of the user table; otherwise, the
column contains the recommendation score calculated by the function.
If the recommendation value is greater than 0 and the company has never viewed or bought items in the
product category (that is, the value in the prod_preference column of the user table is 0), then the
new_reco_flag is 1, meaning that the product category is to be recommended to the company.

Teradata Aster Analytics Foundation User Guide 1021


Chapter 10: Association Analysis
WSRecommender
Table 946: WSRecommender Example Output Table

item usr recommendation new_reco_flag


Consumer Costco 8.5 0
Corporate Staples 7.51515 0
Home Office Costco 7.48889 1
Corporate Costco 7.26667 0
Consumer Ikea 7 0
Small Business Costco 6.7 0
Corporate Ikea 6.31034 0
Small Business Ikea 5.82609 1
Consumer Staples 5.33333 1
Home Office Walmart 5.13793 0
Small Business Walmart 5.13793 0
Home Office Staples 4.5 0
Home Office Ikea 4.10345 0
Corporate Walmart 3.97826 1
Small Business Staples 3.90909 0
Consumer Walmart 2 0

1022 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
KNNRecommenderTrain

KNNRecommenderTrain

Summary

Usage

KNNRecommenderTrain Syntax
Version 1.0

SELECT * FROM KNNRecommenderTrain (


ON (SELECT 1) PARTITION BY 1
RatingTable ('user_rating_table')
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserId ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
[ UserIdColumn ('userid_column') ]
[ ItemIdColumn ('itemid_column') ]
[ RatingColumn ('rating_column') ]
WeightModelTable ('weight_model_table')
BiasModelTable ('bias_model_table')
[ NearestItemsTable ('item_neighbors_table') ]
[ K (number_of_item_neighbors) ]
[ LearningRate (learning_rate) ]
[ MaxIterNum (max_iteration_number) ]
[ Threshold (threshold_to_stop_iteration) ]
[ ItemSimilarity ('method_to_calculate_item_similarity') ]
);

Teradata Aster Analytics Foundation User Guide 1023


Chapter 10: Association Analysis
KNNRecommenderTrain
Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
RatingTable Required The user rating table.
UserIdColumn Optional The user id column in the rating table. The default is the
first column in the rating table.
ItemIdColumn Optional The item id column in the rating table. The default is the
second column in the rating table.
RatingColumn Optional The rating column in the rating table. The default is the
third column in the rating table.
WeightModelTable Required Name for the output table containing the interpolation
weights.
BiasModelTable Required Name for the output table containing the global, user, and
item bias statistics.
NearestItemsTable Optional Name for the output table containing the nearest neighbors
for each item.
If this argument is not present, the NearestNeighbors table is
not produced.
If the argument is used, and a table with the specified name
exists, the function uses the existing table to train the model.
If the argument is used and no table with the specified name
exists, the function creates a table with the specified name.
K Optional The number of nearest neighbors used in the calculation of
the interpolation weights.
Default is 20.
LearningRate Optional Initial learning rate. The learning rate adjusts automatically
during training based on changes in the rmse.
Default is 0.001.
MaxIterNum Optional Maximum number of iterations. Default is 10.
Threshold Optional The function stops when the rmse drops below this level.
Default is 0.0002.
ItemSimilarity Optional The method used to calculate item similarity. The default is
the Pearson correlation coefficient.
Options include:
• Pearson (Pearson correlation coefficient)

1024 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
KNNRecommenderTrain

Argument Category Description

• adjustedcosine (adjusted cosine similarity)

Input
KNNRecommender takes a single input table that contains ratings of various items by a set of users. The
schema is shown in the following table:
Table 947: KNNRecommender Input Table Schema

Column Name Data Type Description


userid Integer User id. The user providing the rating.
itemid Integer Item id. The item being rated by the user.
rating Any numeric type. The rating assigned by the user to the item.

Output
When the function completes, KnnRecommenderTrain displays a table of the root mean square error (rmse)
at each iteration (schema shown in the following table).
Table 948: KNNRecommenderTrain Output Table Schema

Column Name Data Type Description


iternum Integer Iteration number. The baseline rmse shows null in this column.
rmse Double Root mean square error after that iteration.

The function also creates the following three output tables: a table of the interpolation weights, a table of the
bias values calculated by the function, and an optional table of nearest (item) neighbors.
Table 949: KNNRecommenderTrain Interpolation Weights Table Schema

Column Name Data Type Description


itemid Integer Item id.
weights bytea Interpolation weights for that item in compressed binary format.

Table 950: KNNRecommenderTrain Bias Values Table Schema

Column Name Data Type Description


label char(1) U - user statistics
I - item statistics
G - global statistics

Teradata Aster Analytics Foundation User Guide 1025


Chapter 10: Association Analysis
KNNRecommenderTrain

Column Name Data Type Description


id Integer Item id or user id.
value Double If label = I, average rating across all users for that item.
If label = U, average rating across all items from that user.
If label = G, global average rating across all users and all items.

Table 951: KNNRecommenderTrain Nearest Neighbors Table Schema

Column Name Data Type Description


itemi Integer Item id of item i.
itemj Integer Item id of item j.
sij Double Calculated similarity between item i and item j.

Example

Input
The input table, ml_ratings, is a collection of movie ratings from 50 users on approximately 2900 movies,
with an average of about 150 ratings per user. There are 10 possible ratings, ranging from 0.5 to 5 in steps of
0.5. A higher number indicates a better rating.
Table 952: KNNRecommenderTrain Input Table ml_ratings

userid itemid rating


1 1 5
1 2 3
1 10 3
1 32 4
1 34 4
1 47 3
1 50 4
1 62 4
1 150 4
1 153 3
1 160 3
1 161 4
1 165 4
1 185 3

1026 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
KNNRecommenderTrain

userid itemid rating


... ... ...

SQL-MapReduce Call
KnnRecommenderTrain uses the input data to generate three model tables: the weights model
('ml_weights'), the bias model table ('ml_bias') and the optional nearest items or neighbors table
('ml_itemngbrs').

DROP TABLE IF EXISTS ml_weights;


DROP TABLE IF EXISTS ml_bias;
DROP TABLE IF EXISTS ml_itemngbrs;
SELECT iternum, rmse::VARCHAR(6) AS rmse FROM KnnRecommenderTrain
(
ON (SELECT 1) PARTITION BY 1
RatingTable ('ml_ratings')
UserIdColumn ('userid')
ItemIdColumn ('itemid')
RatingColumn ('rating')
WeightModelTable ('ml_weights')
BiasModelTable ('ml_bias ')
NearestItemsTable ('ml_itemngbrs')
K (15)
MaxIterNum (20)
Threshold (0.0002)
LearningRate (0.001)
) ORDER BY iternum NULLS FIRST;

Output
The rmse value is output for each of the 20 iterations. The null iteration or first row of the table shows the
rmse of the default initialized model.
Table 953: KNNRecommenderTrain Output Table

iternum rmse
0.4825
0 0.4803
1 0.4780
2 0.4757
3 0.4734
4 0.4710
5 0.4686
6 0.4661
7 0.4636

Teradata Aster Analytics Foundation User Guide 1027


Chapter 10: Association Analysis
KNNRecommenderTrain

iternum rmse
8 0.4611
9 0.4585
10 0.4560
11 0.4534
12 0.4508
13 0.4482
14 0.4455
15 0.4429
16 0.4403
17 0.4376
18 0.4350
19 0.4323

The sij value (similarity between itemi and itemj) in the following table is the default Pearson correlation
coefficient.

SELECT * FROM ml_itemngbrs ORDER BY 1;

Table 954: KNNRecommenderTrain Default Pearson Correlation Coefficient

itemi itemj sij


1 145 1
1 1235 1
1 2867 1
1 1086 1
1 2067 1
1 7143 1
1 97304 1
1 1882 0.999607184150001
1 2723 0.999351517785325
1 1081 0.998812351124897
1 2301 0.998812351124897
1 26614 0.998812351124897
1 7458 0.998330805669821
1 3624 0.997986152770463

1028 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
KNNRecommenderTrain

itemi itemj sij


1 1094 0.99714268802795
2 160 1
2 63082 1
2 76251 1
2 4874 1
2 6934 1
2 8636 1
2 33004 1
2 112 0.999314833766767
2 1246 0.999282588328635
2 1259 0.998868137724437
2 349 0.998268396969243
2 223 0.998124077136759
2 8644 0.997785157856609
2 4370 0.997458699830735
2 35836 0.997054485501582
... ... ...

The following querty returns the bias table, which shows the Global (G), User (U) and Item (I) statistics.

SELECT * FROM ml_bias ORDER BY 1, 2;

Table 955: KNNRecommenderTrain Bias Table

label id value
G 3.53538298436258
I 1 3.78125
I 2 3
I 3 2
I 5 3.16666666666667
I 6 3.65
I 7 3
I 9 3
I 10 3.65
... ... ...

Teradata Aster Analytics Foundation User Guide 1029


Chapter 10: Association Analysis
KNNRecommenderPredict

KNNRecommenderPredict

Summary

Usage

KNNRecommenderPredict Syntax
Version 1.0

SELECT * FROM KNNRecommenderPredict (


ON rating_table AS ratings
PARTITION BY ANY | PARTITION BY userid_column
ON weight_model_table AS weights DIMENSION
ON bias_model_table AS bias DIMENSION
[ UserIdColumn (userid_column) ]
[ ItemIdColumn (itemid_column) ]
[ RatingColumn (rating_column) ]
[ TopK (top_k_recommendations) ]
);

Arguments
Argument Category Description
UserIdColumn Optional The user id column in the rating table. The default is the first column in
the rating table.
ItemIdColumn Optional The item id column in the rating table. The default is the second column
in the rating table.

1030 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
KNNRecommenderPredict

Argument Category Description


RatingColumn Optional The rating column in the rating table. The default is the third column in
the rating table.
TopK Optional Number of items to recommend for each user. The TopK highest-rated
items are recommended.

Input
The input to the KnnRecommenderPredict function is the rating table and the output tables from
KnnRecommenderTrain (the interpolation weights table and the bias values table).

Output
Table 956: KNNRecommenderPredict Output Table Schema

Column Name Data Type Description


userid integer Userid of a user for whom items are recommended.
itemid integer The itemid of a recommended item for that user.
prediction double The predicted rating of the item for the user. The highest-rated k items
are shown, where k is specified by the TopK argument.

Example

Input
The function uses the model tables ml_weights and ml_bias (Output) from the KnnRecommenderTrain
function and recommends five movies for ten users from the ratings table.

SQL-MapReduce Call

SELECT userid, itemid, prediction::VARCHAR(4) AS prediction FROM


KnnRecommenderPredict (
ON (SELECT * FROM ml_ratings WHERE userid <= 10) AS ratings
PARTITION BY userid
ON ml_bias AS bias DIMENSION
ON ml_weights AS weights DIMENSION
TOPK(5)
) ORDER BY 1, 3 DESC;

Teradata Aster Analytics Foundation User Guide 1031


Chapter 10: Association Analysis
KNNRecommenderPredict
Output
Table 957: KNNRecommenderPredict Example Output Table

userid itemid prediction


1 3703 5.11
1 2176 4.98
1 27773 4.94
1 1217 4.94
1 141 4.90
2 1269 4.86
2 2176 4.85
2 446 4.85
2 27773 4.77
2 3703 4.75
3 318 4.73
3 1217 4.65
3 2352 4.60
3 898 4.59
3 1247 4.58
4 3703 5.05
4 141 4.85
4 538 4.82
4 1240 4.81
4 858 4.79
5 44555 5.08
5 232 5.04
5 6664 4.99
5 76251 4.96
5 1516 4.92
6 2019 5.01
6 106487 4.94
6 951 4.92
6 141 4.86
6 3703 4.85

1032 Teradata Aster Analytics Foundation User Guide


Chapter 10: Association Analysis
KNNRecommenderPredict

userid itemid prediction


7 3703 4.99
7 318 4.94
7 141 4.93
7 1279 4.84
7 3702 4.76
8 1269 4.84
8 2183 4.64
8 1224 4.63
8 1883 4.61
8 1279 4.61
9 3703 4.83
9 8783 4.69
9 922 4.64
9 3504 4.57
9 7981 4.57
10 1269 4.45
10 3730 4.44
10 27773 4.43
10 1939 4.42
10 1136 4.39

Some predicted ratings are higher than 5, even though the maximum rating is 5. The weighted KNN
recommendation algorithm does not limit its results to the range of the input data. The outcome of interest
are the items with the highest recommendation score; if the resulting ratings must be limited to a specific
range, the output data can be normalized as needed.

Teradata Aster Analytics Foundation User Guide 1033


Chapter 10: Association Analysis
KNNRecommenderPredict

1034 Teradata Aster Analytics Foundation User Guide


CHAPTER 11
Graph Analysis

Graph Analysis
• Overview of Graph Analysis
• AllPairsShortestPath
• Betweenness
• Closeness
• EigenvectorCentrality
• gTree
• LocalClusteringCoefficient
• LoopyBeliefPropagation
• Modularity
• nTree
• PageRank
• pSALSA
• RandomWalkSample
Before rerunning the examples in this chapter, manually remove the existing output tables.

Overview of Graph Analysis


The Aster Graph Analytic Premium Portfolio provides a set of functions that are widely used in social
network analysis. These functions perform these tasks:
• Finding shortest paths
• Computing importance/influence scores
• Predicting unobserved variables based on knowledge of observed variables and network structures

Important:
Before rerunning the examples in this chapter, remove the existing output tables.

Graph Functions
The functions in this chapter were developed using the Teradata Aster SQL-GR™ framework, which allows
large-scale graph analysis in Aster Database. For more information about SQL-GR™, see Aster Developer
Guide.

Teradata Aster Analytics Foundation User Guide 1035


Chapter 11: Graph Analysis
Overview of Graph Analysis
Iterations
When running graph functions, the iteration number displayed in the Aster Management Console (AMC) is
not same as the function’s iteration number. For example, when running the EigenvectorCentrality function,
each EigenvectorCentrality iteration (EI) consumes 2 SQL-GR™ iterations (GI). For directed graphs,
including the overhead, GI = 2*EI +1. For undirected graphs, GI = 2 * EI + 3.
For the PageRank function, the number of SQL-GR™ iterations equals the number of Pagerank iterations + 1.

What is a Graph?
A graph is a representation of interconnected objects. An object is represented as a vertex (also called a node)
—for example, cities, computers, and people. A link connecting two vertices is called an edge. Edges can
represent roads that connect cities, computer network cables, interpersonal connections (such as co-worker
relationships), and so on.
Figure 16: Graph Example

In Aster Database, to process graphs using SQL-GR™, it is recommended that you represent a graph using
two tables:
• Vertices table
• Edges table
The graph in the preceding figure is represented by the following two tables.
In the following table, each row represents a vertex.
Table 958: Vertex Table Example

Vertex City Name


A Albany
B Berkeley
C Cerrito
D Danville
E East Palo Alto
F Foster City
G Gilroy

In the following table, each row represents an edge.

1036 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
AllPairsShortestPath
Table 959: Edges Table Example

Source Destination
A B
A C
A E
B D
C D
C F
C G
E C

Directed Graphs
SQL-GR™ is based on a simple directed graph data model where each directed edge can be represented as an
ordered pair of vertices.

Graph Discovery
Graphs can form complex structures as in social, fraud, or communication networks. Graph discovery refers
to the application of algorithms that analyze the structure of these networks. Graph discovery has
applicability in diverse business areas such as marketing, human resources, security, and operations.

AllPairsShortestPath

Summary
The AllPairsShortestPath function computes the shortest distances between all combinations of the specified
source and target vertices. The function works on directed, undirected, weighted, and unweighted graphs.

Teradata Aster Analytics Foundation User Guide 1037


Chapter 11: Graph Analysis
AllPairsShortestPath

The function is useful in social network analysis. The resulting pairs and distances can be aggregated to
determine a closeness metric or the k-degree for each vertex in a graph.

Usage

AllPairsShortestPath Syntax
Version 1.2

SELECT * FROM AllPairsShortestPath (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
[ ON targets_table AS targets PARTITION BY target_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ EdgeWeight ('edge_weight') ]
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxDistance ('max_distance') ]
[ GroupSize ('group_size') ]
);

Arguments
Argument Category Description
TargetKey Required Specifies the target key (the names of the edges table columns that
identify the target vertex). If you specify targets_table, then the function
uses only the vertices in targets_table as targets (which must be a subset
of those that this argument specifies).
EdgeWeight Optional Specifies the name of the edges table column that contains edge weights.
Each edge_weight is a positive value. By default, each edge_weight is 1;
that is, the graph is unweighted.

1038 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
AllPairsShortestPath

Argument Category Description


Directed Optional Specifies whether the graph is directed. The default value is 'true'.
MaxDistance Optional Specifies the maximum distance between source and target for which the
function outputs the vertices. The max_distance must be an integer. If
max_distance is negative, the distance between source and target is
unbounded. The default is value is 10.
GroupSize Optional Specifies the number of source vertices that execute a single-node
shortest path (SNSP) algorithm in parallel. If group_size exceeds the
number of source vertices in a partition, then the number of source
vertices in the partition is the group size. By default, the function uses
cluster and query characteristics to determine the optimal group size.

Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
For an undirected graph, the graph table might have duplicate rows. Remove them, using the code in
Deleting Duplicate Edges Table Rows.
Table 960: AllPairsShortestPath Vertices Table Schema

Column Name Data Type Description


vertex_key_column Any allowed in PARTITION BY Column that is, or is part of, the
clause unique vertex key. Every column
that is part of this key must appear
in the PARTITION BY clause.
This column cannot contain
NULL values.

Table 961: AllPairsShortestPath Edges Table Schema

Column Name Data Type Description


source_vertex_key_column Any allowed in PARTITION BY Column that is, or is part of, the
clause key that identifies the source
vertex of the edge. This key must
be a vertex key in the vertices
table. Every column that is part of
this key must appear in the
PARTITION BY clause. This
column cannot contain NULL
values.
target_vertex_key_column Same as data type of Column that is, or is part of, the
source_vertex_key_column key that identifies the target vertex

Teradata Aster Analytics Foundation User Guide 1039


Chapter 11: Graph Analysis
AllPairsShortestPath

Column Name Data Type Description


of the edge. This key must be a
vertex key in the vertices table.
This column can contain NULL
values.
edge_weight SMALLINT, INTEGER, or Column that contains the weights
NUMERIC of the edges, which must be
positive values. This column is
required only for a weighted
graph. This column can contain
NULL values.

Table 962: AllPairsShortestPath Sources Table Schema

Column Name Data Type Description


source_vertex_key_column Same as data type of Column that is, or is part of, the
corresponding vertex_key_column key that identifies a source vertex.
in vertices table This key must be a vertex key in
the vertices table. Every column
that is part of this key must appear
in the PARTITION BY clause.

Table 963: AllPairsShortestPath Targets Table Schema

Column Name Data Type Description


target_vertex_key_column Same as data type of Column that is, or is part of, the
corresponding vertex_key_column key that identifies a target vertex.
in vertices table This key must be a vertex key in
the vertices table. Every column
that is part of this key must appear
in the PARTITION BY clause.

Deleting Duplicate Edges Table Rows


The edges table of an undirected graph can have many duplicate rows, because each edge between vertices A
and B is represented by two rows—one row has A in the source column and B in the target column, and the
other row has B in the source column and A in the target column. Teradata recommends deleting duplicate
rows from the edges table, using this code (where edges_table is the edges table name):

DROP TABLE IF EXISTS Copy;


CREATE TABLE Copy DISTRIBUTE BY HASH(source)
AS SELECT *, ROW_NUMBER() OVER(ORDER BY source, target) rn
FROM edges_table;
DROP TABLE IF EXISTS DuplicatesRemoved;
CREATE TABLE DuplicatesRemoved DISTRIBUTE BY HASH(source)
AS SELECT source, target, ROW_NUMBER()
OVER(ORDER BY source, target) rn FROM Copy;
DELETE FROM DuplicatesRemoved WHERE rn IN (
SELECT CASE WHEN a.rn > b.rn THEN b.rn ELSE a.rn END minrn
FROM DuplicatesRemoved a

1040 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
AllPairsShortestPath
JOIN Copy b
ON a.source=b.target AND a.target=b.source);
DROP TABLE IF EXISTS Copy;

Table 964: AllPairsShortestPath DuplicatesRemoved Table Schema

Column Name Data Type Description


source VARCHAR Source key.
target VARCHAR Target key.
rn INTEGER Row number in edges_table.

Output
For source and target vertices connected by a path, the function outputs their corresponding source and
target vertex keys and the distance of the shortest path between them. The function does not output cycle
information.
Table 965: AllPairsShortestPath Output Table Schema

Column Name Data Type Description


source INTEGER or Source vertex key.
VARCHAR
target INTEGER or Target vertex key.
VARCHAR
distance INTEGER Distance of shortest path between the source and target vertices.

Examples
• Input
• Example 1: Unweighted, Unbounded Graph
• Example 2: Weighted, Unbounded Graph
• Example 3: Weighted, Bounded Graph with Sources

Input
In the graph in the following figure, the nodes represent persons—light blue for males and dark blue for
females. The directed edges represent phone calls from one person to another. Node size represents number
of connections (degree centrality).

Teradata Aster Analytics Foundation User Guide 1041


Chapter 11: Graph Analysis
AllPairsShortestPath
Figure 17: Graph of Phone Calls Between Persons

The graph in the preceding figure is represented by the vertices and edges tables callers and calls,
respectively.
Table 966: AllPairsShortestPath Examples Vertices Table callers

callerid callername
1 John
2 Carla
3 Simon
4 Celine
5 Winston
6 Diana

Table 967: AllPairsShortestPath Examples Edges Table calls

callerfrom callerto calls


2 4 7
2 6 12
4 6 4
1 2 10
1 3 2
1 4 5
1 6 6
3 6 1
5 6 10

1042 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
AllPairsShortestPath
Example 1: Unweighted, Unbounded Graph

SQL-MapReduce Call

SELECT * FROM AllPairsShortestPath (


ON callers AS vertices PARTITION BY callerid
ON calls AS edges PARTITION BY callerfrom
TargetKey ('callerto')
MaxDistance ('-1')
) ORDER BY source, target;

Output

Table 968: AllPairsShortestPath Example 1 Output Table

source target distance


1 2 1
1 3 1
1 4 1
1 6 1
2 4 1
2 6 1
3 6 1
4 6 1
5 6 1

Example 2: Weighted, Unbounded Graph

SQL-MapReduce Call

SELECT * FROM AllPairsShortestPath (


ON callers AS vertices PARTITION BY callerid
ON calls AS edges PARTITION BY callerfrom
TargetKey ('callerto')
EdgeWeight ('calls')
MaxDistance ('-1')
) ORDER BY source, target;

Teradata Aster Analytics Foundation User Guide 1043


Chapter 11: Graph Analysis
AllPairsShortestPath
Output

Table 969: AllPairsShortestPath Example 2 Output Table

source target distance


1 2 10
1 3 2
1 4 5
1 6 3
2 4 7
2 6 11
3 6 1
4 6 4
5 6 10

Example 3: Weighted, Bounded Graph with Sources

SQL-MapReduce Call

SELECT * FROM AllPairsShortestPath (


ON callers AS vertices PARTITION BY callerid
ON calls AS edges PARTITION BY callerfrom
ON (SELECT callerid FROM callers
WHERE callerid IN (1, 2)) AS sources PARTITION BY callerid
TargetKey ('callerto')
MaxDistance ('8')
EdgeWeight ('calls')
) ORDER BY source, target;

Output

Table 970: AllPairsShortestPath Example 3 Output Table

source target distance


1 3 2
1 4 5
1 6 3
2 4 7

1044 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Betweenness

Betweenness

Summary
The Betweenness function returns the betweenness score, a centrality measurement, for every vertex (node)
in the input graph.

Background
Betweenness, a measure of the centrality of a node in a network, is the fraction of shortest paths between
node pairs that pass through the node of interest.
In a sense, betweenness is a measure of the influence that a node has over the spread of information through
the network. However, by counting only shortest paths, the conventional definition of betweenness
implicitly assumes that information spreads only along those shortest paths. The conventional definition of
betweenness is:

where σs,t is the number of shortest paths between nodes s and t and σs,t(ν) is the number of shortest paths
between s and t that pass through node ν.
The Betweenness function uses a hybrid distributed AllPairShortestPath (APSP) algorithm, which executes a
single-node shortest path (SNSP) algorithm for each vertex in the graph. By restricting the number of
parallel SNSP executions to groups of K vertices, the APSP algorithm enables a trade-off between time and
memory usage. (For more information on APSP, see AllPairsShortestPath.)
Assume that N is the number of vertices in the graph, K is the number of vertices that start their SNSP
algorithms in parallel, and D is the number of iterations required to complete a single SNSP algorithm. The
APSP algorithm completes when N/K of these SNSP algorithms have completed. The hybrid distributed
APSP algorithm requires O(D*N/K) iterations and O(N*K) space. In the worst case, D is bounded by the

Teradata Aster Analytics Foundation User Guide 1045


Chapter 11: Graph Analysis
Betweenness
diameter DG of the graph; however, you can specify bound M on the distance between vertices. For
unweighted graphs, D <= DG <= M.

Usage

Betweenness Syntax
Version 1.2

SELECT * FROM Betweenness (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON targets_table AS targets PARTITION BY vertex_key ]
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ EdgeWeight ('edge_weight_column') ]
[ MaxDistance ('max_distance') ]
[ GroupSize ('group_size') ]
[ SampleRate ('sample_rate') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TargetKey Required Specifies the target key (the names of the edges table columns that identify the
target vertex). If you specify targets_table, then the function uses only the
vertices in targets_table as targets (which must be a subset of those that this
argument specifies).
Directed Optional Specifies whether the graph is directed. The default value is 'true'.
EdgeWeight Optional Specifies the name of the edges table column that contains edge weights. The
weights are positive values. By default, the weight of each edge is 1 (that is, the
graph is unweighted).
MaxDistance Optional Specifies the maximum distance (an integer) between the source and target
vertices. A negative max_distance specifies an infinite distance. If vertices are
separated by more than max_distance, the function does not output them. The
default value is 10.
GroupSize Optional Specifies the number of source vertices that execute a SNSP algorithm in
parallel. If group_size exceeds the number of source vertices in each partition, s,
then s is the group size. By default, the function calculates the optimal group
size based on various cluster and query characteristics.
Running a group of vertices on each vworker, in parallel, uses less memory than
running all vertices on each vworker.

1046 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Betweenness

Argument Category Description


SampleRate Optional Specifies the sample rate (the percentage of source vertices to sample), a
DOUBLE PRECISION value in the range (0.0, 1.0]. The number of source
vertices that the function uses to generate betweenness is approximately
sample_rate*n, where n is the number of vertices in the graph.
Accumulate Optional Specifies the names of the vertices table columns to copy to the output table.

Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
For an undirected graph, the graph table might have duplicate rows. Remove them, using the code in the
section Deleting Duplicate Edges Table Rows of the function AllPairsShortestPath.
For a large graph, specifying the optional sources and targets tables can improve function performance time.
Table 971: Betweenness Vertices Table Schema

Column Name Data Type Description


vertex_key_column Any allowed in Column that is, or is part of, the unique vertex key. Every
PARTITION BY column that is part of this key must appear in the PARTITION
clause BY clause. This column cannot contain NULL values.

Table 972: Betweenness Edges Table Schema

Column Name Data Type Description


source_vertex_key_column Any allowed in Column that is, or is part of, the key that
PARTITION BY identifies the source vertex of the edge. This
clause key must be a vertex key in the vertices table.
Every column that is part of this key must
appear in the PARTITION BY clause. This
column cannot contain NULL values.
target_vertex_key_column Same as data type of Column that is, or is part of, the key that
source_vertex_key_co identifies the target vertex of the edge. This
lumn key must be a vertex key in the vertices table.
This column can contain NULL values.
edge_weight SMALLINT, Column that contains the weights of the edges,
INTEGER, or which must be positive values. This column is
NUMERIC required only for a weighted graph. This
column can contain NULL values.

Teradata Aster Analytics Foundation User Guide 1047


Chapter 11: Graph Analysis
Betweenness
Table 973: Betweenness Sources Table Schema

Column Name Data Type Description


source_vertex_key_column Same as data type of Column that is, or is part of, the key that
corresponding identifies a source vertex. This key must be a
vertex_key_column in vertex key in the vertices table. Every column
vertices table that is part of this key must appear in the
PARTITION BY clause.

Table 974: Betweenness Targets Table Schema

Column Name Data Type Description


target_vertex_key_column Same as data type of Column that is, or is part of, the key that
corresponding identifies a target vertex. This key must be a
vertex_key_column in vertex key in the vertices table. Every column
vertices table that is part of this key must appear in the
PARTITION BY clause.

Output
Table 975: Betweenness Output Table Schema

Column Data Type Description


source VARCHAR Key of the source vertex.
betweenness_score DOUBLE Betweenness score.
PRECISION
accumulate_column Same as in vertex Column copied from the vertex table.
table

Example
This example computes the betweenness score for each person in the social network shown in the following
figure.
Figure 18: Betweenness Example Social Network

1048 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Betweenness
Input
The vertices table contains the names of the people and the edges table represents the connections between
them.
Table 976: Betweenness Example Vertices Table soc_nw_vertices

vertexid
TED
RICKY
ETHEL
FRED
JOE
RANDY
LUCY

Table 977: Betweenness Example Edges Table soc_nw_edges

source target
TED ETHEL
RICKY FRED
ETHEL LUCY
ETHEL RANDY
FRED ETHEL
ETHEL FRED
JOE ETHEL
RANDY RICKY
RICKY RANDY
FRED LUCY

SQL-MapReduce Call

SELECT * FROM Betweenness (


ON soc_nw_vertices AS vertices PARTITION BY vertexid
ON soc_nw_edges AS edges PARTITION BY source
TargetKey ('target')
Accumulate ('vertexid')
) ORDER BY vertexid;

Output
Ethel has the highest betweenness score.

Teradata Aster Analytics Foundation User Guide 1049


Chapter 11: Graph Analysis
Closeness
Table 978: Betweenness Example Output Table

vertexid betweenness
ETHEL 10
FRED 4
JOE 0
LUCY 0
RANDY 4
RICKY 3
TED 0

Closeness

Summary
The Closeness function returns closeness and k-degree scores for each specified source vertex in a graph.
The closeness scores are the inverse of the sum, the inverse of the average, and the sum of inverses for the
shortest distances to all reachable target vertices (excluding the source vertex itself).The graph can be
directed or undirected, weighted or unweighted.
For large graph, you can apply the function to a random sample of the specified target vertices to get an
efficient approximation of the closeness and k-degree scores.

1050 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Closeness
Background
Closeness and k-degree scores are fundamental distance-based centrality metrics used in network structure
analysis. Both measure the time needed to spread information from vertex v to a set of target vertices.
The closeness score is classically defined for each vertex v as either the inverse sum or the inverse average of
the shortest distances from v to all other reachable vertices u. The classical definition does not apply to
disconnected graphs; alternative definitions of closeness have been proposed for them.
The Closeness function applies the classical definition of closeness to connected graphs and an alternative
definition to disconnected graphs. The alternative definition that the function uses adds 0 to the sum for
each unreachable target vertex, which is consistent with the classic definition, because the inverse distance is
effectively 0 for a disconnected graph.
The k-degree score is defined for vertex v as the number of vertices whose distance from v is less than or
equal to k.
The Closeness function uses a hybrid distributed all pairs shortest path (APSP) algorithm to calculate the
shortest distances from each specified source vertex to each specified target vertex and then aggregates these
shortest distances into closeness and k-degree scores for each source vertex. By restricting the number of
parallel single node shortest path (SNSP) executions to groups of P vertices, the APSP algorithm enables a
trade-off between time and memory usage. The APSP algorithm completes when N/P of these groups have
completed, where N is the number of vertices in the graph. (For more information on APSP, see
AllPairsShortestPath.)

Usage

Closeness Syntax
Version 1.2

SELECT * FROM Closeness (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
[ ON targets_table AS targets PARTITION BY target_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ EdgeWeight ('edge_weight') ]
[ MaxDistance ('max_distance') ]
[ GroupSize ('group_size') ]
[ SampleRate ('sample_rate) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Teradata Aster Analytics Foundation User Guide 1051


Chapter 11: Graph Analysis
Closeness
Arguments
Argument Category Description
TargetKey Required Specifies the target key (the names of the edges table columns that identify the
target vertex). If you specify targets_table, then the function uses only the
vertices in targets_table as targets (which must be a subset of those that this
argument specifies).
Directed Optional Specifies whether the graph is directed. The default value is 'true'.
EdgeWeight Optional Specifies the name of the edges table column that contains edge weights. The
weights are positive values. By default, the weight of each edge is 1 (that is, the
graph is unweighted).
MaxDistance Optional Specifies the maximum distance (an integer) between the source and target
vertices. A negative max_distance specifies an infinite distance. If vertices are
separated by more than max_distance, the function does not output them. The
default value is 10.
GroupSize Optional Specifies the number of source vertices that execute a SNSP algorithm in
parallel. If group_size exceeds the number of source vertices in each partition, s,
then s is the group size. By default, the function calculates the optimal group
size based on various cluster and query characteristics.
Running a group of vertices on each vworker, in parallel, uses less memory than
running all vertices on each vworker.
SampleRate Optional Specifies the sample rate (the percentage of source vertices to sample), a
numeric value in the range (0, 1]. The default value is 1.
Accumulate Optional Specifies the names of the vertices table columns to copy to the output table.

Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
For an undirected graph, the graph table might have duplicate rows. Remove them, using the code in the
section Deleting Duplicate Edges Table Rows of the function AllPairsShortestPath.
Table 979: Closeness Vertices Table Schema

Column Name Data Type Description


vertex_key_column Any allowed in Column that is, or is part of, the unique vertex key. Every
PARTITION BY column that is part of this key must appear in the
clause PARTITION BY clause. This column cannot contain
NULL values.

1052 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Closeness
Table 980: Closeness Edges Table Schema

Column Name Data Type Description


source_vertex_key_column Any allowed in PARTITION Column that is, or is part of, the key that
BY clause identifies the source vertex of the edge. This
key must be a vertex key in the vertices table.
Every column that is part of this key must
appear in the PARTITION BY clause. This
column cannot contain NULL values.
target_vertex_key_column Same as data type of Column that is, or is part of, the key that
source_vertex_key_column identifies the target vertex of the edge. This
key must be a vertex key in the vertices table.
This column can contain NULL values.
edge_weight SMALLINT, INTEGER, or Column that contains the weights of the
NUMERIC edges, which must be positive values. This
column is required only for a weighted
graph. This column can contain NULL
values.

Table 981: Closeness Sources Table Schema

Column Name Data Type Description


source_vertex_key_column Same as data type of Column that is, or is part of, the key that identifies a
corresponding source vertex. This key must be a vertex key in the
vertex_key_column vertices table. Every column that is part of this key
in vertices table must appear in the PARTITION BY clause.

Table 982: Closeness Targets Table Schema

Column Name Data Type Description


target_vertex_key_column Same as data type of Column that is, or is part of, the key that identifies a
corresponding target vertex. This key must be a vertex key in the
vertex_key_column vertices table. Every column that is part of this key
in vertices table must appear in the PARTITION BY clause.

Output
The output table has a row for each source vertex v. The reachable target vertices exclude v itself; that is, the
function does not calculate the closeness and k-degree scores for loops.
Table 983: Closeness Output Table Schema

Column Data Type Description


inv_sum_dist DOUBLE Inverse of the sum of the shortest distances to all
PRECISION reachable target vertices. If there are no reachable target
vertices, then this value is NULL.

Teradata Aster Analytics Foundation User Guide 1053


Chapter 11: Graph Analysis
Closeness

Column Data Type Description


inv_avg_dist DOUBLE Inverse of the average shortest distances to all reachable
PRECISION target vertices. If there are no reachable target vertices,
then this value is NULL.
sum_inv_dist DOUBLE Sum of the inverse distances to all reachable target
PRECISION vertices. If there are no reachable target vertices, then this
value is NULL.
kdegree DOUBLE Total number of reachable target vertices. If there are no
PRECISION reachable target vertices, then this value is 0.
accumulate_column Same as in vertex Column copied from the vertex table.
table

Examples
• Input
• Example 1: Unweighted, Unbounded Graph
• Example 2: Weighted, Bounded Graph, MaxDistance=12
• Example 3: Weighted, Bounded Graph, MaxDistance=8

Input
In the graph in the following figure, the nodes represent persons—light blue for males and dark blue for
females. The directed edges represent phone calls from one person to another. Node size represents number
of connections (degree centrality).
Figure 19: Graph of Phone Calls Between Persons

The graph in the preceding figure is represented by the vertices and edges tables callers and calls,
respectively.
Table 984: Closeness Examples Vertices Table callers

callerid callername
1 John
2 Carla

1054 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Closeness

callerid callername
3 Simon
4 Celine
5 Winston
6 Diana

Table 985: Closeness Examples Edges Table calls

callerfrom callerto calls


1 2 10
1 3 2
1 4 5
1 6 6
2 4 7
2 6 12
3 6 1
4 6 4
5 6 10

Example 1: Unweighted, Unbounded Graph

SQL-MapReduce Call

SELECT * FROM Closeness (


ON callers as vertices PARTITION BY callerid
ON calls as edges PARTITION BY callerfrom
TargetKey ('callerto')
MaxDistance ('-1')
Accumulate ('callerid', 'callername')
) ORDER BY callerid;

Output
Because callerid 6 (Diana) has no outbound calls, the k-degree is 0.
Table 986: Closeness Example 1 Output Table

callerid callername inv_sum_dist inv_avg_dist sum_inv_dist kdegree


1 John 0.25 1 4 4
2 Carla 0.5 1 2 2
3 Simon 1 1 1 1

Teradata Aster Analytics Foundation User Guide 1055


Chapter 11: Graph Analysis
Closeness

callerid callername inv_sum_dist inv_avg_dist sum_inv_dist kdegree


4 Celine 1 1 1 1
5 Winston 1 1 1 1
6 Diana 0

Example 2: Weighted, Bounded Graph, MaxDistance=12

SQL-MapReduce Call

SELECT * FROM Closeness (


ON callers AS vertices PARTITION BY callerid
ON calls AS edges PARTITION BY callerfrom
TargetKey ('callerto')
EdgeWeight ('calls')
MaxDistance ('12')
Accumulate ('callerid', 'callername')
) ORDER BY callerid;

Output

Table 987: Closeness Example 2 Output Table

callerid callername inv_sum_dist inv_avg_dist sum_inv_dist kdegree


1 John 0.05 0.2 1.13333333333333 4
2 Carla 0.0555555555555556 0.111111111111111 0.233766233766234 2
3 Simon 1 1 1 1
4 Celine 0.25 0.25 0.25 1
5 Winston 0.1 0.1 0.1 1
6 Diana 0

Example 3: Weighted, Bounded Graph, MaxDistance=8

SQL-MapReduce Call

SELECT * FROM Closeness (


ON callers AS vertices PARTITION BY callerid
ON calls AS edges PARTITION BY callerfrom
TargetKey ('callerto')
EdgeWeight ('calls')
MaxDistance ('8')
Accumulate ('callerid', 'callername')
) ORDER BY callerid;

1056 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
EigenvectorCentrality
Output

Table 988: Closeness Example 3 Output Table

callerid callername inv_sum_dist inv_avg_dist sum_inv_dist kdegree


1 John 0.1 0.3 1.03333333333333 3
2 Carla 0.142857142857143 0.142857142857143 0.142857142857143 1
3 Simon 1 1 1 1
4 Celine 0.25 0.25 0.25 1
5 Winston 0
6 Diana 0

EigenvectorCentrality

Summary
The EigenvectorCentrality function calculates the centrality (relative importance) of each node in a graph.

Background

Centrality Formulas
In the centrality formulas:
• G represents a graph.
• V represents a vertex.
• N represents the total number of vertices.
• A represents the adjacency matrix of vertices.
• aij represents the element in the matrix that represents the relationship between vertex i and vertex j.
• ci represents the centrality value of vertex i.

Teradata Aster Analytics Foundation User Guide 1057


Chapter 11: Graph Analysis
EigenvectorCentrality
Eigenvector Centrality
Bonacich (1972) suggests that the eigenvector of the largest eigenvalue of an adjacency matrix could make a
good network centrality measure. He defines Eigenvector Centrality as:

For more information about this formula, see:


Bonacich, P. Factoring and Weighting Approaches to Status Scores and Clique Identification. Journal of
Mathematical Sociology 2 (1972), 113-120.

Katz Centrality
Katz (1953) gives a measure of centrality as:

For more information about this formula, see:


Katz, L. A New Status Index Derived from Sociometric Analysis. Psychometrika 18 (1953), 39-43.

Bonacich Centrality
Bonacich (1987) writes a more generic centrality measure as:

Note:
a and b are exchanged in the above formula (compared to the original one) to be consistent with Katz
centrality.

For more information about this formula, see:


Bonacich, P. Power and Centrality: A Family of Measures. American Journal of Sociology 92 (1987),
1170-1182.

Eigenvector and Eigenvalue


An eigenvector of a square matrix Α is a nonzero vector υ such that Αυ=λυ. The number λ is called the
eigenvalue of Α corresponding to υ.
The eigenvalues of a matrix Α can be determined by finding the roots of the characteristic polynomial of
equation det(A−λI) = 0.

Power Iteration
Power Iteration is an eigenvalue algorithm to find the largest eigenvalue and corresponding eigenvector.
This algorithm does not compute a matrix decomposition; therefore, it can be used when Α is a very large
sparse matrix.

1058 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
EigenvectorCentrality
The power iteration algorithm starts with a vector b0, which can be an approximation to the dominant
eigenvector or a random vector. The method is described by the following iteration:
bk+1 = Abk / || Abk ||
At every iteration, the vector bk is multiplied by matrix Α and normalized.
The sequence (bk) does not necessarily converge. A subsequence of (bk) converges to an eigenvector
associated with the dominant eigenvalue under these conditions:
• Α has an eigenvalue that is strictly greater in magnitude than its other eigenvalues.
• Starting vector b0 has a nonzero component in the direction of an eigenvector associated with the
dominant eigenvalue.

Centrality Calculation
To calculate centrality using the formulas described in Centrality Formulas, the EigenvectorCentrality
function uses an in-neighbors relation matrix of the input source key and target key. In this matrix, aij has
the value 1 if there is an edge from j to i.
Figure 20: In-Neighbors Relation Matrix

If you need an out-neighbors adjacent matrix—for example, to calculate the contribution of a vertex to other
vertices—exchange the source key and target key columns and then invoke this function.

Usage

EigenvectorCentrality Syntax
Version 1.1

SELECT * FROM EigenvectorCentrality (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
TargetKey ({ 'edge_attribute' | 'edge_attribute_range' }[,...])
[ EdgeWeight ('edge_weight') ]
[ Family ({ 'katz' | 'bonacich' | 'eigenvector' }) ]
[ Alpha ('alpha_value') ]
[ Beta ('beta_value') ]
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]

Teradata Aster Analytics Foundation User Guide 1059


Chapter 11: Graph Analysis
EigenvectorCentrality
[ MaxIterNum ('max_iteration_number') ]
[ Threshold ('threshold') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TargetKey Required Specifies the names of the target key columns in the edges table. The
number and data types of columns must correspond to those of
vertex_key.
EdgeWeight Optional Specifies the name of the edges table column that contains the edge
weights. The edge weights must be positive values. If you omit this
argument, then the graph is unweighted.
Family Optional Specifies the centrality formula. The default value is 'eigenvector'. For
descriptions of the centrality formulas, refer to Centrality Formulas.
Alpha Optional Specifies the alpha value for the Katz or Bonacich centrality formula. The
default value is 0.85.
Beta Optional Specifies the beta value for the Katz or Bonacich centrality formula. The
default value is 1 for Katz and 0 for Bonacich.
Directed Optional Specifies whether the graph is directed. The default value is 'true'.
MaxIterNum Optional Specifies the maximum number of iterations for the function. The
default value is 20.
Threshold Optional Specifies the threshold for convergence (the difference of between bk+1
and bk).The default value is 0.001.
Accumulate Optional Specifies the names of the input columns to copy to the output table.

Input
The EigenvectorCentrality function has two required input tables, vertices and edges.
The vertices table defines the set of vertices in the graph. Each row represents a vertex. The following table
describes the vertices table columns that the function uses. The table can have additional columns, but the
function ignores them.
Table 989: EigenvectorCentrality Vertices Table Schema

Column Name Data Type Description


vertex_key_column Any allowed Column that is, or is part of, the unique vertex key. Every
in column that is part of this key must appear in the
PARTITION PARTITION BY clause.
BY clause
accumulate_column Any Column to be copied to the output table.

1060 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
EigenvectorCentrality
The edges table defines the set of edges in the graph. Each row represents an edge. the following table
describes the schema of the edges table. The table can have additional columns, but the function ignores
them.
Table 990: EigenvectorCentrality Edges Table Schema

Column Name Data Type Description


source_vertex_key_column Any allowed Column that is, or is part of, the unique source vertex key.
in Every column that is part of this key must appear in the
PARTITION PARTITION BY clause.
BY clause
target_vertex_key_column Same as data Column that is, or is part of, the unique target vertex key.
type of
source_vertex
_key_column
edge_weight SMALLINT, Column that contains the weights of the edges, which must
INTEGER, or be positive values. This column is required only for a
NUMERIC weighted graph.
accumulate_column Any Column to be copied to the output table.

Output
Table 991: EigenvectorCentrality Output Table Schema

Column Name Data Type Description


accumulate_column Same as in input table Copied from input table.
centrality value DOUBLE PRECISION Contains the centrality value (relative importance)
of each node in the graph.

Examples
• Input
• Example 1: Eigenvector Centrality (by Default)
• Example 2: Katz Centrality
• Example 3: Bonacich Centrality

Input
In the graph in the following figure, the nodes represent college sophomores and the edges represent the
number of elective subjects that both sophomores have taken.

Teradata Aster Analytics Foundation User Guide 1061


Chapter 11: Graph Analysis
EigenvectorCentrality
Figure 21: EigenvectorCentrality Example Input Graph

The graph in the preceding figure is represented by the vertices and edges tables sophomores and
common_classes, respectively.
Table 992: EigenvectorCentrality Example Vertices Table sophomores

id name
A Allen
B Becky
C Cathy
D Darren

Table 993: EigenvectorCentrality Example Edges Table common_classes

startid endid electives


A B 1
A C 1
B C 1
C D 1
D A 1

Example 1: Eigenvector Centrality (by Default)

SQL-MapReduce Call

SELECT * FROM EigenvectorCentrality (


ON sophomores AS vertices PARTITION BY id
ON common_classes AS edges PARTITION BY startid
TargetKey ('endid')
Accumulate ('id')
EdgeWeight ('electives')
) ORDER BY centrality DESC;

1062 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
EigenvectorCentrality
Output

Table 994: EigenvectorCentrality Example 1 Output Table

id centrality
C 0.649450550239096
D 0.528366549347061
A 0.418290184899757
B 0.352244366231374

Example 2: Katz Centrality

SQL-MapReduce Call

SELECT * FROM eigenvectorCentrality (


ON sophomores AS vertices PARTITION BY id
ON common_classes AS edges PARTITION BY startid
TargetKey ('endid')
Accumulate ('id','name')
Family ('katz')
EdgeWeight ('electives')
) ORDER BY centrality DESC;

Output

Table 995: EigenvectorCentrality Example 2 Output Table

id name centrality
C Cathy 0.632103166675334
D Darren 0.50712609393646
A Allen 0.441313029999005
B Becky 0.385371925652169

Example 3: Bonacich Centrality

SQL-MapReduce Call

SELECT * FROM eigenvectorCentrality (


ON sophomores AS vertices PARTITION BY id
ON common_classes AS edges PARTITION BY startid
TargetKey ('endid')
Accumulate ('id')
Family ('bonacich')
Beta ('0.01')

Teradata Aster Analytics Foundation User Guide 1063


Chapter 11: Graph Analysis
gTree
EdgeWeight ('electives')
) ORDER BY centrality DESC;

Output

Table 996: EigenvectorCentrality Example 3 Output Table

id centrality
C 0.632123195866825
D 0.529800961949689
A 0.445129026468483
B 0.348699520733132

gTree

Summary
The gTree function follows all paths in a graph, starting from a given set of root vertices, and calculates
specified aggregate functions along those paths.

Background
The gTree function is similar to the function nTree, but gTree is implemented using the SQL-GR™ engine.
The SQL-GR™ engine allows the gTree function to traverse arbitrary graphs.
Some information in nTree arguments is input to the gTree function differently, as the following table
shows.
Table 997: gTree Analogs of nTree Arguments

SQL-MR SQL-GR Function gTree Analog


Function
nTree
Argument
Root_Node gTree gets the root nodes of the trees from the input table root.
Node_ID gTree gets the nodes from the partition keys for the input tables vertices, edges, and root. The
name of these keys can differ, but each partition must have the same number of keys and
corresponding keys must have the same data type.
Parent_ID The input table edges and the TargetKey argument define the parent ID. If the TargetKey
specifies a vertex, then gTree finds its parent in the corresponding partition key entries.
Mode To change the direction in which gTree traverses the graph, reverse the direction of the edges
by switching the columns in the edges table partition key with the columns specified by the
TargetKey argument.

1064 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
gTree

SQL-MR SQL-GR Function gTree Analog


Function
nTree
Argument
Result gTree gets the aggregate functions and their aliases from its Results argument, which is very
similar to the nTree Result argument. However, the parser that gTree uses requires that you
enclose each aggregate function and its alias in quotation marks.

Usage

gTree Syntax
Version 1.0

SELECT * FROM gTree (


ON { table | view | (query) } AS vertices PARTITION BY key
ON { table | view | (query) } AS edges PARTITION BY key
ON { table | view | (query) } AS root PARTITION BY key
TargetKey ({ 'edges_column' | 'edges_column_range' }[,...])
[ AllowCycles ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxIterNum ('max_depth') ]
[ Output ({ 'all' | 'end' }) ]
[ Results ('func(expr) [ AS alias]' [,...]) ]
[ EdgeResults ('func(expr) [ AS alias]' [,...])) ]
[ FinalEdgeFlag ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})]
);

Arguments
Argument Category Description
TargetKey Required Specifies the names of the columns in the edges table that identify the
target vertex of an edge.
AllowCycles Optional Specifies whether the input graph can include cycles. The default value is
'false'.
MaxIterNum Optional Specifies the maximum depth to which the function traverses the graph
(a nonnegative integer). The default value is 1000.
Output Optional Specifies whether the function outputs all paths ('all') or only paths that
end by reaching a leaf vertex, a cycle, or the maximum number of
iterations ('end'). The default value is 'end'.
Results Either Results Specifies the aggregate functions that the function calculates along each
or vertex in each path (refer to the following table). The function outputs
EdgeResults one column of results for each aggregate function. The column name is
is required alias, if specified; otherwise it is func(expr).
EdgeResults Either Results Specifies the aggregate functions that the function calculates along each
or edge in each path (refer to the following table). The function outputs one

Teradata Aster Analytics Foundation User Guide 1065


Chapter 11: Graph Analysis
gTree

Argument Category Description


EdgeResults column of results for each aggregate function. The column name is alias,
is required if specified; otherwise it is func(expr).
FinalEdgeFlag Optional Specifies whether the function includes the edge that follows the final
vertex when calculating the functions that EdgeResults specifies. The
default value is 'true'.

The following table describes the aggregate functions that the Results and EdgeResults arguments support.
In function syntax, expr, expr1, and expr2 are values from the vertices or edges table.
Table 998: Aggregate Functions Supported by gTree Function

Aggregate Function Description Return Value Data Type


Returns the Pearson product-moment correlation DOUBLE PRECISION
correl(expr1, expr2) between expr1 and expr2 on the path.

Returns the number of vertices in the path where INTEGER


count(expr [,...]) no expr is null.

Returns the number of distinct values of expr in the INTEGER


countdistinct(expr) vertices on the path.

Returns the value of expr at the final vertex in the Same as input
current(expr) path.

Returns 'true' if the gTree function ends the path BOOLEAN


cycle() by completing a cycle (that is, by visiting a vertex a
second time), 'false' otherwise.

Note:
Specify this function only in the EdgeResults
argument.

Returns a JSON string that contains the counts of CHARACTER


histogram(expr) each distinct value of expr on the path. VARYING

Returns 'true' if the gTree function ends the path BOOLEAN


leaf() by reaching a leaf vertex (that is, a vertex with no
outgoing edges), 'false' otherwise.
Returns the number of vertices on the path from INTEGER
level() the root vertex to the last visited vertex.

Returns the maximum value of expr over all Same as input


max(expr) vertices from the root vertex to the last visited
vertex.

1066 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
gTree

Aggregate Function Description Return Value Data Type


Returns 'true' if the gTree function ends the path BOOLEAN
maxiteration() by reaching the maximum number of iterations
specified by the MaxIterNum argument, 'false'
otherwise.
Returns the mean value of expr over all vertices DOUBLE PRECISION
mean(expr) from the root vertex to the last visited vertex.

Returns the minimum value of expr over all Same as input


min(expr) vertices from the root vertex to the last visited
vertex.
Returns a string that represents the path from the CHARACTER
path(expr[,...]) root vertex to the last visited vertex. The string has VARYING
the form 'v1->v2->v3...' for a vertex path and '.-v1-
>.-v2->.-v3->...' for an edge path, where vn is the
value of the nth expr.
Returns the value of expr at the root vertex. Same as input
propagate(expr)

Returns the difference between the minimum and Same as input


range(expr) maximum values of expr on the path (max(expr) -
min(expr)).
Returns the standard deviation of expr over all DOUBLE PRECISION
stdev(expr) vertices from the root vertex to the last visited
vertex.
Returns the sum of expr over all vertices from the Same as input
sum(expr) root vertex to the last visited vertex.

Calculates the product of expr1 and expr2 at each Same as input if expr1
sumproduct(expr1, expr2) vertex on the path and then returns the sum of the and expr2 have the same
products. type, otherwise numeric

Input
The gTree function has three required input tables:
• vertices, which defines the set of vertices in the graph
• edges, which defines the set of edges in the graph
• root, which defines the set of root vertices from which the function starts traversing the graph
Table 999: gTree Vertices Table Schema

Column Name Data Type Description


nodeid INTEGER Numerical ID of the vertex.
nodestring VARCHAR Name of the vertex.

Teradata Aster Analytics Foundation User Guide 1067


Chapter 11: Graph Analysis
gTree

Column Name Data Type Description


value SQL numeric data type Value of the vertex.

Table 1000: gTree Edges Table Schema

Column Name Data Type Description


nodeid INTEGER Numerical ID of the vertex at which the edge starts (the
source node of the edge).
nodestring VARCHAR Name of the vertex at which the edge starts.
endnodeid SQL numeric data type Numerical ID of the vertex at which the edge ends (the
target node of the edge).
endnodestring VARCHAR Name of the vertex at which the edge ends.

Table 1001: gTree Root Table Schema

Column Name Data Type Description


nodeid INTEGER Numerical ID of the root vertex.
nodestring VARCHAR Name of the root vertex.
value SQL numeric data type Value of the root vertex.

Output
Table 1002: gTree Output Table Schema

Column Name Data Type Description


alias if specified, Type returned by Result of aggregate function. The tables has one such
otherwise aggregate function column for each specified aggregate function.
func(expr) (refer to Arguments)

Examples
• Input
• Example 1: Show All Paths from Root Nodes
• Example 2: Show Only Paths That Cycle or End at Leaves

Input
The vertices (nodes) are bus stops in a small town. The vertices table lists each bus stop and the boarding
fare at that stop.
Table 1003: gTree Example Vertices Table gtree_vertices

nodeid nodestring value


1 Park St 2.25

1068 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
gTree

nodeid nodestring value


2 Main St 2.25
3 Walnut St 2.25
4 Water St 3.5
5 High St 2.25

The edges table represents the bus route. The columns nodeid and nodestring identify the source vertices
(where the bus starts) and the columns endnodeid and endnodestring identify the target vertices (where the
bus stops).
Table 1004: gTree Example Edges Table gtree_edges

nodeid nodestring endnodeid endnodestring


1 Park St 3 Walnut St
2 Main St 3 Walnut St
3 Walnut St 4 Water St
4 Water St 1 Park St
4 Water St 5 High St

The root table defines the set of root vertices from which the function starts traversing the graph.
Table 1005: gTree Example Root Table gtree_root

nodeid nodestring value


1 Park St 2.25
2 Main St 2.25

Example 1: Show All Paths from Root Nodes

SQL-GR™ Call

SELECT * FROM gTree (


ON gtree_vertices AS vertices PARTITION BY nodeid, nodestring
ON gtree_edges AS edges PARTITION BY nodeid, nodestring
ON gtree_root AS root PARTITION BY nodeid, nodestring
TargetKey ('endnodeid', 'endnodestring')
AllowCycles ('t')
MaxIterNum (10)
Output ('all')
Results (
'Propagate (nodeid) AS start_vertex',
'Current (nodeid) AS end_vertex',
'Path (nodestring)',
'Sum (value)',
'Cycle()',
'Leaf()'

Teradata Aster Analytics Foundation User Guide 1069


Chapter 11: Graph Analysis
gTree
)
EdgeResults ('PATH(nodestring, endnodestring) AS edgepath')
) ORDER BY 1,2,7;

Output
The output table has one column for each function that the Results or EdgeResults argument specifies. The
edgepath column shows the links that comprise the path, the cycle column shows whether the path is a cycle,
and the sum column shows the total fare for the path (the sum of the boarding fares at each node in the
path).
Table 1006: gTree Example 1 Output Table

start_vertex end_vertex path(node_string) sum(value) cycle() leaf() edgepath


1 1 Park St-> 2.25 false false .-Park StWalnut St->.
1 3 Park St->Walnut St-> 4.5 false false .-Park StWalnut St->.-
Walnut StWater St->.
1 4 Park St->Walnut St- 8 false false .-Park StWalnut St->.-
>Water St-> Walnut StWater St->.-
Water StHigh St->.
1 4 Park St->Walnut St- 8 true false .-Park StWalnut St->.-
>Water St-> Walnut StWater St->.-
Water StPark St->
1 5 Park St->Walnut St- 10.25 false true .-Park StWalnut St->.-
>Water St->High St Walnut StWater St->.-
Water StHigh St->.
2 1 Main St->Walnut St- 10.25 true false .-Main StWalnut St-
>Water St->Park St-> >.-Walnut StWater St-
>.-Water StPark St->.-
Park StWalnut St->
2 2 Main St-> 2.25 false false .-Main StWalnut St->.
2 3 Main St->Walnut St-> 4.5 false false .-Main StWalnut St-
>.-Walnut StWater St-
>.
2 4 Main St->Walnut St- 8 false false .-Main StWalnut St-
>Water St-> >.-Walnut StWater St-
>.-Water StHigh St->.
2 4 Main St->Walnut St- 8 false false .-Main StWalnut St-
>Water St-> >.-Walnut StWater St-
>.-Water StPark St->.
2 5 Main St->Walnut St- 10.25 false true .-Main StWalnut St-
>Water St->High St >.-Walnut StWater St-
>.-Water StHigh St->.

1070 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
gTree
Example 2: Show Only Paths That Cycle or End at Leaves

SQL-GR™ Call

SELECT * FROM gTree (


ON gtree_vertices AS vertices PARTITION BY nodeid, nodestring
ON gtree_edges AS edges PARTITION BY nodeid, nodestring
ON gtree_root AS root PARTITION BY nodeid, nodestring
TargetKey ('endnodeid', 'endnodestring')
AllowCycles ('t')
MaxIterNum (10)
Output ('end')
Results (
'Propagate (nodeid) AS start_vertex',
'Current (nodeid) AS end_vertex',
'Path (nodestring)',
'Sum (value)',
'Cycle()',
'Leaf()'
)
EdgeResults ('PATH(nodestring, endnodestring) AS edgepath')
) ORDER BY 1,2,7;

Output

Table 1007: gTree Example 2 Output Table

start_vertex end_vertex path(node_string) sum(value) cycle() leaf() edgepath


1 4 Park St->Walnut St- 8 true false .-Park StWalnut St->.-
>Water St-> Walnut StWater St->.-
Water StPark St->
1 5 Park St->Walnut St- 10.25 false true .-Park StWalnut St->.-
>Water St->High St Walnut StWater St->.-
Water StHigh St->.
2 1 Main St->Walnut St- 10.25 true false .-Main StWalnut St-
>Water St->Park St-> >.-Walnut StWater St-
>.-Water StPark St->.-
Park StWalnut St->
2 5 Main St->Walnut St- 10.25 false true .-Main StWalnut St-
>Water St->High St >.-Walnut StWater St-
>.-Water StHigh St->.

Teradata Aster Analytics Foundation User Guide 1071


Chapter 11: Graph Analysis
LocalClusteringCoefficient

LocalClusteringCoefficient

Summary
The LocalClusteringCoefficient function extends the clustering coefficient to directed and weighted graphs.
The clustering coefficient, which was introduced in the context of binary undirected graphs, is a frequently-
used tool for analyzing the structure of a network.
The LocalClusteringCoefficient function is based on the paper “Clustering in complex directed networks” by
Georgio Fagiolo.

Background
The definition of the local clustering coefficient depends on the graph type:
• Unweighted, Undirected Network (BUN)
• Unweighted, Directed Network (BDN)
• Weighted, Directed Network (WDN)
• Weighted, Undirected Network (WUN)

Unweighted, Undirected Network (BUN)


The local clustering coefficient was originally defined on an unweighted, undirected (also called bi-directed)
network (BUN).
Let G=(V,E) be an undirected, simple (no self-loops, no multiple edges) graph (network) with a set of nodes
(vertices) V and a set of edges E.
The degree di of node i is defined to be the number of nodes in V that are adjacent to i. A complete subgraph
of three nodes of G can be considered as a triangle Δ.
Motivated by this, the number of triangles of node i is defined as:

where
aij = 1 if there is an edge from i to j; otherwise aij = 0.
A triple Ƴ at a node i is a path of length two for which i is the center node. The maximum number of triples
of node i is then defined as:

This occurs when every neighbor of node iis connected to every other neighbor of node i.
The clustering coefficient was introduced by Watts and Strogatz (D. J. Watts and S. H. Strogatz. Collective
dynamics of “small-world” networks. Nature, 393:440–442, 1998) in the context of social network analysis.
Given three actors i, j, and h, with mutual relations between i and j, as well as between i and h, it is supposed
to represent the likeliness that j and h are also related.

1072 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LocalClusteringCoefficient
Based on the above, the clustering coefficient for a node i with di ≥ 2, is defined as:
ci = δi / τi

Unweighted, Directed Network (BDN)


For an unweighted, directed network (BDN), given a vertex i, the BDN triangle types can be categorized to 4
patterns.
Table 1008: Triangle Type Patterns

Pattern Description Illustration


Cycle When there exists a cyclical
relation among i and any two
of its neighbors i→j→h→i, or
vice versa.

Middleman When one of i’s neighbors,


say j, both holds an outward
edge to a third neighbor, say
h, and uses i as a medium to
reach h in two steps.

In Where i holds two inward


edges.

Out Where i holds two outward


edges.

For each pattern, its clustering coefficient (CC) can be defined as:

Teradata Aster Analytics Foundation User Guide 1073


Chapter 11: Graph Analysis
LocalClusteringCoefficient
ci* = δi* / τi*
where {*}={cycle, middleman, in, out}.
Also, triples for each pattern can be defined as:
τicyc = diin diout - di↔

where di↔ is the number of bilateral edges between i and its neighbors.
τimid = diindiout - di↔

τiin = diin(diin - 1)

τiout = diout(diout - 1)

Weighted, Directed Network (WDN)


For a weighted, directed network (WDN), CC is extended as:

where {*}={cycle, middleman, in, out}.


Assuming wij is the weight on the edge i→j, for node i, the weighted triangle for each pattern is defined as:

Weighted, Undirected Network (WUN)


For a weighted, undirected network (WUN), assuming that wij is the weight on the edge i→j, for node i, the
clustering coefficient is defined as:

1074 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LocalClusteringCoefficient

Usage

LocalClusteringCoefficient Syntax
Version 1.1

SELECT * FROM LocalClusteringCoefficient (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
TargetKey
({ 'target_vertex_column' | 'target_vertex_column_range' }[,...])
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ EdgeWeight ('edge_weight') ]
[ DegreeRange ({ '[min:max]' | '[min:]' | '[:max]' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
In the DegreeRange argument, you must type the brackets. They do not indicate that their contents are
optional.

Arguments
Argument Category Description
TargetKey Required Specifies the key of the target vertex of an edge. The key consists of one
or more edges table column names.
Directed Optional Specifies whether the graph is directed. The default value is 'false'.
EdgeWeight Optional Specifies the name of the edges table column that contains the edge
weights. Each edge weight is a positive value in the range (0-1]. By
default, the function treats the input graph as unweighted.
DegreeRange Optional Specifies the edge degree range—at least min and at most max
([min:max]), at least min ([min:]), or at most max ([:max]). The min and
max must be positive integers. The function outputs only nodes with
degrees in the specified range. By default, the function outputs all nodes.
Accumulate Optional Specifies the names of the vertices table columns to copy to the output
table.

Teradata Aster Analytics Foundation User Guide 1075


Chapter 11: Graph Analysis
LocalClusteringCoefficient
Input
The LocalClusteringCoefficient function has two required input tables, vertices and edges.
The following table describes the vertices table columns that you must or can specify in the function call.
The table can have additional columns, but the function ignores them.
Table 1009: LocalClusteringCoefficient VerticesTable Schema

Column Name Data Type Description


vertex_key_column INTEGER Column that is, or is part of, the vertex key. Every column in the
key appears in this table.
If the table has duplicate vertices, the function uses the first one
and ignores the others.
accumulate_column Any Column to be copied to the output table (optional).

The following table describes the edges table columns that you must or can specify in the function call. The
table can have additional columns, but the function ignores them.
The data-checking rules for the edges table are, given nodes A and B:
• No graph can have multiple A→B edges.
• An undirected graph cannot have edges A→B and B→A.
Table 1010: LocalClusteringCoefficient Edges Table Schema

Column Name Data Type Description


source_vertex_key_column INTEGER Column that is, or is part of, the key of the source vertex.
Every column in the key appears in this table.
target_vertex_key_column INTEGER Column that is, or is part of, the key of the target vertex.
Every column in the key appears in this table.
edge_weight DOUBLE Column that contains the edge weights. The function uses
PRECISION only positive edge weights. The sum of the edge weights that
the function uses must be 1.

Output
The output table schema depends on the graph type. The graph types are:
• BUN: Unweighted, undirected network
• BDN: Unweighted, directed network
• WUN: Weighted, undirected network
• WDN: Weighted, directed network
The output table schemas for BDN and WDN graphs (described by the following two tables) refer to cycle,
middleman, in, and out triangles, which are explained in Unweighted, Directed Network (BDN).

1076 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LocalClusteringCoefficient
Table 1011: LocalClusteringCoefficient Output Table Schema for BUN Graph

Column Name Data Type Description


accumulate_column Same as in Column copied from the vertices table.
vertices table
degree INTEGER Number of neighbors (nodes directly connected to this
node).
tri_cnt INTEGER Number of triangles of the node.
cc DOUBLE Clustering coefficient of the node.
PRECISION

Table 1012: LocalClusteringCoefficient Output Table Schema for BDN Graph

Column Name Data Type Description


accumulate_column Same as in Column copied from the vertices table.
vertices table
in_degree INTEGER Number of incoming edges.
out_degree INTEGER Number of outgoing edges.
bi_degree INTEGER Total number of edges.
cyc_tri_cnt INTEGER Number of cycle triangles for the node.
mid_tri_cnt INTEGER Number of middleman triangles for the node.
in_tri_cnt INTEGER Number of in triangles for the node.
out_tri_cnt INTEGER Number of out triangles for the node.
tri_cnt INTEGER Total number of triangles for the node.
cyc_cc DOUBLE Clustering coefficient for cycle triangles for the node.
PRECISION
mid_cc DOUBLE Clustering coefficient for middleman triangles for the node.
PRECISION
in_cc DOUBLE Clustering coefficient for in triangles for the node.
PRECISION
out_cc DOUBLE Clustering coefficient for out triangles for the node.
PRECISION
avg_cc DOUBLE Overall clustering coefficient for the node.
PRECISION

Table 1013: LocalClusteringCoefficient Output Table Schema for WUN Graph

Column Name Data Type Description


accumulate_column Same as in Column copied from the vertices table.
vertices table

Teradata Aster Analytics Foundation User Guide 1077


Chapter 11: Graph Analysis
LocalClusteringCoefficient

Column Name Data Type Description


degree INTEGER Number of neighbors (nodes directly connected to this
node).
tri_cnt INTEGER Number of triangles of the node.
cc DOUBLE Clustering coefficient of the node.
PRECISION
w_cc DOUBLE Weighted clustering coefficient of the node.
PRECISION

Table 1014: LocalClusteringCoefficient Output Table Schema for WDN Graph

Column Name Data Type Description


accumulate_column Same as in Column copied from the vertices table.
vertices table
in_degree INTEGER Number of incoming edges.
out_degree INTEGER Number of outgoing edges.
bi_degree INTEGER Total number of edges.
cyc_tri_cnt INTEGER Number of cycle triangles for the node.
mid_tri_cnt INTEGER Number of middleman triangles for the node.
in_tri_cnt INTEGER Number of in triangles for the node.
out_tri_cnt INTEGER Number of out triangles for the node.
tri_cnt INTEGER Total number of triangles for the node.
cyc_cc DOUBLE Clustering coefficient for cycle triangles for the node.
PRECISION
mid_cc DOUBLE Clustering coefficient for middleman triangles for the node.
PRECISION
in_cc DOUBLE Clustering coefficient for in triangles for the node.
PRECISION
out_cc DOUBLE Clustering coefficient for out triangles for the node.
PRECISION
avg_cc DOUBLE Overall clustering coefficient for the node.
PRECISION
w_cyc_cc DOUBLE Weighted clustering coefficient for cycle triangles for the
PRECISION node.
w_mid_cc DOUBLE Weighted clustering coefficient for middleman triangles for
PRECISION the node.
w_in_cc DOUBLE Weighted clustering coefficient for in triangles for the node.
PRECISION

1078 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LocalClusteringCoefficient

Column Name Data Type Description


w_out_cc DOUBLE Weighted clustering coefficient for out triangles for the
PRECISION node.
w_avg_cc DOUBLE Overall weighted clustering coefficient for the node.
PRECISION

Examples
• Input
• Example 1: WUN
• Example 2: WUN with DegreeRange 3 or Greater
• Example 3: WDN

Input
In the graph in the following figure, the nodes represent countries, edges connect countries that trade with
each other, and the numbers on the edges represent trade propensity.
Figure 22: Graph of Trading Partners

The graph in the preceding figure is represented by the vertices and edges tables country and trade,
respectively.
Table 1015: LocalClusteringCoefficient Examples Vertices Table country

countryid name
1 USA
2 China
3 UK
4 Japan
5 France

Teradata Aster Analytics Foundation User Guide 1079


Chapter 11: Graph Analysis
LocalClusteringCoefficient
Table 1016: LocalClusteringCoefficient Examples Edges Table trade

fromid toid tradeweight


1 2 0.8
1 3 0.5
1 4 0.8
2 3 0.5
3 1 0.2
3 4 0.3
3 5 0.4
5 1 0.5

Example 1: WUN
This example treats the input graph as a weighted, undirected network (WUN).

SQL-MapReduce Call

SELECT * FROM LocalClusteringCoefficient (


ON trade as edges PARTITION BY fromid
ON country as vertices PARTITION BY countryid
TargetKey ('toid')
EdgeWeight ('tradeweight')
Directed ('f')
Accumulate ('countryid', 'name')
) ORDER BY countryid;

Output

Table 1017: LocalClusteringCoefficient Example 1 Output Table

countryid name degree tri_cnt cc w_cc


1 USA 4 6 1.00000 0.44642
2 China 2 2 2.00000 1.01569
3 UK 4 6 1.00000 0.44642
4 Japan 2 2 2.00000 0.85667
5 France 2 2 2.00000 0.80615

Example 2: WUN with DegreeRange 3 or Greater


This example treats the input graph as a weighted, undirected network and outputs only nodes with degree
range 3 or greater.

1080 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LocalClusteringCoefficient
SQL-MapReduce Call

SELECT * FROM LocalClusteringCoefficient (


ON trade AS edges PARTITION BY fromid
ON country AS vertices PARTITION BY countryid
TargetKey ('toid')
Directed ('f')
EdgeWeight ('tradeweight')
DegreeRange ('[3:]')
Accumulate ('countryid', 'name')
) ORDER BY countryid;

Output

Table 1018: LocalClusteringCoefficient Example 2 Output Table

countryid name degree tri_cnt cc w_cc


1 USA 4 6 1.00000 0.44642
3 UK 4 6 1.00000 0.44642

Example 3: WDN
This example treats the input graph as a weighted, directed network (WDN).

SQL-MapRequest Call

SELECT * FROM LocalClusteringCoefficient (


ON trade AS edges PARTITION BY fromid
ON country AS vertices PARTITION BY countryid
TargetKey ('toid')
EdgeWeight ('tradeweight')
Directed ('t')
Accumulate ('countryid')
) ORDER BY countryid;

Output

Table 1019: LocalClusteringCoefficient Example 3 Output Table (Columns 1-6)

countryid in_degree out_degree bi_degree cyc_tri_cnt mid_tri_cnt


1 2 3 1 2 1
2 1 1 0 1 1
3 2 3 1 2 1
4 2 0 0 0 0
5 1 1 0 1 1

Teradata Aster Analytics Foundation User Guide 1081


Chapter 11: Graph Analysis
LoopyBeliefPropagation
Table 1020: LocalClusteringCoefficient Example 3 Output Table (Columns 7-12)

in_tri_cnt out_tri_cnt tri_cnt cyc_cc mid_cc in_cc


1 2 6 0.40000 0.20000 0.50000
0 0 2 1.00000 1.00000 0.00000
1 2 6 0.40000 0.20000 0.50000
2 0 2 0.00000 0.00000 1.00000
0 0 2 1.00000 1.00000 0.00000

Table 1021: LocalClusteringCoefficient Example 3 Output Table (Columns 13-19)

out_cc avg_cc w_cyc_cc w_mid_cc w_in_cc w_out_cc w_avg_cc


0.33333 0.33333 0.17901 0.07268 0.17100 0.17967 0.14881
0.00000 1.00000 0.43089 0.58480 0.00000 0.00000 0.50785
0.33333 0.33333 0.17901 0.09865 0.29240 0.11757 0.14881
0.00000 1.00000 0.00000 0.00000 0.42833 0.00000 0.42833
0.00000 1.00000 0.46416 0.34200 0.00000 0.00000 0.40308

LoopyBeliefPropagation

Summary
Belief propagation, or sum-product message passing, is an algorithm for inferring probabilities from graphical
models, such as Bayesian networks and Markov random fields.
The LoopyBeliefPropagation function calculates, for a Bayesian network of binary variables, the marginal
distribution for each unobserved variable, conditional on any observed variables.

Background
A Bayesian network is a probabilistic graphical model that represents a set of random variables and their
conditional dependencies with a directed acyclic graph (DAG). For example, a Bayesian network can
represent the probabilistic relationships between symptoms and diseases. Given symptoms, belief
propagation can use the graph to compute the probabilities of the presence of various diseases.
Formally, Bayesian networks are DAGs whose vertices (or nodes) represent random variables in the
Bayesian sense: they may be observable quantities, latent variables, unknown parameters, or hypotheses.
Each vertex is associated with a probability function that takes as input the values of the vertex's parent
variables and returns the probability of the variable represented by the vertex. For example, if the parents are
m binary variables, then the probability function could be represented by a table of 2m entries, one entry for
each of the 2m possible combinations of its parents' values. If variables are conditionally dependent on each
other, then the vertices that represent them are connected by edges.

1082 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LoopyBeliefPropagation
For example, suppose that there are two reasons that grass can be wet: the sprinkler is on or it is raining.
Also, suppose that when it is raining, the sprinkler is less likely to be on. The situation can be modeled with
the Bayesian network shown in the following figure. The three variables are binary; their possible values are
T (true) and F (false).

To use the LoopyBeliefPropagation function, you must specify only the conditional dependence between
variables (directed edges, possibly weighted) and the values for observed variables. The function computes
the potential tables, using the following functions at factor nodes:

Usage

LoopyBeliefPropagation Syntax
Version 1.0

SELECT * FROM LoopyBeliefPropagation (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key

Teradata Aster Analytics Foundation User Guide 1083


Chapter 11: Graph Analysis
LoopyBeliefPropagation
[ ON observation_table AS observation PARTITION BY source_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ ObservationColumn ('observation_column') ]
[ EdgeWeight ('edge_weight') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ MaxIterNum ('max_iter_num') ]
[ Threshold ('threshold') ]
);

Arguments
Argument Category Description
TargetKey Optional Specifies the names of the edges table columns that comprise
the key of the target vertices.
ObservationColumn Required Specifies the name of the observation table column that
with contains the observations.
observation
table,
optional
otherwise
EdgeWeight Optional Specifies the name of the edges table column that contains
the edge weights. The function uses only positive edge
weights. The sum of the edge weights that the function uses
must be 1.
Accumulate Optional Specifies the names of the vertices table columns to copy to
the output table.
MaxIterNum Optional Specifies the maximum number of iterations that the
algorithm can run. The default value is 20.
Threshold Optional Specifies the threshold value for convergence. The default
value is 0.0001.

Input
The LoopyBeliefPropagation function has two required input tables, vertices and edges, and one optional
input table, observation.
The following table describes the vertices table columns that you must or can specify in the function call.
The table can have additional columns, but the function ignores them.
Table 1022: LoopyBeliefPropagation VerticesTable Schema

Column Name Data Type Description


vertex_key_column INTEGER Column that is, or is part of, the vertex key. Every column in
the key appears in this table. Every variable is represented in
the graph by a vertex.

1084 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LoopyBeliefPropagation

Column Name Data Type Description


accumulate_column Any Column to be copied to the output table.

Note:
The per-vertex computational cost is exponential in terms of in-degree. Therefore, it is not recommended
to have any vertex with an in-degree higher than 20; otherwise the function could take an unexpectedly
long time to finish.

If variables are conditionally dependent on each other, then the vertices that represent them are connected
by edges. The edges table contains the columns that comprise the keys of the source and target vertices of the
edges, and optionally, a column that contains the weights of the edges.
Table 1023: LoopyBeliefPropagation Edges Table Schema

Column Name Data Type Description


source_vertex_key_column INTEGER Column that is, or is part of, the key of the source vertex.
Every column in the key appears in this table.
target_vertex_key_column INTEGER Column that is, or is part of, the key of the target vertex.
Every column in the key appears in this table.
edge_weight DOUBLE Optional column that contains the edge weights. The
PRECISION function uses only positive edge weights. The sum of the
edge weights that the function uses must be 1.

The observations table contains the vertices (which represent variables) and the observations for observed
variables.
Table 1024: LoopyBeliefPropagation Observation Table Schema

Column Name Data Type Description


vertex INTEGER Column that is, or is part of, the vertex key. Every column in the key
appears in this table.
observation INTEGER For an observed variable: 1 or 0
For an unobserved variable: NULL

Output
Table 1025: LoopyBeliefPropagation Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the vertices table.
vertices table
prob_true DOUBLE Marginal probability that the variable represented by the
PRECISION vertex is true.

Teradata Aster Analytics Foundation User Guide 1085


Chapter 11: Graph Analysis
LoopyBeliefPropagation
Examples
These examples use the LoopyBeliefPropagation function to determine the marginal probability of the
disease hepatitis by observing its symptoms.
Hepatitis is an inflammation of the liver that can be caused by drugs, alcohol, or (most often) a virus. Its
most common symptoms are:
• Jaundice (yellowing of the skin and whites of the eyes)
• Internal bleeding
• Loss of appetite
• Fatigue
• Fever
• Dark urine
• Stupor
• Nausea/vomiting
Given the presence or absence of a given symptom, the LoopyBeliefPropagation function determines the
conditional or marginal probability of hepatitis.
The following figure shows the DAG that represents the relationship between hepatitis and its symptoms.
The DAG represents each symptom by a conditional node and the disease by the dependent, unobserved
node. These examples assume that each observed node variable is independent and binary.
Figure 23: Relationship Between Hepatitis and Symptoms

• Example 1: Equally Weighted Symptoms/Edges


• Example 2: Unequally Weighted Symptoms/Edges

Example 1: Equally Weighted Symptoms/Edges


In this example, the probability of hepatitis depends on all symptoms equally; therefore, the edges table does
not include edge weights.

1086 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LoopyBeliefPropagation
Input

Table 1026: LoopyBeliefPropagation Examples Vertices Table lbp_vertices

id vertex
1 Jaundice
2 Internal bleeding
3 Loss of appetite
4 Fatigue
5 Fever
6 Dark urine
7 Stupor
8 Nausea/vomiting
9 Hepatitis

Table 1027: LoopyBeliefPropagation Example 1 Edges Table lbp_edges

id source target
1 Jaundice Hepatitis
2 Internal bleeding Hepatitis
3 Loss of appetite Hepatitis
4 Fatigue Hepatitis
5 Fever Hepatitis
6 Dark urine Hepatitis
7 Stupor Hepatitis
8 Nausea/vomiting Hepatitis

In the observation table, 't' means that the symptom is present and 'f' means that it is absent.
Table 1028: LoopyBeliefPropagation Examples Observation Table lbp_observation

id vertex obs
1 Jaundice t
2 Internal bleeding t
3 Loss of appetite t
4 Fatigue t
5 Fever f
6 Dark urine t
7 Stupor f

Teradata Aster Analytics Foundation User Guide 1087


Chapter 11: Graph Analysis
LoopyBeliefPropagation

id vertex obs
8 Nausea/vomiting f

SQL-MapReduce Call

SELECT * FROM LoopyBeliefPropagation (


ON lbp_edges AS edges PARTITION BY source
ON lbp_vertices AS vertices PARTITION BY vertex
ON lbp_observation AS observation PARTITION BY vertex
TargetKey('target')
ObservationColumn('obs')
Accumulate('vertex')
MaxIterNum('20')
Threshold('1E-10')
) ORDER BY vertex;

Output
In the output table, 1 means that the symptom is present and 0 means that it is absent. Five of the eight
symptoms are present, so the conditional probability of hepatitis is 5/8 (0.625).
Table 1029: LoopyBeliefPropagation Example 1 Output Table

vertex prob_true
Dark urine 1
Fatigue 1
Fever 0
Hepatitis 0.625
Internal bleeding 1
Jaundice 1
Loss of appetite 1
Nausea/vomiting 0
Stupor 0

Example 2: Unequally Weighted Symptoms/Edges


In this example, the probability of hepatitis depends more on some symptoms than others; therefore, the
edges table includes edge weights (and the SQL-MapReduce call includes the EdgeWeight argument).

Input
Use the table below and the following tables from the Input section of Example 1:
• LoopyBeliefPropagation Examples Vertices Table lbp_vertices, from the Input section of Example 1
• LoopyBeliefPropagation Examples Observation Table lbp_observation

1088 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
LoopyBeliefPropagation
Table 1030: LoopyBeliefPropagation Example 2 Edges Table lbp_weighted_edges

id source target edgewt


1 Jaundice Hepatitis 0.2
2 Internal bleeding Hepatitis 0.15
3 Loss of appetite Hepatitis 0.05
4 Fatigue Hepatitis 0.1
5 Fever Hepatitis 0.1
6 Dark urine Hepatitis 0.25
7 Stupor Hepatitis 0.05
8 Nausea/vomiting Hepatitis 0.1

SQL-MapReduce Call

SELECT * FROM LoopyBeliefPropagation (


ON lbp_weighted_edges AS edges PARTITION BY source
ON lbp_vertices AS vertices PARTITION BY vertex
ON lbp_observation AS observation PARTITION BY vertex
TargetKey('target')
ObservationColumn('obs')
EdgeWeight ('edgewt')
Accumulate('vertex')
MaxIterNum('20')
Threshold('1E-10')
) ORDER BY vertex;

Output
In the output table, 1 means that the symptom is present and 0 means that it is absent. The conditional
probability of hepatitis is the sum of the weights of the symptoms that are present (0.25 + 0.1 + 0.05 + 0.15
+ 0.2 = 0.75).
Table 1031: LoopyBeliefPropagation Example 2 Output Table

vertex prob_true
Dark urine 1
Fatigue 1
Fever 0
Hepatitis 0.75
Internal bleeding 1
Jaundice 1
Loss of appetite 1
Nausea/vomiting 0

Teradata Aster Analytics Foundation User Guide 1089


Chapter 11: Graph Analysis
Modularity

vertex prob_true
Stupor 0

Modularity

Summary
The Modularity function uses a clustering algorithm to detect communities in networks (graphs).The
function needs no prior knowledge or estimation of starting cluster centers and assumes no particular data
distribution of the input data set. The graph can be directed or undirected, weighted or unweighted.

Background
Many real-world, large data sets can be represented as networks (graphs). Most of these networks have a
community structure—groups of densely interconnected nodes that are only sparsely connected with the
rest of the network. Community detection and identification is very important for understanding network
dynamics, because community properties—node degree, clustering coefficient, betweenness, centrality, and
so on—can be quite different from those of the whole network. For example, a closely connected social
community tends to have a faster information transmission rate than a loosely connected community.
Modularity measures the strength of division of a network into modules (also called communities, clusters,
or groups). Maximizing modularity leads to the identification of communities in a given network. For
detailed information on modularity, see:
• http://en.wikipedia.org/wiki/Community_structure
• http://en.wikipedia.org/wiki/Modularity_(networks)
• M.E.J. Newman, Modularity and community structure, Proceedings of the National Academy of Sciences
of the United States of America, 2005
• Blondell, Guillaume, Renaud Lambiotte, E. Lefebvre, Fast Unfolding of communities in large networks. J.
Stat. Mech. 2008.
Real-world use cases for modularity include:
• Social network: Identifying communities of acquaintances or circles of influences.
• Telecom:

1090 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Modularity
∘ Identifying groups of customers where most of the communication remains within the group.
∘ Identifying different customer tiers based on data usage.
• Banking:
∘ Identifying business communities (supply chain or capital chain), based on corporation transaction
details.
∘ Fraud detection analysis.
• Web: Identifying web pages with similar subjects or that are interlinked and frequently accessed together.
• Airline: Identifying personal and business social networks, based on order transaction data.
• Retail:
∘ Establishing communities based on product demands.
∘ Identifying target customer bases for particular products.
∘ Identifying related products (books on a similar topics, novels and movies of a similar genre, home
improvement topics, and so on).
• Cyber security: Identifying networks affected by injected viruses or malware attacks.
• Health care: Identifying circles of people affected by a contagious disease.

Definitions

Quality Metric
If e ij represents the number of edges between clusters i and j, C represents an entire cluster set of nodes, and
m represents total number of edges in the graph, then modularity for the graph is given by Q:

In other words, modularity Q can be calculated as:

Resolution
Resolution controls the hierarchical-level information about community formation. It represents the level in
a dendogram at which to converge for the number of communities to be detected. Think of different
resolution points as different hierarchical levels in a tree of nodes interconnected through an edge table.
Two ways to think of resolution are:
• Higher resolution (> 1.0) provides a visualization of the graph nearer the root of the tree and lower
resolution (in the range (0.0, 1.0]) provides a visualization of the graph nearer the leaves of the tree.
• Higher resolution “zooms out ,” providing fewer, larger communities; lower resolution “zooms in ,”
providing more, smaller communities.

Teradata Aster Analytics Foundation User Guide 1091


Chapter 11: Graph Analysis
Modularity
In the following figure, the x-axis represents the nodes in a sample graph and y-axis represents the
community resolution level of the hierarchical tree structure of the graph. For example, at resolution level
0.75, the graph has 7 communities, and at resolution level 1.0, the graph has 5 communities.
Figure 24: Resolution Levels

Resolution is useful for:


• Visualizing hierarchical graph structure
• Finding the expected number of communities in a graph
• Finding the community structure in a graph when some nodes are expected to belong to the same
community
For visualizing hierarchical graph structure, the default resolution (1.0) maximizes modularity and is
therefore expected to provide the best community structure. However, you might want to change the
resolution in the second and third preceding use cases or if your graph includes estimation errors (for
example, when edge weights are statistically computed based on behavior similarity among nodes).
For the third use case, an alternative to changing the resolution is to put the nodes that are expected to
belong to the same community in one group in the vertices table and specify that group in the
CommunityAssociation argument.

1092 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Modularity
Usage

Modularity Syntax
Version 1.1

SELECT * FROM Modularity (


ON { table | view | (query) } AS "vertices" PARTITION BY vertex_key
ON { table | view | (query) } AS "edges" PARTITION BY source_vertex_key
[ ON { table | view | query} AS "sources" PARTITION BY vertex_key ]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ Directed ( { 'true' | 'yes' | 't' | 'y' | '1' | 'false' | 'no' | 'f' | 'n'
| '0' } )]
[ EdgeWeight (edge_weight) ]
[ CommunityAssociation (community_id) ]
[ Resolution (resolution [,...]) ]
[ CommunityEdgeTable (community_edge_table) ]
[ Seed ('seed') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
TargetKey Required Specifies the key of the target vertex of an edge. The key
consists of the names of one or more edges table columns.
Directed Optional Legacy argument that determined whether the graph was
directed. The default value was 'true'. The function now
ignores this argument, treating all graphs as undirected.
EdgeWeight Optional Specifies the name of the edges table column that contains
edge weights. The weights are positive values. By default, the
weight of each edge is 1 (that is, the graph is unweighted).
This argument determines how the function treats duplicate
edges (that is, edges with the same source and destination,
which might have different weights). For a weighted graph,
the function treats duplicate edges as a single edge whose

Teradata Aster Analytics Foundation User Guide 1093


Chapter 11: Graph Analysis
Modularity

Argument Category Description


weight is the sum of the weights of the duplicate edges. For
an unweighted graph, the function uses only one of the
duplicate edges.
CommunityAssociation Optional Specifies the name of the column that represents the
community association of the vertices. Use this argument if
you already know some vertex communities.
Resolution Optional Specifies hierarchical-level information for the
communities. For details, refer to Resolution. The default
resolution is 1.0. If you specify a list of resolution values, the
function incrementally finds the communities for each value
and for the default value.
Each resolution must be a distinct DOUBLE PRECISION
value in the range [0.0, 1000000.0]. The value 0.0 puts each
node in its own community of size 1. You can specify a
maximum of 500 resolution values. To get the modularity of
more than 500 resolution points, call the function multiple
times, specifying different values in each call.
CommunityEdgeTable Optional Specifies the name of the table that the function generates to
output the weights of the edges between the communities at
different resolution levels. If a table with
community_edge_table exists, the function overwrites the
existing table. If you omit this argument, the function does
not create this table.
Seed Optional Specifies the seed to use to create a random number during
modularity computation. The seed must be a positive
BIGINT value. The function multiplies seed by the hash
code of vertex_key to generate a unique seed for each vertex.
The default seed is 1.
The seed significantly impacts community formation (and
modularity score), because the function uses seed for these
purposes:
To break ties between different vertices during community
formation
To determine how deeply to analyze the graph
Deeper analysis of the graph can improve community
formation, but can also increase execution time.
Accumulate Optional Specifies the names of the vertices columns to copy to the
community vertex table. By default, the function copies the
vertex_key columns to the output vertex table for each
vertex, changing the column names to id, id_1, id_2, and so
on.

1094 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Modularity
Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph and each row of the edges table represents an edge of the graph.
As a legacy, the function has an optional input table, sources, which specifies the vertices that are sources.
For a directed graph, this table was required. The function now ignores this optional table and treats all
graphs as undirected.
Table 1032: Modularity Vertices Table Schema

Column Name Data Type Description


vertex_key_column Any allowed in PARTITION BY Column that is, or is part of, the
clause unique vertex key. Every column
that is part of this key must appear
in the PARTITION BY clause.
This column cannot contain
NULL values.
accumulate_column Any Column to copy to the community
vertex table.

Table 1033: Modularity Edges Table Schema

Column Name Data Type Description


source_vertex_key_column Any allowed in PARTITION BY Column that is, or is part of, the
clause key that identifies the source
vertex of the edge. This key must
be a vertex key in the vertices
table. Every column that is part of
this key must appear in the
PARTITION BY clause. This
column cannot contain NULL
values.
target_key_column Same as data type of Column that is, or is part of, the
source_vertex_key_column key that identifies the target vertex
of the edge. This key must be a
vertex key in the vertices table and
must be specified by the
TargetKey argument. This column
can contain NULL values.
edge_weight SMALLINT, INTEGER, or Column that contains the weights
NUMERIC of the edges, which must be
positive values. This column is
required only for a weighted
graph. This column can contain
NULL values.

Teradata Aster Analytics Foundation User Guide 1095


Chapter 11: Graph Analysis
Modularity
Table 1034: Modularity Sources Table Schema

Column Name Data Type Description


source_vertex_key_column Same as data Column that is, or is part of, the key that identifies a source
type of vertex. This key must be a vertex key in the vertices table.
correspondin Every column that is part of this key must appear in the
g PARTITION BY clause.
vertex_key_co
lumn in
vertices table

Output
The function outputs a community vertex table and, optionally, a community edges table.
The community vertex table has a row of modularity results for each specified resolution level.
Table 1035: Community Vertex Table Schema (Default Resolution)

Column Name Data Type Description


accumulate_column Same as in Column copied from the vertices table. For details, refer to the
vertices table description of the Accumulate argument in Arguments.
resolution DOUBLE Resolution at which modularity score is computed.
PRECISION
community_id VARCHAR or Identity of the community to which the vertex belongs.
same as in
vertices table
num_communities INTEGER Number of communities at this resolution. (This value is the
same for each resolution level.)
modularity_score DOUBLE Modularity score at this resolution.(This value is the same for
PRECISION each resolution level.)

The community edges table contains the edge weights (strength) between different communities at specified
resolutions. The table is created implicitly on the database and is not displayed at the output of function
execution. To display the contents of the community edges table, use the command SELECT *
FROMcommunity_edge_table (where community_edge_table is the name that you specified in the
CommunityEdgeTable argument).
Table 1036: Community Edges Table Schema

Column Name Data Type Description


src_community_id VARCHAR Identity of the first community of the edge.
or same as in
vertices table
target_community_id VARCHAR Identity of the second community of the edge.
or same as in
vertices table

1096 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Modularity

Column Name Data Type Description


resolution DOUBLE Resolution at which modularity score is computed.
PRECISION
weight DOUBLE Strength of the link from the first community to the second
PRECISION community.

Examples
• Input
• Example 1: Unweighted Edges
• Example 2: Weighted Edges and Community Edge Table

Input
In the graph in the following figure, the nodes represent persons who are geographically distributed across
the United States and are connected on an online social network, on which they follow each other. The
directed edges start at the follower and end at the leader. For example, Alex follows Bob and Casey.
Figure 25: Graph of Social Network

The graph in the preceding figure is represented by the vertices and edges tables friends and
followers_leaders, respectively. The edges table column intensity represents the fervor with which the
follower follows the leader, on a scale from 1 (lowest) to 10 (highest).
Table 1037: Modularity Examples Vertices Table friends

friends_name location group_id


Alex SanFrancisco SanFrancisco
Bob LosAngeles LosAngeles
Casey LosAngeles LosAngeles
Danny NewYorkCity NewYorkCity
Eve Birmingham Birmingham
Fox Austin Austin
Gohar Miami Miami
Harry Chicago Chicago

Teradata Aster Analytics Foundation User Guide 1097


Chapter 11: Graph Analysis
Modularity
Table 1038: Modularity Examples Edges Table followers_leaders

follower leader intensity


Alex Bob 5
Alex Casey 6
Casey Bob 1
Eve Danny 9
Fox Danny 7
Fox Eve 8
Gohar Casey 10
Harry Gohar 4
Harry Danny 3

Example 1: Unweighted Edges


Followers follow leaders with equal intensity (all edges have default weight 1).

SQL-MapReduce Call

SELECT * FROM Modularity (


ON friends AS "vertices" PARTITION BY friends_name
ON followers_leaders AS "edges" PARTITION BY follower
TargetKey ('leader')
Directed ('true')
CommunityAssociation ('group_id')
CommunityEdgeTable ('community_edges')
Accumulate ('friends_name', 'location')
) ORDER BY friends_name;

Output

Table 1039: Modularity Example 1 Community Vertex Table

friends_name location resolution community_id num_communities modularity_scor


e
Alex SanFrancisco 1 LosAngeles 3 0.425926
Bob LosAngeles 1 LosAngeles 3 0.425926
Casey LosAngeles 1 LosAngeles 3 0.425926
Danny NewYorkCity 1 Austin 3 0.425926
Eve Birmingham 1 Austin 3 0.425926
Fox Austin 1 Austin 3 0.425926
Gohar Miami 1 Chicago 3 0.425926

1098 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Modularity

friends_name location resolution community_id num_communities modularity_scor


e
Harry Chicago 1 Chicago 3 0.425926

Example 2: Weighted Edges and Community Edge Table


Followers follow leaders with different intensity.

SQL-MapReduce Call

SELECT * FROM Modularity (


ON friends AS "vertices" PARTITION BY friends_name
ON followers_leaders AS "edges" PARTITION BY follower
TargetKey ('leader')
Directed ('true')
CommunityAssociation ('group_id')
EdgeWeight ('intensity')
CommunityEdgeTable ('community_edges')
Accumulate ('friends_name', 'location')
) ORDER BY friends_name;

Output

Table 1040: Modularity Example 2 Community Vertex Table

friends_name location resolution community_id num_communities modularity_scor


e
Alex SanFrancisco 1 LosAngeles 2 0.442684
Bob LosAngeles 1 LosAngeles 2 0.442684
Casey LosAngeles 1 LosAngeles 2 0.442684
Danny NewYorkCity 1 Birmingham 2 0.442684
Eve Birmingham 1 Birmingham 2 0.442684
Fox Austin 1 Birmingham 2 0.442684
Gohar Miami 1 LosAngeles 2 0.442684
Harry Chicago 1 LosAngeles 2 0.442684

This query returns the following table:

SELECT * FROM community_edges;

Table 1041: Modularity Example 2 Community Edge Table community_edges

src_community_id target_community_id resolution weight


LosAngeles LosAngeles 1 52

Teradata Aster Analytics Foundation User Guide 1099


Chapter 11: Graph Analysis
Modularity

src_community_id target_community_id resolution weight


LosAngeles Birmingham 1 3
Birmingham LosAngeles 1 3
Birmingham Birmingham 1 48

To verify that the modularity scores for the community_edges table and the community vertex table match:
1. Remove all bidirectional edges from the community_edges table.
(For this example, remove rows 5 and 7 from the preceding table.)
2. Run the Modularity function using the community_edges table as edges and the unique community ids
as vertices, and specify the weight column with the EdgeWeight argument.
(For this example, the modularity score is 0.426, which matches the modularity score in the community
vertex table.)

Tips
• To organize a data set into communities or clusters, the data set first must be transformed into a graph.
In graph terminology, the objects of interest represent vertices and the relations among objects are
represented by edges.
• The edges and their weights can be based on some heuristics which are typically specific to the use case at
hand.
∘ For simple data points, edges can be established by connecting any two objects with distance < d
(some threshold) through an edge, connecting an object to k closest objects regardless of distance, or a
combination of above two techniques.
∘ For complex data points, the edge weight represents the similarity / strength of relation between two
objects. It can be represented by closeness among vertices, similarity metric score such as cosine
similarity, or inverse of the distance among objects.
∘ To compute the edge weight for general data set with each object represented by a combination of
attributes, other statistical techniques can be employed to establish the similarity metric among
objects. Techniques such as principal component analysis, data normalization, and feature extraction,
can be used to produce meaningful graphs out of input data.
• In some cases, when you specify multiple resolution points in the Resolution argument, the modularity
that the function reports for the resolution points varies slightly from the modularity that the function
reports for the same resolution points when you specify them individually. In such cases, specifying the
resolution points individually is recommended.
• The function runs faster on graphs with vertices of data type INTEGER or BIGINT, because they use
much less memory than graphs with vertices of other data types.
Suppose that you have vertices table string_nodes with column id of data type VARCHAR and edges
table string_edges with columns src_id, dest_id, and weight of data types VARCHAR, VARCHAR, and
INTEGER, respectively. You can generate equivalent INTEGER-based vertex and edges tables statements
such as these:

CREATE TABLE int_nodes (id_string VARCHAR, id_int INTEGER)


DISTRIBUTE BY HASH(id_int);
INSERT INTO int_nodes SELECT id, row_number()
OVER (ORDER BY id) FROM string_nodes;
CREATE TABLE int_edges (src INTEGER, dest INTEGER, weight INTEGER)

1100 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
Modularity
DISTRIBUTE BY HASH(src) AS
SELECT m1.id_int, m2.id_int, weight
FROM string_edges E, int_nodes m1, int_nodes m2
WHERE E.src_id = m1.id_string AND E.dest_id = m2.id_string;
• If the column community_id that the function generates has data type VARCHAR, and you want
community_id to have the data type INTEGER, use either of the these statements (where
modularity_output is the output table generated by the function):

SELECT id, rank() OVER (ORDER BY community_id) AS comm_int_id


FROM modularity_output;
SELECT id, dense_rank() OVER (ORDER BY community_id) AS comm_int_id
FROM modularity_output;

Troubleshooting
• Problem: The function runs slowly for large graphs. The function continues to execute while spending
hours in a graph iteration or terminates unsuccessfully, displaying a failure message on the console.
• Work Arounds:
∘ Consult the logs for error message details and troubleshooting.
∘ The logs can help determine the time taken for each iteration. If you know the number of iterations
that the function takes (usually 25-50), you can estimate the total execution time.
∘ All graph functions operate much faster on INTEGER and BIGINT single-column vertex ids. To
generate an INTEGER vertex based graph, see Tips section above.
∘ Compute the modularity in incremental steps by choosing a subset of nodes using the Sources
argument.
∘ If you already know some of the groupings in the graph, then specify them with the
CommunityAssociation argument.
∘ If you specified the CommunityEdgeTable argument, the reason for the slow execution might be that
the function is writing the community table to the database through JDBC. Run the function without
the CommunityEdgeTable argument first, to obtain the modularity of the resultant graph.
• Problem: The function does not accept the edges table. The function terminates with errors on the
vertices table or edges table.
• Work Arounds:
∘ Consult the logs for error message details and troubleshooting.
∘ Ensure that all source and target vertices in the edges table are listed in the vertices table. Ensure that
the columns representing source and target vertices are not NULL.
• Problem: The function completes successfully; however, the results do not show a good modularity score
or community detection.
• Work Arounds:
∘ Change the value of seed.
∘ Multiply the edge weights by a constant.
∘ Change the values in the Resolution argument.
∘ If you already know some of the groupings in the graph, then specify them with the
CommunityAssociation argument.

Note:
The modularity score might be poor only because the graph has no inherent community structure.

Teradata Aster Analytics Foundation User Guide 1101


Chapter 11: Graph Analysis
nTree

nTree

Summary
The nTree function is a hierarchical analysis SQL-MapReduce function that can build and traverse tree
structures on all worker machines. The function reads the data only once from the disk and creates the trees
in memory.
The input data must be partitionable, and each partition must fit in memory. Each partition can consist of
multiple trees of any size. The function has different ways of handling cycles.

Background
Two use cases for nTree are equity trading and social networking.

Equity Trading
A large stock buy or sell order with multiple counterparties is typically divided into child orders, which can
be further divided. An order to sell a specific stock can cause a cascade of transactions, each of which
descends from the original order. For example, an order to sell 100 shares of a specific stock can trigger
orders to sell 70 and 30 shares of that stock, and the order to sell 70 shares can trigger orders to sell 50 and 20
shares, and so on.
All stock transactions are stored in a single table. Each row represents one transaction, which is identified by
its order_id and linked to its parent by its parent_id.
A stock broker must be able to identify the root order for each transaction. To do so with SQL requires an
unknown number of self-joins, but with Aster Database SQL-MapReduce, you can partition the data by
stock symbol and then by date and use the nTree function to create a tree from each root order.

Social Networking
Social networks use multiple data sources to identify people and their relationships. For example, a user-user
connection graph shows connections that users have created on the network, a user-person invitation graph
shows a mixture of user-user connections and user-email connections, and address book data provides a
user-email graph.
Suppose that you are the administrator of a social network, and you want to know who has multiple
accounts on the network. You can use the nTree function to generate a tree for every account and then
compare these trees to find those that are very likely to have the same person as the root node.

1102 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
nTree
Usage

nTree Syntax
Version 1.1

SELECT * FROM nTree (


ON input_table
PARTITION BY partition_columns
[ ORDER BY ordering_columns]
Root_Node (boolean_expression)
Node_ID (expression)
Parent_ID (expression)
Allow_Cycles ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})
Starts_With ({ 'root' | 'leaf' | expression })
Mode ({ 'up' | 'down' })
Output ({ 'end' | 'all' })
[ Max_Distance (expression) ]
[ Logging ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})]
Result (aggregate [,...])
);

Arguments
Argument Category Description
Root_Node Required Specifies the BOOLEAN SQL expression that defines the root nodes of
the trees (for example, parent_id IS NULL).
Node_ID Required Specifies the SQL expression whose value uniquely identifies a node in
the input table (for example, order_id).

Note:
A node can appear multiple times in the data set, with different
parents.

Parent_ID Required Specifies the SQL expression whose value identifies the parent node.
Allow_Cycles Required Specifies whether trees can contain cycles. If not, a cycle in the data set
causes the function to throw an exception. The default value is 'false'. For
information about cycles, refer to Cycles in nTree.
Starts_With Required Specifies the node from which to start tree traversal—must be 'root',
'leaf', or a SQL expression that identifies a node.
Mode Required Specifies the direction of tree traversal from the start node—up to the
root node or down to the leaf nodes.
Output Required Specifies when to output a tuple—at every node along the traversal path
('all') or only at the end of the traversal path ('end'). The default value is
'end'.
Max_Distance Optional Specifies the maximum tree depth. The default value is 5.

Teradata Aster Analytics Foundation User Guide 1103


Chapter 11: Graph Analysis
nTree

Argument Category Description


Logging Optional Specifies whether the function prints log messages. The default value is
'false'.
Result Required Specifies aggregate operations to perform during tree traversal. The
function reports the result of each aggregate operation in the output
table. The syntax of aggregate is:

operation (expression) [ ALIAS alias ]

operation is either PATH, SUM, LEVEL, MAX, MIN, IS_CYCLE, AVG,


or PROPAGATE.
expression is a SQL expression. If operation is LEVEL or IS_CYCLE, then
expression must be *.
alias is the name of the output table column that contains the result of
the operation. The default value is the string
operation(expression) (without the quotation marks); for
example, PATH(node_name).

Note:
The function ignores alias if it is the same as an input table column
name.

For the path from the Starts_With node to the last traversed node, the
operations do the following:
• PATH
Outputs the value of expression for each node, separating values
with '->'.
• SUM
Computes the value of expression for each node and outputs the sum
of these values.
• LEVEL
Outputs the number of hops.
• MAX
Computes the value of expression for each node and outputs the
highest of these values.
• MIN
Computes the value of expression for each node and outputs the
lowest of these values.
• IS_CYCLE
Outputs the cycle (if any).
• AVG
Computes the value of expression for each node and outputs the
average of these values.
• PROPAGATE
Evaluates expression with the value of the Starts_With node and
propagates the result to every node.

1104 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
nTree
Cycles in nTree
A cycle is unidirectional; that is, the output can be different for Mode('up') and Mode('down'). For example,
consider the following table:
Table 1042: emp_table_dept

department order_column id name salary mgr_id


Marketing 1 7 Don 20000.00 2
Marketing 2 2 Pat 30000.00 3
Marketing 3 3 Donna 60000.00 6
Marketing 4 9 Kim 50000.00 5
Marketing 5 4 Fred 40000.00 4
Marketing 6 5 Mark 70000.00 7
Marketing 7 6 Rob 10000.00 1
Marketing 8 5 Mark 10000.00 1
Marketing 9 1 Dave 10000.00 none
Marketing 10 1 Dave 10000.00 9
Engineering 1 10 Peter 10000.00 12
Engineering 1 10 Peter 10000.00 none
Engineering 2 11 Sarah 20000.00 10
Engineering 3 12 Sophia 30000.00 10
Engineering 4 15 Elizabeth 40000.00 12
Engineering 5 16 Richard 50000.00 12
Engineering 6 18 Carter 60000.00 17
Engineering 7 11 Sarah 20000.00 15
Engineering 8 13 John 70000.00 11
Engineering 9 14 Jessica 80000.00 11
Engineering 10 14 Jessica 80000.00 12
Engineering 11 15 Elizabeth 40000.00 14
Engineering 12 16 Richard 50000.00 15
Engineering 13 17 Gary 90000.00 16
Engineering 14 16 Richard 50000.00 18

This query, which specifies Mode('up'), outputs the following table:

SELECT * FROM nTree (


ON emp_table_dept
PARTITION BY department

Teradata Aster Analytics Foundation User Guide 1105


Chapter 11: Graph Analysis
nTree
ORDER BY order_column
Root_Node (mgr_id = 'none')
Parent_ID (mgr_id)
Node_ID (id)
Starts_With (id = '11')
Mode ('up')
Output ('end')
Result (PATH (name) AS path, path (id) AS path2, IS_CYCLE (*))
Allow_Cycles ('true')
) ORDER BY path, path2;

Table 1043: Cycle in nTree with Mode ('up')

id path path2 is_cycle


14 Sarah->Elizabeth->Jessica 11->15->14 11
10 Sarah->Elizabeth->Jessica->Sophia->Peter 11->15->14->12->10 12
10 Sarah->Elizabeth->Sophia->Peter 11->15->12->10 12
12 Sarah->Peter->Sophia 11->10->12 10

If you specify Mode('down') in the preceding query, it outputs the following table.
Table 1044: Cycle in nTree with Mode ('down')

id path path2 is_cycle


15 Sarah->Jessica->Elizabeth 11->14->15 11
18 Sarah->Jessica->Elizabeth->Richard->Gary->Carter 11->14->15->16->17->18 16
13 Sarah->John 11->13

Very Deep Trees


When using the nTree function with trees that have more than 7500 levels, you might need to increase the
Java stack size. To increase the Java stack size, add the following line to the /etc/profile on all worker nodes
(it is unnecessary on the queen node):

export JAVA_TOOL_OPTIONS='-Xss24M'

Input
The nTree function has one required input table, which contains the data from which to build trees. The
only restriction on its schema is that the data must be partitionable and each partition must fit in memory.

Output
The nTree function outputs a table of data collected and computed by traversing the trees that it builds.

1106 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
nTree
Table 1045: nTree Output Table Schema

Column Name Data Type Description


id INTEGER or Contains the unique node identifiers from the column specified by the
VARCHAR Node_ID argument.
alias VARCHAR Contains the result of an aggregate operation specified by the Result
argument. The table has one column for each aggregate operation.

Examples
• Example 1: Find an Employee’s Reports
• Example 2: Find an Employee’s Management Chain
• Example 3: Show Reporting Structure by Department

Example 1: Find an Employee’s Reports

Input
The input table contains the data to build a tree of employees. Each row represents one employee,
identifying both the employee and his or her manager by identifier and name. The employee with no
manager, Don, becomes the root of the tree. The other employees become children of their managers.
Table 1046: nTree Examples 1 and 2 Input Table employee_table

emp_id emp_name mgr_id mgr_name


100 Don NULL NULL
200 Pat 100 Don
300 Donna 100 Don
400 Kim 200 Pat
500 Fred 400 Kim

SQL-MapReduce Call
This call finds the employees who report to employee 100 (either directly or indirectly) by traversing the tree
of employees from employee 100 downward.

SELECT * FROM ntree (


ON employee_table PARTITION BY 1
Root_Node (mgr_id IS NULL)
Node_ID (emp_id)
Parent_ID (mgr_id)
Starts_With (emp_id=100)
Mode ('down')
Output ('end')
Result (PATH(emp_name) AS path)
) ORDER BY 1;

Teradata Aster Analytics Foundation User Guide 1107


Chapter 11: Graph Analysis
nTree
Output
The output table shows that employee 100, Don, has two direct reports, Donna and Pat, and two indirect
reports, Kim (who reports to Pat) and Fred (who reports to Kim).
Table 1047: nTree Example 1 Output Table

id path
300 Don->Donna
500 Don->Pat->Kim->Fred

Example 2: Find an Employee’s Management Chain

Input
The input table is the same as for Example 1.

SQL-MapReduce Call
This call finds the management chain of employee 500 by traversing the tree of employees from employee
500 upward.

SELECT * FROM ntree (


ON employee_table PARTITION BY 1
Root_Node (mgr_id IS NULL)
Node_ID (emp_id)
Parent_ID (mgr_id)
Starts_With (emp_id=500)
Mode ('up')
Output ('end')
Result (PATH(emp_name) AS path)
) ORDER BY 1;

Output
The output table shows that employee 500, Fred, reports to Kim, who reports to Pat, who reports to Don.
Table 1048: nTree Example 2 Output Table

id path
100 Fred->Kim->Pat->Don

Example 3: Show Reporting Structure by Department

Input
The input table contains the data to build trees of departments, and is partitioned by department. In each
partition, each row represents one employee in the department, identifying the employee by identifier and

1108 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
nTree
name and the employee’s manager by identifier. Each employee with no manager becomes the root of the
department tree. The other employees in that department become children of their managers.
Table 1049: nTree Example 3 Input Table emp_table_by_dept

department id name mgr_id


Marketing 1 Dave none
Marketing 2 Kim 1
Marketing 3 Donna 1
Marketing 4 Rob 1
Marketing 5 Fran 2
Marketing 6 Mark 2
Marketing 7 Richard 3
Marketing 8 Pat 4
Marketing 9 Don 4
Engineering 10 Peter none
Engineering 11 Sarah 10
Engineering 12 Dale 10
Engineering 13 John 10
Engineering 14 Sophia 15
Engineering 15 Jessy 12
Engineering 16 Gary 12
Engineering 17 Elizabeth 13
Engineering 18 Richard 13

SQL-MapReduce Call

SELECT * FROM ntree (


ON emp_table_by_dept PARTITION department
Root_Node (mgr_id = 'none')
Parent_ID (mgr_id)
Node_ID (id)
Starts_With ('root')
Mode ('down')
Output ('all')
Result (PATH(name) AS path, PATH(id) AS path2)
) ORDER BY path, path2;

Output
The output table shows two department trees, whose roots are Dave and Peter.

Teradata Aster Analytics Foundation User Guide 1109


Chapter 11: Graph Analysis
PageRank
Table 1050: nTree Example 3 Output Table

id path path2
1 Dave 1
3 Dave->Donna 1->3
7 Dave->Donna->Richard 1->3->7
2 Dave->Kim 1->2
5 Dave->Kim->Fran 1->2->5
6 Dave->Kim->Mark 1->2->6
4 Dave->Rob 1->4
9 Dave->Rob->Don 1->4->9
8 Dave->Rob->Pat 1->4->8
10 Peter 10
12 Peter->Dale 10->12
16 Peter->Dale->Gary 10->12->16
15 Peter->Dale->Jessy 10->12->15
14 Peter->Dale->Jessy->Sophia 10->12->15->14
13 Peter->John 10->13
17 Peter->John->Elizabeth 10->13->17
18 Peter->John->Richard 10->13->18
11 Peter->Sarah 10->11

PageRank

Summary
The PageRank function computes the PageRank values for a directed graph, weighted or unweighted.

1110 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
PageRank
Background
PageRank is a link analysis algorithm. It assigns a numerical weight (between 0 and 1) to each node in a
directed graph, for the purpose of measuring the relative importance of the node to the graph. The sum of
the PageRanks of the nodes is 1. PageRank is applicable to any collection of entities with reciprocal
quotations and references.

Usage

PageRank Syntax
Version 1.1

SELECT * FROM PageRank (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY edge_src_vertex_key
TargetKey ('target_key_column' [,...])
[ EdgeWeight ('edge_weight') ]
[ DampFactor ('damp_factor') ]
[ MaxIterNum ('max_iterations') ]
[ Threshold ('threshold') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Arguments
Argument Category Description
TargetKey Required Specifies the target key columns in the edges table.
EdgeWeight Optional Specifies the column in the edges table that contains the edge weight,
which must be a positive value. By default, all edges have the same weight
(that is, the graph is unweighted).
DampFactor Optional Specifies the value to use in the PageRank formula. The damp_factor
must be a DOUBLE PRECISION value between 0 and 1. The default
value is 0.85.
MaxIterNum Optional Specifies the maximum number of iterations for which the algorithm
runs before the function completes. The max_iterations must be a
positive INTEGER value. The default value is 20.

Note:
max_iterations is the number of SQL-GR™ iterations (that is, the
algorithm iterations shown in AMC minus 3).

Threshold Optional Specifies the convergence criteria value. The threshold must be a
DOUBLE PRECISION value. The default value is 0.0001.
Accumulate Optional Specifies the vertices table columns to copy to the output table.

Teradata Aster Analytics Foundation User Guide 1111


Chapter 11: Graph Analysis
PageRank
Note:
When either the maximum number of iterations or the threshold is satisfied, the iteration stops and the
function outputs the result and records the iteration number and convergence value in the log. You can
access the log using AMC.

Input
The PageRank function requires two input tables, vertices and edges.
The vertices table must contain the unique identifier (vertex key attributes) of each vertex. The unique
identifier of a vertex can consist of multiple columns. The vertices table can also have columns that are not
vertex key attributes. If these additional columns are specified by the Accumulate argument, then the
function copies them to the output table; otherwise, it ignores them.
Table 1051: PageRank Vertices Table

Column Name Data Type Description


vertex_key_attribute VARCHAR Contains a vertex key attribute. The vertices table must have
at least one such column, and can have more than one. The
vertex key attribute columns comprise the vertex_key on
which the vertices table must be partitioned.
accumulate_column Any Column to be copied to the output table. The vertices table
can have zero or more such columns.

The edges table must contain columns for the source and target vertices of each edge. The columns must be
in the same schema.
Table 1052: PageRank Edges Table

Column Name Data Type Description


edge_source INTEGER Contains the numbers of the source nodes of the edges. The edges
table must be partitioned on this column.
edge_target INTEGER Contains the numbers of the target nodes of the edges.

Output
PageRank outputs each vertex’s pagerank, a double value. In addition, the function outputs the accumulated
columns from the vertices table.
Table 1053: PageRank Output Table

Column Name Data Type Description


accumulate_column Any Column copied from the vertices table. The output table can
have zero or more such columns, which appear in the same
order as they appear in the vertices table.
pagerank DOUBLE Contains the PageRank of each vertex.
PRECISION

1112 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
PageRank
Example
In the graph in the following figure, the nodes represent persons—light blue for males and dark blue for
females. The directed edges represent phone calls from one person to another. Node size represents number
of connections (degree centrality).
Figure 26: Graph of Phone Calls Between Persons

The graph in the preceding figure is represented by the vertices and edges tables callers and calls,
respectively.
Table 1054: PageRank Examples Vertices Table callers

callerid callername
1 John
2 Carla
3 Simon
4 Celine
5 Winston
6 Diana

Table 1055: PageRank Examples Edges Table calls

callerfrom callerto calls


2 4 7
2 6 12
4 6 4
1 2 10
1 3 2
1 4 5
1 6 6
3 6 1

Teradata Aster Analytics Foundation User Guide 1113


Chapter 11: Graph Analysis
pSALSA

callerfrom callerto calls


5 6 10

SQL-MapReduce Call

SELECT * FROM PageRank (


ON callers AS vertices PARTITION BY callerid
ON calls AS edges PARTITION BY callerfrom
Targetkey ('callerto')
EdgeWeight ('calls')
Accumulate ('callerid', 'callername')
) ORDER BY callerid;

Output
Table 1056: PageRank Example Output Table

callerid callername pagerank


1 John 0.0890114925255936s
2 Carla 0.122657805325077
3 Simon 0.0957407550854902
4 Celine 0.144421189008967
5 Winston 0.0890114925255936
6 Diana 0.459157265529279

pSALSA

Summary
The pSALSA (personalized SALSA) function is a SQL-GR™ function that evaluates the similarity of nodes in
a bipartite graph according to their proximity. The typical application of pSALSA is for recommendation.

1114 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA

The pSALSA function assigns numerical scores (between 0 and 1) to the vertices on both sides of the
bipartite graph for each seed (hub) vertex. Also, for each seed vertex, the function outputs K hub/authority
vertices with highest score.
The score assigned by pSALSA to a node defines the possibility of visiting the node by a random walk with a
restart from a seed node.

Background

SALSA
Stochastic Approach for Link-Structure Analysis (SALSA) is a link analysis algorithm originally developed for
evaluating the importance of web pages (similar to the PageRank algorithm).
However, unlike PageRank, in SALSA, a collection of web pages is transformed into a bipartite graph:
`G=(V,X,E)
where
{ν ∈V} are hub vertices located on the left side of the graph.
{x ∈ X} are authority vertices located on the right side of the graph.
{(ν, x) Î E} are edges linking the hub vertices and authority vertices.
The following figure shows an example on a bipartite graph, which shows the transforming (a) of the
collection into (b) bipartite`G.

Teradata Aster Analytics Foundation User Guide 1115


Chapter 11: Graph Analysis
pSALSA

For each hub vertex in the graph, SALSA computes a hub score (hv), and for each authority vertex, an
authority score (hx) is associated with. The hub/authority score is defined by analyzing a random walk on the
bipartite graph, wherein the steps from hub vertex to authority vertex (from the left side of the bipartite
graph to the right side) are called forward steps, and the steps from the authority vertex to the hub vertex are
called backward steps.
The hub score (hv) is defined as the possibility of visiting the hub vertex v and the authority score (ax) is the
possibility of visiting the authority vertex x in a random walk on the bipartite graph.
The hub score (hv) and authority score (ax) can be computed using the following update rules until
converge:

For more information about SALSA, refer to the following paper:


Lempel, R.; Moran S. (April 2001). “SALSA: The Stochastic Approach for Link-Structure Analysis” (PDF).
ACM Transaction on Information Systems 19 (2): 131–160. doi:10.1145/382979.383041

pSALSA
You can use the personalized version of the SALSA algorithm to evaluate the similarity between vertices on
each side. Unlike the standard SALSA algorithm, which generates a global score for each vertex, in the
personalized SALSA algorithm, for each hub vertex vi, there is a set of hub scores, hvj (i), and a set of
authority scores, axk (i).
If the hub score is higher, it indicates that vertex vj shares more connections with (or is closer to) vi. If the
authority score is higher, it indicates that vertex ak is more important in building the closeness relationship
with vi.
The updating rule for hub/authority score personalizing on vj is as follows:

This allows random jumps, with a possibility of ε, to the seed vertex at forward steps.

1116 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA
Personalized SALSA can be used in recommendation application where the users are the hub vertices and
the products that must be recommended are the authority nodes, and there is an edge between the user and
product nodes if there is a purchase record.
Unlike the standard SALSA algorithm, it is hard to use the power iteration method to get the scores
personalizing on each of the hub vertices, because the power iteration must run N times, assuming there is a
total of N hub vertices.
This function implements the personalized salsa algorithm described in the following paper:
Bahmani Bahman, Abdur Chowdhury, and Ashish Goel. “Fast incremental and personalized PageRank.”
Proceedings of the VLDB Endowment 4.3 (2010): 173-184.
This paper solves the problem through Monte Carlo simulation: from each hub vertex hi,the algorithm starts
a random walk, which lasts L steps (L is a input parameter) and keeps tracking the path of the random walk.
After the walk stops, the score of a hub vertex hj to hi, and an authority vertex ak to hi are computed as
follows:
score_hij = (2vhij) / L
score_aik = (2vaik) / L
where vhij and vaik are the visit time of vertex hj as a hub vertex and ak as an authority vertex in the random
walk path starting from the seed vertex hi.

Usage

pSALSA Syntax
Version 1.1

SELECT * FROM pSALSA (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
[ ON targets_table AS targets PARTITION BY target_vertex_key ]
SourceKey
({ 'source_vertex_column' | 'source_vertex_column_range' }[,...])
TargetKey
({ 'target_vertex_column' | 'target_vertex_column_range' }[,...])
[ EdgeWeight ('weight_column') ]
MaxHubNum ('max_hubs')
MaxAuthorityNum ('max_authority')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ TeleprotProb ('eta') ]
[ RandomWalkLength ('L') ]
);

Teradata Aster Analytics Foundation User Guide 1117


Chapter 11: Graph Analysis
pSALSA
Arguments
Argument Category Description
SourceKey Required Specifies the source key (the names of the edges table columns that
identify the target vertex). If you specify sources_table, then the
function uses only the vertices in sources_table as sources (which
must be a subset of those that this argument specifies).
The function uses these names to construct the column names for
recommended hub nodes in the output table.
TargetKey Required Specifies the target key (the names of the edges table columns that
identify the target vertex). If you specify targets_table, then the
function uses only the vertices in targets_table as targets (which
must be a subset of those that this argument specifies).
EdgeWeight Optional Specifies the name of the edges table column that contains edge
weights. Each edge_weight is a positive value. By default, each
edge_weight is 1; that is, the graph is unweighted.
MaxHubNum Required Specifies the maximum number of hub vertices with the highest
score output for each hub vertex. The max_hubs must be an
INTEGER.
MaxAuthorityNum Required Specifies the maximum number of authority vertices with the
highest score output for each hub vertex. The max_authority must
be an INTEGER.
Accumulate Optional Specifies the name of the vertices table columns to copy to the
output table.
TeleprotProb Optional Specifies the possibility ε, a DOUBLE PRECISION value between 0
and 1 that represents the jump to the seed vertex during random
walk. The default value is 0.15.
RandomWalkLength Optional Specifies the random walk length L for each hub vertex. L must be
an INTEGER. The final hub/authority score is computed as
ε*Kv/L, where Kv is the number of times that the random walk
visits vertex v. The default value is 5000.

Input
The function has two required input tables, vertices and edges. Each row of the vertices table represents a
vertex of the graph, and each row of the edges table represents an edge of the graph.
The function has two optional input tables, sources and targets, which specify the vertices that are sources
and targets, respectively. For a directed graph, these tables are required. By default, all vertices are sources
and targets; that is, the graph is undirected.
Table 1057: pSALSA Vertices Table Schema

Column Name Data Type Description


vertex_key_column Any allowed in PARTITION BY Column that is, or is part of, the
clause unique vertex key. Every column
that is part of this key must appear

1118 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA

Column Name Data Type Description


in the PARTITION BY clause.
This column cannot contain
NULL values.

Table 1058: pSALSA Edges Table Schema

Column Name Data Type Description


source_vertex_key_column Any allowed in PARTITION BY Column that is, or is part of, the
clause key that identifies the source
vertex of the edge. This key must
be a vertex key in the vertices
table. Every column that is part of
this key must appear in the
PARTITION BY clause. This
column cannot contain NULL
values.
target_vertex_key_column Same as data type of Column that is, or is part of, the
source_vertex_key_column key that identifies the target vertex
of the edge. This key must be a
vertex key in the vertices table.
This column can contain NULL
values.
edge_weight SMALLINT, INTEGER, or Column that contains the weights
NUMERIC of the edges, which must be
positive values. This column is
required only for a weighted
graph. This column can contain
NULL values.

Table 1059: pSALSA Sources Table Schema

Column Name Data Type Description


source_vertex_key_column Same as data type of Column that is, or is part of, the
corresponding vertex_key_column key that identifies a source vertex.
in vertices table This key must be a vertex key in
the vertices table. Every column
that is part of this key must appear
in the PARTITION BY clause.

Table 1060: pSALSA Targets Table Schema

Column Name Data Type Description


target_vertex_key_column Same as data type of Column that is, or is part of, the
corresponding vertex_key_column key that identifies a target vertex.
in vertices table This key must be a vertex key in
the vertices table. Every column

Teradata Aster Analytics Foundation User Guide 1119


Chapter 11: Graph Analysis
pSALSA

Column Name Data Type Description


that is part of this key must appear
in the PARTITION BY clause.

Output
Table 1061: pSALSA Output Table Schema

Column Name Data Type Description


source_vertex_key_column Same as in edges Column that is, or is part of, the key that identifies
table the source vertex of the edge.
accumulate_column Same as in Column copied from the vertices table.
vertices table
hub_source_vertex_key_column Same as in edges Hub node with the top hub score, personalized on
table the source node.
hub_score DOUBLE Hub score for the top hub node.
PRECISION
authority_target_key_column Same as in edges Authority node with the highest authority score,
table personalized on the source node.
authority_score DOUBLE Authority score for the node with the highest
PRECISION authority score.

Examples
• Example 1: User Similarity in a Social Network Without Edge Weight
• Example 2: User Similarity in a Social Network with Edge Weight
• Example 3: User Similarity and Product Recommendation
• Example 4: Using the Sources and Targets Tables as Inputs
Examples 1 and 2 analyze a social network of users (for example, in an application such as Twitter) as shown
in the following figure and their relationships as followers and leaders based on the 'likes' each user gets from
others.
Figure 27: pSALSA Example Diagram (Network Of Users)

1120 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA
The preceding figure is converted to the following figure, which is a bipartite representation of the same
network. The nodes on the left side of the following figure are source vertices, or hubs, and constitute the
'followers' column in Table bbb. The nodes on the right side of the following figure are the target vertices, or
authorities, and constitute the 'leaders' column in Table bbb.
Figure 28: pSALSA Example Diagram (Bipartite Representation)

The pSALSA algorithm assigns scores to both sides. In the output table (Output), the hub_followers column
shows similar users for each follower based on the hub_score. Likewise based on the authority_score, leaders
who are close to followers are output in the authority_leader column. A higher score indicates greater
similarity. Typically the authority_score is interpreted as user recommendations (in this case the closer
leader) and the hub_score is interpreted as user similarity.

Note:
Because the function is stochastic, the output can vary with each run.

Example 1: User Similarity in a Social Network Without Edge Weight

Input
The input consists of two tables: a user vertex table with 6 users (users_vertex) and a user edges table
(users_edges), showing relationships between users.

Teradata Aster Analytics Foundation User Guide 1121


Chapter 11: Graph Analysis
pSALSA
Table 1062: pSALSA Example 1 Input Table users_vertex

userid username
1 John
2 Carla
3 Simon
4 Celine
5 Winston
6 Diana

Table 1063: pSALSA Example 1 Input Table users_edges

followers leaders likes


Carla Celine 7
Carla Diana 12
Celine Diana 4
John Carla 10
John Celine 5
John Diana 6
John Simon 2
Simon Diana 1
Winston Diana 10

The 'likes' column is not used as edgeweight in this example.

SQL-MapReduce Call
This example uses the arguments MaxHubNum and MaxAuthorityNum to output a maximum of two hub
and two authority users.

SELECT * FROM pSALSA(


ON users_vertex AS vertices PARTITION BY username
ON users_edges AS edges PARTITION BY followers
SourceKey ('followers ')
TargetKey ('leaders ')
MaxHubNum ('2')
MaxAuthorityNum ('2')
TeleportProb ('0.15')
RandomWalkLength ('1000')
) ORDER BY followers;

1122 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA
Output
The output shows that the users John and Simon are similar to Carla. John is more similar, as he has a higher
hub_score. The output varies with every run.
Table 1064: pSALSA Example 1 Output Table

followers hub_followers hub_score authority_leaders authority_score


Carla John 0.354
Carla Simon 0.146
Carla Simon 0.0898203592814371
Carla Carla 0.0778443113772455
Celine John 0.314
Celine Carla 0.19
Celine Celine 0.148
Celine Simon 0.084
John Carla 0.190291262135922
John Simon 0.116504854368932
Simon John 0.318
Simon Carla 0.18
Simon Celine 0.148
Simon Simon 0.082
Winston John 0.316
Winston Carla 0.19
Winston Celine 0.146
Winston Simon 0.092

Example 2: User Similarity in a Social Network with Edge Weight

Input
The input consists of two tables: a user vertex table with six users (Input from Example 1) and a user edges
table (Input from Example 1), showing relationships between users.
The ‘likes’ column is used as edgeweight in this example.

Teradata Aster Analytics Foundation User Guide 1123


Chapter 11: Graph Analysis
pSALSA
SQL-MapReduce Call
Use the arguments MaxHubNum and MaxAuthorityNum to output a maximum of two hub and two
authority users.

SELECT * FROM pSALSA(


ON users_vertex AS vertices PARTITION BY username
ON users_edges AS edges PARTITION BY followers
SourceKey ('followers ')
TargetKey ('leaders ')
EdgeWeight ('likes')
MaxHubNum ('2')
MaxAuthorityNum ('2')
TeleportProb ('0.15')
RandomWalkLength ('1000')
) ORDER BY 1, 3 DESC, 5 DESC;

Output
The output shows that John and Winston are similar to Carla. John is more similar, as he has a higher
hub_score. The output varies with every run.
Table 1065: pSALSA Example 2 Output Table

followers hub_followers hub_score authority_leaders authority_score


Carla Carla 0.158
Carla Simon 0.022
Carla John 0.342
Carla Winston 0.124
Celine Celine 0.168
Celine Carla 0.12
Celine John 0.306
Celine Carla 0.262
John Carla 0.229083665338645
John Winston 0.127490039840637
Simon Celine 0.186
Simon Carla 0.14
Simon John 0.308
Simon Carla 0.278
Winston Celine 0.169660678642715
Winston Carla 0.147704590818363
Winston John 0.33
Winston Carla 0.256

1124 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA
Example 3: User Similarity and Product Recommendation

Input
While CFilter is good for stable product lines, pSALSA is very powerful for recommending products to
similar users when the product line has limited pairwise history or changes frequently. Consider clothing
retailers for women's apparel that change their stock based on seasons and trends. The input vertices table
(user_product_nodes) gives the list of women users and the products they buy.
Table 1066: pSALSA Example 3 Input Table user_product_nodes

nodeid nodename
1 Sandra
2 Susan
3 Stacie
4 Stephanie
5 Sally
6 coats
7 sweaters
8 jackets
9 blazers
10 pants
11 pajamas

The edges table (women_apparel_log) reflects the shopping pattern of the users. The goal is to find users
who are similar and thus determine product recommendations.
Table 1067: pSALSA Example 3 Input Table women_apparel_log

username product frequency


Sally blazers 2
Sally coats 10
Sally jackets 8
Sally sweaters 9
Sandra coats 10
Sandra jackets 8
Sandra sweaters 9
Stacie pajamas 9
Stacie pants 9
Stephanie blazers 5

Teradata Aster Analytics Foundation User Guide 1125


Chapter 11: Graph Analysis
pSALSA

username product frequency


Stephanie jackets 4
Stephanie pajamas 7
Stephanie pants 6
Susan blazers 5
Susan jackets 2
Susan pajamas 5
Susan pants 4
Susan sweaters 4

SQL-MapReduce Call
Output a maximum of two similar users (hub) and recommend two products (authority) for each user. Use
frequency of purchase as a weight factor.

SELECT * FROM pSALSA (


ON user_product_nodes AS vertices PARTITION BY nodename
ON women_apparel_log AS edges PARTITION BY username
SourceKey ('username')
TargetKey ('product')
EdgeWeight ('frequency')
MaxHubNum ('2')
MaxAuthorityNum ('2')
TeleportProb ('0.15')
RandomWalkLength ('500')
) ORDER BY username, hub_score DESC, authority_score DESC;

Output
The output shows possible recommendations, based on hub_score and authority_score. For example, the
seller might recommend pajamas to Sandra and Susan because they and Sally have similar scores. The
hub_score and authority_score values vary with every run.
Table 1068: pSALSA Example 3 Output Table

username hub_username hub_score authority_product authority_score


Sally pajamas 0.126984126984127
Sally pants 0.119047619047619
Sally Sandra 0.239043824701195
Sally Susan 0.159362549800797
Sandra pants 0.111111111111111
Sandra pajamas 0.107142857142857
Sandra Sally 0.270916334661355

1126 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
pSALSA

username hub_username hub_score authority_product authority_score


Sandra Susan 0.151394422310757
Stacie sweaters 0.107569721115538
Stacie blazers 0.103585657370518
Stacie Stephanie 0.212
Stacie Susan 0.164
Stephanie coats 0.119521912350598
Stephanie Susan 0.2
Stephanie Stacie 0.184
Susan coats 0.146825396825397
Susan Stacie 0.183266932270916
Susan Stephanie 0.175298804780877

Example 4: Using the Sources and Targets Tables as Inputs

Input
This example uses the same input as Example 3 and shows how to limit the vertices and edges used. This can
be useful if the original vertices and edges tables are very large, but only a subset of the information is of
interest. The hubs and authorities are calculated for the nodes specified in the sources and targets tables
(user_source_nodes and product_target_nodes).
Table 1069: pSALSA Example 4 Input Table user_source_nodes

nodeid username
1 Sandra
2 Susan

Table 1070: pSALSA Example 4 Input Table product_target_nodes

nodeid product
8 jackets
11 pajamas

SQL-MapReduce Call

SELECT * FROM pSALSA(


ON user_product_nodes AS vertices PARTITION BY nodename
ON women_apparel_log AS edges PARTITION BY username
ON user_source_nodes AS sources PARTITION BY username
ON product_target_nodes AS targets PARTITION BY product
SourceKey ('username')

Teradata Aster Analytics Foundation User Guide 1127


Chapter 11: Graph Analysis
RandomWalkSample
TargetKey ('product')
EdgeWeight ('frequency')
MaxHubNum ('2')
MaxAuthorityNum ('2')
TeleportProb ('0.15')
RandomWalkLength ('500')
) ORDER BY username, hub_score DESC, authority_score DESC;

Output
Pajamas are recommended to Sandra, as she has no purchase history for pajamas. No recommendations are
made for Susan, as she has bought all of the items in the past (refer to the Input section of Example 3). The
*_score result vary with every run.
Table 1071: pSALSA Example 4 Output Table

username hub_username hub_score authority_product authority_score


Sandra pajamas 0.333333333333333
Sandra Sally 0.214007782101167
Sandra Susan 0.140077821011673
Susan Stacie 0.206766917293233
Susan Sally 0.176691729323308

RandomWalkSample

Summary
The RandomWalkSample function takes an input graph (which is typically large) and outputs a sample
graph.

Note:
The sample graph is not deterministic. The function can get different results when running on a different
cluster.

Background
Graph sampling is a process of inducing a subset graph of the original graph, while preserving the original
graph properties. In some cases, if the subset graph is a good representation of the original, substituting it for
the original graph in analytic functions greatly decreases execution time without significantly decreasing
accuracy.
Random walk sampling is a graph sampling technique that randomly selects a starting vertex and then either
explores a neighboring vertex or returns ("flies back") to the starting vertex. If the sampling process reaches a
sink vertex (an isolated component or a loop), it randomly selects another vertex and continues until it
reaches the desired sample size (the desired number of vertices).

1128 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
RandomWalkSample
The resulting subset graph includes the edges between sampled vertices and their nearest neighbors (edges
that exist in the original graph), even if the sampling process did not explore those edges. Including those
edges makes the subset graph more representative of the original graph.

Note:
For more information about sampling from large graphs, see: http://cs.stanford.edu/people/jure/pubs/
sampling-kdd06.pdf

Usage

RandomWalkSample Syntax
Version 1.2

SELECT * FROM RandomWalkSample (


ON { table | view |(query) } AS "vertices"
PARTITION BY vertex_attributes
ON { table | view |(query) } AS "edges"
PARTITION BY source_vertex_attributes
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TargetKey
({ 'target_vertex_column' | 'target_vertex_column_range' }[,...])
[ SampleRate ('sample_rate') ]
[ FlyBackRate ('fly_back_rate') ]
[ Seed ('seed') ]
OutputTables ('vertex_table', 'edge_table')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
TargetKey Required The names of the columns in the edges table that identify the target
vertex of an edge. This set of columns must have the same schema as the
vertex_attributes and source_vertex_attributes.
SampleRate Optional The sampling rate. This value must be in the range (0, 1.0). The default
value is 0.15 (15%).

Teradata Aster Analytics Foundation User Guide 1129


Chapter 11: Graph Analysis
RandomWalkSample

Argument Category Description


FlyBackRate Optional The chance, when visiting a vertex, of flying back to the starting vertex.
This value must be in the range (0, 1.0). The default value is 0.15 (15%).
Seed Optional The seed used to generate a series of random numbers for sample_rate,
flyback_rate, and any random number used internally. Specifying this
value guarantees that the function result is repeatable on the same
cluster. The default value is 1000.
OutputTables Required The names of the vertex and edge tables.
Accumulate Optional The names of columns in the input vertex table (vertices) to copy to the
output vertex table (vertex_table).

Input
The RandomWalkSample function has two required input tables:
• vertices, which defines the set of vertices in the input graph
• edges, which defines the set of edges in the input graph
Neither input table can contain NULL values; otherwise, the function displays an error message.

Output
The RandomWalkSample function has three output tables: the summary table (which is usually displayed on
the screen) and the vertex and edges tables (which are saved to the database).
The summary table displays statistics:
Table 1072: RandomWalkSample Summary Table

name count
vertices (number of vertices in original graph)
edges (number of edges in original graph)
sampled vertices (number of sampled vertices)
sampled edges (number of sampled edges)

The vertex and edges tables (whose names are specified in the OutputTables argument) have the same
schemas as the input tables, vertices and edges. However, the output table column names are different from
the input table column names.
If the input table vertices has only one vertex attribute, then the output table vertex has only one column,
named id. If vertices has n vertex attributes, then vertex has n columns, named id_1, ..., id_n.
If the input table edges has only one source vertex attribute, then the output table edges has two columns,
named source and target. If the input table edges has n source vertex attributes, then the output table edges
has n*2 columns, named source_1, ..., source_n, target_1, ..., target_n

1130 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
RandomWalkSample
Example

Input
The input table is a set of 34,546 vertices or nodes.
Table 1073: RandomWalkSample Example Input Table citvertices

id
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1013
1014
1015
1016
1017
1018
1019
1020
...

The input table (citedges) is a set of 421,578 node couplets which specify edges between a source (column
from_id) and a target (column to_id). This example utilizes a 15% sampling rate of this huge collection of
vertices and edges.
Table 1074: RandomWalkSample Example Input Table citedges

from_id to_id
1001 9212308
1001 9305239

Teradata Aster Analytics Foundation User Guide 1131


Chapter 11: Graph Analysis
RandomWalkSample

from_id to_id
1001 9306240
1001 9312276
1001 9312333
1001 9401294
1001 9403226
1001 9409265
1001 9511336
1001 9601359
1001 9602280
1001 9610553
1001 9701390
1001 9702424
1001 9708239
1001 9709423
1001 9710255
... ....

SQL-GR™ Call
Specifying the Seed value guarantees that the result is repeatable on the same cluster, yet it can differ
between clusters as the sample graph is not deterministic.

SELECT * FROM randomwalksample (


ON citvertices AS "vertices" PARTITION BY id
ON citedges AS "edges" PARTITION BY from_id
TargetKey ('to_id')
FlyBackRate (0.15)
SampleRate (0.15)
Seed (1000)
OutputTables ('rw_vertices_15', 'rw_edges_15')
);

Output
Table 1075: RandomWalkSample Example Output Summary

name count
vertices 34546
edges 421578

1132 Teradata Aster Analytics Foundation User Guide


Chapter 11: Graph Analysis
RandomWalkSample

name count
sampled vertices 5181
sampled edges 33650

Table 1076: RandomWalkSample Example Output Table (sampled vertices)

id
1005
1013
1042
1046
1051
1073
1075
1086
1092
1113
1121
1123
1162
1166
...

Table 1077: RandomWalkSample Example Output Table (edges)

source target
4043 1005
111294 1005
10244 1005
11359 1005
7113 1005
102181 1005
211064 1005
103074 1005
102277 1005
104307 1013

Teradata Aster Analytics Foundation User Guide 1133


Chapter 11: Graph Analysis
RandomWalkSample

source target
210311 1013
109216 1042
110097 1046
211046 1046

1134 Teradata Aster Analytics Foundation User Guide


CHAPTER 12
Neural Networks

Neural Networks
• Introduction to Neural Networks
• NeuralNet
• NeuralNetPredict

Introduction to Neural Networks


A neural network model imitates the connections between neurons in the brain. Neural networks can be
applied to complex classification or regression problems. A neural network consists of layers of nodes, or
neurons, as shown in the following figure. The first layer takes a set of input values and the final layer is the
output values. The network can contain any number of intermediate or “hidden” layers. The input to each
node (other than the first layer) is the combined weighted output from each node in the previous layer. Each
layer also contains a bias unit, whose output is always 1. Each node then applies an activation function to its
inputs to generate its output.

Teradata Aster Analytics Foundation User Guide 1135


Chapter 12: Neural Networks
Introduction to Neural Networks
Figure 29: A Neural Network

In the preceding figure, the weights are shown as wijk, where i is the network layer of the origin node, j is the
number of the origin node in layer i, and k is the number of the destination node in layer (i+1). In the
preceding figure, the input to the nodes in Layer 2 is:

The output from the nodes in Layer 2 is given by:

where g(x) is the activation function. The output of the network shown in the preceding figure is:

1136 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNet

A neural network is a supervised learning model. In the Teradata Aster implementation, the weights applied
to each input are trained based on a training dataset using a backpropagation algorithm. The initial weights
can be supplied by the user; if none are supplied, a random set of weights is used. For more information
about the backpropagation algorithm, see https://en.wikipedia.org/wiki/Backpropagation.

NeuralNet

Summary
The NeuralNet function uses backpropagation to train neural networks. You must provide input data and
argument settings for training the networks; the function creates the fitted weights of the neural network.
The NeuralNet function is optimized for performance on very large datasets (millions of rows).

Usage

NeuralNet Syntax
Version 1.0

SELECT * FROM NeuralNet (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserId ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('table_name')
OutputTable ('output_table')
[ WeightTable ('weight_table') ]
InputColumns ('input_column' [, ...] )
ResponseColumns ('string' [, ...] )
[ GroupByColumns ('group_by_column' [, ...] ) ]
[ HiddenLayers ('integer' [, ...] ) ]
[ Threshold ('double') ]
[ MaxIterNum ('integer') ]
[ LearningRate ('double') ]
[ ActivationFunction ({ 'logistic' | 'tanh' }) ]
[ ErrorFunction ({ 'sse' | 'ce' }) ]
[ Algorithms ('backprop') ]
[ LinearOutput ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ OverwriteOutput ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0}') ]

Teradata Aster Analytics Foundation User Guide 1137


Chapter 12: Neural Networks
NeuralNet
[ Seed ('seed') ]
);

Arguments
Argument Category Description
InputTable Required Specifies the table containing the input data to be trained.
OutputTable Required Specifies the table to output the trained network weight data to.
WeightTable Optional Specifies the table that lists the starting values for the neural network
weights. If you do not specify a weight table, the function assigns the initial
weights for the neural network randomly.
InputColumns Required Specifies the name of the columns of the InputTable that contains the
numerical predictor variables x1, x2, x3, etc.
ResponseColumns Required Specifies the name of the columns of InputTable that contains the
numerical dependent variables y1, y2, y3, etc.
GroupByColumns Optional Specifies the weight table columns to use to output different neural
networks for different groups.
HiddenLayers Optional Specifies the number of hidden neurons in each layer, from left to right, by
list of integers. Default value is 1 layer, 1 neuron. For example,
HiddenLayers('5','5') would produce a 3-layer network with 5 neurons in
each hidden layer, while HiddenLayers('3') would produce the network
shown in Introduction to Neural Networks.
Threshold Optional Specifies the threshold for the partial derivatives of the error function as
stopping criteria. Default value is 0.01.
MaxIterNum Optional Specifies the maximum number of steps for the training of the neural
network. Default value is 1.
LearningRate Optional Specifies the learning rate used by traditional backpropagation. Default
value is 0.001.
ActivationFunction Optional Specifies the name of the differentiable function that is applied to the
result of the cross-product of the neurons and the weights. Available
choices are ‘logistic’ (default) and hyperbolic tangent (‘tanh’).
ErrorFunction Optional Specifies the name of the differentiable function that is used for the
calculation of the error. Available choices are ‘sse’ (sum of squared errors,
the default) and cross-entropy (‘ce’).
Algorithms Optional This string contains the algorithm type that is used to calculate the neural
network. Currently, only ‘backprop’ is supported.
LinearOutput Optional Specifies whether the ActivationFunction is not to be applied to the output
neurons.
OverwriteOutput Optional This logical value defines whether (TRUE) or not (FALSE) to overwrite
the output table.

1138 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNet

Argument Category Description


Seed Optional Specifies the seed with which to initialize the model, an INTEGER. Given
the same seed, cluster configuration, and input table, the function
generates the same model. By default, the function initializes the model
randomly.

Input
The NeuralNet function has a required input table and an optional weight table.
Table 1078: NeuralNetwork Input Table Schema

Column Name Data Type Description


input_column Any numeric Predictor variable. The table has one such column for each
type input column specified by the InputColumns argument.
response_column Any numeric Response variable. The table has one such column for each
type input column specified by the ResponseColumns argument.

Table 1079: NeuralNetwork Weight Table Schema

Column Name Data Type Description


group_by_column Any numeric Columns that contains neural network output for a group.
type The table has one such column for each input column
specified by the GroupByColumns argument.
weight_n DOUBLE Initial weight for attribute n. The table has n such columns.
PRECISION

Weights Table
The Weights Table is an optional table. If the table is not provided, the initial weights for the neural network
are randomly assigned. The schema for the weights table is shown in the following table. Grouping columns
(groupcoln) are optional.
Table 1080: NeuralNetwork Weights Table

groupcol1 ... weight0 weight1 ...


input_column ... DOUBLE PRECISION DOUBLE PRECISION ...

Output
The NeuralNet function displays a message after it finishes.
Table 1081: NeuralNet Output Table Schema

Property Value
Reached Threshold The error threshold reached when training stops.

Teradata Aster Analytics Foundation User Guide 1139


Chapter 12: Neural Networks
NeuralNet

Property Value
Threshold Steps Max iterations reached or convergence reached.

The query below displays the OutputTable containing the fitted weights:

SELECT * FROM output_table;

This output table has the same schema as the weight table and has a row for each neural network with one
column for each weight. As shown in the following table, the weight columns are labeled 0, 1, 2, etc., and are
ordered by layer, origin node, and destination node. For example, the weight in column 0 corresponds to
w101 in Introduction to Neural Networks, column 1 corresponds to w102, and so on, up to column 12
corresponding to w231.

Table 1082: NeuralNet Output Table

Initial layer From neuron To neuron Weight Corresponding column


in <OutputTable>
Input layer X0 Φ1 W101 0
Input layer X0 Φ2 W102 1
Input layer X0 Φ3 W103 2
Input layer X1 Φ1 W111 3
... ... ... ... ...
Input layer X2 Φ3 W123 8
Hidden layer Φ0 Output W201 9
... ... ... ... ...
Hidden layer Φ3 Output W231 12

Example

Input
The input table, breast_cancer_data, is assessment data from biopsies of 699 breast tumors. Each tumor is
rated on 9 predictor variables, and is classified as either benign (class = 2) or malignant (class = 4). The nine
attributes (clumpthickness, uniformityofcellsize, uniformityofcellshape, marginaladhesion,
singleepithelialcell, barenuclei, blandchromatin, normalnucleoli, mitoses) are scored on a scale of 1 to 10.
Table 1083: NeuralNet Example Input Table breast_cancer_data (Columns 1-5)

samplecode clumpthickness uniformityofcellsize uniformityofcellshape marginaladhesion


61634 5 4 3 1
63375 9 1 2 6

1140 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNet

samplecode clumpthickness uniformityofcellsize uniformityofcellshape marginaladhesion


76389 10 4 7 2
95719 6 10 10 10
128059 1 1 1 1
142932 7 6 10 5
144888 8 10 10 8
145447 8 4 4 1
160296 5 8 8 10
167528 4 1 1 1
169356 3 1 1 1
183913 1 2 2 1
... ... ... ... ...

Table 1084: NeuralNet Example Input Table breast_cancer_data (Columns 6-11)

singleepithelialcell barenuclei blandchromatin normalnucleoli mitoses class


2 2 3 1 2
4 10 7 7 2 4
2 8 6 1 1 4
8 10 7 10 7 4
2 5 5 1 1 2
3 10 9 10 2 4
5 10 7 8 1 4
2 9 3 3 1 4
5 10 8 10 3 4
2 1 3 6 1 2
2 3 1 1 2
2 1 1 1 1 2
... ... ... ... ... ...

Split Input into Training and Testing Data Sets


This code divides the 150 data rows into a training data set (70%) and a testing dataset (30%). The training
data set is input for the NeuralNet function.

DROP TABLE IF EXISTS breast_cancer_train;


CREATE TABLE breast_cancer_train DISTRIBUTE BY hash(samplecode) AS

Teradata Aster Analytics Foundation User Guide 1141


Chapter 12: Neural Networks
NeuralNet
SELECT * from breast_cancer_data ORDER BY samplecode ASC LIMIT 489;
SELECT * FROM breast_cancer_train ORDER BY samplecode;

Table 1085: NeuralNet Example Train Table breast_cancer_train (Columns 1-5)

samplecode clumpthickness uniformityofcellsize uniformityofcellshape marginaladhesion


61634 5 4 3 1
63375 9 1 2 6
76389 10 4 7 2
95719 6 10 10 10
128059 1 1 1 1
142932 7 6 10 5
144888 8 10 10 8
145447 8 4 4 1
160296 5 8 8 10
167528 4 1 1 1
169356 3 1 1 1
183913 1 2 2 1
... ... ... ... ...

Table 1086: NeuralNet Example Train Table breast_cancer_train (Columns 6-11)

singleepithelialcell barenuclei blandchromatin normalnucleoli mitoses class


2 2 3 1 2
4 10 7 7 2 4
2 8 6 1 1 4
8 10 7 10 7 4
2 5 5 1 1 2
3 10 9 10 2 4
5 10 7 8 1 4
2 9 3 3 1 4
5 10 8 10 3 4
2 1 3 6 1 2
2 3 1 1 2
2 1 1 1 1 2

1142 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNet

singleepithelialcell barenuclei blandchromatin normalnucleoli mitoses class


... ... ... ... ... ...

DROP TABLE IF EXISTS breast_cancer_test;


CREATE TABLE breast_cancer_test DISTRIBUTE BY hash(samplecode) AS
SELEF * FROM breast_cancer_data ORDER BY samplecode DESC LIMIT 210;
SELECT * FROM breast_cancer_train ORDER BY samplecode;

Table 1087: NeuralNet Example Train Table breast_cancer_test (Columns 1-5)

samplecode clumpthickness uniformityofcellsize uniformityofcellshape marginaladhesion


1222936 8 7 8 7
1223003 5 3 3 1
1223282 1 1 1 1
1223306 3 1 1 1
1223426 1 1 1 1
1223543 1 2 1 3
1223793 6 10 7 7
1223967 6 1 3 1
... ... ... ... ...

Table 1088: NeuralNet Example Train Table breast_cancer_test (Columns 6-11)

singleepithelialcell barenuclei blandchromatin normalnucleoli mitoses class


5 5 5 10 2 4
2 1 2 1 1 2
2 1 2 1 1 2
2 4 1 1 1 2
2 1 3 1 1 2
2 1 1 2 1 2
6 4 8 10 2 4
2 1 3 1 1 2
... ... ... ... ... ...

SQL-MapReduce Call
To ensure that the model has good prediction accuracy, the function must converge at a low threshold value.
If the function does not converge, the model prediction is not reliable. The function typically requires several
thousand iterations to converge at a low threshold, a process that may take several hours.

Teradata Aster Analytics Foundation User Guide 1143


Chapter 12: Neural Networks
NeuralNetPredict
Train a model with MaxIterNum ('6000') and use a Seed ('100') to ensure reproducibility.

SELECT * FROM NeuralNet (


ON (SELECT 1) PARTITION BY 1
InputTable ('breast_cancer_train')
OutputTable ('cancer_output')
InputColumns ('clumpthickness', 'uniformityofcellsize',
'uniformityofcellshape', 'marginaladhesion',
'singleepithelialcell', 'barenuclei', 'blandchromatin',
'normalnucleoli', 'mitoses')
ResponseColumns ('class')
HiddenLayers ('10')
OverwriteOutput ('true')
MaxIterNum ('6000')
Threshold ('1')
Seed ('100')
);

Output
Table 1089: NeuralNet Example Output Table

message
Run 1591 Iterations.
neural net converge
Reach Threshold: 0.9993143624713949.

NeuralNetPredict

Summary
The NeuralNetPredict function predicts the output for specific arbitrary covariate inputs, using a particular
trained neural network output weight table.

Usage

NeuralNetPredict Syntax
Version 1.0

SELECT * FROM NeuralNetPredict (


ON { table | view | query } AS testdata PARTITION BY { key | ANY }
ON { table | view | query } AS modeldata PARTITION BY { key | DIMENSION }
InputColumns ('testdata_column' [,...])
[ GroupByColumns ('testdata_column' [,...]) ]

1144 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNetPredict
HiddenLayers ('integer' [,...])
[ ActivationFunction ({ 'logistic' | 'tanh' }) ]
[ LinearOutput ({'TRUE'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ NumOutputs ('integer') ]
[ Accumulate ('testdata_column' [,...]) ]
);

Arguments
Argument Category Description
InputColumns Required Specifies the names of the columns of testdata that contain the numerical
input predictor variables x1, x2, x3, etc.
GroupByColumns Optional Specifies the columns that are used to output different neural networks for
different groups (must be in WeightsTable, if specified).
HiddenLayers Required Specifies the number of hidden neurons in each layer, from left to right, by
list of integers. Default value is 1 layer, 1 neuron.
ActivationFunction Optional Specifies the name of the differentiable function that is applied to the
result of the cross-product of the neurons and the weights. Available
choices are ‘logistic’ (default) and hyperbolic tangent (‘tanh’).
LinearOutput Optional Specifies whether the ActivationFunction is not to be applied to the output
neurons. The default value is 'true'.
NumOutputs Optional Specifies the number of outputs from the neural net. Default value is 1.
Maximum value is 1000.
Accumulate Optional Specifies the names of the columns in the input_table that the function
copies to linear_predictor_table.

Input
Table 1090: NeuralNetPredict Input Table schema

Name Option Description


modeldata Required Contains a neural network weights model output by NeuralNet.
testdata Required Contains the data whose dependent values are to be predicted.

Output
The NeuralNetPredict function output schema is shown in the following table.
Table 1091: NeuralNetPredict Output Table schema

Column Name Data Type Description


input_column Any numeric Predictors from input table specified in InputColumns
type argument.

Teradata Aster Analytics Foundation User Guide 1145


Chapter 12: Neural Networks
NeuralNetPredict

Column Name Data Type Description


predictOut_n DOUBLE Predicted output value for output node n.
PRECISION
accumulate_column Same as Columns copied from the testdata table.
testdata
column

Example

Input
This example uses the model created by the NeuralNet function (cancer_output) and the test dataset as input
to the NeuralNetPredict function.

SQL-MapRequest Call
The NeuralNetPredict call must use the same argument values that were used in the NeuralNet call that
created the model for the following arguments: InputColumns, HiddenLayers, WeightTable, NumOutputs,
and GroupByColumns. In this example, only InputColumns and HiddenLayers are applicable.

DROP TABLE IF EXISTS nn_bc_predict;

CREATE TABLE nn_bc_predict DISTRIBUTE BY hash(samplecode) AS


SELECT * FROM NeuralNetPredict (
ON breast_cancer_test AS testdata PARTITION BY ANY
ON cancer_output AS modeldata DIMENSION
InputColumns ('clumpthickness', 'uniformityofcellsize',
'uniformityofcellshape', 'marginaladhesion',
'singleepithelialcell', 'barenuclei','blandchromatin',
'normalnucleoli', 'mitoses')
HiddenLayers ('10')
Accumulate ('samplecode', 'class')
) ORDER BY samplecode;

Output
Table 1092: NeuralNetPredict Example Output Table (Columns 1-4)

clumpthickness uniformityofcellsize uniformityofcellshape marginaladhesion


8 7 8 7
5 3 3 1
1 1 1 1
3 1 1 1
1 1 1 1

1146 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNetPredict

clumpthickness uniformityofcellsize uniformityofcellshape marginaladhesion


1 2 1 3
6 10 7 7
6 1 3 1
1 1 1 2
6 1 1 1
6 2 3 1
10 6 4 3
4 1 1 3
7 5 6 3
3 1 1 3
10 5 5 6
1 1 1 1
10 5 7 4
8 9 9 5
1 1 1 1
... ... ... ...

Table 1093: NeuralNetPredict Example Output Table (Columns 5-8)

singleepithelialcell barenuclei blandchromatin normalnucleoli


5 5 5 10
2 1 2 1
2 1 2 1
2 4 1 1
2 1 3 1
2 1 1 2
6 4 8 10
2 1 3 1
2 1 3 1
2 1 3 1
2 1 1 1
10 10 9 10
1 5 2 1
3 8 7 4

Teradata Aster Analytics Foundation User Guide 1147


Chapter 12: Neural Networks
NeuralNetPredict

singleepithelialcell barenuclei blandchromatin normalnucleoli


2 1 1 1
3 10 7 9
2 1 2 1
4 10 8 9
3 5 7 7
1 1 3 1
... ... ... ...

Table 1094: NeuralNetPredict Example Output Table (Columns 9-12)

mitoses predictOut_0 samplecode class


2 3.96268312639419 1222936 4
1 2.31746186983614 1223003 2
1 1.95114934587798 1223282 2
1 2.05592368996713 1223306 2
1 1.96487311523041 1223426 2
1 1.99881759348247 1223543 2
2 3.96052773888495 1223793 4
1 2.25816579533148 1223967 2
1 1.97102865872411 1224329 2
1 2.10973172443552 1224565 2
1 2.24875143249325 1225382 2
1 3.96328503886686 1225799 4
1 2.31584924028088 1226012 4
1 3.93721997447082 1226612 4
1 1.85365395051263 1227081 2
2 3.96425933122442 1227210 4
1 1.95114934587798 1227244 2
1 3.96432038514118 1227481 4
1 3.96129081279718 1228152 4
1 1.96119498074045 1228311 2
... ... ... ...

Predictions are in the “predictOut_0” column.

1148 Teradata Aster Analytics Foundation User Guide


Chapter 12: Neural Networks
NeuralNetPredict
Prediction Accuracy
This query calculates prediction accuracy based on root mean squared error (rmse):

SELECT SQRT((SUM((CLASS - CAST("predictOut_0" AS NUMERIC)) ^2) / 210))


AS rmse FROM nn_bc_predict;

Table 1095: NeuralNetPredict Example Prediction Accuracy

rmse
0.27583997522107218006

Teradata Aster Analytics Foundation User Guide 1149


Chapter 12: Neural Networks
NeuralNetPredict

1150 Teradata Aster Analytics Foundation User Guide


CHAPTER 13
Data Transformation

Data Transformation
• Antiselect
• Apache_Log_Parser
• Categorize
• Fellegi-Sunter Functions
• Geometry Functions
• IdentityMatch
• IPGeo
• JSONParser
• Multi_Case
• MurmurHash
• OutlierFilter
• Pack
• Pivot
• PSTParserAFS
• Scale Functions
• StringSimilarity
• Unpack
• Unpivot
• URIPack
• URIUnpack
• XMLParser
• XMLRelation

Antiselect

Summary
Antiselect returns all columns except those specified in the Exclude argument.

Teradata Aster Analytics Foundation User Guide 1151


Chapter 13: Data Transformation
Antiselect
Usage

Antiselect Syntax
Version 1.0

SELECT * FROM Antiselect (


ON input_table
Exclude ({ 'column_name' | 'column_name_range' }[,...])
);

Arguments
Argument Category Description
Exclude Required Specifies the names of the columns not to return.

Input
The input table can have any schema.

Output
The output table has all input table columns except those specified by the Exclude argument.

Example

Input
The input table, antiselect_input, is a sample set of sales data containing 13 columns.
Table 1096: Antiselect Example Input Table antiselect_input (Columns 1-8)

rowid orderid orderdate priority quantity sales discount shipmode


1 3 2010-10-13 Low 6 261.54 0.04 Regular Air
00:00:00
49 293 2012-10-01 High 49 10123 0.07 Delivery
00:00:00 Truck
50 293 2012-10-01 High 27 244.57 0.01 Regular Air
00:00:00
80 483 2011-07-10 High 30 4965.76 0.08 Regular Air
00:00:00
85 515 2010-08-28 Not 19 394.27 0.08 Regular Air
00:00:00 Specified

1152 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Antiselect

rowid orderid orderdate priority quantity sales discount shipmode


86 515 2010-08-28 Not 21 146.69 0.05 Regular Air
00:00:00 Specified
97 613 2011-06-17 High 12 93.54 0.03 Regular Air
00:00:00

Table 1097: Antiselect Example Input Table antiselect_input (Columns 9-13)

custname province region custsegment prodcat


Muhammed Nunavut Nunavut Small Business Office Supplies
MacIntyre
Barry French Nunavut Nunavut Consumer Office Supplies
Barry French Nunavut Nunavut Consumer Office Supplies
Clay Rozendal Nunavut Nunavut Corporate Technology
Carlos Soltero Nunavut Nunavut Consumer Office Supplies
Carlos Soltero Nunavut Nunavut Consumer Furniture
Carl Jackson Nunavut Nunavut Corporate Office Supplies

SQL-MapReduce Call
This query excludes the columns rowid, orderdate, discount, province and custsegment:

SELECT * FROM antiselect (


ON antiselect_input
EXCLUDE ('rowid', 'orderdate', 'discount', 'province', 'custsegment')
) ORDER BY 1, 4;

Output
The output table excludes the specified columns and outputs the remaining eight columns.
Table 1098: Antiselect Example Output Table

orderid priority quantity sales shipmode custname region prodcat


3 Low 6 261.54 Regular Air Muhamme Nunavut Office
d Supplies
MacIntyre
293 High 27 244.57 Regular Air Barry Nunavut Office
French Supplies
293 High 49 10123 Delivery Barry Nunavut Office
Truck French Supplies
483 High 30 4965.76 Regular Air Clay Nunavut Technology
Rozendal

Teradata Aster Analytics Foundation User Guide 1153


Chapter 13: Data Transformation
Apache_Log_Parser

orderid priority quantity sales shipmode custname region prodcat


515 Not 21 146.69 Regular Air Carlos Nunavut Furniture
Specified Soltero
515 Not 19 394.27 Regular Air Carlos Nunavut Office
Specified Soltero Supplies
613 High 12 93.54 Regular Air Carl Nunavut Office
Jackson Supplies

Apache_Log_Parser

Summary
The Apache_Log_Parser function parses Apache log file content from a given server access log and outputs
specified columns of information, which can include search engines and search terms.

Background
The function parses Apache log files with the following constraints:
• Log files are loaded into a table
• One line of the Apache log file is loaded to one row in the table
• If you specify a custom log format, then the input must conform to that format.
• If you do not specify a custom format string with the Log_Format argument, the function parses the logs
with the NCSA extended/combined format, which is defined as:

"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Apache Log Configuration


The Apache web server lets users specify the content and reporting conditions of the server access log. For
example, the user can configure the server to report the remote host name and time of each request, and to
report the requested path only when the page fails to open.
The Apache server configuration file contains the desired log format as the argument of either the
LogFormat or CustomLog directive. LogFormat defines only a custom format. CustomLog defines both a
custom format and a log file that has that format. The argument for each directive is a format string. In the
format string, the symbol % precedes each request characteristic. For example, %v represents the canonical
ServerName of the server serving the request. At logging time, v is replaced by the appropriate ServerName.
The following figure shows some lines of an Apache server configuration file. For detailed information about
custom log formats, see http://httpd.apache.org/docs/current/mod/mod_log_config.html.
Figure 30: Apache Server Configuration File Sample Lines

#
# The following directives define some format nicknames for use with

1154 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Apache_Log_Parser
# a CustomLog directive (see below).
# If you are behind a reverse proxy, you might want to change %h into %{X-
Forwarded-For}i
#
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\""
vhost_combined
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\""
combined
LogFormat "%h %l %u %t \"%r\" %>s %O" common
LogFormat "%{Referer}i -> %U" referrer
LogFormat "%{User-agent}i" agent

Apache Log Parser Item-Name Mapping


The following table shows the mapping between request characteristics in the format string (log items) and
table column names, and gives output examples.
Table 1099: Apache Log Parser Item-Name Mapping

Log Item Column Name Output Example


%h remote_host 153.65.52.112
%a remote_IP 153.65.52.112
%A local_IP 153.65.52.112
%t request_time [22/Jun/2012:17:27:02 -0700]
%b bytes_sent_CLF 512
%B bytes_sent 455
%O bytes_sent_including_header 512
%I bytes_received_including_header 512
%p canonical_server_port 80
%{canonical}p canonical_server_port 80
%{local}p actual_server_port 80
%{remote}p actual_client_port 7777
%P process_ID 8311
%k live_connections 2
%D request_duration_microseconds 312
%T request_duration_seconds 0
%U requested_URL /index.html
%l remote_log_name
%u remote_user
%f requested_file
%{VARNAME}i request:VARNAME

Teradata Aster Analytics Foundation User Guide 1155


Chapter 13: Data Transformation
Apache_Log_Parser

Log Item Column Name Output Example


%{Referer}i Referer
%{VARNAME}o reply:VARNAME
%{VARNAME}n note:VARNAME
%{VARNAME}e env:VARNAME
%{VARNAME}C cookie:VARNAME
%V server_name
%v canonical_server_name
%L log_ID
%H protocol HTTP/1.1
%m method GET
%q query
%X connection_status X = Connection aborted before
the response completed.
+ = Connection may be kept alive
after the response is sent.
- = Connection closes after the
response is sent.
%r request_line
%>s final_status 404
%<s original_request_status
%s original_request_status
%R handler

Usage

Apache_Log_Parser Syntax
version 2.2

SELECT * FROM Apache_Log_Parser (


ON table_name | view_name | (query) }
TargetColumn ('log_column')
[ LogFormat ('format_string') ]
[ ExcludeFiles ('.file_suffix' [,...]) ]
[ SearchInfoFlag
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

1156 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Apache_Log_Parser
Arguments
Argument Category Description
TargetColumn Required Specifies the name of the input column that contains the information to
be parsed.
LogFormat Optional String that specifies the format used to generate the server access logs,
which you can find in the Apache server configuration file. For each log
item in the format string, the function builds an output table column
built from the input table columns. The default log format is NCSA
extended/combined format, which is:

"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-


agent}i\""

ExcludeFiles Optional Specifies the types of files to exclude, by suffix. The default value is '.png',
'.xml', '.js'. If an input row contains a file of an excluded type, then the
function does not generate an output file row for that input row.
SearchInfoFlag Optional Specifies whether to return search information. The default value is
'false'. If you specify 'true', the function extracts the search engine and
search terms (if they exist) into two output columns. The supported
search engines are Google, Bing, and Yahoo. The function provides more
complete parsing capabilities for Google.

Input
The input table must have a column that contains the information to be parsed. The table can have
additional columns, but the function ignores them.
Table 1100: Apache_Log_Parser Input Table Schema

Column Data Type Description


log_column VARCHAR Contains the information to be parsed.
If you specify a custom log format, then the contents of this column must
conform to that format.
The input table has one row for each line of the Apache log file.

Output
The output table has one row for each input row that the function parses, except those that contain
requested files of excluded types. Its schema depends on the log format and function arguments. The
following table is a possible output table schema.
Table 1101: Apache_Log_Parser Output Table Schema

Column Data Type Description


remote_host VARCHAR Remote host that made the HTTP request.
remote_log_name VARCHAR Log name on remote host.

Teradata Aster Analytics Foundation User Guide 1157


Chapter 13: Data Transformation
Apache_Log_Parser

Column Data Type Description


remote_user VARCHAR User logged into remote host.
request_time TIMESTAMP Timestamp when the HTTP request was made.
requested_page VARCHAR Requested landing page.
final_status INTEGER Status for request that was internally redirected.
Applies to the final request, not the original request.
bytes_sent_including_header INTEGER Response size in bytes in Custom Log Format (CLF).
referrer VARCHAR URL from which the request was initiated.
request:User-Agent VARCHAR Information about the system from which the
request was initiated.

The possible output column names are listed in Apache Log Parser Item-Name Mapping and the following
table. The following table describes the output table columns that appear only if the SearchInfoFlag
argument has the value 'true' and the log file contains referrer information.
Table 1102: Apache_Log_Parser Output Table Columns extracted when RETURN_SEARCH_INFO = 'true'

Column Data Type Description


search_engine VARCHAR Name of the search engine referrer (Google, Bing, or Yahoo) if the log
file contains that information; otherwise blank.
search_terms VARCHAR Search terms that led to landing on the page.

Examples
The examples in this section show the results of two different values of the LogFormat argument.

Input
The input table for both examples, apache_logs, has a sample of five records of apache web user logs.
Table 1103: Apache_Log_Parser Example Input Table apache_logs

id logdata
1 69.236.77.51 - Frank [26/Mar/2011:09:17:31 -0700] "GET /about/careers.php HTTP/1.1" 200
5976 "http://www.bing.com/search?q=Aster+data&src=ie9tr" "Mozilla/5.0 (compatible; MSIE
9.0; Windows NT 6.1; Trident/5.0)"
2 168.187.7.114 - Lewis [27/Mar/2011:00:16:49 -0700] "GET / HTTP/1.0" 200 7203 "http://
search.yahoo.com/search;_ylt=AtMGk4Fg.FlhWyX_ro.u0VybvZx4?
p=ASTER&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-383-1" "Mozilla/4.0 (compatible; MSIE
8.0; Windows NT 6.1; Trident/4.0; SLCC2;.NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET
CLR 3.0.30729; Media Center PC 6.0;InfoPath.2)"
3 75.36.209.106 - Patrick [20/May/2008:15:43:57 -0400] "GET / HTTP/1.1" 200 15251 "http://
www.google.com/search?hl=en&q=%22Aster+Data+Systems%22" "Mozilla/4.0 (compatible;

1158 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Apache_Log_Parser

id logdata
MSIE 6.0; Windows NT 5.1; SV1; YPC 3.2.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; MS-
RTC LM 8)"
4 159.41.1.23 - - [06/Jul/2010:07:19:45 -0400] "GET /public/js/common.js HTTP/1.1" 200 16711
"http://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=aster
%20data&rsv_pq=d31bd31c000dd71c&rsv_t=982dONZ4XBYXizw4wA
%2BQD411WcEyn1YoJu4QSpNTQwwoTE7hgPFD9OBTObk&rsv_enter=1&rsv_sug3=11&r
sv_sug1=1&rsv_sug2=0&rsv_sug7=100&inputT=3572&rsv_sug4=6596" "Mozilla/5.0
(Windows; U; Windows NT 5.1; it; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"
5 127.0.0.1 - -[10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://
www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

Example 1: Default Extended/Combined Log Format


The default log format, which is known as the combined or extended log format, is shown in the LogFormat
argument as '%h %l %u %t \"%r\" %>s %O". Refer to Apache Log Parser Item-Name Mapping for
information about item-name mapping.

SQL-MapReduce Call

SELECT * FROM apache_log_parser (


ON apache_logs
TargetColumn ('logdata')
LogFormat ('%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-
Agent}i\"')
SearchInfoFlag ('true')
) ORDER BY remote_user;

Output
The output has 11 columns. One row is output for each input row. There is no output corresponding to
input id=4, because .js pages are omitted by default. This can be controlled by the ExcludeFiles argument
(refer to Arguments). Also, the first row in the output, which corresponds to input id=5, is empty in the
search_engine and search_term columns. This is because the referrer for that input row, “http://
www.example.com/start.html”, is not a search engine. The only supported search engines are Google, Bing,
and Yahoo.
Table 1104: Apache_Log_Parser Example 1 Output Table (Columns 1-5)

remote_host remote_log_name remote_user request_time requested_page


127.0.0.1 - - 2000-10-10 13:55:36 /apache_pb.gif
69.236.77.51 - Frank 2011-03-26 09:17:31 /about/careers.php
168.187.7.114 - Lewis 2011-03-27 00:16:49 /
75.36.209.106 - Patrick 2008-05-20 15:43:57 /

Teradata Aster Analytics Foundation User Guide 1159


Chapter 13: Data Transformation
Apache_Log_Parser
Table 1105: Apache_Log_Parser Example 1 Output Table (Columns 6-8)

final_status bytes_sent_including_header referrer


200 2326 http://www.example.com/start.html
200 5976 http://www.bing.com/search?q=Aster
+data&src=ie9tr
200 7203 http://search.yahoo.com/
search;_ylt=AtMGk4Fg.FlhWyX_ro.u0VybvZx
4?
p=ASTER&toggle=1&cop=mss&ei=UTF-8&fr=
yfp-t-383-1
200 15251 http://www.google.com/search?hl=en&q=
%22Aster+Data+Systems%22

Table 1106: Apache_Log_Parser Example 1 Output Table (Columns 9-11)

search_engine search_terms request:User-Agent


Mozilla/4.08 [en] (Win98; I ;Nav)
bing Aster data Mozilla/5.0 (compatible; MSIE 9.0; Windows
NT 6.1; Trident/5.0)
yahoo ASTER Mozilla/4.0 (compatible; MSIE 8.0; Windows
NT 6.1; Trident/4.0; SLCC2;.NET CLR
2.0.50727; .NET CLR 3.5.30729; .NET CLR
3.0.30729; Media Center PC 6.0;InfoPath.2)
google "Aster Data Systems" Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.1; SV1; YPC 3.2.0; .NET CLR
1.1.4322; .NET CLR 2.0.50727; MS-RTC LM 8)

Example 2: Common logformat


The common log format is:'%h %l %u %t \"%r\" %>s %O' . This format excludes some columns from the
default extended format.

SQL-MapReduce Call

SELECT * FROM apache_log_parser (


ON apache_logs
TargetColumn ('logdata')
LogFormat ('%h %l %u %t \"%r\" %>s %O')
SearchInfoFlag ('true')
) ORDER BY remote_user;

Output
The Referer, User-Agent name, search engine and search terms are absent in the common log format output.

1160 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Categorize
Table 1107: Apache_Log_Parser Example 2 Output Table (Columns 1-4)

remote_host remote_log_name remote_user request_time


127.0.0.1 - - 2000-10-10 13:55:36
69.236.77.51 - Frank 2011-03-26 09:17:31
168.187.7.114 - Lewis 2011-03-27 00:16:49
75.36.209.106 - Patrick 2008-05-20 15:43:57

Table 1108: Apache_Log_Parser Example 2 Output Table (Columns 5-7)

requested_page final_status bytes_sent_including_header


/apache_pb.gif 200 2326
/about/careers.php 200 5976
/ 200 7203
/ 200 15251

Categorize

Summary
The Categorize function converts specified columns from any numeric type to VARCHAR. This operation is
necessary when columns contain numbers that represent codes or categories (for example, billing codes or
zip codes) and you want to input them to another function as categorical data.

Note:
The Categorize function is similar to the R function factor.

Usage

Categorize Syntax
Version 1.0

SELECT * FROM Categorize (


ON { table | view | query } PARTITION BY ANY
Columns ( { column | column_range }[,...] )
);

Teradata Aster Analytics Foundation User Guide 1161


Chapter 13: Data Transformation
Categorize
Arguments
Argument Category Description
Columns Required Specifies the names of the input columns that contain numeric values to
be converted to VARCHAR.

Input
Table 1109: Categorize Input Table Schema

Column Name Data Type Description


input_column BYTEINT, Column specified by the Columns argument, which contains values to
SMALLINT, be converted to VARCHAR.
INTEGER,
BIGINT,
NUMERIC,
DOUBLE
PRECISION, or
NUMBER
other_column Any Column to copy to the output table.

Output
The output table has the same column names, in the same order, as the input table.
Table 1110: Categorize Output Table Schema

Column Name Data Type Description


input_column VARCHAR Column converted from a numeric data type.
other_column Same as in Column copied from the input table.
input table

Example

Input
The input table contains information about houses. All columns have numeric data types. The first three
columns represent numeric values, but the other columns represent codes or categories. The following two
tables are the input table itself and its schema description, respectively.
Table 1111: Categorize Example Input Table categorize_input

sn price lotsize driveway recroom fullbase gashw airco prefarea homestyle


1 27000 1700 1 0 0 0 0 0 1
2 37900 3185 1 0 0 0 1 0 1

1162 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Categorize

sn price lotsize driveway recroom fullbase gashw airco prefarea homestyle


3 42000 4960 1 0 0 0 0 0 1
4 67000 5170 1 0 0 0 1 0 2
5 68000 9166 1 0 1 0 1 0 2
6 132000 3500 1 0 0 1 0 0 3
7 43000 5076 0 0 0 0 0 0 1
8 93000 3760 1 0 0 1 0 0 2
9 44500 3850 1 0 0 0 0 0 1
10 43000 3750 1 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ...

To check the input data variable types, use this query: \d categorize_input
The query returns:

Table "public"."categorize_input"
Column | Type | Modifiers
-------------------------------
sn | integer |
price | real |
lotsize | real |
driveway | integer |
recroom | integer |
fullbase | integer |
gashw | integer |
airco | integer |
prefarea | integer |
homestyle | integer |

Table Type:
fact

Distribution Key:
sn

Compression Level:
none

Storage Type:
row

Persistence:
permanent

SQL-MapReduce Call

CREATE TABLE categorize_output DISTRIBUTE BY HASH(sn) AS


SELECT * FROM Categorize (

Teradata Aster Analytics Foundation User Guide 1163


Chapter 13: Data Transformation
Fellegi-Sunter Functions
ON categorize_input PARTITION BY ANY
Columns ('[4:9]')
);

Output
This query returns the output table:

SELECT * FROM categorize_output ORDER BY 1;

The output table looks like the input table, but its schema is different. To check the output data variable
types, use this query: \d categorize_output
The query returns:

Table "public"."categorize_output"
Column | Type | Modifiers
-----------------------------------------
sn | integer |
price | real |
lotsize | real |
driveway | integer |
recroom | character varying |
fullbase | character varying |
gashw | character varying |
airco | character varying |
prefarea | character varying |
homestyle | character varying |

Table Type:
fact

Distribution Key:
sn

Compression Level:
none

Storage Type:
row

Persistence:
permanent

Fellegi-Sunter Functions

Summary
Aster Analytics Foundation provides two Fellegi-Sunter functions:
• FellegiSunterTrainer, which estimates the parameters of the Fellegi-Sunter model

1164 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterTrainer
• FellegiSunterPredict, which predicts whether two objects are duplicates
FellegiSunterTrainer output is input to FellegiSunterPredict.

Background
The Fellegi-Sunter model is a tool in the field of record linkage (RL), the task of finding records in a data set
that refer to the same entity across different data sources (for example, data files, websites, and databases).
The data sources might or might not have a common identifier (such as a database key, URI, or National
identification number). A data set that has undergone RL-oriented reconciliation is cross-linked.
RL was introduced by Halbert L. Dunn in 1946. In 1959, Howard Borden Newcombe laid the probabilistic
foundations of modern record linkage theory, which were formalized by Ivan Fellegi and Alan Sunter.
Fellegi and Sunter proved that the probabilistic decision rule that they described was optimal when the
comparison attributes were conditionally independent. Their article, "A Theory For Record Linkage,"
published in the Journal of the American Statistical Association in December 1969, remains the
mathematical foundation for many record linkage applications.
Since the late 1990s, various machine learning techniques have been developed that can estimate the
conditional probabilities required by the Fellegi-Sunter model. Although several researchers have reported
that the conditional independence assumption of the Fellegi-Sunter model is often violated in practice,
published efforts to explicitly model the conditional dependencies among the comparison attributes have
not improved record-linkage quality.

FellegiSunterTrainer

Summary
The FellegiSunterTrainer function estimates the parameters of the Fellegi-Sunter model.
The function can use either supervised or unsupervised learning. For supervised learning, specify the
TagColumn argument. For unsupervised learning, omit the TagColumn argument and specify the
arguments InitialM, InitialU, InitialP, and MaxIteration.

Usage

FellegiSunterTrainer Syntax
Version 1.1

SELECT * FROM FellegiSunterTrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('input_table')
ComparisonFields ('field_name[:threshold_value]' [,...])
[ TagColumn ('tag_column') ]
[ InitialM ('initial_value_of_M') ]
[ InitialU ('initial_value_of_U') ]
[ InitialP ('initial_value_of_P') ]

Teradata Aster Analytics Foundation User Guide 1165


Chapter 13: Data Transformation
FellegiSunterTrainer
[ MaxIteration ('max_iteration') ]
[ Eta ('eta_value') ]
[ Lambda ('lambda_value') ]
[ Mu ('mu_value') ]
);

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the object pairs and their
field-pair similarity values.
ComparisonFields Required Specifies the columns of input_table to use in the field-pair similarity in
the training process. If the value in the column is less than
threshold_value, then the field pair does not agree; otherwise, the field pair
agrees. The default value of threshold_value is 1.
TagColumn Optional If you specify this argument, then the function uses supervised learning; if
you omit it, then the function uses unsupervised learning.
This argument specifies the name of the column that indicates whether
two objects match. The column must contain only the values 'M'
(matched) and 'U' (unmatched).
InitialM Optional For unsupervised learning, this argument specifies the initial value of m,
which is the probability that a field agrees, given that the object-pair
belongs to the same object. The default value is 0.9.
For supervised learning, the function ignores this argument.
InitialU Optional For unsupervised learning, this argument specifies the initial value of u,
which is the probability that a field agrees, given that the object-pair
belongs to a different object. The default value is 0.1.
For supervised learning, the function ignores this argument.
InitialP Optional For unsupervised learning, this argument specifies the initial value of p,
which is the percentage of all possible object-pairs that contain the same
object. The default value is 0.1.
For supervised learning, the function ignores this argument.
MaxIteration Optional For unsupervised learning, this argument specifies the maximum number
of iterations. The default value is 100.
For supervised learning, the function ignores this argument.
Eta Optional For unsupervised learning, this argument specifies the tolerance of the
termination criterion. At the end of each iteration, the function computes
the difference between the current value of p and the value of p at the end
of the previous iteration. If the difference is less than eta_value, then the
function terminates.The default value is 1*10-5.
Lambda Optional Specifies the Type I (false negative) error, which occurs if an unmatched
comparison is erroneously linked. The default value is 0.9.
Mu Optional Specifies the Type II (false positive) error, which occurs if a matched
comparison is erroneously not linked. The default value is 0.9.

1166 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterTrainer
Note:
Lambda and Mu determine the values of the model properties lower_bound and upper_bound (described
inTable 21 - 1093). For details, see: Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record
Linkage" Journal of the American Statistical Association 64

Input
The FellegiSunterTrainer function has one required input table, which contains the object pairs, their field-
pair similarity values, and (for supervised learning) a tag column. The following table shows the schema of
the input table.
Table 1112: FellegiSunterTrainer input_table Schema

Column Name Data Type Description


field-pair_i_sim DOUBLE The field-pair similarity value for field-pair i. In input_table, the
PRECISION columns appear in this order: A_field_1, ..., A_field_n, B_field_1, ...,
B_field_n, field-pair_1_sim, ..., field-pair_n_sim.
tag_column VARCHAR This column is required only for supervised learning. Row i of this
column contains 'M' if field i of object A matches field i of object B;
otherwise, it contains 'U'.

Note:
To create the input table for the FellegiSunterTrainer function, you can use the function StringSimilarity.

Output
The FellegiSunterTrainer function has one output table, which is a model table that is input to the function
FellegiSunterPredict. The following table shows the schema of the output table.
Table 1113: FellegiSunterTrainer Output (Model) Table Schema

Column Name Data Type Description


_key VARCHAR Model property name.
_value VARCHAR Model property value.

Table 1114: FellegiSunterTrainer Model Properties

Property Name Data Type Description


is_supervised BOOLEAN Has the value 'true' for supervised learning and 'false' for
unsupervised learning.
comparison_field_cnt INTEGER Count of comparison fields, equal to the length of the list
specified by the ComparisonFields argument.
comparison_field_name_i VARCHAR Name of comparison field i, where i is in the range [0,
comparison_field_cnt-1].
The table has a column for each comparison field.

Teradata Aster Analytics Foundation User Guide 1167


Chapter 13: Data Transformation
FellegiSunterTrainer

Property Name Data Type Description


comparison_field_threshold_i DOUBLE Threshold of comparison field i, where i is in the range
PRECISION [0, comparison_field_cnt-1]. If the similarity value
exceeds this value, then the two objects agree on field i.
The table has a column for each comparison field.
m_i DOUBLE Probability that the two objects agree on field i, given that
PRECISION the object pair matches, where i is in the range [0,
comparison_field_cnt-1].
u_i DOUBLE Probability that the two objects agree on field i, given that
PRECISION the object pair does not match, where i is in the range [0,
comparison_field_cnt-1].
p DOUBLE Percentage of object pairs that contain the same object.
PRECISION This column appears only in output for unsupervised
learning.
lower_bound DOUBLE If the weight of an object pair is less than lower bound,
PRECISION then the objects do not match.
upper_bound DOUBLE If the weight of an object pair is greater than upper
PRECISION bound, then the objects match.
lambda DOUBLE Type I (false negative) error, which occurs if an
PRECISION unmatched comparison is erroneously linked.
mu DOUBLE Type II (false positive) error, which occurs if a matched
PRECISION comparison is erroneously not linked.
time_used DOUBLE Time that the function used to learn the model
PRECISION parameters.

Examples

Input
Both examples use the same input. The input table is generated from the output of the StringSimilarity
function, using the following SQL query and adding the match_tag column (which is used for the supervised
FellegiSunter function).

DROP TABLE IF EXISTS fstrainer_input;


CREATE FACT TABLE fstrainer_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (
'jaro (src_text1 , tar_text ) AS jaro1_sim',
'LD (src_text1 , tar_text, 2) AS ld1_sim',
'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('id','src_text1','tar_text')

1168 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterTrainer
);
ALTER TABLE fstrainer_input
ADD column match_tag varchar;
update fstrainer_input set match_tag= 'M' where id = 1;
update fstrainer_input set match_tag= 'M' where id = 2;
update fstrainer_input set match_tag= 'M' where id = 3;
update fstrainer_input set match_tag= 'U' where id = 4;
update fstrainer_input set match_tag= 'U' where id = 5;
update fstrainer_input set match_tag= 'M' where id = 6;
update fstrainer_input set match_tag= 'U' where id = 7;
update fstrainer_input set match_tag= 'M' where id = 8;
update fstrainer_input set match_tag= 'M' where id = 9;
update fstrainer_input set match_tag= 'U' where id = 10;
update fstrainer_input set match_tag= 'U' where id = 11;
update fstrainer_input set match_tag= 'U' where id = 12;
SELECT * FROM fstrainer_input ORDER BY 1;

SQL to generate FellegiSunterTrainer input from output of StringSimilarity function

DROP TABLE IF EXISTS fstrainer_input;


CREATE FACT TABLE fstrainer_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (
'jaro (src_text1 , tar_text ) AS jaro1_sim',
'LD (src_text1 , tar_text, 2) AS ld1_sim',
'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('id','src_text1','tar_text')
);
ALTER TABLE fstrainer_input
ADD column match_tag varchar;
update fstrainer_input set match_tag= 'M' where id = 1;
update fstrainer_input set match_tag= 'M' where id = 2;
update fstrainer_input set match_tag= 'M' where id = 3;
update fstrainer_input set match_tag= 'U' where id = 4;
update fstrainer_input set match_tag= 'U' where id = 5;
update fstrainer_input set match_tag= 'M' where id = 6;
update fstrainer_input set match_tag= 'U' where id = 7;
update fstrainer_input set match_tag= 'M' where id = 8;
update fstrainer_input set match_tag= 'M' where id = 9;
update fstrainer_input set match_tag= 'U' where id = 10;
update fstrainer_input set match_tag= 'U' where id = 11;
update fstrainer_input set match_tag= 'U' where id = 12;
SELECT * FROM fstrainer_input ORDER BY 1;

Table 1115: FellegiSunterTrainer Example Input Table fstrainer_input (Columns 1-4)

id src_text1 tar_text jaro1_sim


1 astre aster 0.933333333333333
2 hone phone 0.933333333333333

Teradata Aster Analytics Foundation User Guide 1169


Chapter 13: Data Transformation
FellegiSunterTrainer

id src_text1 tar_text jaro1_sim


3 acqiese acquiesce 0.925925925925926
4 AAAACCCCCGGGGA CCAGGGAAACCCAC 0.824175824175824
5 alice allies 0.822222222222222
6 angela angels 0.888888888888889
7 senter centre 0.822222222222222
8 chef chief 0.933333333333333
9 circus circuit 0.849206349206349
10 debt debris 0.75
11 deal lead 0.666666666666667
12 bare bear 0.833333333333333

Table 1116: FellegiSunterTrainer Example Input Table fstrainer_input (Columns 5-8)

ld1_sim ngram1_sim jw1_sim match_tag


0.6 0.5 0.953333333333333 M
0.8 0.75 0.933333333333333 M
0.777777777777778 0.5 0.948148148148148 M
0.214285714285714 0.384615384615385 0.824175824175824 U
0.5 0.4 0.857777777777778 U
0.833333333333333 0.8 0.933333333333333 M
0.5 0.4 0.822222222222222 U
0.8 0.5 0.946666666666667 M
0.714285714285714 0.666666666666667 0.90952380952381 M
0.5 0.4 0.825 U
0.5 0.333333333333333 0.666666666666667 U
0.5 0.333333333333333 0.85 U

The above input table compares the source column (src_txt1) with the reference column (tar_text) and gives
the different similarity scores based on 'jaro', Levenshtein Distance (LD), ngram and jaro-winkler metrics, as
described in StringSimilarity.

Example 1: Unsupervised Learning

SQL-MapReduce Call
The unsupervised model is generated by specifying the user defined threshold values for the different
metrics in the ComparisonFields argument. The initialization parameters InitialP, InitialM, InitialU,

1170 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterTrainer
Lambda, and Mu are set to their default values. The column match_tag is not used for unsupervised
learning.

DROP TABLE IF EXISTS fg_unsupervised_model;


CREATE DIMENSION TABLE "fg_unsupervised_model" AS
SELECT * FROM FellegiSunterTrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('fstrainer_input ')
ComparisonFields ('jaro1_sim: 0.8', 'ld1_sim:0.8', 'ngram1_sim:0.5',
'jw1_sim:0.8')
InitialP (0.1)
InitialM (0.9)
InitialU (0.1)
Lambda (0.9)
Mu (0.9)
);

Output
The unsupervised model “fg_unsupervised_model” is shown below. The time used to generate the model
may vary in each run.
This query returns the output shown in the following table.

SELECT * FROM fg_unsupervised_model ORDER BY 1;

Table 1117: FellegiSunterTrainer Example 1 Output (Model) Table fg_unsupervised_model

_key _value
comparison_filed_cnt 4
comparison_filed_name_0 jaro1_sim
comparison_filed_name_1 ld1_sim
comparison_filed_name_2 ngram1_sim
comparison_filed_name_3 jw1_sim
comparison_filed_threshold_0 0.8
comparison_filed_threshold_1 0.8
comparison_filed_threshold_2 0.5
comparison_filed_threshold_3 0.8
is_supervised false
lambda 0.9
lower_bound -13.7991041364018
m_0 0.9999999
m_1 0.333315282254649
m_2 0.999945434344193

Teradata Aster Analytics Foundation User Guide 1171


Chapter 13: Data Transformation
FellegiSunterTrainer

_key _value
m_3 0.9999999
mu 0.9
p 0.250013539042196
time_used 271.020000 seconds
u_0 0.777773766137301
u_1 1.0E-7
u_2 1.37483178043644E-7
u_3 0.88888688306865
upper_bound 22.2091654088378

Example 2: Supervised Learning


The unsupervised model “fg_supervised_model” is generated in a similar way, but in this case the
“TagColumn ('match_tag')” argument is used to specify the data for the model to be trained on. The
initialization parameters are not used in the supervised learning mode. The time used to generate the model
can vary in each run.

SQL-MapReduce Call

CREATE DIMENSION TABLE "fg_supervised_model" AS


SELECT * FROM FellegiSunterTrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('fstrainer_input ')
ComparisonFields ('jaro1_sim: 0.8', 'ld1_sim:0.8', 'ngram1_sim:0.5',
'jw1_sim:0.8')
TagColumn ('match_tag')
);

Output
This query returns the output shown in the following table:

SELECT * FROM fg_supervised_model ORDER BY 1;

Table 1118: FellegiSunterTrainer Example 2 Output (Model) Table fg_supervised_model

_key _value
comparison_filed_cnt 4
comparison_filed_name_0 jaro1_sim
comparison_filed_name_1 ld1_sim
comparison_filed_name_2 ngram1_sim

1172 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterPredict

_key _value
comparison_filed_name_3 jw1_sim
comparison_filed_threshold_0 0.8
comparison_filed_threshold_1 0.8
comparison_filed_threshold_2 0.5
comparison_filed_threshold_3 0.8
is_supervised true
lambda 0.9
lower_bound -0.415037499278844
m_0 0.9999999
m_1 0.166666666666667
m_2 0.5
m_3 0.9999999
mu 0.9
time_used 35.413000 seconds
u_0 0.666666666666667
u_1 1.0E-7
u_2 1.0E-7
u_3 0.833333333333333
upper_bound -0.415037499278844

FellegiSunterPredict

Summary
The FellegiSunterPredict function predicts whether a pair of objects are duplicates.

Usage

FellegiSunterPredict Syntax
Version 1.1

SELECT * FROM FellegiSunterPredict (


ON { table | view | (query) } PARTITION BY ANY
ON model_table DIMENSION

Teradata Aster Analytics Foundation User Guide 1173


Chapter 13: Data Transformation
FellegiSunterPredict
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Arguments
Argument Category Description
Accumulate Optional Specifies the names of input table columns to be copied to the output
table.

Input
The FellegiSunterPredict function has two required input tables:
• The table, view, or query ("input table") that contains the object pairs whose duplicity the function is to
predict
• The model table that the FellegiSunterTrainer function output
The input table must include the comparison_field_name_i columns in the model table.

Output
Table 1119: FellegiSunterTrainer Output (Model) Table Schema

Column Name Data Type Description


accumulate_column Any Column copied from the input table, specified by the Accumulate
argument.
weight DOUBLE Weight of the object pair.
PRECISION
match_result VARCHAR Indicates whether the objects match:
'M': Yes (weight > upper bound)
'U': No (weight < lower bound)
'P': Possibly (weight is in range [lower bound, upper bound])

Examples

Input
Both examples use the same input table, which is generated from the output of the StringSimilarity function
from the following SQL query.

DROP TABLE IF EXISTS fspredict_input;


CREATE FACT TABLE fspredict_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (

1174 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterPredict
'jaro (src_text2 , tar_text ) AS jaro1_sim',
'LD (src_text2 , tar_text, 2) AS ld1_sim',
'n_gram (src_text2 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text2 , tar_text, 2) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('id','src_text2','tar_text')
);
SELECT * FROM fspredict_input ORDER BY 1;

SQL to generate FellegiSunterTrainer input from StringSimilarity function

DROP TABLE IF EXISTS fspredict_input;


CREATE FACT TABLE fspredict_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (
'jaro (src_text2 , tar_text ) AS jaro1_sim',
'LD (src_text2 , tar_text, 2) AS ld1_sim',
'n_gram (src_text2 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text2 , tar_text, 2) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('id','src_text2','tar_text')
);
SELECT * FROM fspredict_input ORDER BY 1;

Table 1120: FellegiSunterPredict Example Input Table fspredict_input (Columns 1-4)

id src_text2 tar_text jaro1_sim


1 astter aster 0.944444444444445
2 fone phone 0.783333333333333
3 acquire acquiesce 0.841269841269841
4 CCCGGGAACCAACC CCAGGGAAACCCAC 0.875457875457875
5 allen allies 0.822222222222222
6 angle angels 0.877777777777778
7 center centre 0.944444444444445
8 cheap chief 0.733333333333333
9 circle circuit 0.746031746031746
10 debut debris 0.7
11 dell lead 0.5
12 bear bear 1

Teradata Aster Analytics Foundation User Guide 1175


Chapter 13: Data Transformation
FellegiSunterPredict
Table 1121: FellegiSunterPredict Example Input Table fspredict_input (Columns 5-7)

ld1_sim ngram1_sim jw1_sim


0.833333333333333 0.8 0.961111111111111
0.6 0.5 0.783333333333333
0.666666666666667 0.5 0.904761904761905
0.714285714285714 0.692307692307692 0.9003663003663
0.666666666666667 0.4 0.875555555555556
0.666666666666667 0.4 0.914444444444445
0.666666666666667 0.6 0.966666666666667
0.4 0.25 0.786666666666667
0.571428571428571 0.5 0.847619047619048
0.5 0.4 0.79
0.25 0 0.5
1 1 1

The above input table compares the source column (src_txt2) with the reference column (tar_text) and gives
the different similarity scores based on 'jaro', Levenshtein Distance, ngram and jaro-winkler metrics, as
described in StringSimilarity. The tar_text column is the same as the input in FellegiSunterTrainer function,
while 'src_txt2' is the new column required for prediction.

Example 1: Use Unsupervised Learning Model (fg_unsupervised_model)


The model fg_unsupervised_model with the input table (SQL to generate FellegiSunterTrainer input from
StringSimilarity function) is used for prediction.

SQL-MapReduce Call

SELECT * FROM FellegiSunterPredict (


ON fspredict_input PARTITION BY ANY
ON fg_unsupervised_model AS model DIMENSION
Accumulate ('id', 'src_text2', 'tar_text', 'jaro1_sim',
'ld1_sim','ngram1_sim', 'jw1_sim')
) ORDER BY id;

Output
The model prediction is shown in the final column ('M' for match, 'U' for no match). The weight of the
object pair is shown in the 'weight' column.
Table 1122: FellegiSunterPredict Example 1 Output Table (Columns 1-4)

id src_text2 tar_text jaro1_sim


1 astter aster 0.944444444444445

1176 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
FellegiSunterPredict

id src_text2 tar_text jaro1_sim


2 fone phone 0.783333333333333
3 acquire acquiesce 0.841269841269841
4 CCCGGGAACCAACC CCAGGGAAACCCAC 0.875457875457875
5 allen allies 0.822222222222222
6 angle angels 0.877777777777778
7 center centre 0.944444444444445
8 cheap chief 0.733333333333333
9 circle circuit 0.746031746031746
10 debut debris 0.7
11 dell lead 0.5
12 bear bear 1

Table 1123: FellegiSunterPredict Example 1 Output Table (Columns 5-9)

ld1_sim ngram1_sim jw1_sim weight match_result


0.833333333333333 0.8 0.961111111111111 44.9951243578567 M
0.6 0.5 0.783333333333333 -55.9137657950372 U
0.666666666666667 0.5 0.904761904761905 -14.2140648912983 U
0.714285714285714 0.692307692307692 0.9003663003663 22.741745029409 M
0.666666666666667 0.4 0.875555555555556 -14.2140648912983 U
0.666666666666667 0.4 0.914444444444445 -14.2140648912983 U
0.666666666666667 0.6 0.966666666666667 22.741745029409 M
0.4 0.25 0.786666666666667 -55.9137657950372 U
0.571428571428571 0.5 0.847619047619048 -35.6602399749748 U
0.5 0.4 0.79 -55.9137657950372 U
0.25 0 0.5 -55.9137657950372 U
1 1 1 44.9951243578567 M

Example 2: Use supervised learning model (fg_supervised_model)


The model ‘fg_supervised_model’ (Output) with the input table (SQL to generate FellegiSunterTrainer input
from StringSimilarity function) is used for prediction.

Teradata Aster Analytics Foundation User Guide 1177


Chapter 13: Data Transformation
FellegiSunterPredict
SQL-MapReduce Call

SELECT * FROM FellegiSunterPredict (


ON fspredict_input PARTITION BY ANY
ON fg_supervised_model AS model DIMENSION
Accumulate ('id', 'src_text2', 'tar_text', 'jaro1_sim',
'ld1_sim','ngram1_sim', 'jw1_sim')
) ORDER BY id;

Output
The model prediction is shown in the final column ('M' for match, 'U' for no match). The weight of the
object pair is shown in the 'weight' column.
Table 1124: FellegiSunterPredict Example 2 Output Table (Columns 1-4)

id src_text2 tar_text jaro1_sim


1 astter aster 0.944444444444445
2 fone phone 0.783333333333333
3 acquire acquiesce 0.841269841269841
4 CCCGGGAACCAACC CCAGGGAAACCCAC 0.875457875457875
5 allen allies 0.822222222222222
6 angle angels 0.877777777777778
7 center centre 0.944444444444445
8 cheap chief 0.733333333333333
9 circle circuit 0.746031746031746
10 debut debris 0.7
11 dell lead 0.5
12 bear bear 1

Table 1125: FellegiSunterPredict Example 2 Output Table (Columns 5-9)

ld1_sim ngram1_sim jw1_sim weight match_result


0.833333333333333 0.8 0.961111111111111 43.7700274457179 M
0.6 0.5 0.783333333333333 -43.6001024457943 U
0.666666666666667 0.5 0.904761904761905 -0.415037499278844 U
0.714285714285714 0.692307692307692 0.9003663003663 22.8384590206632 M
0.666666666666667 0.4 0.875555555555556 -0.415037499278844 U
0.666666666666667 0.4 0.914444444444445 -0.415037499278844 U
0.666666666666667 0.6 0.966666666666667 22.8384590206632 M
0.4 0.25 0.786666666666667 -43.6001024457943 U

1178 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Geometry Functions

ld1_sim ngram1_sim jw1_sim weight match_result


0.571428571428571 0.5 0.847619047619048 -22.6685340199802 U
0.5 0.4 0.79 -43.6001024457943 U
0.25 0 0.5 -43.6001024457943 U
1 1 1 43.7700274457179 M

In these examples, the predictions are the same for both supervised and unsupervised models.

Geometry Functions
Aster Analytics Foundation provides three geometry functions:
• GeometryLoader loads data from different providers into the database and converts the data to the
format used by the other geometry functions.
• PointInPolygon computes a list of binary values for every point and polygon combination, which indicate
whether the point is contained in the polygon.
• GeometryOverlay finds the result of overlaying two geometries.

Note:
PointInPolygon and GeometryOverlay work only on 2D spatial objects.

Teradata Aster Analytics Foundation User Guide 1179


Chapter 13: Data Transformation
GeometryLoader

GeometryLoader

Summary
The GeometryLoader function fetches file-based geospatial files from AFS, parses them, and stores them in
Aster Database. The function only loads input formats from AFS and converts them to WKT or other
formats in the database.

Usage

GeometryLoader Syntax
Version 1.1
For the ON clause, create a table mr_driver once, with no rows. For example:

SELECT * FROM GeometryLoader (


ON mr_driver
Path ('input_path' [,...])
[ Host ('afs_server_ip_address') ]
[ Port ('afs_server_port_number') ]
[ InputFormat ({ 'kml' | 'geojson' | 'shp' | 'mapinfo' }) ]
[ OutputFormat ({ 'wkt' | 'json' | 'kml' | 'gml' }) ]
[ OutputAttributes ('colname [ coltype ]' [,...]) ]
);

Arguments
Argument Category Description
Path Required The AFS directory or file name to fetch the geometry files from (for
example, /test or /test/testfile.xml or /test/*.xml).
Regular expressions are allowed in the argument.
Before calling this function, ensure that the directories and files that
you want to specify are available on the AFS system.

Note:
A zip file is treated as a directory.

Host Optional The namemode/IP Address of the AFS (cluster) server.


Port Optional The port on the AFS server to connect to. The default is 2601.
InputFormat Optional The format of the geospatial data in the specified files. By default, the
function determines the format.
OutputFormat Optional The representation format of geospatial output data. The default value
is wkt.

1180 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
GeometryLoader

Argument Category Description

Note:
Only WKT is supported by the PointInPolygon and
GeometryOverlay functions.

OutputAttributes Optional The output column names and types. The supported column types are
VARCHAR, INT, and DOUBLE PRECISION. The default column type
is VARCHAR.

Input
Table 1126: Geospatial File Formats That GeometryLoader Accepts

Provider Format URL


MapInfo (Pitney Bowes) MAP/MIF/TAB http://www.mapinfo.com/
(Multiple types of boundary data products/data/
files)
International data sets
ESRI SHP http://www.esri.com/data/
(Multiple types of boundary data esri_data
files)
International data sets
TIGER files SHP, KML https://www.census.gov/geo/
(US Census Tracts boundary files) maps-data/data/tiger.html
Other providers for USPS 2000 SHP, MAP, KML http://www.zipboundary.com/
ZIP Codes index.html

Output
Table 1127: GeometryLoader Output Table Schema

Column Name Data Type Description


path VARCHAR The directory that contains the
geospatial files. The directory is
because some formats use multiple
files to represent geospatial
objects. For shp and mapinfo
formats, directory is adopted as a
basic path element, as they use
multiple files to represent
geospatial objects. For kmnl and
geojson, file is adopted as a basic
path element.

Teradata Aster Analytics Foundation User Guide 1181


Chapter 13: Data Transformation
GeometryLoader

Column Name Data Type Description

Note:
ZIP files are treated as
directories.

OUTPUTATTRIBUTESn VARCHAR, INTEGER, or Geospatial object attribute n.


DOUBLE PRECISION
{ wkt | json | kml | gml } VARCHAR, INTEGER, or Geospatial objects in output
DOUBLE PRECISION format.

Example

Input
The input files, stored in AFS, are sample ArcGIS shapefiles. You can get these files from http://
www.arcgis.com/home/item.html?id=b07a9393ecbd430795a6f6218443dccc.
• states.dbf
• states.prj
• states.sbn
• states.sbx
• states.shp
• states.shp.xml
• states.shx
Install these files into AFS (Aster File Server) as follows. From the beehive=> prompt:

beehive=> \afs -mkdir /data

The following command moves the files from your local directory to the AFS. This example assumes that the
files have been unzipped into a local directory /home/states.

\afs -put /home/states /data/

Confirm that all files have been uploaded. You should see the following list of files:

beehive=> \afs -ls /data/states


Found 7 items
-rwxrwxrwx 2 db_superuser db_superuser 2846 2015-12-30 11:06 /data/states/
states.dbf
-rwxrwxrwx 2 db_superuser db_superuser 167 2015-12-30 11:06 /data/states/
states.prj
-rwxrwxrwx 2 db_superuser db_superuser 596 2015-12-30 11:06 /data/states/
states.sbn
-rwxrwxrwx 2 db_superuser db_superuser 148 2015-12-30 11:06 /data/states/
states.sbx
-rwxrwxrwx 2 db_superuser db_superuser 222392 2015-12-30 11:06 /data/states/

1182 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
GeometryLoader
states.shp
-rwxrwxrwx 2 db_superuser db_superuser 2842 2015-12-30 11:06 /data/states/
states.shp.xml
-rwxrwxrwx 2 db_superuser db_superuser 508 2015-12-30 11:06 /data/states/
states.shx

You must also create an empty table mr_driver, as follows:

DROP TABLE IF EXISTS mr_driver;


CREATE FACT TABLE mr_driver (id int, partition key (id));

SQL-MapReduce Call

SELECT * FROM GeometryLoader (


ON mr_driver
PATH ('/data/states/')
OUTPUTATTRIBUTES ('STATE_NAME varchar', 'SUB_REGION varchar',
'STATE_ABBR varchar')
);

Note:
The STATE_NAME, SUB_REGION, and STATE_ABBR columns represent attributes defined in the
shapefiles.

SQL-MapReduce Call

SELECT * FROM GeometryLoader (


ON mr_driver
PATH ('/data/states/')
OUTPUTATTRIBUTES ('STATE_NAME varchar', 'SUB_REGION varchar',
'STATE_ABBR varchar')
);

Note:
The STATE_NAME, SUB_REGION, and STATE_ABBR columns represent attributes defined in the
shapefiles.

Output
Table 1128: GeometryLoader Output Table

path STATE_NAME SUB_REGION STATE_ABBR wkt


/data/ Hawaii Pacific HI MULTIPOLYGON
states/ (((-160.07380334546815
22.004177347957729,-160.049709345
445706
21.988164347942817,-160.089858345
483094 21.915870347875487,...

Teradata Aster Analytics Foundation User Guide 1183


Chapter 13: Data Transformation
PointInPolygon

path STATE_NAME SUB_REGION STATE_ABBR wkt


/data/ Washington Pacific WA MULTIPOLYGON
states/ (((-122.402015310383547
48.225216372377972,-122.462855310
440204
48.228363372380912,-122.454419310
432343 48.128492372287894,...
/data/ Montana Mountain MT POLYGON ((-111.475425300207363
states/ 44.702162369096875,-111.480804300
21237
44.691416369086866,-111.460692300
193642 44.670023369066939,...
... ... ... ... ...

PointInPolygon

Summary
The PointInPolygon function takes a list of location points and a list of polygons and returns a list of binary
values for every point and polygon combination, which indicates whether the point is contained in the
polygon.

Note:
The function works only on 2D spatial objects.

Background
The PointInPolygon function judges whether a given point in the plane lies inside, or outside of a polygon. It
has various applications in many fields such as computer graphics, geographical information systems (GIS),
and CAD.
In the following example, point A is in the polygon and point B is outside of the polygon.

1184 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PointInPolygon

A use case for this function is to determine in which “drive-time polygon” surrounding a store a customer
resides. This information helps in mailer targeting.
Another use case is to determine which cell phones are frequently within a polygon surrounding an airport.
This information helps in identifying frequent fliers.

Usage

PointInPolygon Syntax

Small Polygon Count and Large Point Count


Version 1.1

SELECT * FROM PointInPolygon (


ON source_table AS source PARTITION BY ANY
ON reference_table AS reference DIMENSION
SourceLocationColumn ('source_location_point_column'
[, 'source_location_point_column_2' ])
ReferenceLocationColumn ('reference_location_polygon_column')
ReferenceNameColumns ({ 'reference_name_column' |
'reference_name_column_range' }[,...])
[ OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Large Polygon Count and Small Point Count


Version 1.1

SELECT * FROM PointInPolygon (


ON dimension_table as source DIMENSION
ON reference_table as reference PARTITION BY ANY
SourceLocationColumn ('source_location_point_column'
[, 'source_location_point_column_2' ])
ReferenceLocationColumn ('reference_location_polygon_column')
ReferenceNameColumns ({ 'reference_name_column' |
'reference_name_column_range' }[,...])
OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})
[ Accumulate

Teradata Aster Analytics Foundation User Guide 1185


Chapter 13: Data Transformation
PointInPolygon
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Only to Determine Relations of Points and Polygons in Same Group


Version 1.1

SELECT * FROM PointInPolygon (


ON { table_name | view_name |(query) } AS source
PARTITION BY group_key
ON { table_name | view_name |(query) } AS reference
PARTITION BY group_key
SourceLocationColumn ('source_location_point_column'
[, 'source_location_point_column_2' ])
ReferenceLocationColumn ('reference_location_polygon_column')
ReferenceNameColumns ({ 'reference_name_column' |
'reference_name_column_range' }[,...])
OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Arguments
Argument Category Description
SourceLocationColumn Required The names of the columns that contain the point
coordinate values from the source input table.
If only one column is specified, the coordinates of the
point must be expressed using well-known text (WKT)
syntax. For example, the string “POINT (30 10)” is the
WKT markup syntax that describes a point whose x
coordinate is 30 and whose y coordinate is 10).
By supporting WKT, this function can process the output
of the GeometryLoader function, which is expressed in
WKT. For example, you can use the GeometryLoader
function to convert GIS data formats (for example,
shapefile (.shp), MapInfo TAB (.tab), Keyhole Markup
Language (KML), and GeoJSON) to WKT and use the
PointInPolygon function to process the resulting WKT
data.
For more information about WKT, refer to the following
URL:
http://www.geoapi.org/3.0/javadoc/org/opengis/
referencing/doc-files/WKT.html
If two columns are specified, the function assumes that
they represent the two coordinates (for example, latitude
and longitude) of the input points. The two-column
format for representing points is useful when the input
data consists of raw latitude and longitude pairs.

1186 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PointInPolygon

Argument Category Description


When two columns are specified, the output of the
IPGEO function can be used as input to this function.
ReferenceLocationColumn Required The column from the reference input table that contains
the polygon coordinate values. The content must be of
type WKT.
ReferenceNameColumns Required The columns from the reference input table that contains
the polygon name. These columns are passed to the
output.
OutputAll Optional Whether to specify, in the output table, that the point is
not in a polygon. The default value is 'false' (does not
specify that the point is not in a polygon).
Accumulate Optional The columns from the source input table that are passed
to the output table. By default, no columns are passed to
the output table.

Input
The PointInPolygon function requires two input tables, source and reference. These tables must have the
same coordinate reference system.
Table 1129: PointInPolygon Source Table Schema

Column Name Data Type Description


source_location_point_column If you specify only this column: If you specify only this column:
CHAR, VARCHAR, or TEXT Contains WKT content.
If you specify two columns: If you specified two columns:
SMALLINT, INTEGER, BIGINT, Contains the x coordinates of the
DOUBLE PRECISION, points.
NUMERIC, or DECIMAL
source_location_point_column_2 SMALLINT, INTEGER, BIGINT, Contains the y coordinates of the
DOUBLE PRECISION, points.
NUMERIC, or DECIMAL
accumulate_column Any Column to copy to the output
table.

Table 1130: PointInPolygon Reference Table Schema

Column Name Data Type Description


reference_location_polygon_column CHAR, VARCHAR, or TEXT Polygon description.
reference_location_column Any Polygon name.

Teradata Aster Analytics Foundation User Guide 1187


Chapter 13: Data Transformation
PointInPolygon
Output
Table 1131: PointInPolygon Output Table Schema

Column Name Data Description


Type
pip_flag INTEGER 1 if the point is inside the polygon, 0
otherwise.
source_location_point_column Same as in If source table has only this column: WKT
source content.
table If source table has two such columns: Contains
the x coordinates of the points.
source_location_point_column_2 Same as in Appears in output table only if it appears in
source source table. Contains the y coordinates of the
table points.
ref_reference_location_polygon_column Same as in Polygon description.
reference
table
ref_reference_location_column Same as in Polygon name.
reference
table
accumulate_column Same as in Column copied from the source table.
source
table

Examples
These examples use PointInPolygon function in three modes:
• With outputall ('true')
• With outputall ('false')
• Using passenger coordinates as separate columns

Example 1: With OutputAll ('true')

Input
This example assumes that the parsed location file formats are grouped into the following relation, as shown
in the table source_passenger, as input to the function. There are four passengers whose x, y coordinates are
known and the goal is to determine in which of the two airport terminals (A or B) they are located. The
outlay, or geographical location, of the terminals is specified in the table reference_terminal as polygon
coordinates. In this table, the coordinates of the points are specified using WKT syntax.

1188 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PointInPolygon
Table 1132: PointInPolygon Example 1 Input Table source_passenger

customer_id source_location_point customer_name


1 POINT (30 10) Jeff
2 POINT (300 10) John
3 POINT (300 20) Maria
3 POINT (400 20) Macy

Table 1133: PointInPolygon Example 1 Input Table reference_terminal

terminal_id reference_location_polygon terminal_name


1 POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0)) Terminal A
2 POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0)) Terminal B

SQL-MapReduce Call

SELECT * FROM PointInPolygon (


ON source_passenger AS source partition BY ANY
ON reference_terminal AS reference dimension
SourceLocationColumn ('source_location_point')
ReferenceLocationColumn ('reference_location_polygon')
ReferenceNameColumns ('terminal_name')
OutputAll ('true')
Accumulate ('customer_id', 'customer_name')
) ORDER BY source_location_point;

Output
Because the OutputAll argument is set to true, the output table shows all passengers regardless of whether
they are in a terminal or not.
Table 1134: PointInPolygon Example 1 Output Table (Columns 1-2)

source_location_point ref_reference_location_polygon
POINT (30 10) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (30 10) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (300 10) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (300 10) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (300 20) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (300 20) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (400 20) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (400 20) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))

Teradata Aster Analytics Foundation User Guide 1189


Chapter 13: Data Transformation
PointInPolygon
Table 1135: PointInPolygon Example 1 Output Table (Columns 3-6)

ref_terminal_name pip_flag customer_id customer_name


Terminal B 0 1 Jeff
Terminal A 1 1 Jeff
Terminal B 1 2 John
Terminal A 0 2 John
Terminal B 1 3 Maria
Terminal A 0 3 Maria
Terminal B 0 3 Macy
Terminal A 0 3 Macy

Example 2: With OutputAll ('false')

SQL-MapReduce Call

SELECT * FROM PointInPolygon (


ON source_passenger AS source PARTITION BY ANY
ON reference_terminal AS reference DIMENSION
SourceLocationColumn ('source_location_point')
ReferenceLocationColumn ('reference_location_polygon')
ReferenceNameColumns ('terminal_name')
outputall ('false')
Accumulate ('customer_id', 'customer_name')
) ORDER BY source_location_point;

Output
In this example, which uses the same input as Example 1 but has the OutputAll argument set to false, the
output includes only passengers inside a terminal. Macy is not in any terminal and does not appear in the
output table.
Table 1136: PointInPolygon Example 2 Output Table (Columns 1-2)

source_location_point ref_reference_location_polygon
POINT (30 10) POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
POINT (300 10) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
POINT (300 20) POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))

Table 1137: PointInPolygon Example 2 Output Table (Columns 3-6)

ref_terminal_name pip_flag customer_id customer_name


Terminal A 1 1 Jeff

1190 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PointInPolygon

ref_terminal_name pip_flag customer_id customer_name


Terminal B 1 2 John
Terminal B 1 3 Maria

Example 3: Passenger Coordinates as Separate Columns


In this example, the input locations are given in a different format.

Input

Table 1138: PointInPolygon Example 2 Input Table source_passenger1

customer_id x y customer_name
1 30 10 Jeff
1 300 10 John
1 300 20 Maria
1 400 20 Macy

SQL-MapReduce Call

SELECT * FROM PointInPolygon(


ON source_passenger1 AS source PARTITION BY ANY
ON reference_terminal AS reference DIMENSION
SourceLocationColumn ('x', 'y')
ReferenceLocationColumn ('reference_location_polygon')
ReferenceNameColumns ('terminal_name')
OutputAll ('false')
Accumulate ('customer_id', 'customer_name')
) ORDER BY x, y;

Output

Table 1139: PointInPolygon Example 3 Output Table (Columns 1-3)

x y ref_reference_location_polygon
30 10 POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0))
300 10 POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))
300 20 POLYGON ((200 0, 400 0, 400 200, 200 200, 200 0))

Table 1140: PointInPolygon Example 3 Output Table (Columns 4-7)

ref_terminal_name pip_flag customer_id customer_name


Terminal A 1 1 Jeff
Terminal B 1 1 John

Teradata Aster Analytics Foundation User Guide 1191


Chapter 13: Data Transformation
GeometryOverlay

ref_terminal_name pip_flag customer_id customer_name


Terminal B 1 1 Maria

GeometryOverlay

Summary
The GeometryOverlay function takes two geometries described by the well-known text (WKT) markup
language and outputs the result of overlaying them, as specified by the boundary operator.
You can use this function to prepare sets of geometries for input to the PointInPolygon function. For
example, you can use this function to prepare a geometry that contains all cellular phone reception polygons
near an airport to create a geometry that is useful for identifying frequent fliers.

Note:
The function works only on 2D spatial objects.

Usage

GeometryOverlay Syntax

UNION, INTERSECTION, DIFFERENCE and SYMDIFFERENCE


Version 1.1

SELECT * FROM GeometryOverlay (


ON { table_name | view_name | (query) } AS source PARTITION BY ANY
ON { table_name | view_name | (query) } AS reference DIMENSION
SourceLocationColumn ('source_location_column')
ReferenceLocationColumn ('ref_location_column')
ReferenceNameColumns
({ 'ref_name_column' | 'ref_name_column_range' }[,...])
BoundaryOperator (
{ 'UNION' | 'INTERSECTION' | 'DIFFERENCE' | 'SYMDIFFERENCE' })
[ OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

CONVEXHULL
Version 1.1

SELECT * FROM GeometryOverlay (


ON { table_name | view_name | (query) }
SourceLocationColumn ('source_location_column')

1192 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
GeometryOverlay
BoundaryOperator ({ 'BUFFER' | 'CONVEXHULL' })
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

BUFFER
Version 1.1

SELECT * FROM GeometryOverlay (


ON { table_name | view_name | (query) }
SourceLocationColumn ('source_location_column')
BoundaryOperator ({ 'BUFFER' | 'CONVEXHULL' })
Distance ('distance')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Arguments
Argument Category Description
SourceLocationColumn Required Specifies the name of the source table column that contains the
polygon description in WKT format.
ReferenceLocationColumn Required Specifies the name of the reference table column that contains
the location of the polygon description in WKT format.
ReferenceNameColumns Required Specifies the name of the reference table column that contains
the names of the polygons.
BoundaryOperator Required Specifies the boundary (geometry overlay) operator. For
descriptions of these operators, refer to the following table.
Distance Required Specifies the distance by which to extend or decrease the
by polygon.
BUFFER
operator
OutputAll Optional Specifies whether to include the result of non-intersection
geometries in the output. The default value is 'false'.
Accumulate Optional Specifies the names of the source table columns to copy to the
output table.

Table 1141: GeometryOverlay Boundary Operators

Boundary Result Description


Operator
UNION Result contains the area covered by the source polygon and the area
covered by the reference polygon (the blue area in the figure).

Teradata Aster Analytics Foundation User Guide 1193


Chapter 13: Data Transformation
GeometryOverlay

Boundary Result Description


Operator
INTERSECTIO Result contains the area that is common to the source and reference
N polygons (the blue area in the figure).

DIFFERENCE Result contains the area covered only by the source polygon (the blue
area in the figure).

SYMDIFFEREN Result contains the area covered by only the source polygon or only
CE the reference polygon (the blue area in the figure).

CONVEXHULL Result contains the smallest convex set that contains the source
polygon in the Euclidean plane (the blue area in the figure).

BUFFER Result contains the source polygon and specified buffer.


A positive buffer extends the area of the polygon. In the upper figure,
the result includes both the buffer (blue) and the source polygon (pale
pink).
A negative buffer decreases the area of the polygon. In the lower figure,
the result includes only the source polygon (blue).

Note:
For the buffer operation, eight segments is adopted in curve
approximation for the corners.

Input
For the boundary operators CONVEXHULL and BUFFER, the function requires only one input table,
source. For the other boundary operators, the function requires two input tables, source and reference,
which must use the same coordinate reference system.
Table 1142: GeometryOverlay Source Table Schema

Column Name Data Type Description


source_location_column VARCHAR Description of the source polygon in the WKT markup language.
accumulate_column Any Column to copy to the output table.

Table 1143: GeometryOverlay Reference Table Schema

Column Name Data Type Description


ref_location_column VARCHAR Location of the polygon description in WKT format.
ref_name_column Any Name of the polygon.

1194 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
GeometryOverlay
Output
Table 1144: GeometryOverlay Output Table Schema

Column Name Data Type Description


overlay_boundary VARCHAR Description of the resulting geometry in the WKT markup
language.
overlay_flag INTEGER Indicates whether the source and reference polygons intersect.
Column does not appear for the CONVEXHULL or BUFFER
operator.
ref_ref_name_column Any Name of the polygon (copied from the reference table).
accumulate_column Any Column copied from the source table.

Examples
The following three examples use the same Input.
• Example 1: Intersection
• Example 2: Union
• Example 3: Buffer (Single Input)

Input
The source input table, source_gatetype, provides the geometrical coordinates of the domestic and
international gates that are spread over three terminals (A, B and C) of an airport.
Table 1145: GeometryOverlay Input Table source_gatetype

id boundary_coordinates boundary_name boundary_attributes


1 POLYGON ((10 10, 10 20, 20 20, 20 10, 10 10)) Domestic Gates Domestic
2 POLYGON ((50 50, 50 150, 150 150, 150 50,50 50)) International Gates International

The terminal coordinates are given in the table ref_terminal. All coordinates are in WKT syntax.
Table 1146: GeometryOverlay Input Table ref_terminal

id boundary_coordinates boundary_name boundary_attributes


1 POLYGON ((0 0, 100 0, 100 100, 0 100, 0 0)) Terminal A A
2 POLYGON ((100 0, 200 0, 200 100, 100 100, 100 0)) Terminal B B
3 POLYGON ((0 100, 100 100, 100 200, 0 200, 0 100)) Terminal C C

Teradata Aster Analytics Foundation User Guide 1195


Chapter 13: Data Transformation
GeometryOverlay
Example 1: Intersection

SQL-MapReduce Call

SELECT * FROM GeometryOverlay (


ON source_gatetype AS source PARTITION BY ANY
ON ref_terminal AS reference DIMENSION
SourceLocationColumn ('boundary_coordinates')
ReferenceLocationColumn ('boundary_coordinates')
ReferenceNameColumns ('boundary_name')
BoundaryOperator ('intersection')
OutputAll ('false')
Accumulate ('boundary_attributes')
) ORDER BY 2, 1;

Output
The output shows that all domestic gates are contained within Terminal A. International gates are spread
over all three terminals.
Table 1147: GeometryOverlay Example 1 Output Table

overlay_boundary ref_boundary_name overlay_flag boundary_attributes


POLYGON ((10 10, 10 Terminal A 1 Domestic
20, 20 20, 20 10, 10 10))
POLYGON ((50 50, 50 Terminal A 1 International
100, 100 100, 100 50, 50
50))
POLYGON ((150 100, Terminal B 1 International
150 50, 100 50, 100 100,
150 100))
POLYGON ((50 100, 50 Terminal C 1 International
150, 100 150, 100 100, 50
100))

Example 2: Union
This example computes the union of area for both the gates and the terminals.

SQL-MapReduce Call

SELECT * FROM GeometryOverlay (


ON source_gatetype AS source PARTITION BY id
ON ref_terminal AS reference PARTITION BY id
SourceLocationColumn ('boundary_coordinates')
ReferenceLocationColumn ('boundary_coordinates')
ReferenceNameColumns ('boundary_name')
BoundaryOperator ('union')
OutputAll ('false')

1196 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
GeometryOverlay
Accumulate ('boundary_name', 'id')
) ORDER BY 2, 1;

Output
Because all domestic gates are contained within Terminal A, the union of the area specified by the domestic
gates and the area of Terminal A is just Terminal A, as shown in the first row of the output. The second row
shows the union of the coordinates of the international gates and all three terminals.
Table 1148: GeometryOverlay Example 2 Output Table

overlay_boundary ref_boundary_name overlay_flag boundary_name id


POLYGON ((0 0, 0 100, Terminal A 1 Domestic Gates 1
100 100, 100 0, 0 0))
POLYGON ((50 50, 50 150, Terminal B 1 International Gates 2
150 150, 150 100, 200 100,
200 0, 100 0, 100 50, 50 50))

Example 3: Buffer (Single Input)


This example uses the buffer operator to expand the domestic or international gates by a distance of ‘2’.

SQL-MapReduce Call

SELECT * FROM GeometryOverlay (


ON source_gatetype
SourceLocationColumn ('boundary_coordinates')
BoundaryOperator ('buffer')
Distance (2)
Accumulate ('boundary_name', 'id')
) ORDER BY 3;

Output

Table 1149: GeometryOverlay Example 3 Output Table

overlay_boundary boundary_name id
POLYGON ((10 8, 9.609819355967742 8.03842943919354, Domestic Gates 1
9.234633135269819 8.152240934977428, 8.888859533960796
8.33706077539491, 8.585786437626904 8.585786437626904,
8.337060775394908 8.888859533960796, 8.152240934977426
9.23463313526982, 8.03842943919354 9.609819355967744, 8 10, 8 20,
8.03842943919354 20.390180644032256, 8.152240934977426
20.76536686473018, 8.33706077539491 21.111140466039203,
8.585786437626904 21.414213562373096, 8.888859533960796
21.66293922460509, 9.23463313526982 21.847759065022572,
9.609819355967744 21.96157056080646, 10 22, 20 22,
20.390180644032256 21.96157056080646, 20.76536686473018

Teradata Aster Analytics Foundation User Guide 1197


Chapter 13: Data Transformation
IdentityMatch

overlay_boundary boundary_name id
21.847759065022572, 21.111140466039206 21.66293922460509,
21.414213562373096 21.414213562373096, 21.66293922460509
21.111140466039203, 21.847759065022572 20.76536686473018,
21.96157056080646 20.390180644032256, 22 20, 22 10,
21.96157056080646 9.609819355967744, 21.847759065022572
9.23463313526982, 21.66293922460509 8.888859533960796,
21.414213562373096 8.585786437626904, 21.111140466039206
8.33706077539491, 20.76536686473018 8.152240934977426,
20.390180644032256 8.03842943919354, 20 8, 10 8))
POLYGON ((50 48, 49.609819355967744 48.03842943919354, International Gates 2
49.23463313526982 48.15224093497743, 48.8888595339608
48.33706077539491, 48.58578643762691 48.58578643762691,
48.33706077539491 48.8888595339608, 48.15224093497743
49.23463313526982, 48.038429439193536 49.609819355967744, 48 50, 48
150, 48.038429439193536 150.39018064403226, 48.15224093497743
150.76536686473017, 48.33706077539491 151.11114046603922,
48.58578643762691 151.4142135623731, 48.8888595339608
151.66293922460508, 49.23463313526982 151.8477590650226,
49.609819355967744 151.96157056080645, 50 152, 150 152,
150.39018064403226 151.96157056080645, 150.76536686473017
151.8477590650226, 151.11114046603922 151.66293922460508,
151.4142135623731 151.4142135623731, 151.66293922460508
151.11114046603922, 151.8477590650226 150.76536686473017,
151.96157056080645 150.39018064403226, 152 150, 152 50,
151.96157056080645 49.609819355967744, 151.8477590650226
49.23463313526982, 151.66293922460508 48.8888595339608,
151.4142135623731 48.58578643762691, 151.11114046603922
48.33706077539491, 150.76536686473017 48.15224093497743,
150.39018064403226 48.038429439193536, 150 48, 50 48))

IdentityMatch

Summary
The IdentityMatch function tries to match source data with reference data, using specified attributes to
calculate the similarity score of each source-reference pair, and then computes the final similarity score.
Typically, the source data is about business customers and the reference data is from external sources, such
as online forums and social networking services.

1198 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IdentityMatch

Background
Businesses can easily gather customer sentiments from external data sources such as online forums and
social networking services. However, businesses cannot easily tell if the customer whose database identifier
(ID) is John Q. Public is the person with the online forum ID JQPublic or the social networking service ID
JohnP. The IdentityMatch function is intended to make this job easier.
The IdentityMatch function supports both nominal (exact) matching and fuzzy matching. You specify the
nominal-match attributes and the fuzzy-match attributes. First, the function compares the nominal-match
attributes. If they match exactly, the function does not compare the fuzzy-match attributes; if not, the
function compares the fuzzy-match attributes and uses only their similarity score.
For example, suppose that the nominal-match attribute is user ID and the fuzzy-match attribute is email or
mobile phone number. Two user IDs might not match exactly, but if both are associated with the same email
or mobile phone number, then they are considered to identify the same user.
However, if the fuzzy-match attributes do not represent users (as location and many other profile attributes
do not), then the function uses weighted matching. For example, for customer 1 and external user 2, the
matching formula could be:

where fx is a function that calculates the similarity of two strings and returns a value between 0 and 1, and
w1+w2+...+wn = 1.

Usage

IdentityMatch Syntax

When Reference Data Fits in Memory


Use this syntax when one input (a or b) fits in memory. The function compares each record from a and each
record from b. The number of comparisons is |a|*|b|.
Version 1.1

SELECT * FROM IdentityMatch (


ON source_input_table AS a PARTITION BY ANY

Teradata Aster Analytics Foundation User Guide 1199


Chapter 13: Data Transformation
IdentityMatch
ON reference_input_table AS b DIMENSION
IDColumn ('a.id_column: b.id_column')
{ NominalMatchColumns ('a.columnX: b.columnY' [,...]) |
FuzzyMatchColumns ('a.columnX: b.columnY, match_metric,
match_weight [, synonym_file ]' [,...]) }
[ Accumulate ('{a|b}.accumulate_column' [,...]')]
[ Threshold ('threshold') ]
);

When Reference Data Does Not Fit in Memory


Use this syntax when neither input fits in memory. Partition the data by categorical attribute (for example,
age range), thereby distributing the records to workers by the specified attribute and reducing the
comparison times. If the categorical attribute has n values and each range has same size, then the
comparison times are reduced to 1/n.
Version 1.1

SELECT * FROM IdentityMatch (


ON source_input_table AS a PARTITION BY key
ON reference_input_table AS b PARTITION BY key
IDColumn ('a.id_column: b.id_column')
{ NominalMatchColumns ('a.columnX: b.columnY' [,...]) |
FuzzyMatchColumns ('a.columnX: b.columnY, match_metric,
match_weight [, synonym_file ]' [,...]) }
[ Accumulate ('{a|b}.accumulate_column' [,...]')]
[ Threshold ('threshold') ]
);

Arguments
Argument Category Description
IDColumn Required Specifies the names of the columns in the source and
reference input tables that contain row identifiers. The
function copies these columns to the output table.
NominalMatchColumns Optional* Specifies pairs of columns (attributes) to check for exact
matching (a.columnX and b.columnY are column names). If
any pair matches exactly, then their records are considered
to be exact matches.
*Required if you omit FuzzyMatchColumns.
FuzzyMatchColumns Optional* Specifies pairs of columns (attributes) to check for fuzzy
matching (a.columnX and b.columnY are column names)
and the fuzzy matching parameters match_metric,
match_weight, and synonym_file (whose descriptions
follow). If any pair is a fuzzy match, then their records are
considered to be fuzzy matches.
*Required if you omit NominalMatchColumns.
The parameter match_metric specifies the similarity metric,
which is a function that returns the similarity score of two

1200 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IdentityMatch

Argument Category Description


strings (a value between 0 and 1). The possible values of
match_metric are:
• EQUAL:
If strings a and b are equal, then their similarity score is
1.0; otherwise it 0.0.
• LD:
The similarity score of strings a and b is f(a,b)=LD(a,b)/
max(len(a),len(b)), where LD(a,b) is the Levenshtein
distance between a with b.
• D-LD:
Like LD except that LD is the Damerau–Levenshtein
distance between a with b.
• JARO:
The similarity score of strings a and b is the Jaro distance
between them.
• JARO-WINKLER:
The similarity score of strings a and b is the Jaro-Winkler
distance between them.
• NEEDLEMAN-WUNSCH:
The similarity score of strings a and b is the Needleman-
Wunsch distance between them.
• JD:
The similarity score of strings a and b is the Jaccard
distance between them. The function converts the strings
a and b to sets s and t by splitting them by space and then
uses the formula f(s,t)=|s∩t|/|s∪t|.
• COSINE:
• The similarity score of strings a and b is calculated with
their term frequency–inverse document frequency (TF-
IDF) and cosine similarity.

Note:
The function calculates IDF only on the input relation
stored in memory.

The parameter match_weight specifies the weight (relative


importance) of the attribute represented by a.columnX and
b.columnY. The match_weight must be a positive number.
The function normalizes each match_weight to a value in the
range [0, 1]. Given match_weight values, w1, w2, ..., wn, the
normalized value of wi is:
wi/(w1+w2+ ...+ wn)
For example, given two pairs of columns, whose match
weights are 3 and 7, the function uses the weights 3/
(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.

Teradata Aster Analytics Foundation User Guide 1201


Chapter 13: Data Transformation
IdentityMatch

Argument Category Description


The parameter synonym_file specifies the dictionary that the
function uses to check the two strings for semantic equality.
In the dictionary, each line is a comma-separated list of
synonyms.

Note:
You must install the dictionary before running the
function.

Accumulate Optional Specifies input table columns to copy to the output table.
Threshold Optional Specifies the threshold similarity score, a DOUBLE
PRECISION value between 0 and 1. The default value is 0.5.
The function outputs only the records whose similarity
score exceeds threshold.

Input
The IdentityMatch function requires a source input table and a reference input table. The following two
tables describe the input table columns that appear in the function syntax. The tables can have additional
columns, but the function ignores them.
Table 1150: IdentityMatch Source Input Table Schema

Column Name Data Type Description


a.id_column Any Contains row identifiers.
a.columnX VARCHAR Contains strings.
a.accumulate_column Any Column to be copied to the output table.

Table 1151: IdentityMatch Reference Input Table Schema

Column Name Data Type Description


b.id_column Any Contains row identifiers.
b.columnY VARCHAR Contains strings.
b.accumulate_column Any Column to be copied to the output table.

Output
Table 1152: IdentityMatch Output Table Schema

Column Name Data Type Description


a.id_column Same as in Contains row identifiers from the source input table.
source input
table

1202 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IdentityMatch

Column Name Data Type Description


b.id_column Same as in Contains row identifiers from the reference input table.
source input
table
a.accumulate_column Same as in Column copied from the source input table.
source input
table
b.accumulate_column Same as in Column copied from the reference input table.
reference input
table
score DOUBLE Contains final similarity scores of the records.
PRECISION

Example

Input
The input table, applicant_reference, is hypothetical information from people applying for employment at a
particular company. This table is a reference table against which external data from various sources can be
compared for identity matching.
Table 1153: IdentityMatch Example Input Table applicant_reference

id firstname lastname email city zipcode department gender


1 John Dewey john.dewey@corp- Sugar Land 77459 Marketing Male
mark.com
2 Sarah Anders sarah.anders@corp Pearland 77584 Sales Female
-sales.com
3 Elizabeth Hall elizabeth.hall@cor Galveston 77550 Engineering Female
p-eng.com
4 James Nickson james.nick@corp- Pasadena 77501 IT Male
it.com
5 Kim Lee kim.lee@corp- Clear Lake 77058 Systems Female
sys.com City
6 Jessica Right jessica.right@corp- Sugar Land 77459 Marketing Female
mark.com

The example compares this table with information (including credit scores) from the external source shown
in the following table. This table has missing and incomplete information, as expected with data from
different sources.

Teradata Aster Analytics Foundation User Guide 1203


Chapter 13: Data Transformation
IdentityMatch
Table 1154: IdentityMatch Example Input Table applicant_external

id firstname lastname email city zipcode department creditscore


1 John Dewey john.dewey@corp- Sugar Land 7774 market 700
mark.com
2 Hall Galveston 77550 eng 790
3 Sarah Anders sarah.anders@corp pear 77584 sales 650
-sales.com
4 Jessica right Sugar Land 77459 Marketing 690
5 James Nickson Pasadena 7750 IT 620
6 Kim 77058 system 570

SQL-MapReduce Call
The objective is to correctly match the information in Input to the applicant from Input and thus accurately
identify the applicant’s credit score. Assume a default threshold of 0.5. A higher threshold means that the
matching accuracy is higher. Look for exact matches (NominalMatchColumns) to the email address and
allow approximate matches (FuzzyMatchColumns) for lastname, firstname, zipcode, city and department
columns, with different match metrics and match weights.

SELECT * FROM IdentityMatch (


ON applicant_reference AS a PARTITION BY ANY
ON applicant_external AS b DIMENSION
IDColumn ('a.id: b.id')
NominalMatchColumns ('a.email: b.email')
FuzzyMatchColumns (
'a.lastname: b.lastname, JARO-WINKLER, 3',
'a.firstname: b.firstname, JARO-WINKLER, 2',
'a.zipcode: b.zipcode, JD, 2',
'a.city: b.city, LD, 2',
'a.department: b.department, COSINE, 1'
)
Accumulate ('a.firstname','a.lastname' ,'b.lastname', 'a.email',
'b.email','a.zipcode', 'b.zipcode', 'a.department',
'b.department','b.creditscore')
Threshold (0.5)
) ORDER BY "a.id", score DESC;

Output
The output table shows the matching information from both input tables and shows the matching score in
the last column. If multiple row entries exist for the same id, which is typically the case, then the higher score
entry gives the best match. For instance, in the output below, the first row is chosen over the second row as it
is a perfect match (score 1) over score 0.6036. The creditscore column gives the credit score for the
applicants.

1204 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IPGeo
Table 1155: IdentityMatch Output Table (Columns 1-6)

a.id a.firstname a.lastname a.email a.zipcode a.department


1 John Dewey john.dewey@corp 77459 Marketing
-mark.com
1 John Dewey john.dewey@corp 77459 Marketing
-mark.com
2 Sarah Anders sarah.anders@cor 77584 Sales
p-sales.com
3 Elizabeth Hall elizabeth.hall@cor 77550 Engineering
p-eng.com
4 James Nickson james.nick@corp- 77501 IT
it.com
6 Jessica Right jessica.right@corp 77459 Marketing
-mark.com

Table 1156: IdentityMatch Output Table (Columns 7-13)

b.id b.lastname b.email b.zipcode b.department b.creditscore score


1 Dewey john.dewey@c 7774 market 700 1.0000
orp-mark.com
4 right 77459 Marketing 690 0.6036
3 Anders sarah.anders@ 77584 sales 650 1.0000
corp-sales.com
2 Hall 77550 eng 790 0.7000
5 Nickson 7750 IT 620 0.8000
4 right 77459 Marketing 690 1.0000

IPGeo

Summary
IPGeo lets you map IP addresses to location information (country, region, city, latitude, longitude, ZIP code,
and ISP).
You can use the locations of web site visitors to improve the effectiveness of online applications. For
example:
• Targeted online advertising
• Content localization
• Geographic rights management
• Enhanced analytics
• Online security and fraud prevention

Teradata Aster Analytics Foundation User Guide 1205


Chapter 13: Data Transformation
IPGeo
For general information about IP databases, see:
http://dev.maxmind.com/geoip/geoip2/geolite2/

Usage

IPGeo Syntax
Version 2.1

SELECT * FROM IPGeo (


ON input_table
IPAddressColumn ('ip_address_column')
[ Converter ('file', 'class') ]
[ IPDatabaseLocation ('geolocation_DB_loc') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Arguments
Argument Category Description
IPAddressColumn Required Specifies the name of the input table column that contains the IP
addresses.
Converter Optional The JAR filename and the name of the class that converts the IP
address to location information. The JAR file must be installed on
the Aster Database and the class name must be the full name, which
includes the package information. The file and class parameters are
case-sensitive.
The IPGeo function is a special case and needs a user-defined class.
This is why you must use the Converter argument. Only the JAR file
declared by this argument can be used by the function.
The JAR file must contain all the classes needed by the user-defined
converter. In Aster Database, all of the installed files are stored in
the database. When a function is invoked, only a ZIP/JAR file
consistent with the SQL-MapReduce function name is temporally
downloaded to the file system to be executed.
To create a new class, refer to Extending IPGeo.
IPDatabaseLocation Optional The location of the IP database that matches IP addresses to
locations. The IP databases can be stored in the file system or in
Aster Database. If the data is stored in a file system, each worker
must have the same path, and the absolute path must be set in this
parameter. If the data is installed in Aster Database, this argument is
ignored.
Accumulate Optional Specifies the names of input table columns to copy to the output
table.

1206 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IPGeo
Input
Table 1157: IPGeo Input Table Schema

Column Name Data Type Description


ip_address_column VARCHAR IP address.
accumulate_column Any Column to copy to the output table. Typically, one such column
is the user identifier.

Output
Table 1158: IPGeo Output Table Schema

Column Name Data Type Description


country_code VARCHAR Country code.
country_name VARCHAR Country name.
state VARCHAR State or region name.
city VARCHAR City name.
postal_code VARCHAR Postal code.
latitude DECIMAL(6.4) Latitude.
longitude DECIMAL(7,4) Longitude.
isp VARCHAR Name of the ISP.
organization VARCHAR Name of the organization that owns this IP address.
organization_type VARCHAR Organization type
area_code INTEGER Area code
metro_code INTEGER Metro code
dma_code INTEGER DMA code

Examples
The examples have the same Input and Output, but different SQL-MapReduce calls.
Examples 1 and 2 require that the file MaxMindLite.jar be installed on the Aster database. This jar file
contains the converter interface used in these examples,
'com.asterdata.sqlmr.analytics.location.ipgeo.MaxMindLite'.
Example 3 uses the default Converter class that ships with the ipgeo function (geolite 1.2.8).
The function requires two database files that map ip addresses to geographic locations (GeoLiteCity.dat and
GeoLiteCityv6.dat). In Example 1, the files are referenced by pathname using the argument
IpDatabaseLocation, and need not be installed on the Aster database. Examples 2 and 3 assume that the two
database files have been installed on the Aster database.

Teradata Aster Analytics Foundation User Guide 1207


Chapter 13: Data Transformation
IPGeo
Input
Table 1159: IPGeo Example Input Table ipgeo_1

id ip
1 159.41.1.23
2 153.65.16.10
3 75.36.209.106
4 202.106.0.20
5 69.236.77.51
6 168.187.7.114

Example 1: Specify Location of IP Database


The two database files from the geolite128_database (GeoLiteCity.dat and GeoLiteCityv6.dat) must be
present on the queen and on each worker node. The directory in which the files are located is specified by
the IpDatabaseLocation argument (in this example, it is /home/maxmind).

SQL-MapReduce Call

SELECT * FROM IPGeo (


ON ipgeo_1
IpAddressColumn ('ip')
IpDatabaseLocation ('/home/maxmind/')
Converter ('MaxMindLite.jar',
'com.asterdata.sqlmr.analytics.location.ipgeo.MaxMindLite')
Accumulate ('id', 'ip')
) ORDER BY 1;

Example 2: IP Database Stored as Aster Database File


This example assumes that the two database files (GeoLiteCity.dat and GeoLiteCityv6.dat) have been
installed on the Aster database.

SQL-MapReduce Call

SELECT * FROM IPGeo (


ON ipgeo_1
IpAddressColumn ('ip')
Converter('MaxMindLite.jar',
'com.asterdata.sqlmr.analytics.location.ipgeo.MaxMindLite')
Accumulate ('id', 'ip')
) ORDER BY 1;

1208 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IPGeo
Example 3: Use Default Maxmind Geolite Database
This example assumes that the two database files (GeoLiteCity.dat and GeoLiteCityv6.dat) have been
installed on the Aster database.

SQL-MapReduce Call

SELECT * FROM IPGeo (


ON ipgeo_1
IpAddressColumn ('ip')
Accumulate ('id', 'ip')
) ORDER BY 1;

Output
Table 1160: IPGeo Example Output Table (Column 1-7)

id ip country_code country_name state city postal_code


1 159.41.1.23 US United States Michigan Saint Joseph 49085
2 153.65.16.10 US United States Ohio Miamisburg 45342
3 75.36.209.106 US United States California San Francisco
4 202.106.0.20 CN China Beijing Beijing
5 69.236.77.51 US United States California San Francisco
6 168.187.7.114 KW Kuwait Al Kuwayt Kuwait

Table 1161: IPGeo Example Output Table (Column 8-15)

latitude longitude isp organization organization_type area_code metro_co dma_code


de
42.0569 -86.4563 269 588 588
39.6182 -84.2488 937 542 542
37.7749 -122.4194 415 807 807
39.9289 116.3883 0 0 0
37.7749 -122.4194 415 807 807
29.3697 47.9783 0 0 0

Extending IPGeo
Because IPGeo cannot cover all the IP database providers for technical and license reasons, you can extend
this function to support new database providers.

Note:
Only Maxmind GeoLite 1.2.8 ships with this function.

Teradata Aster Analytics Foundation User Guide 1209


Chapter 13: Data Transformation
IPGeo
To extend IPGeo:
1. Create a new class that implement the interface Converter. The interface Converter, which is defined in
ipgeo.jar, ships with the analytics functions in this release. If you are using Aster Analytics Foundation
version 5.10 or later, you can extract ipgeo.jar from ipgeo.zip. The interface Converter is defined as:

package com.asterdata.sqlmr.analytics.location.ipgeo;
public interface Converter
{
/**
* initialize a Converter instance with corresponding resource
* @param ipDatabasePath
*/
void initialize(String ipDatabasePath);
/**
* release resources used by this instance before the SQL-MR function close
*/
void finalize();
/**
* Lookup location information for the input ipv4 address and write the
result to a IpLocation instance
* @param ip
* input, IP address in ipv4 format
* @param ipLocation
* output, to hold the location information
*/
void findIpv4(String ip, IpLocation ipLocation);
/**
*
* Lookup location information for the input ipv6 address and write the
result to a IpLocation instance
* @param ip
* input, IP address in ipv6 format
* @param ipLocation
* output, to hold the location information
*/
void findIpv6(String ip, IpLocation ipLocation);
}

Class IpLocation is designed to hold the location information and emit it (you can also find this
information in the ipgeo.jar file). The code has get and set functions for the following member variables
corresponding to the SQL-MapReduce function output:

private String countryCode = null;


private String countryName = null;
private String state = null;
private String city = null;
private String postalCode = null;
private float latitude = -1;
private float longitude = -1;
private String isp = null;
private String organization = null;
private String organizationType = null;
private int areaCode = -1;
private int metroCode = -1;

1210 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IPGeo
private int dmaCode = -1;
}

The following implementation of MaxMindLite (MaxMindLite2)is an example converter for the


MaxMind GeoLite2 database.

package com.asterdata.sqlmr.analytics.location.ipgeo;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.InetAddress;
import java.net.UnknownHostException;
import com.asterdata.ncluster.sqlmr.IllegalUsageException;
import com.asterdata.ncluster.sqlmr.data.InstalledFile;
import com.asterdata.sqlmr.analytics.location.ipgeo.Converter;
import com.maxmind.geoip2.DatabaseReader;
import com.maxmind.geoip2.exception.GeoIp2Exception;
import com.maxmind.geoip2.model.CityResponse;
/**
* A Converter implementation for MaxMind GeoLite2 version
*/
public class MaxMindLite2 implements Converter
{
private DatabaseReader reader = null;
private static final String CITY_DATABASE = "GeoLite2-City.mmdb";
private String tmpCityDatabase_ = null;
//initialize the databaseReader
public void initialize(String ipDatabasePath)
{
if(ipDatabasePath == null)
{
tmpCityDatabase_ = downloadFile(CITY_DATABASE);
initializeDatabaseReader(tmpCityDatabase_);
}
else
{
String path = ipDatabasePath.endsWith("/") ?
ipDatabasePath + CITY_DATABASE : ipDatabasePath + "/" + CITY_DATABASE ;
initializeDatabaseReader(path);
}
}
//find address according to ipv4 address
public void findIpv4(String ip, IpLocation ipLocation)
{
CityResponse response = null;
try
{
response = reader.city(InetAddress.getByName(ip));
}
catch (UnknownHostException e)
{
// do nothing
}
catch (IOException e)

Teradata Aster Analytics Foundation User Guide 1211


Chapter 13: Data Transformation
IPGeo
{
// do nothing
}
catch (GeoIp2Exception e)
{
// do nothing
}
if (response != null) {
setOutput(response, ipLocation);
}
}
//find address according to ipv6 address
public void findIpv6(String ip, IpLocation ipLocation)
{
findIpv4(ip, ipLocation);
}
//release resources
public void finalize()
{
if(reader != null)
{
try
{
reader.close();
}
catch (IOException e)
{
// do nothing
}
}
if(tmpCityDatabase_ != null)
{
new File(tmpCityDatabase_).delete();
}
}
//Set the output to ipLocation from response
private void setOutput(CityResponse response, IpLocation ipLocation)
{
ipLocation.setCountryCode(response.getCountry().getIsoCode());
ipLocation.setCountryName(response.getCountry().getName());
ipLocation.setState(response.getMostSpecificSubdivision().getName());
ipLocation.setCity(response.getCity().getName());
if(null != response.getLocation().getLatitude())
{
ipLocation.setLatitude(response.getLocation().getLatitude().floatValue());
}
if(null != response.getLocation().getLongitude())
{
ipLocation.setLongitude(response.getLocation().getLongitude().floatValue());
}
if(null != response.getLocation().getMetroCode())
{
ipLocation.setMetroCode(response.getLocation().getMetroCode().intValue());
}
ipLocation.setPostalCode(response.getPostal().getCode());
ipLocation.setIsp(response.getTraits().getIsp());
ipLocation.setOrganization(response.getTraits().getOrganization());
}

1212 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
IPGeo
//save a file installed in Aster to file system
private String downloadFile(String file)
{
BufferedInputStream in = null;
try
{
in = new BufferedInputStream(InstalledFile.getFile(file).getStream());
}
catch (FileNotFoundException e1)
{
throw new IllegalUsageException("CAN'T found the default IP database," +
" please check whether following file has been installed to Aster: " +
file);
}
String tmpFile = "/tmp/" + file + System.currentTimeMillis();
byte[] buffer = new byte[1024];
BufferedOutputStream out = null;
try
{
out = new BufferedOutputStream(new FileOutputStream(tmpFile));
for(int len = in.read(buffer); len>-1; len=in.read(buffer))
{
out.write(buffer, 0, len);
}
}
catch (FileNotFoundException e)
{
throw new IllegalUsageException("CAN'T create a tmp file: " + tmpFile
+ ". Please check the conflication in system");
}
catch (IOException e)
{
throw new IllegalUsageException("CAN'T write to tmp file: " + tmpFile
+ ". Please check if folder /tmp has more than 100M bytes free space.");
}
finally
{
try{
if (in != null) in.close();
if (out != null) out.close();
}
catch(IOException e)
{
//do nothing
}
}
return tmpFile;
}
//Return a DatabaseReader for the specified file
private void initializeDatabaseReader(String file)
{
// A File object pointing to your GeoIP2 or GeoLite2 database
File database = new File(file);
if(database.exists())
{
try {
// This creates the DatabaseReader object, which should be reused across
lookups.

Teradata Aster Analytics Foundation User Guide 1213


Chapter 13: Data Transformation
JSONParser
reader = new DatabaseReader.Builder(database).build();
} catch (IOException e) {
throw new IllegalUsageException("CAN'T initialize DatabaseReader. See
details:" + e.getMessage());
}
}
else
{
throw new IllegalUsageException("CAN'T find IP database: " + file);
}
}
}
2. Compile the new Converter class and package it in a JAR file with all the dependent libraries.
For this example, you must package this class, and the classes in the following JAR files, to a new JAR file:
• ipgeo.jar
• MaxMind-DB-Reader (from MaxMind)
• jackson-core-2.2.3.jar (this package is required by geolite2; version 2.2.3 is tested)
• jackson-databind-2.2.3.jar
• jackson-annotations-2.2.3.jar
3. Install the JAR file on Aster Database.
4. When calling the IPGeo function, set the and JAR filename and class name parameters of the Converter
argument to the names of your JAR file and class.

JSONParser

Summary
The JSONParser function is a tool used to extract the element name and text from JSON strings and output
them into a flattened relational table.

Background
On the Internet, most data is exchanged and processed using JSON or XML, and then displayed using
HTML. These languages are based on named structures that can be nested, using a Unicode-based
representation that is both human-readable and machine-readable.
Each language is optimized for its main function:
• HTML for representing web pages
• XML for representing document data
• JSON for representing programming language structures
In many applications, programmers work with only one of these formats. In others, programmers work with
all three. For traditional programming structures, programming with JSON is significantly easier than
programming with XML.
XQuery is the standard query language for XML, and has been implemented in databases, streaming
processors, data integration platforms, application integration platforms, XML message routing software,

1214 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
JSONParser
web browser plugins, and other environments. However, there is currently no standard query language for
JSON. To solve this challenge, Teradata Aster created the JSONParser SQL-MapReduce function, which
takes the JSON string as input and parses it as specified by the function arguments, outputting it as a
relational table which can then be queried using SQL.

Usage

JSONParser Syntax
Version 1.5

SELECT * FROM JSONParser (


ON tablename
TextColumn ('text_columnname')
Nodes ('parentnode/childnode' [,...])
[ SearchPath ('nodename/...') ]
[ Delimiter ('delimiter_string') ]
[ MaxItemNum ('number') ]
[ NodeIDOutputColumn ('columnname') ]
[ ParentNodeOutputColumn ('columnname') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ ErrorHandler ('{true|yes|t|y|1|false|no|f|n|0}
[ [;output_col_name :] input_col_name[,...]') ] ]
);

Arguments
Argument Category Description
TextColumn Required Specifies the column name from the input table which
contains the JSON string.
Nodes Required Specifies the parent/children pair. Must contain at least one
parent/child pair, and all pairs specified must be in the same
format. Multiple children can be specified as parent/
{child1,child2,...}.
SearchPath Optional Specifies the path to find the direct value of the child. To
reach the parent of the parent, include the parent of the
parent in this path. When a path to the parent of the parent
is supplied, all the siblings of the parent can be printed by
including them in the NODES argument. If anything from
root is to be parsed, then supply this argument as '/' (or leave
it as an empty string).
Delimiter Optional Specifies the delimiter used to separate multiple child values
with the same name and which have the same parent node
in the JSON String. If not defined, defaults to comma ','.

Teradata Aster Analytics Foundation User Guide 1215


Chapter 13: Data Transformation
JSONParser

Argument Category Description

Note:
The delimiter cannot include '#'.

MaxItemNum Optional The maximum number of nodes with the same name to
display in the output. The default value is 10.
NodeIDOutputColumn Optional The name of the column to use in the result schema to
contain the identifier (from the input table) of the each node
extracted. If not defined, defaults to 'out_nodeid'.
ParentNodeOutputColumn Optional The name of column to use in the result schema to contain
the tag name of the parent node extracted. If not defined,
defaults to 'out_parent_node'.
Accumulate Optional Specifies the input table columns to copy to the output table.
ErrorHandler Optional Specifies how the function acts when it encounters a data
problem. If not specified, the function aborts if the input
table contains bad data (for example, invalid UTF-8
characters).
ErrorHandler lets you specify an “additional” column to
hold any rows that were rejected as having bad data, also
referred to as the output column, in the output table. The
log information in the additional column lets you easily
identify which input table row contains unexpected data.
There are two parameters you can pass to ErrorHandler:
The first parameter tells the function whether to continue
processing if bad data is encountered. 'true' means continue
the processing without aborting. 'false' means abort the
process when an error occurs.
The second group of parameters designates the output and
input columns. The parameters in this group,
output_col_name: input_col_name1, input_col_name2,
input_col_name3,... are optional. If you specify an output
column, it is added to the output, and bad rows are logged
there. If you do not specify output_col_name, the function
uses “ERROR_HANDLER” as the name of the output
column. The error output column includes the data from the
input columns specified using input_col_namex, when an
error occurs. The data inserted into the output column is
merged from input columns and delimited by column using
a semicolon.
Using ErrorHandler('true') without specifying input
columns does not add any data to the output column.

Input
The table used as input must contain a column with JSON data.

1216 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
JSONParser
Output
A row is output for each node in the JSON string which has its name indicated as a parent node in the Nodes
argument.
The output table contains columns with the node ID, the parent node name, and the children nodes. The
output also contains all columns specified in the Accumulate argument.
• Arrays can be formatted as one of two types in JSON:
∘ parent:[key:value,key:value]
For this type, use 'parent/key' in the Nodes argument
∘ parent[value,value]
For this type, use 'parent/parent' in the Nodes argument
• Root sometimes has a key:value pair like that shown in example 2 below.
To get the value of such a pair, supply '/key' for the Nodes argument.

Examples
• Example 1: With Nondefault Options
• Example 2: With Default Argument Values
• Example 3: Parsing with Ancestor (Search Path Argument Specified)
• Example 4: Specifying ERROR_HANDLER When Calling JSONParser

Example 1: With Nondefault Options

Input
The input table below is a single json record with multiple fields.
Table 1162: JSONParser Example 1 Input Table

id data
1
{"menu": {
"id": "1",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}

Teradata Aster Analytics Foundation User Guide 1217


Chapter 13: Data Transformation
JSONParser
SQL-MapReduce Call
The SQL-MapReduce call uses custom values for the arguments NodeIdOutputColumnName and
ParentNodeOutputColumnName.

SELECT * FROM JSONParser (


ON json_parser_data
TextColumn ('data')
Nodes ('menu/{id,value}', 'menuitem/value')
Delimiter ('|')
NodeIdOutputColumn ('Fieldnumber')
ParentNodeOutputColumn ('ParentName')
Accumulate ('id')
) ORDER BY 1, 2;

Output
The node values are output as shown below:
Table 1163: JSONParser Example 1 Output Table

id fieldnumber parentname menu:id menu:value menuitem:value


1 1 menu 1 File
1 2 menuitem New|Open|Close

Example 2: With Default Argument Values

Input
This example uses the default values for the arguments NodeIdOutputColumnName and
ParentNodeOutputColumnName. The input table shown below is a single json record with multiple fields.
Table 1164: JSONParser Example 2 Input Table

id data
1
{
"email":"[email protected]",
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup
Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {

1218 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
JSONParser

id data

"para": "A meta-markup language",


"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}

SQL-MapReduce Call

SELECT * FROM JSONParser (


ON json_parser_data_2
TextColumn('data')
Nodes('glossary/title', 'GlossDiv/title', 'GlossEntry/Abbrev',
'GlossSeeAlso/GlossSeeAlso', '/email')
Delimiter (' , ')
Accumulate ('id')
NodeIdOutputColumn ('out_nodeid')
ParentNodeOutputColumn ('out_parent_node')
) ORDER BY 1, 2;

Output

Table 1165: JSONParser Example 2 OutputTable (Columns 1-5)

id out_nodeid out_parent_node glossary:title GlossDiv:title


1 1 glossary example glossary
1 2 GlossDiv S
1 3 GlossEntry
1 4 GlossSeeAlso
1 5

The node values are output as shown below:


Table 1166: JSONParser Example 2 OutputTable (Columns 6-8)

GlossEntry:Abbrev GlossSeeAlso:GlossSeeAlso :email


ISO 8879:1986
GML, XML
[email protected]

Teradata Aster Analytics Foundation User Guide 1219


Chapter 13: Data Transformation
JSONParser
Example 3: Parsing with Ancestor (Search Path Argument Specified)

Input
This example uses the same input as Example 2 (Input).
When a specific path is specified in argument search_path(), the function only looks for the fields (key value
pairs) within the search path. This example specifies the SEARCH_PATH('/glossary/GlossDiv/GlossList').

SQL-MapReduce Call

SELECT * FROM JSONParser (


ON json_parser_data_2
TextColumn ('data')
Nodes ('GlossEntry/ID', 'GlossEntry/SortAs', 'GlossEntry/GlossTerm',
'GlossEntry/Acronym', '/email')
SearchPath ('/glossary/GlossDiv/GlossList')
Delimiter (' | ')
Accumulate ('id')
MaxItemNum (10)
) ORDER BY 1, 2;

Output
Because the email field is not included in the specified search path, it returns an empty column.
Table 1167: JSONParser Example 3 OutputTable (Columns 1-5)

id out_nodeid out_parent_node GlossEntry:ID GlossEntry:SortAs


1 1 GlossEntry SGML
1 2 GlossEntry SGML
1 3 GlossEntry
1 4 GlossEntry

Table 1168: JSONParser Example 3 OutputTable (Columns 6-8)

GlossEntry:GlossTerm GlossEntry:Acronym :email


Standard Generalized Markup Language
SGML

Example 4: Specifying ERROR_HANDLER When Calling JSONParser

Input
The input table below is the same as the input for Example 1 (Input), except that this version has a
formatting error. In this example, the data column is missing a closing quotation mark and a colon after the
“menuitem” field.

1220 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
JSONParser
Table 1169: JSONParser Example 4 Input Table

id data
1
{"menu": {
"id": "1",
"value": "File",
"popup": {
"menuitem [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}

SQL-MapReduce Call
The default Error_Handler is applied on the data column:

SELECT * FROM JSONParser (


ON json_parser_data_3
TextColumn ('data')
Nodes ('menuitem/value')
ErrorHandler ('true; data')
) ORDER BY 1, 2;

Output
The error_handler column outputs the error row as shown below.
Table 1170: JSONParser Example 4 Output Table (Columns 1-3)

out_nodeid out_parent_node menuitem:value


0

Table 1171: JSONParser Example 4 Output Table (Column 4)

ERROR_HANDLER
{"menu": {
"id": "1",
"value": "File",
"popup": { "menuitem [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}

Teradata Aster Analytics Foundation User Guide 1221


Chapter 13: Data Transformation
Multi_Case

ERROR_HANDLER
}};

Multi_Case

Summary
The Multi_Case function extends the capability of the SQL CASE statement by supporting matches to
multiple criteria in a single row.
When SQL CASE finds a match, it outputs the result and immediately proceeds to the next row without
searching for more matches in the current row.
The Multi_Case function iterates through the input data set only once and outputs matches whenever a
match occurs. If multiple matches occur for a given input row, the function outputs one output row for each
match.
Use the Multi_Case function when the conditions in your CASE statement do not form a mutually exclusive
set.

Usage

Multi_Case Syntax
Version 1.1

SELECT * FROM Multi_Case (


ON (SELECT *, condition AS case[,...]
FROM { table | view | (query) })
Labels ('case AS "label"' [,...])
);

Arguments
Argument Category Description
Labels Required Specifies a label for each case. Each case corresponds to a condition, which is
a SQL predicate that includes input column names. When an input value
satisfies condition, that is a match, and the function outputs the input row
and the corresponding label.

1222 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Multi_Case
Input
Table 1172: Multi_Case Input Table Schema

Column Name Data Description


Type
input_column Any Every column that appears in a condition must appear in the input
table. The input table can also have columns that do not appear in any
condition. The function copies every input column to the output table.

Output
Table 1173: Multi_Case Output Table Schema

Column Name Data Type Description


input_column Same as in Column copied from the input table.
input table
labels VARCHAR Label that corresponds to the case that the input value matches, in each
row. The output table has a row for each match.

Example
This example labels people with the age groups to which they belong, which overlap:
• infant (younger than 1 year)
• toddler (1-2 years, inclusive)
• kid (2-12 years, inclusive)
• teenager (13-19 years, inclusive)
• young adult (16-25 years, inclusive)
• adult (21-40 years, inclusive)
• middle-aged person (35-60 years, inclusive)
• senior citizen (60 years or older)

Input
The input table contains the identifiers, names, and ages of people. The ages range from 0.5 years (6 months)
to 65 years.
Table 1174: Multi_Case Example InputTable people_age

id name age
1 John 0.5
2 Freddy 2
3 Marie 6
4 Tom Sawyer 17

Teradata Aster Analytics Foundation User Guide 1223


Chapter 13: Data Transformation
Multi_Case

id name age
5 Becky Thatcher 16
6 Philip 22
7 Joseph 25
8 Roger 35
9 Natalie 30
10 Henry 40
11 George 50
12 Sir William 65

SQL-MapReduce Call

SELECT * FROM Multi_Case (


ON (SELECT *,
(age < 1) AS case1,
(age >= 1 AND age <= 2) AS case2,
(age >= 2 AND age <=12) AS case3,
(age >=13 AND age <=19) AS case4,
(age >=16 AND age <=25) AS case5,
(age >=21 AND age <=40) AS case6,
(age >=35 AND age <=60) AS case7,
(age >=60) AS case8
FROM people_age
)
Labels (
'case1 AS "infant"',
'case2 AS "toddler"',
'case3 AS "kid"',
'case4 AS "teenager"',
'case5 AS "young adult"',
'case6 AS "adult"',
'case7 AS "middle aged person"',
'case8 AS "senior citizen"'
)
) ORDER BY id;

Output
Several people have two labels. For example, Freddy is both a toddler and a kid, and Tom Sawyer and Becky
Thatcher are both teenagers and young adults.
Table 1175: Multi_Case Example OutputTable

id name age labels


1 John 0.5 infant
2 Freddy 2 toddler

1224 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
MurmurHash

id name age labels


2 Freddy 2 kid
3 Marie 6 kid
4 Tom Sawyer 17 teenager
4 Tom Sawyer 17 young adult
5 Becky Thatcher 16 teenager
5 Becky Thatcher 16 young adult
6 Philip 22 young adult
6 Philip 22 adult
7 Joseph 25 young adult
7 Joseph 25 adult
8 Roger 35 adult
8 Roger 35 middle aged person
9 Natalie 30 adult
10 Henry 40 adult
10 Henry 40 middle aged person
11 George 50 middle aged person
12 Sir William 65 senior citizen

MurmurHash

Summary
The MurmurHash function computes the hash values of the input columns.

Background
MurmurHash is a noncryptographic hash function suitable for hash-based searching. The function
computes the MurmurHash value of each column value in each row of the input table.

Teradata Aster Analytics Foundation User Guide 1225


Chapter 13: Data Transformation
MurmurHash
Usage

MurmurHash Syntax
Version 1.1

SELECT * FROM MurmurHash (


ON { table | view | (query) }
InputColumns ( { column | column_range }[,...] )
[ HashBit ({ '32' | '64' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Arguments
Argument Category Description
InputColumns Required Specifies the names of the input table columns for which to calculate
hash values.

Note:
NULL values in the input columns are output as NULL.

HashBit Optional Specifies whether the function generates 32-bit hash values (the default)
or 64-bit hash values.
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.

Input
Table 1176: MurmurHash Input Table Schema

Column Name Data Type Description


column Any Column for which to calculate hash values.
accumulate_column Any Column to copy to the output table.

Output
Table 1177: MurmurHash Output Table Schema

Column Name Data Type Description


accumulate_column Same as in input table Column copied from input table.
column_murmurhash INTEGER for HashBit ('32'), Contains the hash value for the
BINGINT for HashBit (' 64') input column column.

1226 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
MurmurHash
Example
• Input
• Example 1: 32-Bit Hash Value (by Default)
• Example 2: 64-Bit Value (Specified)

Input
The input table is a log of midnight temperatures (in degrees Fahrenheit) for five consecutive nights in five
cities. The rows of all columns except id are to be converted to hash values. Because the hash value depends
on data type, the input table has three columns each, with different data types, for the city name, time
period, and temperature:
• Columns 2-4 contain the city name in data types BYTEA, VARCHAR, and TEXT.
• Columns 5-7 contain the time period in data types TIMESTAMP, TEXT, and DATE.
• Columns 8-10 contain the time period in data types DOUBLE PRECISION, INTEGER, and TEXT.
Table 1178: MurmurHash Examples Input Table murmurhash_input, Columns 1-6

id city_bytea city_varchar city_text period_timestamp period_text


1 Asheville Asheville Asheville 2008-10-10 10-10-2008
00:00:00 00:00:00
2 Greenville Greenville Greenville 2008-10-11 10-11-2008
00:00:00 00:00:00
3 Brownsville Brownsville Brownsville 2008-10-12 10-12-2008
00:00:00 00:00:00
4 Nashville Nashville Nashville 2008-10-13 10-13-2008
00:00:00 00:00:00
5 Knoxville Knoxville Knoxville 2008-10-14 10-14-2008
00:00:00 00:00:00

Table 1179: MurmurHash Examples Input Table murmurhash_input, Columns 7-10

period_date temp_f_real temp_f_integer temp_f_text


2008-10-10 34.9 35 34.9
2008-10-11 34.4 34 34.4
2008-10-12 34 34 34
2008-10-13 35.6 36 35.6
2008-10-14 32 32 32

Teradata Aster Analytics Foundation User Guide 1227


Chapter 13: Data Transformation
MurmurHash
Example 1: 32-Bit Hash Value (by Default)

SQL-MapReduce Call

SELECT * FROM MurmurHash (


ON murmurhash_input
InputColumns ('city_bytea', '[2:8]', 'temp_f_text')
Accumulate ('id')
) ORDER BY 1;

Note:
For the InputColumns argument, columns are numbered 0, 1, 2, and so on (not 1, 2, 3 and so on, as in
Table 22 - 1157).

Output
The hash values for each city name in the output table are the same, regardless of data type, but the hash
values for each time period are different for each data type.
When the temperature value in the input table is an integer (as in rows 3 and 5 of Input), the hash values are
the same for INTEGER and TEXT, but different for REAL. When the temperature value in the input table is
real, the hash values are the same for REAL and TEXT, but different for INTEGER.
Table 1180: MurmurHash Example 1 Output Table, Columns 1-4

id city_bytea_murmurhash city_varchar_murmurhash city_text_murmurhash


1 -548788049 -548788049 -548788049
2 595880669 595880669 595880669
3 773070680 773070680 773070680
4 1115825340 1115825340 1115825340
5 -1362387812 -1362387812 -1362387812

Table 1181: MurmurHash Example 1 Output Table, Columns 5-7

period_timestamp_murmurhash period_text_murmurhash period_date_murmurhash


1962681181 -463297848 557349452
-1741097193 1584740227 -1509759604
1530898040 1693067691 -122411470
-1426854377 -2116413651 1981218164
-1560585364 2011908706 1085005088

Table 1182: MurmurHash Example 1 Output Table, Columns 8-10

temp_f_real_murmurhash temp_f_integer_murmurhash temp_f_text_murmurhash


491855154 -392240485 491855154

1228 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
MurmurHash

temp_f_real_murmurhash temp_f_integer_murmurhash temp_f_text_murmurhash


-499861358 -2026295078 -499861358
-141326552 -2026295078 -2026295078
138771797 -736537098 138771797
1047067045 -902858061 -902858061

Example 2: 64-Bit Value (Specified)


This example shows how specifying HashBit('64') affects the hash values. As in the 32-bit value example,
when the temperature value in the input table is an integer (as in rows 3 and 5 of Input), the hash values are
the same for INTEGER and TEXT, but different for REAL. When the temperature value in the input table is
real, the hash values are the same for REAL and TEXT, but different for INTEGER.

SQL-MapReduce Call

SELECT * FROM MurmurHash (


ON murmurhash_input
InputColumns ('city_bytea', '[2:8]', 'temp_f_text')
HashBit ('64')
Accumulate ('id')
) ORDER BY 1;

Output

Table 1183: MurmurHash Example 1 Output Table, Columns 1-4

id city_bytea_murmurhash city_varchar_murmurhash city_text_murmurhash


1 -2851230093024540625 -2851230093024540625 -2851230093024540625
2 -1558815564433480061 -1558815564433480061 -1558815564433480061
3 -2986357434795741281 -2986357434795741281 -2986357434795741281
4 6841491015964275489 6841491015964275489 6841491015964275489
5 -1532140499900981976 -1532140499900981976 -1532140499900981976

Table 1184: MurmurHash Example 1 Output Table, Columns 5-7

period_timestamp_murmurhash period_text_murmurhash period_date_murmurhash


2081499398607021206 1892844779878238375 -3374208292985920001
515846716270079702 -7113059074592314425 7968959960480043948
-1513387057007090709 5003897662396662390 1794743325310104289
5181116076770007143 -3056768082299206123 6786124399480028630
7275307893407845346 2474048315384364200 1414589266172155885

Teradata Aster Analytics Foundation User Guide 1229


Chapter 13: Data Transformation
OutlierFilter
Table 1185: MurmurHash Example 1 Output Table, Columns 8-10

temp_f_real_murmurhash temp_f_integer_murmurhash temp_f_text_murmurhash


1293995461165576289 6315281623848420264 1293995461165576289
-3815305352492785825 2118418006307321788 -3815305352492785825
1044130231958811690 2118418006307321788 2118418006307321788
1386486412937455134 7355370209912386178 1386486412937455134
8029796110320680006 -5952800379241158136 -5952800379241158136

OutlierFilter

Summary
The OutlierFilter function filters outliers from a numeric data set, either deleting them or replacing them
with a specified value. Optionally, the function stores the outliers in their own table. The methods that the
function provides for filtering outliers are:
• Percentile
• Tukey’s method
• Carling’s modification
• median absolute deviation
The input data set is expected to have as many as millions of attribute-value pairs.

Usage

OutlierFilter Syntax
Version 1.3

SELECT * FROM OutlierFilter (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password')]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
TargetColumn ({ 'target_column' | 'target_column_range' }[,...])
[ OutlierTable ('outlier_table') ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ Method (method [,...]) ]

1230 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
OutlierFilter
[ ApproxPercentile
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ PercentileThreshold ('perc_lower', 'perc_upper']) ]
[ PercentileAccuracy ('accuracy') ]
[ IQRMultiplier ('k') ]
[ RemoveTail ({ 'both' | 'upper' | 'lower' }) ]
[ ReplacementValue ({ 'delete' | 'null' | 'median' | 'newval' }) ]
[ MADScaleConstant ('constant') ]
[ MADThreshold ('madlimit') ]
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
InputTable Required Specifies the name of the table that contains the numeric data to be filtered
and (optionally) the columns by which to group the data.
OutputTable Required Specifies the name of the table where the function stores the copy of the
input table (including the PARTITION BY column) with the outliers
either deleted (by default) or replaced (as specified by the
ReplacementValue argument).
TargetColumn Required Specifies the names of the input table columns to be filtered.
OutlierTable Optional Specifies the name of the table where the function outputs copies of the
rows of the input table that contain outliers.
GroupByColumns Optional Specifies the names of the input table columns by which to group the data.
If the data schema format is name:value, then this list must include name.
Method Optional Specifies the method or methods of filtering outliers:
• 'percentile' (default)
• 'tukey' (Tukey’s method)
• 'carling' (Carling’s modification)
• 'MAD-median' (Median absolute deviation (MAD))
MAD is defined as the median of the absolute values of the residuals.
For example, if there are i datapoints and the median value of the data
is M, then MAD=mediani(|xi-M|).
Specify either one method, which the function uses for all columns
specified by TargetColumn, or specify a method for each column specified
by TargetColumn.
ApproxPercentile Optional Specifies whether the function calculates the percentiles used as filter
limits exactly. The default value is 'false'.
Approximate percentiles are typically faster, but might fail when the
number of groups exceeds one million.

Teradata Aster Analytics Foundation User Guide 1231


Chapter 13: Data Transformation
OutlierFilter

Argument Category Description


PercentileThreshold Optional Specifies the range of percentile values for 'percentile' filtering,
[perc_lower, 100 -perc_lower]. The default filter range is [5, 95].
PercentileAccuracy Optional Specifies the accuracy of percentiles used for filtering. The default value is
0.5%.
IQRMultiplier Optional Specifies the multiplier of interquartile range for 'tukey' filtering. The
default value is 1.5.
RemoveTail Optional Specifies the side of the distribution to filter. The default value is 'both'.
ReplacementValue Optional Specifies how the function handles outliers:
• 'delete' (default)
The function does not copy the row to the output table.
• 'null'
The function copies the row to the output table, replacing each outlier
with the value NULL.
• 'median'
The function copies the row to the output table, replacing each outlier
with the median value for its group.
• newval
The function copies the row to the output table, replacing each outlier
with newval, which must be a numeric value.

MadScaleConstant Optional Specifies the scale constant used with 'MAD-median' filtering; a DOUBLE
PRECISION value. The default value is 1.4826, which means MAD =
1.4826 * median(|x - median(x)|).
MadThreshold Optional Specifies the threshold used with 'MAD-median' filtering; a DOUBLE
PRECISION value. The default value is 3, which means that |x-
median(x)|/MAD > 3 is flagged as an outlier.

Input
The input table must have one column that contains numeric data to be filtered for outliers, and you must
specify its name with the TargetColumn argument. The following table describes the input table columns
that you can specify with the TargetColumn and GroupByColumns arguments. The input table can have
other columns, but the function ignores them.
Table 1186: OutlierFilter Input Table Schema

Column Name Data Type Description


target_column BYTEINT, Contains the data to be filtered.
SMALLINT,
INTEGER,
BIGINT,
NUMERIC, or
DOUBLE
PRECISION
group_by_column Any Column by which to group the data.

1232 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
OutlierFilter
Output
The function outputs a message, an output table, and (optionally) an outlier table. The output table and
optional outlier table have the same schema as the input table (that is, every column in the input table
appears in the output and outlier tables).
Table 1187: OutlierFilter Output Message Schema

Column Name Data Type Description


message VARCHAR Reports the names of the tables that the function created.

Examples
• Input
• Example 1: Method ('percentile'), ReplacementValue ('null')
• Example 2: Method ('MAD-median'), ReplacementValue ('median')

Input
The input table contains a time series of atmospheric pressure readings (in mbar) for five cities.
Table 1188: OutlierFilter Examples Input Table ville_pressuredata

sn city period pressure_mbar


1 Asheville 2010-01-01 00:00:00 1020.5
2 Asheville 2010-01-01 01:00:00 9000
3 Asheville 2010-01-01 02:00:00 1020
4 Asheville 2010-01-01 03:00:00 10000
5 Asheville 2010-01-01 04:00:00 1020.2
6 Asheville 2010-01-01 05:00:00 1020
7 Asheville 2010-01-01 06:00:00 1020.3
8 Asheville 2010-01-01 07:00:00 1020.8
9 Asheville 2010-01-01 08:00:00 1020.3
10 Asheville 2010-01-01 09:00:00 1020.7
... ... ... ...
25 Greenville 2010-01-01 00:00:00 1020.6
26 Greenville 2010-01-01 01:00:00 9000
27 Greenville 2010-01-01 02:00:00 1020.1
28 Greenville 2010-01-01 03:00:00 10000
29 Greenville 2010-01-01 04:00:00 1020.2
30 Greenville 2010-01-01 05:00:00 1020

Teradata Aster Analytics Foundation User Guide 1233


Chapter 13: Data Transformation
OutlierFilter

sn city period pressure_mbar


... ... ... ...
49 Brownsville 2010-01-01 00:00:00 1020.5
50 Brownsville 2010-01-01 01:00:00 9000
51 Brownsville 2010-01-01 02:00:00 1020
52 Brownsville 2010-01-01 03:00:00 10000
53 Brownsville 2010-01-01 04:00:00 1020.2
54 Brownsville 2010-01-01 05:00:00 1020
... ... ... ...
73 Nashville 2010-01-01 00:00:00 1020.4
74 Nashville 2010-01-01 01:00:00 9000
75 Nashville 2010-01-01 02:00:00 1019.9
76 Nashville 2010-01-01 03:00:00 10000
77 Nashville 2010-01-01 04:00:00 1020.1
78 Nashville 2010-01-01 05:00:00 1019.9
... ... ... ...
97 Knoxville 2010-01-01 00:00:00 1020.4
98 Knoxville 2010-01-01 01:00:00 9000
99 Knoxville 2010-01-01 02:00:00 1019.9
100 Knoxville 2010-01-01 03:00:00 10000
101 Knoxville 2010-01-01 04:00:00 1020
102 Knoxville 2010-01-01 05:00:00 1019.9
... ... ... ...

Example 1: Method ('percentile'), ReplacementValue ('null')

SQL-MapReduce Call

SELECT * FROM OutlierFilter (


ON (SELECT 1)
PARTITION BY 1
InputTable ('ville_pressuredata')
OutputTable ('of_output1')
TargetColumn ('pressure_mbar')
Method ('percentile')
PercentileThreshold ('1','90')
RemoveTail ('both')
ReplacementValue ('null')

1234 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
OutlierFilter
GroupByColumns ('city')
);

Output

Table 1189: OutlierFilter Example 1 Output Message

message
Created table "of_output1"

This query returns the following table:

SELECT * FROM of_output1 ORDER BY 1;

The outlying values have been replaced by NULL.


Table 1190: OutlierFilter Example 1 Output Table of_output1

sn city period pressure_mbar


1 Asheville 2010-01-01 00:00:00 1020.5
2 Asheville 2010-01-01 01:00:00 NULL
3 Asheville 2010-01-01 02:00:00 1020
4 Asheville 2010-01-01 03:00:00 NULL
5 Asheville 2010-01-01 04:00:00 1020.2
6 Asheville 2010-01-01 05:00:00 1020
7 Asheville 2010-01-01 06:00:00 1020.3
8 Asheville 2010-01-01 07:00:00 1020.8
9 Asheville 2010-01-01 08:00:00 1020.3
10 Asheville 2010-01-01 09:00:00 1020.7
11 Asheville 2010-01-01 10:00:00 NULL
12 Asheville 2010-01-01 11:00:00 1022
13 Asheville 2010-01-01 12:00:00 1021.1
14 Asheville 2010-01-01 13:00:00 1020
15 Asheville 2010-01-01 14:00:00 1019.3
... ... ... ...

Teradata Aster Analytics Foundation User Guide 1235


Chapter 13: Data Transformation
OutlierFilter
Example 2: Method ('MAD-median'), ReplacementValue ('median')

SQL-MapReduce Call

SELECT * FROM OutlierFilter (


ON (SELECT 1)
PARTITION BY 1
InputTable ('ville_pressuredata')
OutputTable ('of_output2')
TargetColumn ('pressure_mbar')
OutlierTable ('of_outlier2')
ReplacementValue ('median')
Method ('MAD-median')
MADScaleConstant ('1.4826')
MADThreshold ('3')
GroupByColumns ('city')
);

Output

Table 1191: OutlierFilter Example 2 Output Message

message
Created tables "of_output2","of_outlier2"

This query returns the following table:

SELECT * FROM of_output2 ORDER BY 1;

The outlying values have been replaced with the median value for the group.
Table 1192: OutlierFilter Example 2 Output Table of_output2

sn city period pressure_mbar


1 Asheville 2010-01-01 00:00:00 1020.5
2 Asheville 2010-01-01 01:00:00 1020.5
3 Asheville 2010-01-01 02:00:00 1020
4 Asheville 2010-01-01 03:00:00 1020.5
5 Asheville 2010-01-01 04:00:00 1020.2
6 Asheville 2010-01-01 05:00:00 1020
7 Asheville 2010-01-01 06:00:00 1020.3
8 Asheville 2010-01-01 07:00:00 1020.8
9 Asheville 2010-01-01 08:00:00 1020.3
10 Asheville 2010-01-01 09:00:00 1020.7
11 Asheville 2010-01-01 10:00:00 1022.1

1236 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Pack

sn city period pressure_mbar


12 Asheville 2010-01-01 11:00:00 1022
13 Asheville 2010-01-01 12:00:00 1021.1
14 Asheville 2010-01-01 13:00:00 1020
15 Asheville 2010-01-01 14:00:00 1019.3
... ... ... ...

This query returns the following table:

SELECT * FROM of_outlier2 ORDER BY 1;

Table 1193: OutlierFilter Example 2 Output Table of_outlier2

sn city period pressure_mbar


2 Asheville 2010-01-01 01:00:00 9000
4 Asheville 2010-01-01 03:00:00 10000
26 Greenville 2010-01-01 01:00:00 9000
28 Greenville 2010-01-01 03:00:00 10000
50 Brownsville 2010-01-01 01:00:00 9000
52 Brownsville 2010-01-01 03:00:00 10000
74 Nashville 2010-01-01 01:00:00 9000
76 Nashville 2010-01-01 03:00:00 10000
98 Knoxville 2010-01-01 01:00:00 9000
100 Knoxville 2010-01-01 03:00:00 10000

Pack

Summary
The Pack function takes data from multiple input columns and packs it into a single column. The packed
column has a virtual column for each input column. By default, virtual columns are separated by commas
and each virtual column value is labeled with its column name.
Pack is complementary to the function Unpack, but you can use it on any input columns that meet the input
requirements.
Before packing columns, note their data types—you need them if you want to unpack the packed column.

Teradata Aster Analytics Foundation User Guide 1237


Chapter 13: Data Transformation
Pack
Usage

Pack Syntax
Version 1.2

SELECT * FROM Pack (


ON { table_name| view_name| (query) }
[ InputColumns ({ 'input_column' | 'input_column_range' }[,... ]) ]
[ Delimiter ('delimiter') ]
[ IncludeColumnName
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
OutputColumn ('output_column')
);

Arguments
Argument Category Description
InputColumns Optional Specifies the names of the input columns to pack into a
single output column. These names become the column
names of the virtual columns. By default, all input table
columns are packed into a single output column. If you
specify this argument, but do not specify all input table
columns, the function copies the unspecified input table
columns to the output table.
Delimiter Optional Specifies the delimiter (a string) that separates the virtual
columns in the packed data. The default delimiter is comma
(,).
IncludeColumnName Optional Specifies whether to label each virtual column value with its
column name (making the virtual column
'input_column:value'). The default value is 'true'.
OutputColumn Required Specifies the name to give to the packed output column.

Input
Table 1194: Pack Input Table Schema

Column Data Description


Type
input_column Any Column to pack, with other input columns, into a single output
column.
other_input_column Any Column to copy to the output table.

1238 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Pack
Output
Table 1195: Pack Output Table Schema

Column Data Type Description


output_column VARCHAR Packed column.
other_input_column Any Column copied from the input table.

Examples
• Input
• Example 1: Default Options
• Example 2: Nondefault Options

Input
The input table contains temperature readings for the cities Nashville and Knoxville, in the state of
Tennessee.
Table 1196: Pack Examples Input Table ville_temperature

sn city state period temp_f


1 Nashville Tennessee 2010-01-01 00:00:00 35.1
2 Nashville Tennessee 2010-01-01 01:00:00 36.2
3 Nashville Tennessee 2010-01-01 02:00:00 34.5
4 Nashville Tennessee 2010-01-01 03:00:00 33.6
5 Nashville Tennessee 2010-01-01 04:00:00 33.1
6 Knoxville Tennessee 2010-01-01 03:00:00 33.2
7 Knoxville Tennessee 2010-01-01 04:00:00 32.8
8 Knoxville Tennessee 2010-01-01 05:00:00 32.4
9 Knoxville Tennessee 2010-01-01 06:00:00 32.2
10 Knoxville Tennessee 2010-01-01 07:00:00 32.4

Example 1: Default Options


This example specifies the default options for Delimiter and IncludeColumnName.

SQL-MapReduce Call

SELECT * FROM Pack (


ON ville_temperature
Delimiter (',')
OutputColumn ('packed_data')

Teradata Aster Analytics Foundation User Guide 1239


Chapter 13: Data Transformation
Pack
IncludeColumnName ('true')
InputColumns ('city', 'state', 'period', 'temp_f')
) ORDER BY 2;

Output
The columns specified by InputColumns are packed in the column packed_data. Virtual columns are
separated by commas, and each virtual column value is labeled with its column name. The input column sn,
which was not specified by InputColumns, is unchanged in the output table.
Table 1197: Pack Example 1 Output

packed_data sn
city:Nashville,state:Tennessee,period:2010-01-01 00:00:00,temp_f:35.1 1
city:Nashville,state:Tennessee,period:2010-01-01 01:00:00,temp_f:36.2 2
city:Nashville,state:Tennessee,period:2010-01-01 02:00:00,temp_f:34.5 3
city:Nashville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.6 4
city:Nashville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:33.1 5
city:Knoxville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.2 6
city:Knoxville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:32.8 7
city:Knoxville,state:Tennessee,period:2010-01-01 05:00:00,temp_f:32.4 8
city:Knoxville,state:Tennessee,period:2010-01-01 06:00:00,temp_f:32.2 9
city:Knoxville,state:Tennessee,period:2010-01-01 07:00:00,temp_f:32.4 10

Example 2: Nondefault Options


This example specifies the pipe character (|) for Delimiter and 'false' for IncludeColumnName.

SQL-MapReduce Call

SELECT * FROM Pack (


ON ville_temperature
Delimiter ('|')
OutputColumn ('packed_data')
IncludeColumnName ('false')
InputColumns ('city', 'state', 'period', 'temp_f')
) ORDER BY 2;

Output
Virtual columns are separated by pipe characters and not labeled with their column names.

1240 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Pivot
Table 1198: Pack Example 2 Output

packed_data sn
Nashville|Tennessee|2010-01-01 00:00:00|35.1 1
Nashville|Tennessee|2010-01-01 01:00:00|36.2 2
Nashville|Tennessee|2010-01-01 02:00:00|34.5 3
Nashville|Tennessee|2010-01-01 03:00:00|33.6 4
Nashville|Tennessee|2010-01-01 04:00:00|33.1 5
Knoxville|Tennessee|2010-01-01 03:00:00|33.2 6
Knoxville|Tennessee|2010-01-01 04:00:00|32.8 7
Knoxville|Tennessee|2010-01-01 05:00:00|32.4 8
Knoxville|Tennessee|2010-01-01 06:00:00|32.2 9
Knoxville|Tennessee|2010-01-01 07:00:00|32.4 10

Pivot

Summary
The Pivot function pivots data that is stored in rows into columns. The function takes as input a table of data
to be pivoted and constructs the output schema based on the values of its arguments. The function handles
NULL values automatically.
The reverse of this function is Unpivot.

Usage

Pivot Syntax
Version 1.5

SELECT * FROM pivot (


ON input_table PARTITION BY partition_column[,...]
[ ORDER BY order_column]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password')]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
PartitionColumns
({ 'partition_column' | 'partition_column_range' }[,...])
{ NumberOfRows ('number_of_rows') |
PivotColumn ('pivot_column')

Teradata Aster Analytics Foundation User Guide 1241


Chapter 13: Data Transformation
Pivot
[ PivotKeys ('pivot_key' [,...]) ]
}
TargetColumns ({ 'target_column' | 'target_column_range' }[,...])
);

Note:
For information about the authentication arguments, refer to the following topics:
• Connecting to Aster Database Using Authentication Cascading
• Connecting to Aster Database Using SSL JDBC Connections

Arguments
Argument Category Description
PartitionColumns Required Specifies the same columns as the PARTITION BY clause (in any
order).
NumberOfRows Optional Specifies the maximum number of rows in any partition. If a partition
has fewer than number_of_rows rows, then the function adds NULL
values; if a partition has more than number_of_rows rows, then the
function omits the extra rows.
If you omit this argument, then you must specify the PivotColumn
argument.

Note:
With this argument, the ORDER BY clause is optional. If omitted,
the order of values can vary. The function adds NULL values at the
end.

PivotColumn Optional Specifies the name of the column that contains the pivot keys. If the
pivot column contains numeric values, then the function casts them to
VARCHAR.
If you omit the NumberOfRows argument, then you must specify this
argument.

Note:
If you specify the PivotColumn argument, then you must order the
input data; otherwise, the output table column content is
nondeterministic. For details, see Ordering Input Data.

PivotKeys Optional If you specify the PivotColumn argument, then this argument specifies
the names of the pivot keys. Do not use this argument without the
PivotColumn argument.
If pivot_column contains a value that is not specified as a pivot_key,
then the function ignores the row containing that value (see Example
2: Specify Pivot Keys).
By default, every unique value in pivot_column is a pivot key (see
Example 3: Use Default Pivot Keys).
TargetColumns Required Specifies the names of the input columns that contain the values to
pivot.

1242 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Pivot
Input
The Pivot function has one required input table, which contains the data to be pivoted. The following table
describes the required input table columns.
Table 1199: Pivot Input Table Schema

Column Name Data Description


Type
partition_column Any Column on which the input data is partitioned.
metric_column Any Contains the values to pivot.

Ordering Input Data


If you do not specify the PivotColumn and PivotKeys arguments, then you must order the input data;
otherwise, the Pivot function output is nondeterministic with respect to the contents of the columns in the
output table.
For example, suppose that you want to pivot this table:
Table 1200: Pivot Input Table input_table_1

Id val
A x
A y
B w
B z

SELECT * FROM pivot (


ON (SELECT id, val FROM input_table_1) PARTITION BY id
PartitionColumns ('id')
NumberOfRows (3)
TargetColumns ('val')
);

Each time you make the preceding call, the output table can be any of the following:
Table 1201: Possible Pivot Output Table 1

Id val_0 val_1
A x y
B w z

Table 1202: Possible Pivot Output Table 2

Id val_0 val_1
A y x
B w z

Teradata Aster Analytics Foundation User Guide 1243


Chapter 13: Data Transformation
Pivot
Table 1203: Possible Pivot Output Table 3

Id val_0 val_1
A x y
B z w

Table 1204: Possible Pivot Output Table 4

Id val_0 val_1
A y x
B z w

Now suppose that you want to pivot this table:


Table 1205: Pivot Input Table input_table_2

Id val sequencenum
A x 4
A y 2
B w 9
B z 3

When you call Pivot, you order the input by sequencenum:

SELECT * FROM pivot (


ON (SELECT id, val FROM input_table_2 ORDER BY sequencenum)
PARTITION BY id
PartitionColumns ('id')
NumberOfRows (3)
TargetColumns ('val')
);

Every time you use the preceding call, you get this result:
Table 1206: Pivot Output Table for Ordered Input Data

Id val_0 val_1
A y x
B z w

1244 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Pivot
Output
Table 1207: Pivot Output Table Schema

Column Name Data Description


Type
partition_column Any Column on which the input data is partitioned. If there are p partition
columns and partition column i has vi distinct values, then the output
table has p*vi rows.
metric_column_i Any If you specify the NumberOfRows argument, then i is the number of
an input row in a partition; otherwise i is a pivot_key.
For example, if you specify NumberOfRows(3), then the output table
has the columns metric_column_0, metric_column_1, and
metric_column_2; if you specify PivotKeys('a', 'b'), then the output
table has the columns metric_column_a, and metric_column_b.
The column metric_column_i contains the value of metric_column for
partition row or pivot key i.

Examples
• Input
• Example 1: Specify Maximum Number of Rows in Any Partition
• Example 2: Specify Pivot Keys
• Example 3: Use Default Pivot Keys

Input
The input table contains temperature, pressure, and dewpoint data for three cities, in sparse format.
Table 1208: Pivot Examples Input Table pivot_input

sn city week attribute value


1 Asheville 1 temp 32
1 Asheville 1 pressure 1020.8
1 Asheville 1 dewpoint 27.6F
2 Asheville 2 temp 32
2 Asheville 2 pressure 1021.3
2 Asheville 2 dewpoint 27.4F
3 Asheville 3 temp 34
3 Asheville 3 pressure 1021.7
3 Asheville 3 dewpoint 28.2F
4 Nashville 1 temp 42

Teradata Aster Analytics Foundation User Guide 1245


Chapter 13: Data Transformation
Pivot

sn city week attribute value


4 Nashville 1 pressure 1021
4 Nashville 1 dewpoint 29.4F
5 Nashville 2 temp 44
5 Nashville 2 pressure 1019.8
5 Nashville 2 dewpoint 29.2F
6 Brownsville 2 temp 47
6 Brownsville 2 pressure 1019
6 Brownsville 2 dewpoint 28.9F
7 Brownsville 3 temp 46
7 Brownsville 3 pressure 1019.2
7 Brownsville 3 dewpoint 28.9F

Example 1: Specify Maximum Number of Rows in Any Partition

SQL-MapReduce Call

SELECT * FROM pivot (


ON pivot_input
PARTITION BY sn, city, week
ORDER BY week
PartitionColumns ('sn', 'city', 'week')
NumberOfRows (3)
TargetColumns ('value')
) ORDER BY 1,2,3;

Note:
The ORDER BY clause is optional. If omitted, the order of values can vary. The function always adds any
NULL values at the end.

Output
To create the output table, the function pivots the input table on the partition columns (sn, city, and week)
and outputs the contents of the target column (value) in dense format in the output columns value_0,
value_1, and value_2, which contain the temperature, pressure, and dewpoint, respectively.
Table 1209: Pivot Example 1 Output Table

sn city week value_0 value_1 value_2


1 Asheville 1 32 1020.8 27.6F
2 Asheville 2 32 1021.3 27.4F
3 Asheville 3 34 1021.7 28.2F

1246 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Pivot

sn city week value_0 value_1 value_2


4 Nashville 1 42 1021 29.4F
5 Nashville 2 44 1019.8 29.2F
6 Brownsville 2 47 1019 28.9F
7 Brownsville 3 46 1019.2 28.9F

Example 2: Specify Pivot Keys


This example specifies the pivot keys; that is, it specifies both the PivotColumn and PivotKeys arguments.
Because the pivot column (attribute) contains a value that is not a pivot key (dewpoint), the function ignores
input rows that contain that value.

SQL-MapReduce Call

SELECT * FROM pivot (


ON pivot_input
PARTITION BY sn, city, week
ORDER BY week
PartitionColumns ('sn', 'city', 'week')
PivotKeys ('temp', 'pressure')
PivotColumn ('attribute')
TargetColumns ('value')
) ORDER BY 1,2,3;

Note:
The ORDER BY clause is required. If omitted, the output table column content is nondeterministic.

Output
The function outputs the contents of the input column value in dense format in the output columns
value_temp and value_pressure, which contain the temperature and pressure, respectively. Because these
values are numeric, the function casts them to VARCHAR.
Table 1210: Pivot Example 2 Output Table

sn city week value_pressure value_temp


1 Asheville 1 1020.8 32
2 Asheville 2 1021.3 32
3 Asheville 3 1021.7 34
4 Nashville 1 1021 42
5 Nashville 2 1019.8 44
6 Brownsville 2 1019 47
7 Brownsville 3 1019.2 46

Teradata Aster Analytics Foundation User Guide 1247


Chapter 13: Data Transformation
PSTParserAFS
Example 3: Use Default Pivot Keys
This example uses the default pivot keys (every unique value in the pivot column); that is, it specifies the
PivotColumn argument but not the PivotKeys argument.

SQL-MapReduce Call

SELECT * FROM pivot (


ON pivot_input
PARTITION BY sn, city, week
ORDER BY week
PartitionColumns ('sn', 'city', 'week')
PivotColumn ('attribute')
TargetColumns ('value')
) ORDER BY 1,2,3;

Output

Table 1211: Pivot Example 2 Output Table

sn city week value_dewpoint value_pressure value_temp


1 Asheville 1 27.6F 1020.8 32
2 Asheville 2 27.4F 1021.3 32
3 Asheville 3 28.2F 1021.7 34
4 Nashville 1 29.4F 1021 42
5 Nashville 2 29.2F 1019.8 44
6 Brownsville 2 28.9F 1019 47
7 Brownsville 3 28.9F 1019.2 46

PSTParserAFS

Summary
The PSTParserAFS function parses Personal Storage Table (PST) files (which store email in Microsoft
software such as Microsoft Outlook and Microsoft Exchange Client) directly from Aster File Store (AFS).
You can use the PSTParserAFS function to extract email content for customer attrition analysis, customer
service analysis, and spam detection. You can also input PSTParserAFS output to other Aster Analytics
functions, such as Text_Parser, the sentiment extraction functions, and the text classification functions.

1248 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PSTParserAFS
Usage

Verifying that AFS is Working


Because PST Parser works directly on AFS using Aster nCluster Terminal (ACT), AFS must be working
before you run the function. To verify that AFS is running, run these commands on the Queen:

\afs -mkdir /test


\afs -ls /

If the second command lists the /test directory as the output, then AFS is working. (For more information
about the \afs command, see Aster Database User Guide for Aster Appliances 6.20.

PSTParserAFS Syntax
Version 1.1

SELECT * FROM PSTParserAFS (


ON empty_table
Path ('input_path' [,...])
[ Host ('afs_server_ip_address') ]
[ Port ('afs_server_port_number') ]
[ OutputColumns ({ 'output_column' | 'output_column_range' }[,...] ]
[ Exclude ('message_folder' [,...]) ]
);

Note:
This function does not need an input table, but because SQL-MapReduce functions require at least one
input table be provided, you can create a table and pass it to the function as an empty table.

Arguments
Argument Category Description
Path Required Specifies the path to the PST files on AFS. The input_path represents
either a directory or a file name, and can use regular expressions. For
example:
/test
/test/testfile.pst
/test/*.pst
The PST files must be available on AFS before you call the function.
If input_path represents a directory, the function parses all PST files in
the directory.
If a file represented by an input_path is not a PST file, the function does
not parse that file and logs an error.
A single vworker processes each PST file.
Host Optional Specifies the IP address of the AFS server. The default value is IP address
of the Queen node.

Teradata Aster Analytics Foundation User Guide 1249


Chapter 13: Data Transformation
PSTParserAFS

Argument Category Description


Port Optional Specifies the port number of the AFS server. The default value is 2601.
OutputColumns Optional Specifies the custom columns to output. By default, the function outputs
only the default columns. For the names and descriptions of the default
and custom columns, refer to Output. If you specify a column not listed
in the Output, the function issues an error message.
Exclude Optional Specifies the message folders to exclude while parsing the PST file (for
examples, Drafts, Deleted, and Junk). The message_folder represents
either a directory or a file name, and can use regular expressions. If
message_folder represents a directory, the function excludes all PST files
in the directory. By default, the function parses all folders in the PST file.
However, the function does not parse or output PST files related to
Calendar, Contacts, Tasks, or RSS.

Input
The input PST files must be available on AFS before you call the function. To upload a PST file to AFS, use
the \afs -put command in ACT. For example, to copy all PST files in the current directory on your
Queen to the /test/ directory on AFS, use this command:

\afs -put *.pst /test/

If the specified directory does not exist in AFS, the command creates it before copying the files to that
directory.
Here are more examples for uploading PST files to AFS:

\afs -put /home/beehive/*.pst /test/subdir1/


\afs -put /home/beehive/test1.pst /test/subdir2/
\afs -put /home/beehive/test3.pst /test/subdir3/file1.pst

The third example stores the file test3.pst as file1.pst in AFS.

Output
The output table has the default columns and any custom columns specified by the OutputColumns
argument. The following table lists and describes the default and custom columns. In the following table,
column name aliases are in parentheses (for example, the alias of the message_id column name is id). The
function treats aliases as unique column names processes them independently. Column names are case-
insensitive.
The function creates the output table in memory, but you can direct it to a database table on disk or operate
on it directly within the SQL-MapReduce framework.

1250 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PSTParserAFS
Note:
Because “date” and “to” are SQL keywords, you must enclose them with double quotation marks. For
example:

SELECT sender, id, "to" FROM PSTParserAFS (ON ...)

Table 1212: PSTParserAFS Output Table Schema

Status Column Name Data Type Description


Default message_id (id) VARCHAR Unique identifier (ID) of the
message (email), if assigned by the
exchange server; blank otherwise.
The exchange server does not
assign IDs to messages in the
Draft, Output, and Sent Items
folders.
To assign a unique global ID to
each email message in the output
table, concatenate the fields
input_path and node_id.
Default sender (sender_name) VARCHAR Display name of the sender.
Default sender_email_address VARCHAR Email address of the sender.
Default recipients (recipient, VARCHAR Recipient names (from the
recipient_name, recipients_name, message fields TO, CC, and BCC).
recipient_names,
recipients_names)
Default recipients_email_addresses VARCHAR Recipient email addresses.
(recipient_email_address)
Default received_date (receive_date, date) TIMESTAMP Date and time when the message
arrived at the recipient’s mailbox,
in this format:

yyyy-mm-dd hh:mm:ss

For example, 2014-02-17 22:34:58.


The date reflects the recipient’s
time zone but does not contain
time zone information. Time zone
information is in the custom
column received_date_timezone.
Default subject VARCHAR Subject of the message.
Default contents (body) VARCHAR Contents of the message (text).
Custom node_id (nodeid) BIGINT Node ID of the message, (unique
within a PST file). (Within each

Teradata Aster Analytics Foundation User Guide 1251


Chapter 13: Data Transformation
PSTParserAFS

Status Column Name Data Type Description


PST file, each message is
represented as a node.)
Custom input_path VARCHAR Input path location of this PST
File on AFS.
Custom importance INTEGER Level of importance assigned to
the Message object by the end user
—0 (Low), 1 (Normal), or 2
(High).
Custom priority INTEGER Priority at which the client
requested the message to be sent—
0 (Normal) or 1 (Urgent).
Custom message_size BIGINT Message size on the server, in
bytes.
Custom has_replied VARCHAR Whether the recipient replied to
the message—'true' or 'false'.
Custom has_forwarded VARCHAR Whether the recipient forwarded
the message—'true' or 'false'.
Custom is_flagged VARCHAR Whether the message has a due
date—'true' or 'false'.
Custom received_date_timezone VARCHAR Time zone of the received_date
(timezone, receive_date_timezone, value, displayed as an offset from
date_timezone) GMT. For example, -0800 is 8
hours behind GMT.
Custom sent_date (send_date) TIMESTAMP Date and time when the sender
sent the message, in this format:

yyyy-mm-dd hh:mm:ss

For example, 2014-02-17 22:34:58.


The date reflects the sender’s time
zone but does not contain time
zone information. Time zone
information is in the custom
column sent_date_timezone.
Custom sent_date_timezone VARCHAR Time zone of the sent_date value,
(send_date_timezone) displayed as an offset from GMT.
For example, -0800 is 8 hours
behind GMT.

1252 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PSTParserAFS

Status Column Name Data Type Description


Custom action_date TIMESTAMP Date and time when the recipient
acted on the message, in this
format:

yyyy-mm-dd hh:mm:ss

For example, 2014-02-17 22:34:58.


The date reflects the recipient’s
time zone but does not contain
time zone information. Time zone
information is in the custom
column action_date_timezone.
Custom action_date_timezone VARCHAR Time zone of the action_date
value, displayed as an offset from
GMT. For example, -0800 is 8
hours behind GMT.
Custom folder (folder_name) VARCHAR Name of folder that contains the
message (for example, 'Inbox' or
'Sent/2014/February').
Custom sender_ip_address VARCHAR IP address of the sender system
(for example, 10.10.143.10).
Custom to VARCHAR Names of the recipients in the
message field TO.
Custom cc VARCHAR Names of the recipients in the
message field CC.
Custom bcc VARCHAR Names of the recipients in the
message field BCC.
Custom conversation_thread (thread) VARCHAR Conversation thread of the
message—the subject of the first
email in the thread, minus the
strings 'Re:' and 'Fwd:'.
Custom number_of_attachments INTEGER Number of files attached to the
message.
Custom attachment_size BIGINT Total size of files attached to the
message, in bytes.
Custom list_of_attachments (attachments) VARCHAR Names of files attached to the
message.
Custom message_type VARCHAR Type of the message
(MessageClass). A Normal
message has type IPM. Some other
qualified types are:

Teradata Aster Analytics Foundation User Guide 1253


Chapter 13: Data Transformation
PSTParserAFS

Status Column Name Data Type Description

• IPM.Contact
• IPM.Appointment
• IPM.Activity
• IPM.Report
• IPM.Task
• IPM.Recall.Report
For a list of all qualified message
classes, and more information, see
the Microsoft website.

Alternative Way of Running PSTParserAFS


Another way to execute PSTParserAFS is to use the table_from_afs function:

SELECT * FROM table_from_afs (


ON empty_table
Path ('input_path' [,...])
[ Host ('afs_server_ip_address') ]
[ Port ('afs_server_port_number') ]
Input_Format (
'com.asterdata.sqlmr.analytics.parser.PSTParserAFS.inputformat.PSTInputForm
at',
'columns colname [,...] STRINGS'
[,'exclude MessageFolder1[,...] STRINGS'] )
SerDe ('com.asterdata.sqlmr.analytics.parser.PSTParserAFS.serde.PSTSerDe',
'field.delim=1', 'escape.delim=27')
OutputColumns ('colname coltype', ...]]
);

In the Input_Format argument, you can specify the output column names and folders to exclude.
For more information about the table_from_afs function, see the Aster Database User Guide for Aster
Appliances 6.20.

Examples
These examples assume that the PST files are stored in AFS in the directory /test/. You can find the dum1.pst
input file for these examples in directory "pstParserFiles" (provided with the function) and download them
to the AFS directory /test/. (For instructions for setting up AFS, refer to Verifying that AFS is Working.)
Examples 1 and 2 show input and output; the others show only SQL-MapReduce calls.

Example 1: Single PST File, Default Output Fields

Input
The input file is a PST file that contains information about an email. The following figure shows how the
email looks in Outlook. (The sender and recipient are the same.)

1254 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PSTParserAFS
Table 1213: PSTParserAFS Example 1 Input Table dum1.pst, Columns 1-4

message_id sender sender_email_address recipients


<56a23f55.53adca0a.b5ff Microsoft Outlook [email protected] dumfirst dumlast
[email protected]
m>

Table 1214: PSTParserAFS Example 1 Input Table dum1.pst, Columns 5-8

recipients_email_addres received_date subject contents


ses
[email protected] 2016-01-22 06:40:21 Microsoft Outlook Test This is an e-mail message
Message sent automatically by
Microsoft Outlook while
testing the settings for
your account.

Figure 31: PSTParserAFS Input File Email in Outlook

SQL-MapReduce Call

SELECT * FROM PSTParserAFS (


ON empty_table
Path ('/test/dum1.pst')
);

Teradata Aster Analytics Foundation User Guide 1255


Chapter 13: Data Transformation
PSTParserAFS
Output

Table 1215: PSTParserAFS Example 1 Output Table dum1.pst, Columns 1-4

message_id sender sender_email_address recipients


<56a23f55.53adca0a.b5ff Microsoft Outlook [email protected] dumfirst dumlast
[email protected]
m>

Table 1216: PSTParserAFS Example 1 Output Table dum1.pst, Columns 5-8

recipients_email_addres received_date subject contents


ses
[email protected] 2016-01-22 06:40:21 Microsoft Outlook Test This is an e-mail message
Message sent automatically by
Microsoft Outlook while
testing the settings for
your account.

Example 2: Single PST File, Specified Output Fields

Input
The input file is dum1.pst.

SQL-MapReduce Call

SELECT input_path, sender, "to" FROM PSTParserAFS (


ON empty_table
Path ('/test/dum1.pst')
OutputColumns ('input_path', 'to')
);

Output

Table 1217: PSTParserAFS Example 2 Output Table

input_path sender to
/test/dum1.pst Microsoft Outlook dumfirst dumlast

Example 3: Directory of PST Files, Exclude Argument


This SQL-MapReduce call parses all PST files in the directory /test/, excluding those in subdirectories Drafts,
Deleted Items, Notes, and Sent Items:

SELECT * FROM PSTParserAFS (


ON empty_table
Path ('/test/')

1256 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PSTParserAFS
Exclude ('Drafts', 'Deleted Items', 'Notes', 'Sent Items')
);

Example 4: Path and Exclude Arguments with Regular Expressions


This SQL-MapReduce call parses all PST files in /test/ subdirectories test1, test2, ..., test9, excluding those in
subdirectory Sent Items/2012/:

SELECT * FROM PSTParserAFS (


ON empty_table
Path ('/test[1-9]/[a-z]*')
Exclude ('Sent I[a-z]*/2012/[a-z]')
);

Example 5: Multiple PST Files, Specified Host and AFS Server Port Attributes

SELECT * FROM PSTParserAFS (


ON empty_table
Path ('/test/dum1.pst', '/test/test2.pst', '/test/dir1/')
Host ('192.0.2.10')
Port ('2601')
);

Example 6: Using table_from_afs


This SQL-MapReduce call uses the Alternative Way of Running PSTParserAFS:

SELECT * FROM table_from_afs (


ON empty_table
Path ('/test/dum1.pst')
Input_Format (
'com.asterdata.sqlmr.analytics.parser.PSTParserAFS.inputformat.PSTInputForm
at',
'columns Sender,folder,Message_ID STRINGS',
'exclude Sent\\sItems STRINGS')
SerDe (
'com.asterdata.sqlmr.analytics.parser.PSTParserAFS.serde.PSTSerDe',
'field.delim=1',
'escape.delim=27')
OutputColumns ('Sender varchar', 'folder varchar', 'Message_ID varchar')
);

Teradata Aster Analytics Foundation User Guide 1257


Chapter 13: Data Transformation
Scale Functions

Scale Functions

Summary
The scale functions are:
• ScaleMap, which takes a data set and outputs its statistical information (assembled at the vworker level)
• Scale, which takes ScaleMap output and outputs scaled (normalized) values for the input data set
You can use Scale output as input to distance-based analysis functions, such as KMeans.
• ScalePrinter, which takes ScaleMap output and outputs global statistical information for the entire input
data set
• PartitionScale, which scales the sequences in each partition independently, using the same formula as
Scale
Scale Function Examples has examples of all scale functions.

Background
The main purpose of the Scale function is to normalize the input data set. The function shifts the input data
and scales it to generate the normalized values.
Scaling a data set allows the comparison of normalized values from different columns without gross
influences. Some normalization methods require only a shift or a scaling step to arrive at values comparable
with statistical information (for example, mean and max). Other methods require a combination of shifting
and scaling steps.
Data-set scaling is a necessary step for many data preprocess flows. For some analytics functions such as
KMeans and Principal Component Analysis (PCA), the input data set consists of different variables, many of
which have different measures. Without scaling the data set, the influence of different columns varies
considerably and can produce unexpected results.
For example, an insurance company uses the KMeans function to cluster customers according to data about
their houses. The input variables are room area, number of rooms, house height, and house price. These
variables vary considerably in scale. For example, as shown in the following table, the scale for the room area
variable ranges from 50 through 150, while that for the house price variable ranges from $150,000 through
$300,000. If you use the input data set without scaling it with KMeans, the effect of the house price on
customer clustering is much greater than that of room area.

1258 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
ScaleMap
Table 1218: Input Data Example

id room area rooms number house height house price


1 100 3 2.6 200,000
2 150 4 3 300,000
3 50 2 2.7 150,000

To normalize the data so that each variable has the same effect on customer clustering, you can use the Scale
function and apply the MAXABS method to transform the variable values into a common range. The
following table shows the normalization results.
Table 1219: Output Table Example

id room area rooms number height price


1 0.667 0.75 0.8667 0.667
2 1 1 1 1
3 0.333 0.5 0.9 0.5

ScaleMap
The ScaleMap function outputs statistical information for a data set. The statistical information is assembled
at the vworker level.
ScaleMap output can be input to the functions Scale (which outputs scaled values for the data set) and
ScalePrinter (which outputs global statistics for the data set).

Note:
The statistical data generated by this function is local and is intended for use by the Scale and ScalePrinter
functions; the data does not make sense before combination.

Usage

ScaleMap Syntax
Version 1.2

SELECT * FROM ScaleMap (


ON { table | view | (query) }
InputColumns ( { column | column_range }[,...] )
[ MissValue ({ 'KEEP' | 'OMIT' | 'ZERO' | 'LOCATION' })]
);

Teradata Aster Analytics Foundation User Guide 1259


Chapter 13: Data Transformation
ScaleMap
Arguments
Argument Category Description
InputColumns Required Specifies the input table columns that contain the attribute values of the
samples. The attribute values must be numeric values between -1e308
and 1e308. If a value is outside this range, the function treats it as
infinity.
MissValue Optional Specifies how the Scale, ScaleMap, and PartitionScale functions are to
process NULL values in input, as follows:
• KEEP (default): Keep NULL values.
• OMIT: Ignore any row that has a NULL value.
• ZERO: Replace each NULL value with zero.
• LOCATION: Replace each NULL value with its location value.

Input
Table 1220: ScaleMap, Scale, or PartitionScale Input Table Schema

Column Name Data Type Description


input_column SMALLINT, INT, Contains numeric values.
BIGINT, NUMERIC, or
DOUBLE PRECISION

Invalid Input Data Handling


• If the input table contains infinity or NaN values, the input is invalid. The Scale and PartitionScale
functions do not process these values, but the ScaleMap function counts the values in each column,
including invalid values. The count of input values is available in the output of the ScalePrinter function.
• If a column contains only invalid input values, the data in this column is not modified.
• If a sequence has only one unique value, the result of the USTD, STD, RANGE, and MIDRANGE
operations are NaN values because the function cannot calculate the scale.

Output
Table 1221: ScaleMap Output Table Schema

Column Name Data Type Description


stattype VARCHAR Type of statistical information in output_column. For the values
that this column can contain, refer to the following table.
output_column DOUBLE Statistical value of the corresponding stattype.
PRECISION

Table 1222: Supported Statistical Data Types in ScaleMap Output Table

Data Type Description


min Minimum value in the corresponding column in the current vworker.

1260 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Scale

Data Type Description


max Maximum value in the corresponding column in the current vworker.
sum Sum value in the corresponding column in the current vworker.
squaresum Square sum value in the corresponding column in the current vworker.
count Count of valid values in the corresponding column in the current vworker.
infinity Count of infinite values in the corresponding column in the current vworker.
nan Count of NaN values in the corresponding column in the current vworker.
null Count of NULL values in the corresponding column in the current vworker.
ignorerow Count of ignored rows in the corresponding column in the current vworker.
text_missvalue_ * is the value of the MissValue argument—KEEP, OMIT, ZERO, or LOCATION.
*

Scale
The Scale function takes ScaleMap output as input and outputs scaled (normalized) values for the input data
set.

Note:
To scale the sequences in each partition independently, use the function PartitionScale.

Usage

Scale Syntax
Version 1.2

SELECT * FROM Scale (


ON input_table AS input PARTITION BY ANY
ON (SELECT * FROM ScaleMap ...) AS statistic DIMENSION
Method ('method' [,...])
[ Global ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ InputColumns ( { column | column_range }[,...] ) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Multiplier ('multiplier' [,...]) ]
[ Intercept ('intercept' [,...]) ]
);

Teradata Aster Analytics Foundation User Guide 1261


Chapter 13: Data Transformation
Scale
Arguments
Argument Category Description
Method Required Specifies one or more statistical methods to use to scale the data set. For
method values and descriptions, refer to the following table.
If you specify multiple methods, the output table includes the column
scalemethod (which contains the method name) and a row for each
input-row/method combination.
Global Optional Specifies whether all input columns are scaled to the same location and
scale. The default value is 'false' (each input column is scaled separately).
InputColumns Optional Specifies the input table columns that contain the attribute values of the
samples. The attribute values must be numeric values between -1e308
and 1e308. If a value is outside this range, the function treats it as
infinity. The default input columns are all columns of the statistic table
except stattype.
Accumulate Optional Specifies the input table columns to copy to the output table. By default,
the function copies no input table columns to the output table.
Multiplier Optional Specifies one or more multiplying factors to apply to the input variables
—multiplier in the following formula:
X' = intercept + multiplier * (X - location)/scale
If you specify only one multiplier, it applies to all columns specified by
the InputColumns argument. If you specify multiple multiplying factors,
each multiplier applies to the corresponding input column. For example,
the first multiplier applies to the first column specified by the
InputColumns argument, the second multiplier applies to the second
input column, and so on. The default multiplier is 1.
Intercept Optional Specifies one or more addition factors incrementing the scaled results—
intercept in the following formula:
X' = intercept + multiplier * (X - location)/scale
If you specify only one intercept, it applies to all columns specified by the
InputColumns argument. If you specify multiple addition factors, each
intercept applies to the corresponding input column.
The syntax of intercept is:

[-]{number | min | mean | max }

where min, mean, and max are the global minimum, maximum, mean
values in the corresponding columns.
The function scales the values of min, mean, and max. For example, if
intercept is '- min' and multiplier is 1, the scaled result is transformed
to a nonnegative sequence according to this formula, where scaledmin is
the scaled value:
X' = -scaledmin + 1 * (X - location)/scale
The default intercept is 0.

1262 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
ScalePrinter
The following table lists the location and scale values for each statistical method. X is an input value in a
category, and minX and maxX are the minimum and maximum values in that category, respectively.
Table 1223: Location and Scale for Statistical Methods

Method Location Scale


mean Xmean 1
sum 0 ΣX
ustd 0 stdXaboutorigin, which is calculated according to the biased
estimator of the variance.
std Xmean stdX, which is calculated according to the unbiased
estimator of the variance.
range minX maxX - minX
midrange (maxX+minX)/2 (maxX - minX)/2
maxabs 0 maxabsX

Input
The function has two input tables, whose schemas are described by Input and Output.

Output
The output table contains the normalized results of the input data set. If you specify multiple methods, the
output table has a row for each input-row/method combination.
Table 1224: Scale and PartitionScale Output Table Schema

Column Data Type Description


accumulate_column Any Column copied from the input table.
input_column DOUBLE Normalized value for column specified by the InputColumns
PRECISION argument.
scalemethod VARCHAR Scale method used to compute input_column value. For possible
values, refer to Location and Scale for Statistical Methods.
This column appears only if you specify multiple methods.

ScalePrinter
The ScalePrinter function takes as input ScaleMap output (statistics assembled at the vworker level) and
outputs global statistical information for the entire input data set.

Teradata Aster Analytics Foundation User Guide 1263


Chapter 13: Data Transformation
ScalePrinter
Usage

ScalePrinter Syntax
Version 1.2

SELECT * FROM ScalePrinter (


ON (SELECT * FROM ScaleMap ...) PARTITION BY 1
);

Input
The ScalePrinter input table is the ScaleMap output table; for its schema, refer to ScaleMap Output Table
Schema.

Output
The ScalePrinter output table displays the statistics for the entire data set. The table has the same schema is
the ScaleMap output table (Output); however, its stattype column values are different—the following table
describes them.
Table 1225: Supported Statistical Data Types in ScalePrinter Output Table

Data Type Description


min Minimum value in the corresponding column.
max Maximum value in the corresponding column.
sum Sum value in the corresponding column.
squaresum Square sum value in the corresponding column.
count Count of valid values in the corresponding column.
avg Average of the values in the corresponding column.
variance Variance of the values in the corresponding column. The variance is calculated according to
N-1 degrees of freedom (the number of samples minus one).
std Standard deviation of the values in the corresponding column. The standard deviation is
calculated according to N-1 degrees of freedom (the number of samples minus one).
infinity Count of infinite values in the corresponding column.
nan Count of NaN values in the corresponding column.
null Count of NULL values in the corresponding column.
ignorerow Count of ignored rows in the corresponding column.

1264 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PartitionScale

PartitionScale
The PartitionScale function scales the sequences in each partition independently, using the same formula as
the function Scale.

Usage

PartitionScale Syntax
Version 1.2

SELECT * FROM PartitionScale (


ON input_table PARTITION BY partition_columns
Method ('method' [,…])
[ MissValue ({ 'KEEP' | 'OMIT' | 'ZERO' | 'LOCATION' })]
InputColumns ( { column | column_range }[,…] )
[ Global ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate ({ 'accumulate_column' | 'accumulate_column_range' }[,…])
[ Multiplier ('multiplier' [,…]) ]
[ Intercept ('intercept' [,…]) ]
);

Arguments
Argument Category Description
Method Required Specifies one or more statistical methods to use to scale the data set. For
method values and descriptions, refer to the table Location and Scale for
Statistical Methods.
If you specify multiple methods, the output table includes the column
scalemethod (which contains the method name) and a row for each
input-row/method combination.
MissValue Optional Specifies how the PartitionScale function is to process NULL values in
input:
• KEEP (default):
Keep NULL values.
• OMIT:
Ignore any row that has a NULL value.
• ZERO:
Replace each NULL value with zero.
• LOCATION:
Replace each NULL value with its location value.

InputColumns Required Specifies the input table columns that contain the attribute values of the
samples. The attribute values must be numeric values between -1e308
and 1e308. If a value is outside this range, the function treats it as
infinity.

Teradata Aster Analytics Foundation User Guide 1265


Chapter 13: Data Transformation
PartitionScale

Argument Category Description


Global Optional Specifies whether all input columns are scaled to the same location and
scale. The default value is 'false' (each input column is scaled separately).
Accumulate Optional Specifies the input table columns to copy to the output table. By default,
the function copies no input table columns to the output table.

Tip:
To identify the sequences in the output, specify the partition columns
in this argument.

Multiplier Optional Specifies one or more multiplying factors to apply to the input variables
(multiplier in the following formula):
X' = intercept + multiplier * (X - location)/scale
If you specify only one multiplier, it applies to all columns specified by
the InputColumns argument.
If you specify multiple multiplying factor, each multiplier applies to the
corresponding input column. For example, the first multiplier applies to
the first column specified by the InputColumns argument, the second
multiplier applies to the second input column, and so on. The default
multiplier is 1.
Intercept Optional Specifies one or more addition factors incrementing the scaled results—
intercept in the following formula:
X' = intercept + multiplier * (X - location)/scale
If you specify only one intercept, it applies to all columns specified by the
InputColumns argument. If you specify multiple addition factors, each
intercept applies to the corresponding input column.
The syntax of intercept is:

[-]{number | min | mean | max }

where min, mean, and max are the global minimum, maximum, mean
values in the corresponding columns.
The function scales the values of min, mean, and max. For example, if
intercept is '- min' and multiplier is 1, the scaled result is transformed to a
nonnegative sequence according to this formula, where scaledmin is the
scaled value:
X' = -scaledmin + 1 * (X - location)/scale
The default intercept is 0.

Input
The PartitionScale input table has the same schema as the Scale input table described in ScaleMap, Scale, or
PartitionScale Input Table Schema.

1266 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PartitionScale
Output
The PartitionScale output table schema is described in the Scale and PartitionScale Output Table Schema.

Scale Function Examples

Example 1: Scale with Method('midrange')


This example scales (normalizes) input data using the midrange method and the default values for the
arguments Intercept and Multiplier (0 and 1, respectively).

Input
The input table contains data about houses—categorical data in the column type and numerical data in the
columns price, lotsize, bedrooms, bathrooms, and stories. The column id identifies the rows. The table has
some NULL values.
Table 1226: Scale Functions Examples Input Table scale_housing

type id price lotsize bedrooms bathrms stories


classic 1 42000 5850 3 1 2
classic 2 4000 2 1 1
classic 3 49500 3060 3 1 1
classic 4 60500 6650 3 1 2
classic 5 61000 6360 2 1 1
bungalow 6 66000 4160 2 1 1
bungalow 7 66000 3880 2 2
bungalow 8 69000 4160 3 1 3
bungalow 9 83800 4800 3 1 1
bungalow 10 88500 5500 3 2 4

SQL-MapReduce Call

SELECT * FROM Scale (


ON scale_housing AS INPUT PARTITION BY ANY
ON (SELECT * FROM ScaleMap (
ON scale_housing
InputColumns ('[2:6]')
MissValue ('omit'))
) AS statistic DIMENSION
Method ('midrange')
Accumulate ('id')
) ORDER BY id;

Teradata Aster Analytics Foundation User Guide 1267


Chapter 13: Data Transformation
PartitionScale
Output
The output table contains the midrange-scaled values for the input data set. As explained in the descriptions
of the arguments Multiplier and Intercept, the formula for computing the scaled value X' from the input
value X is:
X' = intercept + multiplier * (X - location)/scale
The formulas for computing location and scale for the midrange method (from Arguments) are:
• location = (maxX+minX)/2
• scale = (maxX-minX)/2
The values minX and maxX are the minimum and maximum values of X, respectively.
For example, consider row 1 of the price column in the input table and the following output table:
• intercept = 0 (default)
• multiplier = 1 (default)
• Input value X = 42000
• Minimum input price value minX = 42000
• Maximum input price value maxX = 88500
• location = (88500+42000)/2 = 65250
• scale = (88500-42000)/2 = 23250
• Scaled output value X' = 0 + 1 * (42000-65250)/23250 = -1
Table 1227: Scale and ScaleMap Example 1 Output Table

id price lotsize bedrooms bathrms stories


1 -1 0.554317548746518 1 -1 -0.333333333333333
3 -0.67741935483871 1 1 -1 -1
4 -0.204301075268817 1 1 -1 0.333333333333333
5 -0.182795698924731 0.838440111420613 -1 -1 -1
6 0.032258064516129 -0.387186629526462 1 -1 -1
8 0.161290322580645 -0.387186629526462 1 -1 -0.333333333333333
9 0.797849462365591 -0.0306406685236769 1 -1 -1
10 1 0.35933147632312 1 1 1

Example 2: Scale with Method('midrange') and Intercept(-min)


This example is like Example 1 except that the Intercept argument has the value -min (where min is the
global minimum value). This example also specifies a Multiplier value, but it is the default, as in Example 1.

Input
As in Example 1, the input is scale_housing (Input).

1268 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PartitionScale
SQL-MapReduce Call

SELECT * FROM Scale (


ON scale_housing AS INPUT PARTITION BY ANY
ON (SELECT * FROM ScaleMap (
ON scale_housing
InputColumns ('[2:6]')
MissValue ('omit'))
) AS statistic DIMENSION
Method ('midrange')
Accumulate ('id')
Intercept ('-min')
Multiplier (1)
) ORDER BY id;

Output
As explained in the description of the Intercept argument, the formula for computing the scaled value X'
from the input value X when intercept is -min is:
X' = -scaledmin + 1 * (X - location)/scale
The formula for computing scaledmin when intercept is -min is:
scaledmin = (minX - location)/scale
For example, consider row 1 of the price column in the input table (Input) and the following output table:
• Input value X = 42000
• Minimum input price value minX = 42000
• Maximum input price value maxX = 88500
• location = (88500+42000)/2 = 65250
• scale = (88500-42000)/2 = 23250
• scaledmin = (42000 - 65250)/23250 = -1
• Scaled output value X' = -(-1) + 1 * (42000 - 65250)/23250 = 0
Table 1228: Scale and ScaleMap Example 2 Output Table

id price lotsize bedrooms bathrms stories


1 0 1.55431754874652 2 0 0.666666666666667
3 0.32258064516129 0 2 0 0
4 0.795698924731183 2 2 0 0.666666666666667
5 0.817204301075269 1.83844011142061 0 0 0
6 1.03225806451613 0.612813370473538 2 0 0
8 1.16129032258065 0.612813370473538 2 0 1.33333333333333
9 1.79784946236559 0.969359331476323 2 0 0
10 2 1.35933147632312 2 2 2

Teradata Aster Analytics Foundation User Guide 1269


Chapter 13: Data Transformation
PartitionScale
Example 3: Use Training Data to Scale Test Data
This example creates statistics from training data and then uses them to scale similar test data.

Input
• Training data: scale_housing
• Test data: scale_housing_test
Table 1229: Scale and ScaleMap Example 3 Input Table scale_housing_test

type id price lotsize bedrooms bathrms stories


bungalow 11 90000 7200 3 2 1
classic 12 30500 3000 2 1 1
classic 13 27000 1700 3 1 2
classic 14 36000 2880 3 1 1
classic 15 37000 3600 2 1 1

Step 1: Create Statistics Table from Training Data

CREATE DIMENSION TABLE scale_stat AS


SELECT * FROM ScaleMap (
ON scale_housing
InputColumns ('[2:6]')
MissValue ('omit')
);

Step 2: Scale Test Data

SELECT * FROM Scale (


ON scale_housing_test AS INPUT PARTITION BY ANY
ON scale_stat AS STATISTIC DIMENSION
Method ('midrange')
Accumulate ('id')
);

Output

Table 1230: Scale and ScaleMap Example 3 Output Table

id price lotsize bedrooms bathrms stories


11 1.06451612903226 1.30640668523677 1 1 -1
13 -1.64516129032258 -1.75766016713092 1 -1 -0.333333333333333
15 -1.21505376344086 -0.6991643454039 -1 -1 -1
12 -1.49462365591398 -1.03342618384401 -1 -1 -1

1270 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PartitionScale

id price lotsize bedrooms bathrms stories


14 -1.25806451612903 -1.10027855153203 1 -1 -1

Example 4: ScalePrinter
This example uses ScalePrinter to output the contents of the table scale_stat, created in Example 3, step 1.

SQL-MapReduce Call

SELECT * FROM ScalePrinter (


ON scale_stat PARTITION BY 1
) ORDER BY 1;

Output

Table 1231: ScalePrinter Example Output Table (Columns 1-3)

stattype price lotsize


avg 65037.5 5067.5
count 8 8
ignorerow 2 2
infinity 0 0
max 88500 6650
min 42000 3060
nan 0 0
null 1 0
squaresum 35567190000 216159400
std 15712.500710308 1237.67927994291
sum 520300 40540
variance 246882678.571429 1531850

Table 1232: ScalePrinter Example Output Table (Columns 4-6)

bedrooms bathrms stories


2.875 1.125 1.875
8 8 8
2 2 2
0 0 0
3 4 4

Teradata Aster Analytics Foundation User Guide 1271


Chapter 13: Data Transformation
PartitionScale

bedrooms bathrms stories


2 1 1
0 0 0
1 0 0
67 11 37
0.353553390593274 0.353553390593274 1.1259916264596
23 9 15
0.125 0.125 1.26785714285714

Example 5: Scale with Multiple Methods


This example scales input data using multiple methods—midrange, mean, maxabs, and range—in the same
function call.

Input
The input table is scale_housing.

SQL-MapRequest Call

SELECT * FROM Scale (


ON scale_housing AS INPUT PARTITION BY ANY
ON (SELECT * FROM ScaleMap (
ON scale_housing
InputColumns ('[2:6]')
)
) AS statistic DIMENSION
Method ('midrange', 'mean', 'maxabs', 'range')
Accumulate ('id')
) ORDER BY 1,7;

Output

Table 1233: Scale and ScaleMap Example 5 Output Table (Columns 1-3)

id price lotsize
1 0.474576271186441 0.879699248120301
1 -23144.4444444444 1008
1 -1 0.554317548746518
1 0 0.777158774373259
2 0.601503759398496
2 -842

1272 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PartitionScale

id price lotsize
2 -0.476323119777159
2 0.261838440111421
3 0.559322033898305 0.46015037593985
3 -15644.4444444444 -1782
3 -0.67741935483871 -1
3 0.161290322580645 0
... ... ...

Table 1234: Scale and ScaleMap Example 5 Output Table (Columns 4-7)

bedrooms bathrms stories scalemethod


1 0.5 0.5 maxabs
0.222222222222222 -0.2 0.2 mean
1 -1 -0.333333333333333 midrange
1 0 0.333333333333333 range
0.666666666666667 0.5 0.25 maxabs
-0.777777777777778 -0.2 -0.8 mean
-1 -1 -1 midrange
0 0 0 range
1 0.5 0.25 maxabs
0.222222222222222 -0.2 -0.8 mean
1 -1 -1 midrange
1 0 0 range
... ... ... ...

Example 6: PartitionScale
This example scales the sequences in each partition independently.

Input
The input table is scale_housing, which is partitioned by the column type.

SQL-MapRequest Call

SELECT * FROM PartitionScale (


ON scale_housing PARTITION BY type
InputColumns ('[2:6]')

Teradata Aster Analytics Foundation User Guide 1273


Chapter 13: Data Transformation
PartitionScale
Method ('maxabs')
Accumulate ('type', 'id')
) ORDER BY 1 desc,2;

Output

Table 1235: Scale and ScaleMap Example 5 Output Table (Columns 1-3)

type id price
classic 1 0.688524590163934
classic 2
classic 3 0.811475409836066
classic 4 0.991803278688525
classic 5 1
bungalow 6 0.745762711864407
bungalow 7 0.745762711864407
bungalow 8 0.779661016949153
bungalow 9 0.946892655367232
bungalow 10 1

Table 1236: Scale and ScaleMap Example 5 Output Table (Columns 4-7)

lotsize bedrooms bathrms stories


0.879699248120301 1 1 1
0.601503759398496 0.666666666666667 1 0.5
0.46015037593985 1 1 0.5
1 1 1 1
0.956390977443609 0.666666666666667 1 0.5
0.756363636363636 1 0.5 0.25
0.705454545454545 1 0.5
0.756363636363636 1 0.5 0.75
0.872727272727273 1 0.5 0.25
1 1 1 1

Example 7: Using Scale Output in KMeans


This example uses the Scale function to scale data (using the maxabs method) before inputting it to the
function KMeans, which outputs the centroids of the clusters in the data set. (Background explains the
reason for scaling data before inputting it to a distance-based analysis function like KMeans.)

1274 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
PartitionScale
Input
The input table for this example is the input table for the KMeans Examples, computers_train1.

Step 1: Use Scale to Create Table of Scaled Data

CREATE TABLE housing_normalized DISTRIBUTE BY HASH(id) AS


SELECT * FROM Scale (
ON computers_train1 AS INPUT PARTITION BY ANY
ON (SELECT * FROM ScaleMap (
ON computers_train1
InputColumns ('[1:5]')
MissValue ('omit'))
) AS statistic DIMENSION
Method ('maxabs')
Accumulate ('id')
) ORDER BY id;

Step 2: Input Scaled Data to KMeans

SELECT * FROM KMeans (


ON (SELECT 1) PARTITION BY 1
InputTable ('computers_normalized')
OutputTable ('computers_centroid')
NumClusters ('8')
Threshold ('0.05')
MaxIterNum ('10')
);

Note:
The result of this query varies with each run. To ensure repeatability, use the InitialSeeds argument
instead of the NumClusters argument.

Output

Table 1237: Scale and KMeans Example Output Table

clusterid price speed hd ram screen size withinss


0 0.499153191737689 0.313013698630136 0.273377364644488 292 6.14410036408697
0.474743150684932 0.850523771152295
1 0.346259665529739 0.311809635722682 0.122839796318057 1702 26.9367642917337
0.156984430082256 0.846547314577999
2 0.517252416931197 0.601960461285007 0.174217462932454 607 12.6014867603556
0.236408566721582 0.879930225797071
3 0.351843035925946 0.582768595041321 0.11426800472255 726 11.1916445231536
0.118629476584022 0.84086857883649
4 0.43909933270926 1.0 0.276756756756757 0.300168918918919 370 15.9375922030147
0.867408585055643

Teradata Aster Analytics Foundation User Guide 1275


Chapter 13: Data Transformation
StringSimilarity
clusterid price speed hd ram screen size withinss
5 0.357141678235243 0.648366336633661 0.263063020587773 606 10.3780145525625
0.2378300330033 0.871481265773634
6 0.549091201375517 0.608754716981132 0.511633423180592 265 19.8411759929145
0.756603773584906 0.882574916759156
7 0.533305831046153 0.625136363636362 0.288437229437229 440 11.5107079157115
0.497727272727273 0.879144385026736
Converged : False
Number of Iterations : 10
Number of clusters : 8
Output table : "computers_centroid"
Total_WithinSS : 114.54148660353306
Between_SS : 417.79553316469645

StringSimilarity

Summary
The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-
Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1].

Note:
You can use the output of the StringSimilarity function as input to the function FellegiSunterTrainer.

Usage

StringSimilarity Syntax
Version 1.1

SELECT * FROM StringSimilarity (


ON { table | view | (query) } PARTITION BY ANY
ComparisonColumnPairs ('comparison_type (
column1, column2 [, constant]) [ AS output_column]') [,...]
[ CaseSensitive
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}[,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

1276 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
StringSimilarity
Arguments
Argument Category Description
ComparisonColumnPairs Required Specifies pairs of input table columns that contain strings to
be compared (column1 and column2), how to compare them
(comparison_type), and (optionally) a constant and the
name of the output column for their similarity
(output_column). The similarity is a value in the range [0, 1].
For comparison_type, use one of these values:
• 'jaro' (Jaro distance)
• 'jaro_winkler' (Jaro-Winkler distance: 1 for an exact
match, 0 otherwise)
• 'n-gram' (N-gram similarity)
If you specify this comparison type, you can specify the
value of N with constant.
• 'LD' (Levenshtein distance)
The Levenshtein distance is the number of edits needed
to transform one string into the other, where edits
include insertions, deletions, or substitutions of
individual characters.
You can specify a different comparison_type for every pair of
columns.
The default output_column is 'sim_i', where i is the sequence
number of the column pair.
CaseSensitive Optional Specifies whether string comparison is case-sensitive. The
default value is 'false'.
You can specify either one value for all pairs or one value for
each pair. If you specify one value for each pair, then the ith
value applies to the ith pair.
Accumulate Optional Specifies the names of input table columns to be copied to
the output table.

Input
The StringSimilarity function has one required input table, which must contain pairs of columns of strings
to be compared. The input table can contain additional columns, but the function ignores them unless you
specify them with the Accumulate argument. The following table describes the required columns of the
input table.
Table 1238: StringSimilarity Input Table Schema

Column Name Data Type Description


column1_i VARCHAR Column of strings.
column2_i VARCHAR Column of strings.

Teradata Aster Analytics Foundation User Guide 1277


Chapter 13: Data Transformation
StringSimilarity
Output
The StringSimilarity function has one output table:
Table 1239: StringSimilarity Output Table Schema

Column Name Data Type Description


accumulate_column Any Column copied from the input table, specified by the
Accumulate argument.
output_column_i DOUBLE The similarity between the ith comparison column pair. The
PRECISION table has one such column for every comparison column
pair. The names of the output columns are specified in the
ComparisonColumnPairs argument.

Examples
The following examples both use the same Input:
• Example 1: Comparison of src_text1 with tar_text
• Example 2: Comparison of src_text2 with tar_text

Input
The input table, strsimilarity_input, has two source columns (src_text1 and src_text2) against which the
function compares the target column (tar_text). The function calculates the similarity scores by the methods
specified by the ComparisonColumnPairs argument (jaro, jaro-winkler, ngram, Levenshtein Distance). For
clarity, separate examples show the comparison of each source column with the target column. With some
modifications, you can use the output of this function as input to the FellegiSunter functions.
Table 1240: StringSimilarity Example Input Table strsimilarity_input

id src_text1 src_text2 tar_text


1 astre astter aster
2 hone fone phone
3 acqiese acquire acquiesce
4 AAAACCCCCGGGGA CCCGGGAACCAACC CCAGGGAAACCCAC
5 alice allen allies
6 angela angle angels
7 senter center centre
8 chef cheap chief
9 circus circle circuit
10 debt debut debris
11 deal dell lead
12 bare bear bear

1278 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
StringSimilarity
Example 1: Comparison of src_text1 with tar_text

SQL-MapReduce Call

SELECT * FROM StringSimilarity (


ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (
'jaro (src_text1 , tar_text ) AS jaro1_sim',
'LD (src_text1 , tar_text, 2) AS ld1_sim',
'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('id', 'src_text1', 'tar_text')
) ORDER BY id;

Output

Table 1241: StringSimilarity Example 1 Output Table (Columns 1-3)

id src_text1 tar_text
1 astre aster
2 hone phone
3 acqiese acquiesce
4 AAAACCCCCGGGGA CCAGGGAAACCCAC
5 alice allies
6 angela angels
7 senter centre
8 chef chief
9 circus circuit
10 debt debris
11 deal lead
12 bare bear

Table 1242: StringSimilarity Example 1 Output Table (Columns 4-7)

jaro1_sim ld1_sim ngram1_sim jw1_sim


0.933333333333333 0.6 0.5 0.953333333333333
0.933333333333333 0.8 0.75 0.933333333333333
0.925925925925926 0.777777777777778 0.5 0.948148148148148
0.824175824175824 0.214285714285714 0.384615384615385 0.824175824175824
0.822222222222222 0.5 0.4 0.857777777777778

Teradata Aster Analytics Foundation User Guide 1279


Chapter 13: Data Transformation
StringSimilarity

jaro1_sim ld1_sim ngram1_sim jw1_sim


0.888888888888889 0.833333333333333 0.8 0.933333333333333
0.822222222222222 0.5 0.4 0.822222222222222
0.933333333333333 0.8 0.5 0.946666666666667
0.849206349206349 0.714285714285714 0.666666666666667 0.90952380952381
0.75 0.5 0.4 0.825
0.666666666666667 0.5 0.333333333333333 0.666666666666667
0.833333333333333 0.5 0.333333333333333 0.85

Example 2: Comparison of src_text2 with tar_text

SQL-MapReduce Call

SELECT * FROM StringSimilarity (


ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (
'jaro (src_text2, tar_text) AS jaro2_sim',
'LD (src_text2, tar_text, 2) AS ld2_sim',
'n_gram (src_text2, tar_text, 2) AS ngram2_sim',
'jaro_winkler (src_text2, tar_text, 2) AS jw2_sim'
)
CaseSensitive ('true')
Accumulate ('id', 'src_text2', 'tar_text')
) ORDER BY id;

Output

Table 1243: StringSimilarity Example 2 Output Table (Columns 1-3)

id src_text2 tar_text
1 astter aster
2 fone phone
3 acquire acquiesce
4 CCCGGGAACCAACC CCAGGGAAACCCAC
5 allen allies
6 angle angels
7 center centre
8 cheap chief
9 circle circuit
10 debut debris

1280 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Unpack

id src_text2 tar_text
11 dell lead
12 bear bear

Table 1244: StringSimilarity Example 2 Output Table (Columns 4-7)

jaro2_sim ld2_sim ngram2_sim jw2_sim


0.944444444444445 0.833333333333333 0.8 0.961111111111111
0.783333333333333 0.6 0.5 0.783333333333333
0.841269841269841 0.666666666666667 0.5 0.904761904761905
0.875457875457875 0.714285714285714 0.692307692307692 0.9003663003663
0.822222222222222 0.666666666666667 0.4 0.875555555555556
0.877777777777778 0.666666666666667 0.4 0.914444444444445
0.944444444444445 0.666666666666667 0.6 0.966666666666667
0.733333333333333 0.4 0.25 0.786666666666667
0.746031746031746 0.571428571428571 0.5 0.847619047619048
0.7 0.5 0.4 0.79
0.5 0.25 0 0.5
1 1 1 1

Unpack

Summary
The Unpack function takes data from a single packed column and unpacks it into multiple columns. The
packed column is composed of multiple virtual columns, which become the output columns. To determine
the virtual columns, the function must have either the delimiter that separates them in the packed column or
their lengths.
Unpack is complementary to the function Pack, but you can use it on any packed column that meets the
input requirements.

Teradata Aster Analytics Foundation User Guide 1281


Chapter 13: Data Transformation
Unpack
Usage

Unpack Syntax
Version 1.2

SELECT * FROM Unpack (


ON { table_name | view_name| (query) }
InputColumn ('input_column')
OutputColumns ({ 'output_column' | 'output_column_range' }[,...])
OutputDataTypes ('datatype' [,...])
[ Delimiter ('delimiter') ]
[ ColumnLength ('column_length' [,...] ) ]
[ Regex ('regular_expression') ]
[ RegexSet ('group_number') ]
[ Exception ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

Arguments
Argument Category Description
InputColumn Required Specifies the name of the input column that contains the packed data.
OutputColumns Required Specifies the names to give to the output columns, in the order in
which the corresponding virtual columns appear in input_column.
OutputDataTypes Required Specifies the datatypes of the unpacked output columns.
If OutputDataTypes specifies only one value and OutputColumns
specifies multiple columns, then the specified value applies to every
output_column.
If OutputDataTypes specifies multiple values, then it must specify a
value for each output_column. The nth datatype corresponds to the nth
output_column.
Delimiter Optional Specifies the delimiter (a string) that separates the virtual columns in
the packed data. If delimiter contains a character that is a symbol in a
regular expression—such as an asterisk (*) or pipe character (|)—
precede it with two escape characters. For example, if the delimiter is
the pipe character, specify '\\|'. The default delimiter is comma (,).
If the virtual columns are separated by a delimiter, then specify the
delimiter with this argument; otherwise, specify the ColumnLength
argument. Do not specify both this argument and the ColumnLength
argument.
ColumnLength Optional Specifies the lengths of the virtual columns; therefore, to use this
argument, you must know the length of each virtual column.
If ColumnLength specifies only one value and OutputColumns
specifies multiple columns, then the specified value applies to every
output_column.

1282 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Unpack

Argument Category Description


If ColumnLength specifies multiple values, then it must specify a value
for each output_column. The nth datatype corresponds to the nth
output_column. However, the last column_name can be an asterisk (*),
which represents a single virtual column that contains the remaining
data. For example, if the first three virtual columns have the lengths 2,
1, and 3, and all remaining data belongs to the fourth virtual column,
you can specify ColumnLength ('2', '1', '3', *).
If you specify this argument, you must omit the Delimiter argument.
Regex Optional Specifies a regular expression that describes a row of packed data,
enabling the function to find the data values.
A row of packed data contains one data value for each virtual column,
but the row might also contain other information (such as the virtual
column name). In the regular_expression, each data value is enclosed in
parentheses.
For example, suppose that the packed data has two virtual columns, age
and sex, and that one row of packed data is age:34,sex:male.
The regular_expression that describes the row is '.*:(.*)'. The
'.*:' matches the virtual column names, age and sex, and the
'(.*)' matches the values, 34 and male.
The default regular_expression is '(.*)', which matches the whole
string (between delimiters, if any). When applied to the preceding
sample row, the default regular_expression causes the function to
return 'age:34' and 'sex:male' as data values.
To represent multiple data groups in regular_expression, use multiple
pairs of parentheses. By default, the last data group in
regular_expression represents the data value (other data groups are
assumed to be virtual column names or unwanted data). If a different
data group represents the data value, specify its group number with the
RegexSet argument.
RegexSet Optional Specifies the ordinal number of the data group in regular_expression
that represents the data value in a virtual column. By default, the last
data group in regular_expression represents the data value.
For example, suppose that regular_expression is '([a-zA-Z]*):
(.*)'. If group_number is '1', then '([a-zA-Z]*)' represents the
data value. If group_number is '2', then '(.*)' represents the data
value.
Exception Optional Specifies whether the function ignores rows that contain invalid data;
that is, continues without outputting them. The default value is 'false',
which causes the function to fail if it encounters a row with invalid
data.

Teradata Aster Analytics Foundation User Guide 1283


Chapter 13: Data Transformation
Unpack
Input
Table 1245: Unpack Input Table Schema

Column Data Type Description


input_column VARCHAR Contains packed data. The input_column is specified by the
InputColumn argument.

Output
Table 1246: Unpack Output Table Schema

Column Data Type Description


output_column VARCHAR Unpacked column.Each output_column is specified by the
OutputColumns argument.
other_input_column VARCHAR Input table column other than the input_column that the
InputColumn argument specifies, copied from the input
table. In the output table, this column (or these columns)
follow the output columns.

Examples
• Example 1: Delimiter Separates Virtual Columns
• Example 2: No Delimiter Separates Virtual Columns

Example 1: Delimiter Separates Virtual Columns

Input
The input table is a collection of temperature readings for two cities, Nashville and Knoxville, in the state of
Tennessee. In the column of packed data, the delimiter comma (,) separates the virtual columns. The last
row contains invalid data.
Table 1247: Unpack Example 1 Input Table ville_tempdata

sn packed_temp_data
10 Nashville,Tennessee,35.1
11 Nashville,Tennessee,36.2
12 Nashville,Tennessee,34.5
13 Nashville,Tennessee,33.6
14 Nashville,Tennessee,33.1
15 Nashville,Tennessee,33.2
16 Nashville,Tennessee,32.8

1284 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Unpack

sn packed_temp_data
17 Nashville,Tennessee,32.4
18 Nashville,Tennessee,32.2
19 Nashville,Tennessee,32.4
20 Thisisbaddata

SQL-MapReduce Call

SELECT * FROM unpack (


ON ville_tempdata
InputColumn ('packed_temp_data')
OutputColumns ('city', 'state', 'temp_f')
OutputDataTypes ('varchar', 'varchar', 'real')
Delimiter (',')
Regex ('(.*)')
RegexSet (1)
Exception ('true')
) ORDER BY sn;

Note:
Because comma is the default delimiter, the Delimiter argument in the preceding call is optional.

Output
Because of Exception ('true'), the function did not fail when it encountered the row with invalid data, but it
did not output that row.
Table 1248: Unpack Example 1 Output Table

city state temp_f sn


Nashville Tennessee 35.1 10
Nashville Tennessee 36.2 11
Nashville Tennessee 34.5 12
Nashville Tennessee 33.6 13
Nashville Tennessee 33.1 14
Knoxville Tennessee 33.2 15
Knoxville Tennessee 32.8 16
Knoxville Tennessee 32.4 17
Knoxville Tennessee 32.2 18
Knoxville Tennessee 32.4 19

Teradata Aster Analytics Foundation User Guide 1285


Chapter 13: Data Transformation
Unpack
Example 2: No Delimiter Separates Virtual Columns

Input
The input table for this example is like the input table for the previous example, except that no delimiter
separates the virtual columns in the packed data. To enable the function to determine the virtual columns,
the function call specifies the column lengths.
Table 1249: Unpack Example 2 Input Table ville_tempdata1

sn packed_temp_data
10 NashvilleTennessee35.1
11 NashvilleTennessee36.2
12 NashvilleTennessee34.5
13 NashvilleTennessee33.6
14 NashvilleTennessee33.1
15 NashvilleTennessee33.2
16 NashvilleTennessee32.8
17 NashvilleTennessee32.4
18 NashvilleTennessee32.2
19 NashvilleTennessee32.4
20 Thisisbaddata

SQL-MapReduce Call

SELECT * FROM unpack (


ON ville_tempdata1
InputColumn ('packed_temp_data')
OutputColumns ('city', 'state', 'temp_f')
OutputDataTypes ('varchar', 'varchar', 'real')
ColumnLength ('9', '9', '4')
Regex ('(.*)')
RegexSet (1)
Exception ('true')
) ORDER BY sn;

Output

Table 1250: Unpack Example 2 Output Table

city state temp_f sn


Nashville Tennessee 35.1 10
Nashville Tennessee 36.2 11

1286 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Unpivot

city state temp_f sn


Nashville Tennessee 34.5 12
Nashville Tennessee 33.6 13
Nashville Tennessee 33.1 14
Knoxville Tennessee 33.2 15
Knoxville Tennessee 32.8 16
Knoxville Tennessee 32.4 17
Knoxville Tennessee 32.2 18
Knoxville Tennessee 32.4 19

Unpivot

Summary
The Unpivot function pivots data that is stored in columns into rows—the reverse of the function Pivot.

Usage

Unpivot Syntax
Version 1.2

SELECT * FROM Unpivot (


ON input_timeseries_table
[ Unpivot ({ 'unpivot_column' | 'unpivot_range' }[,…]) |
UnpivotRange ('[start_index:end_index]' [,…])
Accumulate ({ 'accumulate_column' | 'accumulate_column_range' }[,…])
[ InputTypes ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ AttributeColumn ('attribute_column') ]
[ ValueColumn ('value_column') ]
);

Arguments
Argument Category Description
Unpivot Optional Specifies the names of the unpivot columns—the input columns to unpivot
(convert to rows).

Teradata Aster Analytics Foundation User Guide 1287


Chapter 13: Data Transformation
Unpivot

Argument Category Description

Note:
If you do not specify this argument, you must specify the UnpivotRange
argument.

UnpivotRange Optional Specifies ranges of unpivot columns. You must type the brackets in this
context—they are range delimiters, not indicators that the syntax element is
optional. The start_column and end_column are nonnegative integers that
represent the positions of columns in the input table.The first column is in
position 0. No start_index can be greater than its corresponding end_index.
The range includes its endpoints.

Note:
If you do not specify this argument, you must specify the Unpivot
argument.

Accumulate Required Specifies the names of input columns—other than unpivot columns—to copy
to the output table. You must specify these columns in the same order that
they appear in the input table. No accumulate_column can be an unpivot
column.
InputTypes Optional Specifies whether the unpivoted value column, in the output table, has the
same data type as its corresponding unpivot column (if possible). The default
value is 'false'—for each unpivoted column, the function outputs the values in
a single VARCHAR column.
If you specify 'true', the function outputs each unpivoted value column in a
separate column. If the unpivot column has a real data type, the unpivoted
value column has the data type DOUBLE PRECISION; if the unpivot column
has an integer data type, the unpivoted value column has the data type
LONG; if the unpivot column has any other data type, the unpivoted value
column has the data type VARCHAR.
AttributeColu Optional Specifies the name of the unpivoted attribute column in the output table. The
mn default value is 'attribute'.
ValueColumn Optional Specifies the name of the unpivoted value column in the output table. The
default value is 'value'.

Input
Table 1251: Pivot Input Table Schema

Column Name Data Type Description


unpivot_column Any Column to unpivot, specified by either the Unpivot or
UnpivotRange argument.
accumulate_column Any Column (other than unpivot_column) to copy to the output table.

1288 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Unpivot
Output
Table 1252: Pivot Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
attribute_column VARCHAR Unpivoted attribute.
value_column VARCHAR Appears when InputTypes('false'). Contains the unpivoted
value of the corresponding attribute. Numeric values are cast to
VARCHAR.
value_column_double DOUBLE Appears when InputTypes('true') and an unpivot column has a
PRECISION real data type. Contains the unpivoted value of the
corresponding attribute if the value is real; NULL otherwise.
value_column_long LONG Appears when InputTypes('true') and an unpivot column has
an integer data type. Contains the unpivoted value of the
corresponding attribute if the value is integer; NULL otherwise.
value_column_str VARCHAR Appears when InputTypes('true') and an unpivot column has a
data type other than real or integer. Contains the unpivoted
value of the corresponding attribute.

Examples
• Input
• Example 1: Specified Unpivot Columns, Default Optional Values
• Example 2: Specified Unpivot Columns, Specified Optional Values
• Example 3: Specified Unpivot Range, Default Optional Values

Input
The input table contains temperature, pressure, and dewpoint data for three cities, in dense (pivoted)
format. The data types of the input table columns are:
• sn: INTEGER
• city: VARCHAR
• week: INTEGER
• temp: INTEGER
• pressure: DOUBLE PRECISION
• dewpoint: VARCHAR
Table 1253: Unpivot Examples Input Table unpivot_input

sn city week temp pressure dewpoint


1 Asheville 1 32 1020.8 27.6F
2 Asheville 2 32 1021.3 27.4F

Teradata Aster Analytics Foundation User Guide 1289


Chapter 13: Data Transformation
Unpivot

sn city week temp pressure dewpoint


3 Asheville 3 34 1021.7 28.2F
4 Nashville 1 42 1021 29.4F
5 Nashville 2 44 1019.8 29.2F
6 Brownsville 2 47 1019 28.9F
7 Brownsville 3 46 1019.2 28.9F

Example 1: Specified Unpivot Columns, Default Optional Values


This example specifies the columns to unpivot by name and specifies the default values for the optional
arguments. If you omit the optional arguments, the result is the same.

SQL-MapReduce Call

SELECT * FROM Unpivot (


ON unpivot_input
Unpivot ('temp', 'pressure', 'dewpoint')
AttributeColumn ('attribute')
ValueColumn ('value')
InputTypes ('false')
Accumulate ('sn', 'city', 'week')
) ORDER BY 1, 2, 3;

Output
Because InputTypes has the value 'false', the value column has the data type VARCHAR.
Table 1254: Unpivot Example 1 Output Table

sn city week attribute value


1 Asheville 1 temp 32
1 Asheville 1 pressure 1020.8
1 Asheville 1 dewpoint 27.6F
2 Asheville 2 temp 32
2 Asheville 2 pressure 1021.3
2 Asheville 2 dewpoint 27.4F
3 Asheville 3 temp 34
3 Asheville 3 pressure 1021.7
3 Asheville 3 dewpoint 28.2F
4 Nashville 1 temp 42
4 Nashville 1 pressure 1021.0

1290 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
Unpivot

sn city week attribute value


4 Nashville 1 dewpoint 29.4F
5 Nashville 2 temp 44
5 Nashville 2 pressure 1019.8
5 Nashville 2 dewpoint 29.2F
6 Brownsville 2 temp 47
6 Brownsville 2 pressure 1019.0
6 Brownsville 2 dewpoint 28.9F
7 Brownsville 3 temp 46
7 Brownsville 3 pressure 1019.2
7 Brownsville 3 dewpoint 28.9F

Example 2: Specified Unpivot Columns, Specified Optional Values


This example specifies the columns to unpivot by name and specifies nondefault values for the optional
arguments.

SQL-MapReduce Call

SELECT * FROM Unpivot (


ON unpivot_input
Unpivot ('temp', 'pressure', 'dewpoint')
AttributeColumn ('climate_attributes')
ValueColumn ('attributevalue')
InputTypes ('true')
Accumulate ('sn', 'city', 'week')
) ORDER BY 1,2,3;

Output
Because InputTypes has the value 'true', the output table has a separate value column for each unpivot
column. The unpivot columns temp, pressure, and dewpoint have the data types INTEGER, DOUBLE
PRECISION, and VARCHAR (respectively); therefore, their corresponding unpivoted columns have the
data types LONG, DOUBLE PRECISION, and VARCHAR.
Table 1255: Unpivot Example 2 Output Table

sn city week climate_attribut attributevalue_lon attributevalue_doub attributevalue_st


es g le r
1 Asheville 1 temp 32
1 Asheville 1 pressure 1020.79998779297
1 Asheville 1 dewpoint 27.6F
2 Asheville 2 temp 32

Teradata Aster Analytics Foundation User Guide 1291


Chapter 13: Data Transformation
Unpivot

sn city week climate_attribut attributevalue_lon attributevalue_doub attributevalue_st


es g le r
2 Asheville 2 pressure 1021.29998779297
2 Asheville 2 dewpoint 27.4F
3 Asheville 3 temp 34
3 Asheville 3 pressure 1021.70001220703
3 Asheville 3 dewpoint 28.2F
4 Nashville 1 temp 42
4 Nashville 1 pressure 1021
4 Nashville 1 dewpoint 29.4F
5 Nashville 2 temp 44
5 Nashville 2 pressure 1019.79998779297
5 Nashville 2 dewpoint 29.2F
6 Brownsville 2 temp 47
6 Brownsville 2 pressure 1019
6 Brownsville 2 dewpoint 28.9F
7 Brownsville 3 temp 46
7 Brownsville 3 pressure 1019.20001220703
7 Brownsville 3 dewpoint 28.9F

Example 3: Specified Unpivot Range, Default Optional Values


This example specifies a range of unpivot columns and uses the default values for the optional arguments.

SQL-MapReduce Call
This call is equivalent to the call in Example 1.

SELECT * FROM Unpivot (


ON unpivot_input
Unpivot ('[3:5]')
Accumulate ('sn', 'city', 'week')
InputTypes ('false')
) ORDER BY 1, 2, 3;

Output
The output is the same as in Example 1 (Output).

1292 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
URIPack

URIPack

Summary
The URIPack function reconstructs hierarchical URI strings that were unpacked by the function
URIUnpack.

Usage

URIPack Syntax
Version 1.1

SELECT * FROM URIPack (


ON input_table
[ Queries ('query_parameter' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Scheme_Column ('scheme_column') ]
[ Host_Column ('host_column') ]
[ Path_Column ('path_column') ]
[ Fragment_Column ('fragment_column') ]
[ IgnoreValues ('string' [,...]) ]
);

Arguments
Argument Category Description
Queries Optional Specifies the names of the query parameters whose values are to be
included in the URIs.
Accumulate Optional Specifies names of the input table columns to copy to the output
table.
Scheme_Column Optional Specifies the name of the input table column that contains the URI
scheme.
Host_Column Optional Specifies the name of the input table column that contains the URI
host.
Path_Column Optional Specifies the name of the input table column that contains the URI
path.
Fragment_Column Optional Specifies the name of the input table column that contains the URI
fragment.
IgnoreValues Optional Specifies a list of (case-insensitive) strings for the function to treat as
null values. If you omit this argument, the function treats only the
string 'null' as a null value. If you specify this argument, you must
specify the string 'null' to have the function treat it as a null value.

Teradata Aster Analytics Foundation User Guide 1293


Chapter 13: Data Transformation
URIPack
Input
The URIPack input table is the output table of the URIUnpack function; for its schema, refer to URIUnpack
Output Table Schema.

Output
Table 1256: URIPack Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
uri VARCHAR Contains the reconstructed hierarchical URIs.

Example

Input
The input table is the output table from the URIUnpack example.

SQL-MapReduce Call

SELECT * FROM URIPack (


ON uripack_input
Queries ('p1', 'p2', 'p3')
Accumulate ('id')
Scheme_Column ('scheme')
Host_Column ('host')
Path_Column ('path')
Fragment_Column ('fragment')
IgnoreValues ('null', '192.0.2.16')
) ORDER BY id;

Output
Table 1257: URIPack Example Output Table

id URI
1 https://www.google.com/webhp?p1=chrome&p2=hello+world&p3=UTF-8#fragment1
2 http://www.ietf.org/rfc/rfc2396.txt
3 ldap://[2001:db8::7]/c=GB
4 telnet:///
5 http://www.bar.com/baz/foo?p1=netscape&p2=%7Bhello+world%7D&p3=UTF#This+is
+fragment+too

1294 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
URIUnpack

URIUnpack

Summary
The URIUnpack function unpacks hierarchical uniform resource identifiers (URIs); that is, it outputs their
constituent components and the values of specified query parameters.
To repack the unpacked URIs, input the URIUnpack output to the function URIPack.

Background
A URI is a structured sequence of characters that identifies a resource (such as a file) on the Internet. URI
lists generated by web server logs and hypertext transfer protocol (HTTP) form submissions are a common
input for text analysis functions.
URI syntax is defined by the Internet Engineering Task Force (IETF). The following table describes the key
components of a hierarchical URI. The examples in the table are from this URI: https://www.google.com/
webhp?p1=chrome&p2=hello%20world&p3=UTF-8#fragment1
Table 1258: Key Hierarchical URI Components

Component Example
scheme https
host www.google.com
path /webhp
query ?p1=chrome&p2=hello%20world&p3=UTF-8
A query starts with a question mark (?). An ampersand (&) precedes each query
parameter. Here, the query parameters are p1, p2, and p3. Their values are chrome,
hello%20world, and UTF-8, respectively. %20 represents a space character.
fragment #fragment1

A URI can contain the US-ASCII characters for the lowercase and uppercase letters of the English alphabet
and the Arabic numerals. Any character outside this character set is percent-encoded; that is, converted to a
sequence of the form %hh, where h is a hexadecimal digit. In a query, the space character is encoded as %20.
For example, "San José" is encoded as "San%20Jos%C3%A9". Outside a query, the space character is
encoded as the plus character (+). For example, "San José" is encoded as "San+Jos%C3%A9".

Usage

URIUnpack Syntax
Version 1.0

SELECT * FROM URIUnpack (


ON input_table

Teradata Aster Analytics Foundation User Guide 1295


Chapter 13: Data Transformation
URIUnpack
URI_Column ('uri_column')
[ Queries ('query_parameter' [,...]) ]
[ Output ({ 'scheme' | 'host' | 'path' | 'fragment' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Print_Null_Queries
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

Arguments
Argument Category Description
URI_Column Required Specifies the name of the input table column that contains the URIs
to unpack. Malformed URIs are ignored.
Queries Optional Specifies the names of the query parameters whose values are to be
extracted from the URIs.
Output Optional Specifies the URI components (outside the query) to output. By
default, the function outputs all four components. If you specify
'path', the function outputs the URI path in normalized form (for
example, it reduces /./bar/baz to /bar/baz.
Accumulate Optional Specifies the names of the input table columns to copy to the output
table.
Print_Null_Queries Optional Specifies whether to output URIs that contain none of the
parameters specified by the Queries argument. The default value is
'true'.

Input
Table 1259: URIUnpack Input Table Schema

Column Name Data Type Description


uri_column VARCHAR Contains URIs to unpack.
accumulate_column Any Column to copy to the output table.

Output
Table 1260: URIUnpack Output Table Schema

Column Name Data Type Description


accumulate_column Same as in Column copied from the input table.
input table
scheme VARCHAR Column appears only if the Output argument specifies 'scheme'.
Contains the scheme of the URI.

1296 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
URIUnpack

Column Name Data Type Description


host VARCHAR Column appears only if the Output argument specifies 'host'.
Contains the host of the URI.
path VARCHAR Column appears only if the Output argument specifies 'path'.
Contains the path of the URI.
fragment VARCHAR Column appears only if the Output argument specifies 'fragment'.
Contains the fragment of the URI.
query_parameter VARCHAR Column appears only if the Queries argument specifies the
or query_parameter. Contains the value of query_parameter.
query_parameter_ The name of this column is query_parameter unless
query_parameter is the name of a column specified by the
Accumulate or Output argument, in which case it is
query_parameter_.

Example

Input
The input table has five URIs, some of which include characters that are percent-encoded.
Table 1261: URIUnpack Input table uris_input

id uri_column
1 'https://www.google.com/webhp?p1=chrome&p2=hello%20world&p3=UTF-8#fragment1'
2 'http://www.ietf.org/rfc/rfc2396.txt'
3 'ldap://[2001:db8::7]/c=GB?objectClass?one'
4 'telnet://192.0.2.16:80/'
5 'http://www.bar.com/./baz/foo?p1=netscape&p2=%7bhello%20world%7d&p3=UTF#This
%2Bis%2Bfragment%2Btoo');

SQL-MapReduce Call

SELECT * FROM URIUnpack (


ON uris_input
Uri_Column ('uri_column')
Queries ('p1', 'p2', 'p3')
Output ('scheme', 'host', 'path', 'fragment')
Accumulate ('id')
Print_Null_Queries ('true')
) ORDER BY id;

Teradata Aster Analytics Foundation User Guide 1297


Chapter 13: Data Transformation
XMLParser
Output
The characters encoded in the input table as %20, %7b, %7d, and %2B are decoded in the output table as the
space character, left brace ({), right brace (}), and plus sign (+), respectively. When a URI does not have a
specified parameter, the value of that parameter is NULL.
Table 1262: URIUnpack Example Output Table

id scheme host path p1 p2 p3 fragment


1 https www.google.com webhp chrome hello world UTF-8 fragment1
2 http www.ietf.org /rfc/rfc2396.txt NULL NULL NULL NULL
3 ldap [2001:db8::7] c=GB NULL NULL NULL NULL
4 telnet 192.0.2.16 / NULL NULL NULL NULL
5 http www.bar.com /baz/foo netscap {hello world} UTF This+is+fragment
e +too

XMLParser

Summary
The XMLParser function takes XML documents and outputs their element names, attribute values, and text
in a relational table, which you can search with SQL queries.

Background
XML data is semistructured and hierarchical, unlike the data in relational database tables. Therefore, you
cannot search XML data with SQL queries unless you first relationalize the XML data (that is, put it in a
relational database table).
Not all XML data can be relationalized; therefore, the XMLParser function constrains the relationships in
the extracted data to grandparent/parent/child, parent/child, ancestor, and sibling relationships. The
function lets you specify these constraints and the output table schema.

Usage

XMLParser Syntax
Version 1.7

SELECT * FROM XMLParser (


ON input_table
Text_Column ('text_column')
Nodes ('node_pair_string[,...]')
[ Sibling ('sibling_node_string[,...]') ]

1298 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLParser
[ Delimiter ('delimiter') ]
[ SiblingDelimiter ('sibling_delimiter') ]
[ MaxItemNum ('max_item_number') ]
[ Ancestor ('nodes_path' [,...]) ]
[ OutputColumnNodeID ('output_column_node') ]
[ OutputColumnParentNodeName ('output_column_parent_node') ]
[ OutputColumnGrandparentNode_Name
('output_column_grandparent_node') ]
[ ErrorHandler ('{'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}
[; [output_column:] column[,...]]') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the XML
documents. The function skips malformed XML documents.
Nodes Required Specifies the node-pair strings from which the function extracts data.
The simplest syntax for node_pair_string is:

[grandparent/]parent/child[,...]

where grandparent, parent, and child are node names.


For each grandparent, parent, and child, you can specify one or more
attributes to extract:

{grandparent|parent|child}[:attribute[,...]]

For each node_pair_string, the function generates a row in the output


table and adds a column for each specified attribute.

Note:
Node and attribute names are case-sensitive.

A grandparent or parent without attributes can contain wildcards. The


wildcards can follow the rules of either the SQL LIKE statement or the
Java regular expression.
The SQL LIKE statement syntax is 'like(expression)', where
expression can include these wildcards:
• Percent (%), which matches any sequence of zero or more characters
• Underscore (_), which matches any single character
• Backslash(\), which “escapes” the wildcard character that follows it,
causing that wildcard character to be treated as an ordinary character.
For example, 'like(%a_c\_)/d' matches the XML fragment
<123abc_><d>text</d></123abc_>.

Teradata Aster Analytics Foundation User Guide 1299


Chapter 13: Data Transformation
XMLParser

Argument Category Description


The Java Regular Expression syntax is 'regex(expression)', where
expression follows the rules for a Java regular expression.
If no node_pair_string contains a parent node, or no node_pair_string
contains a grandparent node, the function outputs nothing. If no
node_pair_string contains a child node, the function outputs NULL child
node values. If the argument specifies no attributes, the function outputs
NULL attribute values.
Sibling Optional Specifies the sibling nodes of one parent node specified in the Nodes
argument. The syntax for sibling_node_string is:

sibling_node_name[:attribute[,...]]

The function includes the values from the sibling nodes in every output
row and adds a column to the output table for every sibling node and
every specified attribute.
If no sibling_node_string contains a sibling node, the function outputs
NULL sibling node values. If the argument specifies no attributes, the
function outputs NULL attribute values.
Delimiter Optional Specifies the delimiter that separates multiple child node values in the
output. The default value is comma (,).
SiblingDelimiter Optional Specifies the delimiter that separates multiple sibling node values in the
output. The default value is comma (,).
MaxItemNum Optional Specifies the maximum number of sibling nodes with the same name to
be returned. This value must be a positive integer. The default value is
10.
Ancestor Optional Specifies the ancestor paths for all parent nodes specified in the Nodes
argument. The simplest syntax for nodes_path is:

node[/node]...

For each node, you can specify one or more attributes:

node[:attribute[,...]]

The default ancestor path is the root of the XML document.


A node without attributes can contain wildcards. The wildcards can
follow the rules of either the SQL LIKE statement or the Java regular
expression. For details, see the description of the Node argument.
If you specify multiple ancestor paths, then the function parses each
XML document to get results for each ancestor path. If different ancestor

1300 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLParser

Argument Category Description


paths contain duplicate node names, as in the following example, then
the result might be ambiguous:

SELECT * FROM xmlparser (


ON xml_inputstext_column ('xml')
Nodes ('parent1/child1')
Ancestor ('A/B:attr/C:attr','A/C:attr/B:attr')
);

If different ancestor paths contain duplicate node names, the function


does not check for duplicate node names in the ancestor paths when
constructing the output. Instead, the function maintains a list of column
names for all ancestor paths in the output schema. For each result, the
function fills the values of its ancestor path in the list and generates the
output for the ancestor part.
If no nodes_path is an ancestor path, the function outputs nothing. If the
argument specifies no attributes, the function outputs NULL attribute
values.
OutputColumn Optional Specifies the name of the output table column where the function stores
NodeID the IDs of the extracted nodes. The default name is out_nodeid.
OutputColumnP Optional Specifies the name of the output table column where the function stores
arentNodeName the names of the extracted parent nodes. The default name is
out_parent_node.
OutputColumn Optional Specifies the name of the output table column where the function stores
GrandparentNo the tag names of the extracted grandparent nodes. The default name is
deName out_grandparent_node.
ErrorHandler Optional Specifies whether the function handles errors that occur when parsing an
XML document. The default value is 'false' (the function aborts and
throws an exception).
If you specify 'true':
• If an error occurs while parsing a row, the function skips that row.
When the function completes the parsing, it outputs only the nodes
that were error-free.
• You can tell the function to output an additional column named
output_column_name and populate it with the values of the specified
columns. In the output column, the values of the specified columns
are separated with semicolons.
For example, the following argument adds the column error_info to
the output table and populates it with the values of input columns
col1 and col2 (with a semicolon after each value):

ErrorHandler('true;error_info:col1,col2')

The default output_column_name is ErrorHandler.

Teradata Aster Analytics Foundation User Guide 1301


Chapter 13: Data Transformation
XMLParser

Argument Category Description


Accumulate Optional Specifies the names of input column names to copy to the output table.
No accumulate_column can be specified by the argument
OutputColumnNodeID, OutputColumnParentNodeName, or
OutputColumnGrandparentNodeName. By default, the function copies
all input columns to the output table.

Input
Table 1263: XMLParser Input Table Schema

Column Data Type Description


text_column VARCHAR Contains the XML documents to parse. The function skips
malformed XML documents.
accumulate_column Any Column to copy to the output table.

Output
In these cases, the function outputs nothing:
• No nodes_path specified by the Ancestor argument is an ancestor path.
• No node_pair_path specified by the Nodes argument contains a parent node.
• No node_pair_path specified by the Nodes argument contains a grandparent node.
Otherwise, the output table has a row for each node specified in the Nodes argument and for each
descendant of each ancestor path specified in the Ancestor argument.
Table 1264: XMLParser Output Table Schema

Column Data Type Description


output_column_node INTEGER or Identifier of extracted node.
VARCHAR
output_column_parent_node VARCHAR Name of extracted parent node.
parent_attribute VARCHAR Attribute of extracted parent node. The table has one
column for each specified parent attribute. If no
parent attributes are specified, this value is NULL.
output_column_grandparent_node VARCHAR Name of extracted grandparent node.
grandparent_attribute VARCHAR Attribute of extracted grandparent node. The table
has one column for each specified grandparent
attribute. If no grandparent attributes are specified,
this value is NULL.
sibling_node_name VARCHAR Name of extracted sibling node. The table has one
column for each specified sibling node. If no sibling
nodes are specified, this value is NULL.

1302 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLParser

Column Data Type Description


sibling_attribute VARCHAR Attribute of extracted sibling node. The table has
one column for each specified sibling attribute. If no
sibling attributes are specified, this value is NULL.
child_node VARCHAR Name of extracted child node. The table has one
column for each specified child node. If no child
nodes are specified, this value is NULL.
child_attribute VARCHAR Attribute of extracted child node. The table has one
column for each specified child attribute. If no child
attributes are specified, this value is NULL.
ancestor_attribute VARCHAR Attribute of extracted ancestor node. The table has
one column for each specified ancestor attribute. If
no ancestor attributes are specified, this value is
NULL.
accumulate_column Same as in input Column copied from the input table.
table

Examples
• Example 1: Specify Sibling and Sibling_Delimiter
• Example 2: Specify Ancestor
• Example 3: Use Regular Expressions in Nodes and Ancestor
• Example 4: Handle Errors
• Example 5: Show Grandparent, Parent, and Child Nodes

Example 1: Specify Sibling and Sibling_Delimiter

Input

Table 1265: XMLParser Example 1 & 2 Input Table xml_input1

xid xmldocument
1 <bookstore>
: <owner>Billy</owner>
: <book category="ASTRONOMY">
: <title lang="en">Cosmos</title>
: <author>Carl Sagan</author>
: <author>Ann Druyan</author>
: <year edition="1">1980</year>
: <year edition="2">1981</year>
: <price>
: <member>49.99</member>
: <public>60.00</public>

Teradata Aster Analytics Foundation User Guide 1303


Chapter 13: Data Transformation
XMLParser

xid xmldocument
: </price>
: <reference>
: <title>Comet</title>
: </reference>
: <position value="1" locate="east"></position>
: </book>
: <book category="CHILDREN">
: <author>Judy Blume</author>
:
: <price>
: <member>99.99</member>
: <public>108.00</public>
: </price>
: </book>
: </bookstore>
2 <setTopRpt xsi:noNamespaceSchemaLocation="Set%20Top%2020Report%20.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
: <settopid type="string" length="5">ST789</settopid>
: <accountid type="string">8728</accountid>
: <zipcode type="string">94025</zipcode>
: <reportstamp type="dateTime">2009-10-03T12:52:06</reportstamp>
: <temperature>
: <read type="bigDecimal">46</read>
: </temperature>
: <storage>
: <used type="bigDecimal">98</used>
: <used type="bigDecimal">199</used>
: <used type="bigDecimal">247</used>
: <total type="bigDecimal">300</total>
: </storage>
: <feed>
: <feedstamp type="dateTime">2009-10-03T12:52:06</feedstamp>
: </feed>
: </setTopRpt>

SQL-MapReduce Call

SELECT * FROM XMLParser (


ON xml_input1
Text_Column ('xmldocument')
Nodes ('price/member')
Sibling ('author', 'year', 'title')
Sibling_Delimiter (';')

1304 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLParser
Accumulate ('xid')
) ORDER BY 1,2;

Output
The parent node, price, has two child nodes, member and public. However, the Nodes argument specifies
only member; therefore, only its value is output. Title, author, and year are siblings of price. The first
document has multiple author and year siblings, so the values of those siblings are separated by the specified
delimiter, semicolon (;).
Table 1266: XMLParser Example 1 Output Table

xid out_nodeid out_parent_node author year title member


1 1 price Carl Sagan;Ann Druyan 1980;1981 Cosmos 49.99
1 2 price Judy Blume 99.99

Example 2: Specify Ancestor

Input
The input is the same as in Example 1 (Input).

SQL-MapReduce Call

SELECT * FROM XMLParser (


ON xml_input1
Text_Column ('xmldocument')
Nodes ('temperature/read:type', 'storage/{used, total}')
Sibling ('settopid:{type, length}', 'accountid')
Ancestor ('setTopRpt')
OutputColumn_NodeID ('nid')
MaxItemNum (1)
Accumulate ('xid')
) ORDER BY 1,2;

Output
The output table contains the node and sibling values of the specified ancestor, setTopRpt.
Table 1267: XMLParser Example 2 Output Table (Columns 1-5)

xid nid out_parent_node settopid settopid:type


2 1 temperature ST789 string
2 2 storage ST789 string

Teradata Aster Analytics Foundation User Guide 1305


Chapter 13: Data Transformation
XMLParser
Table 1268: XMLParser Example 2 Output Table (Columns 6-11)

settopid:length accountid read read:type used total


5 8728 46 bigDecimal
5 8728 98 300

Example 3: Use Regular Expressions in Nodes and Ancestor

Input

Table 1269: XMLParser Example 3 Input Table xml_inputs_fuzzy

xid xmldocument
1 <bookstore>
: <owner>Billy</owner><items>
: <bookitem category="ASTRONOMY">
: <title lang="en">Cosmos</title>
: <author>Carl Sagan</author>
: <author>Ann Druyan</author>
: <year edition="1">1980</year>
: <price>
: <member>49.99</member>
: <public>60.00</public>
: </price>
: </bookitem>
: </items>
: </bookstore>
2 <cdstore>
: <owner> Amy </owner>
: <items>
: <cditem category="pop">
: <title lang="en">Breathe</title>
: <author>Yu Quan</author>
: <year>2003</year>
: <price>
: <member>29</member>
: <public>35</public>
: </price>
: <position value="1" locate="east"/>
: </cditem>
: </items>
: </cdstore>

1306 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLParser
SQL-MapReduce Call

SELECT * FROM XMLParser (


ON xml_inputs_fuzzy
Text_Column ('xmlDocument')
Nodes ('like(%store)/owner', 'regex([a-z]+item)/{title,author,year}')
Ancestor ('like(%store)')
Accumulate ('xid')
) ORDER BY 1,2;

The Ancestor argument specifies that any node whose value ends with 'store' is an ancestor. The Nodes
argument specifies that the function is to output the owner of each store and the title, author, and year of
each node that starts with a string of lowercase alphabetic characters and ends with 'item'.

Output
For bookstore and cdstore, the output table contains the value of owner; for bookitem and cditem, the
output table contains the values of title, author, and year. Multiple values are separated by the default
delimiter, comma (,).
Table 1270: XMLParser Example 3 Output Table

xid out_nodeid out_parent_node owner title author year


1 1 bookstore Billy
1 2 bookitem Cosmos Carl 1980
Sagan,Ann
Druyan
2 1 cdstore Amy
2 2 cditem Breathe Yu Quan 2003

Example 4: Handle Errors

Input
The second XML document is missing the closing tag </bookstore>.
Table 1271: XMLParser Example 4 Input Table xml_inputs_error

xid xmldocument
1 <bookstore owner="Judy">
: <owner>Billy</owner><items>
: <bookitem category="ASTRONOMY">
: <title lang="en">Cosmos</title>
: <author>Carl Sagan</author>
: <author>Ann Druyan</author>
: <year edition="1">1980</year>
: <price>

Teradata Aster Analytics Foundation User Guide 1307


Chapter 13: Data Transformation
XMLParser

xid xmldocument
: <member>49.99</member>
: <public>60.00</public>
: </price>
: </bookitem>
: </items>
</bookstore>
2 <bookstore>

SQL-MapReduce Call

SELECT * FROM XMLParser (


ON xml_inputs_error
Text_Column ('xmldocument')
Nodes ('bookstore/owner', 'bookitem/title', 'bookitem/author')
ErrorHandler ('true;xmldocument')
Accumulate ('xid')
) ORDER BY xid;

Output
The output table has the column ERROR_HANDLER, which contains the value of the input column
xmldocument followed by a semicolon.
Table 1272: XMLParser Example 4 Output Table

xid out_node_id out_parent_node owner title author ERROR_HANDLER


1 1 bookstore Billy
1 2 bookitem Cosmos Carl Sagan,
Ann Druyan
2 <bookstore>;

Example 5: Show Grandparent, Parent, and Child Nodes


This example uses the Nodes and Ancestors arguments to show the hierarchy of grandparent, parent, and
child nodes.

Input

Table 1273: XMLParser Example 5 Input table xml_input2

xid xml
1 <School name="UCBerkeley">
: <Dept ID="CS" name="Computer Science">
: <Class A="sophomore" B="Senior">
: <Year>

1308 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLRelation

xid xml
: <Student>Harry</Student>
: <Grade>A+</Grade>
: </Year>
: </Class>
: </Dept>
: </School>

SQL-MapReduce Call

SELECT * FROM XMLParser (


ON xml_input2
Text_Column ('xml')
Nodes ('Class:{A,B}/Year/Student', 'Year/Grade')
Ancestor ('School/Dept')
Accumulate ('xid')
);

Output

Table 1274: XMLParser Example 5 Output Table

xi out_nodei out_grandparent_nod out_parent_nod Class:A Class:B Student Grade


d d e e
1 1 Class Year sophomor Senior Harry A+
e

XMLRelation

Summary
The XMLRelation function takes XML documents and outputs their element names, attribute values, text,
and structural information in a relational table, which you can search with SQL queries. The function
maintains multilevel paths from the input XML documents to the XML elements.

Usage

XMLRelation Syntax
Version 1.3

SELECT * FROM XMLRelation (


TextColumn ('text_column')

Teradata Aster Analytics Foundation User Guide 1309


Chapter 13: Data Transformation
XMLRelation
DocIDColumns ({ 'docid_column' | 'docid_column_range' }[,...])
[ MaxDepth ('max_depth') ]
[ ExcludeElements ('node[/...][{node[,...]}]' [,...]) ]
[ AttributeAsNode
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ AttributeDelimiter ('delimiter') ]
[ Output ({ 'fulldata' | 'parentchild' | 'fullpath' }) ]
[ ErrorHandler ('{'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}
[;[output_column:] column[,...] ]') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Note:
In the ExcludeElements argument, you must type the braces ({ and }). For example:

Arguments
Argument Category Description
TextColumn Required Specifies the name of the input table column that contains the XML
documents. The function skips malformed XML documents.
DocIDColumns Required Specifies the names of the input table columns that contain the
identifiers of the XML documents. No docid_column can have the
same name as an output table column. For output column names,
refer to Output.
MaxDepth Optional Specifies the maximum depth in the XML tree at which to process
XML documents. The MaxDepth and Output arguments determine
the schema of the output table, and the number of columns in the
output table must not exceed 1600. The default value is 5.
ExcludeElements Optional Specifies the paths to the nodes to exclude from processing. The
function excludes each specified node and its child nodes. Examples
of paths to nodes are:
'chapter'
'root/book'
'root/book/{author,chapter}'
AttributeAsNode Optional Specifies whether to treat the attributes of a node as its child nodes.
The default value is 'false' (attributes of a node are stored in one
element of the output tuple).
AttributeDelimiter Optional Specifies the delimiter used to separate multiple attributes of one
node in XML documents. The default value is a comma ','.
Output Optional Specifies the output table schema (refer to Example 1: Output Three
Different Output Table Schemas). The MaxDepth and Output
arguments determine the schema of the output table, and the
number of columns in the output table must not exceed 1600. The
default value is 'fullpath'.

1310 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLRelation

Argument Category Description


ErrorHandler Optional Specifies whether the function handles errors that occur when
parsing an XML document. The default value is 'false' (the function
aborts and throws an exception).
If you specify 'true':
• If an error occurs while parsing a row, the function skips that
row. When the function completes the parsing, it outputs only
the nodes that were error-free.
• You can tell the function to output an additional column named
output_column_name and populate it with the values of the
specified columns. In the output column, the values of the
specified columns are separated with semicolons.
For example, the following argument adds the column error_info
to the output table and populates it with the values of input
columns col1 and col2 (with a semicolon after each value):

ErrorHandler('true;error_info:col1,col2')

The default output_column_name is ErrorHandler.

Accumulate Optional Specifies the names of input column names to copy to the output
table. No accumulate_column can have the same name as an output
table column. For output column names, refer to Output.

Input
Table 1275: XMLRelation Input Table Schema

Column Data Type Description


text_column VARCHAR Contains the XML documents to parse. The function skips
malformed XML documents.
docid_column INTEGER or Contains the identifiers of the XML documents.
VARCHAR
accumulate_column Any Column to copy to the output table.

Output
The output table schema depends on the Output argument.

Output ('fulldata')

Table 1276: XMLRelation Output Table Schema, Output ('fulldata')

Column Data Type Description


docid_column INTEGER or Identifier of the XML document. Cannot be NULL.
VARCHAR

Teradata Aster Analytics Foundation User Guide 1311


Chapter 13: Data Transformation
XMLRelation

Column Data Type Description


out_nodeid INTEGER Identifier of extracted node, unique within the XML document.
Cannot be NULL.
DnElement VARCHAR Name of node at depth n. The table has a column for each n in
the range [0, max_depth). The node at depth 0 is the root.
DnAttributes VARCHAR Attributes of node at depth n. This value has this form:
attributename=value[delimiter value…]
The table has a column for each n in the range [0, max_depth).
DnValue VARCHAR Value (text content) of node at depth n. The table has a column
for each n in the range [0, max_depth).
DnID INTEGER Identifier of node at depth n (its out_nodeid).
accumulate_column Same as in Column copied from the input table.
input table

The columns DnElement, DnAttributes, DnValue, and DnID contain information about the node at depth
n, and the columns DiElement, DiAttributes, DiValue, and DiID, where i is in the range [0, n), contain
information about its ancestors.

Output ('parentchild')

Table 1277: XMLRelation Output Table Schema, Output ('parentchild')

Column Data Type Description


docid_column INTEGER or Identifier of the XML document. Cannot be NULL.
VARCHAR
out_nodeid INTEGER Identifier of extracted node, unique within the XML document.
Cannot be NULL.
Element VARCHAR Name of node. Cannot be NULL.
Attributes VARCHAR Attributes of node. This value has this form:
attributename=value[delimiter value…]
Value VARCHAR Value (text content) of node.
ParentID INTEGER Identifier of parent of node.
accumulate_column Same as in Column copied from the input table.
input table

Output ('fullpath')

Table 1278: XMLRelation Output Table Schema, Output ('fullpath')

Column Data Type Description


docid_column INTEGER or Identifier of the XML document. Cannot be NULL.
VARCHAR

1312 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLRelation

Column Data Type Description


out_nodeid INTEGER Identifier of extracted node, unique within the XML document.
Cannot be NULL.
Element VARCHAR Name of node. Cannot be NULL.
Attributes VARCHAR Attributes of node. This value has this form:
attributename=value[delimiter value…]
Value VARCHAR Value (text content) of node.
DnElement VARCHAR Name of node at depth n. The table has a column for each n in
the range [0, max_depth). The node at depth 0 is the root.
DnAttributes VARCHAR Attributes of node at depth n. This value has this form:
attributename=value[delimiter value…]
The table has a column for each n in the range [0, max_depth).
DnValue VARCHAR Value (text content) of node at depth n. The table has a column
for each n in the range [0, max_depth).
DnID INTEGER Identifier of node at depth n (its out_nodeid).
accumulate_column Same as in Column copied from the input table.
input table

Examples
• Example 1: Output Three Different Output Table Schemas
• Example 2: Output Attributes as Nodes
• Example 3: Enable Error Handling

Example 1: Output Three Different Output Table Schemas


This example calls the function with each possible Output argument value to show how the value affects the
output table schema.

Input
The input table contains an xml document that has these hierarchical nodes: School at level 1, Dept at level
2, Class at level 3, and Student and Grade at level 4.
Table 1279: XMLRelation Examples 1 & 2 Input Table xmlrelation_input

xid xmldocument
1 <School name="UCLA">
: <Dept name="EE">
: <Class A="grad" B="undergrad">
: <Student>Harry</Student>
: <Grade>A+</Grade>
: </Class>

Teradata Aster Analytics Foundation User Guide 1313


Chapter 13: Data Transformation
XMLRelation

xid xmldocument
: </Dept>
: </School>

SQL-MapReduce Call 1

SELECT * FROM XMLRelation (


ON xmlrelation_input
TextColumn ('xmldocument')
DocIDColumns ('xid')
MaxDepth ('3')
Output ('fulldata')
) ORDER BY 1, 2;

Output
The output table shows the elements, attributes, values, and ids for each node.
Table 1280: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 1-5

xid out_nodeid D0Element D0Attributes D0Value


1 1 School name=UCLA
1 2 School name=UCLA
1 3 School name=UCLA
1 4 School name=UCLA
1 5 School name=UCLA

Table 1281: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 6-10

D0ID D1Element D1Attributes D1Value D1ID


1
1 Dept name=EE 2
1 Dept name=EE 2
1 Dept name=EE 2
1 Dept name=EE 2

Table 1282: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 11-14

D2Element D2Attributes D2Value D2ID

Class A=grad,B=undergrad 3
Class A=grad,B=undergrad 3

1314 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLRelation

D2Element D2Attributes D2Value D2ID


Class A=grad,B=undergrad 3

Table 1283: XMLRelation Example 1 Output Table for Output ('fulldata'), Columns 15-18

D3Element D3Attributes D3Value D3ID

Student Harry 4
Grade A+ 5

SQL-MapReduce Call 2

SELECT * FROM XMLRelation (


ON xmlrelation_input
TextColumn ('xmldocument')
DocIDColumns ('xid')
MaxDepth ('3')
Output ('parentchild')
) ORDER BY 1,2;

Output
The output table shows the attribute, value, and parent node for each node.
Table 1284: XMLRelation Example 1 Output Table for Output ('parentchild')

xid out_nodeid Element Attributes Value ParentID


1 1 School name=UCLA
1 2 Dept name=EE 1
1 3 Class A=grad,B=undergrad 2
1 4 Student Harry 3
1 5 Grade A+ 3

SQL-MapReduce Call 3
This call specifies the default Output argument value.

SELECT * FROM XMLRelation (


ON xmlrelation_input
TextColumn ('xmldocument')
DocIDColumns ('xid')
MaxDepth ('3')

Teradata Aster Analytics Foundation User Guide 1315


Chapter 13: Data Transformation
XMLRelation
Output ('fullpath')
) ORDER BY 1,2;

Output
The output table shows the ID for each node in a separate column.
Table 1285: XMLRelation Example 1 Output Table for Output ('fullpath')

xid out_nodeid Element Attributes Value D0ID D1ID D2ID D3ID


1 1 School name=UCLA 1
1 2 Dept name=EE 1 2
1 3 Class A=grad, 1 2 3
B=undergrad
1 4 Student Harry 1 2 3 4
1 5 Grade A+ 1 2 3 5

Example 2: Output Attributes as Nodes


This example outputs the attributes as if they were nodes.

Input
The input is the same as in Example 1 (Input).

SQL-MapReduce Call

SELECT * FROM XMLRelation (


ON xmlrelation_input
TextColumn ('xmldocument')
DocIDColumns ('xid')
MaxDepth ('3')
AttributeAsNode ('true')
) ORDER BY 1,2;

Output
The Elements column contains both actual nodes and attributes that are output as nodes. For the latter, the
Attributes column contains a tilde (~).
Table 1286: XMLRelation Example 2 Output Table

xid out_nodeid Element Attributes Value D0ID D1ID D2ID D3ID


1 1 School name=UCLA 1
1 2 name ~ UCLA 1 2
1 3 Dept name=EE 1 3

1316 Teradata Aster Analytics Foundation User Guide


Chapter 13: Data Transformation
XMLRelation

xid out_nodeid Element Attributes Value D0ID D1ID D2ID D3ID


1 4 name ~ EE 1 3 4
1 5 Class A=grad,B=undergrad 1 3 5
1 6 A ~ grad 1 3 5 6
1 7 B ~ undergrad 1 3 5 7
1 8 Student Harry 1 3 5 8
1 9 Grade A+ 1 3 5 9

Example 3: Enable Error Handling


This example handles a malformed input document.

Input
The second XML document is malformed.
Table 1287: XMLRelation Example 3 Input Table xmlrelation_error

xid xmldocument
1 <School name="UCLA">
: <Dept name="EE">
: </Dept>
: </School>
2 <School /School> name="UTA">

SQL-MapRequest Call

SELECT * FROM XMLRelation (


ON xmlrelation_error
TextColumn ('xmldocument')
DocIDColumns ('xid')
MaxDepth ('1')
ErrorHandler ('true;xmldocument')
);

Output

Table 1288: XMLRelation Example 3 Output Table

xid out_nodeid Element Attributes Value D0ID D1ID ErrorHand


ler
1 2 Dept name=EE 1 2
1 1 School name=UCLA 1

Teradata Aster Analytics Foundation User Guide 1317


Chapter 13: Data Transformation
XMLRelation

xid out_nodeid Element Attributes Value D0ID D1ID ErrorHand


ler
2 <School /
School>
name="UT
A">

1318 Teradata Aster Analytics Foundation User Guide


CHAPTER 14
Aster Scoring SDK

Aster Scoring SDK


• Introduction to Aster Scoring SDK
• AMLGenerator
• Scorer
• Aster Scoring SDK Functions
• FAQ

Introduction to Aster Scoring SDK


Aster Scoring SDK is intended for systems that follow events in real time and must take action based on
these events in real time with the support of analytics. Aster Scoring SDK applies predictive analytics to
make timely decisions based on real events. Aster Scoring SDK also makes Aster Analytics functions
available for real-time prediction.
Use cases for Aster Scoring SDK include:
• Fraud prevention
• Churn reduction
• System failure predictions
• Site personalization
• Purchase recommendations
• Dynamic promotion pricing
The workflow of Aster Scoring SDK is a four-step process:

1. Model training/data loading

Teradata Aster Analytics Foundation User Guide 1319


Chapter 14: Aster Scoring SDK
AMLGenerator
The training phase trains a model in the same way on Aster framework. This step also involves loading
any additional tables (such as dictionary or rules for text analytics) on the database or installing them as
files on database.
2. AML generation
Run the AMLGenerator function on the model (from Step 1) and relevant information for the
corresponding function.
3. AML file transfer
Download the .aml file from Aster framework (if using an ACT terminal, use the command \download
amlfile) and export (upload) it to the system working in real time using any standard ssh/scp client
tool.
4. Scorer execution
Score input requests (queries), using the Scorer API, based on the trained model in the .aml file.

AMLGenerator

Summary
The AMLGenerator function translates an Aster model into an XML-based Aster Model Language (AML)
format, which is accepted by the Aster Scoring SDK functionality.

Usage

AMLGenerator Syntax
Version 1.0

Select * from AMLGenerator (


ON (SELECT 1) PARTITION BY 1
ModelType ('function_name')
[ ModelTable ('model_table' [,...]) ]
[ ModelTag ('model_tag' [,...]) ]
[ InstalledFile ('boolean' [,...]) ]]
RequestColNames ('column_name' [,...]) ]
RequestColTypes ('column_type' [,...])
[ AMLPrefix ('file_name') ]
[ OverwriteOutput ({'true'|'false'|'t'|'f'|'yes'|'no'| 1 |0 })]
[ Domain ('host:port') ]
[ UserId ('user_id')]
[ Password ('password') ]
[ SSLSettings ('SSLSettings') ]
[ SSLTrustStorePassword ('SSLTrustStorePassword') ]
[ RequestArgName1 ('arg_name') ]
[ RequestArgVal1 ('arg_value') ]
[ RequestArgName2 ('arg_name') ]
[ RequestArgVal2 ('arg_value') ]

1320 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
AMLGenerator
...
);

Arguments
Argument Category Description
ModelType Required Specifies the function name for which this .aml file is to be used.
For predictors, this represents the type of the model trained.
The model type of functions are listed in Aster Scoring SDK
Functions.
ModelTable Optional Specifies the input model table representing input data for which
aml file is to be generated. The argument clause accepts multiple
input model tables. These tables could be either database tables
or files installed on the database.
ModelTag Optional Specifies the input model table tag for each input model table
specified in ModelTable clause. Tags are supported for functions
which accept multiple model tables and are used to distinguish
the role of each model table in the context of the Aster Scoring
SDK function.
The supported tags are listed in Aster Scoring SDK Functions.
The number of entries must match the number of entries in the
ModelTable clause.
InstalledFile Optional Specifies whether the corresponding value in ModelTable is a file
installed on database or a table on database. If ModelTable is a
database table, set this to false (default) and if it is a file, set this
to true. The number of entries must match the number of entries
in the ModelTable clause.
RequestColNames Required Specifies the column names of the request to be scored. For a
predictor, these column names typically match the columns
names of the training data used to train the model.
RequestColTypes Required Specifies the column types of the request to be scored. For a
predictor, these column types typically match the columns types
of the training data used to train the model.
AMLPrefix Optional Specifies the name of the generated AML model file. Default
value is “model”. The output file is stored with suffix .aml.
OverwriteOutput Optional Specifies whether the output AML model file is to be overwritten
if it already exists.
Domain Optional Specifies the IP address of the queen node. The default is queen
of the current cluster.
UserId Optional Specifies the Aster Database user name of the user. The default is
beehive.
Password Optional Specifies the Aster Database password of the user.

Teradata Aster Analytics Foundation User Guide 1321


Chapter 14: Aster Scoring SDK
AMLGenerator

Argument Category Description


SSLSettings Optional Specifies the SSL connection information in a string, excluding
the SSL TrustStore password. Use this argument if you want the
function to use a JDBC SSL connection to connect to Aster
Database instead of a normal JDBC connection. The connection
string specified by this argument is appended to the end of the
SSL JDBC connection string.
SSLTrustStorePassword Optional Specifies the SSL TrustStore password. This password is required
if you use the SSLSettings argument. If SSLSettings is not
specified, do not specify this argument.
RequestArgNamen Optional Specifies the argument clause name for the function to be used in
scoring. These clauses are the same as mentioned in the
corresponding SQL-MapReduce function documentation, with
some exceptions. The supported clauses for each function are
listed in Aster Scoring SDK Functions. The RequestArgName
must start from 1 in a sequential manner. For example, the
following sequence of clauses is acceptable: RequestArgName1,
RequestArgName2, RequestArgName3. However, this sequence
is not acceptable: RequestArgName1, RequestArgName2,
RequestArgName4. The maximum value of n is 30. The values
for corresponding arguments are provided in RequestArgVal
clause.
RequestArgVal[n] Optional Specifies the argument clause value for the function to be used in
scoring. The value in this clause corresponds to the clause
mentioned in RequestArgName for the same value of n. These
clause values are the same as mentioned in the documentation
for the corresponding SQL-MapReduce function. The
RequestArgVal must start from 1 in a sequential manner, and
follows the same format as RequestArgName. The maximum
value of n is 30.

Input
This is a driver function and the ON clause does not operate on any table. The ON clause must accept
(SELECT 1) PARTITION BY 1 to execute the function.

Output
The function generates an AML file installed on Aster Database which conforms to a specific XSD (XML
schema definition) format. The statistics of generated AML file are printed on console. The AML file can be
downloaded from ACT using command “\download [AMLFILE]” where AMLFILE is the name of the AML
file specified in AMLPrefix clause.
The following are some details about the generated AML file:

Header
The header consists of ModelType, AMLGenerator build version, and Teradata Copyright information.

1322 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
AMLGenerator
Request Columns
Request columns specify the names and data types of columns expected in the scoring request and are
appended in the block “columns” with entity = “request”. These are extracted from argument clauses
RequestColNames and RequestColTypes. The data type is converted from Aster database SQL type to
Scoring data type.

Request Parameters
The parameters specified by the user in the clauses RequestArgName and RequestArgVal are appended in
the block “params”. These parameters are not validated by AMLGenerator (parameter parsing takes place
inside Scorer) and are appended as specified by the user.

Model Columns
Model columns are appended in the block “columns” with entity = “model”. They specify the names and
data types of columns in Model data. Each block contains ModelTag information, if provided. Each block
also contains InstalledFile information, if provided.
The behavior of Model columns varies based on whether the model is database table or installed file.
• Database Table:
∘ Model column names and types are retrieved from Aster database and converted to Scoring data
types.
∘ Database table name is appended in the AML file model column header.
• Installed File:
∘ Model columns are empty.
∘ Installed file name is appended in the AML file model column header.

Model Data
Model data contains the actual data for the model provided in ModelTable clause. The data is stored
differently depending upon whether it is a database table or installed file. This block also contains checksum
of the data stored for data integrity.
• Database Table:
∘ For time and timestamp sql types, data is converted to acceptable scoring format.
∘ Special characters are transformed as follows:
Backslash (\) is preceded by another backslash (\).
Comma (,) is preceded by backslash (\).
Ampersand (&) is replaced with “&”.
Less than (<) is replaced with “&lt;”.
Greater than (>) is replaced with “&gt;”.
∘ For binary data, data is stored using Base64 encoding.
• Installed File:
∘ Data is captured in binary format (bytes) and stored using Base64 encoding in the AML file.
∘ Because of binary encoded format, special character handling is not needed.

Teradata Aster Analytics Foundation User Guide 1323


Chapter 14: Aster Scoring SDK
AMLGenerator
Note:
Scorer is expected to use only AML files generated by the AMLGenerator function. The manual (third-
party) generation or manipulation of an AML file may lead to scorer failure or incorrect scoring results.

Example

Input
The input table to the AMLGenerator, glass_attribute_table_output, is an output of the single decision tree
function (single_tree_drive) in this example.
Table 1289: AMLGenerator Example Input Table glass_attribute_table_output

node_id node_size node_label left_id left_size right_id right_size attribute


0 100 2 1 32 2 68 Mg
1 32 2 3 11 4 21 Na
2 68 1 5 28 6 40 Ca
3 11 2 7 10 8 1 Fe
4 21 7 9 14 10 7 K
5 28 2 11 4 12 24 Al
6 40 1 13 29 14 11 Na

SQL-MapRequest Call
The RequestColNames() has all the attributes that the single decision tree function was trained on.

SELECT * FROM AMLGenerator (


ON (SELECT 1) PARTITION BY 1
ModelType ('SDT')
ModelTable ('glass_attribute_table_output')
OverwriteOutput ('true')
RequestColNames ('pid', 'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba',
'Fe')
RequestColTypes ('int', 'double', 'double', 'double', 'double',
'double', 'double', 'double', 'double', 'double')
AMLPrefix ('glass_model')
RequestArgName1 ('ATTRTABLE_AttributeColumns')
RequestArgVal1 ('RI', 'Na', 'Mg', 'Al', 'Si', 'K','Ca', 'Ba', 'Fe')
RequestArgName2 ('ATTRTABLE_pidColumns')
RequestArgVal2 ('pid')
);

1324 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
AMLGenerator
Output
The AML file is generated with the name glass_model.aml (shown below) and is installed on the user’s
database. While using ACT, the file can be downloaded in order to view its contents using the command
\download glass_model.aml. The file is stored in the user's directory from where ACT was invoked.

<?xml version="1.0" encoding="UTF-8"?>


<model xmlns:aml="Aster Model Language">
<name>SDT</name>
<version>6.21_rel_1.0_r55243</version>
<copyright>Copyright (c) 2015-2016 by Teradata Corporation. All rights
reserved.</copyright>
<columns entity="model" name="glass_attribute_table_output">
<c type="long">node_id</c>
<c type="long">node_size</c>
<c type="string">node_label</c>
<c type="long">left_id</c>
<c type="long">left_size</c>
<c type="long">right_id</c>
<c type="long">right_size</c>
<c type="string">attribute</c>
</columns>
<columns entity="request">
<c type="int">pid</c>
<c type="double">RI</c>
<c type="double">Na</c>
<c type="double">Mg</c>
<c type="double">Al</c>
<c type="double">Si</c>
<c type="double">K</c>
<c type="double">Ca</c>
<c type="double">Ba</c>
<c type="double">Fe</c>
</columns>
<data entity="model" name="glass_attribute_table_output">
<d>0,100,2,1,32,2,68,Mg</d>
<d>1,32,2,3,11,4,21,Na</d>
<d>2,68,1,5,28,6,40,Ca</d>
<d>3,11,2,7,10,8,1,Fe</d>
<d>4,21,7,9,14,10,7,K</d>
<d>5,28,2,11,4,12,24,Al</d>
<d>6,40,1,13,29,14,11,Na</d>
<checksum>d49eed5708d45967b88444a88e99bdb0bc11e11d</checksum>
</data>
<params>
<p name="AttrTableAttributeColumns">RI,Na,Mg,Al,Si,K,Ca,Ba,Fe</p>
<p name="AttrTablePidColumns">pid</p>
</params>
</model>

Teradata Aster Analytics Foundation User Guide 1325


Chapter 14: Aster Scoring SDK
Scorer

Scorer

Summary
The Scorer function provides a software framework to score input queries based on a given model and
predictor. Scorer is a set of java classes packed into a .jar file that resides in the user's framework (real time
java virtual machine environment). This figure shows how Scorer interacts with the rest of the system:

The scorer computation model follows three simple steps:

1. Instantiate a scorer object to use for scoring.


2. Configure the scorer with AML model file.
3. Score the incoming requests based on the configured model.

1326 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Scorer
Multiple Scorer Configuration
If needed, multiple scorers can be configured for multiple predictors in the user framework. An example
work flow is shown below (as a guideline):

Cloud Integration
Scorer can be configured to use in any framework including web servers, cloud, distributed computing
environments, etc. The work flow in such environments is shown below:

Package
The Scorer package contains these files:

Teradata Aster Analytics Foundation User Guide 1327


Chapter 14: Aster Scoring SDK
Scorer
• scoring.jar, the main jar library for Aster Scoring SDK.
• scoring-doc.zip, the javadoc API for scoring.jar classes.
The file must be unzipped and index.html opened in a browser or integrated in development framework
in order to view the documentation API.
• scoring-examples.zip, examples for using the scoring.jar library.

Installation
The only library that Aster Scoring SDK needs is scoring.jar.You must load this jar file into the user
environment, and then you can invoke the Scorer like any jar library.
Some ways to load the library are:
• Use classpath
While compiling a package that contains Scoring classes, add scoring.jar to the classpath.
For example, assume that the application MyApp.java uses Scoring classes. The command to compile it
from the command line is:

javac -cp .:path_to_scoring.jar MyApp.java

(You can also add scoring.jar to the classpath when you compile such a package from a builder such
as ant or maven.)
The command to run the compiled application from the command line is:

java -cp .:path_to_scoring.jar MyApp


• Use Eclipse
Add scoring.jar to the Java build path of the project.
For instructions for installing and running Aster Scoring SDK in a cloud environment, see your cloud
environment documentation for running a third party jar library.

Using classpath
While compiling a package that contains Scoring classes from either the command line, ant, maven, or other
builders, add the scoring.jar library to the classpath. An example for command line compilation of an
application, MyApp.java (that uses Scoring classes), is:

javac -cp .:<path/to/scoring.jar> MyApp.java

To run the application containing Scoring classes from the command line, add the scoring.jar library to the
classpath:

java -cp .:<path/to/scoring.jar> MyApp

The procedure to install and run on a cloud environment are similar. Follow the instructions in the
documentation for your cloud environment about running a third party jar library.

1328 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Scorer
Eclipse
In eclipse, add scoring.jar to the Java Build Path of the project just like any other external jar file.

Functional Support
While most of the usage details for these functions are the same for Scorer as they are for the SQL-
MapReduce function, the format for scoring data type, argument clauses, and output generation are slightly
modified to isolate Scorer from the Aster framework. For these differences, refer to the description of the
Aster Scoring SDK version of the function in Aster Scoring SDK Functions. For information about the
function itself, refer to the description of its counterpart SQL-MapReduce function.

Input Formats
Scorer uses two input formats, CSVInputFormat and AMLInputFormat.

CSVInputFormat
CSVInputFormat supports comma separated values (CSV) file format to read input data (requests) and
request parameters from file system (input stream). The request can also be populated using the API, as
mentioned in Scoring API.

AMLInputFormat
AMLInputFormat supports AML (Aster Model Language) file format. AML is an XML-based file with a
predefined XSD schema, and is the data exchange format between Aster Database and Scorer, as mentioned
in AMLGenerator.

Data Types
The following data types are supported by the Scorer. The corresponding data types on Aster Database are
also listed.
Table 1290: Scoring Data Types

Scoring Data Type Aster (SQL) Data type


INT INTEGER, SERIAL
LONG BIGINT, BIGSERIAL
SHORT SMALLINT
FLOAT REAL
DOUBLE DOUBLE PRECISION, NUMERIC, DECIMAL
BOOLEAN BOOLEAN
BYTE BYTEA

Teradata Aster Analytics Foundation User Guide 1329


Chapter 14: Aster Scoring SDK
Scorer

Scoring Data Type Aster (SQL) Data type


STRING CHARACTER, CHARACTER VARYING, VARCHAR, IP4, IP4RANGE, BIT, BIT
VARYING, UUID, INTERVAL, TEXT
DATE DATE
TIME TIME (TIME WITHOUT TIME ZONE), TIME WITH TIME ZONE
TIMESTAMP TIMESTAMP (TIMESTAMP WITHOUT TIME ZONE), TIMESTAMP WITH
TIME ZONE

For date, time, and timestamp data types, format conversions are required in some cases when transforming
data from Aster database to Aster Scoring SDK for compatibility of SQL data types with the Java library
(java.sql package).

Output Formats
The output can be configured in the following way:

API
The API for output format is described in Scoring API.

Logging
The output can be redirected to the console using Java's library methods such as System.out and System.err,
or piped to some file. The output can also be configured to go to an event log, the details of which are
discussed in Logging Support.

Scoring API
The scoring APIs are documented in the javadoc. After installation, you can invoke Scorer in the following
ways (the code blocks show high-level method calls to invoke scorer).

Scoring Requests from File System


This usage provides a simple interface to test scorer functionality.

// initialize
Scorer scorer = new Scorer ();
// configure
// make sure that AML file modelFile is available on file system
scorer.configure (modelFile);
// run scorer (multiple calls)
// make sure that CSV file requestFile is available on file system
scorer.score (requestFile);

1330 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Scorer
Scoring Requests Using API
This usage is recommended in a production environment. By preventing the I/O overheads from populating
and reading request files from the file system, it improves performance.

// initialize
Scorer scorer = new Scorer ();
// configure
// make sure that AML file modelFile is available on file system
Request request = scorer.configure (modelFile);
// populate data structure request
// run scorer (multiple calls)
scorer.score (request);

Javadoc
The Javadoc is in the scoring package in the file scoring-doc.zip. The following figure shows a snapshot
of Javadoc for the Scorer class.

Examples
TheScoring examples and their code are in in the Scoring package in the file scoring-examples.zip.
Each example is in its own directory. To test the examples, run the script run_examples.sh.
The following figure shows an example scoring application with the GLM real-time predictor.

Teradata Aster Analytics Foundation User Guide 1331


Chapter 14: Aster Scoring SDK
Scorer

Logging Support
Scorer logs information that is helpful for monitoring progress and for debugging. Optionally, Scorer also
logs events in a separate event log file. Every day, a log file is stored with the date appended to the log file
name. The log files are configured to log to a file on the local file system where Scorer runs.
The following logging variables are supported as System Properties. An example that sets the system
properties is in the run_examples.sh script.

Variable Default Description


Value
scoring.log.mode 'INFO' Logging modes, based on org.apache.log4j.Level.
scoring.event.mod 'ERROR' Scoring modes, based on org.apache.log4j.Level.
e
scoring.log.dir '/tmp/ Directory that contains log files for scoring. The file scoring.log
scoring' stores the logging information in this directory.
scoring.event.dir '/tmp/ Directory that contains event logging for scoring. The file scoring-
scoring' event.log stores the event logging information in this directory.

Compatibility
Scorer can run on any platform, and is compatible with Java Development Kit (JDK) 1.6 and later.

1332 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Performance
The response time for a Scorer input query is expected to be milliseconds. However, Scorer performance
strongly depends on the available CPU and memory resources, trained model, model size, and the request to
be scored.

Note:
Scorer expects to use the AML file generated by the AMLGenerator function. Using an AML file
generated or manipulated by a third-party might cause Scorer failure or incorrect scoring results.

Tips
Scorer is expected to use the AML file generated by AMLGenerator function. The manual (third-party)
generation or manipulation of AML file may lead to scorer failure or incorrect scoring results because of
which this practice is discouraged.

Aster Scoring SDK Functions


This section provides details about the Aster Scoring SDK functions supported by Scorer. Most details are
the same as those for the corresponding SQL-MapReduce function; consult the section for the equivalent
SQL-MapReduce function for a detailed description. The javadoc for Scorer provides information about
how to run a scorer with a specific Aster Scoring SDK function.
These sections discuss scoring-specific information for each Aster Scoring SDK function currently
supported:
There are three (optionally four) sub-sections for each Aster Scoring SDK function:
1. Model Format - This section describes the model and its type accepted by the function. The model is
incorporated inside AML file. The related arguments of AMLGenerator are ModelType, ModelTable,
ModelTag, and InstalledFile.
2. Request Definition - This section describes the format for the input query to be scored against the Aster
Scoring SDK function, if different from the equivalent SQL-MapReduce function. In the context of a
SQL-MapReduce function, this format corresponds to the input table schema for test (query) data set. In
most of the functions, this format is the same as the input used to train the model. The format is supplied
using AMLGenerator arguments RequestColNames and RequestColTypes (to be incorporated in the
AML file). Inside scorer, a request object is instantiated based on the format and can be used to populate
request data for scoring.
3. Parameters - This section describes the parameters used by the Aster Scoring SDK function for scoring.
These parameters correspond to the arguments of the corresponding SQL-MapReduce (predictor)
function. This section also lists any arguments that are currently not supported by the Aster Scoring SDK
function. These parameters can be supplied using AMLGenerator arguments RequestArgName1,
RequestArgVal1, ReqestArgName2, RequestArgVal2, etc. (to be incorporated in the AML file). Because
parameters can change frequently across queries for the same Aster Scoring SDK function, parameters
can also be supplied/modified directly inside Scorer. See the Scorer javadoc for further information.
4. Additional Notes: An optional section that provides additional details about the function not covered in
the other four sections.

Teradata Aster Analytics Foundation User Guide 1333


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Aster Scoring SDK Single Decision Tree
This is the Aster Scoring SDK version of the Single_Tree_Predict function, which is discussed in
Single_Tree_Predict.

Model Format
Table 1291: Aster Scoring SDK Single Decision Tree Model Format

Argument Description
ModelType sdt, single decision tree
ModelTable Database table
ModelTag No tags supported

Request Definition
The Aster Scoring SDK function uses a different request schema from SQL-MapReduce function
Single_Tree_Predict. As an example, the table for the SQL-MapReduce function looks like:
Table 1292: Single_Tree_Predict Request Schema

PID ATTRIBUTE ATTRVALUE


9 RI 1.51545
9 Na 14.14
9 Mg 0.00
... ... ...

For the Aster Scoring SDK function, the request must be a flat data structure, as shown below for the same
example:
Table 1293: Aster Scoring SDK Single Decision Tree Request Schema

PID RI Na Mg ...
9 1.51545 14.14 0.00 ...

Parameters
Table 1294: Aster Scoring SDK Single Decision Tree Parameters

Parameter Supported Comments


AttrTableAttributeColumns Yes list of attributes columns in the request
AttrTablePidColumns Yes list of pid_col in the request.
ON attribute_table No Provided using Request.
ON model_table No Populated in AML file.

1334 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions

Parameter Supported Comments


AttrTableGroupByColumns No Replaced by AttrTableAttributeColumns.
AttrTableValColumn No The default value is used.
ModelTableNodeColumn No The default value is used.
ModelTableSizeColumn No The default value is used.
ModelTableLeftSizeColumn No The default value is used.
ModelTableRightSizeColumn No The default value is used.
ModelTableAttrColumns No The default value is used.
ModelTableSplitColumn No The default value is used.
ModelTableLabelColumn No The default value is used.
ModelTableLeftLabelColumn No The default value is used.
ModelTableRightLabelColumn No The default value is used.
ModelTableLeftBucketColumn No The default value is used.
ModelTableRightBucketColumn No The default value is used.

Aster Scoring SDK Generalized Linear Model


This is the Aster Scoring SDK version of the GLMPredict function, which is discussed in GLMPredict.

Model Format
Table 1295: Aster Scoring SDK Generalized Linear Model - Model Format

Argument Description
ModelType glm, generalized linear model
ModelTable Database table
ModelTag No tags supported

Request Definition
Same as SQL-MapReduce function GLMPredict.

Parameters
Table 1296: Aster Scoring SDK Generalized Linear Model Parameters

Parameter Supported Comments


Accumulate Yes

Teradata Aster Analytics Foundation User Guide 1335


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions

Parameter Supported Comments


Family Yes
Link Yes
ON input_table No Provided using Request.
ModelTable No Populated in AML file.

Aster Scoring SDK Random Forest


This is the Aster Scoring SDK version of the Forest_Predict function, which is discussed in Forest_Predict.

Model Format
Table 1297: Aster Scoring SDK Random Forest Model Format

Argument Description
ModelType rf, random forest
ModelTable Database table
ModelTag No tags supported

Request Definition
Same as SQL-MapReduce function Forest_Predict.

Parameters
Table 1298: Aster Scoring SDK Random Forest Parameters

Parameter Supported Comments


IdCol Yes
CategoricalInputs Yes
NumericInputs Yes
Detailed Yes
ON {input_tabletable_name | view_name| No Provided using Request.
(query)}
ON model_table as ModelTable No Populated in AML file.
ModelFile No Use ModelTable instead.
Forest No Use ModelTable instead.

1336 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Aster Scoring SDK Naïve Bayes
This is theAster Scoring SDK version of the NaiveBayesPredict function, which is discussed in
NaiveBayesPredict.

Model Format
Table 1299: Aster Scoring SDK Naïve Bayes Model Format

Argument Description
ModelType nb, naïve bayes
ModelTable Database table
ModelTag No tags supported

Request Definition
Same as SQL-MapReduce function NaiveBayesPredict.

Parameters
Table 1300: Aster Scoring SDK Naïve Bayes Parameters

Parameter Supported Comments


IdCol Yes
CategoricalInputs Yes
NumericInputs Yes
ON input_table No Provided using Request.
Model No Populated in AML file.

Aster Scoring SDK Naïve Bayes Text Classifier


This is the Aster Scoring SDK version of the NaiveBayesTextClassifierPredict function, which is discussed in
NaiveBayesTextClassifierPredict.

Model Format
Table 1301: Aster Scoring SDK Naïve Bayes Text Classifier Model Format

Argument Description
ModelType nbtc, naïve bayes text classifier
ModelTable Database table
ModelTag No tags supported

Teradata Aster Analytics Foundation User Guide 1337


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Request Definition
Same as SQL-MapReduce function NaiveBayesTextClassifierPredict.

Parameters
Table 1302: Aster Scoring SDK Naïve Bayes Text Classifier Parameters

Parameter Supported Comments


DocIdColumns Yes
InputTokenColumn Yes
TopK Yes
ModelType Yes
ON input_table No Provided using Request.
ON model_table No Populated in AML file.
ModelTokenColumn No The default value is used.
ModelCategoryColumn No The default value is used.
ModelProbColumn No The default value is used.

Aster Scoring SDK Text Tagging


This is the Aster Scoring SDK version of the TextTagging function, which is discussed in TextTagging.

Model Format
Table 1303: Aster Scoring SDK Text Tagging Model Format

Argument Description
ModelType ttag, text tagging
ModelTable Database table for rules, installed file for dictionary.
ModelTag PREDICT (for rules), DICT (for dictionary).

Request Definition
Same as SQL-MapReduce function TextTagging.

1338 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Parameters
Table 1304: Aster Scoring SDK Text Tagging Parameters

Parameter Supported Comments


Rules Yes If you do not specify a rules table in ModelTables
of amlgenerator, use this argument to specify the
tagging rules.
Language Yes
Tokenize Yes
OutputByTag Yes
TagDelimiter Yes
Accumulate Yes
ON texttable No Provided using Request.
ON rules No Populated in AML file.

Additional Notes
Rules can be provided using either Request parameters or a database table in the ModelTable argument.
Dictionary can be provided using ModelTable, ModelTag, and InstalledFile arguments.
When a rule uses a dictionary file 'file1', you must install the file on the database first and then put it in
ModelTable clause while invoking AMLGenerator with ModelTag “DICT”.
The following examples highlight different variations in usage of rules and the dictionary:

Example 1
No model table is specified and rules are provided using Request parameters

RequestArgName1 ('Rules')
RequestArgVal1 ('contain(content, "floods", 1, ) OR
contain(content, "tsunamis", 1,) AS Natural-Disaster')

Example 2
Rules table is a database table. No dictionary file is specified.

ModelTable ('rules')

or

ModelTable ('rules')
ModelTag ('PREDICT')
InstalledFile ('false')

Teradata Aster Analytics Foundation User Guide 1339


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Example 3
Rules are provided using Request parameters and two dictionary files ‘file1’ and ‘file2’.

ModelTable ('file1', 'file2')


InstalledFile ('true', 'true')
ModelTag ('DICT', 'DICT')
RequestArgName1 ('Rules')
RequestArgVal1 ('tag1, DICT(content, "file1", 1, )',
'tag2, DICT(content, "file2", 1, )')

Example 4
Rules table and dictionary are both provided using the ModelTable argument. The rules in the rule table
‘rules’ may contain references to the dictionary files ‘file1’ and ‘file2’.

ModelTable ('rules', 'file1', 'file2')


InstalledFile ('false', 'true', 'true')
ModelTag ('PREDICT', 'DICT', 'DICT')

Aster Scoring SDK Extract Sentiment


This is the Aster Scoring SDK version of the ExtractSentiment function, which is discussed in
ExtractSentiment.

Model Format
Table 1305: Aster Scoring SDK Extract Sentiment Model Format

Argument Description
ModelType sent, extract sentiment
ModelTable Database table for dictionary, installed file for dictionary file or classification file.
ModelTag DICT (for dictionary database table and installed dictionary file), CLASS (for
installed classification file).

Request Definition
Same as SQL-MapReduce function ExtractSentiment.

Parameters
Table 1306: Aster Scoring SDK Extract Sentiment Parameters

Parameter Supported Comments


TextColumn Yes
Language Yes

1340 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions

Parameter Supported Comments


Model Yes The corresponding model file (database table or
installed files) must be populated in the AML file.
The Model argument can also be used to specify
an installed dictionary or classification file.
Level Yes
HighPriority Yes
Filter Yes
Accumulate Yes
ON table_name No Provided using Request.
ON dict No Populated in AML file.

Additional Notes
The supporting files (dictionary model or file, classification model) are input using the ModelTable
argument in AMLGenerator.
The following examples highlight different variations in usage of the dictionary and classification model:

Example 1
Use the default model (determined by Language argument).

ModelTable ('default_sentiment_lexicon.txt')
ModelTag ('DICT')
InstalledFile ('true')

For Chinese language text, the default model is default_sentiment_lexicon_zh_cn.txt (for Simplified
Chinese) and default_sentiment_lexicon_zh_tw.txt (for traditional Chinese).

Example 2
Model is a dictionary table.

ModelTable ('dict_table')
ModelTag ('DICT')
InstalledFile ('false')

Example 3
A dictionary table and an installed dictionary file (other than the default) are used. In this case, the
sentiment words from the dictionary table have a higher priority than those in the dictionary file.

ModelTable ('sentiment_lexicon.txt', 'dict_table')


InstalledFile ('true', 'false')
ModelTag ('DICT', 'DICT')

Teradata Aster Analytics Foundation User Guide 1341


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
RequestArgName1 ('Model')
RequestArgVal1 ('dictionary:sentiment_lexicon.txt')

Example 4
In this example, a classification file is used.

ModelTable ('sentiment_classification_model.bin')
InstalledFile ('true')
ModelTag ('CLASS')
RequestArgName1 ('Model')
RequestArgVal1 ('classification:sentiment_classification_model.bin')

Aster Scoring SDK Text Parser


This is the Aster Scoring SDK version of the Text_Parser function, which is discussed in TextChunker.

Model Format
Table 1307: Aster Scoring SDK Text Parser Model Format

Argument Description
ModelType tparser, text parser
ModelTable Installed file
ModelTag STOPWORDS, STEMEXCEPTIONWORDS

Request Definition
Same as SQL-MapReduce function Text_Parser.

Parameters
Table 1308: Aster Scoring SDK Text Parser Parameters

Parameter Supported Comments


TextColumn Yes
ToLowerCase Yes
Stemming Yes
Delimiter Yes
TotalWordsNum Yes
Punctuation Yes
RemoveStopWords Yes

1342 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions

Parameter Supported Comments


ListPositions Yes
OutputByWord Yes
TokenColumn Yes
PositionColumn Yes
Accumulate Yes
FrequencyColumn Yes
TotalColumn Yes
ON table_name No Provided using Request.
StemmingExceptions No Populated in AML file.
StopWords No Populated in AML file.

Additional Notes
Model files for stop words and stemming exception words are input using the ModelTable, ModelTag, and
InstalledFile arguments of AMLGenerator. If neither table is provided, no stop words or stemming
exceptions are used. Either (or both) tables can be provided, as shown below:

Example 1

ModelTable ('stop_words_file')
ModelTag ('STOPWORDS')
InstalledFile ('true')

Example 2

ModelTable ('stem_exception_words_file')
ModelTag ('STEMEXCEPTIONWORDS')
InstalledFile ('true')

Example 3

ModelTable ('stop_words_file', 'stem_exception_words_file')


ModelTag ('STOPWORDS', 'STEMEXCEPTIONWORDS')
InstalledFile ('true', 'true')

Aster Scoring SDK Text Tokenizer


This is the Aster Scoring SDK version of the TextTokenizer function, which is discussed in TextTokenizer.

Teradata Aster Analytics Foundation User Guide 1343


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Model Format
Table 1309: Aster Scoring SDK Text Tokenizer Model Format

Argument Description
ModelType ttoken, text tokenizer
ModelTable Database table, installed file
ModelTag DICT, CRF

Request Definition
Same as SQL-MapReduce function TextTokenizer.

Parameters
Table 1310: Aster Scoring SDK Text Tokenizer Parameters

Parameter Supported Comments


TextColumn Yes
Language Yes
OutputDelimiter Yes
OutputByWord Yes
Accumulate Yes
ON input_table No Provided using Request.
On dict No Populated in AML file.
Model No Populated in AML file.
UserDictionaryFile No Populated in AML file.

Additional Notes
If the function uses a dictionary table, a dictionary model, and/or a CRF model file, they can be provided
using the ModelTable, ModelTag, and InstalledFile arguments of AMLGenerator. If no table is provided, the
function uses the default embedded dictionaries for English or Chinese text.

Example 1
Chinese language:

ModelTable ('crf_model_file')
ModelTag ('CRF')
InstalledFile ('true')

1344 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions
Example 2
Chinese language:

ModelTable ('crf_model_file', 'dict_table', 'dict_file')


ModelTag ('CRF', 'DICT', 'DICT')
InstalledFile ('true', 'false', 'true')

Example 3
English or Japanese (CRF file is not supported):

ModelTable ('dict_table', 'dict_file')


ModelTag ('DICT', 'DICT')
InstalledFile ('false', 'true')

Aster Scoring SDK SparseSVM


This is the Aster Scoring SDK version of the SparseSVMPredictor function, which is discussed in
SparseSVMPredictor.

Model Format
Table 1311: Aster Scoring SDK SparseSVM Model Format

Argument Description
ModelType svm, sparse svm
ModelTable Database table
ModelTag No tags supported

Request Definition
Same as SQL-MapReduce function SparseSVMPredictor.

Parameters
Table 1312: Aster Scoring SDK SparseSVM Parameters

Parameter Supported Comments


SampleIdColumn Yes
AttributeColumn Yes
ValueColumn Yes
AccumulateLabel Yes
OutputClassNum Yes

Teradata Aster Analytics Foundation User Guide 1345


Chapter 14: Aster Scoring SDK
Aster Scoring SDK Functions

Parameter Supported Comments


ON sample_table as input No Provided using Request.
ON model_table as model No Populated in AML file.

Aster Scoring SDK CoxPH


This is the Aster Scoring SDK version of the CoxPredict function, which is discussed in CoxPredict.

Model Format
Table 1313: Aster Scoring SDK CoxPH Model Format

Argument Description
ModelType cox, coxph
ModelTable Database table
ModelTag No tags supported

Request Definition
Same as SQL-MapReduce function CoxPredict.

Parameters
Table 1314: Aster Scoring SDK CoxPH Parameters

Parameter Supported Comments


PredictFeatureNames Yes Required parameter.
PredictFeatureColumns Yes Required if PredictFeatureUnitsColumns is
omitted, otherwise disallowed.
RefFeatureColumns Yes Optional when the PredictFeatureColumns
parameter is provided.
Not supported when
PredictFeatureUnitsColumns is provided.
PredictFeatureUnitsColumns Yes Required if PredictFeatureColumns is
omitted, otherwise disallowed.
Accumulate Yes Optional parameter.
ON cox_coef_model_table as No Populated in AML file.
cox_coef_model
ON predict_feature_table as No Provided using Request.
predicts
ON ref_feature_table as refs No Provided using Request.

1346 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
FAQ
Aster Scoring SDK LDAInference
This is the Aster Scoring SDK version of the LDAInference function, which is discussed in LDAInference.

Model Format
Table 1315: Aster Scoring SDK LDAInferenceModel Format

Argument Description
ModelType lda, lda inference
ModelTable Database table
ModelTag No tags supported

Request Definition
Same as SQL-MapReduce function LDAInference.

Parameters
Table 1316: Parameters

Parameter Supported Comments


DocIdColumn Yes Required parameter.
WordColumn Yes Required parameter.
CountColumn Yes Optional parameter.
OutputTopicNum Yes Optional parameter.
OutputTopicWordNum Yes Optional parameter.
InputTable No Provided using Request.
ModelTable No Populated in AML file.
OutputTable No Provided in Response.

FAQ

How is Aster Scoring SDK different from functions in the Aster


Analytics suite?
Aster Scoring SDK enables enterprises to perform predictive analytics in real-time on a user framework (for
example, a web server or stream-processing engine) whereas Aster Analytics suite functions perform data
analytics in batch-mode on Aster Database.

Teradata Aster Analytics Foundation User Guide 1347


Chapter 14: Aster Scoring SDK
FAQ
Does Aster Scoring SDK include a real-time streaming engine
or a listening framework?
No, Aster Scoring SDK is a computational library framework that can be plugged into any online Java
Runtime Environment (Java 1.6 or later) to perform deeper analysis of data. Because real-time systems come
with different architectures to handle input data (requests or streams), Aster Scoring SDK does not
incorporate another request-handling layer of its own and is universally applicable to all Java-based real-
time systems.

Does Aster Scoring SDK Need Aster Database and Aster


Analytics Suite?
Yes, Aster Scoring SDK works on predictive models trained on an Aster Database using functions from the
Aster Analytics suite. Once trained, these models are exported through an Aster Model Language (AML) file
to the user environment where Aster Scoring SDK is deployed.

Can Aster Scoring SDK be invoked in a cloud environment


such as Amazon Web Services (AWS)?
Yes. However, the cloud infrastructure must have a processing framework in which Aster Scoring SDK can
be integrated easily.

Is Aster Scoring SDK thread-safe? Can it be deployed in a


multithreaded parallel system?
Yes, Aster Scoring SDK is thread-safe and can be deployed in a multithreaded system through sharing of a
scorer object across multiple threads.

What is the recommended way to incorporate Aster Scoring


SDK in a multithreaded system?
There are three different ways to incorporate Aster Scoring SDK in a multithreaded system:
• The scorer is not shared across multiple threads. In this case, each thread creates its own scorer object
and there is no sharing of model or configuration across threads. There are as many scorer objects as the
number of parallel execution threads.
• The scorer is initialized and configured by a single thread (main process) and the configured scorer is
shared across multiple threads for parallel execution.
• The scorer is initialized by a single thread (main process) and the initialized scorer is shared across
multiple threads. The scorer is configured by the first thread that makes the configure() call. To change
the scorer configuration (for example, if there is a change in model or a parameter), the scorer must first
be reset using the reset() call and then reconfigured. In case of multiple configuration calls, all but the
first configuration call are ignored by the scorer.

1348 Teradata Aster Analytics Foundation User Guide


Chapter 14: Aster Scoring SDK
FAQ
Does Aster Scoring SDK work on Predictive Model Markup
Language (PMML) based models?
No, Aster Scoring SDK supports only AML based models. PMML model support is inconsistent and is not
fully standardized. Moreover, a PMML model captures only an individual model as opposed to a complete
configuration mechanism for real-time scoring provided through the AML model file. An AML model can
be seamlessly integrated in Aster Scoring SDK to deliver high performance.

How fast is the response time of Aster Scoring SDK?


Aster Scoring SDK processes incoming requests (queries) as fast as less than a millisecond (on the order of a
few microseconds) in a typical configuration. However, response time and throughput for the entire system
depends on other variables including hardware specifications, available memory, number of parallel
executed threads, model complexity, and so on.

Teradata Aster Analytics Foundation User Guide 1349


Chapter 14: Aster Scoring SDK
FAQ

1350 Teradata Aster Analytics Foundation User Guide


CHAPTER 15
Visualization Functions

Visualization Functions
Visualization functions are used with the AppCenter product. For information about these functions, refer
to the AppCenter User Guide.

Teradata Aster Analytics Foundation User Guide 1351


Chapter 15: Visualization Functions
Visualization Functions

1352 Teradata Aster Analytics Foundation User Guide


CHAPTER 16
Aster Database System Utility Functions

Aster Database System Utility Functions


The built-in system utility functions are intended to be invoked through AMC Executables. These functions
are automatically installed as part of the Aster Database installation. If you type \dF in ACT, these out-of-
the-box functions do not appear, as they are internal-only functions. You can, however, use these in your
own custom scripts.
Aster Database includes the following system utility functions.
• nc_skew
• nc_relationstats
For more information on these functions, see one of these documents, depending on your platform:
• For appliances, see the Aster Database User Guide for Aster Appliances
• For installations on commodity hardware, see the Aster Database User Guide for Commodity Hardware.

Teradata Aster Analytics Foundation User Guide 1353


Chapter 16: Aster Database System Utility Functions
Aster Database System Utility Functions

1354 Teradata Aster Analytics Foundation User Guide


APPENDIX A
List of Functions and Their Syntax

About the List of Functions


Within function categories, functions are in alphabetical order.

Time Series, Path, and Attribution Analysis

Arima (version 1.1)

SELECT * FROM Arima (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('output_table')
[ ResidualTable ('residual_table') ]
TimestampColumns
({ 'timestamp_column' | 'timestamp_column_range' }[,...])
ValueColumn ('value_column')
Orders ('p, d, q')
[ SeasonalOrders ('sp, sd, sq') ]
[ Period ('period')]
[ IncludeMean ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Fixed ('fixed_params') ]
[ InitValues ('init_params') ]
[ MaxIterNum ('max_iteration_number') ]
);

ArimaPredictor (version 1.1)

SELECT * FROM ArimaPredictor (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]

Teradata Aster Analytics Foundation User Guide 1355


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ModelTable ('input_table')
ResidualTable ('residual_table')
TimestampColumns
({ 'timestamp_column' | 'timestamp_column_range' }[,...])
[ ValueColumn ('value')]
[ ResidualColumn ('residual_column')]
StepAhead ('steps')
);

Attribution (Multiple-Input Version) (version 2.3)

SELECT * FROM Attribution (


ON { input_table | view | (query) }
PARTITION BY user_id
ORDER BY timestamp_column
[ ON { input_table_n | view_n | (query_n) }
PARTITION BY user_id
ORDER BY timestamp_column [,...] ]
ON conversion_event_table AS conversion DIMENSION
[ ON excluding_event_table AS excluding DIMENSION ]
[ ON optional_event_table AS optional DIMENSION ]
ON model1_table AS model1 DIMENSION
[ ON model2_table AS model2 DIMENSION ]
EventColumn ('event_column')
TimestampColumn ('timestamp_column')
WindowSize ({ 'rows:K' | 'seconds:K' | 'rows:K&seconds:K2' })
) ORDER BY user_id,time_stamp;

Attribution (Single-Input Version) (version 2.3)

SELECT * FROM attribution (


ON { input_table | view | (query) }
PARTITION BY expression [,...]
ORDER BY order_by_columns
EventColumn ('event_column')
ConversionEvents ('conversion_event' [,...])
[ ExcludeEvents ('exclude_event') ]
[ OptionalEvents ('optional_event' [,...]) ]
TimestampColumn ('timestamp_column')
WindowSize ('rows:K | seconds:K | rows:K&seconds:K')
Model1 ('type', { 'K' | 'EVENT:WEIGHT:MODEL:PARAMETERS' } [,...])
[ Model2 ('type', { 'K' | 'EVENT:WEIGHT:MODEL:PARAMETERS' } [,...]) ]
);

Burst (version 1.0)

SELECT * FROM Burst (


ON { table | view | (query) } AS input_table

1356 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
PARTITION BY id
ORDER BY ordering_column
[ ON { table | view | (query) } AS time_table
PARTITION BY id
ORDER BY ordering_column ]
TimeColumn ('start_time_column', 'end_time_column')
[ TimeInterval (numeric_value) ]
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ TimeDataType (data_type) ]
[ ValueDataType (value_type [,...]) ]
[ StartTime (start_time) ]
[ EndTime (end_time) ]
[ NumPoints (data_points) ]
[ ValuesBeforeFirst ('before_first_value' [,...]) ]
[ ValuesAfterLast ('after_last_value' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

CCM (version 1.0)

SELECT * FROM CCM (


ON <table|view|query> AS input PARTITION BY <key>
ON <table|view|query> AS time_series_stats DIMENSION
ON <table|view|query> AS embedding_dimension DIMENSION
IdColumn ('input_column')
SequenceColumns ({ 'seq_column' | 'seq_column_range' }[,...])
[ EmbeddingDimension ('integer')]
[ Tau(num_time_steps) ]
[ LibraryLength (library_length_min:library_length_max) ]
[ BootstrapSamples ('integer') ]
[ Seed ('long') ]
);

CCMPrepare (version 1.0)

SELECT * FROM CCMPrepare (


ON <table|view|query> PARTITION BY <key>
);

ChangePointDetection (version 1.0)

SELECT * FROM ChangePointDetection (


ON { table |view | query }
PARTITION BY partition_expr ORDER BY order_by_expr
ValueColumn ('value_column')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ SegmentationMethod ('segmentation_method') ]

Teradata Aster Analytics Foundation User Guide 1357


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
[ SearchMethod ('binary') ]
[ MaxChangeNum ('maximum_change_point_count') ]
[ Penalty ({ 'BIC' | 'AIC' | 'threshold' }) ]
[ OutputOption ({ 'CHANGEPOINT' | 'VERBOSE' | 'SEGMENT' }) ]
);

DTW (version 1.0)

SELECT * FROM DTW (


ON input_table AS input_table
PARTITION BY i_partition_column [,...]
ORDER BY i_ordering_column [,...]
ON template_table AS template_table DIMENSION
ORDER BY t_ordering_column [,...]
ON mapping_table AS mapping_table
PARTITION BY m_partition_column [,...]
InputColumns ('i_value', 'i_timestamp')
TemplateColumns ('t_value', 't_timestamp')
TimeseriesID ('timeseriesid' [,...])
TemplateID ('templateid' [,...])
[ Radius ('radius') ]
[ DistMethod ('distance_metric') ]
[ WarpPath ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
);

DWT (version 1.3)

SELECT * FROM DWT (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
MetaTable ('meta_table')
InputColumns ({ 'input_column' | 'input_column_range' }[, ...])
SortColumn ('sort_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
{ WaveletName ('wavelet') |
WaveletFilterTable ('wavelet_filter_table') }
Level (level)
[ ExtensionMode ('extension_mode') ]
);

1358 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
DWT2D (version 1.3)

SELECT * FROM DWT2D (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
MetaTable ('meta_table')
InputColumns ({ 'input_column' | 'input_column_range' }[,...])
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
IndexColumns ('indexy_column', 'indexx_column')
[ Range ('(starty, startx), (endy, endx)') ]
{ Wavelet ('wavelet') |
WaveletFilterTable ('wavelet_filter_table') }
Level (level)
[ ExtensionMode ('extension_mode') ]
[ CompactOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

FrequentPaths (version 2.1)

SELECT * FROM FrequentPaths (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
[ TimeColumn ('time_column') ]
[ PathFilters ([Separator (symbol),] 'filter' [,...]) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' [,...]) ]
[ SeqPatternTable ('sequence_pattern_table') ]
{ ItemColumn ('sequence_column') |
ItemDefinition ('item_definition_table:
[ index_column:definition_column:item_column ]') ] |
PathColumn ('path_column')}
MinSupport ('minimum')
[ MaxLength ('maximum_length') ]
[ MinLength ('minimum_length') ]

Teradata Aster Analytics Foundation User Guide 1359


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
[ ClosedPattern ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}]
);

IDWT (version 1.3)

SELECT * FROM IDWT (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
MetaTable ('meta_table')
OutputTable ('output_table')
InputColumns ({ 'input_column' | 'input_column_range' }[,...])
SortColumn ('sort_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
);

IDWT2D (version 1.3)

SELECT * FROM IDWT2D (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
MetaTable ('meta_table')
OutputTable ('output_table')
InputColumns ({ 'column_name' | 'column_range' }[,...])
SortColumn ('sort_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
[ VerboseFlag ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Interpolator (version 1.0)

SELECT * FROM Interpolator (


ON { table|view|(query)} AS input_table
PARTITION BY id
ORDER BY ordering_column
[ ON { table|view|(query) } AS time_table

1360 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
DIMENSION ORDER BY ordering_column ]
[ ON { table|view|(query) } AS count_row_number
PARTITION BY id ]
TimeColumn ('time_column')
[ TimeInterval (time_interval) ]
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ InterpolationType (interpolation_type [,...] ) ]
[ AggregationType (aggregation_type [,...] ) ]
[ TimeDataType (time_data_type) ]
[ ValueDataType (value_type [,...])]
[ StartTime (start_time) ]
[ EndTime (end_time) ]
[ ValuesBeforeFirst ('value' [,...]) ]
[ ValuesAfterLast ('value' [,...]) ]
[ DuplicateRowsCount ('value1' [,'value2']) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Path_Analyzer (version 1.3)

SELECT * FROM Path_Analyzer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ( { table | view | (query) } )
OutputTable ('output_table')
{ SeqColumn | SEQ } ('sequence_column')
{ CountColumn | CNT } ('count_column')
Hash ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
Delimiter ('delimiter')
);

Path_Generator (version 1.3)

SELECT * FROM Path_Generator (


ON { table | view | (query) }
{ SeqColumn | SEQ } ('sequence_column')
[ Delimiter ('delimiter') ]
);

Path_Start (version 1.2)

SELECT * FROM Path_Start (


ON table_name

Teradata Aster Analytics Foundation User Guide 1361


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
PARTITION BY partition_column [,...]
{ CountColumn | CNT } ('count_column')
[ Delimiter (',') ]
Parent ('parent_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
Node ('node_column')
);

Path_Summarizer (version 1.2)

SELECT * FROM Path_Summarizer (


ON { table_name | view_name | (query)}
PARTITION BY partition_column [,...]
[ { CountColumn | CNT } ('count_column') ]
Delimiter ('delimiter')
{ SeqColumn | SEQ } ('sequence_column')
[ PartitionColumns
( { 'partition_column' | 'partition_column_range' }[,...]) ]
Hash ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
Prefix ('prefix_column')
);

RtChangePointDetection (version 1.0)

SELECT * FROM RtChangePointDetection (


ON { table | view | query }
PARTITION BY partition_expr ORDER BY order_by_expr
ValueColumn ('value_column')
Accumulate ({ 'accumulate_column' | 'accumulate_column_range' }[,...])
[ SegmentationMethod ('normal_distribution') ]
[ WindowSize ('window_size') ]
[ Threshold ('change_point_threshold') ]
[ OutputOption ({ 'CHANGEPOINT' | 'VERBOSE' | 'SEGMENT' }) ]
);

SAX2

Multiple-Input Version
Version 1.0

SELECT * FROM SAX2 (


ON { table | view | (query) } AS input
PARTITION BY key
ORDER BY order_columns
ON { table | view | (query) } AS meanstats PARTITION BY key
ON { table | view | (query) } AS stdevstats PARTITION BY key

1362 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ TimeColumn ('time_column') ]
[ WindowType ( { 'global' | 'sliding' } ) ]
[ Output ( { 'string' | 'bytes' | 'bitmap' | ' characters' } ) ]
[ WindowSize ('window_size') ]
[ OutputFrequency ('output_frequency') ]
[ PointsPerSymbol ('points_per_symbol' [,...]) ]
[ SymbolsPerWindow ('symbols_per_window' [,...]) ]
[ AlphabetSize ('alphabet_size' [,...]) ]
[ BitmapLevel ('bitmap_level' [,...]) ]
[ PrintCodeStats
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Single-Input Version
Version 1.0

SELECT * FROM SAX2 (


ON { table | view | (query) } AS input
PARTITION BY key
ORDER BY order_columns
ValueColumns ({ 'value_column' | 'value_column_range' }[,...])
[ Time_Column ('time_column') ]
[ Window_Type ( { 'global' | 'sliding' } ) ]
[ Output ( { 'string' | 'bytes' | 'bitmap' | ' characters' } ) ]
[ Mean ('mean_value' [,...]) ]
[ Stdev ('stdev_value' [,...]) ]
[ Window_Size ('window_size') ]
[ Output_Frequency ('output_frequency') ]
[ Points_Per_Symbol ('points_per_symbol' [,...]) ]
[ Symbols_Per_Window ('symbols_per_window' [,...]) ]
[ Alphabet_Size ('alphabet_size' [,...]) ]
[ Bitmap_Level ('bitmap_level' [,...]) ]
[ Print_Code_Stats
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

SeriesSplitter (version 1.0)

SELECT * FROM SeriesSplitter (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ({ table | view | (query) })

Teradata Aster Analytics Foundation User Guide 1363


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
PartitionByColumns
('partition_column' | 'partition_column_range' }[,...])
[ DuplicateRowsCount ('value' [,...]) ]
[ OrderByColumns
({ 'ordering_column' | 'ordering_column_range'}[,...]) ]
[ SplitCount ('split_count') ]
[ RowsPerSplit ('rows_per_split') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ OutputTable ('output_table') ]
[ SplitIDColumn ('split_id_column') ]
[ StatsTable ('stats_table') ]
[ ReturnStatsTable
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OverwriteOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ ValuesBeforeFirst ('value' [,...]) ]
[ ValuesAfterLast ('value' [,...]) ]
[ DuplicateColumn ('duplicate_column') ]
[ PartialSplitID
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Sessionize (version 1.3)

SELECT * FROM Sessionize (


ON { table_name | view_name| (query) }
PARTITION BY expression [,...]
ORDER BY order_column [,...]
TimeColumn ('timestamp_column')
TimeOut (session_timeout)
[ ClickLag (min_human_click_lag) ]
[ EmitNull ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
);

SupervisedShapeletClassifier (version 1.1)

SELECT * FROM SupervisedShapeletClassifier (


ON { table | view | (query) } AS time_series
PARTITION BY id
ORDER BY time_instant
ON { table | view | (query) } AS shapelets DIMENSION
ORDER BY shapelet_id, time_instant
[ ValueColumn ('value_column') ]
[ TimeInterval ('num_data_points') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

1364 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Time Series, Path, and Attribution Analysis
SupervisedShapeletTrainer (version 1.1)

SELECT * FROM SupervisedShapeletTrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_data_table')
[ CategoryTable ('input_categories_table') ]
IdColumn ('id_column')
TimeColumn ('time_column')
ValueColumn ('value_column')
CategoryColumn ('category_column')
[ SaxSymbolsPerWindow ('symbols_per_window') ]
[ SaxMinWindowSize ('min_window_size') ]
[ SaxMaxWindowSize ('max_window_size') ]
[ SaxOutputFrequency ('gap_between_windows') ]
[ ModelTable ('output_model_table') ]
[ OverwriteOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ RandomProjections ('projections') ]
[ ShapeletCount ('num_shapelets') ]
[ TimeInterval ('num_data_points') ]
[ Seed ('seed') ]
);

UnsupervisedShapelet (version 1.0)

SELECT * FROM UnsupervisedShapelet (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
[ OutputTable ('output_table') ]
[ OverwriteOutput
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
TimeColumn ('time_column')
ValueColumn ('value_column')
SaxWindowSize ('window_size')
[ SaxSymbolsPerWindow ('symbols_per_window') ]
[ SaxOutputFrequency ('gap_between_windows') ]
ID ('id_column')
[ RandomProjections ('projections') ]
[ Threshold ('threshold') ]
[ MaxNumIter ('max_iterations') ]

Teradata Aster Analytics Foundation User Guide 1365


Appendix A: List of Functions and Their Syntax
Pattern Matching with Teradata Aster nPath
[ ShapeletCutOff ('cut_off') ]
);

VARMAX (version 1.0)

SELECT * FROM VARMAX(


ON inputtable
PARTITION BY partitionColumns
ORDER BY timestampColumns
ResponseColumns('columns')
[ ExogenousColumns('columns') ]
[ PartitionColumns('columns') ]
Orders ('p,d,q')
[ SeasonalOrders('sp,sd,sq') ]
[ Period ('period') ]
[ ExogenousOrder ('b') ]
[ Lag ('lag') ]
[ IncludeMean ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}') ]
[ MaxIterNum ('max iteration number') ]
[ StepAhead ('predict steps') ]
);

Pattern Matching with Teradata Aster nPath

nPath (version 1.0)

SELECT * FROM nPath (


ON { table | view | (query) }
PARTITION BY partition_column
ORDER BY order_column [ ASC | DESC ]
[ ON {table | view | query) }
[ PARTITION BY partition_column | DIMENSION ]
ORDER BY order_column [ ASC | DESC ]
][...]
Mode ({ OVERLAPPING | NONOVERLAPPING })
Pattern ('pattern')
Symbols ( { col_expr = symbol_predicate AS symbol } [,...])
[ Filter (filter_expression[,...]) ]
Result ({aggregate_function(col_expr OF symbol) AS alias}[,...])
);

1366 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis

Statistical Analysis

AddOnePlayer (version 1.0)

SELECT * FROM AddOnePlayer (


ON { table | view | (query) } PARTITION BY key
CombinationColumn ('combination_column')
SizeColumn ('size_column')
ValueColumn ('value_column')
NumPlayers ('number_of_players')
[ Delimiter ('delimiter') ]
);

Approximate Distinct Count (version 1.0)

SELECT * FROM ApproxDCountReduce (


ON ( [ SELECT * FROM ] ApproxDCountMap (
ON { table_name | view_name| (query) }
InputColumns ({ 'input_column' | 'input_column_range' [,...])
[ ErrorRate ('error_tolerance') ]
)
)
PARTITION BY expression[,...]
);

Approximate Percentile (version 1.1)

SELECT * FROM ApproxPercentileReduce (


ON ([ SELECT * FROM ] ApproxPercentileMap (
ON { table | view | (query) }
TargetColumns ({ 'target_column' | 'target_column_range' }[,...])
[ ErrorRate (error) ]
[ GroupColumns ({ 'group_column' | group_column_range }[,...]) ]
)
) PARTITION BY [ 1 | group_column [,...]]
[ Percentile (percentile [,...]) ]
[ TargetColumns
({ 'target_column' | target_column_range }[,...]) ]
[ GroupColumns ({ 'group_column' | group_column_range }[,...]) ]
);

CMAVG (version 1.2)

SELECT * FROM CMAVG (


ON { table_name| view_name| (query) }

Teradata Aster Analytics Foundation User Guide 1367


Appendix A: List of Functions and Their Syntax
Statistical Analysis
PARTITION BY partition_column
ORDER BY order_by_column
[ TargetColumns ({ 'target_column' | target_column_range }[,...]) ]
);

ConfusionMatrix (version 2.0)

SELECT * FROM ConfusionMatrix (


ON input_table PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ObsColumn ('observed_column')
PredictColumn ('predicted_column')
OutputTable ('output_table')
[ Classes ('class' [,...] ) ]
[ Prevalence ('prevalence' [,...] ) ]
);

Correlation (version 1.4)

SELECT * FROM Corr_Reduce (


ON Corr_Map (
ON { table_name | view_name | (query) }
PARTITION BY group_column [,group_column [,...]
[ TargetColumns
({ 'target_column_name' | target_column_range }[,...]) ]
KeyName ('key_name')
[ GroupByColumns
({ 'group_column' | 'group_column_range' }[,...]) ]
)
PARTITION BY key_name[,group_column [,...]]
);

CoxPH (version 1.2)

SELECT * FROM CoxPH (


ON (SELECT 1)
PARTITION BY 1
InputTable ('input_table')
FeatureColumns ({ 'feature_column' | 'feature_column_range' }[,...])
[ CategoricalColumns
({ 'categorical_column' | 'categorical_column_range' }[,...]) ]
TimeIntervalColumn ('time_interval_column')
EventColumn ('event_column')
CoefficientTable ('coefficient_table')

1368 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
LinearPredictorTable ('linear_predictor_table')
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
[ Threshold ('threshold') ]
[ MaxIterNum ('max_iteration_number') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

CoxPredict (version 1.1)

SELECT * FROM CoxPredict (


ON cox_coef_model_table AS cox_coef_model DIMENSION
ON predict_feature_table AS predicts PARTITION BY { 1 | id }
[ ON ref_feature_table AS refs PARTITION BY { 1 | id } ]
Predict_Feature_Names (predict_feature [,...])
{ Predict_Feature_Columns
({ 'pf_value_column' | 'pf_value_column_range'}[,...]) |
Predict_Feature_Units_Columns
({ 'pf_unit_column' | 'pf_unit_column_range'}[,...]) }
[ Ref_Feature_Columns
({ 'rf_value_column' | 'rf_value_column_range'}[,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

CoxSurvFit (version 1.1)

SELECT * FROM CoxSurvFit (


ON (SELECT 1) PARTITION BY 1
Cox_Linear_Predictor_Model_Table (cox_linear_predictor_model_table)
Cox_Coef_Model_Table (cox_coef_model_table)
Predict_Table (predict_table)
Predict_Feature_Names (feature_name [,...])
Predict_Feature_Columns
({ 'pf_value_column' | 'pf_value_column_range'}[,...])
Output_Table (output_table)
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
);

Teradata Aster Analytics Foundation User Guide 1369


Appendix A: List of Functions and Their Syntax
Statistical Analysis
CrossValidation (version 1.0)

SELECT * FROM CrossValidation (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
Function ('function_name')
[ Arguments used by the training function ]
...
CVParams ('arguments_vary_in_cv')
[ FoldNum ('k') ]
[ CVTable ('tablename') ]
[ Metric ('error_function_name') ]
);

Distribution Matching, Hypothesis-Test Mode


• Continuous Distributions
• Discrete Distributions

Continuous Distributions
Version 1.0
• Option 1: For Multiple-Node Data Sets
• Option 2: For Single-Node Data Sets

Option 1: For Multiple-Node Data Sets

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY col [,...]
ORDER BY column_name) AS rank, *
FROM input_table
WHERE column_name IS NOT NULL
) AS input PARTITION BY ANY
ON (SELECT col [,...], COUNT(*) AS group_size
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY col [,...]
) AS groupstats DIMENSION
ValueColumn (column_name)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameter' [,...])
[ GroupingColumns (col[,...]) ]
[ MinGroupSize (minGroupSize) ]
[ CellSize (cellSize) ]
)

1370 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
PARTITION BY col [,...]
);

Option 2: For Single-Node Data Sets

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY column [,...]
ORDER BY column) AS rank, *
FROM input_table
WHERE column IS NOT NULL
) AS input PARTITION BY column [,...]
ON (SELECT col [,...], COUNT(*) AS group_size
FROM input_table
WHERE column IS NOT NULL c
GROUP BY column [,...]
) AS groupstats PARTITION BY column [,...]
ValueColumn (value_column)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameters' [,...])
[ GroupByColumns
({ 'group_by_column'| 'group_by_column_range' [,...]) ]
[ MinGroupSize (min_group_size) ]
[ NumCell (cell_size) ]
) PARTITION BY column [,...]
);

Discrete Distributions
Version 1.0
• Option 1: For Multiple-Node Data Sets
• Option 2: For Single-Node Data Sets and Any CvM Test

Option 1: For Multiple-Node Data Sets

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY column [,...]
ORDER BY column) AS rank, column [,...]
FROM input_table
WHERE column IS NOT NULL
GROUP BY column [,...]
) AS input PARTITION BY ANY
ON (SELECT column [,...], COUNT(*) AS group_size
FROM input_table
WHERE column IS NOT NULL
GROUP BY column [,...]
) AS groupstats DIMENSION
ValueColumn (value_column)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameters' [,...])

Teradata Aster Analytics Foundation User Guide 1371


Appendix A: List of Functions and Their Syntax
Statistical Analysis
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ MinGroupSize (min_group_size) ]
[ NumCell (cell_size) ]
)
PARTITION BY column [,...]
);

Option 2: For Single-Node Data Sets and Any CvM Test

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY column [,...]
ORDER BY column) AS rank, column [,...]
FROM input_table
WHERE column IS NOT NULL
GROUP BY column [,...]
) AS input PARTITION BY column [,...]
ORDER BY column
ON (SELECT column [,...], COUNT(*) AS group_size
FROM input_table
WHERE column IS NOT NULL
GROUP BY column [,...]
) AS groupstats
PARTITION BY column [,...]
ValueColumn (value_column)
[ Tests ('test' [,...]) ]
Distributions ('distribution:parameters' [,...])
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' [,...]) ]
[ MinGroupSize (min_group_size) ]
[ NumCell (cell_size) ]
) PARTITION BY column [,...]
);

Distribution Matching, Best-Match Mode


• DOUBLE PRECISION Input
• Integer Input

DOUBLE PRECISION Input


Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT RANK() OVER (PARTITION BY column [,...]
ORDER BY column_name) AS rank, *
FROM input_table WHERE column_name IS NOT NULL)
AS input PARTITION BY ANY
ON (SELECT column [,...]

1372 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
COUNT(*) AS group_size,
AVG (column_name) AS mean,
STDDEV (column_name) AS sd,
CASE
WHEN MIN (column_name) > 0 THEN AVG (LN (
CASE
WHEN column_name > 0 THEN column_name
ELSE 1
END)
)
ELSE 0
END AS mean_of_ln,
CASE
WHEN MIN (column_name) > 0 THEN STDDEV (LN (
CASE
WHEN column_name > 0 THEN column_name
ELSE 1
END)
)
ELSE -1
END AS sd_of_ln,
Max (column_name) AS maximum,
MIN (column_name) AS minimum
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY column [,...]
) AS groupstats DIMENSION
ValueColumn (column_name)
[ Tests ('test' [,...]) ]
[ Distributions ('distribution1:parameter1',...) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
MinGroupSize (minGroupSize)
[ NumCell (cellSize) ]
)
PARTITION BY column [,...]
[ Top ('top') ]
);

Integer Input
Version 1.0

SELECT * FROM DistnmatchReduce (


ON DistnmatchMultipleInput (
ON (SELECT COUNT(1) AS counts,
SUM(COUNT(1)) OVER (PARTITION BY column [,...]
ORDER BY column_name) AS rank,
column [,...], column_name
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY column [,...], column_name
) AS input PARTITION BY ANY
ON (SELECT column [,...],
COUNT(*) AS group_size,
AVG (column_name) AS mean,

Teradata Aster Analytics Foundation User Guide 1373


Appendix A: List of Functions and Their Syntax
Statistical Analysis
STDDEV (column_name) AS sd,
MAX (column_name) AS maximum,
MIN (column_name) AS minimum
FROM input_table
WHERE column_name IS NOT NULL
GROUP BY column [,...]
) AS groupstats DIMENSION
ValueColumn (column_name)
[ Tests ('test' [,...]) ]
[ Distributions ('distribution1:parameter1' [ ,... ]) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
MinGroupSize (minGroupSize)
[ NumCell (cellSize) ]
)
PARTITION BY column [,...]
[ Top ('top') ]
);

EMAVG (version 1.2)

SELECT * FROM EMAVG (


ON { table_name | view_name | (query) }
PARTITION BY partition_column
ORDER BY order_column
[ TargetColumns ({ 'target_column' | 'target_column_range' }[,...]) ]
[ Alpha ('alpha') ]
[ StartRows ('n') ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

FMeasure (version 1.4)

SELECT * FROM FMeasure (


ON { table | view | query) } PARTITION BY 1
ObsColumn ('observed_column')
PredictColumn ('predicted_column')
[ Classes ('class' [,...]) ]
[ Beta (beta_value) ]
);

GenerateCombination (version 1.0)

SELECT * FROM GenerateCombination (


ON { table | view | (query) }
);

1374 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
GLM (version 1.7)

SELECT * FROM GLM (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ InputColumns ({ 'input_column' | 'input_column_range' }[,...]) ]
[ CategoricalColumns ('columnname_value_pair'[,...]) ]
[ Family ('family') ]
[ Link ('link') ]
[ Weight ('weight_column') ]
[ Threshold ('threshold') ]
[ MaxIterNum ('max_iterations') ]
[ Intercept ( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ Step ( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
);

GLMPredict (version 1.5)

SELECT * FROM GLMPredict (


ON input_table
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ModelTable ('model_table')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Family ('family') ]
[ Link ('link') ]
);

Histogram (version 1.0)

SELECT * FROM hist (


ON (SELECT 1) PARTITION BY 1
InputTable (data_table)
OutputTable (out_table)
[ AutoBin ({ 'Sturges' | 'Scott' | number_of_bins }) ]
[ CustomBinTable (bin_table) ]
[ CustomBinColumn ('breaks_col') ]
[ StartValue ('bin_start')]

Teradata Aster Analytics Foundation User Guide 1375


Appendix A: List of Functions and Their Syntax
Statistical Analysis
[ BinSize ('bin_size')]
[ EndValue ('bin_end')]
ValueColumn ('value_col')
[ Inclusion ({ 'left' | 'right' }) ]
[ GroupbyColumns('groupby_col') ]
);

HMMDecoder (version 1.3)

SELECT * FROM HMMDecoder(


ON init_prob_table AS "InitStateProb" PARTITION BY model_key
ON trans_prob_table AS "TransProb" PARTITION BY model_key
ON emission_prob_table AS "EmissionProb" PARTITION BY model_key
ON observation_table AS "observation" PARTITION BY model_key
ORDER BY time_ordered_sequence_attributes ASC
InitStateModelKey ('model_attribute')
InitStateKey ('state_key_attribute')
InitStateProbKey ('probability')
StateTransModelKey ('model_attribute')
StateTransFromStateKey ('from_state_key_attribute')
StateTransToStateKey ('to_state_key_attribute')
StateTransProbKey ('probability')
EmitModelKey ('model_attribute')
EmitStateKey ('state_key_attribute')
EmitObservedKey ('observed_key_attribute')
EmitProbKey ('probability')]
ModelColumn ('model_attribute')
SequenceKey ('seq_attribute')
ObservedKey ('observed_attribute')
[ SequenceMaxSize ('range') ]
[ SkipKey ('skip_attribute') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

HMMEvaluator (version 1.3)

SELECT * FROM HMMEvaluator(


ON input_table AS "InitStateProb"
PARTITION BY model_key_attributes
ON state_transition_table AS "TransProb"
PARTITION BY model_key_attributes
ON emission_table AS "EmissionProb"
PARTITION BY model_key_attributes
ON observation_table AS "observation"
PARTITION BY model_key_attribute
ORDER BY time_ordered_sequence_attributes ASC
InitStateModelColumn ('model_key_attribute')
InitStateColumn ('state_key_attribute')
InitStateProbColumn ('probability')
TransAttributeColumn ('model_key_attribute')
TransFromStateColumn ('from_state_key_attribute')

1376 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
TransToStateColumn ('to_state_key_attribute')
TransProbColumn ('probability')
EmitModelColumn ('model_key_attribute')
EmitStateColumn ('state_key_attribute')
EmitObsColumn ('observed_key_attribute')
EmitProbColumn ('probability')]
ModelColumn ('model_key_attribute')
SeqColumn ('seq_key_attribute')
ObsColumn ('observed_key_attribute1')
[ Incremental
( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ ShowChangeRate
( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ SeqProbColumn ('sequence_probability_attribute') ]
[ SkipColumn ('skip_key_attribute') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

HMMSupervisedLearner (version 1.3)

SELECT * FROM HMMSupervisedLearner (


ON {table_name | view_name | (query)} AS "vertices"
PARTITION BY [ model_key, ...,] sequence_key_attributes
ORDER BY [ model_key, ] time_ordered_sequence_attributes ASC
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
[ ModelKey ('model_attribute') ]
SequenceKey ('sequence_attribute')
ObservedKey ('observed_attribute1')
StateKey('state_attributes')
[ SkipKey('skip_attribute') ]
[ OutputTables('init_state_prob','state_transition_prob','emit_prob')]
[ BatchSize('size') ]
);

HMMUnsupervisedLearner (version 1.3)

SELECT * FROM HMMUnsupervisedLearner (


ON { table_name | view_name | (query) } AS vertices
PARTITION BY [ model_key, ...,] sequence_key_attributes
ORDER BY [ model_key, ] time_ordered_sequence_attributes ASC
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]

Teradata Aster Analytics Foundation User Guide 1377


Appendix A: List of Functions and Their Syntax
Statistical Analysis
[ ModelColumn ('model_attribute') ]
SeqColumn ('sequence_attribute')
ObsColumn ('observed_attribute')
HiddenStateNum ('number')
[ MaxIterNum ('max_iterations') ]
[ Epsilon ('epsilon') ]
[ SkipColumn ('skip_attribute') ]
[ InitMethods ( { 'random' | 'flat' | 'input' }, 'seed_number') ]
[ InitParams ('init_state_probability_vector',
'state_transition_probability_matrix',
'observation_emission_probability_matrix')
[ OutputTables('init_state_prob','state_transition_prob','emit_prob')]
);

KNN (version 1.3)

SELECT * FROM KNN (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TrainingTable ('training_table')
TestTable ('test_table')
K (k)
ResponseColumn ('response_column')
IDColumn ('test_id_column')
DistanceFeatures ({ 'df_column' | 'df_column_range' }[,... ])
[ VotingWeight (voting_weight) ]
[ OutputTable ('output_table') ]
[ CustomizedDistance ('jar', 'distance_class') ]
[ ForceMapreduce
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ PartitionBlockSize ('partition_block_size') ]
);

LARS (version 1.1)

SELECT * FROM LARS (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')

1378 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
InputColumns ('response', 'predictor_columns')
[ Method ({ 'lar' | 'lasso'} ) ]
[ Intercept ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Normalize ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ MaxIterNum ('max_iterations') ]
);

LARSPredict (version 1.1)

SELECT * FROM LARSPredict (


ON input_table AS data PARTITION BY ANY
ON model_table AS model DIMENSION
[ MODE ({ 'STEP' | 'FRACTION' | 'NORM' | 'LAMBDA' })]
[ S ('coef_position' [,...]) ]
[ TargetCol ('target_column') ]
);

Linear Regression (version 1.1 and 1.0, respectively)

SELECT * FROM LinReg (


ON LinRegMatrix (
ON { table_name | view_name | (query) }) PARTITION BY 1
)
);

LRTEST (version 1.1)

SELECT * FROM LRTEST (


ON (SELECT * FROM glm_output1 WHERE attribute = -1) AS "model1"
PARTITION BY 1
ON (SELECT * FROM glm_output2 WHERE attribute = -1) AS "model2"
PARTITION BY 1
Statistic ('predictor_column')
LogLik ('estimate_column')
ObsNum ('std_err_column')
ParamNum ('z_score_column')
);

Percentile (version 1.0)

SELECT * FROM Percentile (


ON input_table
PARTITION BY partition_column [,...]
Percentile ('percentile' [,...])
Target_Columns ({ 'target_column' | 'target_column_range' }[,...])

Teradata Aster Analytics Foundation User Guide 1379


Appendix A: List of Functions and Their Syntax
Statistical Analysis
[ Group_Columns ({ 'group_column' | 'group_column_range' }[,...]) ]
);

Principal Component Analysis (PCA_Reduce version 1.2,


PCA_Map, version 1.1)

SELECT * FROM PCA_Reduce (


ON PCA_Map (
ON target_table
Target_Columns ({ 'target_column' | 'target_column_range' }[,...])
) PARTITION BY 1
[ Components (num_components) ]
) ORDER BY component_rank;

PCAPlot (version 1.0)

SELECT * FROM PCAPlot (


ON input_table AS inputtable PARTITION BY ANY
ON pca_table AS pca_table DIMENSION
Components ('num_components')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

RandomSample (version 1.0)

SELECT * FROM RandomSample (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
NumSample ('sample_size' [,...])
[ WeightColumn ('weight_column') ]
[ SamplingMode ({ 'basic' | 'kmeans++' | 'kmeans||' }) ]
[ Distance ({ 'euclidean' | 'manhattan' }) ]
[ InputColumns ( { 'input_column' | 'input_column_range' }[,...] ) ]
[ AsCategories ( { 'ascat_column' | 'ascat_column_range' }[,...] ) ]
[ CategoryWeights ('category_weight' [,...]) ]
[ CategoricalDistance ({ 'overlap' | 'hamming' }) ]
[ Seed ('seed')
SeedColumn ({ 'seed_column' | 'seed_column_range' } [,...]) ]
[ OverSamplingRate ('rate') ]
[ IterationNum ('number_of_iterations') ]
);

1380 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
Sample (version 1.2)
• Unconditional Sampling, Single Sample Rate
• Unconditional Sampling, Approximate Sample Size
• Conditional Simple Sampling, Single Sample Rate
• Conditional Sampling, Variable Sample Rates
• Conditional Sampling, Approximate Sample Size
• Conditional Sampling, Variable Approximate Sample Sizes

Unconditional Sampling, Single Sample Rate

SELECT * FROM Sample (


ON { table_name | view_name | (query) }
SampleFraction ('fraction')
[ Seed ('seed') ]
);

Unconditional Sampling, Approximate Sample Size

SELECT * FROM Sample (


ON { table_name | view_name | (query) } AS data PARTITION BY ANY
ON { table_name | view_name | (query) } AS summary DIMENSION
ApproxSampleSize ('size')
[ Seed ('seed') ]
);

Conditional Simple Sampling, Single Sample Rate

SELECT * FROM Sample (


ON { table_name | view_name | (query) }
StratumColumn ('column')
Strata ('condition' [,...])
SampleFraction ('fraction')
[ Seed ('seed') ]
);

Conditional Sampling, Variable Sample Rates

SELECT * FROM Sample (


ON { table_name | view_name | (query) }
StratumColumn ('column')
Strata ('condition' [,...])
SampleFraction ('fraction' [,...])
[ Seed ('seed') ]
);

Teradata Aster Analytics Foundation User Guide 1381


Appendix A: List of Functions and Their Syntax
Statistical Analysis
Conditional Sampling, Approximate Sample Size

SELECT * FROM sample (


ON { table_name | view_name | (query) } AS data PARTITION BY ANY
ON { table_name | view_name | (query) } AS summary DIMENSION
StratumColumn ('column')
Strata ('condition' [,...])
ApproxSampleSize ('total_sample_size')
[ Seed ('seed') ]
);

Conditional Sampling, Variable Approximate Sample Sizes

SELECT * FROM sample (


ON { table_name | view_name | (query) } AS data PARTITION BY ANY
ON { table_name | view_name | (query) } AS summary DIMENSION
StratumColumn ('column')
Strata ('condition' [,...])
ApproxSampleSize ('size' [,...])
[ Seed ('seed') ]
);

SMAVG (version 1.2)

SELECT * FROM SMAVG (


ON {table_name | view_name| (query) }
PARTITION BY partition_column
ORDER BY order_by_column
[ TargetColumns ({ 'target_column' | 'target_column_range' }[,...]) ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WindowSize ('window_size') ]
);

SortCombination (version 1.1)

SELECT * FROM SortCombination (


ON { table | view | (query) } PARTITION BY key
CombinationColumn ('combination_column')
ValueColumn ('value_column')
[ Delimiter ('delimiter') ]
);

1382 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
Support Vector Machines

DenseSVMModelPrinter (version 1.1)

SELECT * FROM DenseSvmModelPrinter (


ON <table|view|query> AS input PARTITION BY ANY
ON <table|view|query> AS model DIMENSION
AttributeColumns ('input_column1', 'input_column2', ...)
[ Summary ('true'|'yes'|'t'|'y'|'1'|'FALSE'|'no'|'f'|'n'|'0') ]
);

DenseSVMPredictor (version 1.1)

SELECT * FROM DenseSvmPredictor (


ON <table|view|query> AS input PARTITION BY ANY
ON <table|view|query> AS model DIMENSION
AttributeColumns
({ 'attribute_column' | 'attribute_column_range' }[,...])
SampleIdColumn ('input_column')
[ AccumulateLabel
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ OutputClassNum ('integer') ]
);

DenseSVMTrainer (version 1.1)

SELECT * FROM DenseSvmTrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database( 'db_name') ]
[ UserId ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
SampleIdColumn ('InputTable_column')
AttributeColumns
({ 'attribute_column' | 'attribute_column_range' }[,...])
[ KernelFunction ('linear'|'polynomial'|'rbf'|'sigmoid') ]
[ Gamma ('double') ]
[ Constant ('double') ]
[ Degree ('integer') ]
[ SubspaceDimension ('integer') ]
[ HashBits ('integer') ]
InputTable ('table_name')
ModelTable ('table_name')
LabelColumn ('InputTable_column')
[ Cost ('double') ]
[ Bias ('double') ]
[ ClassWeights ('string') ]
[ MaxStep ('integer') ]
[ Epsilon ('double') ]

Teradata Aster Analytics Foundation User Guide 1383


Appendix A: List of Functions and Their Syntax
Statistical Analysis
[ Seed ('long') ]
[ OverwriteOutput ('boolean')]
);

SparseSVMPredictor (version 1.1)

SELECT * FROM SparseSVMPredictor (


ON sample_table AS input PARTITION BY id_column
ON model_table AS model DIMENSION
SampleIDColumn ('id_column')
AttributeColumn ('attribute_column')
[ ValueColumn ('value_column') ]
[ AccumulateLabel
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ OutputClassNum ('output_class_number') ]
);

SparseSVMTrainer (version 1.1)

SELECT * FROM SparseSVMTrainer (


ON (select 1) PARTITION by 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('model_table')
SampleIDColumn ('id_column')
AttributeColumn ('attribute_column')
[ ValueColumn ('value_column') ]
LabelColumn ('label_column')
[ Cost ('cost') ]
[ Bias ('bias') ]
[ Hash ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ HashBuckets (buckets_number) ]
[ ClassWeights ('class:weight' [,...]) ]
[ MaxStep ('max_step') ]
[ Epsilon ('epsilon') ]
[ Seed ('seed') ]
);

SVMModelPrinter (version 1.1)

SELECT DISTINCT * FROM SVMModelPrinter (


ON inputtable AS input PARTITION BY ANY
ON modeltable AS model DIMENSION
AttributeColumn ('attribute_column')

1384 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Statistical Analysis
[ Summary ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

VectorDistance (version 1.1)

SELECT * FROM VectorDistance (


ON target_input_table AS target PARTITION BY target_id_column [,...]
ON ref_input_table AS ref DIMENSION
TargetIDColumns
({ 'target_id_column' | 'target_id_column_range' }[,...])
TargetFeatureColumn (feature_column)
[ TargetValueColumn (value_column) ]
[ RefIDColumns ({ 'ref_id_column' | 'ref_id_column_range' }[,...])
[ RefTableSize ({ 'SMALL' | 'LARGE' }) ]
[ RefFeatureColumn (feature_column) ]
[ RefValueColumn (value_column) ]
[ DistanceMeasure (
{ 'cosine' | 'euclidean' | 'manhattan' | 'binary' }[,...])]
[ IgnoreMismatch
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ ReplaceInvalid (
{ 'PositiveInfinity' |'NegativeInfinity' | custom })]
[ TopK ('k') ]
[ MaxDistance ('threshold' [,...]) ]
);

VWAP (version 1.2)

SELECT * FROM VWAP (


ON { table_name | view_name| (query) }
PARTITION BY expression [,...]
ORDER BY date_column
[ Price ('price_column') ]
[ Volume ('volume_column') ]
[ TimeInterval ('number_of_seconds') ]
[ DT ('date_column') ]
);

WMAVG (version 1.2)

SELECT * FROM WMAVG (


ON { table_name| view_name| (query) }
PARTITION BY partition_column
ORDER BY order_by_column
[ TargetColumns ({ 'target_column' | 'target_column_range' }[,...]) ]
[ WindowSize ('window_size') ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

Teradata Aster Analytics Foundation User Guide 1385


Appendix A: List of Functions and Their Syntax
Text Analysis

Text Analysis

Evaluate Named Entity Finder (version 1.1)

SELECT * FROM EvaluateNamedEntityFinderPartition (


ON EvaluateNamedEntityFinderRow (
ON { table| view| (query) }
TextColumn ('text_column')
Model ('model_file')
)
PARTITION BY 1
);

EvaluateSentimentExtractor (version 1.1)

SELECT * FROM EvaluateSentimentExtractor (


ON { table | view | (query) } PARTITION BY 1
ObsColumn ('observed_column')
SentimentColumn ('sentiment_column')
);

ExtractSentiment (version 3.1)

SELECT * FROM ExtractSentiment (


ON { table| view| (query) } [ PARTITION BY ANY ]
[ ON { table| view| (query) } AS dict DIMENSION ]
TextColumn ('text_column')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
[ Model ({ 'dictionary[:dict_file]' | 'classification:model_file' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Level ({ 'document' | 'sentence' }) ]
[ HighPriority ({ 'NEGATIVE_RECALL' | 'NEGATIVE_PRECISION' |
'POSITIVE_RECALL' | 'POSITIVE_PRECISION' | 'NONE' }) ]
[ Filter ({ 'POSITIVE' | 'NEGATIVE' | 'ALL' }) ]
);

FindNamedEntity (version 1.2)

SELECT * FROM FindNamedEntity (


ON { table | view | (query) } PARTITION BY ANY
[ ON (configure_table) AS ConfigureTable DIMENSION ]
TextColumn ('text_column')
[ Model ({ 'entity_type [:model_type: { model_file' |
'regular_expression' } ] [,...]} | 'all' }) ]

1386 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Text Analysis
[ ShowEntityContext ('context_words') ]
[ EntityColumn ('entity_column') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
If the input is a query, you must map it to an alias.

LDAInference (version 1.1)

SELECT * FROM LDAInference (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('model_table')
OutputTable ('output_table')
DocIDColumn ('doc_column')
WordColumn ('word_column')
[ CountColumn ('count_column') ]
[ OutputTopicNum ('topic_number') ]
[ OutputTopicWordNum ('topic_word_number') ]
);

LDATopicPrinter (version 1.1)

SELECT * FROM LDATopicPrinter (


ON model_table_name PARTITION by 1
[ Summary ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputTopicWordNum ('topic_words') ]
[ WordWeight ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WordCount ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputByWord ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
);

LDATrainer (version 1.1)

SELECT * FROM LDATrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]

Teradata Aster Analytics Foundation User Guide 1387


Appendix A: List of Functions and Their Syntax
Text Analysis
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
ModelTable ('model_table')
[ OutputTable ('output_table') ]
TopicNum ('topic_number')
[ Alpha ('alpha') ]
[ Eta ('eta') ]
DocIDColumn ('doc_column')
WordColumn ('word_column')
[ CountColumn ('count_column') ]
[ MaxIterate ('max_iterate') ]
[ ConvergenceDelta ('convergence_delta') ]
[ Seed (seed) ]
[ OutputTopicNum ('topic_number') ]
[ OutputTopicWordNum ('topic_word_number') ]
);

Levenshtein Distance (LDist) (version 1.1)

SELECT * FROM ldist (


ON { table | view | query }
PARTITION BY { key | ANY } DIMENSION
TargetColumn ('target_column')
Source ({ 'source_column' | 'source_column_range' }[,...])
[ Threshold ('threshold') ]
[ OutputColumnName ('output_distance_column') ]
[ OutputTargetColumn('output_target_column') ]
[ PrintSourceColumn('output_source_column') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

NaiveBayesTextClassifierPredict (version 1.1)

SELECT * FROM NaiveBayesTextClassifierPredict (


ON input_table PARTITION BY doc_id_column
ON model_table DIMENSION
InputTokenColumn ('token_column')
[ ModelType ({ 'Multinomial' | 'Bernoulli' }) ]
[ DocIDColumns ({ 'doc_id_column' | 'doc_id_column_range' }[,...])]
[ ModelTokenColumn ('model_token_column')
ModelCategoryColumn ('model_category_column')
ModelProbColumn ('model_token_count_column') ]
[ TopK ({ num_of_top_k_predictions | 'num_of_top_k_predictions' }) ]
);

1388 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Text Analysis
NaiveBayesTextClassifierTrainer (version 1.1)

SELECT * FROM NaiveBayesTextClassifierTrainer (


ON (SELECT * FROM NaiveBayesTextClassifierInternal (
ON token_table AS tokens PARTITION BY category
[ ON categories_table AS categories DIMENSION ]
[ ON stop_words_table AS stop_words DIMENSION ]
TokenColumn ('token_column')
[ ModelType ({ 'Multinomial' | 'Bernoulli' }) ]
[ DocIDColumns ({ 'doc_id_column' | 'doc_id_column_range' }[,...])]
DocCategoryColumn ('doc_category_column')
[ CategoryColumn ('category_column') |
Categories ('category' [,...]) ]
[ StopWordsColumn ('stop_words_column') |
StopWords ('word' [,...]) ]
)
PARTITION BY 1
);

NER (version 1.1)

SELECT * FROM NER (


ON input_table PARTITION BY ANY
[ ON rules_table AS rules DIMENSION ]
[ ON dictionary_table AS dict DIMENSION ]
TextColumn ('text_column')
[ Models ('model_file[:jar_file]' [,...]) ]
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
[ ShowEntityContext ('context_words') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

NEREvaluator (version 1.1)

SELECT * FROM NEREvaluator (


ON input_table PARTITION BY 1
TextColumn ('text_column')
Model ('model_file[:jar_file]')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
);

NERTrainer (version 1.1)

SELECT * FROM NERTrainer (


ON input_table PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]

Teradata Aster Analytics Foundation User Guide 1389


Appendix A: List of Functions and Their Syntax
Text Analysis
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TextColumn ('text_column')
[ ExtractorJAR ('jar_file') ]
FeatureTemplate ('template_file')
ModelFile ('model_file')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
[ MaxIterNum (max_iteration_times) ]
[ Eta (eta_threshhold_value) ]
[ MinOccurNum (threshhold_value) ]
);

nGram (version 1.5)

SELECT * FROM nGram (


ON { table_name | view_name| (query) }
TextColumn ('text_column_name')
[ Delimiter ('delimiter_regular_expression') ]
Grams (gram_number |'range_of_values'[,...])
[ OverLapping({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ ToLowerCase
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'} ) ]
[ Punctuation ('punctuation_regular_expression') ]
[ Reset ('reset_regular_expression') ]
[ TotalGramCount
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ TotalCountColumn ('total_count_column_name') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ NGramColumn ('ngram_column_name') ]
[ NumGramsColumn ('numgrams_column_name') ]
[ FrequencyColumn ('count_column_name') ]
);

POSTagger (version 2.1)

SELECT * FROM PosTagger (


ON { table | view | query }
TextColumn ('text_column_name')]
[ Language ({ 'en' | 'zh_Cn' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

1390 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Text Analysis
Sentenizer (version 1.1)

SELECT * FROM Sentenizer (


ON input_table
TextColumn ('text_column')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

TextChunker (version 1.2)

SELECT * FROM TextChunker (


ON input_table PARTITION BY partition_key ORDER BY word_sn
WordColumn ('word_column')
POSColumn ('pos_tag_column')
);

TextClassifier (version 1.2)

SELECT * FROM TextClassifier (


ON input_table
TextColumn ('text_column')
Model ('model_name')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

TextClassifierEvaluator (version 1.2)

SELECT * FROM TextClassifierEvaluator (


ON { table_name| view_name| (query) } PARTITION BY 1
ObsColumn ('expected_column')
PredictColumn ('predicted_column')
);

TextClassifierTrainer (version 1.4)

SELECT * FROM TextClassifierTrainer (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]

Teradata Aster Analytics Foundation User Guide 1391


Appendix A: List of Functions and Their Syntax
Text Analysis
InputTable ('input_table')
TextColumn ('text_column')
CategoryColumn ('category_column')
ModelFile ('model_file')
ClassifierType ({ 'KNN' | 'MaxEnt' })
[ ClassifierParameters ('name:value' [,...]) ]
[ NLPParameters ('name:value' [,...]) ]
[ FeatureSelectionMethod ('DF:[{ min:max | min: | :max }]') ]
);

Note:
In the FeatureSelection argument, you must type the brackets. They do not indicate that their contents
are optional.

TextMorph (version 1.2)

SELECT * FROM TextMorph (


ON { table | view | (query) }
WordColumn ('word_column')
[ POSTagColumn ('pos_tag_column') ]
[ SingleOutput ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ POS ('pos' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

TextTagging (version 1.3)

SELECT * FROM TextTagging (


ON { text_table| text_view| (text_query) } PARTITION BY ANY
[ ON { rules_table| rules_view| (rules_query) } AS rules DIMENSION ]
[ Language ({ 'en' | 'zh_cn' | 'zh_tw' }) ]
[ Rules ('rule AS tag_name' [,...]) ]
[ Tokenize ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputByTag ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ TagDelimiter ('delimiter_string') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

TextTokenizer (version 3.2)

SELECT * FROM TextTokenizer (


ON input_table PARTITION BY ANY
[ ON dict_table AS dict DIMENSION ]
TextColumn ('text_column') ]
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' | 'jp' | }) ]
[ Model ('model_file') ]

1392 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Text Analysis
[ OutputDelimiter ('delimiter') ]
[ OutputByWord ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ UserDictionaryFile ('user_dictionary_file') ]
);

Text_Parser (version 1.3)

SELECT * FROM Text_Parser (


ON { table_name| view_name| (query) }
[ PARTITION BY expression [,...] ]
TextColumn ('text_column_name')
[ ToLowerCase ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Stemming ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Delimiter ('delimiter_regular_expression') ]
[ TotalWordsNum
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Punctuation ('punctuation_regular_expression') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ TokenColumn ('token_column') ]
[ FrequencyColumn ('frequency_column') ]
[ TotalColumn ('total_column') ]
[ RemoveStopWords
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ PositionColumn ('position_column') ]
[ ListPositions
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ OutputByWord ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ StemmingExceptions ('exception_rule_file') ]
[ StopWords ('stop_word_file') ]
);

TF_IDF (TF_IDF version 2.1, TF version 1.1)

SELECT * FROM TF_IDF (


ON TF (
ON { table | view | (query) } PARTITION BY docid
[ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ]
) AS tf PARTITION BY term
[ ON (SELECT COUNT (DISTINCT docid)
FROM doccount_table) AS doccount DIMENSION ]
[ ON (SELECT term, COUNT (DISTINCT docid)
FROM docperterm_table
GROUP BY term)AS docperterm PARTITION BY term ]
[ ON (SELECT DISTINCT (term) AS term, idf
FROM tf_idf_output_table ) AS idf PARTITION BY term ]
);

Teradata Aster Analytics Foundation User Guide 1393


Appendix A: List of Functions and Their Syntax
Cluster Analysis
TrainNamedEntityFinder (version 1.3)

SELECT * FROM TrainNamedEntityFinder (


ON { table| view| (query)}
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TextColumn ('text_column')
EntityType ('entity_type')
Model ('model_file')
[ IterNum ('iterator')]
[ Cutoff ('cutoff')]
);

TrainSentimentExtractor (version 2.1)

SELECT * FROM TrainSentimentExtractor (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('training_table')
TextColumn ('text_column')
SentimentColumn ('sentiment_column')
ModelFile ('model_file')
[ Language ({ 'en' | 'zh_CN' | 'zh_TW' }) ]
);

Cluster Analysis
The Modularity function, which discovers clusters in input graphs, is in the Graph Analysis chapter.

Canopy (version 2.0)

SELECT * FROM Canopy (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]

1394 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Cluster Analysis
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('string')
LooseDistance ('maximum')
TightDistance ('minimum')
);

GMMFit (version 1.0)

SELECT * FROM GMMFit (


ON { table | view | (query) | (SELECT 1) } AS init_params
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ MaxClusterNum ('max_clusters') ]
[ ClusterNum ('clusters') ]
[ CovarianceType ({ 'spherical' | 'diagonal' | 'tied' | 'full' }) ]
[ Tolerance ('tolerance') ]
[ MaxIterNum ('max_iterations') ]
[ ConcentrationParam ('concentration') ]
[ PackOutput ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

GMMPredict (version 1.0)

SELECT * FROM GMMPredict (


ON { table | view | (query) } AS modeldata DIMENSION
ON { table | view | (query) } AS testdata PARTITION BY key
[ OutputFormat ({ 'sparse' | 'dense' }) ]
[ TopNClusters (n) ]
[ PrintLogLikelihood ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Attributes ({ 'testdata_column' | 'testdata_column_range' }[,...])]
[ IDColumn ('testdata_column')]
);

GMMProfile (version 1.0)

SELECT * FROM GMMProfile (


ON { table | view | (query) } PARTITION BY 1
);

Teradata Aster Analytics Foundation User Guide 1395


Appendix A: List of Functions and Their Syntax
Cluster Analysis
KMeans (version 1.6)

SELECT * FROM KMeans (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
[ ClusteredOutput ('clustered_output_table') ]
[ UnpackColumns
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ InitialSeeds (starting_clusters) ]
[ NumClusters (number_of_means) ]
[ CentroidsTable ('centroids_table') ]
[ Threshold ('threshold') ]
[ MaxIterNum ('max_iterations') ]
);

KMeansPlot Syntax
Version 1.1

SELECT * FROM KMeansPlot (


ON { table | view | query } PARTITION BY ANY
ON { table | view | query } DIMENSION
[ CentroidsTable ('centroids_table') ]
[ PrintDistance ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]

);

Note:
When calling KMeansPlot on a view, you must provide aliases (a requirement of multi-input SQL-
MapReduce). For example:

SELECT *
FROM KMeansPlot (
ON pa_prdwk.seg_data_v AS input_data PARTITION BY ANY
ON pa_prdwk.seg_data_output AS segmentation_data_output DIMENSION
CentroidsTable ('segmentation_data_output')
);

1396 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Cluster Analysis
KModes (version 1.0)

SELECT * FROM KModes (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('table_name')
OutputTable ('table_name')
[ InitialSeedTable ('table_name')]
[ ModelIdColumn] ('column_name')
[ NumClusters ('integer_value1', 'integer_value2', ...) ]
InputColumns ('InputTable_column1', 'InputTable_column2', ...)
[ Threshold ('double') ]
[ MaxIterNum ('integer') ]
[ Distance ('manhattan'|'euclidean') ]
[ CategoryWeights ('double_value1', 'double_value2',...) ]
[ AsCategories
({ 'ascat_column' | 'ascat_column_range' }[,...]) ]
);

KModesPredict (version 1.0)

SELECT * FROM KModesPredict (


ON <table|view|query> AS input PARTITION BY ANY
ON <table|view|query> AS model DIMENSION
[ TestModels ('string_value1', 'string_value2', ...) ]
[ PrintDistance ('true'|'yes'|'t'|'y'|'1'|'FALSE'|'no'|'f'|'n'|'0') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Minhash (version 2.2)

SELECT * FROM Minhash (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutpuTTable ('output_table')
IDColumn ('id_column')
ItemsColumn ('items_column')
[ SeedTable ('seed_table_to_use') ]

Teradata Aster Analytics Foundation User Guide 1397


Appendix A: List of Functions and Their Syntax
Naive Bayes
[ SaveSeedTo ('seed_table_to_save') ]
HashNum ('number_of_hash_functions')
KeyGroups ('number_of_key_groups')
[ InputFormat ({ 'bigint' | 'integer' | 'string' | 'hex' }) ]
[ MinClusterSize ('minimum_cluster_size') ]
[ MaxClusterSize ('maximum_cluster_size') ]
[ Delimiter ('delimiter') ]
);

Naive Bayes

NaiveBayesMap and NaiveBayesReduce (version 1.3)

CREATE TABLE model_table_name (PARTITION KEY(column_name)) AS


SELECT * FROM NaiveBayesReduce (
ON (
SELECT * FROM NaiveBayesMap (
ON input_table
ResponseColumn ('response_column')
NumericInputs ({ 'numeric_input_column' |
'numeric_input_column_range' }[,...])
CategoricalInputs ({ 'categorical_input_column' |
'categorical_input_column_range' }[,...])
)
) PARTITION BY column_name
);

NaiveBayesPredict (version 1.4)

SELECT * FROM NaiveBayesPredict (


ON input_table
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
Model ('model_table_name')
IDCol ('test_point_id_col')
NumericInputs ({ 'numeric_input_column' |
'numeric_input_column_range' }[,...])
CategoricalInputs ({ 'categorical_input_column' |
'categorical_input_column_range' }[,...])
);

1398 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Ensemble Methods

Ensemble Methods

AdaBoost_Drive (version 1.5)

SELECT * FROM AdaBoost_Drive (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
AttributeTable ('attribute_table')
AttributeNameColumns ('attribute_name_column' [,...] )
AttributeValueColumn ('attribute_value_column')
[ CategoricalAttributeTable ('cat_attribute_table') ]
ResponseTable ('response_table')
OutputTable ('output_table')
IdColumns ('id_column' [,...] )
ResponseColumn ('response_column')
[ IterNum ('iterations') ]
[ NumSplits ('splits') ]
[ ApproxSplits ({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ SplitMeasure ({ 'gini' | 'entropy' }) ]
[ MaxDepth ('max_depth') ]
[ MinNodeSize ('min_node_size') ]
[ DropOutputTable
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
);

AdaBoost_Predict (version 1.5)

SELECT * FROM AdaBoost_Predict (


ON { table | view | query } AS attributetable PARTITION BY key
ON { table | view | query } AS model DIMENSION
AttrTableGroupbyColumns ('group_column' [,...] )
AttrTablePidColumns ('pid_column' [,...])
AttrTableValColumn ('value_column')
);

Forest_Analyze (version 1.1)

SELECT * FROM Forest_Analyze (


ON { table_name | view_name | (query) }
[ NumLevels (number_of_levels) ]
);

Teradata Aster Analytics Foundation User Guide 1399


Appendix A: List of Functions and Their Syntax
Ensemble Methods
Forest_Drive (version 1.5)

SELECT * FROM Forest_Drive (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table_name')
OutputTable ('output_table_name')
ResponseColumn ('response_column')
[ NumericInputs ({ 'numeric_input_column_name' |
'numeric_input_column_range' }[,...]) ]
[ MaxNumCategoricalValues (max_cat_values) ]
[ CategoricalInputs ({ 'categorical_input_column_name' |
'categorical_input_column_range' }[,...]) ]
[ TreeType ( { 'regression' | 'classification' } ) ]
[ NumTrees (number_of_trees) ]
[ TreeSize (tree_size) ]
[ MinNodeSize (min_node_size) ]
[ Variance (variance) ]
[ MaxDepth (max_depth) ]
[ MonitorTable ('monitor_table_name') ]
[ DropMonitorTable
( {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ) ]
[ Mtry ('mtry') ]
[ MtrySeed ('mtryseed') ]
[ Seed ('seed') ]
);

Forest_Predict (version 1.5)

SELECT * FROM Forest_Predict (


ON { table_name| view_name| (query) } [ PARTITION BY ANY ]
[ ON model_table AS ModelTable DIMENSION ]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
ModelFile ('model_file')
Forest ('model_table')
[ NumericInputs ({ 'numeric_input_column_name' |
'numeric_input_column_range' }[,...]) ]
[ CategoricalInputs ({ 'categorical_input_column_name' |
'categorical_input_column_range' }[,...]) ]
IDColumn ('id_column')
[ Detailed ({ 'true' | 'false' }) ]
[ Accumulate ({ 'accumulate_column' |

1400 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Ensemble Methods
'accumulate_column_range' } [,...]) ]
);

Single_Tree_Drive (version 1.3)

SELECT * FROM Single_Tree_Drive (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
{ InputTable ('input_table') |
AttributeTableName ('attribute_table')
ResponseTableName ('response_table') }
OutputTable ('output_table')
AttributeNameColumns ('attribute_column' [,...])
AttributeValueColumn ('node_column')
ResponseColumn ('response_column')
IDColumns ({ 'id_column' | 'id_column_range' } [,...])
[ CategoricalAttributeTableName
('categorical_attribute_table') ]
[ SaveFinalResponseTableTo ('final_response_table') ]
[ SplitsTable ('splits_table') ]
[ SplitsValueColumn ('splits_valcol') ]
[ NumSplits ('num_splits_to_consider') ]
[ ApproxSplits
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ IntermediateSplitsTable ('intermediate_splits_table') ]
[ DropTable ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ MinNodeSize ('minimum_split_size') ]
[ MaxDepth ('max_depth') ]
[ Weighted ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WeightColumn ('weight_column') ]
[ SplitMeasure ( { 'gini' | 'entropy' | 'chisquare' } ) ]
);

Single_Tree_Predict (version 1.2)

SELECT * FROM Single_Tree_Predict (


ON attribute_table AS attribute_table
PARTITION BY pid_col [,...]
ON model_table as model_table DIMENSION
AttrTableGroupbyColumns ({ 'gcol' | 'gcol_range' } [,...])
AttrTablePIDColumns ({ 'pid_col' | 'pid_col_range' } [,...])
AttrTableValColumn ('value_column')
);

Teradata Aster Analytics Foundation User Guide 1401


Appendix A: List of Functions and Their Syntax
Association Analysis

Association Analysis

Basket_Generator (version 1.3)

SELECT * FROM Basket_Generator (


ON { table_name | view_name | (query) }
PARTITION BY partition_column [,...]
BasketItem ('basket_item' [,...])
[ BasketSize ('basket_size') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ Combination ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxItems ('item_set_max') ]
);

CFilter (version 1.7)

SELECT * FROM CFilter (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
InputColumns ({ 'input_column' | 'input_column_range' }[,...])
JoinColumns ({ 'join_column' | 'join_column_range' }[,...])
[ AddColumns ({ 'add_column' | 'add_column_range' } [,...]) ]
[ PartitionKeyColumn ('partition_key_column') ]
[ MaxItemSet ('max_item_set') ]
[ DropTable ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

FPGrowth (version 1.2)

SELECT * FROM FPGrowth (


ON (SELECT 1)
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputPatternTable ('output_pattern_table')

1402 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Association Analysis
OutputRuleTable ('output_rule_table')
TranItemColumns ({ 'item_column' | 'item_column_range' }[,...] )
TranIDColumns ({ 'id_column' | 'id_column_range' }[,...])
[ PatternsOrRules ({ 'patterns' | 'rules' | 'both' }) ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ PatternDistributionKeyColumn ('p_distribution_key_column') ]
[ RuleDistributionKeyColumn ('r_distribution_key_column') ]
[ Compress ({ 'nocompress' | 'high' | 'medium' | 'low' }) ]
[ DropTable ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ GroupSize (group_size) ]
[ MinSupport (min_support) ]
[ MinConfidence (min_confidence) ]
[ MaxPatternLength (pattern_length) ]
[ AntecedentCountRange ('lower_bound_upper_bound') ]
[ ConsequenceCountRange ('lower_bound_upper_bound') ]
[ Delimiter ('delimiter') ]
);

KNNRecommenderPredict (version 1.0)

SELECT * FROM KnnRecommenderPredict (


ON <rating_table> AS ratings
PARTITION BY ANY | PARTITION BY <userid_column>
ON <weight_model_table> AS weights DIMENSION
ON <bias_model_table> AS bias DIMENSION
[ UserIdColumn(<userid_column>) ]
[ ItemIdColumn(<itemid_column>) ]
[ RatingColumn(<rating_column>) ]
[ TopK(<top_k_recommendations>) ]
);

KNNRecommenderTrain (version 1.0)

SELECT * FROM KnnRecommenderTrain (


ON (SELECT 1) PARTITION BY 1
RatingTable(<user_rating_table>)
[ UserIdColumn(<userid_column>) ]
[ ItemIdColumn(<itemid_column>) ]
[ RatingColumn(<rating_column>) ]
WeightModelTable(<weight_model_table>)
BiasModelTable(<bias_model_table>)
[ NearestItemsTable(<item_neighbors_table>) ]
[ K(<number_of_item_neighbors>) ]
[ LearningRate(<learning_rate>) ]
[ MaxIterNum(<max_iteration_number>) ]
[ Threshold(<threshold_to_stop_iteration>) ]
[ ItemSimilarity(<method_to_calculate_item_similarity>) ]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]

Teradata Aster Analytics Foundation User Guide 1403


Appendix A: List of Functions and Their Syntax
Graph Analysis
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
);

WSRecommender (version 1.0)

SELECT * FROM WSRecommender (


ON (SELECT * FROM WSRecommenderReduce (
ON item_table_name AS item_table PARTITION BY item1_column
ON user_table_name AS user_table PARTITION BY item_column
[ Item1 ('item1_column') ]
[ Item2 ('item2_column') ]
[ ItemSimilarity ('similarity_column') ]
[ UserItem ('item_column') ]
[ UserID ('user_column') ]
[ UserPref ('preference_column') ]
[ AccumulateItem ({ 'accumulate_item_column' |
'accumulate_item_column_range' }[,...]) ]
[ AccumulateUser ({ 'accumulate_user_column' |
'accumulate_user_column_range' }[,...]) ]
)
) AS temporary_table PARTITION BY usr, col1_item2
);

Graph Analysis

AllPairsShortestPath (version 1.2)

SELECT * FROM AllPairsShortestPath (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
[ ON targets_table AS targets PARTITION BY target_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ EdgeWeight ('edge_weight') ]
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxDistance ('max_distance') ]
[ GroupSize ('group_size') ]
);

Betweenness (version 1.2)

SELECT * FROM Betweenness (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key

1404 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Graph Analysis
[ ON targets_table AS targets PARTITION BY vertex_key ]
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0')} ]
[ EdgeWeight ('edge_weight_column') ]
[ MaxDistance ('max_distance') ]
[ GroupSize ('group_size') ]
[ SampleRate ('sample_rate') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Closeness (version 1.2)

SELECT * FROM Closeness (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
[ ON targets_table AS targets PARTITION BY target_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ EdgeWeight ('edge_weight') ]
[ MaxDistance ('max_distance') ]
[ GroupSize ('group_size') ]
[ SampleRate ('sample_rate) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

EigenvectorCentrality (version 1.1)

SELECT * FROM EigenvectorCentrality (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
TargetKey ({ 'edge_attribute' | 'edge_attribute_range' }[,...])
[ EdgeWeight ('edge_weight') ]
[ Family ({ 'katz' | 'bonacich' | 'eigenvector' }) ]
[ Alpha ('alpha_value') ]
[ Beta ('beta_value') ]
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxIterNum ('max_iteration_number') ]
[ Threshold ('threshold') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Teradata Aster Analytics Foundation User Guide 1405


Appendix A: List of Functions and Their Syntax
Graph Analysis
gTree (version 1.0)

SELECT * FROM gTree (


ON { table | view | (query) } AS vertices PARTITION BY key
ON { table | view | (query) } AS edges PARTITION BY key
ON { table | view | (query) } AS root PARTITION BY key
TargetKey ({ 'edges_column' | 'edges_column_range' }[,...])
[ AllowCycles ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ MaxIterNum ('max_depth') ]
[ Output ({ 'all' | 'end' }) ]
[ Results ('func(expr) [ AS alias ]' [,...]) ]
[ EdgeResults ('func(expr) [ AS alias ]' [,...])) ]
[ FinalEdgeFlag ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})]
);

LocalClusteringCoefficient (version 1.1)

SELECT * FROM LocalClusteringCoefficient (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
TargetKey
({ 'target_vertex_column' | 'target_vertex_column_range' }[,...])
[ Directed ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ EdgeWeight ('edge_weight') ]
[ DegreeRange ({ '[min:max]' | '[min:]' | '[:max]' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

Note:
In the DegreeRange argument, you must type the brackets. They do not indicate that their contents are
optional.

LoopyBeliefPropagation (version 1.0)

SELECT * FROM LoopyBeliefPropagation (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON observation_table AS observation PARTITION BY source_vertex_key ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ ObservationColumn ('observation_column') ]
[ EdgeWeight ('edge_weight') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ MaxIterNum ('max_iter_num') ]
[ Threshold ('threshold') ]
);

1406 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Graph Analysis
Modularity (version 1.1)

SELECT * FROM Modularity (


ON { table | view | (query) } AS "vertices" PARTITION BY vertex_key
ON { table | view | (query) } AS "edges" PARTITION BY source_vertex_key
[ ON { table | view | query} AS "sources" PARTITION BY vertex_key ]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TargetKey
({ 'target_key_column' | 'target_key_column_range' }[,...])
[ Directed ( { 'true' | 'yes' | 't' | 'y' | '1' | 'false' | 'no' | 'f' |
'n' | '0' } )]
[ EdgeWeight (edge_weight) ]
[ CommunityAssociation (community_id) ]
[ Resolution (resolution [,...]) ]
[ CommunityEdgeTable (community_edge_table) ]
[ Seed ('seed') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

nTree (version 1.1)

SELECT * FROM nTree (


ON input_table
PARTITION BY partition_columns
[ ORDER BY ordering_columns ]
Root_Node (boolean_expression)
Node_ID (expression)
Parent_ID (expression)
Allow_Cycles ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})
Starts_With ({ 'root' | 'leaf' | expression })
Mode ({ 'up' | 'down' })
Output ({ 'end' | 'all' })
[ Max_Distance (expression) ]
[ Logging ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})]
Result (aggregate [,...])
);

PageRank (version 1.1)

SELECT * FROM PageRank (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY edge_src_vertex_key
TargetKey ('target_key_column' [,...])
[ EdgeWeight ('edge_weight') ]
[ DampFactor ('damp_factor') ]

Teradata Aster Analytics Foundation User Guide 1407


Appendix A: List of Functions and Their Syntax
Graph Analysis
[ MaxIterNum ('max_iterations') ]
[ Threshold ('threshold') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

pSALSA (version 1.1)

SELECT * FROM pSALSA (


ON vertices_table AS vertices PARTITION BY vertex_key
ON edges_table AS edges PARTITION BY source_vertex_key
[ ON sources_table AS sources PARTITION BY source_vertex_key ]
[ ON targets_table AS targets PARTITION BY target_vertex_key ]
SourceKey
({ 'source_vertex_column' | 'source_vertex_column_range' }[,...])
TargetKey
({ 'target_vertex_column' | 'target_vertex_column_range' }[,...])
[ EdgeWeight ('weight_column') ]
MaxHubNum ('max_hubs')
MaxAuthorityNum ('max_authority')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
[ TeleprotProb ('eta') ]
[ RandomWalkLength ('L') ]
);

RandomWalkSample (version 1.2)

SELECT * FROM RandomWalkSample (


ON { table | view | (query) } AS "vertices"
PARTITION BY vertex_attributes
ON { table | view | query) } AS "edges"
PARTITION BY source_vertex_attributes
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
TargetKey
({ 'target_vertex_column' | 'target_vertex_column_range' }[,...])
[ SampleRate ('sample_rate') ]
[ FlyBackRate ('fly_back_rate') ]
[ Seed ('seed') ]
OutputTables ('vertex_table_name', 'edge_table_name')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]) ]
);

1408 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Neural Networks

Neural Networks

NeuralNet (version 1.0)

SELECT * FROM NeuralNet (


ON (SELECT 1) PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('table_name')
OutputTable ('table_name')
[ WeightTable ('table_name') ]
InputColumns ('InputTable_column1', 'InputTable_column2', ...)
ResponseColumns ('string_value1', 'string_value2', ...)
[ GroupByColumns ('InputTable_column1', 'InputTable_column2', ...) ]
[ HiddenLayers ('integer_value1', 'integer_value2',...) ]
[ Threshold ('double') ]
[ MaxIterNum ('integer') ]
[ LearningRate ('double') ]
[ ActivationFunction ({ 'logistic' | 'tanh' }) ]
[ ErrorFunction ({ 'sse' | 'ce' }) ]
[ Algorithms ('backprop') ]
[ LinearOutput
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
[ OverwriteOutput
({ 'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0' }) ]
);

NeuralNetPredict (version 1.0)

SELECT * FROM NeuralNetPredict (


ON <table|view|query> AS testdata PARTITION BY <key> / PARTITION BY ANY
ON <table|view|query> AS modeldata PARTITION BY <key> / DIMENSION
InputColumns ('testdata_column1', 'testdata_column2', ...)
[ GroupByColumns('testdata_column1', 'testdata_column2', ...) ]
HiddenLayers ('integer_value1', 'integer_value2', ...)
[ ActivationFunction ('logistic'|'tanh') ]
[ LinearOutput ('TRUE'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0') ]
[ NumOutputs ('integer') ]
[ Accumulate ('testdata_column1', 'testdata_column2', ...) ]
);

Teradata Aster Analytics Foundation User Guide 1409


Appendix A: List of Functions and Their Syntax
Data Transformation

Data Transformation

Antiselect (version 1.0)

SELECT * FROM Antiselect (


ON input_table
Exclude ({ 'column_name' | 'column_name_range' }[,...])
);

Apache_Log_Parser (version 2.2)

SELECT * FROM Apache_Log_Parser (


ON { table_name | view_name | (query) }
TargetColumn ('log_column')
[ LogFormat ('format_string') ]
[ ExcludeFiles ('.file_suffix' [,...]) ]
[ SearchInfoFlag
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

Categorize (version 1.0)

SELECT * FROM Categorize (


ON { table | view | query } PARTITION BY ANY
Columns ( { column | column_range }[,...] )
);

FellegiSunterPredict (version 1.1)

SELECT * FROM FellegiSunterPredict (


ON { table | view | (query) } PARTITION BY ANY
ON model_table DIMENSION
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

FellegiSunterTrainer (version 1.1)

SELECT * FROM FellegiSunterTrainer (


ON (SELECT 1) PARTITION BY 1
InputTable ('input_table')
ComparisonFields ('field_name[:threshold_value]' [,...])

1410 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Data Transformation
[ TagColumn ('tag_column') ]
[ InitialM ('initial_value_of_M') ]
[ InitialU ('initial_value_of_U') ]
[ InitialP ('initial_value_of_P') ]
[ MaxIteration ('max_iteration') ]
[ Eta ('eta_value') ]
[ Lambda ('lambda_value') ]
[ Mu ('mu_value') ]
);

GeometryLoader (version 1.1)

SELECT * FROM GeometryLoader (


ON mr_driver
Path ('inputPath' [,...])
[ Host ('afs_server_ip_address') ]
[ Port ('afs_server_port_number') ]
[ InputFormat ({ 'kml' | 'geojson' | 'shp' | 'mapinfo' }) ]
[ OutputFormat ({ 'wkt' | 'json' | 'kml' | 'gml' }) ]
[ OutputAttributes ('colname [ coltype ]' [,...]) ]
);

GeometryOverlay (version 1.1)

UNION, INTERSECTION, DIFFERENCE and SYMDIFFERENCE

SELECT * FROM GeometryOverlay (


ON { table_name | view_name | (query) } AS source PARTITION BY ANY
ON { table_name | view_name | (query) } AS reference DIMENSION
SourceLocationColumn ('source_location_column')
ReferenceLocationColumn ('ref_location_column')
ReferenceNameColumns
({ 'ref_name_column' | 'ref_name_column_range' }[,...])
BoundaryOperator (
{ 'UNION' | 'INTERSECTION' | 'DIFFERENCE' | 'SYMDIFFERENCE' })
[ OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

CONVEXHULL

SELECT * FROM GeometryOverlay (


ON { table_name | view_name | (query) }
SourceLocationColumn ('source_location_column')
BoundaryOperator ({ 'BUFFER' | 'CONVEXHULL' })
[ Accumulate

Teradata Aster Analytics Foundation User Guide 1411


Appendix A: List of Functions and Their Syntax
Data Transformation
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

BUFFER

SELECT * FROM GeometryOverlay (


ON { table_name | view_name | (query) }
SourceLocationColumn ('source_location_column')
BoundaryOperator ({ 'BUFFER' | 'CONVEXHULL' })
Distance ('distance')
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

IdentityMatch (version 1.1)

When Reference Data Fits in Memory

SELECT * FROM IdentityMatch (


ON source_input_table AS a PARTITION BY ANY
ON reference_input_table AS b DIMENSION
IDColumn ('a.id_column: b.id_column')
{ NominalMatchColumns ('a.columnX: b.columnY' [,...]) |
FuzzyMatchColumns ('a.columnX: b.columnY, match_metric,
match_weight [, synonym_file ]' [,...]) }
[ Accumulate ('{a|b}.accumulate_column' [,...]')]
[ Threshold ('threshold') ]
);

When Reference Data Does Not Fit in Memory

SELECT * FROM IdentityMatch (


ON source_input_table AS a PARTITION BY key
ON reference_input_table AS b PARTITION BY key
IDColumn ('a.id_column: b.id_column')
{ NominalMatchColumns ('a.columnX: b.columnY' [,...]) |
FuzzyMatchColumns ('a.columnX: b.columnY, match_metric,
match_weight [, synonym_file ]' [,...]) }
[ Accumulate ('{a|b}.accumulate_column' [,...]')]
[ Threshold ('threshold') ]
);

IPGeo (version 2.1)

SELECT * FROM IPGeo (


ON input_table

1412 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Data Transformation
IPAddressColumn ('ip_address_column')
[ Converter ('file', 'class') ]
[ IPDatabaseLocation ('geolocation_DB_loc') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

JSONParser (version 1.5)

SELECT * FROM JSONParser (


ON tablename
TextColumn ('text_columnname')
Nodes ('parentnode/childnode' [,...])
[ SearchPath ('nodename/...') ]
[ Delimiter ('delimiter_string') ]
[ MaxItemNum ('number') ]
[ NodeIDOutputColumn ('columnname') ]
[ ParentNodeOutputColumn ('columnname') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ ErrorHandler ('{true|yes|t|y|1|false|no|f|n|0}
[ [;output_col_name :] input_col_name[,...]') ] ]
);

Multi_Case (version 1.1)

SELECT * FROM Multi_Case (


ON (SELECT *, condition AS case [,...]
FROM { table | view | (query) })
Labels ('case AS "label"' [,...])
);

MurmurHash (version 1.1)

SELECT * FROM MurmurHash (


ON { table | view | (query) }
InputColumns ( { column | column_range }[,...] )
[ HashBit ({ '32' | '64' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

OutlierFilter (version 1.3)

SELECT * FROM OutlierFilter (


ON (SELECT 1)

Teradata Aster Analytics Foundation User Guide 1413


Appendix A: List of Functions and Their Syntax
Data Transformation
PARTITION BY 1
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
InputTable ('input_table')
OutputTable ('output_table')
TargetColumn ({ 'target_column' | 'target_column_range' }[,...])
[ OutlierTable ('outlier_table') ]
[ GroupByColumns
({ 'group_by_column' | 'group_by_column_range' }[,...]) ]
[ Method (method [,...]) ]
[ ApproxPercentile
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ PercentileThreshold ('perc_lower', 'perc_upper']) ]
[ PercentileAccuracy ('accuracy') ]
[ IQRMultiplier ('k') ]
[ RemoveTail ({ 'both' | 'upper' | 'lower' }) ]
[ ReplacementValue ({ 'delete' | 'null' | 'median' | 'newval' }) ]
[ MADScaleConstant ('constant') ]
[ MADThreshold ('madlimit') ]
);

Pack (version 1.2)

SELECT * FROM Pack (


ON { table_name | view_name | (query) }
[ InputColumns ({ 'input_column' | 'input_column_range' }[,... ]) ]
[ Delimiter ('delimiter') ]
[ IncludeColumnName
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
OutputColumn ('output_column')
);

PartitionScale (version 1.2)

SELECT * FROM PartitionScale (


ON input_table PARTITION BY partition_columns
Method ('method' [,...])
[ MissValue ({ 'KEEP' | 'OMIT' | 'ZERO' | 'LOCATION' })]
InputColumns ( { column | column_range }[,...] )
[ Global ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Multiplier ('multiplier' [,...]) ]
[ Intercept ('intercept' [,...]) ]
);

1414 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Data Transformation
Pivot (version 1.5)

SELECT * FROM pivot (


ON input_table PARTITION BY partition_column[,...]
[ ORDER BY order_column]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
PartitionColumns
({ 'partition_column' | 'partition_column_range' }[,...])
{ NumberOfRows ('number_of_rows') |
PivotColumn ('pivot_column')
[ PivotKeys ('pivot_key' [,...]) ]
}
TargetColumns ({ 'target_column' | 'target_column_range' }[,...])
);

PointInPolygon

Small Polygon Count and Large Point Count

SELECT * FROM PointInPolygon (


ON source_table AS source PARTITION BY ANY
ON reference_table AS reference DIMENSION
SourceLocationColumn ('source_location_point_column'
[, 'source_location_point_column_2' ])
ReferenceLocationColumn ('reference_location_polygon_column')
ReferenceNameColumns ({ 'reference_name_column' |
'reference_name_column_range' }[,...])
[ OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Large Polygon Count and Small Point Count

SELECT * FROM PointInPolygon (


ON dimension_table as source DIMENSION
ON reference_table as reference PARTITION BY ANY
SourceLocationColumn ('source_location_point_column'
[, 'source_location_point_column_2' ])
ReferenceLocationColumn ('reference_location_polygon_column')
ReferenceNameColumns ({ 'reference_name_column' |
'reference_name_column_range' }[,...])
OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})
[ Accumulate

Teradata Aster Analytics Foundation User Guide 1415


Appendix A: List of Functions and Their Syntax
Data Transformation
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Only to Determine Relations of Points and Polygons in Same Group

SELECT * FROM PointInPolygon (


ON { table_name | view_name |(query) } AS source
PARTITION BY group_key
ON { table_name | view_name |(query) } AS reference
PARTITION BY group_key
SourceLocationColumn ('source_location_point_column'
[, 'source_location_point_column_2' ])
ReferenceLocationColumn ('reference_location_polygon_column')
ReferenceNameColumns ({ 'reference_name_column' |
'reference_name_column_range' }[,...])
OutputAll ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

PSTParserAFS (version 1.1)

SELECT * FROM PSTParserAFS (


ON empty_table
Path ('input_path' [,...])
[ Host ('afs_server_ip_address') ]
[ Port ('afs_server_port_number') ]
[ OutputColumns ({ 'output_column' | 'output_column_range' }[,...] ]
[ Exclude ('message_folder' [,...]) ]
);

Scale (version 1.2)

SELECT * FROM Scale (


ON input_table AS input PARTITION BY ANY
ON (SELECT * FROM ScaleMap ...) AS statistic DIMENSION
Method ('method' [,...])
[ Global ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ InputColumns ( { column | column_range }[,...] ) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Multiplier ('multiplier' [,...]) ]
[ Intercept ('intercept' [,...]) ]
);

1416 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Data Transformation
ScaleMap (version 1.2)

SELECT * FROM ScaleMap (


ON { table | view | (query) }
InputColumns ( { column | column_range }[,...] )
[ MissValue ({ 'KEEP' | 'OMIT' | 'ZERO' | 'LOCATION' })]
);

ScalePrinter (version 1.2)

SELECT * FROM ScalePrinter (


ON (SELECT * FROM ScaleMap ...) PARTITION BY 1
);

StringSimilarity (version 1.1)

SELECT * FROM StringSimilarity (


ON { table | view | (query) }PARTITION BY ANY
ComparisonColumnPairs ('comparison_type (
column1, column2 [, constant]) [ AS output_column]') [,...]
[ CaseSensitive
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}[,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Unpack (version 1.2)

SELECT * FROM Unpack (


ON { table_name | view_name| (query) }
InputColumn ('input_column')
OutputColumns ({ 'output_column' | 'output_column_range' }[,...])
OutputDataTypes ('datatype' [,...])
[ Delimiter ('delimiter') ]
[ ColumnLength ('column_length' [,...] ) ]
[ Regex ('regular_expression') ]
[ RegexSet ('group_number') ]
[ Exception ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

Unpivot (version 1.2)

SELECT * FROM unpivot (


ON input_timeseries_table
[ Unpivot ({ 'unpivot_column' | 'unpivot_range' }[,...])] |

Teradata Aster Analytics Foundation User Guide 1417


Appendix A: List of Functions and Their Syntax
Data Transformation
UnpivotRange ('[start_index:end_index]' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ InputTypes ({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})]
[ AttributeColumn ('attribute_column')]
[ ValueColumn ('value_column')]
);

URIPack (version 1.1)

SELECT * FROM URIPack (


ON input_table
[ Queries ('query_parameter' [,...]) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Scheme_Column ('scheme_column') ]
[ Host_Column ('host_column') ]
[ Path_Column ('path_column') ]
[ Fragment_Column ('fragment_column') ]
[ IgnoreValues ('string' [,...]) ]
);

URIUnpack (version 1.0)

SELECT * FROM URIUnpack (


ON input_table
URI_Column ('uri_column')
[ Queries ('query_parameter' [,...]) ]
[ Output ({ 'scheme' | 'host' | 'path' | 'fragment' }) ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
[ Print_Null_Queries
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
);

XMLParser (version 1.7)

SELECT * FROM XMLParser (


ON input_table
Text_Column ('text_column')
Nodes ('node_pair_string[,...]')
[ Sibling ('sibling_node_string[,...]') ]
[ Delimiter ('delimiter') ]
[ SiblingDelimiter ('sibling_delimiter') ]
[ MaxItemNum ('max_item_number') ]
[ Ancestor ('nodes_path' [,...]) ]
[ OutputColumnNodeID ('output_column_node') ]
[ OutputColumnParentNodeName ('output_column_parent_node') ]
[ OutputColumnGrandparentNode_Name

1418 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Aster Scoring SDK
('output_column_grandparent_node') ]
[ ErrorHandler ('{'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}
[;[output_column:] column[,...]]') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

XMLRelation (version 1.3)

SELECT * FROM XMLRelation (


TextColumn ('text_column')
DocIDColumns ({ 'docid_column' | 'docid_column_range' } [,...])
[ MaxDepth ('max_depth') ]
[ ExcludeElements ('node[/...][{node[,...]}]' [,...]) ]
[ AttributeAsNode
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}) ]
[ AttributeDelimiter ('delimiter') ]
[ Output ({ 'fulldata' | 'parentchild' | 'fullpath' }) ]
[ ErrorHandler ('{'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'}
[;[output_column:] column[,...]]') ]
[ Accumulate
({ 'accumulate_column' | 'accumulate_column_range' }[,...]')]
);

Aster Scoring SDK

AMLGenerator (version 1.0)

SELECT * FROM AMLGenerator (


ON (SELECT 1) PARTITION BY 1
ModelType ('function_name')
[ ModelTable ('list_of_tables') ]
[ ModelTag ('list_of_tags') ]
[ InstalledFile ('list_of_boolean') ]
RequestColNames ('list_of_col_names')
RequestColTypes ('list_of_col_types')
[ AMLPrefix ('file_name') ]
[ OverwriteOutput
({'true'|'yes'|'t'|'y'|'1'|'false'|'no'|'f'|'n'|'0'})]
[ Domain ('host:port') ]
[ Database ('db_name') ]
[ UserID ('user_id') ]
[ Password ('password') ]
[ SSLSettings ('SSLsettings') ]
[ SSLTrustStorePassword ('SSLtruststorepassword') ]
[ RequestArgName1 ('arg_name') ]
[ RequestArgVal1 ('arg_value') ]
[ RequestArgName2 ('arg_name') ]
[ RequestArgVal2 ('arg_value') ]

Teradata Aster Analytics Foundation User Guide 1419


Appendix A: List of Functions and Their Syntax
Aster Scoring SDK
...
);

Scorer

Aster Scoring SDK Single Decision Tree


See Single_Tree_Predict Syntax.

Aster Scoring SDK Generalized Linear Model


See GLMPredict Syntax.

Aster Scoring SDK Random Forest


See Forest_Predict Syntax.

Aster Scoring SDK Naïve Bayes


See NaiveBayesPredict Syntax.

Aster Scoring SDK Naïve Bayes Text Classifier


See NaiveBayesTextClassifierPredict Syntax.

Aster Scoring SDK Text Tagging


See TextTagging Syntax.

Aster Scoring SDK Extract Sentiment


See ExtractSentiment Syntax.

Aster Scoring SDK Text Parser


See Text_Parser Syntax.

Aster Scoring SDK Text Tokenizer


See TextTokenizer Syntax.

Aster Scoring SDK SparseSVM


See SparseSVMPredictor Syntax.

1420 Teradata Aster Analytics Foundation User Guide


Appendix A: List of Functions and Their Syntax
Visualization Functions
Aster Scoring SDK CoxPH
See CoxPredict Syntax.

Aster Scoring SDK LDAInference


See LDAInference Syntax.

Visualization Functions
Visualization functions are used with the AppCenter product. For information about these functions, refer
to the AppCenter User Guide.

Teradata Aster Analytics Foundation User Guide 1421


Appendix A: List of Functions and Their Syntax
Visualization Functions

1422 Teradata Aster Analytics Foundation User Guide

You might also like