0% found this document useful (0 votes)
383 views417 pages

Teradata Vantage™ - Analytics Database Analytic Functions B035-1206-172K

Analytics Database Analytic Functions

Uploaded by

xyzfds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
383 views417 pages

Teradata Vantage™ - Analytics Database Analytic Functions B035-1206-172K

Analytics Database Analytic Functions

Uploaded by

xyzfds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 417

Teradata Vantage™ - Analytics

Database Analytic Functions


- 17.20
Release 17.20
2022-09-22

B035-1206-172K
DOCS.TERADATA.COM
Copyright and Trademarks
Copyright © 2017 - 2022 by Teradata. All Rights Reserved.
All copyrights and trademarks used in Teradata documentation are the property of their respective owners. For more information, see
Trademark Information.

Product Safety

Safety type Description

Indicates a situation which, if not avoided, could result in damage to property, such as to equipment or data,

⚠ NOTICE but not related to personal injury.

Indicates a hazardous situation which, if not avoided, could result in minor or moderate personal injury.

⚠ CAUTION
Indicates a hazardous situation which, if not avoided, could result in death or serious personal injury.

⚠ WARNING

Third-Party Materials
Non-Teradata (i.e., third-party) sites, documents or communications (“Third-party Materials”) may be accessed or accessible (e.g., linked or
posted) in or in connection with a Teradata site, document or communication. Such Third-party Materials are provided for your convenience only
and do not imply any endorsement of any third party by Teradata or any endorsement of Teradata by such third party. Teradata is not responsible
for the accuracy of any content contained within such Third-party Materials, which are provided on an “AS IS” basis by Teradata. Such third party
is solely and directly responsible for its sites, documents and communications and any harm they may cause you or others.

Warranty Disclaimer
Except as may be provided in a separate written agreement with Teradata or required by applicable laws, all designs, specifications,
statements, information, recommendations and content (collectively, "content") available from the Teradata Documentation website
or contained in Teradata information products is presented "as is" and without any express or implied warranties, including, but not
limited to, the implied warranties of merchantability, fitness for a particular purpose, or noninfringement, which are hereby disclaimed.
In no event shall Teradata corporation, its suppliers or partners be liable for any direct, indirect, incidental, special, exemplary, or
consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or
business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or
otherwise) arising in any way out of the use of content, even if advised of the possibility of such damage.
The Content available from the Teradata Documentation website or contained in Teradata information products may contain references or
cross-references to features, functions, products, or services that are not announced or available in your country. Such references do not imply
that Teradata Corporation intends to announce such features, functions, products, or services in your country. Please consult your local Teradata
Corporation representative for those features, functions, products, or services available in your country.
The Content available from the Teradata Documentation website or contained in Teradata information products may be changed or updated
by Teradata at any time without notice. Teradata may also make changes in the products or services described in the Content at any time
without notice.
The Content is subject to change without notice. Users are solely responsible for their application of the Content. The Content does not constitute
the technical or other professional advice of Teradata, its suppliers or partners. Users should consult their own technical advisors before
implementing any Content. Results may vary depending on factors not tested by Teradata.

Machine-Assisted Translation
Certain materials on this website have been translated using machine-assisted translation software/tools. Machine-assisted translations of any
materials into languages other than English are intended solely as a convenience to the non-English-reading users and are not legally binding.
Anybody relying on such information does so at his or her own risk. No automated translation is perfect nor is it intended to replace human
translators. Teradata does not make any promises, assurances, or guarantees as to the accuracy of the machine-assisted translations provided.
Teradata accepts no responsibility and shall not be liable for any damage or issues that may result from using such translations. Users are reminded
to use the English contents.

Feedback
To maintain the quality of our products and services, e-mail your comments on the accuracy, clarity, organization, and value of this document
to: [email protected].
Any comments or materials (collectively referred to as "Feedback") sent to Teradata Corporation will be deemed nonconfidential. Without any
payment or other obligation of any kind and without any restriction of any kind, Teradata and its affiliates are hereby free to (1) reproduce, distribute,
provide access to, publish, transmit, publicly display, publicly perform, and create derivative works of, the Feedback, (2) use any ideas, concepts,
know-how, and techniques contained in such Feedback for any purpose whatsoever, including developing, manufacturing, and marketing products
and services incorporating the Feedback, and (3) authorize others to do any or all of the above.
Confidential Information
Confidential Information means any and all confidential knowledge, data or information of Teradata, including, but not limited to, copyrights, patent
rights, trade secret rights, trademark rights and all other intellectual property rights of any sort.
The Content available from the Teradata Documentation website or contained in Teradata information products may include Confidential
Information and as such, the use of such Content is subject to the non-use and confidentiality obligations and protections of a non-disclosure
agreement or other such agreements to protect Confidential Information that you have executed with Teradata.
Contents

Chapter 1: Introduction to Analytics Database Analytic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


Analytics Database Analytic Functions Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Usage Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
AA 7.00 Usage Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 2: Data Cleaning Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


Handling Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Parsing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 3: Data Exploration Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


MovingAverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
TD_CategoricalSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
TD_ColumnSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
TD_GetRowsWithMissingValues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
TD_Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
TD_QQNorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
TD_UnivariateStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
TD_WhichMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
TD_WhichMin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Chapter 4: Feature Engineering Transform Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


Antiselect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
TD_BinCodeFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
TD_BinCodeTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
TD_ColumnTransformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
TD_FunctionFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
TD_FunctionTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
TD_NonLinearCombineFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
TD_NonLinearCombineTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
TD_OneHotEncodingFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
TD_OneHotEncodingTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
TD_OrdinalEncodingFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
TD_OrdinalEncodingTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
TD_PolynomialFeaturesFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
TD_PolynomialFeaturesTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
TD_RandomProjectionMinComponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
TD_RandomProjectionFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
TD_RandomProjectionTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 4
Contents

TD_RowNormalizeFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
TD_RowNormalizeTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
TD_ScaleFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
TD_ScaleTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Chapter 5: Feature Engineering Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178


TD_FillRowID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
TD_NumApply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
TD_RoundColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
TD_StrApply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

Chapter 6: Model Training Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192


TD_DecisionForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
TD_KMeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
TD_GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
TD_VectorDistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Chapter 7: Model Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224


GLMPredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
SVMSparsePredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
DecisionForestPredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
DecisionTreePredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
TD_KMeansPredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
NaiveBayesPredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
TD_GLMPredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Chapter 8: Model Evaluation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276


TD_Silhouette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
TD_ClassificationEvaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
TD_Regression_Evaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
TD_ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

Chapter 9: Text Analytic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295


NaiveBayesTextClassifierPredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
NGramSplitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
TD_NaiveBayesTextClassifierTrainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
TD_SentimentExtractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
TD_TextParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

Chapter 10: Path and Pattern Analysis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328


Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Sessionize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
nPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 5
Contents

Chapter 11: Hypothesis Testing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384


Hypothesis Test Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Hypothesis Test Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
TD_ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
TD_ChiSq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
TD_FTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
TD_ZTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

Chapter 12: Aster Compatibility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411


TD_BYONE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

Appendix A: How to Read Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

Appendix B: Additional Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 6
1
Introduction to Analytics Database Analytic
Functions
The Advanced Analytics functions is a suite of scalable and distributed machine learning analytic functions
to perform analytics on your dataset.

Using the Analytics Database Analytical Content


The content describes the advanced analytic functions used for feature engineering, training, scoring,
evaluation, pattern recognition, and text analytics use cases. The engine supports the use cases with a full
dataset without compromising the end-to-end performance of an analytic workload.

Why Would I Use this Content?


You can use the content to understand which function to use for the following use cases:
• Data cleaning (which includes use cases related to handling outliers, handling missing values, and
parsing data)
• Data exploration (which includes use cases related to descriptive statistics and statistical tests)
• Feature Engineering (which includes use cases related to feature engineering utilities and categorical
and continuous variable transform)
• Model building (which includes use cases related to model training, model evaluation, and
model scoring)

How Do I Use this Content?


You can use the content as follows:
1. Select a function from the Overview section.
2. Read the function description, syntax, syntax elements, and input and output sections for the
selected function.
3. Download the zip file and use the dataset setup file to create the datasets.
4. Use the examples from the guide or SQL statements from the notepad and run the statements in the
Teradata Studio or BTEQ environment.

How Do I Get Started?


Before using the functions, read the following sections:
1. Read Usage Notes.
2. Read the How to Read Syntax section.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 7
1: Introduction to Analytics Database Analytic Functions

References to Other Relevant Content


Topic Reference

ML Engine functions Teradata Vantage™ Machine Learning Engine Analytic


Function Reference, B700-4003

Aster Analytics functions Teradata Aster® Analytics Foundation User


Guide, B700-1022

Installing model files output by ML Engine Teradata Vantage™ User Guide, B700-4002
functions on Analytics Database

Analytics Database Analytic Functions Overview


The Analytics Database provides the following analytic capabilities:

Data Cleaning Functions


Function Name Description

TD_GetFutileColumns Returns the futile column names.

TD_OutlierFilterFit Calculates the lower_percentile, upper_percentile, count of rows,


and median for the specified input table columns.

TD_OutlierFilterTransform Filters outliers from the input table

TD_GetRowsWithoutMissingValues Displays the rows that have non-NULL values in the specified input
table columns.

TD_SimpleImputeFit Outputs a table of values to substitute for missing values in the


input table.

TD_SimpleImputeTransform Substitutes specified values for missing values in the input table.

TD_ConvertTo Converts the specified input table columns to specified data types.

Pack Compresses data in multiple columns into a single packed


data column.

Unpack Expands data from a single packed column to multiple


unpacked columns.

StringSimilarity Calculates the similarity between two strings, using the specified
comparison method.

Data Exploration Functions


Function Name Description

MovingAverage Computes average values in a series.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 8
1: Introduction to Analytics Database Analytic Functions

Function Name Description

TD_CategoricalSummary Displays the distinct values and their counts for each specified input
table column.

TD_ColumnSummary Displays a summary of each specified input table column.

TD_GetRowsWithMissingValues Displays the rows that have NULL values in the specified input
table columns.

TD_Histogram Calculates the frequency distribution of a data set.

TD_QQNorm Checks whether the values in the specified input table columns are
normally distributed.

TD_UnivariateStatistics Displays descriptive statistics for each specified numeric input


table column.

TD_WhichMax Displays all rows that have the maximum value in a specified input
table column.

TD_WhichMin Displays all rows that have the minimum value in specified input
table column.

Feature Engineering Transform Functions


Function Name Description

Antiselect AntiSelect returns all columns except those specified.

TD_BinCodeFit Converts numeric data to categorical data by binning the numeric


data into multiple numeric bins (intervals).

TD_BinCodeTransform Transforms input table columns from the BinCodeFit


function output.

TD_ColumnTransformer Transforms the input table columns in a single operation.

TD_FunctionFit Determines whether specified numeric transformations can be


applied to specified input columns.

TD_FunctionTransform Applies numeric transformations to input columns to the


FunctionFit output.

TD_NonLinearCombineFit Returns the target columns and a specified formula which uses
the non-linear combination of existing features.

TD_NonLinearCombineTransform Generates the values of the new feature using the specified
formula from the TD_NonLinearCombineFit function output.

TD_OneHotEncodingFit Outputs a table of attributes and categorical values to the TD_


OneHotEncodingTransform function.

TD_OneHotEncodingTransform Encodes specified attributes and categorical values as


one-hot numeric vectors using the output from the TD_
OneHotEncodingFit function.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 9
1: Introduction to Analytics Database Analytic Functions

Function Name Description

TD_OrdinalEncodingFit Identifies distinct categorical values from the input table or a


user-defined list and returns the distinct categorical values along
with the ordinal value for each category.

TD_OrdinalEncodingTransform Maps the categorical value to a specified ordinal value using the
TD_OrdinalEncodingFit output.

TD_PolynomialFeaturesFit Stores all the specified values in the argument in a tabular format.

TD_PolynomialFeaturesTransform Extracts values of arguments from the output of the TD_


PolynomialFeaturesFit function and generates a feature matrix of
all polynomial combinations of the features.

TD_RandomProjectionMinComponents Calculates the minimum number of components required for


applying RandomProjection on the given dataset for the specified
epsilon(distortion) parameter value.

TD_RandomProjectionFit Returns a random projection matrix based on the


specified arguments.

TD_RandomProjectionTransform Converts the high-dimensional input data to a lower-dimensional


space using the TD_RandomProjectionFit function output.

TD_RowNormalizeFit Outputs a table of parameters and specified input columns


to TD_RowNormalizeTransform which normalizes the input
columns row-wise.

TD_RowNormalizeTransform Normalizes the input columns row-wise using the output of the
TD_RowNormalizeFit function.

TD_ScaleFit Outputs a table of statistics to the TD_ScaleTransform function.

TD_ScaleTransform Scales the specified input table columns using the output of the
TD_ScaleFit function.

Feature Engineering Utility Functions


Function Name Description

TD_FillRowID Adds a column of unique row identifiers to the input table.

TD_NumApply Applies a specified numeric operator to the specified input table columns.

TD_RoundColumns Rounds the values of each specified input table column to a specified number of
decimal places

TD_StrApply Applies a specified string operator to the specified input table columns.

Model Training Functions


Function Name Description

TD_DecisionForest Used for classification and regression predictive modeling.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 10
1: Introduction to Analytics Database Analytic Functions

Function Name Description

TD_KMeans Groups a set of observations into k clusters in which each observation belongs to the
cluster with the nearest mean (cluster centers or cluster centroid).

TD_GLM Performs regression analysis on data sets where the response follows an exponential
family distribution.

TD_VectorDistance Accepts a table of target vectors and a table of reference vectors and returns a table
that contains the distance between target-reference pairs.

Model Scoring Functions


Function Name Description

GLMPredict Uses the model file output by ML Engine GLM function to analyze the input data and
make predictions.

SVMSparsePredict Uses the model file output by ML Engine SVMSparse function to analyze the input
data and make predictions.

DecisionForestPredict Uses the model file output by Machine Learning Engine (ML Engine)
DecisionForest function to analyze the input data and make predictions.

DecisionTreePredict Uses the model file output by ML Engine DecisionTree function to analyze the input
data and make predictions.

TD_KMeansPredict Uses the cluster centroids in the TD_KMeans function output to assign the input
data points to the cluster centroids.

NaiveBayesPredict Uses the model file output by ML Engine Naive Bayes Classifier function to analyze
the input data and make predictions.

TD_GLMPredict Predicts target values (regression) and class labels (classification) for test data
using a GLM model of the TD_GLM function.

Model Evaluation Functions


Function Name Description

TD_Silhouette Determines how well the data is clustered among clusters.

TD_ClassificationEvaluator Computes the Confusion matrix, precision, recall, and F1-score based on the
observed labels (true labels) and the predicted labels.

TD_Regression_Evaluator Computes metrics to evaluate and compare multiple models and summarizes
how close predictions are to their expected values.

TD_ROC Accepts a set of prediction-actual pairs for a binary classification model


and calculates the True-positive rate (TPR), False-positive rate (FPR), The
area under the ROC curve (AUC), and Gini coefficient values for a range of
discrimination thresholds.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 11
1: Introduction to Analytics Database Analytic Functions

Text Analytic Functions


Function Name Description

NaiveBayesTextClassifierPredict Uses the model file output by ML Engine


NaiveBayesTextClassifierTrainer function to analyze the input data
and make predictions.

NGramSplitter Tokenizes (splits) an input stream and emits n multigrams, based


on specified delimiter and reset parameters. Useful for sentiment
analysis, topic identification, and document classification.

TD_ Calculates the conditional probabilities for token-category pairs,


NaiveBayesTextClassifierTrainer the prior probabilities, and the missing token probabilities for
all categories.

TD_SentimentExtractor Uses a dictionary model to extract the sentiment (positive, negative,


or neutral) of each input document or sentence.

TD_TextParser Tokenizes an input stream of words and creates a row for each word
in the output table.

Path and Pattern Analysis Functions


Function Name Description

Attribution Calculates attributions with a wide range of distribution models. Often used in web-
page analysis.

nPath Performs regular pattern matching over a sequence of rows from one or more inputs.

Sessionize Maps each click in a clickstream to a unique session identifier.

Hypothesis Testing Functions


Function Name Description

TD_ANOVA Performs analysis of variance (ANOVA) test to analyze the difference between
the means.

TD_ChiSq Performs Pearson's chi-squared test for independence.

TD_FTest Performs an F-test, for which the test statistic has an F-distribution under the
null hypothesis.

TD_ZTest Performs a Z-test, for which the distribution of the test statistic under the null hypothesis
can be approximated by normal distribution.

Usage Notes
These usage notes apply to every function in this document.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 12
1: Introduction to Analytics Database Analytic Functions

Function Syntax Descriptions


SELECT Statement Clauses
The function syntax descriptions in this document are SQL SELECT statements. For simplicity, the
descriptions do not show every possible SELECT statement clause. However, you can use any valid
SELECT statement clauses. For information about SELECT statement options, see Teradata Vantage™
- SQL Data Manipulation Language, B035-1146.
Many examples in this document use ORDER BY clauses that the function syntax descriptions do
not show.

Function Syntax Element Order


Function syntax elements must appear after the USING clause, but they need not appear in the order
shown in the function syntax description.
Many examples in this document do not specify their syntax elements in the order shown in the function
syntax description.

Function Input Types


• PARTITION BY expression:
The syntax defines the distribution/partitioning of the input result set before the function operates on
it. All AMPs process the partitions parallelly, but if there are multiple partitions per AMP, then each
partition is processed sequentially.
• PARTITION BY ANY:
The syntax does not alter the data distribution of input and lets the function operate on the input
result set without any redistribution. If there is no PARTITION BY expression syntax, it defaults to the
PARTITION BY ANY syntax. The entire result set per AMP forms a single group or a partition.
• DIMENSION:
The syntax duplicates the input result set to all AMPs before the function operates on it. You can use
this syntax for multiple inputs only.
• HASH BY expression:
The syntax defines the distribution/partitioning of the input result set before the function operates on
it. However, unlike PARTITION BY expression, it does not sort the result set after partitioning, and

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 13
1: Introduction to Analytics Database Analytic Functions

the entire result set per AMP forms a single group or partition. The HASH BY expression is always
supported with the PARTITION BY ANY clause.
Related Information:
How to Read Syntax

AMP Configuration Impact on Function Execution


The execution strategy of some analytic functions for the PARTITION BY ANY syntax varies with the
number of AMPs in the system, and the function output may vary across different configurations.
The output is correct in all cases and does not impact the quality of results for large datasets.
If a degraded quality is observed in the results for small data sets, then you may restrict to one or few AMPs
by carefully selecting the primary index to improve the quality of results.
The following functions whose output may vary by AMP configuration:
• TD_GLM
• TD_DecisionForest

Column Specification Syntax Elements


Some ML Engine functions have column specification syntax elements with this syntax:

syntax_element ( {'column' | column_range }[,...] )

The column is a column name. This is the syntax of column_range:

'start_column:end_column' [, '-exclude_column' ]

The range includes its endpoints.


The start_column and end_column can be:
• Column names (for example, 'column1:column2')
• Nonnegative integers that represent the indexes of columns in the table (for example, '[0:4]')
The first column has index 0; therefore, '[0:4]' specifies the first five columns in the table.
• Empty. For example:
◦ '[:4]' specifies all columns up to and including the column with index 4.
◦ '[4:]' specifies the column with index 4 and all columns after it.
◦ '[:]' specifies all columns in the table.
The exclude_column is a column in the specified range, represented by either its name or its index (for
example, '[0:99]', '-[50]', '-column10' specifies the columns with indexes 0 through 99, except
the column with index 50 and column10).
Column ranges cannot overlap, and cannot include any specified column.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 14
1: Introduction to Analytics Database Analytic Functions

Functions Ignore Disallowed Syntax Elements


If you call a function with a disallowed syntax element, most functions ignore the syntax element without
returning an error.

Input Table Schemas


Input table schemas show only the columns that a function uses. Unless otherwise noted, input tables can
have additional columns, but the function ignores them.

Function Names with and without TD Prefix


Functions without TD prefix have corresponding ML Engine functions. The syntax of corresponding
functions may differ, but given the same inputs and syntax element values, they produce the same results
(with the minor exceptions noted in specific functions).
The functions with the prefix 'TD' are a new generation of advanced analytic functions that use the SQL-MR
framework and keep the resource usage under budget.
To execute ML Engine functions on Teradata Vantage™, contact your Teradata Support representative.

Accumulated Columns Impact on Function Execution


Consider the following points if the functions display the accumulated columns as the first columns of
the output:
• The data type of the first column cannot be BLOB or CLOB.
• The first column becomes the Primary Index and must be selected carefully.
• The Primary Index column (by default, the first column) affects the data distribution and performance.
The column with more unique values must be the Primary Index column and vice versa. If the first
column in the output is the accumulated column, then you must select the Primary Index column
explicitly (if the default column is not optimal) for better performance.
The following functions add the accumulated columns at the end of the output:
• NaiveBayesPredict
• NaiveBayesTextClassifierPredict
• DecisionTreePredict
• SVMSparsePredict
• Pack
• Unpack
• TD_KMeansPredict
• TD_Silhouette
The following functions add the accumulated columns at the beginning of the output:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 15
1: Introduction to Analytics Database Analytic Functions

• GLMPredict
• DecisionForestPredict
• StringSimilarity
• NgramSplitter
• TD_GetRowsWithMissingValues
• TD_GetRowsWithoutMissingValues
• TD_ConvertTo
• TD_qqnorm
• TD_TextParser
• TD_NumApply
• TD_StrApply
• TD_RoundColumns
• TD_BincodeTransform
• TD_NonLinearCombineTransform
• TD_OrdinalEncodingTransform
• TD_PolynomialFeaturesTransform
• TD_RowNormalizeTransform
• TD_ScaleTransform
• TD_RandomProjectionTransform
• TD_SentimentExtractor

Datatype Change in Accumulated Columns


The following functions change the data type of a column in the input table to a different data type in the
output table:
• TD_NumApply
If you set the Inplace argument as false and specify the target columns in the Accumulate argument,
then the data type of target columns in the output can be REAL or FLOAT.
• TD_StrApply
If you set the Inplace argument as false and specify the target columns in the Accumulate argument,
then for the following StringOperation argument values, the data type of target columns is VARCHAR
(UNICODE) in the output:
STRINGCON, STRINGLIKE, STRINGPAD, STRINGTRIM, STRINGINDEX ]
• TD_NonLinearCombineTransform
If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_KMeansPredict

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 16
1: Introduction to Analytics Database Analytic Functions

If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_Silhouette
If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_RandomProjectionTransform
If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_SentimentExtractor
If you specify the Text column in the Accumulate argument, then the data type of the Text column in
the output is VARCHAR (UNICODE).
• TD_FunctionTransform
If a numeric column is not specified in the IDColumns argument, then the data type of the numeric
column in the output can be REAL or FLOAT.
• Pack
If you set the Colcast argument as True and specify the target columns in the Accumulate argument,
then the data type of target columns in the output can be VARCHAR.
• TD_Qqnorm and TD_PolynomialFeaturesTransform
The target columns are by default included in the output. The data type of the target columns included
in the output can change to Double Precision or FLOAT.
• TD_NBTCT
If you specify the datatype for the token or doccategory columns as Char or VARCHAR, then the data
type of token or category in the output can be VARCHAR Unicode.
• TD_Histogram
If you specify the datatype for label from minmaxtable table schema as BYTEINT, SMALLINT,
INTEGER, or BIGINT, then the data type of the label in the output can be BIGINT. If you specify
CHAR or VARCHAR as the datatype for the label, then the data type of the label in the output can be
VARCHAR Unicode.

TD_GLMPredict versus GLMPredict


The GLMPredict function accepts models from the GLM function in MLE (see the GLM function in
Teradata Vantage™ Machine Learning Engine Analytic Function Reference B700-4003), whereas the
TD_GLMPredict function accepts models from TD_GLM.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 17
1: Introduction to Analytics Database Analytic Functions

Displaying Online Help for Analytics Database


Analytic Functions
Online help is available for each Analytics Database analytic function.
• For information about a function, type:
HELP 'function_name'
For example:
HELP 'SQL NPATH'

BC/BCE Timestamps
Analytics Database functions do not support Before the Common Era (BCE) timestamps. BCE is an
alternative to Before Christ (BC). These are examples of BC/BCE timestamps:

4713-01-01 11:07:11-07:52:58 BC
4713-01-01 11:07:11 BC

Workload Management Configuration for Analytics Database


Analytic Functions
Analytics Database analytic functions can be memory- and compute-intensive and impact other workloads,
depending on function parameters and input table sizes. To learn to use Workload Management throttles to
limit concurrency and memory, see Teradata Vantage™ - Workload Management User Guide, B035-1197.

Avoid Deadlocks Using Volatile Tables


When you use the CREATE TABLE syntax to directly save the function results, and if multiple such queries
are running concurrently, deadlocks may occur due to the locks on the dictionary tables that get updated
with the metadata of the new tables.
In this case, you must use VOLATILE tables rather than permanent tables to avoid dictionary locks. You
can view the results in the VOLATILE tables and make a separate CREATE TABLE SQL call to store the
results in a permanent table.

AA 7.00 Usage Notes


These usage notes apply only if you use model tables created using a supported version of Aster Analytics
on Aster Database as input to these functions:
• DecisionTreePredict
• DecisionForestPredict
• GLMPredict
• NaiveBayesPredict

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 18
1: Introduction to Analytics Database Analytic Functions

• NaiveBayesTextClassifierPredict
• SVMSparsePredict

AA 7.00 Limitations
• The minimum supported version of Aster Analytics is AA 7.00.
Models created using an earlier version of Aster Analytics must be recreated after upgrading to a
supported version of Aster Analytics.

AA 7.00 Model Tables


These notes apply only to model tables output by AA 7.00 functions.
• BLOB and CLOB data types are not supported.
• Aster data type BYTEA corresponds to Analytics Database data type VARBYTE.
• The maximum size of a VARBYTE or VARCHAR column is 64000.
• You can load a table created on Aster Database to Analytics Database using either the
load_to_teradata command or Open Database Connectivity (ODBC).

Analytic Functions on Analytics Database and Aster Database

The following table summarizes the differences between analytic functions on Analytics Database and
Aster Database.

Analytics Database Analytic Function Aster Database Analytic Function

PARTITION BY clause lets you specify a column by PARTITION BY clause accepts only column
its position, an integer. PARTITION BY 1 partitions names. PARTITION BY 1 causes the function to
rows by column 1. process all rows on a single worker node.
See TD_BYONE

For table operator output, an alias is required. For function output, an alias is optional.

To specify function syntax elements, you must use a Function syntax does not include USING clause.
USING clause.

Function syntax elements do not support Function syntax elements support column ranges.
column ranges.

Loading Aster Tables to Analytics Database Using load_to_teradata

For load_to_teradata instructions, see Teradata Aster® Database User Guide and the following
usage notes.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 19
1: Introduction to Analytics Database Analytic Functions

load_to_teradata Usage Notes


• If a table column name includes a keyword, enclose the name in double quotation marks and alias it.
• In SELECT statements, enclose every camel-case table column name in double quotation marks.
This example shows both aliased columns and camel-case column names:

SELECT * FROM load_to_teradata (


ON (
SELECT "class" AS class_col,
"variable" AS variable_col,
"type" AS type_col,
category,
cnt,
"sum" AS sum_col,
"sumSq",
"totalCnt"
FROM aster_nb_modelSC
)
tdpid ('sdt12432.labs.teradata.com')
username ('sample_user')
password ('sample_user')
target_table ('td_nb_modelSC')
);

• Cast every REAL column to DOUBLE PRECISION.


For example:

SELECT * FROM load_to_teradata (


ON (
SELECT attribute,
predictor,
category,
CAST (estimate AS DOUBLE PRECISION) AS estimate,
CAST (std_err AS DOUBLE PRECISION) AS std_err,
CAST (z_score AS DOUBLE PRECISION) AS z_score,
CAST (p_value AS DOUBLE PRECISION) AS p_value,
significance,
"family"
FROM glm_housing_model
)
tdpid ('sdt12432.labs.teradata.com')
username ('sample_user')
password ('sample_user')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 20
1: Introduction to Analytics Database Analytic Functions

target_table ('glm_housing_model')
);

• If a model table column name contains Analytics Database reserved keywords or special characters
— characters other than letters, digits, or underscore (_)—enclose it in double quotation marks.
This rule applies to the following model column names:

AA 7.00 Function Model Column Name

Single_Tree_Drive node_gini(p)
node_entropy(p)
node_chisq_pv(p)
split_gini(p)
split_entropy(p)
split_chisq_pv(p)

NaiveBayesReduce class
variable
type
sum
sumSq
totalCnt

For example:

CREATE SET TABLE NBUSER.td_glass_modelPD1,


FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO,
MAP = TD_MAP1 (
node_id BIGINT,
node_size BIGINT,
"node_gini(p)" FLOAT,
node_entropy FLOAT,
node_chisq_pv FLOAT,
node_label VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,
node_majorvotes BIGINT,
split_value FLOAT,
"split_gini(p)" FLOAT,
split_entropy FLOAT,
split_chisq_pv FLOAT,
left_id BIGINT,
left_size BIGINT,

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 21
1: Introduction to Analytics Database Analytic Functions

left_label VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,


left_majorvotes BIGINT,
right_id BIGINT,
right_size BIGINT,
right_label VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,
right_majorvotes BIGINT,
left_bucket VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,
right_bucket VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,
left_label_problist VARCHAR(2048) CHARACTER SET UNICODE
NOT CASESPECIFIC,
right_label_problist VARCHAR(2048) CHARACTER SET UNICODE
NOT CASESPECIFIC,
prob_label_order VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,
attribute VARCHAR(2048) CHARACTER SET UNICODE NOT CASESPECIFIC,
)
PRIMARY INDEX (node_id);

Related Information:
Loading Aster Tables to Analytics Database Using ODBC

Loading Aster Tables to Analytics Database Using ODBC

The ODBC instructions follow. To follow them, you must have an account on
https://downloads.teradata.com.
1. Install Teradata Parallel Transporter Base.
2. Set up the Aster driver on the client machine.
3. If the table does not exist on Aster Database, create and populate it there.
4. On Analytics Database, do the following:
a. If the user who is to own the table does not exist, create it.
b. Write the tpt script.
c. Write the JobVariablesFile.
d. Use the tbuild command to run the tpt script.
Related Information:
Loading Aster Tables to Analytics Database Using load_to_teradata
Example: Loading Aster Table to Analytics Database Using ODBC

Installing Teradata Parallel Transporter Base

1. Go to https://support.teradata.com.
2. Log in.
3. Download the package TTU 16.20.04.00 Windows - Base.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 22
1: Introduction to Analytics Database Analytic Functions

Downloading the packages takes approximately 30-40 minutes.


4. On your client machine, go to the folder where the package was downloaded and unzip it.
5. Go to TeradataToolsAndUtilitiesBase\Windows and run TTU.exe.
6. Install Teradata Parallel Transporter Base.

Setting Up the Aster Driver on the Client

1. Go to https://support.teradata.com.
2. Log in.
3. Download AsterClients__windows_x8664.version.zip, where version is the version of Aster
Analytics on your client machine; for example:
AsterClients__windows_x8664.06.20.00.00.zip
Downloading the packages takes several minutes.
4. On your client machine, go to the folder where the package was downloaded and unzip it.
5. Go to the subfolder \stage\home\beehive\clients-winnn, where nn is 32, 64, or 86, depending
on your Windows machine. For example:
\stage\home\beehive\clients-win64
6. Install nClusterODBCInstaller_xnn.
If the installer requests a dependency package, install it from the web.
7. Open ODBC Data Sources (nn-bit).
8. On the System DSN tab, select Add.
If Aster ODBC driver installation succeeded, the window Aster ODBC Driver appears.
9. In the window Aster ODBC Driver, select Finish.
10. In the DSN Setup form that appears, enter the following values and select OK:
Field Value

Data Source Name of data source to use in tpt script

Server IP address of Aster queen

Port 2406

Database Aster Database

Username User name

Password User password

MaxLenVarchar Default length of unbounded VARCHAR data item

11. Select OK.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 23
1: Introduction to Analytics Database Analytic Functions

Writing the tpt Script

• Write the tpt script by substituting values for variables in the following script:

DEFINE JOB PRODUCT_SOURCE_LOAD

DESCRIPTION 'LOAD PRODUCT DEFINITION TABLE'


(
DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
TD_compatible_table_definition

STEP STEP_CREATE_DDL
(
APPLY
('DROP TABLE '||@TargetTable||' ;'),
('CREATE MULTISET TABLE
'||@TargetTable||'(TD_compatible_table_definition;')
TO OPERATOR ($DDL() [1]);
);
Step Insert_Tables
(
APPLY
('Ins '||@TargetTable||'(
:"column_name_1"
,:"column_name_2"
[...,:"column_name_k"]
);
)
TO OPERATOR ($LOAD()[1])

SELECT * FROM OPERATOR ($ODBC(PRODUCT_SOURCE_SCHEMA)[1]);


);
);

Writing the JobVariablesFile

• Write the JobVariablesFile by substituting values for variables in the following script:

DDLTdpId = 'td_host_name_or_ip'
,DDLUserName = 'td_user'

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 24
1: Introduction to Analytics Database Analytic Functions

,DDLUserPassword = 'td_user_password'
,DDLErrorList = ['3807']
,DDLPrivateLogName = 'DDL001S1'
,TargetTable = 'td_table_name'
,ODBCPrivateLogName = 'ODB039P1'
,ODBCDSNName = 'Data_Source_Name_specified_in_DSN_Setup_form'
,TruncateData = {'Y' | 'N'}
,ODBCUserName = 'aster_user'
,ODBCUserPassword = 'aster_user_password'
,LOADPrivateLogName = 'ODB039C1'
,LOADTDPID = 'td_host_name_or_ip'
,LOADUserName = 'td_user'
,LOADUserPassword = 'td_user_password'
,SelectStmt = 'SELECT * FROM aster_table_name;'
,LOADTargetTable = 'td_table_name'

For TruncateData, 'Y' trims unused space. The default is 'N'. Specify 'Y' when the MaxLenVarchar
field of the DSN Setup form (in Setting Up the Aster Driver on the Client) exceeds the maximum
VARCHAR length specified in TD_compatible_table_definition in the tpt script; otherwise, ODBC
cannot load the script.

Running the tpt Script

You are on Analytics Database, where a folder contains the tpt script and JobVariablesFile that
you wrote.
1. Open the command prompt.
2. Go to the folder where the tpt script and JobVariablesFile are.
3. Run the tpt script with this command:

tbuild -f tptfile -v JobVariablesFile -j jobid

where tptfile and JobVariablesFile are the names of the tpt script and JobVariablesFile that you wrote
and jobid is the name you are giving to this tbuild job.

Example: Loading Aster Table to Analytics Database Using ODBC

This example shows the code for the following:


• Creating and populating a table on Aster Database
• Creating a Analytics Database user to own the table
• A tpt script
• A JobVariablesFile

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 25
1: Introduction to Analytics Database Analytic Functions

• Running the tpt script

Code for Creating and Populating a Table on Aster Database


This code creates and populates a training table, aster_nb_trainerSC, and then uses the training table
and Naive Bayes Classifier function to create a model table, aster_nb_modelSC.

/* Create training table */

DROP TABLE IF EXISTS aster_nb_trainerSC;

CREATE TABLE aster_nb_trainerSC (


id INT,
year INT,
color VARCHAR (100),
type VARCHAR (100),
origin VARCHAR (100),
stolen VARCHAR (100),
PARTITION KEY (id)
);

/* Populate training table */

INSERT INTO aster_nb_trainerSC VALUES


(1,3,'red','sports','domestic','Yes'),
(2,9,'red','sports','domestic','No'),
(3,1,'red','sports','domestic','Yes'),
(4,8,'yellow','sports','domestic','No'),
(5,2,'yellow','sports','imported','Yes');

/* Create model table from training table */

DROP TABLE IF EXISTS aster_nb_modelSC;

CREATE TABLE aster_nb_modelSC distribute by hash(class_nb) AS (


SELECT * FROM naiveBayesReduce (
ON (
SELECT * FROM naiveBayesMap (
ON aster_nb_trainerSC
Response ('stolen')
NumericInputs ('year')
CategoricalInputs ('color','origin','type')
)
) PARTITION BY class_nb

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 26
1: Introduction to Analytics Database Analytic Functions

)
);

Code for Creating an Analytics Database User

CREATE USER sample_user AS


PASSWORD = sample_user
PERM = 10e6*(HASHAMP()+1);

GRANT ALL ON dbc TO sample_user;

tpt Script, astermodel.tpt

DEFINE JOB PRODUCT_SOURCE_LOAD


DESCRIPTION 'LOAD PRODUCT DEFINITION TABLE'
(
DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
DESCRIPTION 'PRODUCT INFORMATION SCHEMA'
(
class_nb VARCHAR(128),
variable_nb VARCHAR(128),
type_nb VARCHAR(128),
category VARCHAR(32),
cnt BIGINT,
sum_nb FLOAT,
sum_sq FLOAT,
total_cnt BIGINT
);

STEP STEP_CREATE_DDL
(
APPLY
('DROP TABLE '||@TargetTable||';'),
('CREATE MULTISET TABLE '||@TargetTable||'(
class_nb VARCHAR(128),
variable_nb VARCHAR(128),
type_nb VARCHAR(128),
category VARCHAR(32),
cnt BIGINT,
sum_nb FLOAT,
sum_sq FLOAT,
total_cnt BIGINT) NO PRIMARY INDEX;
')
TO OPERATOR ($DDL()[1]);

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 27
1: Introduction to Analytics Database Analytic Functions

);
Step Insert Tables
(
APPLY
('Ins '||@TargetTable||'(
:"class_nb"
,:"variable_nb"
,:"type_nb"
,:"category"
,:"cnt"
,:"sum_nb"
,:"sum_sq"
,:"total_cnt");'
)
TO OPERATOR ($LOAD()[1])

SELECT * FROM OPERATOR ($ODBC(PRODUCT_SOURCE_SCHEMA)[1]);


);
);

JobVariablesFile, attr.txt

DDLTdpid = 'td_host_name_or_ip'
,DDLUserName = 'alice'
,DDLUserPassword = 'alice'
,DDLErrorList = '[3807]'
,DDLPrivateLogName = 'DDL001S1'
,TargetTable = 'td_nb_modelsc'
,ODBCPrivateLogName = 'ODBC039P1'
,ODBCDSNName = 'shruti'
,TruncateData = 'Y'
,ODBCUserName = 'db_superuser'
,ODBCUserPassword = 'db_superuser'
,LOADPrivateLogName = 'ODBC039P1'
,LOADTDPID = 'td_host_name_or_ip'
,LOADUserName = 'alice'
,LOADUserPassword = 'alice'
,SelectStmt = 'SELECT * FROM aster_nb_modelsc;'
,LOADTargetTable = 'td_nb_modelsc'

Command for Running the tpt Script

tbuild -f aster_model.tpt -v attr.txt -j urr1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 28
1: Introduction to Analytics Database Analytic Functions

Aster Model Table Schemas

Forest_Predict Model Table Schema

CREATE FACT TABLE public.aster_fp_admissions_clsmodel


(
worker_ip VARCHAR,
task_index INTEGER,
tree_num INTEGER,
tree VARCHAR
)
DISTRIBUTE BY HASH (task_index)
STORAGE ROW;

GLMPredict Model Table Schema

CREATE ANALYTIC FACT TABLE public.glm_housing_model


(
attribute INTEGER,
predictor VARCHAR(1024),
category VARCHAR(1024),
estimate DOUBLE PRECISION,
std_err DOUBLE PRECISION,
{ t_score | z_score } DOUBLE PRECISION,
p_value DOUBLE PRECISION,
significance VARCHAR(50),
family VARCHAR(20)
)
DISTRIBUTE BY HASH (attribute)
STORAGE ROW;

The model table has the column t_score if created with Family ('GAUSSIAN'), otherwise it has the
column z_score.

NaiveBayesPredict Model Table Schema

CREATE ANALYTIC FACT TABLE public.aster_nb_modelsc


(
class_nb VARCHAR(128),
variable_nb VARCHAR(128),
type_nb VARCHAR(128),
category VARCHAR(32),
cnt BIGINT,

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 29
1: Introduction to Analytics Database Analytic Functions

sum_nb DOUBLE PRECISION,


sum_sq DOUBLE PRECISION,
total_cnt BIGINT
)
DISTRIBUTE BY HASH (class_nb)
STORAGE ROW;

NaiveBayesTextClassifierPredict Model Table Schema

CREATE DIMENSION TABLE public.nbtcp_spam_multinomialmodel


(
token VARCHAR,
category VARCHAR,
prob DOUBLE PRECISION
)
DISTRIBUTE BY REPLICATION
STORAGE ROW;

Single_Tree_Predict Model Table Schema

CREATE DIMENSION TABLE public.glass_model


(
node_id BIGINT,
node_size BIGINT,
node_gini_p DOUBLE PRECISION,
node_entropy DOUBLE PRECISION,
node_chisq_pv DOUBLE PRECISION,
node_label VARCHAR(512),
node_majorvotes BIGINT,
split_value DOUBLE PRECISION,
split_gini_p DOUBLE PRECISION,
split_entropy DOUBLE PRECISION,
split_chisq_pv DOUBLE PRECISION,
left_id BIGINT,
left_size BIGINT,
left_label VARCHAR(512),
left_majorvotes BIGINT,
right_id BIGINT,
right_size BIGINT,
right_label VARCHAR(512),
right_majorvotes BIGINT,
left_bucket VARCHAR(512),
right_bucket VARCHAR(512),
attribute VARCHAR(512)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 30
1: Introduction to Analytics Database Analytic Functions

)
DISTRIBUTE BY REPLICATION
STORAGE ROW;

SparseSVMPredict Model Table Schema

CREATE FACT TABLE public.aster_svm_iris_model_default


(
classid INTEGER,
weights BYTEA
)
DISTRIBUTE BY HASH (classid)
STORAGE ROW;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 31
2
Data Cleaning Functions

Data cleaning functions prepare the input data set for the next set of transformations.

Handling Outliers

TD_GetFutileColumns
TD_GetFutileColumns function returns the futile column names if any of these conditions is met:
• If all values in the columns are unique
• If all the values in the columns are the same
• If the count of distinct values in the columns divided by the count of the total number of rows in the input
table is greater than or equal to the threshold value

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_GetFutileColumns Syntax

Select * from TD_GetFutileColumns(


ON {table | view | (query)} as inputtable partition by any
ON {table | view | (query)} as CategoricalSummaryTable dimension
USING
CategoricalSummaryColumn(‘target_column’)
ThresholdValue(‘threshold_value’)
)As alias;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 32
2: Data Cleaning Functions

TD_GetFutileColumns Syntax Elements

CategoricalSummaryColumn
[Required]: Specify the column name from the CategoricalSummaryTable generated using
the TD_CategoricalSummary function.

ThresholdValue
[Optional]: Specify the threshold value for the input table column name. The column name
is futile if the count of distinct values in the columns divided by the total number of rows in
the input table is greater or equal to the threshold value.

Note:
This function works only for categorical data.

TD_GetFutileColumns Input

Input Table Schema


Column Data Type Description

Target_Column Varchar The input table columns from the Category Summary table.

Categorical Summary Table Schema


Column Data Type Description

ColumnName Varchar Character The column name of the target column.


Set Unicode

DistinctValue Varchar Character The distinct value in the target column.


Set Unicode

DistinctValueCount BIGINT The count of distinct values in the target column.

TD_GetFutileColumns Output

Output Table Schema


Column Data Type Description

FutileColumns Varchar Character Set Unicode The column names that are futile.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 33
2: Data Cleaning Functions

TD_GetFutileColumns Example

InputTable

passenger sex ticket cabin survived


--------- ------ ---------------- ----- --------
1 male A/5 21171 C 0
2 Female PC 17599 C 1
3 Female STON/O2. 3101282 C 1
4 male 113803 C 1
5 Female 373450 C 0

CategorySummary table

Create table cateogrySummaryTable as (


SELECT * FROM TD_CATEGORICALSUMMARY (
ON getFutileColumns_titanic as inputtable
USING
TargetColumns('Cabin','sex','Ticket')
) AS dt)With data;

ColumnName DistinctValue DistinctValueCount


---------- ---------------- ------------------
cabin C 5
sex Female 3
sex male 2
ticket 373450 1
ticket A/5 21171 1
ticket PC 17599 1
ticket STON/O2. 3101282 1
ticket 113803 1

SQL Call

Select * from TD_getFutileColumns(


ON getFutileColumns_titanic as inputtable partition by any
ON cateogrySummaryTable as categorytable Dimension
USING
CategoricalSummaryColumn('ColumnName')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 34
2: Data Cleaning Functions

ThresholdValue(0.7)
)As dt;

Output Table

ColumnName
----------
ticket
cabin

TD_OutlierFilterFit
TD_OutlierFilterFit function calculates the lower_percentile, upper_percentile, count of rows, and
median for the specified input table columns. The calculated values for each column help the
TD_OutlierFilterTransform function detect outliers in the input table.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_OutlierFilterFit Syntax

CREATE TABLE fit_table AS (


SELECT * FROM TD_OutlierFilterFit (
ON { table | view | (query) } AS InputTable
[OUT [ PERMANENT | VOLATILE ] TABLE OutputTable(output_table_name)]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[GroupColumns ('group_column')]
[OutlierMethod ({ 'percentile' | 'tukey' | 'carling' })]
[LowerPercentile (min_value)]
[UpperPercentile (max_value)]
[IQRMultiplier (k)]
[ReplacementValue ({ 'delete' | 'null' | 'median' | replacement_value})]
[RemoveTail ({ 'both' | 'upper' | 'lower' })]
[PercentileMethod ({ 'PercentileCont' | 'PercentileDISC' })]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 35
2: Data Cleaning Functions

) AS alias
) WITH DATA;

TD_OutlierFilterFit Syntax Elements

TargetColumns
Specify the names of the numeric InputTable columns for which to compute metrics.

GroupColumns
[Optional] Specify the name of the InputTable column by which to group the input data.
Default behavior: Function does not group input data.

OutlierMethod
[Optional] Specify one of these methods for filtering outliers:

Method Values Outside This Range Are Outliers

percentile [min_value, max_value].


(default
method)

tukey [Q1 - k*(Q3-Q1), Q1 + k*(Q3-Q1)]


where:
Q1 = 25th quartile of data
Q3 = 75th quartile of data
k = interquantile range multiplier (see IQRMultiplier)

carling Q2 ± c*(Q3-Q1)
where:
Q2 = median of data
Q1 = 25th quartile of data
Q3 = 75th quartile of data
c = (17.63*r - 23.64) / (7.74*r - 3.71)
r = count of rows in group_column if you specify GroupColumns, otherwise
count of rows in InputTable

LowerPercentile
[Optional] Specify a lower range of percentile to use to detect whether the value is an outlier.
Value 0 to 1 is supported. For Tukey and Carling, use 0.25 as the lower percentile. The
default value is 0.05.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 36
2: Data Cleaning Functions

UpperPercentile
[Optional] Specify a upper range of percentile to use to detect whether the value is an outlier.
Value 0 to 1 is supported. For Tukey and Carling, use 0.75 as the upper percentile. The
default value is 0.95.

IQRMultiplier
[Optional] Specify interquartile range multiplier (IQR), k, for Tukey filtering.
The IQR is an estimate of the spread (dispersion) of the data in the target columns (IQR =
|Q3-Q1|).
Use k = 1.5 for moderate outliers and k = 3.0 for serious outliers.
Default: 1.5

ReplacementValue
[Optional] Specify how to handle outliers:
Option Description

delete (default value) Do not copy row to output table.

null Copy row to output table, replacing each outlier with NULL.

median Copy row to output table, replacing each outlier with median value
for its group.

replacement_value Copy row to output table, replacing each outlier with


(Must be numeric.) replacement_value.

RemoveTail
[Optional] Specify whether to remove the upper tail, the lower tail, or both.
Default: both

PercentileMethod
[Optional] Specify either the PercentileCont or the PercentileDISC method for calculating
the upper and lower percentiles of the input data values. The default value
is PercentileDISC.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 37
2: Data Cleaning Functions

TD_OutlierFilterFit Input

InputTable Schema
Column Data Type Description

target_column NUMERIC The input table column names for computing metrics and filtering outliers
using the TD_OUTLIERFILTERTRANSFORM function.

group_column Any [Optional] Column by which to group input data.

TD_OutlierFilterFit Output

FitTable (fit_table) Schema


Column Data Type Description

TD_OutlierMethod_OFTFIT VARCHAR Value of OutlierMethod ('percentile', 'tukey', or


(CHARACTER 'carling').
SET UNICODE)

group_column Same as in Input table [Column appears only if you specify


GroupColumns.] Column by which input data
is grouped.

TD_IQRMultiplier_OFTFIT NUMERIC Value of IQRMultiplier (k).

TD_RemoveTail_OFTFIT VARCHAR Value of RemoveTail ('both', 'upper', or 'lower').


(CHARACTER
SET UNICODE)

TD_ReplacementValue_ VARCHAR Value of ReplacementValue ('delete', 'null',


OFTFIT (CHARACTER 'median', or replacement_value).
SET UNICODE)

TD_MinThreshold_OFTFIT NUMERIC Value of LowerPercentile (min_value).

TD_MaxThreshold_OFTFIT NUMERIC Value of UpperPercentile (max_value).

TD_AttributeValue_OFTFIT VARCHAR [Column appears once for each specified target_


(CHARACTER column.] target_column
SET UNICODE)

TD_CountValue_OFTFIT NUMERIC Count of rows in group_column if you specify


GroupColumns, otherwise count of rows in
input table.

TD_MedianValue_OFTFIT NUMERIC Median values for target columns.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 38
2: Data Cleaning Functions

Column Data Type Description

TD_ NUMERIC Lower percentile of input data values, calculated


LowerPercentile_OFTFIT by method specified by PercentileMethod
(PercentileCont or PercentileDISC).

TD_ NUMERIC Upper percentile of input data values, calculated


UpperPercentile_OFTFIT by method specified by PercentileMethod.

TD_OutlierFilterFit Example

InputTable: titanic
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger pclass fare survived


--------- ------ ------------ --------
1 3 7.250000000 0
2 1 71.283300000 1
3 3 7.925000000 1
4 1 53.100000000 1
5 3 8.050000000 0

SQL Call

CREATE TABLE outlier_fit AS (


SELECT * FROM TD_OutlierFilterFit (
ON titanic AS InputTable
OUT TABLE OutputTable (outlier_fit)
USING
TargetColumns ('Fare')
LowerPercentile (0.1)
UpperPercentile (0.9)
OutlierMethod ('Percentile')
ReplacementValue ('median')
PercentileMethod ('PercentileCont')
) AS dt
) WITH DATA;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 39
2: Data Cleaning Functions

Output

TD_OUTLIERMETHOD_OFTFIT TD_IQRMULTIPLIER_OFTFIT TD_REMOVETAIL_OFTFIT


TD_REPLACEMENTVALUE_OFTFIT TD_MINTHRESHOLD_OFTFIT TD_MAXTHRESHOLD_OFTFIT
TD_ATTRIBUTEVALUE_OFTFIT TD_COUNTVALUE_OFTFIT TD_MEDIANVALUE_OFTFIT
TD_LOWERPERCENTILE_OFTFIT TD_UPPERPERCENTILE_OFTFIT
----------------------- ----------------------- --------------------
-------------------------- ---------------------- ----------------------
------------------------ -------------------- ---------------------
------------------------- -------------------------
PERCENTILE 1.500000000 BOTH
MEDIAN 0.100000000
0.900000000 fare 5
8.050000000 7.520000000 64.009980000

TD_OutlierFilterTransform
TD_OutlierFilterTransform filters outliers from the input table. The metrics for determining outliers come
from TD_OutlierFilterFit output.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_OutlierFilterTransform Syntax

TD_OutlierFilterFit Call without GroupColumns

SELECT * FROM TD_OutlierFilterTransform (


ON { table | view | (query) } AS InputTable PARTITION BY ANY
ON { table | view | (query) } AS FitTable DIMENSION
) AS alias;

TD_OutlierFilterFit Call with GroupColumns

SELECT * FROM TD_OutlierFilterTransform (


ON { table | view | (query) } AS InputTable PARTITION BY group_column

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 40
2: Data Cleaning Functions

ON { table | view | (query) } AS FitTable PARTITION BY group_column


) AS alias;

TD_OutlierFilterTransform Input

InputTable Schema
See TD_OutlierFilterFit Input.

FitTable Schema
See TD_OutlierFilterFit Output.

TD_OutlierFilterTransform Output

Column Data Type Description

target_column NUMERIC Column for which metrics have been computed.

OtherColumns Any The columns from the input table excluding the target columns
are displayed.

TD_OutlierFilterTransform Example

Input
• InputTable: titanic, as in TD_OutlierFilterFit Example
• FitTable: outlier_fit, created by TD_OutlierFilterFit Example

SQL Call

SELECT * FROM TD_OutlierFilterTransform (


ON titanic AS InputTable PARTITION BY ANY
ON outlier_fit AS FitTable DIMENSION
) AS dt;

Output

passenger pclass fare survived


--------- ------ ------------ --------
1 3 8.050000000 0
2 1 8.050000000 1
3 3 7.925000000 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 41
2: Data Cleaning Functions

4 1 53.100000000 1
5 3 8.050000000 0

Handling Missing Values

TD_GetRowsWithoutMissingValues
TD_GetRowsWithoutMissingValues displays the rows that have non-NULL values in the specified input
table columns.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Related Information:
TD_GetRowsWithMissingValues

TD_GetRowsWithoutMissingValues Syntax

SELECT * FROM TD_GetRowsWithoutMissingValues (


ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_column ] ]
[ USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
]
) AS alias;

TD_GetRowsWithoutMissingValues Syntax Elements

TargetColumns
[Optional] Specify the target column names to check for non-Null values.
Default: If omitted, the function considers all columns of the Input table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 42
2: Data Cleaning Functions

Accumulate
[Optional]: Specify the input table column names to copy to the output table.

TD_GetRowsWithoutMissingValues Input

InputTable Schema
Column Data Type Description

target_column Any Columns for which non-NULL values are checked.

accumulate_column Any The input table column names to copy to the output table.

TD_GetRowsWithoutMissingValues Output

Output Table Schema


Same as InputTable schema

TD_GetRowsWithoutMissingValues Example

InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ----
----- ----- --------- --------- ----------- --------
1 0 3 Braund; Mr. Owen Harris male 22
1 0 A/5 21171 7.25 null S
30 0 3 Todoroff; Mr. Lalio male null
0 0 349216 7.8958 null S
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 43
2: Data Cleaning Functions

SQL Call

SELECT * FROM TD_getRowsWithoutMissingValues (


ON input_table AS InputTable
USING
TargetColumns ('[name:cabin]')
) AS dt;

Output

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ---
----- ----- ------ ------ ----------- --------
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

TD_SimpleImputeFit
Imputation is the process of replacing missing values with substitute values.
TD_SimpleImputeFit outputs a table of values to substitute for missing values in the input table. The output
table is input to TD_SimpleImputeTransform, which makes the substitutions.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 44
2: Data Cleaning Functions

TD_SimpleImputeFit Syntax

SELECT * FROM TD_SimpleImputeFit (


ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table) ]
USING
{ { literal_specification | stats_specification } |

literal_specification
stats_specification
}
) AS alias;

literal_specification

ColsForLiterals ({'literal_column' | literal_column_range } [,...])


Literals ('literal' [,...])

stats_specification

ColsForStats ({'stats_column' | stats_column_range } [,...])


Stats ('statistic' [,...])
[ PartitionColumn ('partition_column') ]

TD_SimpleImputeFit Syntax Elements

OutputTable
[Optional] Specify a name for the output table.
If you omit OutputTable, you must create the output table for TD_SimpleImputeTransform
with a CREATE TABLE AS statement:

CREATE TABLE output_table AS (


SELECT * FROM TD_SimpleImputeFit ( ... ) AS alias
) WITH DATA;

ColsForLiterals
[Optional] Specify the names of the InputTable columns in which to find missing values to
replace with specified literal values.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 45
2: Data Cleaning Functions

Literals
[Optional] Specify the literal values to substitute for missing values in the columns specified
by ColsForLiterals. A literal must not exceed 128 characters.
The function maps each literal to the column in the same position in ColsForLiterals. For
example, ColsForLiterals ('[1:5]', '-[2:4]', '[3]') specifies the column with index 3 last, so the
function maps the last specified literal to it.

ColsForStats
[Optional] Specify the names of the InputTable columns in which find missing values to
replace with specified statistics.

Stats
[Optional] Specify the statistics to substitute for missing values in the columns specified
by ColsForStats.
For numeric columns, the value of the Stats argument must be one of the following values:
• MIN
• MAX
• MEAN
• MEDIAN
For columns with the following data types, the value of the Stats argument can be MODE:
• CHARACTER
• VARCHAR
• BYTEINT
• SMALLINT
• INTEGER
CHARACTER and VARCHAR values must not exceed 128 characters.
The output of MODE is the value that appeared in the last according to the alphabetical
order, in case of a tie.
The function maps the value of each Stats argument to the column in the same position
in ColsForStats. For example, ColsForStats ('[1:5]', '-[2:4]', '[3]') specifies the column with
index 3 last, so the function maps the last specified Stats argument to it.

PartitionColumn
[Optional] Specify the name of the InputTable column on which to partition the input.
Default behavior: The function treats all rows as a single partition.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 46
2: Data Cleaning Functions

TD_SimpleImputeFit Input

InputTable Schema
Column Data Type Description

target_column CHAR, VARCHAR, or Column in which to find


(with CHARACTER SET LATIN or UNICODE) missing values.
or numeric

TD_SimpleImputeFit Output

FitTable Schema
Column Data Type Description

TD _INDEX_SIMFIT INTEGER Unique row identifier.

TD_ VARCHAR Target column name (literal_column or


TARGETCOLUMN_SIMFIT (CHARACTER stats_column).
SET UNICODE)

TD_NUM_COLVAL_SIMFIT NUMERIC If column is numeric, value substituted for


missing value; otherwise NULL.

TD_STR_COLVAL_SIMFIT VARCHAR If column is nonnumeric, value substituted


(CHARACTER for missing value or MODE value of
SET UNICODE) column; otherwise NULL.

TD_ISNUMERIC_SIMFIT BYTEINT 1 if column is numeric, otherwise 0.

partition_column Same as in InputTable Column on which input is partitioned.

TD_SimpleImputeFit Example

InputTable: simpleimputefit_input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger pclass sex fare survived


--------- ------ ------ ------------- --------
1 3 male 725.320000000 0
2 1 female 712.250000000 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 47
2: Data Cleaning Functions

3 null female null 1


4 1 null 531.780000000 1
5 3 male 805.210000000 0

SQL Call

CREATE TABLE fit_table AS (


SELECT * FROM TD_SimpleImputeFit (
ON simpleimputefit_input AS InputTable
USING
ColsForLiterals ('Pclass')
Literals ('2')
ColsForStats ('Sex','Fare')
Stats ('mode','median')
) AS dt
) WITH DATA;

Output

TD_INDEX_SIMFIT TD_TARGETCOLUMN_SIMFIT TD_NUM_COLVAL_SIMFIT


TD_STR_COLVAL_SIMFIT TD_ISNUMERIC_SIMFIT
--------------- ---------------------- --------------------
-------------------- -------------------
1 pclass 2.000000000
null 1
2 sex null
male 0
3 fare 718.785000000
null 1

TD_SimpleImputeTransform
SimpleImputeTransform substitutes specified values for missing values in the input table. The specified
values come from TD_SimpleImputeFit output.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 48
2: Data Cleaning Functions

TD_SimpleImputeTransform Syntax
TD_SimpleImputeTransform syntax depends on whether the TD_SimpleImputeFit call that output
FitTable omitted or specified PartitionColumn.

TD_SimpleImputeFit Call Omitted PartitionColumn

SELECT * FROM TD_SimpleImputeTransform (


ON { table | view | (query) } AS InputTable PARTITION BY ANY
ON { table | view | (query) } AS FitTable DIMENSION
) AS alias;

TD_SimpleImputeFit Call Specified PartitionColumn

SELECT * FROM TD_SimpleImputeTransform (


ON { table | view | (query) } AS InputTable PARTITION BY partition_column
ON { table | view | (query) } AS FitTable DIMENSION PARTITION BY partition_column
) AS alias;

TD_SimpleImputeTransform Input

InputTable Schema
See TD_SimpleImputeFit Input.

FitTable Schema
See TD_SimpleImputeFit Output.

TD_SimpleImputeTransform Output

Output Table Schema


Column Data Type Description

target_column Same as in InputTable Column in which missing values have been replaced..

TD_SimpleImputeTransform Example

Input
FitTable: fit_table created by TD_SimpleImputeFit Example

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 49
2: Data Cleaning Functions

SQL Call

SELECT * FROM TD_SimpleImputeTransform (


ON simpleimputetransform_test AS InputTable
ON fit_table AS FitTable DIMENSION
) AS dt;

Output

passenger pclass sex fare survived


--------- ------ ------ ------------- --------
5 3 male 805.210000000 0
4 1 male 531.780000000 1
3 2 female 718.785000000 1
1 3 male 725.320000000 0
2 1 female 712.250000000 1

Parsing Data

TD_ConvertTo
TD_ConvertTo converts the specified input table columns to specified data types.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_ConvertTo Syntax

SELECT * FROM TD_ConvertTo (


ON { table | view | (query) } AS InputTable
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
TargetDataType ('target_datatype' [,...])

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 50
2: Data Cleaning Functions

[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]


) AS alias;

TD_ConvertTo Syntax Elements

TargetColumns
Specify the names of the InputTable columns to convert to another data type.

Accumulate
[Optional]: Specify the input table column names to copy to the output table.

TargetDataType
Specify either a single target data type for all target columns or a target data type for each
target column. If you specify multiple target data types, the function assigns the nth target
data type to the nth target column.

Allowed Values for the


Output Data Type
TargetDataType Element

BYTEINT BYTEINT

SMALLINT SMALLINT

INTEGER INTEGER

BIGINT BIGINT

REAL REAL

DECIMAL DECIMAL (total_digits, precision) where you can


specify the values up to 38 and 19 for the total_digits and
precision parameters.

VARCHAR Depends on input data type:


Input Data
Output Data Type
Type

VARCHAR VARCHAR with same CHARLEN,


CHARACTER SET, and
CASESPECIFIC values as input
data type.

CHAR VARCHAR(32000) with same


CHARACTER SET and
CASESPECIFIC values as input
data type.

CLOB VARCHAR(32000) with same


CHARACTER SET as input data type,
NOT CASESPECIFIC

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 51
2: Data Cleaning Functions

Allowed Values for the


Output Data Type
TargetDataType Element

Input Data
Output Data Type
Type

Other VARCHAR(32000), CHARACTER


SET UNICODE,
NOT CASESPECIFIC.

VARCHAR(charlen=len, VARCHAR(len) with CHARACTER SET charset value,


charset={LATIN | UNICODE} CASESPECIFIC casespecific value.
,casespecific={YES | NO})

CHAR Depends on input data type:


Input Data
Output Data Type
Type

CHAR CHAR with same CHARLEN,


CHARACTER SET, and
CASESPECIFIC values as input
data type.

VARCHAR CHAR(32000) with same


CHARACTER SET and
CASESPECIFIC values as input
data type.

CLOB CHAR(32000) with same


CHARACTER SET as input data type,
NOT CASESPECIFIC

Other CHAR(32000), CHARACTER SET


UNICODE, NOT CASESPECIFIC.

CHAR(charlen=len, CHAR(len) with CHARACTER SET charset value,


charset={LATIN | UNICODE} CASESPECIFIC casespecific value.
,casespecific={YES | NO})

DATE DATE FORMAT 'YY/MM/DD'

TIME TIME(6)

TIMESTAMP TIMESTAMP(6)

TIME WITH ZONE TIME(6) WITH TIMEZONE

TIMESTAMP WITH ZONE TIMESTAMP(6) WITH TIMEZONE

INTERVAL YEAR INTERVAL YEAR(4)

INTERVAL MONTH INTERVAL MONTH(4)

INTERVAL DAY INTERVAL DAY(4)

INTERVAL HOUR INTERVAL HOUR(4)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 52
2: Data Cleaning Functions

Allowed Values for the


Output Data Type
TargetDataType Element

INTERVAL MINUTE INTERVAL MINUTE(4)

INTERVAL SECOND INTERVAL SECOND(4,6)

INTERVAL YEAR TO MONTH INTERVAL YEAR(4) TO MONTH

INTERVAL DAY TO HOUR INTERVAL DAY(4) TO HOUR

INTERVAL DAY TO MINUTE INTERVAL DAY(4) TO MINUTE

INTERVAL DAY TO SECOND INTERVAL DAY(4) TO SECOND(6)

INTERVAL HOUR TO MINUTE INTERVAL HOUR(4) TO MINUTE

INTERVAL HOUR TO SECOND INTERVAL HOUR(4) TO SECOND(6)

INTERVAL MINUTE INTERVAL MINUTE(4) TO SECOND(6)


TO SECOND

CLOB Depends on input data type:


Input Data
Output Data Type
Type

CLOB CLOB with same CHARLEN and


CHARACTER SET value as input
data type.

VARCHAR or CLOB(1048544000) with same


CHAR CHARACTER SET as input
data type.

Other CLOB(1048544000), CHARACTER


SET UNICODE.

CLOB(charlen=len, CLOB(len) with CHARACTER SET charset value.


charset={LATIN | UNICODE})

BYTE BYTE(32000)

BYTE(charlen=len) BYTE(len)

VARBYTE VARBYTE(32000)

VARBYTE(charlen=len) VARBYTE(len)

BLOB BLOB(2097088000)

BLOB(charlen=len) BLOB(len)

JSON JSON(32000), CHARACTER SET UNICODE.

XML XML(2097088000) INLINE LENGTH 4046

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 53
2: Data Cleaning Functions

TD_ConvertTo Input

InputTable Schema
Column Data Type Description

target_column Any Column to convert to target_datatype.

TD_ConvertTo Output

Output Table Schema


Column Data Type Description

target_column target_datatype Column converted to target_datatype.

input_column Same as in InputTable Column copied from InputTable.

AccumulateColumns Any The specified column names in the Accumulate


element copied to the output table.

TD_ConvertTo Example

InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age


sibsp parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ----
----- ----- -------- ------- ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71
0 0 PC 17754 34.6542 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.7 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 54
2: Data Cleaning Functions

873 0 1 Carlsson; Mr. Frans Olof male 33


0 0 695 5 B51 B53 B55 S

SQL Call

SELECT * FROM TD_ConvertTo (


ON input_table AS InputTable
USING
TargetColumns ('fare')
TargetDataType ('integer')
) AS dt ORDER BY 1;

Output

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ---
----- ----- -------- ---- ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71
0 0 PC 17754 34 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

Pack
The Pack function packs data from multiple input columns into a single column. The packed column has a
virtual column for each input column. By default, virtual columns are separated by commas and each virtual
column value is labeled with its column name.
Pack complements the function Unpack, but you can use it on any columns that meet the
input requirements.

Note:
To use Pack and Unpack together, you must run both on Analytics Database. Pack and Unpack are
incompatible with ML Engine functions Pack_MLE and Unpack_MLE.

Before packing columns, note their data types—you need them if you want to unpack the packed column.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 55
2: Data Cleaning Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support locale-based formatting with the SDF file.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.

Pack Syntax

SELECT * FROM Pack (


ON { table | view | (query) }
USING
[ TargetColumns ({ 'target_column' | target_column_range }[,...]) ]
[ Delimiter ('delimiter') ]
[ IncludeColumnName ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
OutputColumn ('output_column')
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ ColCast ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
) AS alias;

Related Information:
Column Specification Syntax Elements

Pack Syntax Elements

TargetColumns
[Optional] Specify the names of the input table columns to pack into a single output column.
Column names must be valid object names, which are defined in Teradata Vantage™ - SQL
Fundamentals, B035-1141.
These names become the column names of the virtual columns. If you specify this syntax
element, but do not specify all input table columns, the function copies the unspecified input
table columns to the output table.
Default behavior: All input table columns are packed into a single output column.

Delimiter
[Optional] Specify the delimiter—a single Unicode character in Normalization Form C (NFC)
—that separates the virtual columns in the packed data. The delimiter is case-sensitive.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 56
2: Data Cleaning Functions

Default: ',' (comma)

IncludeColumnName
[Optional] Specify whether to label each virtual column value with its column name (making
the virtual column target_column:value).
Default: 'true'

OutputColumn
Specify the name to give to the packed output column. The name must be a valid object
name, as defined in Teradata Vantage™ - SQL Fundamentals, B035-1141.

Accumulate
[Optional] Specify the input columns to copy to the output table.

ColCast
[Optional] Specify whether to cast each numeric target_column to VARCHAR.
Specifying 'true' decreases run time for queries with numeric target columns.
Default: false

Pack Input

Input Table Schema


Column Data Type Description

target_column Any Column to pack, with other input columns, into single
output column.

accumulate_column or other_ Any Column to copy to output table. Typically, one such
input_column column contains row identifiers.

Pack Output

Output Table Schema


Column Data Type Description

output_column VARCHAR Packed column.


If a column of type DATE is packed into the
output, its format is 'YYYY-MM-DD'.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 57
2: Data Cleaning Functions

Column Data Type Description

accumulate_column Nonnumeric column or numeric column with Column copied from input table.
ColCast ('false'):Same as in input table
Numeric column with ColCast ('true')
: VARCHAR

other_input_column Same as in input table [Column appears only without


Accumulate, once for each
specified other_input_column.]
Column copied from input table.

Pack Examples

Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Pack Examples Input

The input table, ville_temperature contains temperature readings for the cities Nashville and Knoxville,
in the state of Tennessee.

ville_temperature
sn city state period temp_f

1 Nashville Tennessee 2010-01-01 00:00:00 35.1

2 Nashville Tennessee 2010-01-01 01:00:00 36.2

3 Nashville Tennessee 2010-01-01 02:00:00 34.5

4 Nashville Tennessee 2010-01-01 03:00:00 33.6

5 Nashville Tennessee 2010-01-01 04:00:00 33.1

6 Knoxville Tennessee 2010-01-01 03:00:00 33.2

7 Knoxville Tennessee 2010-01-01 04:00:00 32.8

8 Knoxville Tennessee 2010-01-01 05:00:00 32.4

9 Knoxville Tennessee 2010-01-01 06:00:00 32.2

10 Knoxville Tennessee 2010-01-01 07:00:00 32.4

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 58
2: Data Cleaning Functions

Pack Example: Default Options

This example specifies the default options for Delimiter and IncludeColumnName.

Input
See Pack Examples Input.

SQL Call

SELECT * FROM Pack (


ON ville_temperature
USING
Delimiter (',')
OutputColumn ('packed_data')
IncludeColumnName ('true')
TargetColumns ('[1:4]')
Accumulate ('sn')
) AS dt ORDER BY 2;

Output
The columns specified by TargetColumns are packed in the column packed_data. Virtual columns are
separated by commas, and each virtual column value is labeled with its column name. The input column
sn, which was not specified by TargetColumns, is unchanged in the output table.

packed_data sn

city:Nashville,state:Tennessee,period:2010-01-01 00:00:00,temp_f:35.1 1

city:Nashville,state:Tennessee,period:2010-01-01 01:00:00,temp_f:36.2 2

city:Nashville,state:Tennessee,period:2010-01-01 02:00:00,temp_f:34.5 3

city:Nashville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.6 4

city:Nashville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:33.1 5

city:Knoxville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.2 6

city:Knoxville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:32.8 7

city:Knoxville,state:Tennessee,period:2010-01-01 05:00:00,temp_f:32.4 8

city:Knoxville,state:Tennessee,period:2010-01-01 06:00:00,temp_f:32.2 9

city:Knoxville,state:Tennessee,period:2010-01-01 07:00:00,temp_f:32.4 10

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 59
2: Data Cleaning Functions

Pack Example: Nondefault Options

This example specifies the pipe character (|) for Delimiter and 'false' for IncludeColumnName.

Input
See Pack Examples Input.

SQL Call

SELECT * FROM Pack (


ON ville_temperature
USING
Delimiter ('|')
OutputColumn ('packed_data')
IncludeColumnName ('false')
TargetColumns ('city', 'state', 'period', 'temp_f')
) AS dt ORDER BY 2;

Output
Virtual columns are separated by pipe characters and not labeled with their column names.

packed_data sn

Nashville|Tennessee|2010-01-01 00:00:00|35.1 1

Nashville|Tennessee|2010-01-01 01:00:00|36.2 2

Nashville|Tennessee|2010-01-01 02:00:00|34.5 3

Nashville|Tennessee|2010-01-01 03:00:00|33.6 4

Nashville|Tennessee|2010-01-01 04:00:00|33.1 5

Knoxville|Tennessee|2010-01-01 03:00:00|33.2 6

Knoxville|Tennessee|2010-01-01 04:00:00|32.8 7

Knoxville|Tennessee|2010-01-01 05:00:00|32.4 8

Knoxville|Tennessee|2010-01-01 06:00:00|32.2 9

Knoxville|Tennessee|2010-01-01 07:00:00|32.4 10

Unpack
The Unpack function unpacks data from a single packed column into multiple columns. The packed
column is composed of multiple virtual columns, which become the output columns. To determine the

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 60
2: Data Cleaning Functions

virtual columns, the function must have either the delimiter that separates them in the packed column or
their lengths.
Unpack complements the function Pack, but you can use it on any packed column that meets the
input requirements.

Note:

• To use Pack and Unpack together, you must run both on Analytics Database. Pack and Unpack
are incompatible with ML Engine functions Pack_MLE and Unpack_MLE.
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support the following:
◦ Locale-based parsing with the SDF file
◦ This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
◦ This function does not support KanjiSJIS or Graphic data types.

Unpack Syntax

SELECT * FROM Unpack (


ON { table | view | (query) }
USING
TargetColumn ('target_column')
OutputColumns ('output_column' [,...])
OutputDataTypes ('datatype' [,...])
[ Delimiter ('delimiter') ]
[ ColumnLength ('column_length' [,...] ) ]
[ Regex ('regular_expression') ]
[ RegexSet ('group_number') ]
[ IgnoreInvalid ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 61
2: Data Cleaning Functions

[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]


) AS alias;

Related Information:
Column Specification Syntax Elements

Unpack Syntax Elements

TargetColumn
Specify the name of the input column that contains the packed data.

OutputColumns
Specify the names to give to the output columns, in the order in which the corresponding
virtual columns appear in target_column. The names must be valid object names, as
defined in Teradata Vantage™ - SQL Fundamentals, B035-1141.
If you specify fewer output column names than there are virtual input columns, the function
ignores the extra virtual input columns. That is, if the packed data contains x+y virtual
columns and the OutputColumns syntax element specifies x output column names, the
function assigns the names to the first x virtual columns and ignores the remaining y
virtual columns.

OutputDataTypes
Specify the datatypes of the unpacked output columns. Supported data types are
VARCHAR, INTEGER, DOUBLE PRECISION, TIME, DATE, and TIMESTAMP.
If OutputDataTypes specifies only one value and OutputColumns specifies multiple
columns, the specified value applies to every output_column.
If OutputDataTypes specifies multiple values, it must specify a value for each
output_column. The nth datatype corresponds to the nth output_column.
The function can output only 16 VARCHAR columns.

Delimiter
[Optional] Specify the delimiter—a single Unicode character in Normalization Form C (NFC)
—that separates the virtual columns in the packed data. The delimiter is case-sensitive.
Do not specify both this syntax element and the ColumnLength syntax element. If the
virtual columns are separated by a delimiter, specify the delimiter with this syntax element;
otherwise, specify the ColumnLength syntax element.
Default: ',' (comma)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 62
2: Data Cleaning Functions

ColumnLength
[Optional] Specify the lengths of the virtual columns; therefore, to use this syntax element,
you must know the length of each virtual column.
If ColumnLength specifies only one value and OutputColumns specifies multiple columns,
the specified value applies to every output_column.
If ColumnLength specifies multiple values, it must specify a value for each output_column.
The nth datatype corresponds to the nth output_column. However, the last output_column
can be an asterisk (*), which represents a single virtual column that contains the
remaining data. For example, if the first three virtual columns have the lengths 2, 1,
and 3, and all remaining data belongs to the fourth virtual column, you can specify
ColumnLength ('2', '1', '3', *).
Do not specify both this syntax element and the Delimiter syntax element.

Regex
[Optional] Specify a regular expression that describes a row of packed data, enabling the
function to find the data values.
A row of packed data contains a data value for each virtual column, but the row might also
contain other information (such as the virtual column name). In the regular_expression,
each data value is enclosed in parentheses.
For example, suppose that the packed data has two virtual columns, age and sex, and that
one row of packed data is age:34,sex:male. The regular_expression that describes the
row is '.*:(.*)'. The '.*:' matches the virtual column names, age and sex, and the
'(.*)' matches the values, 34 and male.
To represent multiple data groups in regular_expression, use multiple pairs of parentheses.
Without parentheses, the last data group in regular_expression represents the data value
(other data groups are assumed to be virtual column names or unwanted data). If a
different data group represents the data value, specify its group number with the RegexSet
syntax element.
Default: '(.*)', which matches the whole string (between delimiters, if any). When applied
to the preceding sample row, the default regular_expression causes the function to return
'age:34' and 'sex:male' as data values.

RegexSet
[Optional] Specify the ordinal number of the data group in regular_expression that
represents the data value in a virtual column.
Default behavior: The last data group in regular_expression represents the data value. For
example, suppose that regular_expression is '([a-zA-Z]*):(.*)'. If group_number is
'1', '([a-zA-Z]*)' represents the data value. If group_number is '2', '(.*)' represents
the data value.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 63
2: Data Cleaning Functions

Maximum: 30

IgnoreInvalid
[Optional] Specify whether the function ignores rows that contain invalid data.
IgnoreInvalid may not behave as you expect if an item in a virtual column has trailing special
characters. See Unpack Example: IgnoreInvalid ('true') with Trailing Special Characters.
Default: 'false' (The function fails if it encounters a row with invalid data.)

Accumulate
[Optional] Specify the input columns to copy to the output table.

Unpack Input

Input Table Schema


Column Data Type Description

target_column CHARACTER, Packed data.


VARCHAR, or CLOB

accumulate_column or other_ Any [Column appears zero or more times.]


input_column Column to copy to output table. Typically, one
such column contains row identifiers.

Unpack Output

Output Table Schema


Column Data Type Description

output_column Specified by OutputDataTypes Unpacked column.


syntax element.

accumulate_column Same as in input table Column copied from input table.

other_input_column Same as in input table [Column appears only without Accumulate,


once for each input table column not
specified by TargetColumn syntax element.]
Column copied from input table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 64
2: Data Cleaning Functions

Unpack Examples

Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Unpack Example: Delimiter Separates Virtual Columns

Input
The input table, ville_tempdata, is a collection of temperature readings for two cities, Nashville and
Knoxville, in the state of Tennessee. In the column of packed data, the delimiter comma (,) separates the
virtual columns. The last row contains invalid data.

ville_tempdata
sn packed_temp_data

10 Nashville,Tennessee,35.1

11 Nashville,Tennessee,36.2

12 Nashville,Tennessee,34.5

13 Nashville,Tennessee,33.6

14 Nashville,Tennessee,33.1

15 Nashville,Tennessee,33.2

16 Nashville,Tennessee,32.8

17 Nashville,Tennessee,32.4

18 Nashville,Tennessee,32.2

19 Nashville,Tennessee,32.4

20 Thisisbaddata

SQL Call
Because comma is the default delimiter, the Delimiter syntax element is optional.

SELECT * FROM Unpack (


ON ville_tempdata
USING
TargetColumn ('packed_temp_data')
OutputColumns ('city', 'state', 'temp_f')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 65
2: Data Cleaning Functions

OutputDataTypes ('varchar', 'varchar', 'real')


Delimiter (',')
Regex ('(.*)')
RegexSet (1)
IgnoreInvalid ('true')
Accumulate ('sn')
) AS dt ORDER BY sn;

Output
Because of IgnoreInvalid ('true'), the function did not fail when it encountered the row with invalid data,
but it did not output that row.

city state temp_f sn

Nashville Tennessee 3.51000000000000E 001 10

Nashville Tennessee 3.62000000000000E 001 11

Nashville Tennessee 3.45000000000000E 001 12

Nashville Tennessee 3.36000000000000E 001 13

Nashville Tennessee 3.31000000000000E 001 14

Nashville Tennessee 3.32000000000000E 001 15

Nashville Tennessee 3.28000000000000E 001 16

Nashville Tennessee 3.24000000000000E 001 17

Nashville Tennessee 3.22000000000000E 001 18

Nashville Tennessee 3.24000000000000E 001 19

Unpack Example: No Delimiter Separates Virtual Columns

Input
The input table, ville_tempdata1, is like the input table for the previous example, except that no delimiter
separates the virtual columns in the packed data. To enable the function to determine the virtual columns,
the function call specifies the column lengths.

ville_tempdata1
sn packed_temp_data

10 NashvilleTennessee35.1

11 NashvilleTennessee36.2

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 66
2: Data Cleaning Functions

sn packed_temp_data

12 NashvilleTennessee34.5

13 NashvilleTennessee33.6

14 NashvilleTennessee33.1

15 NashvilleTennessee33.2

16 NashvilleTennessee32.8

17 NashvilleTennessee32.4

18 NashvilleTennessee32.2

19 NashvilleTennessee32.4

20 Thisisbaddata

SQL Call

SELECT * FROM Unpack (


ON ville_tempdata1
USING
TargetColumn ('packed_temp_data')
OutputColumns ('city', 'state', 'temp_f')
OutputDataTypes ('varchar', 'varchar', 'real')
ColumnLength ('9', '9', '4')
Regex ('(.*)')
RegexSet (1)
IgnoreInvalid ('true')
) AS dt ORDER BY sn;

Output
city state temp_f sn

Nashville Tennessee 3.51000000000000E 001 10

Nashville Tennessee 3.62000000000000E 001 11

Nashville Tennessee 3.45000000000000E 001 12

Nashville Tennessee 3.36000000000000E 001 13

Nashville Tennessee 3.31000000000000E 001 14

Nashville Tennessee 3.32000000000000E 001 15

Nashville Tennessee 3.28000000000000E 001 16

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 67
2: Data Cleaning Functions

city state temp_f sn

Nashville Tennessee 3.24000000000000E 001 17

Nashville Tennessee 3.22000000000000E 001 18

Nashville Tennessee 3.24000000000000E 001 19

Unpack Example: More Input Columns than Output Columns

Input
The input table is ville_tempdata1, as in Unpack Example: No Delimiter Separates Virtual Columns. Its
packed_temp_data column has three virtual columns.

SQL Call
The OutputColumns syntax element specifies only two output column names.

SELECT * FROM Unpack (


ON ville_tempdata1
USING
TargetColumn ('packed_temp_data')
OutputColumns ('city', 'state')
OutputDataTypes ('varchar', 'varchar')
ColumnLength ('9', '9')
Regex ('(.*)')
RegexSet (1)
IgnoreInvalid ('true')
) AS dt ORDER BY sn;

Output
The output table has columns for the first two virtual input columns, but not for the third.

city state sn

Nashville Tennessee 10

Nashville Tennessee 11

Nashville Tennessee 12

Nashville Tennessee 13

Nashville Tennessee 14

Nashville Tennessee 15

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 68
2: Data Cleaning Functions

city state sn

Nashville Tennessee 16

Nashville Tennessee 17

Nashville Tennessee 18

Nashville Tennessee 19

Unpack Example: IgnoreInvalid ('true') with Trailing Special Characters

In this example, the items in the first virtual input column have trailing special characters. No delimiter
separates the virtual columns. ColumnLength is 2. The call to Unpack includes IgnoreInvalid ('true'), but
the output is unexpected.

Input
t2
c1

1,1919-04-05

1.1919-04-05

5,.1919-04-05

2,2019/04/05

4.,.1919-04-05

32019/04/05

SQL Call

SEL * FROM Unpack (


ON t2
USING
TargetColumn ('c1')
OutputColumns ('a','b')
OutputDataTypes ('int','date')
ColumnLength ('2','*')
IgnoreInvalid ('True')
) AS dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 69
2: Data Cleaning Functions

Output
a b

1 19/04/05

1 19/04/05

2 19/04/05

The reason for the unexpected output is the behavior of an internal library that Unpack uses, which is
as follows:

Input Row Behavior

1,1919-04-05 Library prunes trailing comma, converts "1" to integer and "1919-04-05" to date. (Output
row 1.)

1.1919-04-05 Library prunes trailing period, converts "1" to integer and "1919-04-05" to date. (Output
row 2.)

5,.1919-04-05 Library prunes trailing comma, converts 5 to integer, but cannot convert ".1919-04-05"
to date. (No output row.)
With ColumnLength ('3','*'), library prunes trailing comma and period, converts "5" to
integer and "1919-04-05" to date, and outputs a row for this input row.

2,2019/04/05 Library prunes trailing comma, converts "2" to integer and "1919/04/05" to date. (Output
row 3.)

4.,.1919-04-05 Library converts "4." to integer, but cannot convert ",.1919-04-05" to date. (No output
row.)

32019/04/05 Library converts "32" to integer, but cannot convert "019/04/05" to date. (No output row.)

StringSimilarity
The StringSimilarity function calculates the similarity between two strings, using the specified comparison
method. The similarity is a value in the range [0, 1].

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• When comparing strings, the function assumes that they are in the same Unicode script in
Normalization Form C (NFC).
• When used with this function, the ORDER BY clause supports only ASCII collation.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 70
2: Data Cleaning Functions

StringSimilarity Syntax

SELECT * FROM StringSimilarity (


ON { table | view | (query) } [ PARTITION BY ANY ]
USING
ComparisonColumnPairs ('comparison_type (column1,column2[,constant])[ AS
output_column]' [,...])
[ CaseSensitive ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}[,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

Related Information:
Column Specification Syntax Elements

StringSimilarity Syntax Elements

ComparisonColumnPairs
Specify the names of the input table columns that contain strings to compare (column1 and
column2), how to compare them (comparison_type), and (optionally) a constant and the
name of the output column for their similarity (output_column). The similarity is a value in
the range [0, 1].
For column1 and column2:
• If column1 or column2 includes any special characters (that is, characters other than
letters, digits, or underscore (_)), surround the column name with double quotation
marks. For example, if column1 and column2 are c(col1) and c(col2), respectively,
specify them as "c(col1)" and "c(col2)".
If column1 or column2 includes double quotation marks, replace each double quotation
mark with a pair of double quotation marks. For example, if column1 and column2 are
c1"c and c2"c, respectively, specify them as "c1""c" and "c2""c".

Note:
These rules do not apply to output_column. For example, this is valid syntax:
ComparisonColumnPairs ('jaro ("c1""c", "c2""c") AS out"col')

• If column1 or column2 supports more than 200 characters, you can cast it to
VARCHAR(200), as in the following example; however, the string may be truncated.
For information about the CAST operation, see Teradata Vantage™ - SQL Functions,
Expressions, and Predicates, B035-1145.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 71
2: Data Cleaning Functions

SELECT * FROM StringSimilarity (


ON (
SELECT id, CAST(a AS VARCHAR(200)) AS a, CAST(b AS
VARCHAR(200)) AS b
FROM max_varchar_strlen
) PARTITION BY ANY
USING
ComparisonColumnPairs ('ld(a,b) AS sim_fn')
Accumulate ('id')
) AS dt ORDER BY 1;

For comparison_type, use one of these values:


comparison_type Description

'jaro' Jaro distance.

'jaro_winkler' Jaro-Winkler distance: 1 for an exact match, 0 otherwise. If you


specify this comparison type, you can specify the value of factor p with
constant. 0 ≤ p ≤ 0.25. Default: p = 0.1

'n_gram' N-gram similarity. If you specify this comparison type, you can specify
the value of N with constant. Default: N = 2

'LD' Levenshtein distance: Number of edits needed to transform one


string into the other. Edits are insertions, deletions, or substitutions of
individual characters.

'LDWS' Levenshtein distance without substitution: Number of edits needed to


transform one string into the other using only insertions or deletions of
individual characters.

'OSA' Optimal string alignment distance: Number of edits needed to


transform one string into the other. Edits are insertions, deletions,
substitutions, or transpositions of characters. A substring can be
edited only once.

'DL' Damerau-Levenshtein distance: Like 'OSA' except that a substring


can be edited any number of times.

'hamming' Hamming distance: For strings of equal length, number of positions


where corresponding characters differ (that is, minimum number of
substitutions needed to transform one string into the other). For
strings of unequal length, -1.

'LCS' Longest common substring: Length of longest substring common to


both strings.

'jaccard' Jaccard indexed-based comparison.

'cosine' Cosine similarity.

'soundexcode' Only for English strings: -1 if either string has a non-English character;
otherwise, 1 if their soundex codes are the same and 0 otherwise.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 72
2: Data Cleaning Functions

The function ignores constant for every comparison_type except 'jaro_winkler'


and 'n_gram'.
You can specify a different comparison_type for every pair of columns.
Default: output_column is 'sim_i', where i is the sequence number of the column pair.

CaseSensitive
[Optional] Specify whether string comparison is case-sensitive. You can specify either one
value for all pairs or one value for each pair. If you specify one value for each pair, the ith
value applies to the ith pair.
Default: 'false'

Accumulate
[Optional] Specify the names of input table columns to copy to the output table.

StringSimilarity Input

Column Data Type Description

column1 CHARACTER or VARCHAR String to compare to string in column2.

column2 CHARACTER or VARCHAR String to compare to string in column1.

accumulate_column Any Column to copy to output table.

If any column1 or column2 in the input table schema supports more than 200 characters, you must cast
it to VARCHAR(200). See example in StringSimilarity Syntax Elements.

StringSimilarity Output

Output Table Schema


Column Data Type Description

accumulate_column Any Column copied from input table.

output_column DOUBLE PRECISION Similarity between strings in column pair.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 73
2: Data Cleaning Functions

StringSimilarity Examples

Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

StringSimilarity Example: Specify Column Names

Input
strsimilarity_input
id src_text1 src_text2 tar_text

1 astre astter aster

2 hone fone phone

3 acqiese acquire acquiesce

4 AAAACCCCCGGGGA CCCGGGAACCAACC CCAGGGAAACCCAC

5 alice allen allies

6 angela angle angels

7 senter center centre

8 chef cheap chief

9 circus circle circuit

10 debt debut debris

11 deal dell lead

12 bare bear bear

SQL Call

SELECT * FROM StringSimilarity (


ON strsimilarity_input PARTITION BY ANY
USING
ComparisonColumnPairs ('jaro (src_text1, tar_text) AS jaro1_sim',
'LD (src_text1, tar_text) AS ld1_sim',
'n_gram (src_text1, tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1, tar_text, 0.1) AS jw1_sim'
)
CaseSensitive ('true')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 74
2: Data Cleaning Functions

Accumulate ('id', 'src_text1', 'tar_text')


) AS dt ORDER BY id;

Output
Columns 1-3
id src_text1 tar_text

1 astre aster

2 hone phone

3 acqiese acquiesce

4 AAAACCCCCGGGGA CCAGGGAAACCCAC

5 alice allies

6 angela angels

7 senter centre

8 chef chief

9 circus circuit

10 debt debris

11 deal lead

12 bare bear

Columns 4-7
jaro1_sim ld1_sim ngram1_sim jw1_sim

0.933333333333333 0.6 0.5 0.953333333333333

0.933333333333333 0.8 0.75 0.933333333333333

0.925925925925926 0.777777777777778 0.5 0.948148148148148

0.824175824175824 0.214285714285714 0.384615384615385 0.824175824175824

0.822222222222222 0.5 0.4 0.857777777777778

0.888888888888889 0.833333333333333 0.8 0.933333333333333

0.822222222222222 0.5 0.4 0.822222222222222

0.933333333333333 0.8 0.5 0.946666666666667

0.849206349206349 0.714285714285714 0.666666666666667 0.90952380952381

0.75 0.5 0.4 0.825

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 75
2: Data Cleaning Functions

jaro1_sim ld1_sim ngram1_sim jw1_sim

0.666666666666667 0.5 0.333333333333333 0.666666666666667

0.833333333333333 0.5 0.333333333333333 0.85

StringSimilarity Example: Specify Column Ranges

Input
The input table is strsimilarity_input, as in StringSimilarity Example: Specify Column Names.

SQL Call

SELECT * FROM StringSimilarity (


ON strsimilarity_input PARTITION BY ANY
USING
ComparisonColumnPairs ('jaro (src_text1, tar_text) AS jaro1_sim',
'LD (src_text1, tar_text) AS ld1_sim',
'n_gram (src_text1, tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1, tar_text, 0.1) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('[0:1]', 'tar_text')
) AS dt ORDER BY id;

Output

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 76
3
Data Exploration Functions

Data exploration functions help you learn about the variables (columns) of the input data set.

MovingAverage
The MovingAverage function computes average values in a series, using the specified moving
average type.

Moving Average Type Description

Cumulative Computes cumulative moving average of value from beginning of series.


Moving Average

Exponential Computes average of points in series, applying damping factor that


Moving Average exponentially decreases weights of older values.

Modified Moving Average Computes first value as simple moving average. Computes subsequent
values by adding new value and subtracting last average from resulting sum.

Simple Moving Average Computes unweighted mean of previous n data points.

Triangular Moving Average Computes double-smoothed average of points in series.

Weighted Moving Average Computes average of points in series, applying weights to older values.
Weights for older values decrease arithmetically.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• The ORDER BY clause supports only ASCII collation.
• The PARTITION BY clause assumes column names are in Normalization Form C (NFC).

Weighted Moving Average


A weighted average has multiplying factors that give different weights to different data points.
Mathematically, the moving average is the convolution of the data points with a moving average function.
In technical analysis, a weighted moving average (WMA) has weights that decrease arithmetically. In an
n-point weighted moving average, the most recent data point has weight n, the second most recent data
point has weight (n - 1), and so on, until the weight is zero.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 77
3: Data Exploration Functions

With MAvgType ('W'), the MovingAverage function uses this formula:


WMAM = (nVM + (n-1)VM-1 + … + 2V(M-n+2)) + V(M-n+1)) / (n + (n-1) + … + 2 + 1 )

Where TotalM = VM + … + V(M-n+1), the following equations are true:

• TotalM+1 = TotalM + VM+1 - V(M-n+1)


• NumeratorM+1 = NumeratorM + n*VM+1 - TotalM
• WMAM+1 = NumeratorM+1 / (n*(n+1)/2)

VM is the target column value at index M in the window under consideration.

The value n—the number of old values to use when calculating the new weighted moving average—is
specified by the WindowSize syntax element.

Triangular Moving Average


The triangular moving average (TMA) differs from the simple moving average in that it is double-smoothed;
that is, averaged twice. Double-smoothing keeps the triangular moving average from responding to new
data points as fast as the simple moving average. For an average that responds quickly to new data points,
use the simple moving average or exponential moving average.
With MAvgType ('T'), the MovingAverage function uses this procedure:
1. Compute the window size, N:
N=ceil(window_size +1)/2
The value window_size is specified by the WindowSize syntax element.
2. Compute the simple moving average of each target column, using this formula:
SMAi = (V1 + V2 + … + VN)/N

The function calculates SMAi on the ith window of the target column from the start of the row.

Vi is the value of the target column at index i in the window.

3. Compute the triangular moving average by computing the simple moving average with window size N
on the values obtained in step 2, using this formula:
TMA = (SMA1 + SMA2 + … + SMAN)/N

The function writes the cumulative moving average values computed for the first n rows, where n is less
than N, to the output table.

Simple Moving Average


The simple moving average (SMA) is the unweighted mean of the previous n data points. For example, a
10-day simple moving average of closing price is the mean of closing prices for the previous 10 days.
With MAvgType ('S'), the MovingAverage function uses this procedure:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 78
3: Data Exploration Functions

1. Compute the arithmetic average of the first window_size rows.


The value window_size is specified by the WindowSize syntax element. In the next step, it is called N.
2. For each subsequent row, compute the new simple moving average value with this formula:
SMA = (V1+V2+...+VN)/N

Vi is the value of the target column at index i in the window.

Modified Moving Average


The first point of the modified moving average (MMA) is calculated like the first point of the simple moving
average. Each subsequent point is calculated by adding the new value to the most recently calculated
modified moving average value and then, from that sum, subtracting the last average value. The difference
is the new modified moving average value.
With MAvgType ('M'), the MovingAverage function uses this procedure:
1. Compute the arithmetic average of the first window_size rows.
The value window_size is specified by the WindowSize syntax element.
2. For each subsequent row, compute the new modified moving average value with this formula:
MMAM = MMAM-1 + (1/window_size) (VM - MMAM-1)

VM is the current value.

Exponential Moving Average


Exponential moving average (EMA), or exponentially weighted moving average (EWMA), applies a
damping factor, alpha, that exponentially decreases the weights of older values. This technique gives much
more weight to recent observations, while retaining older observations.
With MAvgType ('E'), the MovingAverage function uses this procedure:
1. Compute the arithmetic average of the first n rows.
The value n is specified by the StartRows syntax element.
2. For each subsequent row, compute the new exponential moving average value with this formula:
EMAM = alpha * V + (1 - alpha) * EMAM-1

The value alpha is specified by the Alpha syntax element. V is the new value.

Cumulative Moving Average


In a cumulative moving average (CMA), the data are added to the data set in an ordered data stream over
time. The objective is to compute the average of all the data at each point in time when new data arrived.
A typical use case is an investor who wants to find the average price of all transactions of a specific stock
over time, up to the current time.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 79
3: Data Exploration Functions

With MAvgType ('C'), the MovingAverage function computes the arithmetic average of all the rows from the
beginning of the series with this formula:
CMA = (V1 + V2 + ... + VN)/N

Vi is a value. N is the number of rows from the beginning of the data set.

MovingAverage Syntax
SELECT * FROM MovingAverage (
ON { table | view | (query) }
[ PARTITION BY partition_column [,...] ]
[ ORDER BY order_column [,...] ]
[ USING
[ MAvgType ({ 'C' | 'E' | 'M' | 'S' | 'T' | 'W' }) ]
[ TargetColumns ({'target_column'| 'target_column_range'}[,...])]
[ Alpha (alpha) ]
[ StartRows (n) ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WindowSize (window_size) ]
]
) AS alias;

Note:
If the ON clause does not include the PARTITION BY and ORDER BY clauses, results
are nondeterministic.

MovingAverage Syntax Elements


MAvgType
[Optional] Specify one of the following moving average types:

Type Description

'C' (Default) Cumulative moving average.

'E' Exponential moving average.

'M' Modified moving average.

'S' Simple moving average.

'T' Triangular moving average.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 80
3: Data Exploration Functions

Type Description

'W' Weighted moving average.

TargetColumns
[Optional] Specify the input column names for which to compute the moving average.
Default behavior: The function copies every input column to the output table but does not
compute any moving averages.

Alpha
[Optional with MAvgType E, otherwise ignored.] Specify the damping factor, a value in the
range [0, 1], which represents a percentage in the range [0, 100]. For example, if alpha is 0.2,
the damping factor is 20%. A higher alpha discounts older observations faster.
Default: 0.1

StartRows
[Optional with MAvgType E, otherwise ignored.] Specify the number of rows to skip before
calculating the exponential moving average. The function uses the arithmetic average of
these rows as the initial value of the exponential moving average. The value n must be
an integer.
Default: 2

IncludeFirst
[Ignored with MAvgType C, otherwise optional.] Specify whether to include the starting rows
in the output table. If you specify 'true', the output columns for the starting rows contain NULL,
because their moving average is undefined.
Default: 'false'

WindowSize
[Optional with MAvgType M, S, T, and W; otherwise ignored.] Specify the number of previous
values to consider when computing the new moving average. The data type of window_size
must be BYTEINT, SMALLINT, or INTEGER.
Minimum value: 3
Default: '10'

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 81
3: Data Exploration Functions

MovingAverage Input
Input Table Schema
Column Data Type Description

partition_ Any Column by which input data is partitioned. This column must
column contain all rows of an entity. For example, if function is to
compute moving average of a particular stock share price, all
transactions of that stock must be in one partition.
PARTITION BY clause assumes column names are in
Normalization Form C (NFC).

order_column Any Column by which input table is ordered.


ORDER BY clause supports only ASCII collation.

target_column INTEGER, Values to average.


SMALLINT, BIGINT,
NUMERIC,
NUMBER, or
DOUBLE
PRECISION

MovingAverage Output
Output Table Schema
Column Data Type Description

partition_column Same as in input table Column by which input data is partitioned.

order_column Same as in input table Column by which input table is ordered.

target_column Same as in input table Column copied from input table.

target_ DOUBLE PRECISION Moving average of target_column values when row


column_typemavg was added to data set, where type is value of
MAvgType syntax element.

MovingAverage Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

MovingAverage Examples Input

The input table, company1_stock, contains 25 observations of common stock closing prices.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 82
3: Data Exploration Functions

company1_stock
id name period stockprice

1 Company1 1961-05-17 00:00:00.000000 460.000000000000

2 Company1 1961-05-18 00:00:00.000000 457.000000000000

3 Company1 1961-05-19 00:00:00.000000 452.000000000000

4 Company1 1961-05-22 00:00:00.000000 459.000000000000

5 Company1 1961-05-23 00:00:00.000000 462.000000000000

6 Company1 1961-05-24 00:00:00.000000 459.000000000000

7 Company1 1961-05-25 00:00:00.000000 463.000000000000

8 Company1 1961-05-26 00:00:00.000000 479.000000000000

9 Company1 1961-05-29 00:00:00.000000 493.000000000000

10 Company1 1961-05-31 00:00:00.000000 490.000000000000

11 Company1 1961-06-01 00:00:00.000000 492.000000000000

12 Company1 1961-06-02 00:00:00.000000 498.000000000000

13 Company1 1961-06-05 00:00:00.000000 499.000000000000

14 Company1 1961-06-06 00:00:00.000000 497.000000000000

15 Company1 1961-06-07 00:00:00.000000 496.000000000000

16 Company1 1961-06-08 00:00:00.000000 490.000000000000

17 Company1 1961-06-09 00:00:00.000000 489.000000000000

18 Company1 1961-06-12 00:00:00.000000 478.000000000000

19 Company1 1961-06-13 00:00:00.000000 487.000000000000

20 Company1 1961-06-14 00:00:00.000000 491.000000000000

21 Company1 1961-06-15 00:00:00.000000 487.000000000000

22 Company1 1961-06-16 00:00:00.000000 482.000000000000

23 Company1 1961-06-19 00:00:00.000000 479.000000000000

24 Company1 1961-06-20 00:00:00.000000 478.000000000000

25 Company1 1961-06-21 00:00:00.000000 479.000000000000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 83
3: Data Exploration Functions

MovingAverage Example: Cumulative Moving Average

This example computes a cumulative moving average for the price of stock.

Input
See MovingAverage Examples Input.

SQL Call

SELECT * FROM MovingAverage (


ON company1_stock PARTITION BY name ORDER BY period
USING
MAvgType ('C')
TargetColumns ('stockprice')
) AS dt ORDER BY id;

Output
id name period stockprice stockprice_cmavg

1 Company1 1961-05-17 00:00:00.000000 460.000000000000 460.000000000000

2 Company1 1961-05-18 00:00:00.000000 457.000000000000 458.500000000000

3 Company1 1961-05-19 00:00:00.000000 452.000000000000 456.333333333333

4 Company1 1961-05-22 00:00:00.000000 459.000000000000 457.000000000000

5 Company1 1961-05-23 00:00:00.000000 462.000000000000 458.000000000000

6 Company1 1961-05-24 00:00:00.000000 459.000000000000 458.166666666667

7 Company1 1961-05-25 00:00:00.000000 463.000000000000 458.857142857143

8 Company1 1961-05-26 00:00:00.000000 479.000000000000 461.375000000000

9 Company1 1961-05-29 00:00:00.000000 493.000000000000 464.888888888889

10 Company1 1961-05-31 00:00:00.000000 490.000000000000 467.400000000000

11 Company1 1961-06-01 00:00:00.000000 492.000000000000 469.636363636364

12 Company1 1961-06-02 00:00:00.000000 498.000000000000 472.000000000000

13 Company1 1961-06-05 00:00:00.000000 499.000000000000 474.076923076923

14 Company1 1961-06-06 00:00:00.000000 497.000000000000 475.714285714286

15 Company1 1961-06-07 00:00:00.000000 496.000000000000 477.066666666667

16 Company1 1961-06-08 00:00:00.000000 490.000000000000 477.875000000000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 84
3: Data Exploration Functions

id name period stockprice stockprice_cmavg

17 Company1 1961-06-09 00:00:00.000000 489.000000000000 478.529411764706

18 Company1 1961-06-12 00:00:00.000000 478.000000000000 478.500000000000

19 Company1 1961-06-13 00:00:00.000000 487.000000000000 478.947368421053

20 Company1 1961-06-14 00:00:00.000000 491.000000000000 479.550000000000

21 Company1 1961-06-15 00:00:00.000000 487.000000000000 479.904761904762

22 Company1 1961-06-16 00:00:00.000000 482.000000000000 480.000000000000

23 Company1 1961-06-19 00:00:00.000000 479.000000000000 479.956521739130

24 Company1 1961-06-20 00:00:00.000000 478.000000000000 479.875000000000

25 Company1 1961-06-21 00:00:00.000000 479.000000000000 479.840000000000

MovingAverage Example: Exponential Moving Average

This example computes an exponential moving average for the price of stock.

Input
See MovingAverage Examples Input.

SQL Call

SELECT * FROM MovingAverage (


ON company1_stock PARTITION BY name ORDER BY period
USING
MAvgType ('E')
TargetColumns ('stockprice')
StartRows (10)
Alpha (0.1)
IncludeFirst ('true')
) AS dt ORDER BY id;

Output
id name period stockprice stockprice_emavg

1 Company1 1961-05-17 00:00:00.000000 460.000000000000 ?

2 Company1 1961-05-18 00:00:00.000000 457.000000000000 ?

3 Company1 1961-05-19 00:00:00.000000 452.000000000000 ?

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 85
3: Data Exploration Functions

id name period stockprice stockprice_emavg

4 Company1 1961-05-22 00:00:00.000000 459.000000000000 ?

5 Company1 1961-05-23 00:00:00.000000 462.000000000000 ?

6 Company1 1961-05-24 00:00:00.000000 459.000000000000 ?

7 Company1 1961-05-25 00:00:00.000000 463.000000000000 ?

8 Company1 1961-05-26 00:00:00.000000 479.000000000000 ?

9 Company1 1961-05-29 00:00:00.000000 493.000000000000 ?

10 Company1 1961-05-31 00:00:00.000000 490.000000000000 467.400000000000

11 Company1 1961-06-01 00:00:00.000000 492.000000000000 469.860000000000

12 Company1 1961-06-02 00:00:00.000000 498.000000000000 472.674000000000

13 Company1 1961-06-05 00:00:00.000000 499.000000000000 475.306600000000

14 Company1 1961-06-06 00:00:00.000000 497.000000000000 477.475940000000

15 Company1 1961-06-07 00:00:00.000000 496.000000000000 479.328346000000

16 Company1 1961-06-08 00:00:00.000000 490.000000000000 480.395511400000

17 Company1 1961-06-09 00:00:00.000000 489.000000000000 481.255960260000

18 Company1 1961-06-12 00:00:00.000000 478.000000000000 480.930364234000

19 Company1 1961-06-13 00:00:00.000000 487.000000000000 481.537327810600

20 Company1 1961-06-14 00:00:00.000000 491.000000000000 482.483595029540

21 Company1 1961-06-15 00:00:00.000000 487.000000000000 482.935235526586

22 Company1 1961-06-16 00:00:00.000000 482.000000000000 482.841711973927

23 Company1 1961-06-19 00:00:00.000000 479.000000000000 482.457540776535

24 Company1 1961-06-20 00:00:00.000000 478.000000000000 482.011786698881

25 Company1 1961-06-21 00:00:00.000000 479.000000000000 481.710608028993

MovingAverage Example: Modified Moving Average

This example computes the triangular moving average for the price of stock.

Input
See MovingAverage Examples Input.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 86
3: Data Exploration Functions

SQL Call

SELECT * FROM MovingAverage (


ON company1_stock PARTITION BY name ORDER BY period
USING
MAvgType ('M')
TargetColumns ('stockprice')
WindowSize (10)
IncludeFirst ('true')
) AS dt ORDER BY id;

Output
id name period stockprice stockprice_mmavg

1 Company1 1961-05-17 00:00:00.000000 460.000000000000 460.000000000000

2 Company1 1961-05-18 00:00:00.000000 457.000000000000 459.700000000000

3 Company1 1961-05-19 00:00:00.000000 452.000000000000 458.930000000000

4 Company1 1961-05-22 00:00:00.000000 459.000000000000 458.937000000000

5 Company1 1961-05-23 00:00:00.000000 462.000000000000 459.243300000000

6 Company1 1961-05-24 00:00:00.000000 459.000000000000 459.218970000000

7 Company1 1961-05-25 00:00:00.000000 463.000000000000 459.597073000000

8 Company1 1961-05-26 00:00:00.000000 479.000000000000 461.537365700000

9 Company1 1961-05-29 00:00:00.000000 493.000000000000 464.683629130000

10 Company1 1961-05-31 00:00:00.000000 490.000000000000 467.215266217000

11 Company1 1961-06-01 00:00:00.000000 492.000000000000 469.693739595300

12 Company1 1961-06-02 00:00:00.000000 498.000000000000 472.524365635770

13 Company1 1961-06-05 00:00:00.000000 499.000000000000 475.171929072193

14 Company1 1961-06-06 00:00:00.000000 497.000000000000 477.354736164974

15 Company1 1961-06-07 00:00:00.000000 496.000000000000 479.219262548476

16 Company1 1961-06-08 00:00:00.000000 490.000000000000 480.297336293629

17 Company1 1961-06-09 00:00:00.000000 489.000000000000 481.167602664266

18 Company1 1961-06-12 00:00:00.000000 478.000000000000 480.850842397839

19 Company1 1961-06-13 00:00:00.000000 487.000000000000 481.465758158055

20 Company1 1961-06-14 00:00:00.000000 491.000000000000 482.419182342250

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 87
3: Data Exploration Functions

id name period stockprice stockprice_mmavg

21 Company1 1961-06-15 00:00:00.000000 487.000000000000 482.877264108025

22 Company1 1961-06-16 00:00:00.000000 482.000000000000 482.789537697222

23 Company1 1961-06-19 00:00:00.000000 479.000000000000 482.410583927500

24 Company1 1961-06-20 00:00:00.000000 478.000000000000 481.969525534750

25 Company1 1961-06-21 00:00:00.000000 479.000000000000 481.672572981275

MovingAverage Example: Simple Moving Average

This example computes a simple moving average for the price of stock.

Input
See MovingAverage Examples Input.

SQL Call

SELECT * FROM MovingAverage (


ON company1_stock PARTITION BY name ORDER BY period
USING
MAvgType ('S')
TargetColumns ('stockprice')
WindowSize (10)
IncludeFirst ('true')
) AS dt ORDER BY id;

Output
id name period stockprice stockprice_smavg

1 Company1 1961-05-17 00:00:00.000000 460.000000000000 ?

2 Company1 1961-05-18 00:00:00.000000 457.000000000000 ?

3 Company1 1961-05-19 00:00:00.000000 452.000000000000 ?

4 Company1 1961-05-22 00:00:00.000000 459.000000000000 ?

5 Company1 1961-05-23 00:00:00.000000 462.000000000000 ?

6 Company1 1961-05-24 00:00:00.000000 459.000000000000 ?

7 Company1 1961-05-25 00:00:00.000000 463.000000000000 ?

8 Company1 1961-05-26 00:00:00.000000 479.000000000000 ?

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 88
3: Data Exploration Functions

id name period stockprice stockprice_smavg

9 Company1 1961-05-29 00:00:00.000000 493.000000000000 ?

10 Company1 1961-05-31 00:00:00.000000 490.000000000000 467.400000000000

11 Company1 1961-06-01 00:00:00.000000 492.000000000000 470.600000000000

12 Company1 1961-06-02 00:00:00.000000 498.000000000000 474.700000000000

13 Company1 1961-06-05 00:00:00.000000 499.000000000000 479.400000000000

14 Company1 1961-06-06 00:00:00.000000 497.000000000000 483.200000000000

15 Company1 1961-06-07 00:00:00.000000 496.000000000000 486.600000000000

16 Company1 1961-06-08 00:00:00.000000 490.000000000000 489.700000000000

17 Company1 1961-06-09 00:00:00.000000 489.000000000000 492.300000000000

18 Company1 1961-06-12 00:00:00.000000 478.000000000000 492.200000000000

19 Company1 1961-06-13 00:00:00.000000 487.000000000000 491.600000000000

20 Company1 1961-06-14 00:00:00.000000 491.000000000000 491.700000000000

21 Company1 1961-06-15 00:00:00.000000 487.000000000000 491.200000000000

22 Company1 1961-06-16 00:00:00.000000 482.000000000000 489.600000000000

23 Company1 1961-06-19 00:00:00.000000 479.000000000000 487.600000000000

24 Company1 1961-06-20 00:00:00.000000 478.000000000000 485.700000000000

25 Company1 1961-06-21 00:00:00.000000 479.000000000000 484.000000000000

MovingAverage Example: Triangular Moving Average

This example computes the triangular moving average for the price of stock.

Input
See MovingAverage Examples Input.

SQL Call

SELECT * FROM MovingAverage (


ON company1_stock PARTITION BY name ORDER BY period
USING
MAvgType ('T')
TargetColumns ('stockprice')
WindowSize (10)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 89
3: Data Exploration Functions

IncludeFirst ('true')
) AS dt ORDER BY id;

Output
id name period stockprice stockprice_tmavg

1 Company1 1961-05-17 00:00:00.000000 460.000000000000 460.000000000000

2 Company1 1961-05-18 00:00:00.000000 457.000000000000 459.250000000000

3 Company1 1961-05-19 00:00:00.000000 452.000000000000 458.277777777778

4 Company1 1961-05-22 00:00:00.000000 459.000000000000 457.958333333333

5 Company1 1961-05-23 00:00:00.000000 462.000000000000 457.966666666667

6 Company1 1961-05-24 00:00:00.000000 459.000000000000 458.000000000000

7 Company1 1961-05-25 00:00:00.000000 463.000000000000 457.777777777778

8 Company1 1961-05-26 00:00:00.000000 479.000000000000 458.416666666667

9 Company1 1961-05-29 00:00:00.000000 493.000000000000 460.555555555556

10 Company1 1961-05-31 00:00:00.000000 490.000000000000 463.444444444444

11 Company1 1961-06-01 00:00:00.000000 492.000000000000 467.000000000000

12 Company1 1961-06-02 00:00:00.000000 498.000000000000 471.611111111111

13 Company1 1961-06-05 00:00:00.000000 499.000000000000 477.138888888889

14 Company1 1961-06-06 00:00:00.000000 497.000000000000 482.555555555556

15 Company1 1961-06-07 00:00:00.000000 496.000000000000 486.916666666667

16 Company1 1961-06-08 00:00:00.000000 490.000000000000 490.416666666667

17 Company1 1961-06-09 00:00:00.000000 489.000000000000 493.000000000000

18 Company1 1961-06-12 00:00:00.000000 478.000000000000 493.944444444444

19 Company1 1961-06-13 00:00:00.000000 487.000000000000 493.555555555556

20 Company1 1961-06-14 00:00:00.000000 491.000000000000 492.500000000000

21 Company1 1961-06-15 00:00:00.000000 487.000000000000 491.111111111111

22 Company1 1961-06-16 00:00:00.000000 482.000000000000 489.500000000000

23 Company1 1961-06-19 00:00:00.000000 479.000000000000 487.694444444444

24 Company1 1961-06-20 00:00:00.000000 478.000000000000 486.444444444444

25 Company1 1961-06-21 00:00:00.000000 479.000000000000 485.305555555556

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 90
3: Data Exploration Functions

MovingAverage Example: Weighted Moving Average

This example computes the weighted moving average for the price of stock.

Input
See MovingAverage Examples Input.

SQL Call

SELECT * FROM MovingAverage (


ON company1_stock PARTITION BY name ORDER BY period
USING
MAvgType ('W')
TargetColumns ('stockprice')
WindowSize (10)
IncludeFirst ('true')
) AS dt ORDER BY id;

Output
id name period stockprice stockprice_wmavg

1 Company1 1961-05-17 00:00:00.000000 460.000000000000 ?

2 Company1 1961-05-18 00:00:00.000000 457.000000000000 ?

3 Company1 1961-05-19 00:00:00.000000 452.000000000000 ?

4 Company1 1961-05-22 00:00:00.000000 459.000000000000 ?

5 Company1 1961-05-23 00:00:00.000000 462.000000000000 ?

6 Company1 1961-05-24 00:00:00.000000 459.000000000000 ?

7 Company1 1961-05-25 00:00:00.000000 463.000000000000 ?

8 Company1 1961-05-26 00:00:00.000000 479.000000000000 ?

9 Company1 1961-05-29 00:00:00.000000 493.000000000000 ?

10 Company1 1961-05-31 00:00:00.000000 490.000000000000 473.454545454545

11 Company1 1961-06-01 00:00:00.000000 492.000000000000 477.927272727273

12 Company1 1961-06-02 00:00:00.000000 498.000000000000 482.909090909091

13 Company1 1961-06-05 00:00:00.000000 499.000000000000 487.327272727273

14 Company1 1961-06-06 00:00:00.000000 497.000000000000 490.527272727273

15 Company1 1961-06-07 00:00:00.000000 496.000000000000 492.854545454545

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 91
3: Data Exploration Functions

id name period stockprice stockprice_wmavg

16 Company1 1961-06-08 00:00:00.000000 490.000000000000 493.472727272727

17 Company1 1961-06-09 00:00:00.000000 489.000000000000 493.345454545455

18 Company1 1961-06-12 00:00:00.000000 478.000000000000 490.745454545455

19 Company1 1961-06-13 00:00:00.000000 487.000000000000 489.800000000000

20 Company1 1961-06-14 00:00:00.000000 491.000000000000 489.690909090909

21 Company1 1961-06-15 00:00:00.000000 487.000000000000 488.836363636364

22 Company1 1961-06-16 00:00:00.000000 482.000000000000 487.163636363636

23 Company1 1961-06-19 00:00:00.000000 479.000000000000 485.236363636364

24 Company1 1961-06-20 00:00:00.000000 478.000000000000 483.490909090909

25 Company1 1961-06-21 00:00:00.000000 479.000000000000 482.272727272727

TD_CategoricalSummary
TD_CategoricalSummary displays the distinct values and their counts for each specified input table column.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_CategoricalSummary Syntax
SELECT * FROM TD_CategoricalSummary (
ON { table | view | (query) } AS InputTable
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
) AS alias;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 92
3: Data Exploration Functions

TD_CategoricalSummary Syntax Elements


TargetColumns
Specify the names of the InputTable columns for which to display the distinct values and
their counts.

TD_CategoricalSummary Input
InputTable Schema
Column Data Type Description

target_column CHAR, VARCHAR Column for which to display distinct values


(CHARACTER SET LATIN or UNICODE) and their counts.

TD_CategoricalSummary Output
Output Table Schema
Column Data Type Description

ColumnName VARCHAR Name of target_column.


(CHARACTER SET UNICODE)

DistinctValue VARCHAR Name of distinct value in target_column.


(CHARACTER SET UNICODE) Table has one row for each distinct value.

DistinctValueCount BIGINT Count of distinct value in target_column.


Table has one row for each distinct value.

TD_CategoricalSummary Example
InputTable: cat_titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ --- -----
----- -------- --------- ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 93
3: Data Exploration Functions

0 0 PC 17754 34.6542 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.7 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

SQL Call

SELECT * FROM TD_CategoricalSummary (


ON cat_titanic_train AS InputTable
USING
TargetColumns ('sex')
) AS dt;

Output

ColumnName DistinctValue DistinctValueCount


---------- ------------- ------------------
sex female 1
sex male 4

TD_ColumnSummary
TD_ColumnSummary displays the following for each specified input table column:
• Column name
• Column data type
• Count of these values:
◦ Non-NULL
◦ NULL
◦ Blank (all space characters) (NULL for numeric data type)
◦ Zero (NULL for nonnumeric data type)
◦ Positive (NULL for nonnumeric data type)
◦ Negative (NULL for nonnumeric data type)
• Percentage of NULL values
• Percentage of non-NULL values

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 94
3: Data Exploration Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_ColumnSummary Syntax
SELECT * FROM TD_ColumnSummary (
ON { table | view | (query) } AS InputTable
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
) AS alias;

TD_ColumnSummary Syntax Elements


TargetColumns
Specify the names of the InputTable columns to summarize.

TD_ColumnSummary Input
InputTable Schema
Column Data Type Description

target_column Any Column for which to display summary.

TD_ColumnSummary Output
Output Table Schema
Column Data Type Description

ColumnName VARCHAR Name of target_column.


(CHARACTER
SET UNICODE)

DataType VARCHAR Data type of target_column.


(CHARACTER
SET LATIN)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 95
3: Data Exploration Functions

Column Data Type Description

NonNullCount BIGINT Count of non-NULL values in target_column.

NullCount BIGINT Count of NULL values in target_column.

BlankCount BIGINT If data type is char or varchar, count of blank values


(values with all space characters) in target_column;
otherwise NULL.

ZeroCount BIGINT If DataType is numeric, count of zero values in target_


column; otherwise NULL.

PositiveCount BIGINT If DataType is numeric, count of positive values in target_


column; otherwise NULL.

NegativeCount BIGINT If DataType is numeric, count of negative values in


target_column; otherwise NULL.

NullPercentage DOUBLE PRECISION Percentage of NULL values in target_column.

NonNullPercentage DOUBLE PRECISION Percentage of non-NULL values in target_column.

TD_ColumnSummary Example
InputTable: col_titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ---- -----
----- ------ --------- ----------- --------
49 0 3 Samaan; Mr. Youssef male null
2 0 2662 21.679 null C
78 0 3 Moutal; Mr. Rahamin Haim male null
0 0 374746 8.05 null S
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 8.65 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 96
3: Data Exploration Functions

SQL Call

SELECT * FROM TD_ColumnSummary (


ON col_titanic_train AS InputTable
USING
TargetColumns ('age','pclass','embarked','cabin')
) AS dt;

Output

ColumnName Datatype NonNullCount NullCount BlankCount


ZeroCount PositiveCount NegativeCount NullPercentage NonNullPercentage
---------- ------------------------------- ------------ --------- ----------
--------- ------------- ------------- -------------- -----------------
age INTEGER 3 2
null 0 3 0 4.00E+001 6.00E+001
cabin VARCHAR(20) CHARACTER SET LATIN 3 2
0 null null null 4.00E+001 6.00E+001
embarked VARCHAR(20) CHARACTER SET LATIN 5 0
0 null null null 0.00E000 1.00E+002
pclass INTEGER 5 0
null 0 5 0 0.00E000 1.00E+002

TD_GetRowsWithMissingValues
TD_GetRowsWithMissingValues displays the rows that have NULL values in the specified input
table columns.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 97
3: Data Exploration Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Related Information:
TD_GetRowsWithoutMissingValues

TD_GetRowsWithMissingValues Syntax
SELECT * FROM TD_GetRowsWithMissingValues (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_column ] ]
[ USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
]
) AS alias;

TD_GetRowsWithMissingValues Syntax Elements


TargetColumns
[Optional] Specify the target column names to check for Null values.
Default: If omitted, the function considers all columns of the Input table.

Accumulate
[Optional]: Specify the input table column names to copy to the output table.

TD_GetRowsWithMissingValues Input
InputTable Schema
Column Data Type Description

target_column Any Columns for which NULL values are checked.

accumulate_column Any The input table column names to copy to the output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 98
3: Data Exploration Functions

TD_GetRowsWithMissingValues Output
Output Table Schema
Same as Input Table schema

TD_GetRowsWithMissingValues Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age


sibsp parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ----
----- ----- --------- ------ ----------- --------
1 0 3 Braund; Mr. Owen Harris male 22
1 0 A/5 21171 7.25 null S
30 0 3 Todoroff; Mr. Lalio male null
0 0 349216 7.8958 null S
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

SQL Call

SELECT * FROM TD_getRowsWithMissingValues (


ON input_table AS InputTable
USING
TargetColumns ('[name:cabin]')
) AS dt;

Output

passenger survived pclass name sex age sibsp parch


ticket fare cabin embarked
--------- -------- ------ ----------------------- ---- ---- ----- -----
--------- ---- ----- --------
1 0 3 Braund; Mr. Owen Harris male 22 1 0 A/5

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 99
3: Data Exploration Functions

21171 7.25 null S


30 0 3 Todoroff; Mr. Lalio male null 0 0
349216 7.8958 null S

TD_Histogram
TD_Histogram calculates the frequency distribution of a data set using your choice of these methods:
• Sturges
• Scott
• Variable-width
• Equal-width

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_Histogram Syntax
SELECT * TD_Histogram (
ON { table | view | (query) } AS InputTable
[ ON { table | view | (query) } AS MinMax DIMENSION ]
USING
MethodType ({ 'Sturges' | 'Scott' | 'Variable-Width' | 'Equal-Width' })
TargetColumn ('target_column')
[ NBins ('number_of_bins') ]
[ Inclusion ({ 'left' | 'right' }) ]
) AS alias;

TD_Histogram Syntax Elements


MethodType
Specify the method for calculating the frequency distribution of the data set:

Available
Description
Methods

Sturges Algorithm for calculating bin width, w:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 100
3: Data Exploration Functions

Available
Description
Methods

w = r/(1 + log2n)
where:
w = bin width
r = data value range
n = number of elements in data set
Sturges algorithm performs best if data is normally distributed and n is at
least 30.

Scott Algorithm for calculating bin width, w:


w = 3.49s/(n1/3)
where:
w = bin width
s = standard deviation of data values
n = number of elements in data set
r = data value range
Number of bins: r/w
Scott algorithm performs best on normally distributed data.

Variable- Requires MinMax table, which specifies the minimum value and the maximum
Width value of the bin in column1 and column2 respectively, and the label of the bin
in column3.
Maximum number of bins cannot exceed 3500.

Equal- Requires MinMax table, which specifies the minimum value of the bins in
Width column1 and the maximum value of the bins in column2.
Algorithm for calculating bin width, w:
w = (max - min)/k
where:
min = minimum value of the bins
max = maximum value of the bins
k = number of intervals into which algorithm divides data set
Interval boundaries: min+w, min+2w, …, min+(k-1)w

TargetColumn
Specify the name of the InputTable column that contains the data set.

NBins
[Required with methods Variable-Width and Equal-Width, otherwise ignored.] Specify the
number of bins (number of data value ranges).

Inclusion
[Optional] Specify where to put data points that are on bin boundaries—in the bin to the left
of the boundary or the bin to the right of boundary.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 101
3: Data Exploration Functions

Default: left

TD_Histogram Input
InputTable Schema
Column Data Type Description

target_column BYTEINT,SMALLINT,INTEGER,BIGINT, Decimal/Numeric, Float, Real, Data set


Double precision

MinMax Table Schema


Column Data Type Description

MinValue BYTEINT,SMALLINT,INTEGER,BIGINT, Minimum value of the bins.


Decimal/Numeric, Float, Real, Double precision

MaxValue BYTEINT,SMALLINT,INTEGER,BIGINT, Maximum value of the bins.


Decimal/Numeric, Float, Real, Double precision

Label CHAR, VARCHAR, BYTEINT, SMALLINT, [Required only with MethodType


INTEGER, or BIGINT ('Variable-Width').] Labels for bins or data
Character types: CHARACTER SET can be value ranges.
either LATIN or UNICODE. Maximum label length: 128
UNICODE characters.

TD_Histogram Output
Output Table Schema
Column Data Type Description

Label VARCHAR (with Labels for bins or data value ranges.


CHARACTER SET
UNICODE) or BIGINT

MinValue DOUBLE PRECISION Minimum values for bins or data value ranges.

MaxValue DOUBLE PRECISION Maximum values for bins or data value ranges.

CountOfValues BIGINT Counts of values in bins or data value ranges.

Bin_percent DOUBLE PRECISION Percentage of InputTable rows in bins or data


value ranges.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 102
3: Data Exploration Functions

TD_Histogram Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• InputTable: hist_titanic_train

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ---
----- ----- -------- --------- ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71
0 0 PC 17754 34.6542 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.7 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

• MinMax: hist_titanic_train_dim

minVal maxVal label


--------- --------- ----------
0 20 Young Age
21 45 Middle Age
46 91 Old Age

SQL Call

SELECT * FROM TD_Histogram (


ON hist_titanic_train AS InputTable
ON hist_titanic_train_dim AS MinMax DIMENSION
USING
TargetColumn ('age')
MethodType ('variable-width')
nbins (3)
) AS dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 103
3: Data Exploration Functions

Output

Label MinValue MaxValue CountOfValues bin_Percent


---------- --------- --------- ------------- -----------
Middle Age 21 45 1 20
Old Age 46 90 3 60
Young Age 0 20 1 20

TD_QQNorm
TD_QQNorm checks whether the values in the specified input table columns are normally distributed. The
function returns the quantiles of the column values and corresponding theoretical quantile values from a
normal distribution. If the column values are normally distributed, then the quantiles of column values and
normal quantile values appear in a straight line when plotted on a 2D graph.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_QQNorm Syntax
SELECT * FROM TD_QQNorm (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ ORDER BY order_column ] ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
RankColumns ({ 'rank_column' | rank_column_range }[,...])
[ OutputColumns ('output_column' [,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

TD_QQNorm Syntax Elements


TargetColumns
Specify the names of the numeric InputTable columns to check for normal distribution.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 104
3: Data Exploration Functions

RankColumns
Specify the names of the InputTable columns that contain the ranks for the target columns.

OutputColumns
[Optional] Specify names for the output table columns that contain the theoretical quantiles
of the target columns.
Default: target_column_theoretical_quantiles

Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.

TD_QQNorm Input
InputTable Schema
Column Data Type Description

target_column BYTEINT,SMALLINT,INTEGER, Column to check for normal distribution.


BIGINT, Decimal/Numeric, Float,
Real, Double precision

rank_column BYTEINT, SMALLINT, INTEGER, Ranks for target_column.


or BIGINT

accumulate_column Any Column to copy to output table.

TD_QQNorm Output
Output Table Schema
Column Data Type Description

accumulate_column Same as in InputTable Column copied from InputTable.

target_column DOUBLE PRECISION Column checked for


normal distribution.

output_column if specified, otherwise DOUBLE PRECISION Theoretical quantile values for


target_column_theoretical_quantiles target_column.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 105
3: Data Exploration Functions

TD_QQNorm Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age


sibsp parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ----
----- ----- -------- ------- ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71
0 0 PC 17754 34.6542 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.7 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

From input_table, create RankTable with this statement:

CREATE TABLE RankTable AS (


SELECT age,
fare,
CAST (ROW_NUMBER() OVER (ORDER BY age ASC NULLS LAST) AS BIGINT)
AS rank_age,
CAST (ROW_NUMBER() OVER (ORDER BY fare ASC NULLS LAST) AS BIGINT)
AS rank_fare
FROM input_table AS dt
) WITH DATA;

age fare rank_age rank_fare


--- --------- -------- ---------
16 86.5 1 5
33 5 2 1
58 29.7 3 2
71 34.6542 4 4
80 30 5 3

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 106
3: Data Exploration Functions

SQL Call

SELECT * FROM TD_QQNorm (


ON RankTable AS InputTable
USING
TargetColumns ('[0:1]')
RankColumns ('[2:3]')
) AS dt;

Output

age age_theoretical_quantiles fare


fare_theoretical_quantiles
--------- ------------------------- ---------
--------------------------
16 -1.17986882170049 86.5 1.17986882170049
33 -0.496788749686441 5 -1.17986882170049
58 -0.000000101006675468085 29.7 -0.496788749686441
71 0.496788749686441 34.6542 0.496788749686441
80 1.17986882170049 30 -0.000000101006675468085

TD_UnivariateStatistics
UnivariateStatistics displays descriptive statistics for each specified numeric input table column.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_UnivariateStatistics Syntax
SELECT * FROM UnivariateStatistics (
ON { table | view | (query) }
AS InputTable [ PARTITION BY ANY ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ PartitionColumns ('partition_column' [,...]) ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 107
3: Data Exploration Functions

[ Stats ('statistic' [,...]) ]


[ Centiles ('percentile' [,...]) ]
[ TrimPercentile ('trimmed_percentile') ]
) AS alias;

TD_UnivariateStatistics Syntax Elements


TargetColumns
Specify the names of the numeric InputTable columns for which to compute statistics.

PartitionColumns
[Optional] Specify the names of the InputTable columns on which to partition the input. The
function copies these columns to the output table.
Default behavior: The function treats all rows as a single partition.

Stats
[Optional] Specify the statistics to calculate. statistic is one of these:
• SUM
• COUNT or CNT
• MAXIMUM or MAX
• MINIMUM or MIN
• MEAN
• UNCORRECTED SUM OF SQUARES or USS
• NULL COUNT or NLC
• POSITIVE VALUES COUNT or PVC
• NEGATIVE VALUES COUNT or NVC
• ZERO VALUES COUNT or ZVC
• TOP5 or TOP
• BOTTOM5 or BTM
• RANGE or RNG
• GEOMETRIC MEAN or GM
• HARMONIC MEAN or HM
• VARIANCE or VAR
• STANDARD DEVIATION or STD
• STANDARD ERROR or SE
• SKEWNESS or SKW
• KURTOSIS or KUR
• COEFFICIENT OF VARIATION or CV

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 108
3: Data Exploration Functions

• CORRECTED SUM OF SQUARES or CSS


• MODE
• MEDIAN or MED
• UNIQUE ENTITY COUNT or UEC
• INTERQUARTILE RANGE or IQR
• TRIMMED MEAN or TM
• PERCENTILES or PRC
• ALL
Default: ALL

Centiles
[Optional] Specify the centile to calculate. percentile is an INTEGER in the range [1, 100].
The function ignores Centiles unless Stats specifies PERCENTILES, PRC, or ALL.
Default: 1, 5, 10, 25, 50, 75, 90, 95, 99

TrimPercentile
[Optional] Specify the trimmed lower percentile, an integer value in the range [1, 50].
The function calculates the mean of the values between the trimmed lower percentile
(trimmed_percentile) and trimmed upper percentile (1-trimmed_percentile).
The function ignores TrimPercentile unless Stats specifies TRIMMED MEAN, TM, or ALL.
Default: 20

TD_UnivariateStatistics Input
InputTable Schema
Column Data Type Description

target_column NUMERIC Column for which to calculate statistics.

partition_column Any Defines a partition for statistics calculation.

TD_UnivariateStatistics Output
Output Table Schema
Column Data Type Description

partition_column Same as in InputTable Column copied from InputTable. Defines a partition for
statistics calculation.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 109
3: Data Exploration Functions

Column Data Type Description

Attribute VARCHAR target_column for which function calculated statistics.

StatsName VARCHAR [Column appears once for each specified statistic.] Statistic.

StatsValue DOUBLE PRECISION [Column appears once for each specified statistic.]
Statistic value.

TD_UnivariateStatistics Example
InputTable: titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived name sex age fare


--------- -------- ------ ------------------------------ ------ ---- --------
97 0 Goldschmidt; Mr. George B male
71 34.6542
488 0 Kent; Mr. Edward Austin male
58 29.7
505 1 Maioni; Miss. Roberta female
16 86.5
631 1 Barkworth; Mr. Algernon Henry Wilson male 80 30
873 0 Carlsson; Mr. Frans Olof male 33 5

SQL Call

SELECT * FROM TD_UnivariateStatistics (


ON titanic_train
USING
TargetColumns ('age','fare')
Stats ('MEAN', 'MEDIAN', 'MODE')
) AS dt;

Output

ATTRIBUTE StatName StatValue


--------- -------- ---------------------
age MEAN 5.16000000000000E 001
age MEDIAN 5.80000000000000E 001
age MODE 1.60000000000000E 001
fare MEAN 3.71708400000000E 001

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 110
3: Data Exploration Functions

fare MEDIAN 3.00000000000000E 001


fare MODE 5.00000000000000E 000

TD_WhichMax
TD_WhichMax displays all rows that have the maximum value in a specified input table column.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_WhichMax Syntax
SELECT * FROM TD_WhichMax (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_by_column ] ]
USING
TargetColumn ('target_column')
) AS alias;

TD_WhichMax Syntax Elements


TargetColumn
Specify the target column names to check for maximum values.

TD_WhichMax Input
InputTable Schema
Column Data Type Description

target_column Any except BLOB, CLOB, and UDT. Columns for which maximum values
are checked.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 111
3: Data Exploration Functions

TD_WhichMax Output
Output Table Schema
Same as InputTable schema

TD_WhichMax Example
InputTable: titanic_dataset
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------- ----- --------
1 0 3 male 22 1 0 7.25 null S
2 1 1 female 38 1 0 71.28 C85 C
3 1 3 female 26 0 0 7.93 null S
4 1 1 female 35 1 0 53.10 C123 S
5 0 3 male 35 0 0 8.05 null S

SQL Call

SELECT * FROM TD_WhichMax (


ON titanic_dataset AS InputTable
USING
TargetColumn ('fare')
) AS dt;

Output

passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------- ----- --------
2 1 1 female 38 1 0 71.28 C85 C

TD_WhichMin
TD_WhichMin displays all rows that have the minimum value in specified input table column.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 112
3: Data Exploration Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_WhichMin Syntax
SELECT * FROM TD_WhichMin (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_by_column ] ]
USING
TargetColumn ('target_column')
) AS alias;

TD_WhichMin Syntax Elements


TargetColumn
Specify the target column names to check for minimum values.

TD_WhichMin Input
InputTable Schema
Column Data Type Description

target_column Any except BLOB, CLOB, and UDT. Columns for which minimum values are checked.

TD_WhichMin Output
Output Table Schema
Same as InputTable schema

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 113
3: Data Exploration Functions

TD_WhichMin Example
InputTable: titanic_dataset
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------- ----- --------
1 0 3 male 22 1 0 7.25 null S
2 1 1 female 38 1 0 71.28 C85 C
3 1 3 female 26 0 0 7.93 null S
4 1 1 female 35 1 0 53.10 C123 S
5 0 3 male 35 0 0 8.05 null S

SQL Call

SELECT * FROM TD_WhichMin (


ON titanic_dataset AS InputTable
USING
TargetColumn ('fare')
) AS dt;

Output

passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ---- --- ----- ----- --------- ----- --------
1 0 3 male 22 1 0 7.25 null S

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 114
4
Feature Engineering Transform Functions

Feature engineering transform functions encapsulate variable transformations during the training phase so
you can chain them to create a pipeline for operationalization.
Each TD_nameFit function outputs a table to input to the TD_nameTransform function as FitTable. For
example, TD_BinCodeFit outputs a FitTable for TD_BinCodeTransform.

Antiselect
Antiselect returns all columns except those specified in the Exclude syntax element.

Note:

• This function requires the UTF8 client character set for UNICODE data.

Antiselect Syntax
SELECT * FROM Antiselect (
ON { table | view | (query) }
USING
Exclude ({ 'exclude_column' | exclude_column_range }[,...])
) AS alias;

Antiselect Syntax Elements


Exclude
Specify the names of the input table columns to exclude from the output table. Column
names must be valid object names, which are defined in Teradata Vantage™ - SQL
Fundamentals, B035-1141.
The exclude_column is a column name. This is the syntax of exclude_column_range:

'start_column:end_column' [, '-exclude_in-range_column' ]

The range includes its endpoints.


The start_column and end_column can be:
• Column names (for example, 'column1:column2')
Column names must contain only letters in the English alphabet, digits, and special
characters. If a column name includes any special characters, surround the column

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 115
4: Feature Engineering Transform Functions

name with double quotation marks. For example, if the column name is a*b, specify it as
"a*b". A column name cannot contain a double quotation mark.
• Nonnegative integers that represent the indexes of columns in the table (for
example, '[0:4]')
The first column has index 0; therefore, '[0:4]' specifies the first five columns in
the table.
• Empty. For example:
◦ '[:4]' specifies all columns up to and including the column with index 4.
◦ '[4:]' specifies the column with index 4 and all columns after it.
◦ '[:]' specifies all columns in the table.
The exclude_in-range_column is a column in the specified range, represented by either its
name or its index (for example, '[0:99]', '-[50]', '-column10' specifies the columns
with indexes 0 through 99, except the column with index 50 and column10).
Column ranges cannot overlap, and cannot include any specified exclude_column.

Antiselect Input
The input table can have any schema.

Antiselect Output
The output table has all input table columns except those specified by the Exclude syntax element.

Antiselect Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Antiselect Example: No Column Ranges

Input
The input table, antiselect_test, is a sample set of sales data containing 13 columns.

antiselect_test
sno id orderdate priority qty sales disct dmode custname province region cust

1 3 2010-10-13 Low 6 261. 0.04 Regular Muhammed Nunavut Nunavut Sma


00:00:00 54 Air MacIntyre Busi

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 116
4: Feature Engineering Transform Functions

sno id orderdate priority qty sales disct dmode custname province region cust

49 293 2012-10-01 High 49 10123 0.07 Delivery Barry Nunavut Nunavut Con
00:00:00 Truck French

50 293 2012-10-01 High 27 244. 0.01 Regular Barry Nunavut Nunavut Con
00:00:00 57 Air French

80 483 2011-07-10 High 30 4965. 0.08 Regular Clay Nunavut Nunavut Corp
00:00:00 76 Air Rozendal

85 515 2010-08-28 Not 19 394. 0.08 Regular Carlos Nunavut Nunavut Con
00:00:00 specified 27 Air Soltero

86 515 2010-08-28 Not 21 146. 0.05 Regular Carlos Nunavut Nunavut Con
00:00:00 specified 69 Air Soltero

97 613 2011-06-17 High 12 93.54 0.03 Regular Carl Nunavut Nunavut Corp
00:00:00 Air Jackson

SQL Call

SELECT * FROM Antiselect (


ON antiselect_test
USING
Exclude ('id', 'orderdate', 'disct', 'province', 'custsegment')
) AS dt ORDER BY 1, 4;

Output
sno priority qty sales dmode custname region prodcat

1 Low 6 2. Regular Muhammed Nunavut Office


61540000000000E Air MacIntyre Supplies
002

49 High 49 1. Delivery Barry French Nunavut Office


01230000000000E Truck Supplies
004

50 High 27 2. Regular Barry French Nunavut Office


44570000000000E Air Supplies
002

80 High 30 4. Regular Clay Nunavut Technology


96576000000000E Air Rozendal
003

85 Not 19 3. Regular Carlos Nunavut Office


specified 94270000000000E Air Soltero Supplies
002

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 117
4: Feature Engineering Transform Functions

sno priority qty sales dmode custname region prodcat

86 Not 21 1. Regular Carlos Nunavut Furniture


specified 46690000000000E Air Soltero
002

97 High 12 9. Regular Carl Jackson Nunavut Office


35400000000000E Air Supplies
001

Antiselect Example: Column Range

Input
The input table is antiselect_test, as in Antiselect Example: No Column Ranges.

SQL Call

SELECT * FROM Antiselect (


ON antiselect_test
USING
Exclude ('id', '[2:3]', 'custname:prodcat')
) AS dt ORDER BY 1, 4;

Output
sno qty sales disct dmode

1 6 2.61540000000000E 002 0.04 Regular Air

49 49 1.01230000000000E 004 0.07 Delivery Truck

50 27 2.44570000000000E 002 0.01 Regular Air

80 30 4.96576000000000E 003 0.08 Regular Air

85 19 3.94270000000000E 002 0.08 Regular Air

86 21 1.46690000000000E 002 0.05 Regular Air

97 12 9.35400000000000E 001 0.03 Regular Air

TD_BinCodeFit
TD_BinCodeFit outputs a table of information to input to TD_BinCodeTransform, which bin-codes the
specified input table columns.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 118
4: Feature Engineering Transform Functions

Bin-coding is typically used to convert numeric data to categorical data by binning the numeric data into
multiple numeric bins (intervals). The bins can have a fixed-width with auto-generated labels or can have
specified variable widths and labels.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_BinCodeFit Syntax
For Equal-Width Bins with Generated Labels

CREATE TABLE primary_output_table AS (


SELECT * FROM TD_BincodeFit (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table) ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
MethodType ('equal-width')
NBins ('single bin value' | 'separate bin values')
[ LabelPrefix ('single prefix' | 'separate prefix values') ]
) AS alias
) WITH DATA;

For Variable-Width Bins with Specified Labels

CREATE TABLE primary_output_table AS (


SELECT * FROM TD_BincodeFit (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitInput DIMENSION
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
MethodType ('variable-width')
[ MinValueColumn ('minvalue_column') ]
[ MaxValueCOlumn ('maxvalue_column') ]
[ LabelColumn ('label_column') ]
[ TargetColNames ('target_names_column') ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 119
4: Feature Engineering Transform Functions

) AS alias
) WITH DATA;

TD_BinCodeFit Syntax Elements


OutputTable
[MethodType ('equal-width') only.] [Optional] Specify a name for the secondary output table
which contains the actual bins used by the BincodeTransform function.

TargetColumns
Specify the names of the InputTable columns to bin-code.
The maximum number of target columns is 2018.

MethodType
Specify the bin-coding method:
Method
Description
Type

equal-width Bins have fixed width with auto-generated labels.


Function finds minimum and maximum values of target columns (min and
max) and computes bin width, w, with this formula:
w = (max - min)/k
k, the number of bins, is determined by NBins.
For bin boundaries and names of generated labels, see LabelPrefix.

variable- Bins have specified variable widths and specified labels, provided by
width FitInput table.
Maximum number of bins is 3000.

NBins
[MethodType ('equal-width') only.] Specify either a single bin value for all the target columns
or separate bin values for each of the target columns.

LabelPrefix
[MethodType ('equal-width') only.][Optional] Specify either a prefix for all the target columns
or a separate prefix for each of the target columns.
Lower prefix-Generated prefix_count-
Upper Bin Boundary
Bin Boundary Bin Label Generated Bin Label

min min + w prefix_1 target_column_1

min + w min + 2w prefix_2 target_column_2

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 120
4: Feature Engineering Transform Functions

Lower prefix-Generated prefix_count-


Upper Bin Boundary
Bin Boundary Bin Label Generated Bin Label

... ... ... ...

min + kw min + (k - 1)w prefix_k target_column_k

Default: Target column name is used as the prefix

MinValueColumn
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the minimum value of the bin (lower bin boundaries).
Default: MinValue

MaxValueColumn
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the maximum value of the bin (upper bin boundaries).
Default: MaxValue

LabelColumn
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the bin labels.
Default: Label

TargetColNames
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the target column names.

Note:
Column range is not supported

TD_BinCodeFit Input
InputTable Schema
Column Data Type Description

target_column BYTEINT,SMALLINT,INTEGER,BIGINT, Decimal/Numeric, Column to bin-code.


Float, Real, Double precision

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 121
4: Feature Engineering Transform Functions

FitTable Schema
Required with specify MethodType ('variable-width'), ignored otherwise.

Column Data Type Description

target_names_column CHAR, VARCHAR Bin column name


(CHARACTER SET LATIN or UNICODE)

minvalue_column BYTEINT,SMALLINT,INTEGER,BIGINT, Decimal/ Minimum value of the bin.


Numeric, Float, Real, Double precision

maxvalue_column BYTEINT,SMALLINT,INTEGER,BIGINT, Decimal/ Maximum value of the bin.


Numeric, Float, Real, Double precision

label_column CHAR, VARCHAR Bin labels.


(CHARACTER SET LATIN or UNICODE)

TD_BinCodeFit Output
Output Table Schema
Column Data Type Description

TD_ VARCHAR target_names_column. Bin column name.


ColumnName_BINFIT (CHARACTER
SET UNICODE)

TD_MinValue_BINFIT DOUBLE PRECISION minvalue_column. Minimum value of the bin.

TD_MaxValue_BINFIT DOUBLE PRECISION maxvalue_column. Maximum value of the bin.

TD_LabelPrefix_BINFIT VARCHAR [Column appears only with MethodType ('equal-


(CHARACTER width').] Label prefix.
SET UNICODE)

TD_Label_BINFIT VARCHAR [Column appears only with MethodType


(CHARACTER ('variable-width').] Bin label.
SET UNICODE)

TD_Bins_BINFIT INTEGER target_names_column. Bin count.

TD_IndexValue_BINFIT SMALLINT Index value.

TD_ SMALLINT Maximum bin label length.


MaxLenLabel_BINFIT

target_column Same as in Input table Target column for TD_BinCodeTransform.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 122
4: Feature Engineering Transform Functions

OutputTable Schema
The function outputs this secondary output table only if you specify MethodType ('equal-width').

Column Data Type Description

ColumnName VARCHAR The column name of the bin.


(CHARACTER SET UNICODE)

MinValue DOUBLE PRECISION The minimum value of the bin.

MaxValue DOUBLE PRECISION The maximum value of the bin.

label VARCHAR The label of the bin.


(CHARACTER SET UNICODE)

TD_BinCodeFit Example
InputTable: bin_titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ ---
----- ----- -------- ------------ ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71
0 0 PC 17754 34.654200000 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.700000000 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.500000000 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30.000000000 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5.000000000 B51 B53 B55 S

Input: FitInput table

ColumnName MinValue MaxValue Label


---------- ------------ ------------ ----------
age 0.00 20.00 Young Age

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 123
4: Feature Engineering Transform Functions

age 21.00 45.00 Middle Age


age 46.00 90.00 Old Age

SQL Call

CREATE TABLE FitOutputTable AS (


SELECT * FROM TD_BincodeFit (
ON bin_titanic_train AS InputTable
ON FitInputTable AS FitInput DIMENSION
USING
TargetColumns ('age’)
MethodType ('Variable-Width')
MinValueColumn ('MinValue')
MaxValueCOlumn ('MaxValue')
LabelColumn ('Label')
TargetColNames ('ColumnName')
) AS dt
) WITH DATA;

Output

TD_ColumnName_BINFIT TD_MinValue_BINFIT TD_MaxValue_BINFIT TD_Label_BINFIT


TD_Bins_BINFIT TD_IndexValue_BINFIT TD_MaxLenLabel_BINFIT age
-------------------- ------------------ ------------------ ---------------
-------------- -------------------- --------------------- ----
age 46.000000000 90.000000000 Old
Age 3 0 10 null
age 21.000000000 45.000000000 Middle
Age 3 0 10 null
age 0.000000000 20.000000000 Young
Age 3 0 10 null

TD_BinCodeTransform
TD_BinCodeTransform bin-codes input table columns, using TD_BinCodeFit output.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 124
4: Feature Engineering Transform Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_BinCodeTransform Syntax
SELECT * FROM TD_BincodeTransform (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

TD_BinCodeTransform Syntax Elements


Accumulate
[Optional] Specify names of InputTable columns to copy to the output table.

TD_BinCodeTransform Input
InputTable Schema
See TD_BinCodeFit Input.

FitTable Schema
See TD_BinCodeFit Output.

TD_BinCodeTransform Output
Output Table Schema
Column Data Type Description

accumulate_column Same as in InputTable. Column copied from InputTable.

target_column VARCHAR [Column appears once for each specified


(CHARACTER SET UNICODE) target_column.] Bin labels.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 125
4: Feature Engineering Transform Functions

TD_BinCodeTransform Example
• InputTable: bin_titanic_train, as in TD_BinCodeFit Example
• FitTable: FitOutputTable, created by TD_BinCodeFit Example

SQL Call

SELECT * FROM TD_BincodeTransform (


ON bin_titanic_train AS InputTable
ON FitOutputTable AS FitTable DIMENSION
USING
Accumulate ('passenger')
) AS dt;

Output

passenger age
--------- ----------
873 Middle Age
631 Old Age
505 Young Age
488 Old Age
97 Old Age

TD_ColumnTransformer
The TD_ColumnTransformer function transforms the input table columns in a single operation. You only
need to provide the FIT tables to the function, and the function runs all transformations that you require in
a single operation.
The function performs the following transformations:
• TD_Scale Transform
• TD_Bincode Transform
• TD_Function Transform
• TD_NonLinearCombine Transform
• TD_OutlierFilter Transform
• TD_PolynomialFeatures Transform
• TD_RowNormalize Transform
• TD_OrdinalEncoding Transform
• TD_OneHotEncoding Transform
• TD_SimpleImpute Transform

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 126
4: Feature Engineering Transform Functions

You must create the FIT tables before using the function and you must provide the FIT tables in the same
order as in the training data sequence to transform the dataset. The FIT tables must have a maximum of
128 columns.

Note:
The TD_BincodeFit function has a maximum of 5 columns when using the variable-width method.

TD_ColumnTransformer Syntax
SELECT * FROM TD_ColumnTransformer (
ON { table | view | (query) } AS InputTable
[ ON { table | view | (query) } AS BincodeFitTable DIMENSION ]
[ ON { table | view | (query) } AS FunctionFitTable DIMENSION ]
[ ON { table | view | (query) } AS NonLinearCombineFitTable DIMENSION ]
[ ON { table | view | (query) } AS OneHotEncodingFitTable DIMENSION ]
[ ON { table | view | (query) } AS OrdinalEncodingFitTable DIMENSION ]
[ ON { table | view | (query) } AS OutlierFilterFitTable DIMENSION ]
[ ON { table | view | (query) } AS PolynomialFeaturesFitTable DIMENSION ]
[ ON { table | view | (query) } AS RowNormalizeFitTable DIMENSION ]
[ ON { table | view | (query) } AS ScaleFitTable DIMENSION ]
[ ON { table | view | (query) } AS SimpleImputeFitTable DIMENSION ]
USING
[FillRowIDColumnName(‘output_column_name’)]
) AS dt;

TD_ColumnTransformer Syntax Elements


FillRowIdColumnName
[Optional] Specify a name for the output column in which unique identifiers for each row
are populated.

TD_ColumnTransformer Input
Column Data Type Description

TargetColumn • CHAR or The input table columns that requires transformation based
VARCHAR for on the FIT table.
categorical columns Functions with categorical columns:
• INTEGER, REAL, • TD_OrdinalEncoding Fit
DECIMAL, or • TD_OneHotEncoding Fit
NUMBER for Functions with numeric columns:
numeric columns • TD_Scale Fit

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 127
4: Feature Engineering Transform Functions

Column Data Type Description

• TD_Bincode Fit
• TD_Function Fit
• TD_NonLinearCombine Fit
• TD_OutlierFilter Fit
• TD_PolynomialFeatures Fit
• TD_RowNormalize Fit
• TD_SimpleImpute Fit

TD_ColumnTransformer Output
TD_ColumnTransformer Output

Column Data Type Description

TargetColumn • CHAR or VARCHAR for The columns tranformed using the


categorical columns ColumnTransformer function.
• INTEGER, REAL, DECIMAL, or
NUMBER for numeric columns

otherColumns • CHAR or VARCHAR for The default columns from input to output.
categorical columns
• INTEGER, REAL, DECIMAL, or
NUMBER for numeric columns

NewColumns • CHAR or VARCHAR for The following FIT functions produce


categorical columns new columns:
• INTEGER, REAL, DECIMAL, or • TD_Polynomial FeaturesFit
NUMBER for numeric columns • TD_NonLinearCombineFit
• TD_OneHotEncodingFit
The following functions change the schema of
the target column:
• TD_BincodeFit
• TD_OrdinalEncodingFit

TD_ColumnTransformer Example
Input Table: titanic_train
PassengerID Pclass Name Sex Age SibSp Parch Fare Cabin Embarked

149 2 Navratil, M 36 0 2 26.0 B21 S


Michael

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 128
4: Feature Engineering Transform Functions

PassengerID Pclass Name Sex Age SibSp Parch Fare Cabin Embarked

152 1 Pearson, F Null 1 0 66.6 C2 S


Mrs.
Thomas

581 2 Christian, F 25 1 1 30.0 Null S


Miss
Juliana

663 1 Collier, Dr. M 47 0 0 25. A23 S


Edwin 70

704 3 Gavin, Mr. M 25 0 0 7.74 Null Q


Herbert

Create getCabin Input table

drop table getSubtitles;


create multiset table getSubtitles as (
select * from Unpack(
on titanic_train
Using
TargetColumn('Name')
OutputColumns('NTitle')
OutputDatatypes('Varchar')
Delimiter('$')
Regex('([A-Za-z]+)\.')
)as dt)with data;

drop table getCabin;


create multiset table getCabin as (
SELECT * FROM TD_strApply (
ON getSubtitles as inputtable
USING
TargetColumns ('cabin')
StringOperation('getNchars')
StringLength(1)
Accumulate('[:]','-cabin')
InPlace('True')
) as dt)with data;

SQL Call

select * from TD_ColumnTransformer(


On getCabin as inputtable

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 129
4: Feature Engineering Transform Functions

on imputeFit as SimpleImputeFitTable dimension


on NonLinearCombineFit as NonLinearCombineFitTable dimension
on ordinalFit_Title as OrdinalEncodingFitTable dimension
on ordinalFit_Sex as OrdinalEncodingFitTable dimension
on ordinalFit_Embarked as OrdinalEncodingFitTable dimension
on onehotfittable as OneHotEncodingFitTable dimension
on ScaleFit as ScaleFitTable dimension
)as dt order by 1,2,3,4,5,6,7;

Output

NTitle passenger survived pclass sex age


sibsp parch ticket fare embarked
cabin FamilySize cabin_A cabin_B cabin_C
cabin_other
----------- ----------- ----------- ----------- ----------- -----------
----------- ----------- -------------------- ----------------------
----------- ----- ---------------------- ----------- ----------- -----------
-----------
-1 8 0 3 1
2 3 1 349909 4.11356604308324E-002
2 ? 5.00000000000000E 000 0 0 0
1
-1 17 0 3 1
2 4 1 382652 5.68482139999047E-002
1 ? 6.00000000000000E 000 0 0 0
1
.... ...... .... ....
.... .... .... ......
.... ....
.... .... .... .... ....
....
.... ...... .... ....
.... .... .... ......
.... ....
.... .... .... .... ....
....

5 888 1 1 2
19 0 0 112053 5.85561002574126E-002
2 B 1.00000000000000E 000 0 1 0
0
5 889 0 3 2
28 1 2 W./C. 6607 4.57713517012109E-002

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 130
4: Feature Engineering Transform Functions

2 ? 4.00000000000000E 000 0 0 0
1

Comparison of serial processing of the functions to TD_ColumnTransformer function based on size of


data set:
Data Set Serial Processing in Seconds TD_ColumnTransformer Processing in Seconds

10M 89 29

20M 167 49

30M 332 98

TD_FunctionFit
TD_FunctionFit determines whether specified numeric transformations can be applied to specified input
columns and outputs a table to use as input to TD_FunctionTransform, which does the transformations.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Related Information:
TD_NumApply

TD_FunctionFit Syntax
CREATE TABLE output_table AS (
SELECT * FROM TD_FunctionFit (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS TransformationTable DIMENSION
) AS alias
) WITH DATA;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 131
4: Feature Engineering Transform Functions

TD_FunctionFit Input
InputTable Schema
Column Data Type Description

input_column VARCHAR (CHARACTER SET LATIN Column whose name can appear as
or UNICODE) or NUMERIC TargetColumn in TransformationTable.

TransformationTable Schema
Column Data Type Description

TargetColumn VARCHAR Name of InputTable column to transform.


(CHARACTER SET LATIN
or UNICODE)

Transformation VARCHAR Transformation to apply to TargetColumn—for allowed


(CHARACTER SET LATIN transformations, see following table.
or UNICODE)

Parameters VARCHAR [Optional] Transformation parameters in JSON format.


(CHARACTER SET LATIN If this column is absent and transformation has
or UNICODE) parameter, function uses default value in Parameters
column of following table.

DefaultValue NUMERIC [Optional] Default value for transformed value if


TargetColumn is nonnumeric or NULL.
If this column is absent, function uses default value 0.

Transformations
Transformation Parameter Operation on TargetColumn Value x

ABS None |x|

CEIL None CEIL (x)


(Least integer ≥ x)

EXP None ex
(e = 2.718)

FLOOR None FLOOR (x)


(Greatest integer ≤ x)

LOG [Optional] {"base": base} LOGbase(x)


Default: e

POW [Optional] {"exponent": exponent} xexponent


Default: 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 132
4: Feature Engineering Transform Functions

Transformation Parameter Operation on TargetColumn Value x

SIGMOID None 1 / (1 + e-x)

TANH None (ex - e-x) / (ex + e-x)

TD_FunctionFit Output
Output Table Schema
Same as TransformationTable schema (see TD_FunctionFit Input).

TD_FunctionFit Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
InputTable: function_input_table

passenger survived pclass name


sex age sibsp parch ticket fare cabin embarked
--------- -------- ------ ---------------------------------------------------
------ --- ----- ----- ---------------- ------------ ----- --------
1 0 3 Braund; Mr. Owen Harris
male 22 1 0 A/5 21171 7.250000000 S
2 1 1 Cumings; Mrs. John Bradley (Florence Briggs Thayer)
female 38 1 0 PC 17599 71.283300000 C85 C
3 1 3 Heikkinen; Miss. Laina
female 26 0 0 STON/O2. 3101282 7.925000000 S
4 1 1 Futrelle; Mrs. Jacques Heath (Lily May Peel)
female 35 1 0 113803 53.100000000 C123 S
5 0 3 Allen; Mr. William Henry
male 35 0 0 373450 8.050000000 S

SQL Call

CREATE TABLE fit_out AS (


SELECT * FROM TD_FunctionFit (
ON function_input_table AS InputTable
ON transformations AS TransformationTable DIMENSION
) AS dt
) WITH DATA;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 133
4: Feature Engineering Transform Functions

Output

TargetColumn Transformation Parameters Defaultvalue


------------ -------------- --------------- ------------
age LOG {"base":2} 0.000000000
fare POW {"exponent": 2} 10.000000000

TD_FunctionTransform
TD_FunctionTransform applies numeric transformations to input columns, using TD_FunctionFit output.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_FunctionTransform Syntax
SELECT * FROM TD_FunctionTransform (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ IDColumns ({ 'id_column' | id_column_range }[,...])]
) AS alias;

TD_FunctionTransform Syntax Elements


IDColumns
[Optional] Specify the names of the InputTable columns with NUMERIC datatypes to exclude
from transformations. The columns with VARCHAR datatypes are automatically excluded.
No id_column can be a target_column in the TransformationTable for the TD_FunctionFit call
that output FitTable.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 134
4: Feature Engineering Transform Functions

TD_FunctionTransform Input
InputTable Schema
See TD_FunctionFit Input.

FitTable Schema
See TD_FunctionFit Output.

TD_FunctionTransform Output
Output Table Schema
Column Data Type Description

input_column If NUMERIC in TD_FunctionFit InputTable: DOUBLE PRECISION Transformed values.


Otherwise: Same as in TD_FunctionFit InputTable

TD_FunctionTransform Example
Input
• InputTable: titanic_data, as in TD_FunctionFit Example
• FitTable: fit_out, created by TD_FunctionFit Example

SQL Call

SELECT * FROM TD_FunctionTransform (


ON function_input_table AS InputTable
ON fit_out AS FitTable DIMENSION
USING
IDColumns ('[0:2]','[6:7]')
) AS dt ORDER BY Passenger;

Output

passenger survived pclass


name
sex age sibsp parch
ticket fare cabin embarked
----------- ----------- -----------
--------------------------------------------------------------------------------
---------- ---------- ---------------------- ----------- -----------
-------------------- ---------------------- -------------------- ----------

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 135
4: Feature Engineering Transform Functions

1 0 3 Braund; Mr. Owen


Harris
male 4.45943161863730E 000 1 0 A/5 21171
5.25625000000000E 001 S
2 1 1 Cumings; Mrs. John Bradley (Florence Briggs
Thayer) female 5.24792751344359E
000 1 0 PC 17599 5.08130885889000E 003
C85 C
3 1 3 Heikkinen; Miss.
Laina
female 4.70043971814109E 000 0 0 STON/O2. 3101282
6.28056250000000E 001 S
4 1 1 Futrelle; Mrs. Jacques Heath (Lily May
Peel) female 5.12928301694497E
000 1 0 113803 2.81961000000000E 003
C123 S
5 0 3 Allen; Mr. William
Henry male
5.12928301694497E 000 0 0 373450
6.48025000000000E 001 S

TD_NonLinearCombineFit
TD_NonLinearCombineFit function returns the target columns and a specified formula which uses the
non-linear combination of existing features.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_NonLinearCombineFit Syntax
SELECT * FROM TD_NonLinearCombineFit (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable(output_table_name) ]
USING
TargetColumns ({'target_column' | 'target_column_range'}[,...])
Formula ('Y = <expression>')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 136
4: Feature Engineering Transform Functions

ResultColumn ('result_column')
) as alias;

TD_NonLinearCombineFit Syntax Elements


TargetColumns
[Required] Specify the input table column names to run the non-linear combination.

Formula
[Required] Specify the formula. See the Arithmetic, Trigonometric, Hyperbolic Operators/
Functions section in the SQL Functions, Operators, Expressions, and Predicates Guide.

ResultColumn
[Required] Specify the name of the new feature column generated by the Transform function.
The Fit function saves the specified formula in this column.

TD_NonLinearCombineFit Input
Input Table Schema
Column Data Type Description

TargetColumns BYTEINT,SMALLINT,INTEGER, The input table column names to use in the


BIGINT, Decimal/Numeric,Float,Real, non-linear combination.
Double precision

TD_NonLinearCombineFit Output
Output Table Schema
Column Data Type Description

ResultColumn VARCHAR CHARACTER The Fit function saves the specified formula in
SET UNICODE this column.

TargetColumns Same as Input The specified target columns displayed as


NULL values.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 137
4: Feature Engineering Transform Functions

TD_NonLinearCombineFit Example
Input table

passenger survived pclass sex age sibsp parch fare


cabin embarked
--------- -------- ------- ------ --- ----- ----- ------------
----- --------------------------------------
1 0 General male 22 1 0 7.250000000
null S
2 1 Deluxe female 38 1 1 71.280000000
C85 C
3 1 General female 26 0 0 7.930000000
null S
4 1 Deluxe female 35 1 0 53.100000000
C123 S
5 0 General male 35 0 1 8.050000000 null S

SQL Call

SELECT * FROM TD_NonLinearCombineFit (


ON nonLinearCombineFit_input AS InputTable
OUT TABLE FitTable (nonLinearCombineFit_output)
USING
TargetColumns ('SibSp', 'Parch', 'Fare')
Formula ('Y=(X0+X1+1)*X2')
ResultColumn ('TotalCost')
) as dt order by 1;

Output Table

total_cost sibsp parch fare


-------------- ----- ----- ----
Y=(X0+X1+1)*X2 null null null

TD_NonLinearCombineTransform
TD_NonLinearCombineTransform generates the values of the new feature using the specified formula from
the TD_NonLinearCombineFit function output.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 138
4: Feature Engineering Transform Functions

TD_NonLinearCombineTransform Syntax
SELECT * FROM TD_NonLinearCombineTransform (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
) as alias;

TD_NonLinearCombineTransform Syntax Elements


Accumulate
[Optional] Specify the input table column names to copy to the output table.

TD_NonLinearCombineTransform Input
Input Table Schema
Column Data Type Description

TargetColumns BYTEINT,SMALLINT, The input table column names to use in the


INTEGER,BIGINT, Decimal non-linear combination.
/Numeric,Float,Real,
Double precision

AccumulateColumns ANY The input table column names to copy to the


output table.

TD_NonLinearCombineTransform Output
Output Table Schema
Column Data Type Description

AccumulateColumns Same as Input The specified column names in the Accumulate element copied
to the output table.

ResultColumn REAL The values calculated using the specified formula are displayed.

TD_NonLinearCombineTransform Example
InputTable
See Input table and Output table sections of TD_NonLinearCombineFit Example

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 139
4: Feature Engineering Transform Functions

SQL Call

SELECT * FROM TD_NonLinearCombineTransform (


ON nonLinearCombineFit_input AS InputTable
ON nonLinearCombineFit_output AS FitTable DIMENSION
USING
Accumulate('Passenger')
) as dt order by 1;

Output Table

passenger TotalCost
--------- -------------
1 14.50000
2 213.84000
3 7.93000
4 106.20000
5 16.10000

TD_OneHotEncodingFit
TD_OneHotEncodingFit outputs a table of attributes and categorical values to input to
TD_OneHotEncodingTransform, which encodes them as one-hot numeric vectors.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_OneHotEncodingFit Syntax
CREATE TABLE fit_table AS (
SELECT * FROM TD_OneHotEncodingFit (
ON { table | view | (query) } AS InputTable
[ PARTITION BY { ANY [ ORDER BY order_column ] | attribute_column } ]
USING
IsInputDense ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
{ for_dense_input | for_sparse_input }

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 140
4: Feature Engineering Transform Functions

) AS alias
) WITH DATA;

for_dense_input

TargetColumn ('target_column')
CategoricalValues ('categorical_value' [,...])
[ OtherColumnName ('other_column') ]

for_sparse_input

AttributeColumn ('attribute_column')
ValueColumn ('value_column')
TargetAttributes ('target_attribute' [,...])
[ OtherAttributeNames ('other_attribute' [,...]) ]

TD_OneHotEncodingFit Syntax Elements


IsInputDense
Specify whether the input is in dense format.

TargetColumn
[Required with IsInputDense ('true'), disallowed otherwise.] Specify the name of the
InputTable column of categorical values.

CategoricalValues
[Required with IsInputDense ('true'), disallowed otherwise.] Specify one or more categorical
values in target_column to encode in one-hot form.

OtherColumnName
[Optional with IsInputDense ('true'), disallowed otherwise.] Specify a category name for
values that CategoricalValues does not specify (categorical values not to encode in one-
hot form).
Default: 'other'

AttributeColumn
[Required with IsInputDense ('false'), disallowed otherwise.] Specify the name of the
InputTable column of attributes.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 141
4: Feature Engineering Transform Functions

ValueColumn
[Required with IsInputDense ('false'), disallowed otherwise.] Specify the name of the
InputTable column of attribute values.

TargetAttributes
[Required with IsInputDense ('false'), disallowed otherwise.] Specify one or more attributes
to encode in one-hot form. Every target_attribute must be in attribute_column.

OtherAttributeNames
[Optional with IsInputDense ('false'), disallowed otherwise.] For each target_attribute,
specify a category name (other_attribute) for attributes that TargetAttributes does not specify.
The nth other_attribute corresponds to the nth target_attribute.

TD_OneHotEncodingFit Input
InputTable Schema for Dense Input
Column Data Type Description

target_column CHAR or VARCHAR Categorical values.


(CHARACTER SET LATIN or UNICODE)

InputTable Schema for Sparse Input


Column Data Type Description

attribute_column CHAR or VARCHAR Attribute names.


(CHARACTER SET LATIN or UNICODE)

value_column CHAR or VARCHAR Attribute values.


(CHARACTER SET LATIN or UNICODE)

TD_OneHotEncodingFit Output
Output Table Schema for Dense Input
Column Data Type Description

target_column INTEGER Preserves target column name for TD_


OneHotEncodingTransform. Contains only
NULL values.

target_column_ VARCHAR Preserves target column definition for


categorical_value (CHARACTER TD_OneHotEncodingTransform.
SET UNICODE)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 142
4: Feature Engineering Transform Functions

Column Data Type Description

target_column_ VARCHAR Categorical values not to encode in one-hot form.


other_column (CHARACTER
SET UNICODE)

Output Table Schema for Sparse Input


Column Data Type Description

TD_attribute_ INTEGER 1 if row has attribute_column-target_attribute pair.


column_type_ 0 if row has attribute_column-other_attribute pair.
OHEFIT

attribute_column VARCHAR Preserves attribute column name for TD_


(CHARACTER OneHotEncodingTransform. Contains only NULL values.
SET UNICODE)

value_column VARCHAR Preserves value column name for TD_


(CHARACTER OneHotEncodingTransform. Contains only NULL values.
SET UNICODE)

TD_OneHotEncodingFit Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Passenger Survived Pclass Name Age Sex


--------- -------- ------ ------------------ --- ------
1 0 3 Mr. Owen Harris 22 Male
2 1 1 Mrs. John Bradley 38 Female
3 1 3 Mrs. Laina 26 Female
4 0 3 Mrs. Jacques Heath 35 Female

SQL Call

CREATE TABLE onehotencodingfit_output AS (


SELECT * FROM TD_OneHotEncodingFit (
ON onehotencodingfit_input AS InputTable
USING
TargetColumn ('Sex')
OtherColumnName ('other')
CategoricalValues ('male', 'female')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 143
4: Feature Engineering Transform Functions

IsInputDense ('true')
) AS dt
) WITH DATA;

Output

Sex Sex_male Sex_female Sex_other


---- -------- ---------- ---------
null male female null

TD_OneHotEncodingTransform
TD_OneHotEncodingTransform encodes specified attributes and categorical values as one-hot numeric
vectors, using TD_OneHotEncodingFit output.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_OneHotEncodingTransform Syntax
SELECT * FROM TD_OneHotEncodingTransform (
ON { table | view | (query) } AS InputTable
[ PARTITION BY { ANY [ ORDER BY order_column ] | attribute_column } ]
ON { table | view | (query) } AS FitTable
{ DIMENSION | PARTITION BY attribute_column }
USING
IsInputDense ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
) AS alias;

TD_OneHotEncodingTransform Syntax Elements


IsInputDense
Specify whether the input is in dense format.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 144
4: Feature Engineering Transform Functions

TD_OneHotEncodingTransform Input
InputTable Schema
See TD_OneHotEncodingFit Input.

FitTable Schema
See TD_OneHotEncodingFit Output.

TD_OneHotEncodingTransform Output
Output Table Schema for Dense Input
Column Data Type Description

input_column Same as in InputTable Column copied from InputTable.

target_column_categorical_value INTEGER One-hot-encoded categorical values.

Output Table Schema for Sparse Input


Column Data Type Description

input_column Same as in InputTable Column copied from InputTable.

attribute_column CHAR or VARCHAR Attribute names.


(CHARACTER SET UNICODE)

value_column CHAR or VARCHAR One-hot-encoded attribute values.


(CHARACTER SET UNICODE)

TD_OneHotEncodingTransform Example
• InputTable: onehotencoding_input, as in TD_OneHotEncodingFit Example
• FitTable: onehotencodingfit_output, created by TD_OneHotEncodingFit Example

SQL Call

SELECT * FROM TD_OneHotEncodingTransform (


ON onehotencoding_input AS InputTable PARTITION BY ANY
ON onehotencodingfit_output AS FitTable DIMENSION
USING
IsInputDense ('true')
) AS dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 145
4: Feature Engineering Transform Functions

Output

Passenger Survived Pclass Name Age Sex_male Sex_female Sex_other


--------- -------- ------ ------------------ --- -------- ---------- ---------
1 0 3 Mr. Owen Harris 22 1 0 0
4 0 3 Mrs. Jacques Heath 35 0 1 0
3 1 3 Mrs. Laina 26 0 1 0
2 1 1 Mrs. John Bradley 38 0 1 0

TD_OrdinalEncodingFit
The TD_OrdinalEncodingFit function identifies distinct categorical values from the input table or a user-
defined list and returns the distinct categorical values along with the ordinal value for each category.

TD_OrdinalEncodingFit Syntax
SELECT * FROM TD_ORDINALENCODINGFIT (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table_name) ]
USING
TargetColumn ('target_column')
[{
Approach ('LIST')
Categories ('category'[,...])
[ OrdinalValues (ordinal_value[,...]) ]
}|
{
[ Approach ('AUTO') ]
}]
[ StartValue (start_value) ]
[ DefaultValue (default_value) ]
)as alias;

TD_OrdinalEncodingFit Syntax Elements


TargetColumn
[Required] Specify the categorical column name of the input table.

Note:
Only one column is supported.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 146
4: Feature Engineering Transform Functions

Approach
[Optional] Specify AUTO to obtain categories from the input table or specify LIST to obtain
categories from the user.
Default value: AUTO

Categories
[Required, when you use the LIST approach] Specify the list of categories for encoding in the
required order.

Note:
The maximum length supported for the categorical value is 128 characters.

OrdinalValues
[Optional] Specify the custom ordinal values when you use the LIST approach for encoding
the categorical values.
If you do not provide the ordinal value and the start value, then by default, the first category
contains the default start value 0, and the last category is assigned a value that is one lesser
than the total number of categories.
For example, if there are three categories, then the categories contain the values 0, 1,
2 respectively.
However, if you only specify the ordinal values, then each ordinal value is associated with a
categorical value. For example, if there are three categories and the ordinal values are 3, 4,
5 then the ordinal values are assigned to the respective categories.
The TD_OrdinalEncodingFit function returns an error when the ordinal value count does
not match the categorical value count or if both the ordinal values and the start value
are provided.

Note:
You can either use the OrdinalValues or the StartValue argument in the syntax.

StartValue
[Optional] Specify the starting value for the ordinal values list.
Default value: 0

DefaultValue
[Optional] Specify the ordinal value to use when the categorical value is not found.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 147
4: Feature Engineering Transform Functions

The TD_OrdinalEncodingFit function adds the TD_OTHER_CATEGORY row and assigns


the specified value to the row in the FIT table output.
If you specify the default value in the TD_OrdinalEncodingFit function and if the
TD_OrdinalEncodingTransform function does not find the categorical value in the FIT table,
then the function assigns the TD_OTHER_CATEGORY value to the missing category.
If you do not specify the default value in the TD_OrdinalEncodingFit function and if the
TD_OrdinalEncodingTransform function does not find the categorical value in the FIT table,
then the TD_OrdinalEncodingTransform function returns an error.

TD_OrdinalEncodingFit Input
CONTINGENCY Table Schema
Column Data Type Description

TargetColumn CHAR or VARCHAR CHARACTER SET The input table column name for encoding
LATIN/UNICODE the categorical values.

TD_OrdinalEncodingFit Output
Output Table Schema
Column Data Type Description

TargetColumn VARCHAR CHARACTER The distinct categorical values from the input table or
SET UNICODE the user-defined list.

TD_VALUE_ORDFIT INTEGER The corresponding ordinal value for each category.

TD_OrdinalEncodingFit Example
Input: titanic_dataset

passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------------ ----- --------
1 0 3 male 22 1 0 7.250000000 null S
2 1 1 female 38 1 0 71.283300000 C85 C
3 1 3 female 26 0 0 7.925000000 null S
4 1 1 female 35 1 0 53.100000000 C123 S
5 0 3 male 35 0 0 8.050000000 null S

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 148
4: Feature Engineering Transform Functions

SQL Call

SELECT * FROM TD_OrdinalEncodingFit (


ON titanic_dataset AS InputTable
OUT table outputtable (titanic_fit_output)
USING
TargetColumn ('sex')
DefaultValue (-1)
) as dt;

Output Table

sex TD_VALUE_ORDFIT
----------------- ---------------
female 0
male 1
TD_OTHER_CATEGORY -1

TD_OrdinalEncodingTransform
The TD_OrdinalEncodingTransform function maps the categorical value to a specified ordinal value using
the TD_OrdinalEncodingFit output.

TD_OrdinalEncodingTransform Syntax
SELECT * FROM TD_OrdinalEncodingTransform (
ON { table | view | (query) } as InputTable
ON { table | view | (query) } as FitTable DIMENSION
USING
[ Accumulate ({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
) as alias;

TD_OrdinalEncodingTransform Syntax Elements


Accumulate
[Optional]: Specify the input table column names to copy to the output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 149
4: Feature Engineering Transform Functions

TD_OrdinalEncodingTransform Input
Input Table Schema
Column Data Type Description

TargetColumn CHAR or VARCHAR The input table or user-defined column names for
CHARACTER SET LATIN encoding the categorical values.
/UNICODE

Accumulate Any The input table column names that you want to copy to
the output table.

FitTable Schema
Column Data Type Description

TargetColumn VARCHAR The column that has the distinct categories obtained
CHARACTER using the AUTO or the LIST approach.
SET UNICODE

TD_VALUE_ORDFIT INTEGER The ordinal value associated with the


categorical value.

TD_OrdinalEncodingTransform Output
Output Table Schema
Column Data Type Description

TargetColumn INTEGER The Target column with encoded ordinal values from the Fit table. Also, the
TD_OTHER_CATEGORY with the specified ordinal value from the Fit table
is displayed.

Accumulate Any The specified column names in the Accumulate element is copied to the
output table.

TD_OrdinalEncodingTransform Example
InputTable Schema
See Input table and Output table sections of TD_OrdinalEncodingFit Example.

SQL Call

SELECT * FROM TD_OrdinalEncodingTransform (


ON titanic_dataset AS InputTable
ON titanic_fit_output AS FitTable DIMENSION

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 150
4: Feature Engineering Transform Functions

USING
Accumulate ('passenger')
) as dt order by passenger;

Output Table

passenger sex
--------- ---
1 1
2 0
3 0
4 0
5 1

TD_PolynomialFeaturesFit
TD_PolynomialFeaturesFit function stores all the specified values in the argument in a tabular format.
All polynomial combinations of the features with degrees less than or equal to the specified degree are
generated. For example, for a 2-D input sample [x, y], the degree-2 polynomial features are [x, y, x-squared,
xy, y-squared,1].

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_PolynomialFeaturesFit Syntax
SELECT * FROM TD_PolynomialFeaturesFit (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table) ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ IncludeBias ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ InteractionOnly ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Degree (degree) ]
) AS alias;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 151
4: Feature Engineering Transform Functions

TD_PolynomialFeaturesFit Syntax Elements


OutputTable
[Optional] Specify a name for the output table.
If you omit OutputTable, you must create the output table for
TD_PolynomialFeaturesTransform with a CREATE TABLE AS statement:

CREATE TABLE output_table AS (


SELECT * FROM TD_PolynomialFeaturesFit ( ... ) AS alias
) WITH DATA;

TargetColumns
Specify the names of the InputTable columns for which to output polynomial combinations for
features (no more than five).

IncludeBias
[Optional] Specify whether the output table is to include a bias column for the feature in which
all polynomial powers are zero (that is, a column of ones). A bias column acts as an intercept
term in a linear model.
Default: true

InteractionOnly
[Optional] Specify whether to output polynomial combinations only for interaction features
(features that are products of at most degree distinct input features).
Default: false

Degree
[Optional] Specify the maximum degree of the input features for which to output polynomial
combinations, an integer in the range [1,2,3].
Default: 2

TD_PolynomialFeaturesFit Input
InputTable Schema
Column Data Type Description

target_column NUMERIC Column for which to output polynomial combinations for features.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 152
4: Feature Engineering Transform Functions

TD_PolynomialFeaturesFit Output
OutputTable Schema
Column Data Type Description

TD_IncludeBias_POLFIT:boolean INTEGER 1 if boolean is 'True', 0 if it is 'False'.


boolean is IncludeBias value, 'True'
or 'False'.

TD_InteractionOnly_ INTEGER 1 if boolean is 'True', 0 if it is 'False'.


POLFIT:boolean
boolean is InteractionOnly value,
'True' or 'False'.

TD_Degree_POLFIT:degree INTEGER degree

target_column NUMERIC Preserves target column name for TD_


PolynomialFeaturesTransform. Contains only
NULL values.

TD_PolynomialFeaturesFit Example
InputTable: polynomialFeaturesFit_input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

id col1 col2 col3


-- ---- ---- ----
1 2 3 4
2 5 6 7
3 1 2 4
4 5 3 5
5 3 2 6

SQL Call

CREATE TABLE polynomialFit AS (


SELECT * FROM TD_PolynomialFeaturesFit (
ON polynomialFeaturesFit_input AS InputTable
USING
TargetColumns ('[1:2]')
Degree (2)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 153
4: Feature Engineering Transform Functions

) AS dt
) WITH DATA;

Output

TD_INCLUDEBIAS_POLFIT:TRUE TD_INTERACTIONONLY_POLFIT:FALSE TD_DEGREE_POLFIT:2


col1 col2
-------------------------- ------------------------------- ------------------
---- ----
1 0 2
null null

TD_PolynomialFeaturesTransform
TD_PolynomialFeaturesTranform function extracts values of arguments [TargetColumns, Degree,
IncludeBias, and InteractionOnlygenerates] from the output of the TD_PolynomialFeaturesFit function
and generates a feature matrix of all polynomial combinations of the features.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_PolynomialFeaturesTransform Syntax
SELECT * FROM TD_PolynomialFeaturesTransform (
ON { table | view | (query) } AS InputTable PARTITION BY ANY
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

TD_PolynomialFeaturesTransform Syntax Elements


Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 154
4: Feature Engineering Transform Functions

Note:
If two or more column names are concatenated and the column name exceeds 128
characters, then the function replaces the actual column names with names such as
col1, col2, col3, col4, col5 in the output.

TD_PolynomialFeaturesTransform Input
InputTable Schema
See TD_PolynomialFeaturesFit Input.

FitTable Schema
See TD_PolynomialFeaturesFit Output.

TD_PolynomialFeaturesTransform Output
Output Table Schema
Column Data Type Description

accumulate_ Same as Column copied from InputTable.


column in InputTable.

One DOUBLE [Column appears only with IncludeBias ('true').] Column for
PRECISION feature in which all polynomial powers are zero (column of
ones).

output_column DOUBLE [Column appears once for each polynomial combination of


PRECISION degree or less.] Polynomial combination.
For example, for two-dimensional input feature [x, y], output
columns are x, y, x_square, y_square, x_y, where x and y are
InputTable target columns.

TD_PolynomialFeaturesTransform Example
• InputTable: polynomialFeatures, as in TD_PolynomialFeaturesFit Example
• FitTable: polynomialFit, created by TD_PolynomialFeaturesFit Example

SQL Call

SELECT * FROM TD_PolynomialFeaturesTransform (


ON polynomialFeaturesFit_input AS InputTable PARTITION BY ANY
ON polynomialFit AS FitTable DIMENSION
USING

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 155
4: Feature Engineering Transform Functions

Accumulate ('[0:0]')
) AS dt;

Output

id ONE col1 col1_col2 col1_SQUARE col2 col2_SQUARE


-- ----------- ----------- ------------ ------------ ----------- ------------
1 1.000000000 2.000000000 6.000000000 4.000000000 3.000000000 9.000000000
2 1.000000000 5.000000000 30.000000000 25.000000000 6.000000000 36.000000000
3 1.000000000 1.000000000 2.000000000 1.000000000 2.000000000 4.000000000
4 1.000000000 5.000000000 15.000000000 25.000000000 3.000000000 9.000000000
5 1.000000000 3.000000000 6.000000000 9.000000000 2.000000000 4.000000000

TD_RandomProjectionMinComponents
The TD_RandomProjectionMinComponents function calculates the minimum number of components
required for applying RandomProjection on the given dataset for the specified epsilon(distortion)
parameter value. The function estimates the minimum value of the NumComponents argument in the
TD_RandomProjectionFit function for a given dataset. The function uses the Johnson-Lindenstrauss
Lemma algorithm to calculate the value.

TD_RandomProjectionMinComponents Syntax
SELECT * FROM TD_RandomProjectionMinComponents(
ON {table | view | (query)} as InputTable
USING
TargetColumns({'target_column' | 'target_column_range'} [,...])
[ Epsilon(epsilon_value) ]
) as dt;

TD_RandomProjectionMinComponents Syntax Elements


TargetColumns
[Required]: Specify the input columns for random projection.

Epsilon
[Optional]: Specify a value to control distortion introduced while projecting the data to a lower
dimension. The amount of distortion increases if you increase the value.
Default Value: 0.1
Allowed Values: Between 0 and 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 156
4: Feature Engineering Transform Functions

TD_RandomProjectionMinComponents Input
Input Table Schema
Column Data Type Description

Target_column BYTEINT,SMALLINT,INTEGER,BIGINT, Decimal The input columns for


/Numeric,Float,Real,Double precision random projection.

TD_RandomProjectionMinComponents Output
Output Table Schema
Column Data Type Description

RandomProjection_ INTEGER The minimum number of components computed using the


MinComponents Epsilon value to apply the RandomProjection function to
the dataset.

TD_RandomProjectionMinComponents Example
Input Table
Each of the input data points in the following input table consists of 963 columns and a unique identifier of
the data point:

company_id date_2010_01_04 date_2010_01_05 ....


date_2013_10_28 date_2013_10_29
1 0.58 -0.220005
0.840019 -19.589981
2 -0.640002 -0.65
-0.400002 0.66
3 -2.350006
1.260009 .... -1.760009
3.740021
4 0.109997 0
0.040001 0.540001
5 0.459999
1.77 .... 1.130005
0.309998
6 0.45
0.460001 -0.06
-0.11
7 0.18

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 157
4: Feature Engineering Transform Functions

0.220001 .... 0.330002


1.150001
8 0.73 0.369999
0.090001 -0.110001
9 0.899997
0.700001 .... -0.220001
0.159996
10 0.36 0.909996
1.070003 1.050003

SQL Call

SELECT * FROM TD_RandomProjectionMinComponents(


ON stock_movement as InputTable
USING
TargetColumns('[1:]')
Epsilon(0.25)
) as dt;

Output table

randomprojection_mincomponents
------------------------------
353

TD_RandomProjectionFit
The TD_RandomProjectionFit function returns a random projection matrix based on the
specified arguments.
The function also returns the required parameters for transforming the input data into lower-dimensional
data. The TD_RandomProjectionTransform function uses the TD_RandomProjectionFit output to reduce
the dimensionality of the input data.

TD_RandomProjectionFit Syntax
SELECT * FROM TD_RandomProjectionFit
(ON {table | view | (query)} as InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable(out_table_name) ]
USING
TargetColumns({'target_column' | 'target_column_range'} [,...])
NumComponents(num_components)
[ Seed(seed_value) ]
[ Epsilon(epsilon_value) ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 158
4: Feature Engineering Transform Functions

[ ProjectionMethod({'GAUSSIAN' | 'SPARSE'}) ]
[ Density(density_value) ]
[ OutputFeatureNamesPrefix('output_feature_names_prefix') ])
as dt;

TD_RandomProjectionFit Syntax Elements


TargetColumns
[Required]: Specify the input table columns for dimensionality reduction.

NumComponents
[Required]: Specify the target dimension (number of features) on which the data points from
the original dimension are projected.
The NumComponents value cannot be greater than the original dimension (number
of features) and must satisfy the Johnson-Lindenstrauss Lemma result. The
minimum value allowed for the NumComponents argument is calculated using the
TD_RandomProjectionMinComponents function.

Seed
[Optional]: Specify the random seed the algorithm uses for repeatable results. The algorithm
uses the seed to generate a random projection matrix. The seed must be a non-negative
integer value.
Default Value: The Random Seed value is used for generating a random projection matrix,
and hence the output is non-deterministic.

Epsilon
[Optional]: Specify a value to control distortion introduced while projecting the data to a lower
dimension. The amount of distortion increases if you increase the value.
Default Value: 0.1
Allowed Values: Between 0 and 1

ProjectionMethod
[Optional]: Specify the method name for generating the random projection matrix.
Default Value: GAUSSIAN
Allowed Values: [GAUSSIAN, SPARSE]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 159
4: Feature Engineering Transform Functions

Density
[Optional]: Specify the approximate ratio of non-zero elements in the random projection
matrix when SPARSE is used as the projection method.
Default Value: 0.33333333
Allowed Values: 0 < Density <= 1

OutputFeatureNamesPrefix
[Optional]: Specify the prefix for the output column names.
Default Value: td_rpj_feature

TD_RandomProjectionFit Input
Input Table Schema
Column Data Type Description

Target_column BYTEINT,SMALLINT,INTEGER,BIGINT, The input table columns for


Decimal/Numeric,Float,Real,Double precision dimensionality reduction.

TD_RandomProjectionFit Output
Output Table Schema
Column Data Type Description

OutputFeatureNamesPrefix_ INTEGER The combination of OutputFeatureNamesPrefix and


NumComponents_0 NumComponents values is used for the column name
and contains a unique identifier of rows in the Random
Projection matrix.

Target_column REAL The columns that have the elements of the Random
Projection Matrix.

TD_RandomProjectionFit Example
Input Table
Each of the input data points in the following input table consists of 963 columns and a unique identifier of
the data point:

company_id date_2010_01_04 date_2010_01_05 ....


date_2013_10_28 date_2013_10_29
1 0.58 -0.220005

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 160
4: Feature Engineering Transform Functions

0.840019 -19.589981
2 -0.640002 -0.65
-0.400002 0.66
3 -2.350006
1.260009 .... -1.760009
3.740021
4 0.109997 0
0.040001 0.540001
5 0.459999
1.77 .... 1.130005
0.309998
6 0.45
0.460001 -0.06
-0.11
7 0.18
0.220001 .... 0.330002
1.150001
8 0.73 0.369999
0.090001 -0.110001
9 0.899997
0.700001 .... -0.220001
0.159996
10 0.36 0.909996
1.070003 1.050003

SQL Call

SELECT * FROM TD_RandomProjectionFit(


ON stock_movement as InputTable
OUT PERMANENT TABLE OutputTable(rand_proj_fit_tbl_ex)
USING
TargetColumns('[1:]')
NumComponents(353)
Seed(0)
Epsilon(0.25)
ProjectionMethod('GAUSSIAN')
) as dt ORDER BY 1,2;

Output Table

td_rpj_feature_353_0 date_2010_01_04 date_2010_01_05 .... date_2013_10_28


date_2013_10_29
0 -0.068021509 -0.021542237
.... 0.031534748 -0.021109446

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 161
4: Feature Engineering Transform Functions

1 0.076606803
0.033381927 .... 0.047328448 0.039908753
2 -0.100981365
0.026625478 .... -0.035816093 -0.009469693
.... .... ....
.... .... ....
.... .... ....
.... .... ....
351
0.039755885 -0.040469189 .... 0.046338113
0.018347439
352 0.040278453
0.072748211 .... -0.04456492 -0.031728421

TD_RandomProjectionTransform
The TD_RandomProjectionTransform function converts the high-dimensional input data to a lower-
dimensional space using the TD_RandomProjectionFit function output.

TD_RandomProjectionTransform Syntax
SELECT * FROM TD_RandomProjectionTransform(
ON {table | view | (query)} as InputTable
ON {table | view | (query)} as FitTable DIMENSION
USING
[ Accumulate({'accumulate_column' | 'accumulate_column_range'} [,...]) ]
) as dt;

TD_RandomProjectionTransform Syntax Elements


Accumulate
[Optional]: Specify the input table columns to copy to the output table.
Default: Only transformed columns are present in the output.

TD_RandomProjectionTransform Input
Input Table Schema
Column Data Type Description

Target_column BYTEINT,SMALLINT, The input table columns for


INTEGER,BIGINT, Decimal dimensionality reduction.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 162
4: Feature Engineering Transform Functions

Column Data Type Description

/Numeric,Float,Real,
Double precision

Accumulate_ ANY The input table columns that you want to copy to
column the output table.

TD_RandomProjectionTransform Output
Output Table Schema
Column Data Type Description

Accumulate ANY The specified columns in the Accumulate element is copied


to the output table.

OutputFeatureNamesPrefix_i REAL The rendered columns after converting the data points to
lower-dimensional space wherein i is the sequence number
of the generated column.

TD_RandomProjectionTransform Example
Input Table
Each of the input data points in the following input table consists of 963 columns and a unique identifier of
the data point:

company_id date_2010_01_04 date_2010_01_05 ....


date_2013_10_28 date_2013_10_29
1 0.58 -0.220005
0.840019 -19.589981
2 -0.640002 -0.65
-0.400002 0.66
3 -2.350006
1.260009 .... -1.760009
3.740021
4 0.109997 0
0.040001 0.540001
5 0.459999
1.77 .... 1.130005
0.309998
6 0.45
0.460001 -0.06
-0.11
7 0.18

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 163
4: Feature Engineering Transform Functions

0.220001 .... 0.330002


1.150001
8 0.73 0.369999
0.090001 -0.110001
9 0.899997
0.700001 .... -0.220001
0.159996
10 0.36 0.909996
1.070003 1.050003

FITTable (Generated from TD_RandomProjectionFit)

td_rpj_feature_353_0 date_2010_01_04 date_2010_01_05 .... date_2013_10_28


date_2013_10_29
0 -0.068021509 -0.021542237
.... 0.031534748 -0.021109446
1 0.076606803
0.033381927 .... 0.047328448 0.039908753
2 -0.100981365
0.026625478 .... -0.035816093 -0.009469693
.... .... ....
.... .... ....
.... .... ....
.... .... ....
351
0.039755885 -0.040469189 .... 0.046338113
0.018347439
352 0.040278453
0.072748211 .... -0.04456492 -0.031728421

SQL Call

SELECT * FROM TD_RandomProjectionTransform(


ON stock_movement as InputTable
ON rand_proj_fit_tbl_ex as FitTable DIMENSION
USING
Accumulate('company_id')
) as dt ORDER BY 1,2;

Output Table

company_id td_rpj_feature_0 td_rpj_feature_1 ....


td_rpj_feature_351 td_rpj_feature_352
1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 164
4: Feature Engineering Transform Functions

5.545038445 -18.76927716
-4.260967918 1.729951186
2
0.38584145 -1.436265087 .... -0.66632
7415 -0.820435756
3 -2.173481389 -1.528033248
2.975176363 -4.719261735
4 -0.193832387 -1.093648166 ....
0.07893702 0.923864022
5 -1.542301006 -1.794068037
-0.795435421 0.797221691
6 -0.344011761 -0.344862717 ....
0.13398249 0.088007141
7
0.899506286 -1.592144274
-1.225987822 0.192506954
8
0.446963102 -0.736407073 ....
0.133092795 0.192294245
9 -3.02747734 -3.933652254
-0.086266401 -2.545520414
10 1.937290638
0.498908831 .... -0.393961865 -0.88549
645

TD_RowNormalizeFit
TD_RowNormalizeFit outputs a table of parameters and specified input columns to input to
TD_RowNormalizeTransform, which normalizes the input columns row-wise.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_RowNormalizeFit Syntax
SELECT * FROM TD_RowNormalizeFit (
ON { table | view | (query) } AS InputTable

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 165
4: Feature Engineering Transform Functions

[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table_name) ]


USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ Approach ({ 'UNITVECTOR' | 'FRACTION' | 'PERCENTAGE' | 'INDEX' }) ]
[ BaseColumn ('base_column')
BaseValue (base_value) ]
) AS alias;

TD_RowNormalizeFit Syntax Elements


OutputTable
[Optional] Specify a name for the output table.
If you omit OutputTable, you must create the output table for TD_RowNormalizeTransform
with a CREATE TABLE AS statement:

CREATE TABLE output_table AS (


SELECT * FROM TD_RowNormalizeFit ( ... ) AS alias
) WITH DATA;

TargetColumns
Specify the names of the InputTable columns to normalize row-wise.

Approach
[Optional] Specify the normalization method:
Option Normalizing Formula

UNITVECTOR (Default) X' = X / (sqrt (Σi ϵ [1, n] Xi2))

FRACTION X' = X / (Σi ϵ [1, n] Xi)

PERCENTAGE X' = X*100 / (Σi ϵ [1, n] Xi)

INDEX X' = V + ((X - B) / B) * 100

In the normalizing formulas:


X' is the normalized value.
X is the original value.
B is the value in the base column.
V is the base value.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 166
4: Feature Engineering Transform Functions

BaseColumn
[Required with Approach ('INDEX'), ignored otherwise.] Specify the name of the InputTable
column that has the B values to use in the normalizing formula.

BaseValue
[Required with Approach ('INDEX'), ignored otherwise.] Specify the V value to use in the
normalizing formula.

TD_RowNormalizeFit Input
InputTable Schema
Column Data Type Description

target_column BYTEINT,SMALLINT,INTEGER,BIGINT, Column to normalize row-wise.


Decimal/Numeric, Float, Real,
Double precision

base_column BYTEINT,SMALLINT,INTEGER,BIGINT, [Column appears only with Approach


Decimal/Numeric, Float, Real, ('INDEX').] B value to use in
Double precision normalizing formula.

TD_RowNormalizeFit Output
OutputTable Schema
Column Data Type Description

TD_KEY_ROWFIT VARCHAR Parameter key.


(CHARACTER SET LATIN)

TD_VALUE_ROWFIT VARCHAR Parameter value.


(CHARACTER SET UNICODE)

target_column Same as in InputTable [Column appears once for each specified


target_column.] NULL value.

TD_RowNormalizeFit Example
InputTable: rowNormalizeFit_input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 167
4: Feature Engineering Transform Functions

id x y
-- - --
1 0 1
2 3 4
3 5 12
4 7 24

SQL Call

CREATE TABLE rowNormalizeFit_output AS (


SELECT * FROM TD_RowNormalizeFit (
ON rowNormalizeFit_input AS InputTable
USING
TargetColumns ('[1:2]')
Approach ('INDEX')
BaseColumn ('y')
BaseValue (100)
) AS dt
) WITH DATA;

Output

SELECT * FROM rowNormalizeFit_output;

TD_KEY_ROWFIT TD_VALUE_ROWFIT x y
------------- --------------- ---- ----
Approach INDEX null null
BaseColumn y null null
BaseValue 100 null null

TD_RowNormalizeTransform
TD_RowNormalizeTransform normalizes input columns row-wise, using TD_RowNormalizeFit output.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 168
4: Feature Engineering Transform Functions

TD_RowNormalizeTransform Syntax
SELECT * FROM TD_RowNormalizeTransform (
ON { table | view | (query) } AS InputTable [ PARTITION BY ANY [ ORDER BY
order_column ] ]
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

TD_RowNormalizeTransform Syntax Elements


Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.

TD_RowNormalizeTransform Input
InputTable Schema

Column Data Type Description

target_column BYTEINT,SMALLINT,INTEGER, Column to normalize row-wise.


BIGINT, Decimal/Numeric, Float,
Real, Double precision

base_column BYTEINT,SMALLINT,INTEGER, [Column appears only with Approach


BIGINT, Decimal/Numeric, Float, ('INDEX').] B value to use in
Real, Double precision normalizing formula.

accumulate_column Any The input table column names copied to the


output table.

FitTable Schema
See TD_RowNormalizeFit Output.

TD_RowNormalizeTransform Output
Column Data Type Description

accumulate_column Same as in InputTable Column copied from InputTable.

target_column DOUBLE PRECISION Row-normalized values.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 169
4: Feature Engineering Transform Functions

TD_RowNormalizeTransform Example
• InputTable: rowNormalizeFit_input, as in TD_RowNormalizeFit Example
• FitTable: rowNormalizeFit_output, output by TD_RowNormalizeFit Example

SQL Call

SELECT * FROM TD_RowNormalizeTransform (


ON rowNormalizeFit_input AS InputTable
ON rowNormalizeFit_output AS FitTable DIMENSION
USING
Accumulate ('id')
) AS dt;

Output

id x y
-- ------- ------
1 0.00 100.00
2 75.00 100.00
3 41.66 100.00
4 29.16 100.00

TD_ScaleFit
TD_ScaleFit outputs a table of statistics to input to TD_ScaleTransform, which scales specified input
table columns.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_ScaleFit Syntax
SELECT * FROM TD_ScaleFit (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ ORDER BY order_column ] ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 170
4: Feature Engineering Transform Functions

[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table) ]


USING
TargetColumns ( { 'target_column' | target_column_range }[,...] )
ScaleMethod ('scale_method' [,...])
[ Multiplier ('multiplier' [,...]) ]
[ Intercept ('intercept' [,...]) ]
[ GlobalScale ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ MissValue ({ 'KEEP' | 'ZERO' | 'LOCATION' }) ]
) AS alias;

TD_ScaleFit Syntax Elements


OutputTable
[Optional] Specify a name for the output table.

TargetColumns
Specify the names of the InputTable columns for which to output statistics. The columns must
contain numeric data in the range (-1e308, 1e308).

ScaleMethod
Specify either one scale_method for all target columns or one scale_method for each
target_column. The nth scale_method applies to the nth target_column.
The following table lists each possible scale_method and its location and scale values.
The TD_ScaleTransform function uses the location and scale values in the following formula
to scale target column value X to scaled value X':
X' = intercept + multiplier * ((X - location)/scale)
Intercept and Multiplier determine intercept and multiplier.
In the table, Xmin, Xmax, and XMean are the minimum, maximum, and mean values
of target_column.
scale_method Description location scale

MAXABS Maximum 0 Maximum |X|


absolute value.

MEAN Mean. XMean 1

MIDRANGE Midrange. (Xmax+Xmin)/2 (Xmax-Xmin)/2

RANGE Range. Xmin Xmax-Xmin

RESCALE Rescale using See table after See table after


specified lower RESCALE RESCALE syntax.
bound, upper bound, syntax.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 171
4: Feature Engineering Transform Functions

scale_method Description location scale

or both. See syntax


after this table.

STD Standard deviation. XMean √(∑((Xi - Xmean)2)/ N)


where N is count of
valid values.

SUM Sum. 0 ΣX

USTD Unbiased XMean √(∑((Xi - Xmean)2)/ (N - 1))


standard deviation. where N is count of
valid values.

RESCALE ({ lb=lower_bound | ub=upper_bound | lb=lower_bound,


ub=upper_bound })

RESCALE location and scale


location scale

Lower bound only Xmin - lower_bound 1

Upper bound only Xmax - upper_bound 1

Lower and Xmin - (lower_bound/(upper_bound - (Xmax - Xmin)/(upper_


upper bounds lower_bound)) bound - lower_bound)

Intercept
[Optional] Specify either one intercept for all target columns or one intercept for each
target_column. The function uses the nth intercept for the nth target_column.
Default: '0'

Multiplier
[Optional] Specify either one multiplier for all target columns or one multiplier for each
target_column. The function uses the nth multiplier for the nth target_column.
Default: '1'

GlobalScale
[Optional] Specify whether to scale all target columns to the same location and scale.
Default: 'false' (scale each target column separately)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 172
4: Feature Engineering Transform Functions

MissValue
[Optional] Specify how to handle NULL values:
Option Description

KEEP (Default) Keep NULL values.

ZERO Replace each NULL value with 0.

LOCATION Replace each NULL value with its location value.

TD_ScaleFit Input
InputTable Schema
Column Data Type Description

target_column BYTEINT,SMALLINT,INTEGER,BIGINT, Decimal Column for which to


/Numeric, Float, Real, Double precision output statistics.

TD_ScaleFit Output
Output Table Schema
Column Data Type Description

TD_STATTYPE_SCLFIT VARCHAR Statistic names and parameters—see following


(CHARACTER SET LATIN) table..

target_column DOUBLE PRECISION Statistics values for target_column.

TD_STATTYPE_SCLFIT Statistic Names


Statistic Name Description

min Minimum value in target_column.

max Maximum value in target_column.

sum Sum of non-NULL values in target_column.

count Count of non-NULL values in target_column.

null Count of NULL values in target_column.

avg Average of non-NULL values in target_column.

variance Variance of non-NULL values in target_column, calculated according to


N-1 degrees of freedom. [N is count of non-NULL values in target_column].

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 173
4: Feature Engineering Transform Functions

Statistic Name Description

ustd Unbiased Standard deviation of non-NULL values in target_column.

std Standard deviation of non-NULL values in target_column.

multiplier multiplier for target_column.

intercept intercept for target_column.

methodnumbermapping Identifier of scale_method for target_column:


methodnumbermapping scale_method

0 MEAN

1 SUM

2 USTD

3 STD

4 RANGE

5 MIDRANGE

6 MAXABS

7 RESCALE

global_true or global_false GlobalScale value.

missvalue_keep, missvalue_ MissValue value.


zero, or missvalue_location

TD_ScaleFit Example
InputTable: scale_input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass name sex age sibsp


parch ticket fare cabin embarked
--------- -------- ------ ------------------------------------ ------ --- -----
----- -------- ------------ ----------- --------
97 0 1 Goldschmidt; Mr. George B male 71
0 0 PC 17754 34.6542 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.7 B37 C
505 1 1 Maioni; Miss. Roberta female 16

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 174
4: Feature Engineering Transform Functions

0 0 110152 86.5 B79 S


631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S

SQL Call

SELECT * FROM TD_ScaleFit (


ON scale_input_table AS InputTable
OUT PERMANENT TABLE OutputTable (scaleFitOut)
USING
TargetColumns ('fare')
MissValue ('keep')
ScaleMethod ('range')
GlobalScale ('f')
) AS dt;

Output

TD_STATTYPE_SCLFIT
fare

--------------------------------------------------------------------------------
----------- -------------
min
5.000000000
max
86.500000000
sum
185.854200000
count
5.000000000
null
0.000000000
avg
37.170840000
multiplier
1.000000000
intercept
0.000000000
location
5.000000000
scale

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 175
4: Feature Engineering Transform Functions

81.500000000

globalscale_false
null
ScaleMethodNumberMapping:
[0:mean,1:sum,2:ustd,3:std,4:range,5:midrange,6:maxabs,7:rescale] 4.000000000

missvalue_KEEP
null

TD_ScaleTransform
TD_ScaleTransform scales specified input table columns, using TD_ScaleFit output.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_ScaleTransform Syntax
SELECT * FROM TD_ScaleTransform (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ ORDER BY order_column ] ]
ON { table | view | (query) } AS FitTable DIMENSION
[ USING
Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...])
]
) AS alias;

TD_ScaleTransform Syntax Elements


Accumulate
[Optional] Specify InputTable columns to copy to the output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 176
4: Feature Engineering Transform Functions

TD_ScaleTransform Input
InputTable Schema
See TD_ScaleFit Input.

FitTable Schema
See TD_ScaleFit Output.

TD_ScaleTransform Output
Output Table Schema
Column Data Type Description

accumulate_column Same as in InputTable Column copied from InputTable.

target_column DOUBLE PRECISION Scaled values.

TD_ScaleTransform Example
• InputTable: input_table, as in TD_ScaleFit Example
• FitTable: scaleFitOut, output by TD_ScaleFit Example

SQL Call

SELECT * FROM TD_scaleTransform (


ON scale_input_table AS InputTable
ON scaleFitOut AS FitTable DIMENSION
USING
Accumulate ('passenger')
) AS dt ORDER BY 1;

Output

passenger fare
--------- ------------------
97 0.363855214723926
488 0.303067484662577
505 1
631 0.306748466257669
873 0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 177
5
Feature Engineering Utility Functions

Feature engineering utility functions for analyzing and extracting features of the input dataset.

TD_FillRowID
TD_FillRowID adds a column of unique row identifiers to the input table.

Note:
This function may not return the same RowIds if the function is run multiple times.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_FillRowID Syntax
SELECT * FROM TD_FillRowID (
ON { table | view | (query) } AS InputTable [ PARTITION BY ANY [ ORDER BY
order_column ] ]
USING
[ RowIDColumnName ('row_id_column') ]
) AS alias;

TD_FillRowID Syntax Elements


RowIDColumnName
[Optional] Specify a name for the output column of row identifiers.
Default: row_id

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 178
5: Feature Engineering Utility Functions

TD_FillRowID Input
InputTable Schema
InputTable can have any schema.

TD_FillRowID Output
Output Table Schema
Column Data Type Description

row_id_column BIGINT Column of row identifiers.

input_column Same as in InputTable Column copied from InputTable.

TD_FillRowID Example
InputTable: titanic
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Survived Pclass Name Age


-------- ------ ------------------ ---
0 3 Mrs. Jacques Heath 35
0 3 Mr. Owen Harris 22
1 3 Mrs. Laina 26
1 1 Mrs. John Bradley 38

SQL Call

SELECT * FROM TD_FillRowID (


ON fillrowid_input AS InputTable
USING
RowIDColumnName ('PassengerId')
) AS dt;

Output

Survived Pclass Name Age PassengerId


-------- ------ ------------------ --- -----------
1 1 Mrs. John Bradley 38 0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 179
5: Feature Engineering Utility Functions

0 3 Mr. Owen Harris 22 7


1 3 Mrs. Laina 26 8
0 3 Mrs. Jacques Heath 35 15

TD_NumApply
TD_NumApply applies a specified numeric operator to the specified input table columns. For the list of
numeric operators, see TD_NumApply Syntax Elements.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Related Information:
TD_StrApply
TD_FunctionFit
TD_FunctionTransform

TD_NumApply Syntax
SELECT * FROM TD_NumApply (
ON { table | view | (query) } AS InputTable [ PARTITION BY ANY [ ORDER BY
order_column ] ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ OutputColumns ('output_column' [,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
ApplyMethod ('num_operator')
[ SigmoidStyle ({ 'logit' | 'modifiedlogit' | 'tanh' )]
[ InPlace ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
) AS alias;

TD_NumApply Syntax Elements


TargetColumns
Specify the names of the InputTable columns to which to apply the numeric operator.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 180
5: Feature Engineering Utility Functions

OutputColumns
[Ignored with Inplace ('true'), otherwise optional.] Specify names for the output columns. An
output_column cannot exceed 128 characters.
Default: With InPlace ('false'), target_column_operator; otherwise target_column

Note:
If any target_column_operator exceeds 128 characters, specify an output_column for
each target_column.

Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.
With InPlace ('true'), no target_column can be an accumulate_column.

ApplyMethod
Specify one of these numeric operators:
num_operator Description

EXP Raises e (base of natural logarithms) to power of value, where e =


2.71828182845905.

LOG Computes base 10 logarithm of value.

SIGMOID Applies sigmoid function to value. See SigmoidStyle.

SININV Computes inverse hyperbolic sine of value.

TANH Computes hyperbolic tangent of value.

SigmoidStyle
[Required with ApplyMethod ('sigmoid'), otherwise ignored.] Specify the sigmoid style.
Default: logit

InPlace
[Optional] Specify whether the output columns have the same names as the target columns.
InPlace ('true') effectively replaces each value in each target column with the result of
applying num_operator to it.
InPlace ('false') copies the target columns to the output table and adds output columns whose
values are the result of applying num_operator to each value.
With InPlace ('true'), no target_column can be an accumulate_column.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 181
5: Feature Engineering Utility Functions

Default: true

TD_NumApply Input
InputTable Schema
Column Data Type Description

target_column NUMERIC Column to which to apply num_operator.

accumulate_column Any Column to copy to output table.

TD_NumApply Output
Output Table Schema
Column Data Type Description

accumulate_column Same as Column copied from InputTable.


in InputTable

output_column Same as Column to which num_operator was applied.


in InputTable With InPlace ('true'), output_column is target_column.
With InPlace ('false'), OutputColumns determines
output_column.

TD_NumApply Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

passenger survived pclass sex age sibsp parch fare


cabin embarked
--------- -------- ------ ------ --- ----- ----- ------------
----- ------------
1 0 3 male 22 1 0 7.250000000
null S
2 1 1 female 38 1 0 71.280000000
C85 C
3 1 3 female 26 0 0 7.930000000
null S
4 1 1 female 35 1 0 53.100000000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 182
5: Feature Engineering Utility Functions

C123 S
5 0 3 male 35 0 0 8.050000000 null S

SQL Call

SELECT * FROM TD_NumApply (


ON numApply_input AS InputTable PARTITION BY ANY
USING
TargetColumns ('Age','Fare')
ApplyMethod ('log')
Accumulate ('Passenger')
InPlace ('true')
) AS dt;

Output

passenger age fare


--------- ----------- -----------
5 3.555348061 2.085672091
4 3.555348061 3.972176928
3 3.258096538 2.070653036
1 3.091042453 1.981001469
2 3.637586160 4.266615783

TD_RoundColumns
TD_RoundColumns rounds the values of each specified input table column to a specified number of
decimal places.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_RoundColumns Syntax
SELECT * FROM TD_RoundColumns (
ON { table | view | (query) } AS InputTable
USING

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 183
5: Feature Engineering Utility Functions

TargetColumns ({ 'target_column' | target_column_range }[,...])


[ PrecisionDigit (precision) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

TD_RoundColumns Syntax Elements


TargetColumns
Specify the names of the InputTable columns in which to round every value to precision digits.

PrecisionDigit
[Optional] Specify the number of decimal places to which to round values.
If precision is positive, the function rounds values to the right of the decimal point.
If precision is negative, the function rounds values to the left of the decimal point.
Default: If the PrecisionDigit value is not provided, the function rounds the column values to
0 places.

Note:
If the column values have the DECIMAL/NUMERIC data type with a precision less than
38, then the function increases the precision by 1. For example, when a DECIMAL (4,2)
value of 99.99 is rounded to 0 places, the function returns a DECIMAL (5,2) value,
100.00. However, if the precision is 38, then the function only reduces the scale value
by 1 unless the scale is 0. For example, the function returns a DECIMAL (38, 36) value
of 99.999999999 as a DECIMAL (38, 35) value, 100.00.

Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.

TD_RoundColumns Input
InputTable Schema
Column Data Type Description

target_column NUMERIC Column in which to round every value to precision digits.

accumulate_column Any Column to copy to output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 184
5: Feature Engineering Utility Functions

TD_RoundColumns Output
Output Table Schema
Column Data Type Description

target_column NUMERIC Column in which every value is rounded to


precision digits.

accumulate_column Same as in InputTable Column copied from InputTable.

TD_RoundColumns Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

InputTable: titanic

passenger pclass fare survived


--------- ------ --------- --------
1 3 7.25 0
2 1 71.2833 1
3 3 7.925 1
4 1 53.100 1
5 3 8.050 0

SQL Call

SELECT * FROM TD_RoundColumns (


ON titanic AS InputTable
USING
TargetColumns ('Fare')
PrecisionDigit (1)
Accumulate ('[0:1]','Survived')
) AS dt;

Output

passenger pclass survived fare


--------- ------ -------- ---------
1 3 0 7.30
2 1 1 71.30
3 3 1 7.90

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 185
5: Feature Engineering Utility Functions

4 1 1 53.10
5 3 0 8.10

TD_StrApply
TD_StrApply applies a specified string operator to the specified input table columns. For the list of string
operators, see TD_StrApply Syntax Elements.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

Related Information:
TD_NumApply

TD_StrApply Syntax
SELECT * FROM TD_strApply (
ON { table | view | (query) } AS InputTable PARTITION BY ANY
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ OutputColumns ('output_column' [,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
StringOperation (str_operator)
[ String ('string')]
[ StringLength ('length') ]
[ OperatingSide ({ 'Left' | 'Right' })]
[ IsCaseSpecific ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ EscapeString ('escape_string')]
[ IgnoreTrailingBlank ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ StartIndex ('start_index')]
[ InPlace ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
) AS alias;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 186
5: Feature Engineering Utility Functions

TD_StrApply Syntax Elements


TargetColumns
Specify the names of the input table columns to which to apply the string operator.

OutputColumns
[Ignored with Inplace ('true'), otherwise optional.] Specify names for the output columns. An
output_column cannot exceed 128 characters.
Default: With InPlace ('false'), target_column_operator; otherwise target_column

Note:
If any target_column_operator exceeds 128 characters, specify an output_column for
each target_column.

Accumulate
[Optional] Specify the names of the input table columns to copy to the output table.
With InPlace ('true'), no target_column can be an accumulate_column.

StringOperation
Specify a str_operator from the following table. If str_operator requires string, length, or
start_index, specify that value with String, StringLength, or StartIndex.
str_operator Description

CHARTOHEXINT Converts value to its hexadecimal representation.

GETNCHARS Returns length characters from value.


Option: See OperatingSide.

INITCAP Capitalizes first letter of value.

STRINGCON Concatenates string to value.

STRINGINDEX Returns index of first character of string in value.


Options: See IsCaseSpecific.

STRINGLIKE Returns first string that matches specified pattern if one exists in value.
Options: See EscapeString, IsCaseSpecific, IgnoreTrailingBlank.

STRINGPAD Pads value with string to length.


Option: See OperatingSide.

STRINGREVERSE Reverses order of characters in value.

STRINGTRIM If value contains string, trim string from value.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 187
5: Feature Engineering Utility Functions

str_operator Description

Option: See OperatingSide.

SUBSTRING Returns substring starting at start_index with length from value.

TOLOWER Replaces uppercase letters in value with lowercase equivalents.

TOUPPER Replaces lowercase letters in value with uppercase equivalents.

TRIMSPACES Trims leading and trailing space characters from value.

UNICODESTRING Converts LATIN value to UNICODE.

String
[Required when str_operator needs string argument, ignored otherwise.] Specify string
argument for str_operator:
str_operator string

STRINGCON String to concatenate to value.

STRINGINDEX String for which to return index of its first character in value.

STRINGLIKE Pattern that describes string to find in value.

STRINGPAD String with which to pad value.

STRINGTRIM String to trim from value.

StringLength
[Optional] [Required when str_operator needs length argument, ignored otherwise.] Specify
length argument for str_operator:
str_operator length

GETNCHARS Number of characters to return from value.

STRINGPAD Length to which to pad value.

SUBSTRING Substring length.

OperatingSide
[Optional] Applies only when str_operator is GETNCHARS, STRINGPAD, or STRINGTRIM.
Specifies side of value on which to apply str_operator.
Default: left

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 188
5: Feature Engineering Utility Functions

IsCaseSpecific
[Optional] Applies only when str_operator is STRINGINDEX or STRINGLIKE. Specify
whether search for string is case-specific.
Default: true (search is case-specific)

EscapeString
[Optional] Applies only when str_operator is STRINGLIKE. Specify the escape characters.

IgnoreTrailingBlank
[Optional] Applies only when str_operator is STRINGLIKE. Specify whether to ignore trailing
space characters.

StartIndex
[Optional] Applies only when str_operator is SUBSTRING. Specify the index of the character
at which the substring starts.

InPlace
[Optional] Specify whether the output columns have the same names as the target columns.
InPlace ('true') effectively replaces each value in each target column with the result of
applying str_operator to it.
InPlace ('false') copies the target columns to the output table and adds output columns whose
values are the result of applying str_operator to each value.
With InPlace ('true'), no target_column can be an accumulate_column.
Default: true

TD_StrApply Input
Input Table Schema
Column Data Type Description

target_column CHAR, VARCHAR Column to which to apply str_operator.


(CHARACTER SET LATIN or
UNICODE.)

accumulate_column Any Column to copy to output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 189
5: Feature Engineering Utility Functions

TD_StrApply Output
Output Table Schema
Column Data Type Description

accumulate_column Same as in Column copied from input table.


input table

output_column Same as in Column to which str_operator was applied.


input table With InPlace ('true'), output_column is target_column.
With InPlace ('false'), OutputColumns determines
output_column.

TD_StrApply Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Input Table: input_table

passenger survived pclass name


sex age sibsp parch ticket fare cabin embarked
--------- -------- ------ ---------------------------------------------------
------ --- ----- ----- ---------------- ------------ ----- --------
5 0 3 Allen; Mr. William Henry
male 35 0 0 373450 8.050000000 null S
4 1 1 Futrelle; Mrs. Jacques Heath (Lily May Peel)
female 35 1 0 113803 53.100000000 C123 S
3 1 3 Heikkinen; Miss. Laina
female 26 0 0 STON/O2. 3101282 7.925000000 null S
1 0 3 Braund; Mr. Owen Harris
male 22 1 0 A/5 21171 7.250000000 null S
2 1 1 Cumings; Mrs. John Bradley (Florence Briggs Thayer)
female 38 1 0 PC 17599 71.283300000 C85 C

SQL Call

SELECT * FROM TD_strApply (


ON strApply_input_table as InputTable PARTITION BY ANY
USING
TargetColumns ('Sex')
stringOperation ('toUpper')
Accumulate('Passenger')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 190
5: Feature Engineering Utility Functions

InPlace('True')
) as dt order by 1;

Output

passenger sex
--------- ------
1 MALE
2 FEMALE
3 FEMALE
4 FEMALE
5 MALE

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 191
6
Model Training Functions

TD_DecisionForest
The function is an ensemble algorithm used for classification and regression predictive modeling problems.
It is an extension of bootstrap aggregation (bagging) of decision trees. Typically, constructing a decision tree
involves evaluating the value for each input feature in the data to select a split point.
The function reduces the features to a random subset (that can be considered at each split point); the
algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.
The function uses a training dataset to create a predictive model. The DecisionForestPredict function uses
the model created by the TD_DecisionForest function for making predictions.
The function supports regression, binary, and multi-class classification.
Consider the following points:
• All input features are numeric. Convert the categorical columns to numerical columns as
preprocessing step.
• For classification, class labels (ResponseColumn values) can only be integers.
• Any observation with a missing value in an input column is skipped and not used for training. You can
use the TD_SimpleImpute function to assign missing values.
The number of trees built by TD_DecisionForest depends on the NumTrees, TreeSize, CoverageFactor
values, and the data distribution in the cluster. The trees are constructed in parallel by all the AMPs, which
have a non-empty partition of data.
• When you specify the NumTrees value, the number of trees built by the function is adjusted as:

Number_of_trees = Num_AMPs_with_data * (NumTrees/Num_AMPs_with_data)

• For Num_AMPs_with_data value, use the SQL command SELECT HASHAMP()+1;.


• When you do not specify the NumTrees value, the number of trees built by an AMP is calculated as:

Number_of_AMP_trees = CoverageFactor * Num_Rows_AMP / TreeSize

The number of trees built by the function is the sum of Number_of_AMP_trees.


• The TreeSize value determines the sample size used to build a tree in the forest and depends on the
memory available to the AMP. By default, this value is computed internally by the function. The function
reserves approximately 40% of its available memory to store the input sample, while the rest is used to
build the tree.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 192
6: Model Training Functions

TD_DecisionForest Syntax
SELECT * FROM TD_DecisionForest (
ON { table | view | (query) } PARTITION BY ANY
USING
InputColumns ({'input_column'|input_column_range }[,…])
ResponseColumn('response_column')
[ MaxDepth (maxdepth) ]
[ MinNodeSize (minnodesize) ]
[ NumTrees (numtrees) ]
[ Treetype ('regression'|'classification') ]
[ TreeSize (treesize) ]
[ CoverageFactor (coveragefactor) ]
[ Seed (seed) ]
[ Mtry (mtry) ]
[ MtrySeed (mtryseed) ]
[ MinImpurity (minimpurity) ]
)
as dt;

TD_DecisionForest Syntax Elements


InputColumns
Specify the input table column names for training the model (predictors, features, or
independent variables).

ResponseColumn
Specify the column name that contains the classification label or target value (dependent
variable) for regression.

MaxDepth
[Optional] Specify the maximum depth of a tree. The algorithm stops splitting a node beyond
this depth. Decision trees can grow to 2(max_depth+1)-1 nodes. The default value is 5. You must
specify a non-negative integer value.

NumTrees
[Optional] Specify the number of trees for the forest model. You must specify a value greater
than or equal to the number of data AMPs. By default, the function builds the minimum
number of trees that provides the specified coverage level in the CoverageFactor argument
for the input dataset. The default value is -1.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 193
6: Model Training Functions

MinNodeSize
[Optional] Specify the minimum number of observations in a tree node. The algorithm stops
splitting a node if the number of observations in the node is equal to or smaller than this value.
You must specify a non-negative integer value. The default value is 1.

Mtry
[Optional] Specify the number of features from input columns for evaluating the best split
of a node. A higher value improves the splitting and performance of a tree. A smaller value
improves the robustness of the forest and prevents it from overfitting. When the value is -1,
all variables are used for each split. The default value is -1.

MtrySeed
[Optional] Specify the random seed that the algorithm uses for the Mtry argument. The
default value is 1.

Seed
[Optional] Specify the random seed the algorithm uses for repeatable results. The default
value is 1.

TreeType
[Optional] Specify the modeling type.
Allowed Values: Regression, Classification. The default value is Regression.

TreeSize
[Optional] Specify the number of rows that each tree uses as its input dataset. The function
builds a tree using either the number of rows on an AMP, the number of rows that fit into the
AMP’s memory (whichever is less), or the number of rows given by the TreeSize argument.
By default, this value is the minimum number of rows on an AMP and the number of rows that
fit into the AMP’s memory. The default value is -1.

CoverageFactor
Specify the level of coverage for the dataset in the forest. The value is specified in
percentage. The default coverage value is 1.0 (100%).

MinImpurity
[Optional] Specify the minimum impurity of a tree node. The algorithm stops splitting a node
if the value is equal to or smaller than the specified value. The default value is 0.0.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 194
6: Model Training Functions

TD_DecisionForest Input
Column Name Data TYype Description

input_column INTEGER, BIGINT, The columns that the function uses to train the
SMALLINT, BYTEINT, DecisionForest model.
FLOAT, DECIMAL,
or NUMBER

response_ INTEGER, BIGINT, The column that contains the response value for an
column SMALLINT, BYTEINT, observation. For regression, all numeric data types
FLOAT, DECIMAL, are supported. For classification, INTEGER, BIGINT,
or NUMBER SMALLINT datatypes are supported.

TD_DecisionForest Output
The function produces a model and a JSON representation of the decision tree. The model output is
as follows:
Column Name Data Type Description

task_index SMALLINT The AMP that produces the decision tree.

tree_num SMALLINT The identified decision tree within an AMP.

tree CLOB The trained decision tree model represented in JSON format.

The JSON representation of the decision tree has the following elements:
JSON Type Description

id_ The unique identifier of the node.

sum_ [Regression trees] The sum of respone variable values in the node.

sumSq_ [Regression trees] The sum of squared values of the response variable in the node.

responseCounts_ [Classification trees] The number of observations in each class of the node.

size_ The total number of observations in the node.

maxDepth_ The maximum possible depth of the tree starting from the current node. For the root
node, the value is max_depth. For leaf nodes, the value is 0. For other nodes, the value
is the maximum possible depth of the tree, starting from that node.

split_ The start of JSON item that describes a split in the node.

score_ The GINI score of the node.

attr_ The attribute on which the algorithm splits the node.

type_ Type of tree and split. Possible values:


• CLASSIFICATION_NUMERIC_SPLIT

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 195
6: Model Training Functions

JSON Type Description

• REGRESSION_NUMERIC_SPLIT

leftNodeSize_ The number of observations assigned to the left node of the split.

rightNodeSize_ The number of observations assigned to the right node of the split.

leftChild_ The start of the JSON item that describes the left child of the node.

rightChild_ The start of the JSON item that describes the right child of the node.

nodeType_ The node type. Possible values:


• CLASSIFICATION_NODE
• CLASSIFICATION_LEAF
• REGRESSION_NODE
• REGRESSION_LEAF

TD_DecisionForest Examples
Example: TD_DecisionForest Regression
The following is a sample of housing data taken from Boston housing dataset.
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT medv

0. 0.0 4.49 0.0 0. 6. 45.1 4. 3.0 247 18.5 396. 12.86 22.5
05188 449 015 4272 99

0. 0.0 7.38 0.0 0. 6. 28.9 5. 5.0 287 19.6 396. 6.15 23.0
30347 493 312 4159 90

0. 0.0 6.20 0.0 0. 6. 80.8 3. 8.0 307 17.4 396. 7.6 30.1
6147 507 618 2721 90

0. 0.0 11.93 0.0 0. 6. 76.7 2. 1.0 273 21.0 396. 9.08 20.6
04527 537 120 2875 90

0. 12. 6.07 0.0 0. 5. 33 6. 4.0 345 18.9 396. 8.79 20.9


12816 5 409 885 4890 90

... ... ... ... ... ... ... ... ... ... ... ... ... ...

SQL Call

SELECT * FROM TD_DecisionForest (


ON housing_sample partition by ANY
USING
ResponseColumn('medv')
InputColumns('[0:12]')
MaxDepth(12)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 196
6: Model Training Functions

MinNodeSize(1)
NumTrees(4)
TreeType('REGRESSION')
Seed(1)
Mtry(3)
MtrySeed(1)
) as dt;

TD_DecisionForest Output
task_ tree_
tree
index num

0 0 {"id_":1,"sum_":201.700000,"sumSq_":6781.890000,"size_":6,"maxDepth_":12,
"nodeType_":"REGRESSION_NODE","split_":{"splitValue_":7.091500,"attr_":"rm",
"type_":"REGRESSION_NUMERIC_SPLIT","score_":32984.915253,"scoreImprove_
":32984.915253,"leftNodeSize_":5,"rightNodeSize_":1},"leftChild_":{"id_":2,"sum_":
167.000000,"sumSq_":5577.800000,"size_":5,"maxDepth_":11,"value_":33.400000,
"nodeType_":"REGRESSION_LEAF"},"rightChild_":{"id_":3,"sum_":34.700000,
"sumSq_":1204.090000,"size_":1,"maxDepth_":11,"value_":34.700000,"nodeType_":
"REGRESSION_LEAF"}}

2 0 {"id_":1,"sum_":208.800000,"sumSq_":4905.980000,"size_":9,"maxDepth_":12,
"nodeType_":"REGRESSION_NODE","split_":{"splitValue_":6.465000,"attr_":"rm",
"type_":"REGRESSION_NUMERIC_SPLIT","score_":37076.368050,"scoreImprove_
":37076.368050,"leftNodeSize_":8,"rightNodeSize_":1},"leftChild_":{"id_":2,"sum_":
178.700000,"sumSq_":3999.970000,"size_":8,"maxDepth_":11,"value_":22.337500,
"nodeType_":"REGRESSION_LEAF"},"rightChild_":{"id_":3,"sum_":30.100000,
"sumSq_":906.010000,"size_":1,"maxDepth_":11,"value_":30.100000,"nodeType_":
"REGRESSION_LEAF"}}

3 0 {"id_":1,"sum_":93.600000,"sumSq_":2194.560000,"size_":4,"maxDepth_":12,
"nodeType_":"REGRESSION_NODE","split_":{"splitValue_":7.060000,"attr_":"lstat",
"type_":"REGRESSION_NUMERIC_SPLIT","score_":6272.052528,"scoreImprove_
":6272.052528,"leftNodeSize_":3,"rightNodeSize_":1},"leftChild_":{"id_":2,"sum_":
72.000000,"sumSq_":1728.000000,"size_":3,"maxDepth_":11,"value_":24.000000,
"nodeType_":"REGRESSION_LEAF"},"rightChild_":{"id_":3,"sum_":21.600000,
"sumSq_":466.560000,"size_":1,"maxDepth_":11,"value_":21.600000,"nodeType_":
"REGRESSION_LEAF"}}

Example: TD_DecisionForest Classification


The input table is a sample of the diabetes dataset, with feature columns and a target (response) column,
Outcome. It is a binary classification problem with classes, 0 and 1.
Diabetes
Blood Skin
ID Pregnancies Glucose Insulin BMI Pedigree Age Outcome
Pressure Thickness
Function

0 15 136 70 32 110 37. 0.153 43 1


1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 197
6: Model Training Functions

Diabetes
Blood Skin
ID Pregnancies Glucose Insulin BMI Pedigree Age Outcome
Pressure Thickness
Function

1 0.0 97 64 36 100 36. 0.6 25 0


8

2 1.0 116 70 28 0 27. 0.204 21 0


4

3 2.0 106 64 35 119 30. 1.4 34 0


5

4 0 123 84 37 0 3. 0.197 29 0
52

... ... ... ... ... ... ... ... ... ...

SQL Call

SELECT * FROM TD_DecisionForest (


ON diabetes_sample AS inputtable PARTITION BY ANY
USING
ResponseColumn('outcome')
InputColumns('[1:8]')
MaxDepth(3)
MinNodeSize(1)
NumTrees(4)
TreeType('CLASSIFICATION')
Seed(2)
Mtry(3)
MtrySeed(1)
isDebug('f')
) AS dt;

TD_DecisionForest Output
amp_ tree_
tree
id num

0 0 {"id_":1,"size_":11,"maxDepth_":3,"responseCounts_":{"0":8,"1":3},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":35.500000,"attr_":"skinthickness",
"type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.396694,"scoreImprove_
":0.178512,"leftNodeSize_":6,"rightNodeSize_":5},"leftChild_":{"id_":2,"size_":6,
"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":3,"size_":5,"maxDepth_":2,"responseCounts_":{"0":2,"1":3},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":135.000000,"attr_":"glucose",
"type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.480000,"scoreImprove_
":0.218182,"leftNodeSize_":2,"rightNodeSize_":3},"leftChild_":{"id_":6,"size_":2,

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 198
6: Model Training Functions

amp_ tree_
tree
id num

"maxDepth_":1,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":
{"id_":7,"size_":3,"maxDepth_":1,"label_":"1","nodeType_":"CLASSIFICATION_LEAF"

1 0 {"id_":1,"size_":9,"maxDepth_":3,"responseCounts_":{"0":5,"1":4},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":32.500000,"attr_":"age","type_
":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.493827,"scoreImprove_":
0.316049,"leftNodeSize_":4,"rightNodeSize_":5},"leftChild_":{"id_":2,"size_":4,
"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":3,"size_":5,"maxDepth_":2,"responseCounts_":{"0":1,"1":4},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":36.500000,"attr_":"age","type_
":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.320000,"scoreImprove_":
0.066667,"leftNodeSize_":3,"rightNodeSize_":2},"leftChild_":{"id_":6,"size_":3,
"maxDepth_":1,"label_":"1","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":7,"size_":2,"maxDepth_":1,"label_":"0","nodeType_":"CLASSIFICATION_
LEAF"}}}

2 0 {"id_":1,"size_":5,"maxDepth_":3,"label_":"1","nodeType_":
"CLASSIFICATION_LEAF"}

3 0 {"id_":1,"size_":10,"maxDepth_":3,"responseCounts_":{"0":9,"1":1},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":37.000000,"attr_":"age","type_
":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.180000,"scoreImprove_":
0.080000,"leftNodeSize_":8,"rightNodeSize_":2},"leftChild_":{"id_":2,"size_":8,
"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":3,"size_":2,"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_
LEAF"}}

TD_KMeans
The K-means algorithm groups a set of observations into k clusters in which each observation belongs to
the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective
function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:
1. Specify or randomly select k initial cluster centroids.
2. Assign each data point to the cluster that has the closest centroid.
3. Recalculate the positions of the k centroids.
4. Repeat steps 2 and 3 until the centroids no longer move.
The algorithm doesn't necessarily find the optimal configuration as it depends significantly on the initial
randomly selected cluster centers. You can run the function multiple times to reduce the effect of
this limitation.
Also, this function returns the within-cluster-squared-sum, which you can use to determine an optimal
number of clusters using the Elbow method.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 199
6: Model Training Functions

Note:

• This function doesn't consider the InputTable and InitialCentroidsTable Input rows that have a
NULL entry in the specified TargetColumns.
• The function can produce deterministic output across different machine configurations if you
provide the InitialCentroidsTable in the query.
• The function randomly samples the initial centroids from the InputTable, if you don't provide the
InitialCentroidsTable in the query. In this case, you can use the Seed element to make the function
output deterministic on a machine with an assigned configuration. However, using the Seed
argument won't guarantee deterministic output across machines with different configurations.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_KMeans Syntax
SELECT * FROM TD_KMeans (
ON {table | view | query} as InputTable
[ ON {table | view | query} as InitialCentroidsTable DIMENSION ]
[ OUT [ PERMANENT | VOLATILE ] TABLE ModelTable(model_output_table_name) ]
USING
IdColumn('id_column')
TargetColumns({'target_column'|'target_column_range'}[,...])
[ NumClusters(number_of_clusters) ]
[ Seed(seed_value) ]
[ StopThreshold(threshold_value) ]
[ MaxIterNum(number_of_iterations) ]
[ NumInit(num_init) ]
[ OutputClusterAssignment({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
) as alias;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 200
6: Model Training Functions

TD_KMeans Syntax Elements


IdColumn
[Required]: Specify the input table column name that has the unique identifier for each input
table row.

TargetColumns
[Required]: Specify the input table columns for clustering.

ModelTable
[Optional]: Specify the ModelTable name to save the clustering data model. If specified, then
a model containing centroids of clusters is saved in the specified ModelTable name.

NumClusters
[Optional]: Specify the number of clusters to create from the clustering data. Not required, if
the InitialCentroidsTable is specified.

Seed
[Optional]: Specify a non-negative integer value to randomly select the initial cluster centroid
positions from the input table rows. Not required, if the InitialCentroidsTable is specified.

StopThreshold
[Optional]: The algorithm converges if the distance between the centroids from the previous
iteration and the current iteration is less than the specified value.
Default Value: 0.0395

MaxIterNum
[Optional]: Specify the maximum number of iterations for the K-means algorithm. The
algorithm stops after performing the specified number of iterations even if the convergence
criterion is not met.
Default Value: 10

NumInit
[Optional]: Specify the number of times to repeat clustering with different initial centroid
seeds. The function returns the model having the least value of Total Within Cluster
Squared Sum.
Not required, if the InitialCentroidsTable is specified.
Default Value: 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 201
6: Model Training Functions

OutputClusterAssignment
[Optional]: Specify whether to return the Cluster Assignment information.
Default Value: False

TD_KMeans Input
Input Table Schema
Column Data Type Description

IdColumn Any The InputTable column name that has the unique
identifier for each input table row.

TargetColumns BYTEINT,SMALLINT,INTEGER, The input table columns for clustering.


BIGINT, Decimal/Numeric,Float,
Real,Double precision

InitialCentroids Table Schema


Column Data Type Description

Initial_ BYTEINT,SMALLINT, The column that contains the unique


Clusterid_Column INTEGER,BIGINT identifiers of initial centroids.

TargetColumns BYTEINT,SMALLINT,INTEGER, The columns that contain the initial


BIGINT, Decimal/Numeric,Float, centroid values.
Real,Double precision

TD_KMeans Output
Output Table Schema
If the OutputClusterAssignment value is set to False:

Column Data Type Description

TD_CLUSTERID_ BIGINT The unique identifier of the cluster.


KMEANS

TargetColumns REAL The columns that contain the centroid value for each feature.

TD_SIZE_ BIGINT The number of points in the cluster.


KMEANS

TD_WITHINSS_ REAL The within-cluster-sum-of-squares, that is, the sum of


KMEANS squared differences of each point from its cluster centroid.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 202
6: Model Training Functions

Column Data Type Description

Id_Column BYTEINT The unique identifier column name copied from the
InputTable. This column contains only NULL values in
the output.

TD_MODELINFO_ VARCHAR(128) The following information related to the model is saved:


KMEANS CHARACTER • Converged: True or False
SET LATIN • Number of Iterations: The number of iterations performed
by the function.
• Number of Clusters: The number of clusters produced.
• Total_WithinSS: The total within cluster sum of squares.
• Between_SS : Between sum of squares, that is, the
sum of squared distances of centroids to global mean,
where squared distance of each mean to global mean is
multiplied by the number of data points it represents.

If the OutputClusterAssignment value is set to True:

Column Data Type Description

Id_Column Any The unique identifier of input rows copied from the input table.

TD_CLUSTERID_ BIGINT The ClusterId assigned to the input row.


KMEANS

TD_KMeans Example
Input Table

id C1 C2
-- -- --
1 1 1
2 2 2
3 8 8
4 9 9

Initial Centroids Table

TD_CLUSTERID_KMEANS C1 C2
------------------- -- --
2 2 2
4 9 9

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 203
6: Model Training Functions

InitialCentroidsTable is not provided

Select * from TD_KMeans (


on kmeans_input_table as InputTable
using
IdColumn('id')
TargetColumns('c1','c2')
NumClusters(2)
Seed(0)
StopThreshold(0.0395)
MaxIterNum(3)
)as dt;

td_clusterid_kmeans C1 C2
td_size_kmeans td_withinss_kmeans id td_modelinfo_kmeans
-------------------- ---------------------- ----------------------
-------------------- ----------------------
---- -----------------------------------------
0 1.5
1.5 2 1 NULL NULL
1 8.5
8.5 2 1 NULL NULL
NULL NULL
NULL NULL NULL NULL
Converged : True
NULL NULL
NULL NULL NULL NULL Number of
Iterations : 2
NULL NULL
NULL NULL NULL NULL Number of
Clusters : 2
NULL NULL
NULL NULL NULL NULL
Total_WithinSS 2.00000000000000E+00
NULL NULL
NULL NULL NULL NULL
Between_SS : 9.80000000000000E+01

Initial Centroid table is provided

Select * from TD_KMeans (


on kmeans_input_table as InputTable

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 204
6: Model Training Functions

on kmeans_initial_centroids_table as InitialCentroidsTable DIMENSION


using
IdColumn('id')
TargetColumns('c1','c2')
StopThreshold(0.0395)
MaxIterNum(3)
)as dt;

td_clusterid_kmeans C1 C2
td_size_kmeans td_withinss_kmeans id td_modelinfo_kmeans
-------------------- ----------------------
---------------------- -------------------- ----------------------
---- ----------------------------------------

0 1.5 1.5
2 1 NULL NULL
1 8.5 8.5
2 1 NULL NULL
NULL NULL NULL
NULL NULL NULL Converged : True
NULL NULL NULL
NULL NULL NULL Number of Iterations : 2
NULL NULL NULL
NULL NULL NULL Number of Clusters : 2
NULL NULL
NULL NULL NULL NULL
Total_WithinSS : 2.00000000000000E+00
NULL NULL
NULL NULL NULL NULL
Between_SS : 9.80000000000000E+01

OutputClusterAssignment is set to true

Select * from TD_KMeans (


on kmeans_input_table as InputTable
on kmeans_initial_centroids_table as InitialCentroidsTable DIMENSION
using
IdColumn('id')
TargetColumns('c1','c2')
StopThreshold(0.0395)
MaxIterNum(3)
OutputClusterAssignment('true')
)as dt order by 1;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 205
6: Model Training Functions

id td_clusterid_kmeans
----------- --------------------
1 0
2 0
3 1
4 1

OutputClusterAssignment is set to true and ModelTable (OutClause


is provided)

Select * from TD_KMeans (


on kmeans_input_table as InputTable
on kmeans_initial_centroids_table as InitialCentroidsTable DIMENSION
OUT TABLE ModelTable(kmeans_model)
using
IdColumn('id')
TargetColumns('c1','c2')
StopThreshold(0.0395)
MaxIterNum(3)
OutputClusterAssignment('true')
)as dt;

id td_clusterid_kmeans
----------- --------------------
1 0
2 0
3 1
4 1

KMeans ModelTable Output

td_clusterid_kmeans c1 c2 td_size_kmeans td_withinss_kmeans


id td_modelinfo_kmeans
------------------- ----------- ----------- -------------- ------------------
---- -------------------------------------
Null Null Null Null Null
Null Converged : True
Null Null Null Null Null
Null Number of Clusters : 2
Null Null Null Null Null
Null Total_WithinSS : 2.00000000000000E+00
Null Null Null Null null

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 206
6: Model Training Functions

null Between_SS : 9.80000000000000E+01


Null Null Null Null Null
Null Number of Iterations : 2
0 1.500000000 1.500000000 2 1.000000000
Null Null
1 8.500000000 8.500000000 2 1.000000000
Null Null

TD_GLM
The TD_GLM function is a generalized linear model (GLM) that performs regression and classification
analysis on data sets, where the response follows an exponential family distribution and supports the
following models:
• Regression (Gaussian family): The loss function is squared error.
• Binary Classification (Binomial family): The loss function is logistic and implements logistic regression.
The only response values are 0 or 1.
The function uses the Minibatch Stochastic Gradient Descent (SGD) algorithm that is highly scalable
for large datasets. The algorithm estimates the gradient of loss in minibatches, which is defined by the
Batchsize argument and updates the model with a learning rate using the LearningRate argument.
The function also supports the following approaches:
• L1, L2, and Elastic Net Regularization for shrinking model parameters
• Accelerated learning using Momentum and Nesterov approaches
The function uses a combination of IterNumNoChange and Tolerance arguments to define the convergence
criterion and runs multiple iterations (up to the specified value in the MaxIterNum argument) until the
algorithm meets the criterion.
The function also supports LocalSGD, a variant of SGD, that uses LocalSGDIterations on each AMP to run
multiple batch iterations locally followed by a global iteration.
The weights from all mappers are aggregated in a reduce phase and are used to compute the gradient
and loss in the next iteration. LocalSGD lowers communication costs and can result in faster learning and
convergence in fewer iterations, especially when there is a large cluster size and many features.
Due to gradient-based learning, the function is highly-sensitive to feature scaling. Before using
the features in the function, you must standardize the Input features using TD_ScaleFit and
TD_ScaleTransform functions.
The function only accepts numeric features. Therefore, before training, you must convert the categorical
features to numeric values.
The function skips the rows with missing (null) values during training.
The function output is a trained GLM model that is used as an input to the TD_GLMPredict
function. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC. You can use

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 207
6: Model Training Functions

TD_RegressionEvaluator,TD_ClassificationEvaluator, and TD_ROC functions to perform model evaluation


as a post processing step.

Note:
When an unsupported data type is passed in InputColumns or ResponseColumn, the following error
message is displayed:

Unsupported data type for column index n in argument InputColumns.

In the message, n refers to the column index based on an input to the function comprising InputColumns
and ResponseColumn only. The function does not need the rest of the columns, and the Teradata Vantage
optimizer does not project them to the function. Due to this, n might be different from the actual index in the
input table.

TD_GLM Syntax
SELECT * FROM TD_GLM (
ON { table | view | (query) } PARTITION BY ANY
[ OUT TABLE MetaInformationTable (meta_table) ]
USING
InputColumns ({‘input_column’|input_column_range }[,…])
ResponseColumn(‘response_column’)
[ Family (‘Gaussian’ | ‘Binomial’) ]
[ BatchSize (batchsize) ]
[ MaxIterNum (max_iter) ]
[ RegularizationLambda (lambda) ]
[ Alpha (alpha) ]
[ IterNumNoChange (n_iter_no_change) ]
[ Tolerance (tolerance) ]
[ Intercept (‘true’ | ‘false’) ]
[ ClassWeights (‘class:weight,...’) ]
[ LearningRate (‘constant’|‘optimal’|’invtime’|’adaptive’) ]
[ InitialEta (eta0) ]
[ DecayRate (gamma) ]
[ DecaySteps (decay_steps) ]
[ Momentum (momentum) ]
[ Nesterov (‘true’|’false’) ]
[ LocalSGDIterations(local_iterations) ]
) as dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 208
6: Model Training Functions

TD_GLM Syntax Elements


InputColumns
Specify the input table column names for training the model (predictors, features or
independent variables).

ResponseColumn
Specify the column name that contains the class label for classification or target value
(dependent variable) for regression.

Family
[Optional] Specify the distribution exponential family. Options are Gaussian and Binomial.
Default value is Gaussian.

MaxIterNum
[Optional] Specify the maximum number of iterations (minibatches) over the training data
batches. Value is a positive integer less than 10,000,000. Default value is 300.

BatchSize
[Optional] Specify the number of observations (training samples) processed in a single
minibatch per AMP. A value of 0 or higher than the number of rows on an AMP processes
all rows on the AMP, such that the entire dataset is processed in a single iteration, and the
algorithm becomes Gradient Descent. Specify a positive integer value. The default value
is 10.

RegularizationLambda
[Optional] Specify the regularization amount. The higher the value, stronger the
regularization. It is also used to compute learning rate when learning rate is set to optimal.
Must be a non-negative float value. A value of 0 means no regularization. Default value
is 0.02.

Alpha
[Optional] Specify the Elasticnet parameter for penalty computation. It is only effective when
RegularizationLambda is greater than 0. The value represents the contribution ratio of L1 in
the penalty. A value of 1.0 indicates L1 (LASSO) only, a value of 0 indicates L2 (Ridge) only,
and a value between is a combination of L1 and L2. Value is a float value between 0 and 1.
The default value is 0.15 (15% L1, 85% L2).

IterNumNoChange
[Optional] Specify the number of iterations (minibatches) with no improvement in loss
including the tolerance to stop training. A value of 0 indicates no early stopping and the

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 209
6: Model Training Functions

algorithm continues until MaxIterNum iterations are reached. Specify a positive integer value.
The default value is 50.

Tolerance
[Optional] Specify the stopping criteria in terms of loss function improvement. Applicable
when IterNumNoChange is greater than 0. Specify a positive integer value. The default value
is 0.001.

Intercept
[Optional] Specify whether to estimate intercept based on whether the data is already
centered. The default value is true.

ClassWeights
[Optional] Specify weights associated with classes. Only applicable for Binomial Family.
The format is 0:weight,1:weight. For example, 0:1.0,1:0.5 gives twice the weight to each
observation in class 0. If the weight of a class is omitted, it is assumed to be 1.0. The default
value is 0:1.0,1:1.0.

LearningRate
[Optional] Specify one of the learning rate algorithms:
• Constant
• InvTime.
• Optimal
• Adaptive
The default value is invtime for Gaussian, and optimal for Binomial.

InitialEta
[Optional] Specify the initial learning rate eta value. If you specify the learning rate as
constant, the eta value is applicable for all iterations. The default value is 0.05.

DecayRate
[Optional] Specify the decay rate for the learning rate. Only applicable for invtime and
adaptive learning rates. The default value is 0.25.

DecaySteps
Specify the number of iterations without decay for the adaptive learning rate. The learning
rate changes by decay rate after the specified number of iterations are completed. The
default value is 5.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 210
6: Model Training Functions

Momentum
[Optional] Specify the value to use for momentum learning rate optimizer. A larger value
indicates higher momentum contribution. A value of 0 means momentum optimizer is
disabled. For a good momentum contribution, a value between 0.6-0.95 is recommended.
Value is a non-negative float between 0 and 1. The default value is 0.

Nesterov
[Optional] Specify whether to use Nesterov optimization for the Momentum optimizer. Only
applicable when the Momentum optimizer value is greater than 0. The default value is True.

LocalSGDIterations
[Optional] Specify the number of local iterations for the Local SGD algorithm. A value
of 0 implies that the algorithm is disabled. A value greater than 0 enables the algorithm
and specifies the number of iterations for the algorithm. The recommended values for the
arguments are as follows:
• LocalSGDIterations: 10
• MaxIterNum: 100
• BatchSize: 50
• IterNumNoChange: 5
The default value is 0.

TD_GLM Input
Column Data Type Description

input_column INTEGER, BIGINT, The input table columns used to train the
SMALLINT, BYTEINT, FLOAT, GLM model.
DECIMAL, NUMBER

response_column INTEGER, BIGINT, The column that contains the response value
SMALLINT, BYTEINT, FLOAT, for an observation.
DECIMAL, NUMBER

TD_GLM Output
TD_GLM produces the following outputs:
• Model (Primary output): Contains the trained model with model statistics. The following model
statistics are stored in the model:
◦ Loss Function
◦ MSE (Gaussian)
◦ Loglikelihood (Logistic)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 211
6: Model Training Functions

◦ Number of Observations
◦ AIC
◦ BIC
◦ Number of Iterations
◦ Regularization
◦ Alpha (L1/L2/Elasticnet)
◦ Learning Rate (initial)
◦ Learning Rate (Final)
◦ Momentum
◦ Nesterov
◦ LocalSGD Iterations
• [Optional] MetaInformationTable (Secondary Output): Contains training progress information for
each iteration.
The model output schema is as follows:
Column Data Type Description

attribute SMALLINT The column contains the numeric index of predictor and model metrics. Intercept
is specified using index 0, and the rest of the predictors take positive values.
Model metrics take negative indices.

predictor VARCHAR The name of the predictor or model metric.

estimate FLOAT The predictor weights and numeric-based metric values.

value VARCHAR The string-based metric values such as SQUARED_ERROR for LossFunction,
L2 for Regularization, and so on.

The MetaInformationTable output schema is as follows:


Column Data Type Description

iteration INTEGER The iteration number.

num_rows BIGINT The total number of rows processed.

eta FLOAT The learning rate for the iteration.

loss FLOAT The loss in the iteration.

best_loss FLOAT The best loss until the specified iteration.

TD_GLM Example
TD_GLM Example for Credit Data Set
The following credit data set is used in this example:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 212
6: Model Training Functions

A0_ A0_ A3_ A3_ A4_ A4_


ID A1 A2 A7 A10 A13 A14
b a Y u p g

61 0. 2.17724 0.142986 1. 0.238997 -0. 1 0 0 1 0 1


218228 25309 130455

297 -0. -0. -0. -0. 0.886307 -0. 0 1 0 1 0 1


453325 0247979 404925 77231 225685

631 -1. -0. -0. -0. -0. -0. 0 1 0 1 0 1


13175 77234 199458 547266 761392-0. 225817
643699

122 -0. 1.20223 0. 0. -0.54366 -0. 1 0 0 1 0 1


657541 00600835 802998 120063

... ... ... ... ... ... ... ... ... ... ... .. ...

TD_GLM Call for Credit Data

CREATE VOLATILE TABLE td_glm_output_credit_ex AS (


SELECT * FROM td_glm (
ON credit_ex_merged
USING
InputColumns('a1', 'a2', 'a7', 'a10', 'a13', 'a14', 'a0_b', 'a0_a',
'a3_y', 'a3_u', 'a4_p', 'a4_g', 'a5_k', 'a5_cc', 'a5_d', 'a5_c', 'a5_aa',
'a5_m', 'a5_q', 'a5_w', 'a5_e', 'a5_ff', 'a5_j', 'a5_x', 'a5_i', 'a6_v',
'a6_h', 'a6_bb', 'a6_z', 'a6_ff', 'a6_j', 'a8_t', 'a8_f', 'a9_t', 'a9_f',
'a11_t', 'a11_f', 'a12_g', 'a12_s')
ResponseColumn('Outcome')
Family('Binomial')
BatchSize(10)
MaxIterNum(300)
RegularizationLambda(0.02)
Alpha(0.15)
IterNumNoChange(50)
Tolerance(0.001)
Intercept('true')
LearningRate('optimal')
InitialEta(0.001)
Momentum(0.0)
LocalSGDIterations(0)
) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS
;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 213
6: Model Training Functions

TD_GLM Output for Credit Data


Attribute Predictor Estimate Value

-13 LocaleSGD Iterations 0

-12 Nesterov FALSE

-11 Momentum 0

-10 Learning Rate (Final) 0.287682

-9 Learning Rate (Initial) 0.001

-8 Number of Iterations 156 CONVERGED

-7 Alpha 0.15 Elasticnet

-6 Regularization 0.02 ENABLED

-5 BIC 151.787

-4 AIC 80.4189

-3 Number of Observations 44

-2 Loglik -0.209461

-1 Loss Function LOG

0 (Intercept) 0.146566

1 A1 0.732289

2 A2 0

3 A7 0.717899

4 A10 0.682358

5 A13 0.822302

6 A14 0.176791

7 A0_b 0.172178

8 A0_a -0.12165

9 A3_y -0.285135

10 A3_u 0.335663

11 A4_p -0.285135

12 A4_g 0.335663

13 A5_k 0.0358046

14 A5_cc 0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 214
6: Model Training Functions

Attribute Predictor Estimate Value

15 A5_d 0.0480538

16 A5_c 0.430725

17 A5_aa 0

18 A5_m -0.332389

19 A5_q 0.524153

20 A5_w -0.599829

21 A5_e 0.0257252

22 A5_ff -0.359011

23 A5_j -0.119432

24 A5_x 0.0494795

25 A5_i 0

26 A6_v -0.387493

27 A6_h 0.0415697

28 A6_bb 0.0604108

29 A6_z 0

30 A6_ff -0.372217

31 A6_j 0.538139

32 A8_t 0.9259

33 A8-f -0.875372

34 A9_t 0

35 A9_f 0

36 A11_t 0.197957

37 A11_f -0.146928

38 A12_g -0.221414

39 A12_s 0.272506

TD_GLM Example for Housing Data


This example takes raw housing data, and does the following:
1. Uses TD_ScaleFit to standardize the data.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 215
6: Model Training Functions

2. Uses TD_ScaleTransform to transform the data.


3. Uses TD_GLM to get a model.

Raw Housing Data


The following is a sample of housing data taken from Boston housing dataset.

ID MecInc HouseAge AveRoom AveBedrms Population AveOccup Latitude Longitude M

2833 1.3527 30 2.24754 0.742574 169 1.67327 35.39 -119.02 0.

5328 2.7679 23 3.03868 1.06446 2031 1.63658 34.04 -118.45 2.

5300 1.583 19 3.14815 1.04548 3751 2.4373 34.07 -118.45 3.

12433 1.7344 24 3.2984 1.05856 4042 4.4663 33.51 -116.01 0.

... ... ... ... ... ... ... ... ... ...

TD_ScaleFit Call for Housing Data

SELECT * FROM TD_ScaleFit(


ON cal_housing_ex_raw as InputTable
OUT VOLATILE TABLE OutputTable(scaleFitOut_cal_ex)
USING
TargetColumns('medinc', 'houseage', 'averooms', 'avebedrms', 'population',
'aveoccup', 'latitude', 'longitude')
ScaleMethod('STD')
) as dt2;

TD_ScaleFit Output for Housing Data


TD_STATTYPE_SCLFIT MedInc HouseAge AveRooms AveBedrms Population AveOccup La

min 1.0472 7 2.24752 0.742574 47 1.63658 32

max 10. 52 8.89305 2.56522 4145 5.45536 40


7721

sum 261. 2134 349.412 74.0563 91566 200.592 24


401

count 69 69 69 69 69 69 69

null 0 0 0 0 0 0 0

avg 3. 30.9275 1.07328 1.07328 1327.04 2.90714 2.9


78842

variance 5. 142.803 1.55253 0.0458083 861633 0.618171 0.


40242 61

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 216
6: Model Training Functions

TD_STATTYPE_SCLFIT MedInc HouseAge AveRooms AveBedrms Population AveOccup La

std 2. 11.8631 1.23694 0.212472 921.491 0.780521 0.


30741 78

ustd 2. 11.95 1.246 0.214029 928.242 0.786239 0.


32431 78

multiplier 1 1 1 1 1 1 1

intercept 0 0 0 0 0 0 0

location 3. 30.9275 5.06394 1.07328 1327.04 2.90714 2.9


78842

scale 2. 11.8631 1.23694 0.212472 921.491 0.780521 0.


30741 78

globalscale_false nan nan nan nan nan nan na

ScaleMethodNumberMapping: 3 3 3 3 3 3 3
[0:mean, 1:sum, 2:ustd, 3:
std, 4:range, 5:midrange,6:
maxabs, 7:rescale]

missvalue_keep nan nan nan nan nan nan na

TD_ScaleTransform Call for Housing Data

CREATE MULTISET TABLE cal_housing_ex_scaled AS (


SELECT * FROM TD_ScaleTransform(
ON cal_housing_ex_raw AS InputTable
ON scaleFitOut_cal_ex AS FitTable DIMENSION
USING
accumulate('id', 'MedHouseVal')
) AS dt1
) WITH data;

TD_ScaleTransform Output for Housing Data


ID MedHouseVal MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitud

244 1.117 -0. 1.10194 -0.160367 0.426688 1.02221 1.04102 0.


605796 946201

670 1.922 -0. 0.427582 -0.129699 -0.530672 -0.761856 -0.21256 0.


00308521 906345

686 1.578 -0. -0. -0.625426 -0.513581 -0.685892 -0.533101 1.01705


152084 0781865

1754 1.651 -0. 0.596172 0.454207 -0.0272726 0.0683203 -0. 1.01705


0263147 0827654

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 217
6: Model Training Functions

ID MedHouseVal MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitud

... ... ... ... ... ... ... ... ...

TD_GLM Call for Housing Data

CREATE VOLATILE TABLE td_glm_cal_ex AS (


SELECT * from TD_GLM (
ON cal_housing_ex_scaled
USING
InputColumns('medinc', 'houseage', 'averooms', 'avebedrms',
'population', 'aveoccup', 'latitude', 'longitude')
ResponseColumn('MedHouseVal')
Family('Gaussian')
BatchSize(10)
MaxIterNum(300)
RegularizationLambda(0.02)
Alpha(0.15)
IterNumNoChange(50)
Intercept('true')
LearningRate('invtime')
InitialEta(0.05)
Momentum(0)
Nesterov('false')
LocalSGDIterations(0)
) as dt
) WITH DATA
ON COMMIT PRESERVE ROWS
;

TD_GLM Output for Housing Data


Attribute Predictor Estimate Value

-13 LocalSGD Iterations 0

-12 Nesterov

-11 Momentum 0

-10 Learning Rate (Final) 0.0133974

-9 Learning Rate (Initial) 0.05

-8 Number of Iterations 194 CONVERGED

-7 Alpha 0.15 Elasticnet

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 218
6: Model Training Functions

Attribute Predictor Estimate Value

-6 Regularization 0.02 ENABLE

-5 BIC -67.6236

-4 AIC -87.7305

-3 Number of Observations 69

-2 MSE 0.216033

-1 Loss Function SQUARED_ERROR

0 (Intercept) 2.07174

1 MedInc 0.782883

2 HouseAge 0.231914

3 AveRooms 0.0619822

4 AveBedrms -0.113656

5 Population 0.211336

6 AveOccup -0.388201

7 Latitude -0.195511

8 Longitude -0.193884

TD_VectorDistance
The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and
returns a table that contains the distance between target-reference pairs.
The function computes the distance between the target pair and the reference pair from the same table if
you provide only one table as the input.
You must have the same column order in the TargetFeatureColumns argument and the RefFeatureColumns
argument. The function ignores the feature values during distance computation if the value is either NULL,
NAN, or INF.

Important:
The function returns N2 output if you use the TopK value as -1 because the function includes all
reference vectors in the output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 219
6: Model Training Functions

Note:
The algorithm used in this function is of the order of N2 (where N is the number of rows). Hence, expect
the query to run significantly longer as the number of rows increases in either the target table or the
reference table. Also, because the Reference table is a DIMENSION input, it is copied to the spool for
each AMP before running the query. The user spool limits the size/scalability of the input.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_VectorDistance Syntax
SELECT * FROM TD_VectorDistance (
ON { table | view | (query) } AS TARGETTABLE PARTITION BY ANY
[ ON { table | view | (query) } AS REFERENCETABLE DIMENSION ]
USING
TargetIDColumn ('target_id_column')
TargetFeatureColumns ({ 'target_feature_column' | target_feature_Column_range }[,...])
[ RefIDColumn ('ref_id_column') ]
[ RefFeatureColumns ({ 'ref_feature_column' | ref_feature_Column_range }[,...]) ]
[ DistanceMeasure ({'Cosine' | 'Euclidean' | 'Manhattan' }[,...])]
[ TopK(integer_value)]
) AS alias;

TD_VectorDistance Syntax Elements


TargetIDColumn
Specify the target table column name that contains identifiers of the target table vectors.

TargetFeatureColumns
Specify the target table column names that contain features of the target table vectors.

Note:
You can specify up to 2018 feature columns.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 220
6: Model Training Functions

RefIDColumn
[Optional] Specify the reference table column name that contains identifiers of the reference
table vectors.

RefFeatureColumns
[Optional] Specify the reference table column names that contain features of the reference
table vectors.

Note:
You can specify up to 2018 feature columns.

DistanceMeasure
[Optional] Specify the distance type to compute between the target and the reference vector:
• Cosine: Cosine distance between the target vector and the reference vector.
• Euclidean: Euclidean distance between the target vector and the reference vector.
• Manhattan: Manhattan distance between the target vector and the reference vector.

TopK
[Optional] Specify the maximum number of closest reference vectors to include in the output
table for each target vector. The value k is an integer between 1 and 100. The default value
is 10.

TD_VectorDistance Input
Target Table Schema:
Column Data Type Description

target_id_column BYTEINT, SMALLINT, The target table column name that contains
BIGINT, INTEGER target table vector identifiers.

target_feature_ BYTEINT, SMALLINT, The target table column names that contain
column BIGINT, INTEGER, DECIMAL, features of the target table vectors.
NUMBER, FLOAT, REAL,
DOUBLE PRECISION

Reference Table Schema:


Column Data Type Description

ref_id_column BYTEINT, SMALLINT, The reference table column name that contains
BIGINT, INTEGER identifiers of the reference table vectors.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 221
6: Model Training Functions

Column Data Type Description

ref_feature_ BYTEINT, SMALLINT, The reference table column names that contain
column BIGINT, INTEGER, DECIMAL, features of the reference table vectors.
NUMBER, FLOAT, REAL,
DOUBLE PRECISION

TD_VectorDistance Output
The function produces a table with the distances between the target and reference vectors.
Column Data Type Description

Target_ID BIGINT The ID of the specified target table column name.

Reference_ID BIGINT The ID of the specified reference vector.

DistanceType VARCHAR The specified distance type.

Distance FLOAT The distance between the target and the reference vectors.

TD_VectorDistance Example
Target Table:

Userid CallDuration DataCounter SMS

1 0.0000333 0.2 0.1

2 0.5 0.4 0.4

3 1 0.8 0.9

4 0.01 0.4 0.2

Reference Table:

Userid CallDuration DataCounter SMS

5 0.93 0.4 0.7

6 0.83 0.3 0.6

7 0.73 0.5 0.7

SQL Call

SELECT target_id, reference_id, distancetype, cast(distance as decimal(36,8))


as distance FROM TD_VECTORDISTANCE (

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 222
6: Model Training Functions

ON target_mobile_data_dense as TargetTable
ON ref_mobile_data_dense as ReferenceTable Dimension
USING
TargetIDColumn('userid')
TargetFeatureColumns('CallDuration','DataCounter','SMS')
RefIDColumn('userid')
RefFeatureColumns('CallDuration','DataCounter','SMS')
DistanceMeasure('euclidean','cosine','manhattan')
topk(2)
) as dt order by 3,1,2,4;

TD_VectorDistance Result

Target_ID Reference_ID DistanceType Distance


1 5 cosine 0.45486518
1 7 cosine 0.32604815
2 5 cosine 0.02608923
2 7 cosine 0.00797609
3 5 cosine 0.02415054
3 7 cosine 0.00337338
4 5 cosine 0.43822243
4 7 cosine 0.31184844
1 6 euclidean 0.97408661
1 7 euclidean 0.99138861
2 6 euclidean 0.39862262
2 7 euclidean 0.39102429
3 5 euclidean 0.45265881
3 7 euclidean 0.45044423
4 6 euclidean 0.91782351
4 7 euclidean 0.88226980
1 6 manhattan 1.42996670
1 7 manhattan 1.62996670
2 6 manhattan 0.63000000
2 7 manhattan 0.63000000
3 5 manhattan 0.67000000
3 7 manhattan 0.77000000
4 6 manhattan 1.32000000
4 7 manhattan 1.32000000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 223
7
Model Scoring Functions

GLMPredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.

If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.

GLMPredict Syntax
SELECT * FROM GLMPredict (
ON { table | ( query ) } [ PARTION BY ANY ]
ON { table | view | (query) } AS Model DIMENSION
[ USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ Family ('family') ]
[ LinkFunction ('link') ]
]
) AS alias;

You must specify the keyword USING to use any function syntax element.
Related Information:
Column Specification Syntax Elements

GLMPredict Syntax Elements


Accumulate
[Optional] Specify the names of input table columns to copy to the output table.

Family
[Optional] Specify the distribution exponential family.
If you specify this syntax element, you must give it the same value that you used for the
Family syntax element of ML Engine GLM function when you created the model table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 224
7: Model Scoring Functions

Default: Read from the model table

LinkFunction
[Optional] Specify the link function. For the canonical link functions (default link functions)
and the link functions allowed for each exponential family, see the GLM function description
in Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003.
If you specify this syntax element, you must give it the same value that you used for the
LinkFunction syntax element of ML Engine GLM function when you created the model table.
Default: 'CANONICAL'

GLMPredict Input
Table Description

Input Contains new data.

Model Model output by ML Engine GLM function. For schema, see Teradata Vantage™ Machine
Learning Engine Analytic Function Reference, B700-4003.
If the GLM call that created the model table specified the Step syntax element, include the optional
ORDER BY clause in the GLMPredict call; otherwise, the GLMPredict result is nondeterministic.

Input Table Schema


Column Data Type Description

accumulate_column Any Column to copy to output table.

dependent_ INTEGER, SMALLINT, Dependent/response variables.


variable_column BIGINT, NUMERIC, Cannot be NULL.
DOUBLE PRECISION,
VARCHAR(n), CHAR(n)

predictor_variable_column INTEGER, SMALLINT, Independent/predictor variable.


BIGINT, NUMERIC, Cannot be NULL.
DOUBLE PRECISION

Any numeric dependent_variable_column or predictor_variable_column that is expected to be categorical


must be cast to VARCHAR.

Model Table Schema


For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.

Column Data Type Description

attribute INTEGER Numeric index of predictor.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 225
7: Model Scoring Functions

Column Data Type Description

predictor CHARACTER Predictor name.


or VARCHAR

category CHARACTER For categorical predictor, its level. For numeric


or VARCHAR predictor, NULL.

estimate DOUBLE PRECISION Estimated coefficient.

std_error DOUBLE PRECISION Standard error of coefficient.

t_score DOUBLE PRECISION [Column appears only with Family ('GAUSSIAN').] The t_
score follows a t(N-p-1) distribution.

z_score DOUBLE PRECISION [Column appears only without Family ('GAUSSIAN').] The
z-score follows the N(0,1) distribution.

p_value DOUBLE PRECISION p-value for z_score. (p-value represents significance of each
coefficient.)

significance CHARACTER Significance code for p_value.


or VARCHAR

family CHARACTER Distribution exponential family, specified by Family


or VARCHAR syntax element.

GLMPredict Output
Output Table Schema
Column Data Type Description

accumulate_ Same as in Column copied from input table.


column input_table

fitted_value DOUBLE Score of the input data, given by equation g-1(Xβ), where g-1 is
PRECISION the inverse link function, X the predictors, and β is the vector
of coefficients estimated by the GLM function.
For other values of Family, the scores are the expected values
of dependent/response variable, conditional on the predictors.

GLMPredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 226
7: Model Scoring Functions

GLMPredict Example: Logistic Distribution Prediction

This example predicts the admission status of students.

Input
• Input table: admissions_test, which has admissions information for 20 students
• Model: glm_admissions_model, output by "GLM Example: Logistic Regression Analysis with
Intercept" in Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003

Input Table Column Descriptions


Column Description

id Student identifier (unique)

masters Whether student has a masters degree—yes or no (categorical)

gpa Grade point average on a 4.0 scale (numerical)

stats Statistical skills—Novice, Beginner, or Advanced (categorical)

programming Programming skills—Novice, Beginner, or Advanced (categorical)

admitted Whether student was admitted—1 (yes) or 0 (no)

admissions_test
id masters gpa stats programming admitted

50 yes 3.95 Beginner Beginner 0

51 yes 3.76 Beginner Beginner 0

52 no 3.7 Novice Beginner 1

53 yes 3.5 Beginner Novice 1

54 yes 3.5 Beginner Advanced 1

55 no 3.6 Beginner Advanced 1

56 no 3.82 Advanced Advanced 1

57 no 3.71 Advanced Advanced 1

58 no 3.13 Advanced Advanced 1

59 no 3.65 Novice Novice 1

60 no 4 Advanced Novice 1

61 yes 4 Advanced Advanced 1

62 no 3.7 Advanced Advanced 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 227
7: Model Scoring Functions

id masters gpa stats programming admitted

63 no 3.83 Advanced Advanced 1

64 yes 3.81 Advanced Advanced 1

65 yes 3.9 Advanced Advanced 1

66 no 3.87 Novice Beginner 1

67 yes 3.46 Novice Beginner 0

68 no 1.87 Advanced Novice 1

69 no 3.96 Advanced Advanced 1

SQL Call

CREATE MULTISET TABLE glmpredict_admissions AS (


SELECT * FROM GLMPredict (
ON admissions_test PARTITION BY ANY
ON glm_admissions_model AS Model DIMENSION
USING
Accumulate ('id','masters','gpa','stats','programming','admitted')
Family ('LOGISTIC')
LinkFunction ('LOGIT')
) AS dt
) WITH DATA;

Output
This query returns the following table:

SELECT * FROM glmpredict_admissions ORDER BY 1;

Fitted values can vary in precision, because they depend on the model table output by ML Engine GLM
function and fetched to Analytics Database.

glmpredict_admissions
id masters gpa stats programming admitted fitted value

50 yes 3.95000000000000E 000 Beginner Beginner 0 3.50763408888030E-001

51 yes 3.76000000000000E 000 Beginner Beginner 0 3.55708978581653E-001

52 no 3.70000000000000E 000 Novice Beginner 1 7.58306140231079E-001

53 yes 3.50000000000000E 000 Beginner Novice 1 5.56012779663342E-001

54 yes 3.50000000000000E 000 Beginner Advanced 1 7.69474352959112E-001

55 no 3.60000000000000E 000 Beginner Advanced 1 9.68031141480050E-001

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 228
7: Model Scoring Functions

id masters gpa stats programming admitted fitted value

56 no 3.82000000000000E 000 Advanced Advanced 1 9.45772725968165E-001

57 no 3.71000000000000E 000 Advanced Advanced 1 9.46411914806798E-001

58 no 3.13000000000000E 000 Advanced Advanced 1 9.49666186386367E-001

59 no 3.65000000000000E 000 Novice Novice 1 8.74189685344822E-001

60 no 4.00000000000000E 000 Advanced Novice 1 8.65058992199339E-001

61 yes 4.00000000000000E 000 Advanced Advanced 1 6.50618727191735E-001

62 no 3.70000000000000E 000 Advanced Advanced 1 9.46469669158547E-001

63 no 3.83000000000000E 000 Advanced Advanced 1 9.45714262806374E-001

64 yes 3.81000000000000E 000 Advanced Advanced 1 6.55523357798207E-001

65 yes 3.90000000000000E 000 Advanced Advanced 1 6.53204164468356E-001

66 no 3.87000000000000E 000 Novice Beginner 1 7.54738501228429E-001

67 yes 3.46000000000000E 000 Novice Beginner 0 2.60034297764359E-001

68 no 1.87000000000000E 000 Advanced Novice 1 8.90965518431337E-001

69 no 3.96000000000000E 000 Advanced Advanced 1 9.44948816395031E-001

Categorizing fitted_value Column


The fitted_value column gives the probability that a student belongs to one of the output classes. The
following figure shows a typical logistic regression graph, mapping the input x-axis against a y probability
value between [0,1].

A fitted_value probability greater than or equal to 0.5 implies class 1 (student admitted); a probability less
than 0.5 implies class 0 (student rejected).
The following code adds a fitted_category column to glmpredict_admissions and populates it:

ALTER table glmpredict_admissions


ADD fitted_category int;
UPDATE glmpredict_admissions SET fitted_category = 1
WHERE fitted_value > 0.4999;
UPDATE glmpredict_admissions SET fitted_category = 0
WHERE fitted_value < 0.4999;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 229
7: Model Scoring Functions

This query returns the following table:

SELECT * FROM glmpredict_admissions ORDER BY 1;

fitted_
id masters gpa stats programming admitted fitted_value
category

50 yes 3. Beginner Beginner 0 3. 0


95000000000000E 50763408888030E-001
000

51 yes 3. Beginner Beginner 0 3. 0


76000000000000E 55708978581653E-001
000

52 no 3. Novice Beginner 1 7. 1
70000000000000E 58306140231079E-001
000

53 yes 3. Beginner Novice 1 5. 1


50000000000000E 56012779663342E-001
000

54 yes 3. Beginner Advanced 1 7. 1


50000000000000E 69474352959112E-001
000

55 no 3. Beginner Advanced 1 9. 1
60000000000000E 68031141480050E-001
000

56 no 3. Advanced Advanced 1 9. 1
82000000000000E 45772725968165E-001
000

57 no 3. Advanced Advanced 1 9. 1
71000000000000E 46411914806798E-001
000

58 no 3. Advanced Advanced 1 9. 1
13000000000000E 49666186386367E-001
000

59 no 3. Novice Novice 1 8. 1
65000000000000E 74189685344822E-001
000

60 no 4. Advanced Novice 1 8. 1
00000000000000E 65058992199339E-001
000

61 yes 4. Advanced Advanced 1 6. 1


00000000000000E 50618727191735E-001
000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 230
7: Model Scoring Functions

fitted_
id masters gpa stats programming admitted fitted_value
category

62 no 3. Advanced Advanced 1 9. 1
70000000000000E 46469669158547E-001
000

63 no 3. Advanced Advanced 1 9. 1
83000000000000E 45714262806374E-001
000

64 yes 3. Advanced Advanced 1 6. 1


81000000000000E 55523357798207E-001
000

65 yes 3. Advanced Advanced 1 6. 1


90000000000000E 53204164468356E-001
000

66 no 3. Novice Beginner 1 7. 1
87000000000000E 54738501228429E-001
000

67 yes 3. Novice Beginner 0 2. 0


46000000000000E 60034297764359E-001
000

68 no 1. Advanced Novice 1 8. 1
87000000000000E 90965518431337E-001
000

69 no 3. Advanced Advanced 1 9. 1
96000000000000E 44948816395031E-001
000

Prediction Accuracy
This query returns the prediction accuracy:

SELECT (SELECT COUNT(id) FROM glmpredict_admissions


WHERE admitted = fitted_category)/(SELECT count(id)
FROM glmpredict_admissions) AS prediction_accuracy;

prediction_accuracy

1.00000000000000000000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 231
7: Model Scoring Functions

GLMPredict Example: Gaussian Distribution Prediction

This example evaluates the predictions for new houses, comparing the original price information with root
mean square error evaluation (RMSE).

Input
• Input table: housing_test, as in DecisionForestPredict Example: Specify Column Names
• Model: glm_housing_model, output by "GLM Example: Gaussian Distribution Analysis" in Teradata
Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003

SQL Call
The canonical link specifies the default family link, which is "identity" for the Gaussian distribution.

DROP TABLE glmpredict_housing;

CREATE MULTISET TABLE glmpredict_housing AS (


SELECT * FROM GLMPredict (
ON housing_test PARTITION BY ANY
ON glm_housing_model AS Model DIMENSION
USING
Accumulate ('sn', 'price')
Family ('GAUSSIAN')
LinkFunction ('CANONICAL')
) AS dt
) WITH DATA;

Output
This query returns the following table:

SELECT * FROM glmpredict_housing ORDER BY 1;

sn price fitted_value

13 27000 3.73458440000000E 004

16 37900 4.36871317500000E 004

25 42000 4.09020280000000E 004

38 67000 7.24876705000000E 004

53 68000 7.92386937000000E 004

104 132000 1.11528007000000E 005

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 232
7: Model Scoring Functions

sn price fitted_value

111 43000 3.91028812000000E 004

117 93000 6.69369510000000E 004

132 44500 4.18198865000000E 004

140 43000 4.16117915000000E 004

142 40000 4.43941465000000E 004

157 60000 6.65712643500000E 004

161 63900 6.49009829000000E 004

The fitted_value column gives the predicted house price.

Root Mean Square Error Evaluation


This query returns the root mean square error evaluation (RMSE):

SELECT SQRT(AVG(POWER(glmpredict_housing.price -
glmpredict_housing.fitted_value, 2))) AS RMSE FROM glmpredict_housing;

rmse

1.06854695738768E 004

GLMPredict Example: Casting Input Column to VARCHAR

Like GLMPredict Example: Logistic Distribution Prediction, this example predicts the admission status
of students. In both examples, the input column masters is categorical—the value can be yes or no. In
the other example, the value is 'yes' or 'no'. In this example, the value is numerical—1 for yes or 0 for
no—therefore, it must be cast to VARCHAR.

Input
• Input table: admissions_test_2, which has admissions information for 20 students
• Model: glm_admissions_model, output by "GLM Example: Logistic Regression Analysis with
Intercept" in Teradata Vantage™ Machine Learning Engine Analytic Function Reference,
B700-4003, with the category column modified as follows:

attribute predictor category

1 masters '1'

2 masters '0'

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 233
7: Model Scoring Functions

admissions_test_2
id masters gpa stats programming admitted

50 1 3.95000000000000E 000 Beginner Beginner 0

51 1 3.76000000000000E 000 Beginner Beginner 0

52 0 3.70000000000000E 000 Novice Beginner 1

53 1 3.50000000000000E 000 Beginner Novice 1

54 1 3.50000000000000E 000 Beginner Advanced 1

55 0 3.60000000000000E 000 Beginner Advanced 1

56 0 3.82000000000000E 000 Advanced Advanced 1

57 0 3.71000000000000E 000 Advanced Advanced 1

58 0 3.13000000000000E 000 Advanced Advanced 1

59 0 3.65000000000000E 000 Novice Novice 1

60 0 4.00000000000000E 000 Advanced Novice 1

61 1 4.00000000000000E 000 Advanced Advanced 1

62 0 3.70000000000000E 000 Advanced Advanced 1

63 0 3.83000000000000E 000 Advanced Advanced 1

64 1 3.81000000000000E 000 Advanced Advanced 1

65 1 3.90000000000000E 000 Advanced Advanced 1

66 0 3.87000000000000E 000 Novice Beginner 1

67 1 3.46000000000000E 000 Novice Beginner 0

68 0 1.87000000000000E 000 Advanced Novice 1

69 0 3.96000000000000E 000 Advanced Advanced 1

SQL Call

CREATE MULTISET TABLE glmpredict_admissions_2 AS (


SELECT * FROM GLMPredict (
ON (
SELECT id, CAST(masters AS varchar(10)) AS masters,
gpa, stats, programming, admitted
FROM admissions_test
) PARTITION BY ANY
ON glm_admissions_model AS Model DIMENSION
USING

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 234
7: Model Scoring Functions

Accumulate ('id','masters','gpa','stats','programming','admitted')
Family ('LOGISTIC')
LinkFunction ('LOGIT')
) AS dt
) WITH DATA;

Output
This query returns the following table:

SELECT * FROM glmpredict_admissions_2 ORDER BY 1;

Fitted values can vary in precision, because they depend on the model table output by ML Engine GLM
function and fetched to Analytics Database.

glmpredict_admissions_2
id masters gpa stats programming admitted fitted value

50 1 3.95000000000000E 000 Beginner Beginner 0 3.50763408888030E-001

51 1 3.76000000000000E 000 Beginner Beginner 0 3.55708978581653E-001

52 0 3.70000000000000E 000 Novice Beginner 1 7.58306140231079E-001

53 1 3.50000000000000E 000 Beginner Novice 1 5.56012779663342E-001

54 1 3.50000000000000E 000 Beginner Advanced 1 7.69474352959112E-001

55 0 3.60000000000000E 000 Beginner Advanced 1 9.68031141480050E-001

56 0 3.82000000000000E 000 Advanced Advanced 1 9.45772725968165E-001

57 0 3.71000000000000E 000 Advanced Advanced 1 9.46411914806798E-001

58 0 3.13000000000000E 000 Advanced Advanced 1 9.49666186386367E-001

59 0 3.65000000000000E 000 Novice Novice 1 8.74189685344822E-001

60 0 4.00000000000000E 000 Advanced Novice 1 8.65058992199339E-001

61 1 4.00000000000000E 000 Advanced Advanced 1 6.50618727191735E-001

62 0 3.70000000000000E 000 Advanced Advanced 1 9.46469669158547E-001

63 0 3.83000000000000E 000 Advanced Advanced 1 9.45714262806374E-001

64 1 3.81000000000000E 000 Advanced Advanced 1 6.55523357798207E-001

65 1 3.90000000000000E 000 Advanced Advanced 1 6.53204164468356E-001

66 0 3.87000000000000E 000 Novice Beginner 1 7.54738501228429E-001

67 1 3.46000000000000E 000 Novice Beginner 0 2.60034297764359E-001

68 0 1.87000000000000E 000 Advanced Novice 1 8.90965518431337E-001

69 0 3.96000000000000E 000 Advanced Advanced 1 9.44948816395031E-001

Categorizing fitted_value Column


See GLMPredict Example: Logistic Distribution Prediction.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 235
7: Model Scoring Functions

Prediction Accuracy
See GLMPredict Example: Logistic Distribution Prediction.

SVMSparsePredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.

If the SVMSparse call that created the model specified HashProjection ('true'), SVMSparsePredict does not
support UNICODE data.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.

SVMSparsePredict Syntax
SELECT * FROM SVMSparsePredict (
ON { table | view | (query) } AS InputTable PARTITION BY id_column
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('id_column')
AttributeNameColumn ('attribute_name_column')
[ AttributeValueColumn ('attribute_value_column') ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ TopK ({ output_class_number | 'output_class_number' }) ]
[ OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Responses ('response' [,...]) ]
) AS alias;

Related Information:
Column Specification Syntax Elements

SVMSparsePredict Syntax Elements


IDColumn
Specify the name of the InputTable column that contains the identifiers of the test samples.
The InputTable must be partitioned by this column.

AttributeNameColumn
Specify the name of the InputTable column that contains the attributes of the test samples.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 236
7: Model Scoring Functions

AttributeValueColumn
[Optional] Specify the name of the InputTable column that contains the attribute values.
Default behavior: Each attribute has the value 1.

Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.

TopK
[Disallowed with Responses, otherwise optional] Specify the number of class labels to
appear in the output table. For each observation, the output table has n rows, corresponding
to the n most likely classes. To see the probability of each class, use OutputProb ('true').

OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to output the
probability for each response. If you omit Responses, the function outputs only the probability
of the predicted class.
Default: 'true'

Responses
[Optional] Specify the classes for which to output probabilities.
Default behavior: Output only the probability of the predicted class.

SVMSparsePredict Input
Table Description

InputTable Contains test data.

Model Output by ML Engine SVMSparse function. Model is in binary format. To display its readable
content, use ML Engine SVMSparseSummary function.

InputTable Schema
Column Data Type Description

id_column BYTEINT, INTEGER, SMALLINT, BIGINT, Test sample identifier.


NUMERIC, NUMERIC(p), NUMERIC(p,a),
VARCHAR, or VARCHAR(n)

attribute_name_column BYTEINT, INTEGER, SMALLINT, BIGINT, Test sample attribute.


VARCHAR, or VARCHAR(n)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 237
7: Model Scoring Functions

Column Data Type Description

attribute_value_column BYTEINT, INTEGER, SMALLINT, BIGINT, Attribute value.


NUMERIC, NUMERIC(p), or NUMERIC(p,a)

accumulate_column Any Column to copy to


output table.

Model Table Schema


Column Data Type Description

classid BYTEINT, INTEGER, SMALLINT, or BIGINT Identifier of class of model attribute.

weights BYTE, VARBYTE, or BLOB Weight of model attribute.

SVMSparsePredict Output
Output Table Schema
The table has the predicted class of each test sample.
If you specify TopK (n), the output table has n rows for each observation.

Column Data Type Description

id_column BYTEINT, INTEGER, Test sample identifier.


SMALLINT, BIGINT,
NUMERIC, NUMERIC(p),
NUMERIC(p,
a), VARCHAR,
or VARCHAR(n)

predict_value VARCHAR Predicted class of test sample.

predict_confidence DOUBLE PRECISION [Column appears only with OutputProb ('true') and
without Responses syntax element] Probability
that observation belongs to class in predict_
value column.

prob_response DOUBLE PRECISION [Column appears only without Responses syntax


element, and appears once for each specified
response] Probability that observation belongs to
category response.

accumulate_column Any Column copied from InputTable.

Calculation of prob_response and predict_confidence

The function calculates the values of prob_response and predict_confidence with the following formulas.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 238
7: Model Scoring Functions

This is the formula for the value of class r:

valuer = Wr·X

where:
• X is the vector of predictor values corresponding to an observation.
• Wr is the vector of predictor weights calculated by the model for class r, where r is a class specified
by the Responses syntax element.

prob_response
For binary classification, the formula for the probability that a response belongs to class r is:

prob_response = sigmoid (valuer)

For multiple-class classification, the formula for the probability that a response belongs to class r is:

prob_response = softmaxr (values)

predict_confidence
The column predict_confidence, which appears only if you omit the Responses syntax element, displays
the probability that the observation belongs to the class in the column predict_value. This value is the
maximum value of prob_response over all responses r.

SVMSparsePredict Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• InputTable: svm_iris_input_test
• Model: svm_iris_model, output by ML Engine SVMSparse function
The model is in binary format. To display its readable content, use ML Engine
SVMSparseSummary function.

svm_iris_input_test
id species attribute value

5 setosa sepal_length 5.0

5 setosa sepal_width 3.6

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 239
7: Model Scoring Functions

id species attribute value

5 setosa petal_length 1.4

5 setosa petal_width 0.2

10 setosa sepal_length 4.9

10 setosa sepal_width 3.1

10 setosa petal_length 1.5

10 setosa petal_width 0.1

15 setosa sepal_length 5.8

15 setosa sepal_width 4.0

15 setosa petal_length 1.2

15 setosa petal_width 0.2

... ... ... ...

svm_iris_model
classid weights

-3 757365686173683A66616C736500636F73743A312E300073616D706C656E756D6265723A313230007365656

-2 7365746F7361007665727369636F6C6F720076697267696E696361

-1 706574616C5F6C656E67746800706574616C5F776964746800736570616C5F6C656E67746800736570616C5

0 BFF134DF08DD751EBFE204E07599DDE03FD93A9DDED02C8A3FD57FC7A69871810000000000000000

1 3FE4F1C5871DE4A0C000B12D7E8C18FE3FE558515B291C5DBFF6F2558D9050370000000000000000

2 3FF9424250696DEA4003BA4B98AB24FCBFF5FBB2667D07A7BFF1D36766E2FE0A0000000000000000

SQL Call

SELECT * FROM SVMSparsePredict (


ON svm_iris_input_test AS InputTable PARTITION BY id
ON svm_iris_model AS Model DIMENSION
USING
IDColumn ('id')
AttributeNameColumn ('attribute')
AttributeValueColumn ('value')
Accumulate ('species')
) AS dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 240
7: Model Scoring Functions

Output
This query returns the following table:

SELECT * FROM svm_iris_predict_out ORDER BY id;

id predict_value predict_confidence species

5 setosa 9.47345262648877E-001 setosa

10 setosa 8.46460012681134E-001 setosa

15 setosa 9.76489773801754E-001 setosa

... ... ... ...

Prediction Accuracy
This query returns the prediction accuracy:

SELECT (SELECT count(id)


FROM svm_iris_predict_out
WHERE predict_value = species)/(1.00*(
SELECT count(id) FROM svm_iris_predict_out)) AS prediction_accuracy;

prediction_accuracy

0.83

DecisionForestPredict
This function can use the models from TD_DecisionForest, and the ML Engine DecisionForest functions to
analyze the input data and make predictions.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.
DecisionForestPredict outputs the probability that each observation is in the predicted class. To use
DecisionForestPredict output as input to ML Engine ROC function, you must first transform it to show the
probability that each observation is in the positive class. One way to do this is to change the probability to
(1- current probability) when the predicted class is negative.
The prediction algorithm compares floating-point numbers. Due to possible inherent data type differences
between ML Engine and Analytics Database executions, predictions can differ. Before calling the function,
compute the relative error, using this formula:

relative_error = (abs(mle_prediction - td_prediction)/mle_prediction)*100

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 241
7: Model Scoring Functions

where mle_prediction is ML Engine prediction value and td_prediction is Analytics Database prediction
value. Errors (e) follow Gaussian law; 0 < e < 3% is a negligible difference, with high confidence.

DecisionForestPredict Syntax
SELECT * FROM DecisionForestPredict (
ON { table | view | (query) } PARTITION BY ANY
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('id_column')
[ NumericInputs ({ 'numeric_input_column' | numeric_input_column_range }[,...]) ]
[ CategoricalInputs ({ 'categorical_input_column' | categorical_input_column_range
}[,...]) ]
[ Detailed ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Responses ('response' [,...]) ]
[ OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

Related Information:
Column Specification Syntax Elements

DecisionForestPredict Syntax Elements


IDColumn
Specify the column that contains a unique identifier for each test point in the test set.

NumericInputs
[Optional] Specify the names of the columns that contain the numeric predictor variables.
Default behavior: The function gets these variables from the model output by DecisionForest
only if you omit both NumericInputs and CategoricalInputs. If you specify this syntax element,
you must specify it exactly as you specified it in the DecisionForest call that created
the model.

CategoricalInputs
[Optional] Specify the names of the columns that contain the categorical predictor variables.
Default behavior: The function gets these variables from the model output by DecisionForest
only if you omit both NumericInputs and CategoricalInputs. If you specify this syntax element,
you must specify it exactly as you specified it in the DecisionForest call that created
the model.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 242
7: Model Scoring Functions

Detailed
[Optional] Specify whether to output detailed information about the forest trees; that is,
the decision tree and the specific tree information, including task index and tree index for
each tree.
Default: 'false'

Responses
[Optional] Specify the classes for which to output probabilities.

Note:
Responses works only with a classification model.

Default behavior: Output only the probability of the predicted class.

OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to output the
probability for each response. If you omit Responses, the function outputs only the probability
of the predicted class.

Note:
OutputProb works only with a classification model.

Default: 'false'

Accumulate
[Optional] Specify the names of the input columns to copy to the output table.

DecisionForestPredict Input
Table Description

Input Contains test data.

Model Has same schema as OutputTable of ML Engine DecisionForest function.

Input Table Schema


Column Data Type Description

id_column Any Unique test point identifier. Cannot be NULL.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 243
7: Model Scoring Functions

Column Data Type Description

numeric_column NUMERIC, INTEGER, Numeric predictor variable. Cannot be NULL.


BIGINT, or
DOUBLE PRECISION

category_column INTEGER, BIGINT, Categorical predictor variable. Cannot be NULL.


or VARCHAR

accumulate_column Any Column to copy to output table.

Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.

Column Data Type Description

worker_ip VARCHAR IP address of worker that produced decision tree.

task_index INTEGER, BIGINT, or SMALLINT Identifier of worker that produced decision tree.

tree_num INTEGER, BIGINT, or SMALLINT Decision tree identifier.

tree VARCHAR, CLOB, or JSON JSON representation of decision tree.

DecisionForestPredict Output
Output Table Schema
The table has a set of predictions for each test point.

Column Data Type Description

accumulate_ Same as in Column copied from input table.


column input table

id_column Same as in Column copied from input table. Unique row identifier.
input table

prediction VARCHAR Predicted test point value, predicted by model.

confidence_ DOUBLE [Appears with OutputProb ('false').] Lower bound of


lower PRECISION confidence interval.
For classification tree, confidence_lower and confidence_
upper have the same value, which is the probability of the
predicted class.

confidence_ DOUBLE [Appears with OutputProb ('false').] Upper bound of


upper PRECISION confidence interval.
For classification tree, confidence_lower and confidence_
upper have the same value, which is the probability of the
predicted class.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 244
7: Model Scoring Functions

Column Data Type Description

tree_num VARCHAR Either the concatenation of task_index and tree_num from the
model table, to show which tree created the prediction, or 'final'
to show the overall prediction. This column appears only if you
specify Detailed ('true').

prob DOUBLE [Column appears only with OutputProb ('true') and without
PRECISION Responses syntax element.] Probability that observation
belongs to class prediction.

prob_response DOUBLE [Column appears only with OutputProb ('true') and Responses
PRECISION syntax element and Responses syntax element. Appears
once for each specified response.] Probability that observation
belongs to category response.

DecisionForestPredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

DecisionForestPredict Example: Specify Column Names

Input
• Input table: housing_test, which has 54 observations of 14 variables
• Model: rft_model, output by "DecisionForest Example: TreeType ('classification') and OutOfBag
('false')" in Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003

Input Table Column Descriptions


Column Description

sn Sale number (unique identifier of observation)

price Sale price in U. S. dollars (numeric)

lotsize Lot size in square feet (numeric)

bedrooms Number of bedrooms (numeric)

bathrms Number of full bathrooms (numeric)

stories Number of stories, excluding basement (numeric)

driveway Whether the house has a driveway—yes or no (categorical)

recroom Whether the house has a recreation room—yes or no (categorical)

fullbase Whether the house has a full finished basement—yes or no (categorical)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 245
7: Model Scoring Functions

Column Description

gashw Whether the house uses gas to heat water—yes or no (categorical)

airco Whether the house has central air conditioning—yes or no (categorical)

garagepl Number of garage places (numeric)

prefarea Whether the house is in a preferred neighborhood—yes or no (categorical)

homestyle Style of house (response variable)

housing_test
sn price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco

13 27000 1700 3 1 2 yes no no no no

16 37900 3185 2 1 1 yes no no no yes

25 42000 4960 2 1 1 yes no no no no

38 67000 5170 3 1 4 yes no no no yes

53 68000 9166 2 1 1 yes no yes no yes

104 132000 3500 4 2 2 yes no no yes no

111 43000 5076 3 1 1 no no no no no

117 93000 3760 3 1 2 yes no no yes no

132 44500 3850 3 1 2 yes no no no no

140 43000 3750 3 1 2 yes no no no no

142 40000 2650 3 1 2 yes no yes no no

157 60000 2953 3 1 2 yes no yes no yes

... ... ... ... ... ... ... ... ... ... ...

rtf_model
worker_ip task_index tree_num CAST(tree AS VARCHAR(50))

xx.xx.xx.xx 0 0 {"responseCounts_":{"Eclectic":148,"bungalow":30,"

xx.xx.xx.xx 0 1 {"responseCounts_":{"Eclectic":158,"bungalow":26,"

xx.xx.xx.xx 0 2 {"responseCounts_":{"Eclectic":120,"bungalow":38,"

xx.xx.xx.xx 0 3 {"responseCounts_":{"Eclectic":166,"bungalow":29,"

xx.xx.xx.xx 0 4 {"responseCounts_":{"Eclectic":138,"bungalow":32,"

xx.xx.xx.xx 0 5 {"responseCounts_":{"Eclectic":158,"bungalow":34,"

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 246
7: Model Scoring Functions

worker_ip task_index tree_num CAST(tree AS VARCHAR(50))

xx.xx.xx.xx 0 6 {"responseCounts_":{"Eclectic":168,"bungalow":32,"

xx.xx.xx.xx 0 7 {"responseCounts_":{"Eclectic":145,"bungalow":40,"

xx.xx.xx.xx 0 8 {"responseCounts_":{"Eclectic":150,"bungalow":34,"

xx.xx.xx.xx 0 9 {"responseCounts_":{"Eclectic":156,"bungalow":42,"

xx.xx.xx.xx 0 10 {"responseCounts_":{"Eclectic":148,"bungalow":18,"

xx.xx.xx.xx 0 11 {"responseCounts_":{"Eclectic":147,"bungalow":20,"

xx.xx.xx.xx 0 12 {"responseCounts_":{"Eclectic":150,"bungalow":31,"

xx.xx.xx.xx 0 13 {"responseCounts_":{"Eclectic":135,"bungalow":32,"

xx.xx.xx.xx 0 14 {"responseCounts_":{"Eclectic":139,"bungalow":24,"

xx.xx.xx.xx 0 15 {"responseCounts_":{"Eclectic":146,"bungalow":27,"

xx.xx.xx.xx 0 16 {"responseCounts_":{"Eclectic":152,"bungalow":23,"

xx.xx.xx.xx 0 17 {"responseCounts_":{"Eclectic":135,"bungalow":23,"

xx.xx.xx.xx 0 18 {"responseCounts_":{"Eclectic":148,"bungalow":29,"

xx.xx.xx.xx 0 19 {"responseCounts_":{"Eclectic":166,"bungalow":33,"

xx.xx.xx.xx 0 20 {"responseCounts_":{"Eclectic":142,"bungalow":28,"

xx.xx.xx.xx 0 21 {"responseCounts_":{"Eclectic":172,"bungalow":27,"

xx.xx.xx.xx 0 22 {"responseCounts_":{"Eclectic":147,"bungalow":37,"

xx.xx.xx.xx 0 23 {"responseCounts_":{"Eclectic":158,"bungalow":31,"

xx.xx.xx.xx 0 24 {"responseCounts_":{"Eclectic":158,"bungalow":33,"

xx.xx.xx.xx 1 0 {"responseCounts_":{"Eclectic":140,"bungalow":44,"

xx.xx.xx.xx 1 1 {"responseCounts_":{"Eclectic":161,"bungalow":28,"

xx.xx.xx.xx 1 2 {"responseCounts_":{"Eclectic":131,"bungalow":25,"

xx.xx.xx.xx 1 3 {"responseCounts_":{"Eclectic":167,"bungalow":28,"

xx.xx.xx.xx 1 4 {"responseCounts_":{"Eclectic":150,"bungalow":19,"

xx.xx.xx.xx 1 5 {"responseCounts_":{"Eclectic":158,"bungalow":24,"

xx.xx.xx.xx 1 6 {"responseCounts_":{"Eclectic":177,"bungalow":32,"

xx.xx.xx.xx 1 7 {"responseCounts_":{"Eclectic":156,"bungalow":24,"

xx.xx.xx.xx 1 8 {"responseCounts_":{"Eclectic":156,"bungalow":37,"

xx.xx.xx.xx 1 9 {"responseCounts_":{"Eclectic":165,"bungalow":24,"

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 247
7: Model Scoring Functions

worker_ip task_index tree_num CAST(tree AS VARCHAR(50))

xx.xx.xx.xx 1 10 {"responseCounts_":{"Eclectic":135,"bungalow":29,"

xx.xx.xx.xx 1 11 {"responseCounts_":{"Eclectic":140,"bungalow":20,"

xx.xx.xx.xx 1 12 {"responseCounts_":{"Eclectic":156,"bungalow":24,"

xx.xx.xx.xx 1 13 {"responseCounts_":{"Eclectic":147,"bungalow":34,"

xx.xx.xx.xx 1 14 {"responseCounts_":{"Eclectic":151,"bungalow":22,"

xx.xx.xx.xx 1 15 {"responseCounts_":{"Eclectic":161,"bungalow":18,"

xx.xx.xx.xx 1 16 {"responseCounts_":{"Eclectic":156,"bungalow":19,"

xx.xx.xx.xx 1 17 {"responseCounts_":{"Eclectic":126,"bungalow":29,"

xx.xx.xx.xx 1 18 {"responseCounts_":{"Eclectic":148,"bungalow":26,"

xx.xx.xx.xx 1 19 {"responseCounts_":{"Eclectic":177,"bungalow":21,"

xx.xx.xx.xx 1 20 {"responseCounts_":{"Eclectic":137,"bungalow":31,"

xx.xx.xx.xx 1 21 {"responseCounts_":{"Eclectic":171,"bungalow":28,"

xx.xx.xx.xx 1 22 {"responseCounts_":{"Eclectic":146,"bungalow":30,"

xx.xx.xx.xx 1 23 {"responseCounts_":{"Eclectic":149,"bungalow":21,"

xx.xx.xx.xx 1 24 {"responseCounts_":{"Eclectic":158,"bungalow":18,"

SQL Call
Use the Accumulate syntax element to pass the homestyle variable, to easily compare the actual and
predicted response for each observation.

CREATE MULTISET TABLE rf_housing_predict AS (


SELECT * FROM DecisionForestPredict (
ON housing_test PARTITION BY ANY
ON rft_model AS Model DIMENSION
USING
NumericInputs ('price ','lotsize ','bedrooms ','bathrms ',
'stories ','garagepl')
CategoricalInputs ('driveway ','recroom ','fullbase ','gashw ',
'airco ','prefarea')
IdColumn ('sn')
Accumulate ('homestyle')
Detailed ('false')
) AS dt
) WITH DATA;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 248
7: Model Scoring Functions

Output
This query returns the following table:

SELECT * FROM rf_housing_predict ORDER BY 2;

homestyle sn prediction
confidence_lower confidence_upper
------------------------------ ----------- --------------------
---------------------- ----------------------
classic 13 classic
8.88888888888889E-001 8.88888888888889E-001
classic 16 classic
8.88888888888889E-001 8.88888888888889E-001
classic 25 classic
1.00000000000000E 000 1.00000000000000E 000
eclectic 38 eclectic
7.77777777777778E-001 7.77777777777778E-001
eclectic 53 eclectic
7.77777777777778E-001 7.77777777777778E-001
bungalow 104 eclectic
7.77777777777778E-001 7.77777777777778E-001
classic 111 classic
1.00000000000000E 000 1.00000000000000E 000
eclectic 117 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 132 classic
8.88888888888889E-001 8.88888888888889E-001
classic 140 classic
8.88888888888889E-001 8.88888888888889E-001
classic 142 classic
8.88888888888889E-001 8.88888888888889E-001
eclectic 157 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 161 eclectic
1.00000000000000E 000 1.00000000000000E 000
bungalow 162 bungalow
5.55555555555556E-001 5.55555555555556E-001
eclectic 176 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 177 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 195 classic
1.00000000000000E 000 1.00000000000000E 000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 249
7: Model Scoring Functions

classic 198 classic


8.88888888888889E-001 8.88888888888889E-001
eclectic 224 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 234 classic
1.00000000000000E 000 1.00000000000000E 000
classic 237 classic
8.88888888888889E-001 8.88888888888889E-001
classic 239 classic
1.00000000000000E 000 1.00000000000000E 000
classic 249 classic
1.00000000000000E 000 1.00000000000000E 000
classic 251 classic
8.88888888888889E-001 8.88888888888889E-001
eclectic 254 eclectic
8.88888888888889E-001 8.88888888888889E-001
eclectic 255 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 260 classic
1.00000000000000E 000 1.00000000000000E 000
eclectic 274 eclectic
8.88888888888889E-001 8.88888888888889E-001
classic 294 classic
1.00000000000000E 000 1.00000000000000E 000
eclectic 301 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 306 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 317 eclectic
7.77777777777778E-001 7.77777777777778E-001
bungalow 329 bungalow
8.88888888888889E-001 8.88888888888889E-001
bungalow 339 bungalow
5.55555555555556E-001 5.55555555555556E-001
eclectic 340 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 353 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 355 eclectic
8.88888888888889E-001 8.88888888888889E-001
eclectic 364 eclectic
1.00000000000000E 000 1.00000000000000E 000
bungalow 367 bungalow
7.77777777777778E-001 7.77777777777778E-001

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 250
7: Model Scoring Functions

bungalow 377 bungalow


7.77777777777778E-001 7.77777777777778E-001
eclectic 401 eclectic
8.88888888888889E-001 8.88888888888889E-001
eclectic 403 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 408 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 411 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 440 eclectic
8.88888888888889E-001 8.88888888888889E-001
eclectic 441 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 443 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 459 classic
8.88888888888889E-001 8.88888888888889E-001
classic 463 classic
7.77777777777778E-001 7.77777777777778E-001
eclectic 469 eclectic
7.77777777777778E-001 7.77777777777778E-001
eclectic 472 eclectic
1.00000000000000E 000 1.00000000000000E 000
bungalow 527 bungalow
7.77777777777778E-001 7.77777777777778E-001
bungalow 530 eclectic
6.66666666666667E-001 6.66666666666667E-001
eclectic 540 eclectic
8.88888888888889E-001 8.88888888888889E-001

Prediction Accuracy
This query returns the prediction accuracy:

SELECT (SELECT count(sn) FROM rf_housing_predict


WHERE homestyle = prediction) / (SELECT count(sn)
FROM rf_housing_predict) AS PA;

pa

0.77777777777777777778

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 251
7: Model Scoring Functions

DecisionForestPredict Example: Specify Column Range,


OutputProb, Responses

Input
• Input table: housing_test_sample

sn price lotsize bedrooms bathrms


stories driveway recroom fullbase gashw airco garagepl
prefarea homestyle
--- --------------------- --------------------- -------- -------
------- -------- ------- -------- ----- ----- --------
-------- ---------
329 1.15442000000000E 005 7.00000000000000E 003 3 2
4 yes no no no yes 2 no bungalow
224 7.85000000000000E 004 2.81700000000000E 003 4 2
2 no yes yes no no 1 no Eclectic
198 4.05000000000000E 004 4.35000000000000E 003 3 1
2 no no no yes no 1 no Classic
162 1.30000000000000E 005 6.00000000000000E 003 4 1
2 yes no yes no no 2 no bungalow
339 1.41000000000000E 005 8.10000000000000E 003 4 1
2 yes yes yes no yes 2 yes bungalow

• Model: rft_model_classification

worker_ip task_index tree_num tree


------------- ---------- -------- ----
172.24.57.209 1 0
{"responseCounts_":{"classic":48,"bungalow":21,"eclectic":97},
172.24.96.213 0 0
{"responseCounts_":{"classic":48,"bungalow":17,"eclectic":99},
172.24.106.80 2 0
{"responseCounts_":{"classic":40,"bungalow":16,"eclectic":108},
172.24.57.209 1 1
{"responseCounts_":{"classic":73,"bungalow":30,"eclectic":107},
172.24.96.213 0 1
{"responseCounts_":{"classic":66,"bungalow":23,"eclectic":114},
172.24.106.80 2 1
{"responseCounts_":{"classic":65,"bungalow":24,"eclectic":115},
172.24.57.209 1 2
{"responseCounts_":{"classic":61,"bungalow":19,"eclectic":80},
172.24.96.213 0 2
{"responseCounts_":{"classic":47,"bungalow":18,"eclectic":93},

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 252
7: Model Scoring Functions

172.24.106.80 2 2
{"responseCounts_":{"classic":53,"bungalow":8,"eclectic":97},

(In the preceding screen, tree column is truncated on the right.)

SQL Call

SELECT * FROM DecisionForestPredict (


ON housing_test_sample PARTITION BY ANY
ON rft_model_classification AS Model DIMENSION
USING
NumericInputs ('[1:5]','garagepl')
CategoricalInputs ('[6:10]','prefarea')
IdColumn ('sn')
Accumulate ('homestyle')
Detailed ('false')
OutputProb ('t')
Responses ('classic','eclectic','bungalow')

)AS dt;

Output

homestyle sn prediction prob_classic prob_eclectic prob_bungalow


------------ --- ---------- ------------ ------------- -------------
bungalow 329 bungalow 0.00 0.12 0.88
Eclectic 224 eclectic 0.00 1.00 0.00
Classic 198 classic 0.89 0.11 0.00
bungalow 162 bungalow 0.00 0.44 0.56
bungalow 339 bungalow 0.00 0.44 0.56

DecisionTreePredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.

If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 253
7: Model Scoring Functions

DecisionTreePredict Syntax
SELECT * FROM DecisionTreePredict (
ON { table | view | (query) } AS AttributeTable
PARTITION BY pid_col [,...]
ON { table | view | (query) } AS Model DIMENSION
USING
AttrTableGroupbyColumns ({ 'gcol' | gcol_range }[,...])
AttrTablePIDColumns ({ 'pid_col' | pid_col_range }[,...])
AttrTableValColumn ('value_column')
[ OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
[ Responses ('response'[,...]) ]
]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

Related Information:
Column Specification Syntax Elements

DecisionTreePredict Syntax Elements


AttrTableGroupByColumns
Specify the names of the columns on which AttributeTable is partitioned. Each partition
contains one attribute of the input data.

AttrTablePIDColumns
Specify the names of the columns that define the data point identifiers.

AttrTableValColumn
Specify the name of the AttributeTable column that contains the input values.

OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to
output probabilities.
Default: 'false'

Responses
[Optional with OutputProb, disallowed otherwise.] Specify the labels for which to
output probabilities.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 254
7: Model Scoring Functions

Default behavior: Output only the probability of the predicted class.

Accumulate
[Optional] Specify the names of the input columns to copy to the output table.
If you are using this function to create input for ML Engine ROC function, this syntax element
must specify actual_label.

DecisionTreePredict Input
Table Description

AttributeTable Contains test data. Has same schema as ML Engine DecisionTree InputTable.

Model Model output by ML Engine DecisionTree function.

AttributeTable Schema
See Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003.

Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.
Double quotation marks around some column names are required because the names contain
special characters.

Column Data Type Description

node_id INTEGER, SMALLINT, Node identifier.


or BIGINT

node_size INTEGER, SMALLINT, Number of objects in node.


or BIGINT

"node_gini(p)" or INTEGER, SMALLINT, GINI impurity value for information in node. For
node_gini BIGINT, NUMBER, or ImpurityMeasurement ('gini'), column name is node_
DOUBLE PRECISION gini(p); otherwise, it is node_gini.

"node_entropy(p) INTEGER, SMALLINT, Entropy impurity value for the information in the node. For
" or node_entropy BIGINT, NUMBER, or ImpurityMeasurement ('entropy'), column name is node_
DOUBLE PRECISION entropy(p); otherwise, it is node_entropy.

"node_chisq_ INTEGER, SMALLINT, Chi-square impurity value for the information in the node.
pv(p)" or node_ BIGINT, NUMBER, or For ImpurityMeasurement ('chisquare'), column name is
chisq_pv DOUBLE PRECISION node_chisq_pv(p); otherwise, it is node_chisq_pv.

node_label CHARACTER Output category for node.


or VARCHAR

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 255
7: Model Scoring Functions

Column Data Type Description

node_majorvotes INTEGER, SMALLINT, Number of objects that belong to category identified by


or BIGINT node_label.

split_value INTEGER, SMALLINT, Numeric split value.


BIGINT, NUMBER, or
DOUBLE PRECISION

"split_gini(p)" or INTEGER, SMALLINT, GINI impurity measurement for information in node after
split_gini BIGINT, NUMBER, or splitting. For ImpurityMeasurement ('gini'), column name
DOUBLE PRECISION is split_gini(p); otherwise, it is split_gini.

"split_entropy(p)" INTEGER, SMALLINT, Entropy impurity measurement for the information in


or split_entropy BIGINT, NUMBER, or node after splitting. For ImpurityMeasurement ('entropy'),
DOUBLE PRECISION column name is split_entropy(p); otherwise, it is
split_entropy.

"split_chisq_pv(p) INTEGER, SMALLINT, Chi-square impurity measurement for information in node


" or split_chisq_pv BIGINT, NUMBER, or after splitting. For ImpurityMeasurement ('chisquare'),
DOUBLE PRECISION column name is split_chisq_pv(p); otherwise, it is split_
chisq_pv.

left_id INTEGER, SMALLINT, Identifier of left child of node.


or BIGINT

left_size INTEGER, SMALLINT, Number of objects in left child of node.


or BIGINT

left_label CHARACTER Output category for left child of node.


or VARCHAR

left_majorvotes INTEGER, SMALLINT, Number of objects that belong to category identified by


or BIGINT left_label.

right_id INTEGER, SMALLINT, Identifier of right child of node.


or BIGINT

right_size INTEGER, SMALLINT, Number of objects in right child of node.


or BIGINT

right_label CHARACTER Output category for right child of node.


or VARCHAR

right_majorvotes INTEGER, SMALLINT, Number of objects that belong to category identified by


or BIGINT right_label.

left_bucket CHARACTER When split value is categorical attribute, value in left child
or VARCHAR of node.

right_bucket CHARACTER When split value is categorical attribute, value in right


or VARCHAR child of node.

left_label_prob_ CHARACTER [Column appears only with OutputResponseProbList


list or VARCHAR ('true').] Probability of each label for left child of node.

right_label_prob_ CHARACTER [Column appears only with OutputResponseProbList


list or VARCHAR ('true').] Probability of each label for right child of node.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 256
7: Model Scoring Functions

Column Data Type Description

prob_label_order CHARACTER [Column appears only with OutputResponseProbList


or VARCHAR ('true').] Label order probability for left and right children
of node.

attribute CHARACTER Split attribute.


or VARCHAR

node_majorfreq INTEGER, SMALLINT, [Column appears only with Weighted ('true').] Weighted
BIGINT, NUMBER, or objects that belong to category identified by node_label.
DOUBLE PRECISION

left_majorfreq INTEGER, SMALLINT, [Column appears only with Weighted ('true').] Weighted
BIGINT, NUMBER, or objects that belong to category identified by left_label.
DOUBLE PRECISION

right_majorfreq INTEGER, SMALLINT, [Column appears only with Weighted ('true').] Weighted
BIGINT, NUMBER, or objects that belong to category identified by right_label.
DOUBLE PRECISION

DecisionTreePredict Output
Output Table Schema
Column Data Type Description

id_column Any Data point identifier from attribute_table (DecisionTree InputTable).

pred_label VARCHAR Predicted response value for data point.

prob DOUBLE [Column appears only with OutputProb ('true') and without
PRECISION Responses syntax element.] Probability that observation belongs to
class pred_label, which depends on value of DecisionTree syntax
element ResponseProbDistType used to create model:
ResponseProbDistType Probability Formula

'laplace' (TP + 1) / (TP + FP + C)

'frequency' or 'rawcount' Lc / L

Where:
Operand Description

TP Number of true positives at leaf.

FP Number of false positives at leaf.

C Number of trees.

Lc Number of leaf nodes for class.

L Number of leaf nodes in tree.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 257
7: Model Scoring Functions

Column Data Type Description

prob_for_label_ DOUBLE [Column appears only with Responses syntax element.] Probability
response PRECISION that observation belongs to category response, calculated as in the
description of column prob.

accumulate_ Same as in Column copied from AttributeTable.


column AttributeTable

DecisionTreePredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

DecisionTreePredict Examples Input

• AttributeTable: iris_attribute_test
• Model: iris_attribute_output
Both tables are created in "DecisionTree Example: Create Model" in Teradata Vantage™ Machine
Learning Engine Analytic Function Reference, B700-4003.
For input table column descriptions, see NaiveBayesPredict Example.

iris_attribute_test
pid attribute attrvalue

5 petal_length 1.4

5 petal_width 0.2

5 sepal_length 5

5 sepal_width 3.6

10 petal_length 1.5

10 petal_width 0.1

10 sepal_length 4.9

10 sepal_width 3.1

15 petal_length 1.2

15 petal_width 0.2

15 sepal_length 5.8

15 sepal_width 4

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 258
7: Model Scoring Functions

pid attribute attrvalue

... ... ...

iris_attribute_output
node_
node_ node_ node_ node_
node_gini(p) node_entropy chisq_ split_value
id size label majorvotes
pv

0 120 0. 1. 1 1 40 3
666666666666667 58496250072116

2 80 0.5 1 1 2 40 1.
70000004768372

5 39 0. 0. 1 2 38 4.
0499671268902038 172036949353113 90000009536743

6 41 0. 0. 1 3 39 4.
0928019036287924 281193796432043 90000009536743

14 37 0. 0. 1 3 36 2.
0525931336742148 179256066928321 90000009536743

30 24 0. 0. 1 3 23 3.
0798611111111112 249882292833186 20000004768372

61 14 0.13265306122449 0. 1 3 13 6.
371232326640875 30000019073486

DecisionTreePredict Example: Apply Model to Test Data

Input
See DecisionTreePredict Examples Input.

SQL Call

CREATE MULTISET TABLE decisiontree_predict AS (


SELECT * FROM DecisionTreePredict (
ON iris_attribute_test AS AttributeTable PARTITION BY pid
ON iris_attribute_output as Model DIMENSION
USING
AttrTableGroupbyColumns ('attribute')
AttrTablePIDColumns ('pid')
AttrTableValColumn ('attrvalue')
) AS dt
) WITH DATA;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 259
7: Model Scoring Functions

Output
This query returns the following table:

SELECT * FROM decisiontree_predict ORDER BY 1;

The predict labels 1, 2, and 3 correspond to species setosa, versicolor, and virginica.

pid pred_label

5 1

10 1

15 1

20 1

25 1

30 1

35 1

40 1

45 1

50 1

55 2

60 2

65 2

70 2

75 2

80 2

85 2

90 2

95 2

100 2

105 3

110 3

115 3

120 2

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 260
7: Model Scoring Functions

pid pred_label

125 3

130 2

135 2

140 3

145 3

150 3

DecisionTreePredict Example: OutputProb, Responses

Input
See DecisionTreePredict Examples Input.

SQL Call

SELECT * FROM DecisionTreePredict (


ON iris_attribute_test AS AttributeTable PARTITION BY pid
ON iris_attribute_output AS Model DIMENSION
USING
AttrTableGroupByColumns ('attribute')
AttrTablePIDColumns ('pid')
AttrTableValColumn ('attrvalue')
OutputProb ('true')
Responses ('1','2','3')
) AS dt ORDER BY pid;

Output
pid pred_label prob_for_label_1 prob_for_label_2 prob_for_label_3

5 1 0.95348 0.02326 0.02326

10 1 0.95348 0.02326 0.02326

15 1 0.95348 0.02326 0.02326

20 1 0.95348 0.02326 0.02326

25 1 0.95348 0.02326 0.02326

30 1 0.95348 0.02326 0.02326

35 1 0.95348 0.02326 0.02326

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 261
7: Model Scoring Functions

pid pred_label prob_for_label_1 prob_for_label_2 prob_for_label_3

40 1 0.95348 0.02326 0.02326

45 1 0.95348 0.02326 0.02326

50 1 0.95348 0.02326 0.02326

55 2 0.02632 0.94736 0.02632

60 2 0.02632 0.94736 0.02632

65 2 0.02632 0.94736 0.02632

70 2 0.02632 0.94736 0.02632

75 2 0.02632 0.94736 0.02632

80 2 0.02632 0.94736 0.02632

85 2 0.02632 0.94736 0.02632

90 2 0.02632 0.94736 0.02632

95 2 0.02632 0.94736 0.02632

100 2 0.02632 0.94736 0.02632

105 3 0.06250 0.12500 0.81250

110 3 0.07692 0.07692 0.84616

115 3 0.06250 0.06250 0.87500

120 3 0.14286 0.57143 0.28571

125 3 0.07692 0.07692 0.84616

130 3 0.14286 0.57143 0.28571

135 3 0.14286 0.57143 0.28571

140 3 0.06250 0.12500 0.81250

145 3 0.07692 0.07692 0.84616

150 3 0.25000 0.25000 0.50000

TD_KMeansPredict
The TD_KMeansPredict function uses the cluster centroids in the TD_KMeans function output to assign the
input data points to the cluster centroids.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 262
7: Model Scoring Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_KMeansPredict Syntax
SELECT * FROM TD_KMeansPredict (
ON { table | view | (query) } as InputTable
ON { table | view | (query) } as ModelTable DIMENSION
USING
[ OutputDistance(({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Accumulate({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
) as alias;

TD_KMeansPredict Syntax Elements


OutputDistance
[Optional]: Specify whether to return the distance between each data point and the
nearest cluster.
Default Value: False

Accumulate
[Optional]: Specify the input table column names to copy to the output table.

TD_KMeansPredict Input
Input Table Schema
Column Data Type Description

IdColumn Any The InputTable column name that has the unique
identifier for each input table row.

TargetColumns BYTEINT,SMALLINT,INTEGER, The input table column names used for clustering.
BIGINT, Decimal/Numeric,Float,
Real,Double precision

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 263
7: Model Scoring Functions

TD_KMeansPredict Output
Output Table Schema
Column Data Type Description

Id_Column ANY The unique identifier of input rows copied from the input table.

TD_ BIGINT The unique identifier of the cluster.


CLUSTERID_KMEANS

TD_ REAL The distance between a data point and the center of the
DISTANCE_KMEANS assigned cluster.
Note:
The column is shown if the OutputDistance element is set
to 'True'.

Accumulate_Columns ANY The specified input table column names copied to the
output table.

TD_KMeansPredict Example
Input Table

id C1 C2
-- -- --
1 1 1
2 2 2
3 8 8
4 9 9

KMeans_Model (generated using TD_KMeans)

td_clusterid_kmeans C1 C2
td_size_kmeans td_withinss_kmeans id td_modelinfo_kmeans
-------------------- ---------------------- ----------------------
-------------------- ----------------------
---- -----------------------------------------
0 1.5
1.5 2 1 NULL NULL
1 8.5
8.5 2 1 NULL NULL
NULL NULL
NULL NULL NULL NULL
Converged : True

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 264
7: Model Scoring Functions

NULL NULL
NULL NULL NULL NULL Number of
Iterations : 2
NULL NULL
NULL NULL NULL NULL Number of
Clusters : 2
NULL NULL
NULL NULL NULL NULL
Total_WithinSS 2.00000000000000E+00
NULL NULL
NULL NULL NULL NULL
Between_SS : 9.80000000000000E+01

Select * from TD_KMeansPredict (


on kmeans_input_table as InputTable
on kmeans_model as ModelTable DIMENSION
using
OutputDistance('true')
Accumulate('c1','c2')
)as dt order by 1,2,3;

Output

id td_clusterid_kmeans td_distance_kmeans
C1 C2
----------- -------------------- ----------------------
---------------------- ----------------------
1 0 0.707106781 1
1
2 0 0.707106781 2
2
3 1 0.707106781 8
8
4 1 0.707106781 9 9

NaiveBayesPredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 265
7: Model Scoring Functions

If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.

NaiveBayesPredict Syntax
SELECT * FROM NaiveBayesPredict (
ON { table | view | (query) } PARTITION BY ANY
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('test_point_id_col')
NumericInputs ('numeric_input_column'[,...] )
CategoricalInputs ('categorical_input_column'[,...] )
Responses ('response'[,...])
) AS alias;

NaiveBayesPredict Syntax Elements


IDColumn
Specify the name of the input table column that contains the ID that uniquely identifies the
test input data.

NumericInputs
[Required if CategoricalInputs is omitted.] Specify the same numeric_input_columns that you
specified when you used the NaiveBayesMap and NaiveBayesReduce functions to create
the model table from the training data.

CategoricalInputs
[Required if NumericInputs is omitted.] Specify the same categorical_input_columns that you
specified when you used the NaiveBayesMap and NaiveBayesReduce functions to create
the model table from the training data.

Responses
Specify the responses to output.

NaiveBayesPredict Input
Table Description

Input Contains test data. Has same schema as ML Engine Naive Bayes Classifier input table.

Model Model output by ML Engine Naive Bayes Classifier function.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 266
7: Model Scoring Functions

Input Table Schema


See Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003.

Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.
Double quotation marks around some column names are required because the names are either Analytics
Database reserved keywords or are camel-case.

Column Data Type Description

"class" or class_nb VARCHAR Response.

"variable" or VARCHAR Input variable (name of input column).


variable_nb

"type" or type_nb VARCHAR Input variable types ('NUMERIC' or 'CATEGORICAL').

category VARCHAR For categorical predictor, its level. For numeric


predictor, NULL.

cnt INTEGER or BIGINT Count of observations with this class, variable,


and category.

"sum" or sum_nb INTEGER or For numerical predictor, sum of variable values for
DOUBLE_ observations with this class, variable, and category. For
PRECISION categorical predictor, NULL.

"sumSq" or sum_sq INTEGER or For numerical predictor, sum of square of variable values
DOUBLE_ for observations with this class, variable, and category.
PRECISION For categorical predictor, NULL.

"totalCnt" or total_cnt INTEGER or BIGINT Total count of observations.

NaiveBayesPredict Output
Output Table Schema
Each row of the table represents one observation.

Column Data Type Description

id INTEGER Row (observation) identifier.

prediction VARCHAR Prediction for observation.

loglik_response_i DOUBLE PRECISION Loglikelihood (natural logarithm of probability) that


observation has response.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 267
7: Model Scoring Functions

NaiveBayesPredict Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• Input table: nb_iris_input_test
• Model: nb_iris_model
The model is created in the Naive Bayes example in Teradata Vantage™ Machine Learning Engine
Analytic Function Reference, B700-4003.

Input Table Column Descriptions


Column Description

id Unique identifier of observation

sepal_length Numeric

sepal_width Numeric

petal_length Numeric

petal_width Numeric

species Setosa, versicolor, or virginica

nb_iris_input_test
id sepal_length sepal_width petal_length petal_width species

5 5 3.6 1.4 0.2 setosa

10 4.9 3.1 1.5 0.1 setosa

15 5.8 4 1.2 0.2 setosa

20 5.1 3.8 1.5 0.3 setosa

25 4.8 3.4 1.9 0.2 setosa

30 4.7 3.2 1.6 0.2 setosa

35 4.9 3.1 1.5 0.2 setosa

40 5.1 3.4 1.5 0.2 setosa

45 5.1 3.8 1.9 0.4 setosa

50 5 3.3 1.4 0.2 setosa

55 6.5 2.8 4.6 1.5 versicolor

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 268
7: Model Scoring Functions

id sepal_length sepal_width petal_length petal_width species

60 5.2 2.7 3.9 1.4 versicolor

65 5.6 2.9 3.6 1.3 versicolor

70 5.6 2.5 3.9 1.1 versicolor

75 6.4 2.9 4.3 1.3 versicolor

80 5.7 2.6 3.5 1 versicolor

85 5.4 3 4.5 1.5 versicolor

90 5.5 2.5 4 1.3 versicolor

95 5.6 2.7 4.2 1.3 versicolor

100 5.7 2.8 4.1 1.3 versicolor

105 6.5 3 5.8 2.2 virginica

110 7.2 3.6 6.1 2.5 virginica

115 5.8 2.8 5.1 2.4 virginica

120 6 2.2 5 1.5 virginica

125 6.7 3.3 5.7 2.1 virginica

130 7.2 3 5.8 1.6 virginica

135 6.1 2.6 5.6 1.4 virginica

140 6.9 3.1 5.4 2.1 virginica

145 6.7 3.3 5.7 2.5 virginica

150 5.9 3 5.1 1.8 virginica

nb_iris_model
class variable type category cnt sum sumSq totalcnt

setosa sepal_width NUMERIC ? 40 136.700000524521 473.290003499985 40

setosa petal_width NUMERIC ? 40 10.1000002026558 3.03000012755394 40

setosa sepal_length NUMERIC ? 40 199.900000095367 1004.27000005722 40

setosa petal_length NUMERIC ? 40 57.6999998092651 84.2099996709824 40

versicolor sepal_width NUMERIC ? 40 111.10000038147 313.130002088547 40

versicolor petal_width NUMERIC ? 40 53.299999833107 72.7099995040894 40

versicolor sepal_length NUMERIC ? 40 239.599999427795 1446.13999296188 40

versicolor petal_length NUMERIC ? 40 172.399999141693 752.219992570878 40

virginica sepal_width NUMERIC ? 40 118.799999952316 356.539999780655 40

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 269
7: Model Scoring Functions

class variable type category cnt sum sumSq totalcnt

virginica petal_width NUMERIC ? 40 81.1999989748001 166.999995970726 40

virginica sepal_length NUMERIC ? 40 264.400000572205 1764.92000530243 40

virginica petal_length NUMERIC ? 40 222.299999713898 1249.1499958992 40

SQL Call

DROP TABLE nb_iris_predict;

CREATE MULTISET TABLE nb_iris_predict AS (


SELECT * FROM NaiveBayesPredict (
ON nb_iris_input_test PARTITION BY ANY
ON nb_iris_model AS Model DIMENSION
USING
IDColumn ('id')
NumericInputs ('sepal_length','sepal_width','petal_length','petal_width')
Responses ('virginica','setosa','versicolor')
) AS dt
) WITH DATA;

Output
This query returns the following table:

SELECT * FROM nb_iris_predict ORDER BY 1;

The output provides a prediction for each row in the test data set and specifies the log likelihood values that
were used to make the predictions for each category.

id prediction loglik_virginica loglik_setosa loglik_versicolor

5 setosa -60.9907330174083 0.940424559067427 -38.2319825308929

10 setosa -61.5861966261907 -0.173043897170957 -37.6660830556247

15 setosa -64.7169548001753 -3.55476375390931 -42.613272284101

20 setosa -57.7992844148636 0.531796840642284 -35.7613053354934

25 setosa -55.0939143017897 -3.23703029869347 -32.1179858509341

30 setosa -58.0673073752287 0.109611164911179 -34.9285997859276

35 setosa -58.1980267787658 0.660202577013632 -34.9335988704833

40 setosa -58.3538858459019 0.976840811041703 -35.4425587940391

45 setosa -50.3847602463201 -4.36921429673761 -29.0537478266948

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 270
7: Model Scoring Functions

id prediction loglik_virginica loglik_setosa loglik_versicolor

50 setosa -59.4745348026195 1.00257959230347 -36.5026022674224

55 versicolor -5.22108005914589 -270.465431908161 -1.7396367893394

60 versicolor -11.3356467465064 -174.565470791378 -2.31925264962004

65 versicolor -12.6496488706934 -138.435722453706 -2.1898005756116

70 versicolor -15.236843619572 -152.47255627778 -2.3538459106499

75 versicolor -8.34632493685681 -214.383653794905 -1.14727508911532

80 versicolor -18.455946984498 -109.900955754698 -3.72743011721095

85 versicolor -7.00283150694931 -249.656488976769 -2.00455589365379

90 versicolor -12.0279925543069 -177.470336291088 -1.74539749109463

95 versicolor -10.1802450220293 -198.037109900803 -1.10567314638237

100 versicolor -10.1315405651018 -187.294956922171 -1.02885306444447

105 virginica -1.58321671192447 -540.56351949849 -14.859643718252

110 virginica -6.11301966870239 -654.801984259278 -28.8385135092999

115 virginica -3.64635253153959 -456.647579953406 -15.3298808321577

120 versicolor -7.73615017754911 -322.909009762056 -3.53629430321742

125 virginica -1.87627054598219 -509.817023097936 -13.7515396871732

130 virginica -3.36908052149115 -469.802937074554 -9.13832860900173

135 versicolor -5.81482980902253 -403.678170868448 -4.51644862072851

140 virginica -1.48430911768034 -463.610989255182 -12.0238603485835

145 virginica -3.82266629516761 -576.395460020916 -22.6942168473031

150 virginica -2.57004648415525 -366.506113945482 -4.84887216455807

TD_GLMPredict
The TD_GLMPredict function predicts target values (regression) and class labels (classification) for test
data using a GLM model of the TD_GLM function.
Before using the features in the function, you must standardize the Input features using TD_ScaleFit and
TD_ScaleTransform functions.
The function only accepts numeric features. Therefore, you must convert the categorical features to numeric
values before prediction.
The function skips the rows with missing (null) values during prediction.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 271
7: Model Scoring Functions

You can use TD_RegressionEvaluator, TD_ClassificationEvaluator, or TD_ROC function as a post-


processing step for evaluating prediction results.

TD_GLMPredict Syntax
SELECT * from TD_GLMPredict (
ON { table | view | (query) } AS InputTable PARTITION BY ANY
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('id_column')
[Accumulate({'accumulate_column'|accumulate_column_range}[,...])]
[OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no','n','0'})]
[Responses ('response' [,...])]
) AS dt;

TD_GLMPredict Syntax Elements


IDColumn
Specify the column name that uniquely identifies an observation in test table.

Accumulate
[Optional] Specify the input table column names to copy to the output table.

OutputProb
[Optional] Specify whether the function returns the probability for each response. Only
applicable if family of probability distribution is BINOMIAL. The default value is false.

Responses
[Optional] Specify the class labels if the function returns probabilities for each response.
Only applicable if the OutputProb element is True. A class label has the value 0 or 1. If not
specified, the function returns the probability of the predicted response.

TD_GLMPredict Input
TD_GLMPredict Input
The input table schema is as follows:
Column Name Data Type Description

id_column Any The column that uniquely identifies an observation in


test table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 272
7: Model Scoring Functions

Column Name Data Type Description

input_column INTEGER, BIGINT, The column used as predictors (features) for


SMALLINT, model training.
BYTEINT, FLOAT,
DECIMAL, NUMBER

accumulate_column Any The input table column names to copy to the


output table.

The model schema is as follows (see TD_GLM for a model example):


Column
Data Type Description
Name

attribute SMALLINT A numeric index that represents predictor and model metrics wherein
model metrics have negative values and predictors take positive values.
Intercept is specified using index 0.

predictor VARCHAR The predictor or model metric name. The maximum length is
32000 characters.

estimate FLOAT The predictor weights and numeric values of metrics.

value VARCHAR The values of metric string. The maximum length is 30 characters.

TD_GLMPredict Output
The model output schema is as follows:
Column Name Data Type Description

id_column Same as The specified column name that uniquely identifies an observation
input table in test table.

prediction FLOAT The predicted value of the test observation.

prob FLOAT The probability that the observation belongs to the predicted class.
Only appears if the OutputProb element is set to True and the
Responses element is not specified.

prob_0 FLOAT The probability that the observation belongs to class 0. Only
appears if the Responses element is specified.

prob_1 FLOAT The probability that the observation belongs to class 1. Only
appears if the Responses element is specified.

accumulate_ Any The specified column names in the Accumulate element copied to
column the output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 273
7: Model Scoring Functions

TD_GLMPredict Example
TD_GLMPredict Example for Credit Data
This example takes credit data and uses TD_GLM function to get a model. You can view the input and
output in the TD_GLM example.

TD_GLMPredict Call for Credit Data

CREATE VOLATILE TABLE vt_glm_predict_credit_ex AS (


SELECT * from TD_GLMPredict (
ON credit_ex_merged AS INPUTTABLE
ON td_glm_output_credit_ex AS Model DIMENSION
USING
IDColumn ('ID')
Accumulate('Outcome')
) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS
;

TD_GLMPredict Output for Credit Data


ID Prediction Outcome

61 1 1

297 0 0

631 0 0

122 1 1

... ... ...

TD_GLMPredict Example for Housing Data


This example takes raw housing data, and does the following:
1. Uses TD_ScaleFit to standardize the data.
2. Uses TD_ScaleTransform to transform the data.
3. Uses TD_GLM to get a model.
4. Uses TD_GLMPredict to predict target values.
You can view the input and output of steps 1 through 3 in the TD_GLM example.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 274
7: Model Scoring Functions

TD_GLMPredict Call for Housing Data

CREATE VOLATILE TABLE vt_predict_cal_ex AS (


SELECT * from TD_GLMPredict (
ON cal_housing_ex_scaled AS INPUTTABLE
ON td_glm_cal_ex AS Model DIMENSION
USING
IDColumn ('ID')
Accumulate('MedHouseVal')
) AS dt
) WITH DATA
ON COMMIT PRESERVE ROWS
;

TD_GLMPredict Output for Housing Data


ID Prediction MedHouseVal

2833 1.5762 0.6

5328 2.29801 2.775

5300 1.82705 3.5

12433 0.863867 0.664

... ... ...

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 275
8
Model Evaluation Functions

TD_Silhouette
The Silhouette function refers to a method of interpretation and validation of consistency within clusters of
data. The function determines how well the data is clustered among clusters.
The silhouette value determines the similarity of an object to its cluster (cohesion) compared to other
clusters (separation). The silhouette plot displays a measure of how close each point in one cluster is to the
points in the neighbouring clusters and thus provides a way to assess parameters like the optimal number
of clusters.
The silhouette scores and its definitions are as follows:
• 1: Data is appropriately clustered
• -1: Data is not appropriately clustered
• 0: Datum is on the border of two natural clusters

Note:
The algorithm used in this function is of the order of N2 (where N is the number of rows). Hence, expect
the query to run significantly longer as the number of rows increases in the input table.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_Silhouette Syntax
SELECT * FROM TD_Silhouette (
ON { table | view | (query) } as InputTable
USING
IdColumn('id_column')
ClusterIdColumn('clusterid_column')
TargetColumns({'target_column'|'target_column_range'}[,...])
[{

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 276
8: Model Evaluation Functions

[ OutputType({'SCORE' | 'CLUSTER_SCORES'}) ]
}
|
{
OutputType('SAMPLE_SCORES')
[ Accumulate({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
}]
)as alias;

TD_Silhouette Syntax Elements


IdColumn
[Required]: Specify the unique identifier column name.

ClusterIdColumn
[Required]: Specify the column name that contains the assigned clusterIds for the input
data points.

TargetColumns
[Required]: Specify the features or columns for clustering.

OutputType
[Optional]: Specify the output type or format.
• SCORE: Returns average silhouette score of all input samples.
• SAMPLE_SCORES: Returns silhouette score for each input sample.
• CLUSTER_SCORES: Returns average silhouette scores of input samples for
each cluster.
Allowed Values: ['SCORE','SAMPLE_SCORES','CLUSTER_SCORES']
Default Value: SCORE

Accumulate
[Optional]: Specify the input table columns to copy to the output table.

Note:
Only applicable for 'SAMPLE_SCORES' output type.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 277
8: Model Evaluation Functions

TD_Silhouette Input
Input Table Schema
Column Data Type Description

id_column ANY The unique identifier column of the input table.

clusterid_column BYTEINT, SMALLINT, The column that contains the assigned clusterIds
INTEGER, BIGINT for the input data points.

TargetColumns BYTEINT,SMALLINT,INTEGER, The columns used for clustering.


BIGINT, Decimal/Numeric,Float,
Real,Double precision

TD_Silhouette Output
Output Table Schema
If the Output type is set to Score:

Column Data Type Description

Silhouette_Score REAL Silhouette Coefficient (that is, the Mean Silhouette Score)

If the Output type is set to Sample_Scores:

Column Data Type Description

Id_Column Same as in The unique identifier of input rows copied from the input table.
Input Table

clusterid_Column Same as in The ClusterIds of the input data points copied from the
Input Table input table.

A_i REAL The mean distance of a data point to other data points in the
same cluster.

B_i REAL The minimum mean dissimilarity to other clusters.

Silhouette_Score REAL The silhouette score for the data point.

accumulate_ ANY The specified columns in the Accumulate element are copied
columns from the input table to the output table.

If the Output type is set to Cluster_Scores:

Column Data Type Description

clusterid_Column BYTEINT, SMALLINT, The clusterId of the cluster.


INTEGER, BIGINT

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 278
8: Model Evaluation Functions

Column Data Type Description

Silhouette_Score REAL The mean silhouette score for the clusterId.

TD_Silhouette Example
InputTable

id clusterid c1 c2
-- --------- -- --
1 1 1 1
2 1 2 2
3 2 8 8
4 2 9 9

Output Table
Output when the Output type is set to 'Score'.

SELECT * FROM TD_Silhouette(


ON input_tbl as InputTable
USING
IdColumn('id')
ClusterIdColumn('clusterid')
TargetColumns('c1','c2')
OutputType('SCORE')
) as dt;

silhouette_score
----------------
0.856410256

Output when the Output type is set to 'Sample_Scores'.

SELECT * FROM TD_Silhouette(


ON input_tbl as InputTable
USING
IdColumn('id')
ClusterIdColumn('clusterid')
TargetColumns('c1','c2')
OutputType('SAMPLE_SCORES')
) as dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 279
8: Model Evaluation Functions

id clusterid a_i b_i silhouette_score


-- --------- ----------- ------------ ----------------
1 1 1.414213562 10.606601718 0.866666667
4 2 1.414213562 10.606601718 0.866666667
3 2 1.414213562 9.192388155 0.846153846
2 1 1.414213562 9.192388155 0.846153846

Output when the Output type is set to 'Cluster_Scores'.

SELECT * FROM TD_Silhouette(


ON input_tbl as InputTable
USING
IdColumn('id')
ClusterIdColumn('clusterid')
TargetColumns('c1','c2')
OutputType('CLUSTER_SCORES')
) as dt;

clusterid silhouette_score
--------- ----------------
1 0.856410256
2 0.856410256

TD_ClassificationEvaluator
In classification problems, a confusion matrix is used to visualize the performance of a classifier. The
confusion matrix contains predicted labels represented across the row-axis and actual labels represented
across the column-axis. Each cell in the confusion matrix corresponds to the count of occurrences of labels
in the test data.

Note:
The function works for multi-class scenarios as well. In any case, the primary output table contains
class-level metrics, whereas the secondary output table contains metrics that are applicable
across classes.

Apart from accuracy, the secondary output table returns micro, macro, and weighted-averaged metrics of
precision, recall, and F1-score values.

TD_ClassificationEvaluator Syntax
SELECT * FROM TD_ClassificationEvaluator(
ON { input_table | view | (query) }
[ OUT [VOLATILE| PERMANENT] TABLE OutputTable(output_table_name) ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 280
8: Model Evaluation Functions

USING
ObservationColumn('ObservationColumn')
PredictionColumn('PredictionColumn')
{
Labels({'Lable1','Label2' } [,…])) | NumLabels('label_count')
}
)AS dt1;

TD_ClassificationEvaluator Syntax Elements


ObservationColumn
[Required]: Specify the column name that has observed labels.

PredictionColumn
[Required]: Specify the column name that has predicted labels.

Labels
[Required]: Specify the list of predicted labels.

NumLabels
[Optional]: Specify the total count of labels.

TD_ClassificationEvaluator Input
Input Table Schema
Column Data Type Description

ObservationColumn ByteInt,ShortInt,Integer, The InputTable column name that has


Char,Varchar observed labels.

PredictionColumn ByteInt,ShortInt,Integer, The input table column name that has


Char,Varchar predicted labels.

TD_ClassificationEvaluator Output
Output Table Schema
The Primary Output table is as follows:

Column Data Type Description

SeqNum Integer The sequence number of the row.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 281
8: Model Evaluation Functions

Column Data Type Description

Prediction VARCHAR The column name that has predicted labels.

Mapping VARCHAR The mapping used for the label.

ColumnN BIGINT The N columns denoting N labels.

Precision REAL The positive predictive value. Refers to the fraction of relevant instances among
the total retrieved instances.

Recall REAL Refers to the fraction of relevant instances retrieved over the total amount of
relevant instances.

F1 REAL F1 score, defined as the harmonic mean of the precision and recall.

Support BIGINT The number of times a label displays in the ObservationColumn.

Output Table Schema


The Secondary Output table is as follows:

Column Data Type Description

SeqNum Integer The sequence number of the row.

Metric VARCHAR The metric name.

MetricValue REAL The value for the corresponding metric.

TD_ClassificationEvaluator Example
Input Table

id observed_value predicted_value
--- -------------- ---------------
5 setosa setosa
5 setosa setosa
5 setosa setosa
5 setosa setosa
5 setosa setosa
10 setosa setosa
10 setosa setosa
10 setosa setosa
10 setosa setosa
10 setosa setosa
15 setosa setosa
15 setosa setosa

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 282
8: Model Evaluation Functions

15 setosa setosa
15 setosa setosa
15 setosa setosa
20 setosa setosa
20 setosa setosa
20 setosa setosa
20 setosa setosa
20 setosa setosa
25 setosa setosa
25 setosa setosa
25 setosa setosa
25 setosa setosa
25 setosa setosa
30 setosa setosa
30 setosa setosa
30 setosa setosa
30 setosa setosa
30 setosa setosa
35 setosa setosa
35 setosa setosa
35 setosa setosa
35 setosa setosa
35 setosa setosa
40 setosa setosa
40 setosa setosa
40 setosa setosa
40 setosa setosa
40 setosa setosa
45 setosa setosa
45 setosa setosa
45 setosa setosa
45 setosa setosa
50 setosa setosa
50 setosa setosa
50 setosa setosa
50 setosa setosa
55 versicolor versicolor
55 versicolor versicolor
55 versicolor versicolor
55 versicolor versicolor
60 versicolor versicolor
60 versicolor versicolor
60 versicolor versicolor
60 versicolor versicolor

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 283
8: Model Evaluation Functions

65 versicolor versicolor
65 versicolor versicolor
65 versicolor versicolor
65 versicolor versicolor
70 versicolor versicolor
70 versicolor versicolor
70 versicolor versicolor
75 versicolor versicolor
75 versicolor versicolor
75 versicolor versicolor
80 versicolor versicolor
80 versicolor versicolor
80 versicolor versicolor
85 virginica versicolor
85 virginica versicolor
85 virginica versicolor
90 versicolor versicolor
90 versicolor versicolor
90 versicolor versicolor
95 versicolor versicolor
95 versicolor versicolor
95 versicolor versicolor
100 versicolor versicolor
100 versicolor versicolor
100 versicolor versicolor
105 virginica virginica
105 virginica virginica
105 virginica virginica
110 virginica virginica
110 virginica virginica
110 virginica virginica
115 virginica virginica
115 virginica virginica
115 virginica virginica
120 versicolor virginica
120 versicolor virginica
120 versicolor virginica
125 virginica virginica
125 virginica virginica
125 virginica virginica
130 versicolor virginica
130 versicolor virginica
130 versicolor virginica
135 versicolor virginica

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 284
8: Model Evaluation Functions

135 versicolor virginica


135 versicolor virginica
140 virginica virginica
140 virginica virginica
140 virginica virginica
145 virginica virginica
145 virginica virginica
150 virginica virginica
150 virginica virginica

SQL Call

SELECT * from TD_CLASSIFICATIONEVALUATOR(


ON iris_pred AS InputTable
OUT TABLE OutputTable(additional_metrics)
USING
Labels('setosa','versicolor','virginica' )
ObservationColumn('observed_value')
PredictionColumn ('predicted_value')
) as dt1 order by 1,2,3;

Output Table
Primary Output table:

SeqNum Prediction Mapping CLASS_1 CLASS_2 CLASS_3 Precision Recall


F1 Support
------ ---------- ------- ------- ------- ------- ----------- -----------
----------- -------
0 setosa CLASS_1 48 0 0 1.000000000 1.000000000
1.000000000 48
1 versicolor CLASS_2 0 30 3 0.909090909 0.769230769
0.833333333 39
2 virginica CLASS_3 0 9 19 0.678571429 0.863636364
0.760000000 22

Output Table
Secondary Output table:

SeqNum Metric MetricValue


------ -------------------------------------------------- -----------
1 Accuracy 0.889908257
2 Micro-Precision 0.889908257

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 285
8: Model Evaluation Functions

3 Micro-Recall 0.889908257
4 Micro-F1 0.889908257
5 Macro-Precision 0.862554113
6 Macro-Recall 0.877622378
7 Macro-F1 0.864444444
8 Weighted-Precision 0.902597403
9 Weighted-Recall 0.889908257
10 Weighted-F1 0.891926606

TD_Regression_Evaluator
The TD_RegressionEvaluator function computes metrics to evaluate and compare multiple models and
summarizes how close predictions are to their expected values.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_RegressionEvaluator Syntax
SELECT * FROM TD_RegressionEvaluator(
ON { table | view | (query) } as InputTable
USING
ObservationColumn('observation_column')
PredictionColumn('prediction_column')
[ Metrics('metric_1',['metric_2',.....'metric_n']) ]
[ NumOfIndependentVariables(value) ]
[ DegreesOfFreedom(df1,df2) ]
) AS alias;

TD_RegressionEvaluator Syntax Elements


ObservationColumn
[Required]: Specify the column name that has observation values.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 286
8: Model Evaluation Functions

PredictionColumn
[Required]: Specify the column name that has prediction values.

NumOfIndependentVariables
[Optional]: Specify the number of independent variables in the model. Required with
Adjusted R Squared metric, otherwise ignored.

DegreesOfFreedom
[Optional]: Specify the numerator degrees of freedom (df1) and denominator degrees of
freedom (df2). Required with fstat metric, else ignored.

Metrics
[Optional]: Specify the list of evaluation metrics. The function returns the following metrics if
the list is not provided:
• MAE: Mean absolute error (MAE) is the arithmetic average of the absolute errors
between observed values and predicted values.
• MSE: Mean squared error (MSE) is the average of the squares of the errors between
observed values and predicted values.
• MSLE: Mean Square Log Error (MSLE) is the relative difference between the log-
transformed observed values and predicted values.
• MAPE: Mean Absolute Percentage Error (MAPE) is the mean or average of the absolute
percentage errors of forecasts.
• MPE: Mean percentage error (MPE) is the computed average of percentage errors by
which predicted values differ from observed values.
• RMSE: Root means squared error (MSE) is the square root of the average of the
squares of the errors between observed values and predicted values.
• RMSLE: Root means Square Log Error (MSLE) is the square root of the relative
difference between the log-transformed observed values and predicted values.
• R2: R Squared (R2) is the proportion of the variation in the dependent variable that is
predictable from the independent variable(s).
• AR2: Adjusted R-squared (AR2) is a modified version of R-squared that has been
adjusted for the independent variable(s) in the model.
• EV: Explained variation (EV) measures the proportion to which a mathematical model
accounts for the variation (dispersion) of a given data set.
• ME: Max-Error (ME) is the worst-case error between observed values and
predicted values.
• MPD: Mean Poisson Deviance (MPD) is equivalent to Tweedie Deviances when the
power parameter value is 1.
• MGD: Mean Gamma Deviance (MGD) is equivalent to Tweedie Deviances when the
power parameter value is 2.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 287
8: Model Evaluation Functions

• FSTAT: F-statistics (FSTAT) conducts an F-test. An F-test is any statistical test in which
the test statistic has an F-distribution under the null hypothesis.
◦ F_score = F_score value from the F-test.
◦ F_Critcialvalue = F critical value from the F-test. (alpha, df1, df2, UPPER_TAILED) ,
alpha = 95%
◦ p_value = Probability value associated with the F_score value (F_score, df1,
df2, UPPER_TAILED)
◦ F_conclusion = F-test result, either 'reject null hypothesis' or 'fail to reject null
hypothesis'. If F_score > F_Critcialvalue, then 'reject null hypothesis' Else 'fail to
reject null hypothesis'

TD_RegressionEvaluator Input
Input Table Schema
Column Data Type Description

Observation_ BYTEINT,SMALLINT,INTEGER, The column that has


Column BIGINT, Decimal/Numeric,Float,Real, observation values.
Double precision

Prediction_Column BYTEINT,SMALLINT,INTEGER, The column that has


BIGINT, Decimal/Numeric,Float,Real, prediction values.
Double precision

TD_RegressionEvaluator Output
Output Table Schema
Column Data Type Description

Metricsi FLOAT The metrics specified in the Metrics syntax element are displayed. For FSTAT, the
following columns are displayed:
• F_score
• F_Critcialvalue
• p_value
• F_Conclusion

TD_RegressionEvaluator Example
InputTable

sn price prediction
--- ---------------- ----------------

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 288
8: Model Evaluation Functions

13 27000.000000000 40446.918834842
16 37900.000000000 40510.148673279
25 42000.000000000 43449.453484331
38 67000.000000000 76624.879832478
53 68000.000000000 71463.482418863
104 132000.000000000 116919.270833333
111 43000.000000000 44914.331354282
117 93000.000000000 65017.025152392
132 44500.000000000 40953.263035303
140 43000.000000000 43084.061169765
142 40000.000000000 40842.383578431
157 60000.000000000 63601.429679229
161 63900.000000000 63577.865086289
162 130000.000000000 118893.154761905
176 57500.000000000 65472.830594775
177 70000.000000000 62739.489325450
195 33000.000000000 39967.151210673
198 40500.000000000 44205.358401854
224 78500.000000000 66951.540759118
234 32500.000000000 42075.221979656
237 43000.000000000 42838.368042767
239 26000.000000000 40172.789343484
249 44500.000000000 40931.339168183
251 48500.000000000 43288.879830816
254 60000.000000000 71441.950774676
255 61000.000000000 62427.945104114
260 41000.000000000 45264.892185064
274 64900.000000000 64333.059596141
294 47000.000000000 42006.077797518
301 55000.000000000 59668.624729461
306 64000.000000000 64594.501483399
317 80000.000000000 69883.134938113
329 115442.000000000 116388.318452381
339 141000.000000000 131657.638888889
340 62500.000000000 60979.553793302
353 78500.000000000 69278.445119583
355 86900.000000000 64204.176931452
364 72000.000000000 75421.353748405
367 114000.000000000 126319.444444444
377 140000.000000000 110247.569444444
401 92500.000000000 80670.206257863
403 77500.000000000 80768.002205787
408 87500.000000000 79691.581621186
411 90000.000000000 77262.550218560

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 289
8: Model Evaluation Functions

440 69000.000000000 75592.404463299


441 51900.000000000 60775.316629141
443 65000.000000000 58709.087267676
459 44555.000000000 38949.583765397
463 49000.000000000 41389.644529286
469 55000.000000000 62925.830957098
472 60500.000000000 59982.047344907
527 105000.000000000 113294.568452381
530 108000.000000000 112760.937500000
540 85000.000000000 77329.868236111

SQL Call

SELECT * FROM TD_RegressionEvaluator(


ON decision_predict_output as InputTable
USING
ObservationColumn('price')
PredictionColumn('prediction')
Metrics('RMSE','R2','FSTAT')
DegreesOfFreedom(5,48)
NUMOFINDEPENDENTVARIABLES(5)
) as dt;

Output Table

RMSE R2 F_SCORE F_CRITICALVALUE


P_VALUE F_CONCLUSION
-------------- ----------- ------------ ---------------
----------- ----------------------
9604.405925830 0.888090785 66.047306996 2.408514119 0.000000000 Reject
null hypothesis

TD_ROC
The Receiver Operating Characteristic (ROC) function accepts a set of prediction-actual pairs for a binary
classification model and calculates the following values for a range of discrimination thresholds:
• True-positive rate (TPR)
• False-positive rate (FPR)
• The area under the ROC curve (AUC)
• Gini coefficient

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 290
8: Model Evaluation Functions

A receiver operating characteristic (ROC) curve shows the performance of a binary classification model as
its discrimination threshold varies. For a range of thresholds, the curve plots the true positive rate against
the false-positive rate.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_ROC Syntax
select * from TD_ROC(
on { input_table | view | (query) } as InputTable
[ OUT [VOLATILE| PERMANENT] TABLE OutputTable(output_table_name) ]
Using
[ ModelIDColumn ('model_id_column') ]
ProbabilityColumn ('probability_column')
ObservationColumn ('observation_column')
PositiveLabel ('positive_class_label')
[ NumThresholds (num_thresholds) ]
[ AUC ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Gini ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
)As alias;

TD_ROC Syntax Elements


ModelIDColumn
[Optional]: Specify the InputTable column name that contains the model or partition identifiers
for the ROC curves.

ProbabilityColumn
[Required]: Specify the InputTable column name that contains the probability values
for predictions.

ObservationColumn
[Required]: Specify the InputTable column name that contains the actual classes.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 291
8: Model Evaluation Functions

PositiveLabel
[Required]: Specify the label of the positive class.

NumThresholds
[Optional]: Specify the number of thresholds for this function. The value must be in the range
[1, 10000]. Default value: 50 (The function uniformly distributes the thresholds between 0
and 1.)

AUC
[Optional]: Specify whether the function displays the AUC calculated from the ROC values
(thresholds, false positive rates, and true positive rates).

GINI
[Optional]: Specify whether the function displays the Gini coefficient calculated from the
ROC values. The Gini coefficient is an inequality measure among the values of a frequency
distribution. A Gini coefficient of 0 indicates that all values are the same. The closer the Gini
coefficient is to 1, the more unequal are the values in the distribution.

TD_ROC Input
Input Table Schema
Column Data Type Description

model_id_column Varchar, Char, SmallInt, The Model identifier or partition for ROC curve
BigInt, Integer associated with observation.

probability_column Numeric The predicted probability that the observation is in


positive class.

observation_column Varchar, Char, SmallInt, The actual class of the observation.


BigInt, Integer

TD_ROC Output
Output Table Schema
If the OutputTable is given in OUT clause:

Column Data Type Description

Model_id Varchar, Char, The Model identifier or partition for ROC curve associated with observation.
SmallInt, The column is not displayed if you do not provide the ModelIdColumn
BigInt, Integer syntax element.

Threshold REAL The threshold at which function classifies an observation as positive.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 292
8: Model Evaluation Functions

Column Data Type Description

TPR REAL The TPR (True Positive Rate) for the threshold. Calculated as:
The number of observations correctly predicted as positive based on the
threshold divided by the number of positive observations.

FPR REAL The FPR (False positive rate) for threshold. Calculated as:
The number of observations incorrectly predicted as positive based on the
threshold divided by the number of negative observations.

Output Table Schema


If the GINI or AUC syntax element is set to 'True':

Column Data Type Description

AUC REAL The area under the ROC curve for data in the partition. The column is not
displayed if the AUC syntax element is False.

GINI REAL The column is not displayed if the GINI syntax element is False.

TD_ROC Example
Input Table

model id observation probability


----- -- ----------- -----------
1 2 1 0.500000000
1 1 0 0.150000000
2 1 1 0.250000000
2 2 0 0.550000000
3 2 1 0.250000000
3 1 0 0.450000000

SQL Call

select * from TD_ROC(


on roc_input as InputTable
OUT PERMANENT TABLE OutputTable(RocTable)
Using
ModelIDColumn ('model')
ProbabilityColumn ('probability')
ObservationColumn ('observation')
PositiveLabel ('1')
NumThresholds (5)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 293
8: Model Evaluation Functions

AUC('true')
GINI('true')
)As dt;

Output Table

model AUC GINI


----- ----------- ------------
1 1.000000000 1.000000000
3 0.500000000 0.000000000
2 0.000000000 -1.000000000

ROC Output Table

model threshold_value tpr fpr


----- --------------- ----------- -----------
1 0.500000000 1.000000000 0.000000000
1 1.000000000 0.000000000 0.000000000
1 0.250000000 1.000000000 0.000000000
1 0.750000000 0.000000000 0.000000000
1 0.000000000 1.000000000 1.000000000
2 0.250000000 1.000000000 1.000000000
2 0.000000000 1.000000000 1.000000000
2 1.000000000 0.000000000 0.000000000
2 0.750000000 0.000000000 0.000000000
2 0.500000000 0.000000000 1.000000000
3 0.500000000 0.000000000 0.000000000
3 1.000000000 0.000000000 0.000000000
3 0.750000000 0.000000000 0.000000000
3 0.250000000 1.000000000 1.000000000
3 0.000000000 1.000000000 1.000000000

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 294
9
Text Analytic Functions

NaiveBayesTextClassifierPredict
This function uses the model output by TD_NaiveBayesTextClassifierTrainer function to analyze the input
data and make predictions.

NaiveBayesTextClassifierPredict Syntax
SELECT * FROM NaiveBayesTextClassifierPredict (
ON { table | view | (query) } AS PredictorValues PARTITION BY doc_id_column [,...]
ON { table | view | (query) } AS Model DIMENSION
USING
InputTokenColumn ('input_token_column')
[ ModelType ({ 'Multinomial' | 'Bernoulli' }) ]
DocIDColumns ({ 'doc_id_column' | 'doc_id_column_range' }[,...])
[ ModelTokenColumn ('model_token_column')
ModelCategoryColumn ('model_category_column')
ModelProbColumn ('model_probability_column') ]
[ TopK ({ num_of_top_k_predictions | 'num_of_top_k_predictions' }) |
Responses ('response' [,...]) ]
[ OutputProb {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;

NaiveBayesTextClassifierPredict Syntax Elements


InputTokenColumn
Specify the name of the PredictorValues column that contains the tokens.

ModelType
[Optional] Specify the model type of the text classifier.
Default: 'Multinomial'

DocIDColumns
Specify the names of the PredictorValues columns that contain the document identifier.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 295
9: Text Analytic Functions

ModelTokenColumn
[Optional] Specify the name of the Model table column that contains the tokens.
Default: First column of Model table

ModelCategoryColumn
[Optional] Specify the name of the Model table column that contains the
prediction categories.
Default: Second column of Model table

ModelProbColumn
[Optional] Specify the name of the Model table column that contains the probability values.
Default: Third column of Model table

TopK
[Disallowed with Responses, otherwise optional.] Specify the number of most likely
prediction categories to output with their loglikelihood values (for example, the top 10 most
likely prediction categories). To see the probability of each class, use OutputProb ('true').
Default: All prediction categories

Responses
[Disallowed with TopK, otherwise optional.] Specify the labels for which to output
loglikelihood values and probabilities (with OutputProb ('true')).

OutputProb
Specify whether to output the calculated probability for each observation.
Default: 'false'

Accumulate
Specify the names of the PredictorValues table columns to copy to the output table.

Note:

• Specify either all or none of the syntax elements ModelTokenColumn, ModelCategoryColumn,


and ModelProbColumn.
• Specifying neither TopK nor Responses is equivalent to specifying TopK.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 296
9: Text Analytic Functions

NaiveBayesTextClassifierPredict Input
Table Description

PredictorValues Contains test data, for which to predict outcomes, in document-token pairs. To transform
the input document into this form, input it to TD_TextParser or ML Engine function,
TextTokenizer, or TextParser.
TextTokenizer and TextParser have language-processing limitations that might limit
support for Unicode input data (see Teradata Vantage™ Machine Learning Engine
Analytic Function Reference, B700-4003).

Model Model output by TD_NaiveBayesTextClassifier or ML Engine


NaiveBayesTextClassifierTrainer2 function. If the latter, then for schema, see Teradata
Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003.

PredictorValues Schema
Column Data Type Description

doc_id_column CHARACTER, VARCHAR, Identifier of document that contains classified


INTEGER, or SMALLINT testing tokens.

token_column CHARACTER or VARCHAR Testing token.

accumulate_column Any Column to copy to output table.

Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.

Column Data Type Description

token CHARACTER or VARCHAR Classified training token.

category CHARACTER or VARCHAR Prediction category for token.

prob DOUBLE PRECISION Probability that token is in category.

NaiveBayesTextClassifierPredict Output
Output Table Schema
Column Data Type Description

doc_id CHARACTER, Single- or multiple-column document identifier.


VARCHAR, INTEGER,
or SMALLINT

prediction VARCHAR Prediction category.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 297
9: Text Analytic Functions

Column Data Type Description

loglik DOUBLE PRECISION Loglikelihood that document belongs to category.

loglik_response DOUBLE PRECISION [Column appears only with Responses syntax


element. Loglikelihood that document belongs to class
label response.

prob DOUBLE PRECISION [Column appears only when you both specify OutputProb
('true') and omit Responses.] Probability that document
belongs to class label in prediction column, which is
max(softmax(loglik)).

prob_response DOUBLE PRECISION [Column appears only when you specify both OutputProb
('true') and Responses. Column appears once for each
specified response.] Probability that document belongs to
class label response, which is softmax(loglik_response).

accumulate_ Same as in Column copied from PredictorValues table.


column PredictorValues table

NaiveBayesTextClassifierPredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

NaiveBayesTextClassifierPredict Example: TopK Specified

Input
• PredictorValues: complaints_test_tokenized, created by applying ML Engine TextTokenizer function
to the table complaints_test, a log of vehicle complaints, as follows:

CREATE MULTISET TABLE complaints_test_tokenized AS (


SELECT doc_id, doc_name, lower(cast(token AS VARCHAR(20))) AS token
FROM TextTokenizer (
ON complaints_test PARTITION BY ANY
USING
TextColumn ('text_data')
OutputByWord ('true')
Accumulate ('doc_id', doc_name)
) AS dt
) WITH DATA;

• Model: Use the following query to create the complaints_tokens_model:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 298
9: Text Analytic Functions

CREATE TABLE complaints_tokens_model(


token VARCHAR(100),
category VARCHAR(100),
prob DOUBLE PRECISION
);

complaints_test
doc_
doc_id text_data
name

1 A ELECTRICAL CONTROL MODULE IS SHORTENING OUT, CAUSING THE


VEHICLE TO STALL. ENGINE WILL BECOME TOTALLY INOPERATIVE.
CONSUMER HAD TO CHANGE ALTERNATOR/ BATTERY AND STARTER, AND
MODULE REPLACED 4 TIMES, BUT DEFECT STILL OCCURRING CANNOT
DETERMINE WHAT IS CAUSING THE PROBLEM.

2 B ABS BRAKES FAIL TO OPERATE PROPERLY, AND AIR BAGS FAILED


TO DEPLOY DURING A CRASH AT APPROX. 28 MPH IMPACT.
MANUFACTURER NOTIFIED.

3 C WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER
THAT IS AROUND THE GAS PEDAL.

4 D THERE IS A KNOCKING NOISE COMING FROM THE CATALYITC CONVERTER ,


AND THE VEHICLE IS STALLING. ALSO, HAS PROBLEM WITH THE STEERING.

5 E CONSUMER WAS MAKING A TURN ,DRIVING AT APPROX 5- 10 MPH WHEN


CONSUMER HIT ANOTHER VEHICLE. UPON IMPACT, DUAL AIRBAGS DID NOT
DEPLOY . ALL DAMAGE WAS DONE FROM ENGINE TO TRANSMISSION,TO
THE FRONT OF VEHICLE, AND THE VEHICLE CONSIDERED A TOTAL LOSS.

6 F WHEEL BEARING AND HUBS CRACKED, CAUSING THE METAL TO GRIND


WHEN MAKING A RIGHT TURN. ALSO WHEN APPLYING THE BRAKES, PEDAL
GOES TO THE FLOOR, CAUSE UNKNOWN. WAS ADVISED BY MIDAS NOT TO
DRIVE VEHICLE- WHEELE COULD COME OFF.

7 G DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN
WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S
AND THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN
INJURIES. PLEASE PROVIDE FURTHER INFORMATION AND VIN#.

8 H THE AIR BAG WARNING LIGHT HAS COME ON. INDICATING AIRBAGS
ARE INOPERATIVE.THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM
HAS REOCCURRED.

9 I CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING
EAST. THE OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE,
CONSUMER HIT OTHER VEHICLE AND STARTED TO SPIN AROUND ,
COULDN'T STOP, RESULTING IN A CRASH. UPON IMPACT, AIRBAGS
DIDN'T DEPLOY.

10 J WHILE DRIVING ABOUT 65 MPH AND THE TRANSMISISON MADE A STRANGE


NOISE, AND THE LEFT FRONT AXLE LOCKED UP. THE DEALER HAS
REPAIRED THE VEHICLE.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 299
9: Text Analytic Functions

SQL Call

SELECT * FROM NaiveBayesTextClassifierPredict (


ON complaints_test_tokenized AS PredictorValues PARTITION BY doc_id
ON complaints_tokens_model AS Model DIMENSION
USING
ModelType ('Bernoulli')
InputTokenColumn ('token')
DocIDColumns ('doc_id')
OutputProb ('true')
Accumulate ('doc_name')
TopK ('2')
) AS dt ORDER BY doc_id;

Output

doc_id prediction loglik prob doc_name


------ ---------- ---------------------- ---------------------- --------
1 crash -1.38044220625651E 002 1.41243173571687E-009 A
1 no_crash -1.17666267644292E 002 9.99999998587568E-001 A
2 crash -1.04652470718918E 002 1.70704288519507E-003 B
2 no_crash -9.82811865081127E 001 9.98292957114805E-001 B
3 crash -1.03026451289745E 002 2.26862573862878E-012 C
3 no_crash -7.62146044204976E 001 9.99999999997731E-001 C
4 crash -1.10830711173169E 002 1.42026355157382E-011 D
4 no_crash -8.58531176043404E 001 9.99999999985797E-001 D
5 no_crash -1.23936921216052E 002 3.43620138383542E-002 E
5 crash -1.20601083912966E 002 9.65637986161646E-001 E
6 crash -1.30310015371040E 002 2.61074198636704E-006 F
6 no_crash -1.17454141890718E 002 9.99997389258014E-001 F
7 no_crash -1.23123774759574E 002 4.23603936872661E-002 G
7 crash -1.20005517060745E 002 9.57639606312734E-001 G
8 crash -1.08617321658980E 002 8.92398441816595E-009 H
8 no_crash -9.00827983614664E 001 9.99999991076016E-001 H
9 crash -1.19919230739025E 002 2.24857954852037E-002 I
9 no_crash -1.16147101713878E 002 9.77514204514796E-001 I
10 crash -1.06104244132225E 002 7.57068462691010E-008 J
10 no_crash -8.97078469668254E 001 9.99999924293154E-001 J

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 300
9: Text Analytic Functions

NaiveBayesTextClassifierPredict Example: Responses Specified

Input
As in NaiveBayesTextClassifierPredict Example: TopK Specified
• PredictorValues: complaints_test_tokenized
• Model: complaints_tokens_model

SQL Call

SELECT * FROM NaiveBayesTextClassifierPredict (


ON complaints_test_tokenized AS PredictorValues PARTITION BY doc_id
ON complaints_tokens_model AS Model DIMENSION
USING
ModelType ('Bernoulli')
InputTokenColumn ('token')
DocIDColumns ('doc_id')
OutputProb ('true')
Accumulate ('doc_name')
Responses ('crash', 'no crash')
) AS dt ORDER BY doc_id;

Output
doc_id prediction loglik_crash loglik_no_crash prob_crash prob_no_crash doc_name
------ ---------- ---------------------- ---------------------- ---------------------- ---------------------- --------
1 no_crash -1.38044220625651E 002 -1.17666267644292E 002 1.41243173571687E-009 9.99999998587568E-001 A
2 no_crash -1.04652470718918E 002 -9.82811865081127E 001 1.70704288519507E-003 9.98292957114805E-001 B
3 no_crash -1.03026451289745E 002 -7.62146044204976E 001 2.26862573862878E-012 9.99999999997731E-001 C
4 no_crash -1.10830711173169E 002 -8.58531176043404E 001 1.42026355157382E-011 9.99999999985797E-001 D
5 crash -1.20601083912966E 002 -1.23936921216052E 002 9.65637986161646E-001 3.43620138383542E-002 E
6 no_crash -1.30310015371040E 002 -1.17454141890718E 002 2.61074198636704E-006 9.99997389258014E-001 F
7 crash -1.20005517060745E 002 -1.23123774759574E 002 9.57639606312734E-001 4.23603936872661E-002 G
8 no_crash -1.08617321658980E 002 -9.00827983614664E 001 8.92398441816595E-009 9.99999991076016E-001 H
9 no_crash -1.19919230739025E 002 -1.16147101713878E 002 2.24857954852037E-002 9.77514204514796E-001 I
10 no_crash -1.06104244132225E 002 -8.97078469668254E 001 7.57068462691010E-008 9.99999924293154E-001 J

NGramSplitter
The NGramSplitter function tokenizes (splits) an input stream of text and outputs n multigrams (called
n-grams) based on the specified Reset, Punctuation, and Delimiter syntax elements. NGramSplitter first
splits sentences, next removes punctuation characters from them, and finally splits the words into n-grams.
NGramSplitter provides more flexibility than standard tokenization when performing text analysis. Many
two-word phrases carry important meaning (for example, "machine learning") that single-word tokens do
not capture. This, combined with additional analytical techniques, can be useful for performing sentiment
analysis, topic identification, and document classification.
NGramSplitter considers each input row to be one document, and returns a row for each unique n-gram in
each document. NGramSplitter also returns, for each document, the counts of each n-gram and the total
number of n-grams.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 301
9: Text Analytic Functions

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.

NGramSplitter Syntax
SELECT * FROM NGramSplitter (
ON { table | view | (query) }
USING
TextColumn ('text_column')
Grams ({ gram_number | 'value_range' }[,...])
[ OverLapping({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ ConvertToLowerCase ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Reset ('reset_character...') ]
[ Punctuation ('punctuation_character...') ]
[ Delimiter ('delimiter_character...') ]
[ OutputTotalGramCount ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ TotalCountColName ('total_count_column') ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ NGramColName ('ngram_column') ]
[ GramLengthColName ('gram_length_column') ]
[ FrequencyColName ('frequency_column') ]
) AS alias;

Related Information:
Column Specification Syntax Elements

NGramSplitter Syntax Elements


TextColumn
Specify the name of the column that contains the input text. This column must have a SQL
string data type.

Grams
Specify the length, in words, of each n-gram (that is, the value of n). A value_range has the
syntax integer1-integer2, where integer1 <= integer2. The values of n, integer1, and integer2
must be positive.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 302
9: Text Analytic Functions

OverLapping
[Optional] Specify whether the function allows overlapping n-grams.
Default: 'true' (Each word in each sentence starts an n-gram, if enough words follow it in the
same sentence to form a whole n-gram of the specified size. For information on sentences,
see the Reset syntax element description.)

ConvertToLowerCase
[Optional] Specify whether the function converts all letters in the input text to lowercase.
Default: 'true'

Reset
[Optional] Specify, in a string, the characters that can end a sentence. At the end of a
sentence, the function discards any partial n-grams and searches for the next n-gram at the
beginning of the next sentence. An n-gram cannot span sentences.
Default: '.,?!'

Punctuation
[Optional] Specify, in a string, the punctuation characters for the function to remove before
evaluating the input text.
Punctuation characters can be from both Unicode and Latin character sets.
Default: '`~#^&*()-'

Delimiter
[Optional] Specify the character or string that separates words in the input text.
Default: ' ' (space)

OutputTotalGramCount
[Optional] Specify whether the function returns the total number of n-grams in the document
(that is, in the row) for each length n specified in the Grams syntax element. If you specify
'true', the TotalCountColName syntax element determines the name of the output table
column that contains these totals.
The total number of n-grams is not necessarily the number of unique n-grams.
Default: 'false'

TotalCountColName
[Optional] Specify the name of the output table column that appears if the value of the
OutputTotalGramCount syntax element is 'true'.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 303
9: Text Analytic Functions

Default: 'totalcnt'

Accumulate
[Optional] Specify the names of the input table columns to copy to the output table for each
n-gram. These columns cannot have the same names as those specified by the syntax
elements NGramColName, GramLengthColName, and TotalCountColName.
Default: All input columns for each n-gram

NGramColName
[Optional] Specify the name of the output table column that is to contain the created n-grams.
Default: 'ngram'

GramLengthColName
[Optional] Specify the name of the output table column that is to contain the length of n-gram
(in words).
Default: 'n'

FrequencyColName
[Optional] Specify the name of the output table column that is to contain the count of
each unique n-gram (that is, the number of times that each unique n-gram appears in
the document).
Default: 'frequency'

NGramSplitter Input
Input Table Schema
Each row of the table has a document to tokenize.

Column Data Type Description

text_column VARCHAR or CLOB Document to tokenize.

accumulate_column Any Column to copy to output table.

NGramSplitter Output
Output Table Schema
The table has a row for each unique n-gram in each input document.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 304
9: Text Analytic Functions

Column Data Type Description

total_count_column INTEGER [Column appears only with TotalCountColName ('true').] Total


number of n-grams in document for each length n specified in Grams
syntax element.

accumulate_column Any Column copied from input table.

ngram_column VARCHAR Created n-gram.

gram_length_column INTEGER Length of n-gram in words (value n).

frequency_column INTEGER Count of each unique n-gram in document.

NGramSplitter Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

NGramSplitter Example: Omit Accumulate

Input
The input table, paragraphs_input, contains sentences about commonly used machine
learning techniques.

paragraphs_input
paraid paratopic paratext

1 Decision Trees Decision tree learning uses a decision tree as a predictive model which maps
observations about an item to conclusions about the items target value. It
is one of the predictive modeling approaches used in statistics, data mining
and machine learning. Tree models where the target variable can take a finite
set of values are called classification trees. In these tree structures, leaves
represent class labels and branches represent conjunctions of features that
lead to those class labels. Decision trees where the target variable can take
continuous values (typically real numbers) are called regression trees.

2 Simple In statistics, simple linear regression is the least squares estimator of a linear
Regression regression model with a single explanatory variable. In other words, simple
linear regression fits a straight line through the set of n points in such a
way that makes the sum of squared residuals of the model (that is, vertical
distances between the points of the data set and the fitted line) as small
as possible

... ... ...

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 305
9: Text Analytic Functions

SQL Call

SELECT * FROM NGramSplitter (


ON paragraphs_input
USING
TextColumn ('paratext')
Grams ('4-6')
OutputTotalGramCount ('true')
) AS dt;

Output
paraid paratopic ngram n frequency totalcnt

1 Decision Trees decision tree learning uses 4 1 73

1 Decision Trees decision tree learning uses a 5 1 66

1 Decision Trees decision tree learning uses a decision 6 1 60

1 Decision Trees tree learning uses a 4 1 73

1 Decision Trees tree learning uses a decision 5 1 66

1 Decision Trees tree learning uses a decision tree 6 1 60

1 Decision Trees learning uses a decision 4 1 73

1 Decision Trees learning uses a decision tree 5 1 66

1 Decision Trees learning uses a decision tree as 6 1 60

... ... ... ... ... ...

NGramSplitter Example: Specify Accumulate

Input
The input table is paragraphs_input, as in NGramSplitter Example: Omit Accumulate.

SQL Call

SELECT * FROM NGramSplitter (


ON paragraphs_input
USING
TextColumn ('paratext')
Grams ('4-6')
OutputTotalGramCount ('true')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 306
9: Text Analytic Functions

Accumulate ('[0:1]')
) AS dt;

Output

TD_NaiveBayesTextClassifierTrainer
TD_NaiveBayesTextClassifierTrainer function calculates the conditional probabilities for token-category
pairs, the prior probabilities, and the missing token probabilities for all categories. The trainer function
trains the model with the probability values, and the predict function uses the values to classify documents
into categories.

TD_NaiveBayesTextClassifierTrainer Syntax
SELECT * FROM TD_NaiveBayesTextClassifierTrainer (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE ModelTable (model_table_name) ]
USING
TokenColumn ('token_column')
DocCategoryColumn ('doc_category_column')
[{
[ ModelType ('Multinomial') ]
}
|
{
ModelType ('Bernoulli')
DocIDColumn ('doc_id_column')
}]
) AS alias;

TD_NaiveBayesTextClassifierTrainer Syntax Elements


TokenColumn
[Required]: Specify the InputTable column name that contains the classified tokens.

DocCategoryColumn
[Required]: Specify the InputTable column name that contains the document category.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 307
9: Text Analytic Functions

DocIDColumn
[Required for Bernoulli model type]: Specify the InputTable column name that contains the
document identifier.

ModelType
[Optional]: Specify the model type of the text classifier.
Supported Model Types: Bernoulli and Multinomial
Default: Multinomial

TD_NaiveBayesTextClassifierTrainer Input
Input Table Schema
Column Data Type Description

doc_id_column BYTEINT/SMALLINT/ The InputTable column name that contains the


INTEGER/BIGINT or CHAR document identifier.
/VARCHAR Note:
This column is required for 'Bernoulli' model type.

token_column CHAR or VARCHAR The column name that contains the classified
training tokens from a tokenization function.

doc_ CHAR or VARCHAR The category of the document.


category_column

Note:
The following vocabulary token names are reserved:
• NAIVE_BAYES_TEXT_MODEL_TYPE
• NAIVE_BAYES_PRIOR_PROBABILITY
• NAIVE_BAYES_MISSING_TOKEN_PROBABILITY

TD_NaiveBayesTextClassifierTrainer Output
Output Table Schema
Column Data Type Description

token VARCHAR (UNICODE) The classified training tokens.

category VARCHAR (UNICODE) The category of the token.

prob DOUBLE PRECISION The probability of the token in the category.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 308
9: Text Analytic Functions

TD_NaiveBayesTextClassifierTrainer Example
InputTable

doc_id category token


------ -------- -------
1 no_crash vehicl
1 no_crash motor
1 no_crash separ
1 no_crash from
1 no_crash frame
2 crash deploy
2 crash anoth
2 crash end
2 crash vehicl
2 crash 70mph
2 crash airbag
2 crash rear
3 no_crash sunroof
3 no_crash leak
4 crash driver
4 crash sustain
4 crash injuri

SQL Call

SELECT * FROM TD_NaiveBayesTextClassifierTrainer (


ON complaints_tokenized AS InputTable
USING
TokenColumn ('token')
DocCategoryColumn ('category')
ModelType ('Multinomial')
) AS dt;

Output Table

token category prob


------------------------------------- ----------- -----------
driver crash 0.076923077
vehicl no_crash 0.086956522
leak no_crash 0.086956522
anoth crash 0.076923077

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 309
9: Text Analytic Functions

deploy crash 0.076923077


airbag crash 0.076923077
from no_crash 0.086956522
70mph crash 0.076923077
NAIVE_BAYES_PRIOR_PROBABILITY crash 0.588235294
separ no_crash 0.086956522
end crash 0.076923077
sunroof no_crash 0.086956522
injuri crash 0.076923077
NAIVE_BAYES_MISSING_TOKEN_PROBABILITY crash 0.038461538
rear crash 0.076923077
vehicl crash 0.076923077
NAIVE_BAYES_PRIOR_PROBABILITY no_crash 0.411764706
NAIVE_BAYES_MISSING_TOKEN_PROBABILITY no_crash 0.043478261
NAIVE_BAYES_TEXT_MODEL_TYPE MULTINOMIAL 1.000000000
frame no_crash 0.086956522
motor no_crash 0.086956522
sustain crash 0.076923077

TD_SentimentExtractor
The TD_SentimentExtractor function uses a dictionary model to extract the sentiment (positive, negative, or
neutral) of each input document or sentence.
The dictionary model consists of WordNet, a lexical database of the English language, and these negation
words (no, not, neither, never, and similar negation words).
The function handles negated sentiments as follows:
• -1 if the sentiment is negated (for example, "I am not happy")
• -1 if the sentiment and a negation word are separated by one word (for example, "I am not very happy")
• +1 if the sentiment and a negation word are separated by two or more words (for example, "I am not
saying I am happy")

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 310
9: Text Analytic Functions

Note:

• You can omit the dimension ON clause(s) of the dictionary tables from the query if you want to use
the default sentiment dictionary.
• You can use your dictionary table and provide it as a CUSTOMDICTIONARYTABLE ON clause.
• You can provide additional dictionary entries through the ADDITIONALDICTIONARYTABLE
ON clause if you want to add more entries to either the CUSTOMDICTIONARY table or
default dictionary.
• You can access the dictionary through the OUTPUTDICTIONARYTABLE OUT clause if you want
to check the dictionary contents used during sentiment analysis.
• Only the English language is supported.
• The max length supported for sentiment word in the dictionary table is 128 characters.
• The Max length of the sentiment_words output column is 32000 characters. If the
sentiment_words output column value exceeds this limit, then a triple dot(...) displays at the
end of the string.
• The Max length of the content output column is 32000 characters; that is, the supported maximum
length of a sentence is 32000.
• You can have up to 10 words in a sentiment phrase.

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_SentimentExtractor Syntax
SELECT * FROM TD_SentimentExtractor (
ON { table | view | (query) } AS INPUTTABLE PARTITION BY ANY
[ ON { table | view | (query) } AS CUSTOMDICTIONARYTABLE DIMENSION ]
[ ON { table | view | (query) } AS ADDITIONALDICTIONARYTABLE DIMENSION ]
[ OUT PERMANENT TABLE OUTPUTDICTIONARYTABLE (output_table_name) ]
USING
TextColumn ('text_column')
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ AnalysisType ({ 'DOCUMENT' | 'SENTENCE' }) ]
[ Priority ({ 'NONE' | 'NEGATIVE_RECALL' | 'NEGATIVE_PRECISION' |
'POSITIVE_RECALL' | 'POSITIVE_PRECISION'}) ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 311
9: Text Analytic Functions

[ OutputType ({ 'ALL' | 'POSITIVE' | 'NEGATIVE' | 'NEUTRAL' }) ]


) AS alias;

TD_SentimentExtractor Syntax Elements


TextColumn
[Required]: Specify the input column name that contains text for sentiment analysis.

Accumulate
[Optional]: Specify the input table column names to copy to the output table.

AnalysisType
[Optional]: Specify the analysis level - whether you want to analyze each document (default
level) or sentence.

Priority
[Optional]: Specify one of the following priorities for results:
• None (Default): Provide all results the same priority.
• Negative_Recall: Provide the highest priority to negative results, including those
with lower-confidence sentiment classifications (maximizes number of negative
results returned)
• Negative_Precision: Provide the highest priority to negative results with high-confidence
sentiment classifications
• POSITIVE_RECALL: Provide the highest priority to positive results, including those
with lower-confidence sentiment classifications (maximizes number of positive
results returned).
• Positive_Precision: Provide the highest priority to positive results with high confidence
sentiment classifications.

OutputType
[Optional]: Specify one of the following result types:
• All (Default): Returns all results.
• Positive: Returns only results with positive sentiments.
• Negative: Returns only results with negative sentiments.
• Neutral: Returns only results with neutral sentiments.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 312
9: Text Analytic Functions

TD_SentimentExtractor Input
Input Table Schema
Column Data Type Description

text_column CHAR, The InputTable column name that contains text for
VARCHAR, CLOB sentiment analysis.

accumulate_column ANY The input table column names to copy to the output table.

Custom/Additional Dictionary Table Schema


Column Data Type Description

sentiment_word CHAR, VARCHAR The column name that contains the sentiment word.

polarity_strength BYTEINT, The column name that contains the strength of the
SMALLINT, INTEGER sentiment word.

TD_SentimentExtractor Output
Output Table Schema
Column Data Type Description

AccumulateColumns ANY The specified input table column names copied to the output table.

content VARCHAR The column contains the sentence extracted from the document. The
column displays if you use Sentence as the AnalysisType.

polarity VARCHAR The sentiment value of the result. Possible values are POS
(positive), NEG (negative), or NEU (neutral).

sentiment_score INTEGER The sentiment score of polarity. Possible values are 0 (neutral), 1
(higher than neutral), or 2 (higher than 1).

sentiment_words VARCHAR The string that contains a total positive score, total negative score,
and sentiment words with their polarity_strength and frequency
enclosed in parenthesis.

Output Dictionary Table Schema


Column Data Type Description

sentiment_word VARCHAR The column that contains the sentiment word.

polarity_strength INTEGER The column that contains the strength of the sentiment word.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 313
9: Text Analytic Functions

TD_SentimentExtractor Example
Input Table

id product category
review

-- ------------ --------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
-----------------
1 camera POS we primarily bought this camera for high image quality
and excellent video capability without paying the price for a dslr .it has
excelled in what we expected of it , and consequently represented excellent value
for me .all my friends want my camera for their vacations . i would recommend
this camera to anybody .definitely worth the price .plus , when you buy some
accessories , it becomes even more powerful
2 office suite POS it is the best office suite i have used to date . it
is launched before office 2010 and it is ages ahead of it already . the fact that
i could comfortable import xls , doc , ppt and modify them , and then export them
back to the doc , xls , ppt is terrific . i needed the compatibility .it is a
very intuitive suite and the drag drop functionality is
terrific .
3 camera POS this is a nice camera , delivering good quality video
images decent photos .
light small , using easily obtainable , high quality minidv i love it .
minor irritations include touchscreen based menu only digital photos can only be
trensferred via usb , requiring ilink and usb if you use ilink .

4 gps POS it is a fine gps . outstanding performance , works


great . you can even get incredible coordinate accuracy from streets and trips to
compare .

5 gps POS nice graphs and map route info .i would not run outside

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 314
9: Text Analytic Functions

again without this unique gadget . great job. big display , good backlight ,
really watertight , training assistant .i use in trail running and it worked well
through out the
race

6 gps NEG most of the complaints i have seen in here are from a
lack of rtfm. i have never seen so many mistakes do to what i think has to be none
update of data to the system . i wish i could make all the rating stars be
empty .

7 gps NEG this machine is all screwed up . on my way home from


a friends house it told me there is no possible route . i found their website
support difficult to navigate . i am is so disapointed and just returned it and
now looking for another
one

8 camera NEG i hate my camera , and im stuck with it . this camera


sucks so bad , even the dealers on ebay have difficulty selling it. horrible
indoors , does not capture fast action, screwy software , no suprise , and screwy
audio/video codec that does not work with hardly any
app

9 television NEG $3k is way too much money to drop onto a piece of
crap .poor customer support . after about 1 and a half years and hardly using the
tv , a big yellow pixilated stain appeared. product is very inferior and subject
to several lawsuits . i expressed my dissatifaction with the situation as this
is a known
issue

10 camera NEG i returned my camera to the vendor as i will not tolerate


a sub standard product that is a known issue especially from vendor who will not
admt that this needs to be removed from the shelf due to failing parts updated .
due to the constant need for repair , i would never recommend this product .

SQL Call

SELECT * FROM TD_SentimentExtractor (


ON sentiment_extract_input AS INPUTTABLE PARTITION BY ANY
USING
TextColumn ('review')
Accumulate ('id', 'product')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 315
9: Text Analytic Functions

AnalysisType ('DOCUMENT')
) AS dt ORDER BY id;

Output
Example 1: Default Dictionary

id product polarity sentiment_score


sentiment_words

-- ------------ -------- ---------------


--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
------------------------
1 camera POS 2 In total, positive score:7 negative
score:0. excelled 1 (1), excellent 1 (2), powerful 1 (1), worth 1 (1), recommend
1 (1), capability 1 (1).
2 office suite POS 2 In total, positive score:5 negative
score:-1. drag -1 (1), intuitive 1 (1), best 1 (1), comfortable 1 (1), terrific
1 (2).
3 camera POS 2 In total, positive score:5 negative
score:-1. decent 1 (1), good 1 (1), irritations -1 (1), nice 1 (1), love 1 (1),
obtainable 1 (1).
4 gps POS 2 In total, positive score:5 negative
score:0. incredible 1 (1), outstanding 1 (1), fine 1 (1), great 1 (1), works 1
(1).
5 gps POS 2 In total, positive score:5 negative
score:0. good 1 (1), worked 1 (1), nice 1 (1), great 1 (1), well 1
(1).
6 gps NEG 2 In total, positive score:0 negative
score:-3. lack -1 (1), complaints -1 (1), mistakes -1
(1).
7 gps NEG 2 In total, positive score:1 negative
score:-3. disapointed -1 (1), screwed -1 (1), difficult -1 (1), support 1
(1).
8 camera NEG 2 In total, positive score:0 negative
score:-10. stuck -1 (1), sucks -1 (1), screwy -1 (2), not fast -1 (1), bad -1 (1),
difficulty -1 (1), horrible -1 (1), not work -1 (1), hate -1 (1).
9 television NEG 2 In total, positive score:1 negative
score:-5. crap -1 (1), issue -1 (1), stain -1 (1), inferior -1 (1), poor -1 (1),
support 1 (1).
10 camera NEG 2 In total, positive score:0 negative
score:-3. failing -1 (1), issue -1 (1), never recommend -1 (1).

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 316
9: Text Analytic Functions

InputTable: Sentiment Word

sentiment_word polarity_strength
-------------- -----------------
big 0
constant 0
crap -2
difficulty -1
disappointed -1
excellent 2
fun 1
incredible 2
love 1
mistake -1
nice 1
not tolerate -1
outstanding 2
screwed 2
small 0
stuck -1
terrific 2
terrrible -2
update 0

SQL Call

SELECT * FROM TD_SentimentExtractor (


ON sentiment_extract_input AS INPUTTABLE PARTITION BY ANY
ON sentiment_word AS CustomDictionaryTable DIMENSION
USING
TextColumn ('review')
Accumulate ('id', 'product')
AnalysisType ('SENTENCE')
) AS dt ORDER BY id;

Output
Example 2: With Custom Dictionary

id product
content

polarity sentiment_score

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 317
9: Text Analytic Functions

sentiment_words
-- ------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------- -------- ---------------
---------------------------------------------------------------------
1 camera i would recommend this camera to anybody .definitely worth the
price .plus , when you buy some accessories , it becomes even more
powerful
NEU
0
1 camera we primarily bought this camera for high image quality and
excellent video capability without paying the price for a dslr .it has excelled
in what we expected of it , and consequently represented excellent value for
me .all my friends want my camera for their vacations . POS
2 In total, positive score:4 negative score:0. excellent 2 (2).
2 office suite the fact that i could comfortable import xls , doc , ppt and
modify them , and then export them back to the doc , xls , ppt is
terrific .
POS
2 In total, positive score:2 negative score:0. terrific 2 (1).
2 office suite i needed the compatibility .it is a very intuitive suite and
the drag drop functionality is
terrific .

POS 2 In total, positive score:2 negative score:0. terrific 2


(1).
2 office suite it is launched before office 2010 and it is ages ahead of it
already .

NEU
0
2 office suite it is the best office suite i have used to
date .

NEU
0
3 camera minor irritations include touchscreen based menu only digital
photos can only be trensferred via usb , requiring ilink and usb if you use
ilink .
NEU
0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 318
9: Text Analytic Functions

3 camera light small , using easily obtainable , high quality minidv


i love
it .

POS 2 In total,
positive score:1 negative score:0. small 0 (1), love 1 (1).
3 camera this is a nice camera , delivering good quality video images
decent
photos .

POS 2 In total,
positive score:1 negative score:0. nice 1 (1).
4 gps it is a fine
gps .

NEU
0
4 gps outstanding performance , works
great .

POS 2 In total, positive score:2 negative score:0. outstanding


2 (1).
4 gps you can even get incredible coordinate accuracy from streets and
trips to
compare .

POS 2 In total, positive


score:2 negative score:0. incredible 2 (1).
5 gps big display , good backlight , really watertight , training
assistant .i use in trail running and it worked well through out the
race
NEU
0 In total, positive score:0 negative score:0. big 0 (1).
5 gps great
job.

NEU
0
5 gps nice graphs and map route info .i would not run outside again
without this unique
gadget .

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 319
9: Text Analytic Functions

POS 2 In total, positive score:1


negative score:0. nice 1 (1).
6 gps i wish i could make all the rating stars be
empty .

NEU
0
6 gps i have never seen so many mistakes do to what i think has to be
none update of data to the
system .

NEU 0 In total, positive score:0 negative


score:0. update 0 (1).
6 gps most of the complaints i have seen in here are from a lack of
rtfm.

NEU
0
7 gps i found their website support difficult to
navigate .

NEU
0
7 gps i am is so disapointed and just returned it and now looking for
another
one

NEU
0
7 gps on my way home from a friends house it told me there is no possible
route .

NEU
0
7 gps this machine is all screwed
up .

POS 2 In total, positive score:2 negative score:0. screwed 2


(1).
8 camera this camera sucks so bad , even the dealers on ebay have difficulty
selling

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 320
9: Text Analytic Functions

it.

NEG 2 In total, positive


score:0 negative score:-1. difficulty -1 (1).
8 camera i hate my camera , and im stuck with
it .

NEG 2 In total, positive score:0 negative score:-1. stuck -1


(1).
8 camera horrible indoors , does not capture fast action, screwy software ,
no suprise , and screwy audio/video codec that does not work with hardly any
app
NEU
0
9 television product is very inferior and subject to several
lawsuits .

NEU
0
9 television i expressed my dissatifaction with the situation as this is a
known
issue

NEU
0
9 television after about 1 and a half years and hardly using the tv , a big
yellow pixilated stain
appeared.

NEU 0 In total, positive score:0


negative score:0. big 0 (1).
9 television $3k is way too much money to drop onto a piece of crap .poor
customer
support .

NEG 2 In total,
positive score:0 negative score:-2. crap -2 (1).
10 camera due to the constant need for repair , i would never recommend
this
product .

NEU 0 In total,
positive score:0 negative score:0. constant 0 (1).

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 321
9: Text Analytic Functions

10 camera i returned my camera to the vendor as i will not tolerate a sub


standard product that is a known issue especially from vendor who will not admt
that this needs to be removed from the shelf due to failing parts
updated . NEG
2 In total, positive score:0 negative score:-1. not tolerate -1 (1).

InputTable: Sentiment_Word_Add

sentiment_word polarity_strength
--------------- -----------------
love 2
need for repair -2
repair -1

SQL Call

SELECT * FROM TD_SentimentExtractor (


ON sentiment_extract_input AS INPUTTABLE PARTITION BY ANY
ON sentiment_word_add AS ADDITIONALDICTIONARYTABLE DIMENSION
USING
TextColumn ('review')
Accumulate ('id', 'product')
AnalysisType ('DOCUMENT')
) AS dt order by id;

Output
Example 3: With Default Dictionary and Additional Dictionary

id product polarity sentiment_score


sentiment_words

-- ------------ -------- ---------------


--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
------------------------
1 camera POS 2 In total, positive score:7 negative
score:0. excelled 1 (1), excellent 1 (2), powerful 1 (1), worth 1 (1), recommend
1 (1), capability 1 (1).
2 office suite POS 2 In total, positive score:5 negative
score:-1. drag -1 (1), intuitive 1 (1), best 1 (1), comfortable 1 (1), terrific
1 (2).
3 camera POS 2 In total, positive score:6 negative

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 322
9: Text Analytic Functions

score:-1. decent 1 (1), good 1 (1), irritations -1 (1), nice 1 (1), love 2 (1),
obtainable 1 (1).
4 gps POS 2 In total, positive score:5 negative
score:0. incredible 1 (1), outstanding 1 (1), fine 1 (1), great 1 (1), works 1
(1).
5 gps POS 2 In total, positive score:5 negative
score:0. good 1 (1), worked 1 (1), nice 1 (1), great 1 (1), well 1
(1).
6 gps NEG 2 In total, positive score:0 negative
score:-3. lack -1 (1), complaints -1 (1), mistakes -1
(1).
7 gps NEG 2 In total, positive score:1 negative
score:-3. disapointed -1 (1), screwed -1 (1), difficult -1 (1), support 1
(1).
8 camera NEG 2 In total, positive score:0 negative
score:-10. stuck -1 (1), sucks -1 (1), screwy -1 (2), not fast -1 (1), bad -1 (1),
difficulty -1 (1), horrible -1 (1), not work -1 (1), hate -1 (1).
9 television NEG 2 In total, positive score:1 negative
score:-5. crap -1 (1), issue -1 (1), stain -1 (1), inferior -1 (1), poor -1 (1),
support 1 (1).
10 camera NEG 2 In total, positive score:0 negative
score:-5. failing -1 (1), need for repair -2 (1), issue -1 (1), never recommend
-1 (1).

TD_TextParser
The TD_TextParser performs the following operations:
• Tokenizes the text in the specified column
• Removes the punctuations from the text and converts the text to lowercase
• Removes stop words from the text and converts the text to their root forms
• Creates a row for each word in the output table
• Performs stemming; that is, the function identifies the common root form of a word by removing or
replacing word suffixes

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 323
9: Text Analytic Functions

Note:

• The stems resulting from stemming may not be actual words. For example, the stem for
'communicate' is 'commun' and the stem for 'early' is 'earli' (trailing 'y' is replaced by 'i').
• This function requires the UTF8 client character set.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.

TD_TextParser Syntax

SELECT * FROM TD_TextParser (


ON { table | view | (query) } AS InputTable
[ON {table | view | (query)} AS StopWordsTable DIMENSION]

USING
TextColumn ('text_column')
[ConvertToLowerCase
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
[StemTokens ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
[RemoveStopWords ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
[Delimiter ('delimiter_expression')]
[Punctuation ('punctuation_expression')]
[TokenColName ('token_column')]
[ Accumulate ({'accumulate_column' | )]
) as dt;

TD_TextParser Syntax Elements


TextColumn
[Required] Specify the input table column name.

ConvertToLowerCase
[Optional] Specify the default value, true to convert the text in the input table column name
to lowercase.
Default value: true

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 324
9: Text Analytic Functions

StemTokens
[Optional] Specify the default value, true to convert the text in the input table column name
to their root forms.
Default value: true

Delimiter
[Optional] Specify single-character delimiter values to apply to the text in the specified
column in the TextColumn element.
Default values: '\t\n\f\r'

RemoveStopWords
[Optional] Specify the value, true to remove the stop words before parsing the text in the
specified column in the TextColumn element.
Default value: true

Punctuation
[Optional] Specify the punctuation characters that you want to replace in the text of the
specified column in the TextColumn element with space.
Default values: ‘!#$%&()*+,-./:;?@\^_`{|}~’

TokenColName
[Optional] Specify a name for the output column that contains the individual words from the
text of the specified column in the TextColumn element.
Default value: token

Accumulate
[Optional] Specify the input table column names to copy to the output table.

TD_TextParser Input
Input Table Schema
Column Data Type Description

text_column CHAR/CLOB/VARCHAR The column name that contains the text


CHARACTER SET LATIN to parse.
/UNICODE

accumulate_column Any The input table column names to copy to the


output table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 325
9: Text Analytic Functions

TD_TextParser Output
Output Table Schema
Column Data Type Description

token VARCHAR The default column name, token that is set in the
CHARACTER SET TokenColName element is used and contains the rows
LATIN/UNICODE with individual words.

AccumulateColumns Any The column name specified that is, id in the Accumulate
element is copied to the output table.

TD_TextParser Example
Input table: test_table

id paragraph
-- ----------------------------------------------
1 Programmers program with programming languages
2 The quick brown fox jumps over the lazy dog

Also, create a stopwords table with the Word column that contains the stopwords with the VARCHAR/
CHAR CHARACTER SET LATIN/UNICODE datatype.

word
----
a
an
the

SQL Call

SELECT * FROM TD_TEXTPARSER (


ON test_table AS InputTable
ON stopwords As StopWordsTable DIMENSION
USING
TextColumn('paragraph')
StemTokens('true')
RemoveStopWords('true')
Accumulate ('id')
) as dt ORDER BY id,token;

The query performs the following operations:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 326
9: Text Analytic Functions

• Removes the stopwords from the text in the Paragraph column


• Splits the text in the Paragraph column and creates a row for each word in the output table
• Copies the ID column from the input table to the output table

Output Table

id token
-- --------
1 languag
1 program
1 program
1 programm
1 with
2 brown
2 dog
2 fox
2 jump
2 lazi
2 over
2 quick

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 327
10
Path and Pattern Analysis Functions

Terminology
This document uses the following terms.

Term Description

Path An ordered, start-to-finish series of actions, for example, page views, for which sequences
and sub-sequences can be created.

Sequence A sequence is the path prefixed with a carat (^), which indicates the start of a path. For
example, if a user visited page a, page b, and page c, in that order, the session sequence
is ^,a,b,c.

Subsequence For a given sequence of actions, a sub-sequence is one possible subset of the steps that
begins with the initial action. For example, the path a,b,creates three subsequences: ^,a; ^,
a,b; and ^,a,b,c.

Attribution
The Attribution function is used in web page analysis, where it lets companies assign weights to pages
before certain events, such as buying a product.
The function takes data and parameters from multiple tables and outputs attributions.

ML Engine function Attribution_MLE has two versions:


• Multiple-input: Accepts one or more input tables and gets many parameters from other
dimension tables.
• Single-input: Accepts only one input table and gets all parameters from syntax elements.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 328
10: Path and Pattern Analysis Functions

Analytics Database Attribution function corresponds to the multiple-input version. Unlike Attribution_MLE,
Attribution does not support Unicode.
A query that runs longer than 3 seconds before displaying output indicates that syntax elements supplied
to the function are incorrect.

Attribution Syntax
SELECT * FROM Attribution (
ON { table | view | (query) } [ AS InputTable1 ]
PARTITION BY user_id
ORDER BY times_column
[ ON { table | view | (query) } [ AS InputTable2 ]
PARTITION BY user_id
ORDER BY time_column [,...] ]
ON conversion_event_table AS ConversionEventTable DIMENSION
[ ON excluding_event_table AS ExcludedEventTable DIMENSION ]
[ ON optional_event_table AS OptionalEventTable DIMENSION ]
ON model1_table AS FirstModelTable DIMENSION
[ ON model2_table AS SecondModelTable DIMENSION ]
USING
EventColumn ('event_column')
TimeColumn ('time_column')
WindowSize ({'rows:K' | 'seconds:K' | 'rows:K&seconds:K2'})
) AS alias ORDER BY user_id,time_stamp;

Attribution Syntax Elements


EventColumn
Specify the name of the input column that contains the clickstream events.

TimeColumn
Specify the name of the input column that contains the timestamps of the clickstream events.

WindowSize
Specify how to determine the maximum window size for the attribution calculation:
Option Description

rows: K Assign attributions to at most K events before conversion event,


excluding events of types specified in ExcludedEventTable.

seconds: K Assign attributions only to rows not more than K seconds before
conversion event.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 329
10: Path and Pattern Analysis Functions

Option Description

rows: K &seconds: K2 Apply both constraints and comply with stricter one.

Attribution Input
Required
Table Description

Input tables (maximum of five) Contain clickstream data for computing attributions.

ConversionEventTable Contains conversion events.

FirstModelTable Defines type and distributions of first model.

Optional
Table Description

ExcludedEventTable Contains events to exclude from attribution.

OptionalEventTable Contains optional events.

SecondModelTable Defines type and distributions of second model.

Input Table Schema


Column Data Type Description

userid_column INTEGER or VARCHAR User identifier.

event_column INTEGER or VARCHAR Event from clickstream.

time_column INTEGER, SMALLINT, BIGINT, TIMESTAMP, or TIME Event timestamp.

ConversionEventTable Schema
Column Data Type Description

conversion_event VARCHAR Conversion event value (string or integer).

FirstModelTable and SecondModelTable Schema


Column Data Type Description

id INTEGER Row identifier. Rows are numbered 0, 1, 2, and so on.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 330
10: Path and Pattern Analysis Functions

Column Data Type Description

model VARCHAR Row 0: Model type.


Row 1, ..., n: Distribution model definition.
SIMPLE model: Model table has single row that specifies model type
and parameters.
Other model types: n is number of rows or events included in model.
For model type and specification definitions, see Model Specification.

ExcludedEventTable Schema
Column Data Type Description

excluding_event VARCHAR Excluded event (string or integer). Cannot be a conversion event.

OptionalEventTable Schema
Column Data Type Description

optional_event VARCHAR Optional event (string or integer). Cannot be a conversion or excluded


event. Function attributes conversion event to optional event only if it
cannot attribute it to regular event.

Model Specification

Model Types and Specification Definitions


Row 1, ..., n:
Row 0: Model Distribution
Additional Information
Type Model
Specification

SIMPLE MODEL: Distribution model for all events. For MODEL and PARAMETER
PARAMETERS definitions, see following table.

EVENT_ EVENT:WEIGHT: Distribution model for a regular event.


REGULAR MODEL: EVENT cannot be a conversion, excluded, or optional event.
PARAMETERS For MODEL and PARAMETER definitions, see following table.
Sum of WEIGHT values must be 1.0.
For example, suppose that model table has these specifications:
email:0.19:LAST_CLICK:NA
impression:0.81:UNIFORM:NA
Within WindowSize of a conversion event, 19% of conversion event
is attributed to last email event and 81% is attributed uniformly to all
impression events.

EVENT_ EVENT:WEIGHT: Distribution model for an optional event.


OPTIONAL MODEL: EVENT must be in optional event table.
PARAMETERS For MODEL and PARAMETER definitions, see following table.
Sum of WEIGHT values must be 1.0.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 331
10: Path and Pattern Analysis Functions

Row 1, ..., n:
Row 0: Model Distribution
Additional Information
Type Model
Specification

SEGMENT_ Ki:WEIGHT: Distribution model by row. Sum of Ki values must be value K specified by
ROWS MODEL: 'rows:K' in WindowSize syntax element.
PARAMETERS Function considers rows from most to least recent. For example, suppose
that function call has these syntax elements:
WindowSize ('rows:10')
Model1 ('SEGMENT_ROWS',
'3:0.5:UNIFORM:NA',
'4:0.3:LAST_CLICK:NA',
'3:0.2:FIRST_CLICK:NA')
Attribution for a conversion event is divided among attributable events in 10
rows immediately preceding conversion event. If conversion event is in row
11, first model specification applies to rows 10, 9, and 8; second applies to
rows 7, 6, 5, and 4; and third applies to rows 3, 2, and 1.
Half attribution (5/10) is uniformly divided among rows 10, 9, and 8; 3/10 to
last click in rows 7, 6, 5, and 4 (that is, in row 7), and 2/10 to first click in rows
3, 2, and 1 (that is, in row 1).

SEGMENT_ Ki:WEIGHT: Distribution model by time in seconds. Sum of Kivalues must be value K
SECONDS MODEL: specified by 'seconds:K' in WindowSize syntax element.
PARAMETERS Function considers rows from most to least recent. For example, suppose
that function call has these syntax elements:
WindowSize ('seconds:20')
Model1 ('SEGMENT_SECONDS',
'6:0.5:UNIFORM:NA',
'8:0.3:LAST_CLICK:NA',
'6:0.2:FIRST_CLICK:NA')
Attribution for a conversion event is divided among attributable events
in 20 seconds immediately preceding conversion event. If conversion
event is at second 21, first model specification applies to seconds 20-15
(counting backward); second applies to seconds 14-7; and third applies to
seconds 6-1.
Half attribution (5/10) is uniformly divided among seconds 20-15; 3/10 to last
click in seconds 14-7, and 2/10 to first click in seconds 6-1.

MODEL Values and Corresponding PARAMETER Values


MODEL values are case-sensitive. Attributable events are those whose types are not specified in
excluding events table.

MODEL Description PARAMETERS

'LAST_CLICK' Conversion event is attributed entirely 'NA'


to most recent attributable event.

'FIRST_CLICK' Conversion event is attributed entirely 'NA'


to first attributable event.

'UNIFORM' Conversion event is attributed 'NA'


uniformly to preceding
attributable events.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 332
10: Path and Pattern Analysis Functions

MODEL Description PARAMETERS

'EXPONENTIAL' Conversion event is attributed 'alpha,type' where alpha is a decay


exponentially to preceding attributable factor in range (0, 1) and type
events (the more recent the event, the is ROW, MILLISECOND, SECOND,
higher the attribution). MINUTE, HOUR, DAY, MONTH, or YEAR.
When alpha is in range (0, 1), sum of series
wi=(1-alpha)*alphai is 1. Function uses wi
as exponential weights.

'WEIGHTED' Conversion event is attributed to You can specify any number of weights.
preceding attributable events with If there are more attributable events than
weights specified by PARAMETERS. weights, extra (least recent) events are
SEGMENT_SECONDS (when you assigned zero weight. If there are more
specify 'rows:K&seconds:K' in weights than attributable events, then
WindowSize syntax element) function renormalizes weights.

Allowed FirstModelTable/SecondModelTable Combinations


FirstModelTable Type SecondModelTable Type

SIMPLE Not allowed

EVENT_REGULAR

EVENT_REGULAR EVENT_OPTIONAL (when you specify optional events table)

SEGMENT_ROWS SEGMENT_SECONDS (when you specify 'rows:K&seconds:K' in WindowSize


syntax element)

SEGMENT_ROWS

SEGMENT_SECONDS Not allowed

Attribution Output
Attribution Output Table Schema
Column Data Type Description

user_id INTEGER User identifier from input table.


or VARCHAR

event VARCHAR Clickstream event from input table.

time_stamp TIMESTAMP Event timestamp from input table.

attribution DOUBLE PRECISION Fraction of attribution for conversion event that is


attributed to this event.

time_to_conversion INTEGER Elapsed time between attributable event and


conversion event.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 333
10: Path and Pattern Analysis Functions

Attribution Example: Model Assigns Weights to Events


and Channels
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Event Type Channels


This example uses models to assign attribution weights to these events and channels.

Event Type Channels

conversion SocialNetwork, PaidSearch

excluding Email

optional Direct, Referral, OrganicSearch

Input
InputTable1: attribution_sample_table1
user_id event time_stamp

1 impression 2001-09-27 23:00:01

1 impression 2001-09-27 23:00:05

1 Email 2001-09-27 23:00:15

2 impression 2001-09-27 23:00:31

2 impression 2001-09-27 23:00:51

InputTable2: attribution_sample_table2
user_id event time_stamp

1 impression 2001-09-27 23:00:19

1 SocialNetwork 2001-09-27 23:00:20

1 Direct 2001-09-27 23:00:21

1 Referral 2001-09-27 23:00:22

1 PaidSearch 2001-09-27 23:00:23

2 impression 2001-09-27 23:00:29

2 impression 2001-09-27 23:00:31

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 334
10: Path and Pattern Analysis Functions

user_id event time_stamp

2 impression 2001-09-27 23:00:33

2 impression 2001-09-27 23:00:36

2 impression 2001-09-27 23:00:38

ConversionEventTable: conversion_event_table
conversion_events

PaidSearch

SocialNetwork

ExcludedEventTable: excluding_event_table
excluding_events

Email

OptionalEventTable: optional_event_table
optional_events

Direct

OrganicSearch

Referral

The following two model tables apply the distribution models by rows and by seconds, respectively.

FirstModelTable: model1_table
id model

0 SEGMENT_ROWS

1 3:0.5:EXPONENTIAL:0.5,SECOND

2 4:0.3:WEIGHTED:0.4,0.3,0.2,0.1

3 3:0.2:FIRST_CLICK:NA

SecondModelTable: model2_table
id model

0 SEGMENT_SECONDS

1 6:0.5:UNIFORM:NA

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 335
10: Path and Pattern Analysis Functions

id model

2 8:0.3:LAST_CLICK:NA

3 6:0.2:FIRST_CLICK:NA

SQL Call

SELECT * FROM Attribution (


ON attribution_sample_table1 AS InputTable1
PARTITION BY user_id ORDER BY time_stamp
ON attribution_sample_table2 AS InputTable2
PARTITION BY user_id ORDER BY time_stamp
ON conversion_event_table AS ConversionEventTable DIMENSION
ON excluding_event_table AS ExcludedEventTable DIMENSION
ON optional_event_table AS OptionalEventTable DIMENSION
ON model1_table AS FirstModelTable DIMENSION
ON model2_table AS SecondModelTable DIMENSION
USING
EventColumn ('event')
TimeColumn ('time_stamp')
WindowSize ('rows:10&seconds:20')
) AS dt ORDER BY user_id, time_stamp;

Output
user_id event time_stamp attribution time_to_conversion

1 impression 2001-09-27 23:00:01 0.285714 -19

1 impression 2001-09-27 23:00:05 0 ?

1 impression 2001-09-27 23:00:19 0.714286 -1

1 SocialNetwork 2001-09-27 23:00:20 ? ?

1 Direct 2001-09-27 23:00:21 0.5 -2

1 Referral 2001-09-27 23:00:22 0.5 -1

1 PaidSearch 2001-09-27 23:00:23 ? ?

Sessionize
The Sessionize function maps each click in a session to a unique session identifier. A session is a sequence
of clicks by one user that are separated by at most n seconds.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 336
10: Path and Pattern Analysis Functions

The function is useful for both sessionization and detecting web crawler ("bot") activity. A typical use is to
understand user browsing behavior on a web site.

Sessionize Syntax
SELECT * FROM Sessionize (
ON { table | view | (query) }
PARTITION BY expression [,...]
ORDER BY order_column [,...]
USING
TimeColumn ('time_column')
TimeOut (session_timeout)
[ ClickLag (min_human_click_lag) ]
[ EmitNull ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
) AS alias;

Sessionize Syntax Elements


TimeColumn
Specify the name of the input column that contains the click times.
The time_column must also be an order_column.

TimeOut
Specify the number of seconds at which the session times out. If session_timeout seconds
elapse after a click, the next click starts a new session. The data type of session_timeout is
DOUBLE PRECISION.

ClickLag
[Optional] Specify the minimum number of seconds between clicks for the session user to be
considered human. If clicks are more frequent, indicating that the user is a bot, the function
ignores the session. The min_human_click_lag must be less than session_timout. The data
type of min_human_click_lag is DOUBLE PRECISION.
Default behavior: The function ignores no session, regardless of click frequency.

EmitNull
[Optional] Specify whether to output rows that have NULL values in their session id and rapid
fire columns, even if their time_column has a NULL value.
Default: 'false'

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 337
10: Path and Pattern Analysis Functions

Sessionize Input
Input Table Schema
Column Data Type Description

time_column TIME, TIMESTAMP, Click times (in milliseconds if data type is INTEGER,
INTEGER, BIGINT, BIGINT, or SMALLINT).
SMALLINT, or DATE

partition_column Any Column by which input data is partitioned. Input data must
be partitioned such that each partition contains all rows of
an entity.

order_column Any Column by which input data is ordered.

No input table column can have the name 'sessionid' or 'clicklag', because these are output table
column names.

Tip:
To create a single timestamp column from separate date and time columns:

SELECT (datecolumn || ' ' || timecolumn)::timestamp AS


mytimestamp FROM table;

Sessionize Output
Output Table Schema
Column Data Type Description

input_column Same as in input table Column copied from input table. Function copies every input_
column to output table.

sessionid INTEGER or BIGINT Identifier that function assigned to session.

clicklag BYTEINT '1' if the session exceeded min_human_click_lag, '0' otherwise.

Sessionize Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 338
10: Path and Pattern Analysis Functions

Input
sessionize_table
partition_id clicktime userid productname pagetype referrer productprice

1 1110000 333 Home www.yahoo.com

1 1112000 333 Ipod Checkout www.yahoo.com 200.2

1 1160000 333 Bose Checkout 340

1 1200000 333 Home www.google.com

1 1203000 67403 Home www.google.com

1 1300000 67403 Home www.google.com

1 1301000 67403 Home

1 1302000 67403 Home

1 1340000 67403 Iphone Checkout 650

1 1450000 67403 Bose Checkout 750

1 1450200 80000 Home godaddy.com

1 1450600 80000 Bose Checkout 340

1 1450800 80000 Itrip Checkout 450

1 1452000 880000 Iphone Checkout 650

SQL Call

SELECT * FROM Sessionize (


ON sessionize_table PARTITION BY partition_id ORDER BY clicktime
USING
TimeColumn ('clicktime')
TimeOut (60)
ClickLag (0.2)
) ORDER BY partition_id, clicktime;

Output
partition_id clicktime userid productname pagetype referrer productprice SESSIONID CLICKLAG

1 1110000 333 ? Home www.yahoo.com ? 0 f

1 1112000 333 Ipod Checkout www.yahoo.com 200.2 0 f

1 1160000 333 Bose Checkout ? 340 0 f

1 1200000 333 ? Home www.google. ? 0 f


com

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 339
10: Path and Pattern Analysis Functions

partition_id clicktime userid productname pagetype referrer productprice SESSIONID CLICKLAG

1 1203000 67403 ? Home www.google. ? 0 f


com

1 1300000 67403 ? Home www.google. ? 1 f


com

1 1301000 67403 ? Home ? ? 1 f

1 1302000 67403 ? Home ? ? 1 f

1 1340000 67403 Iphone Checkout ? 650 1 f

1 1450000 67403 Bose Checkout ? 750 2 f

1 1450200 80000 ? Home godaddy.com ? 2 t

1 1450600 80000 Bose Checkout ? 340 2 f

1 1450800 80000 Itrip Checkout ? 450 2 t

1 1452000 880000 Iphone Checkout ? 650 2 f

nPath
The nPath function scans a set of rows, looking for patterns that you specify. For each set of input rows that
matches the pattern, nPath produces a single output row. The function provides a flexible pattern-matching
capability that lets you specify complex patterns in the input data and define the values that are output for
each matched input set.

nPath is useful when your goal is to identify the paths that lead to an outcome. For example, you can use
nPath to analyze:
• Web site click data, to identify paths that lead to sales over a specified amount
• Sensor data from industrial processes, to identify paths to poor product quality
• Healthcare records of individual patients, to identify paths that indicate that patients are at risk of
developing conditions such as heart disease or diabetes
• Financial data for individuals, to identify paths that provide information about credit or fraud risks
The output from the nPath function can be input to other ML Engine functions or to a visualization tool such
as Teradata® AppCenter.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 340
10: Path and Pattern Analysis Functions

Sankey Diagram of Analytics Database nPath Output

An nPath call specifies:


• Mode (overlapping or nonoverlapping)
• Pattern to match
• Symbols to use
• [Optional] Filters to apply
• Results to output

Note:

• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
• When used with this function, the ORDER BY clause supports only ASCII collation.
• When used with this function, the PARTITION BY clause assumes column names are in
Normalization Form C (NFC).

nPath Syntax
SELECT * FROM nPath (
ON { table | view | (query) }
PARTITION BY partition_column
ORDER BY order_column [ ASC | DESC ][...]
[ ON { table | view | (query) }
[ PARTITION BY partition_column | DIMENSION ]
ORDER BY order_column [ ASC | DESC ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 341
10: Path and Pattern Analysis Functions

][...]
USING
Mode ({ OVERLAPPING | NONOVERLAPPING })
Pattern ('pattern')
Symbols ({ col_expr = symbol_predicate AS symbol}[,...])
[ Filter (filter_expression[,...]) ]
Result ({ aggregate_function (expression OF [ANY] symbol [,...]) AS alias_1 }[,...])
) AS alias_2;

nPath Syntax Elements


Mode
Specify the pattern-matching mode:
Option Description

OVERLAPPING Find every occurrence of pattern in partition, regardless of whether it is


part of a previously found match. One row can match multiple symbols
in a given matched pattern.

NONOVERLAPPING Start next pattern search at row that follows last pattern match.

Pattern
Specify the pattern for which the function searches. You compose pattern with the symbols
(which you define in the Symbols syntax element), operators, and parentheses.
When patterns have multiple operators, the function applies them in order of precedence,
and applies operators of equal precedence from left to right. To force the function to evaluate
a subpattern first, enclose it in parentheses. For more information, see nPath Patterns.

Symbols
Defines the symbols that appear in the values of the Pattern and Result syntax elements. The
col_expr is an expression whose value is a column name, symbol is any valid identifier, and
symbol_predicate is a SQL predicate (often a column name).
Each col_expr = symbol_predicate must satisfy the SQL syntax of the Analytics Database
when nPath is invoked. Otherwise, it is a syntax error.
For example, this Symbols syntax element is for analyzing website visits:

Symbols (
pagetype = 'homepage' AS H,
pagetype <> 'homepage' AND pagetype <> 'checkout' AS PP,
pagetype = 'checkout' AS CO
)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 342
10: Path and Pattern Analysis Functions

The symbol is case-insensitive; however, a symbol of one or two uppercase letters is easy
to identify in patterns.
If col_expr represents a column that appears in multiple input tables, you must qualify the
ambiguous column name with its table name. For example:

Symbols (
weblog.pagetype = 'homepage' AS H,
weblog.pagetype = 'thankyou' AS T,
ads.adname = 'xmaspromo' AS X,
ads.adname = 'realtorpromo' AS R
)

For more information about symbols that appear in the Pattern syntax element value, see
nPath Symbols. For more information about symbols that appear in the Result syntax
element value, see nPath Results.

Filter
[Optional] Specify filters to impose on the matched rows. The function combines the filter
expressions using the AND operator.
This is the filter_expression syntax:

symbol_expression comparison_operator symbol_expression

The two symbol expressions must be type-compatible. This is the


symbol_expression syntax:

{ FIRST | LAST }(column_with_expression OF [ANY](symbol[,...]))

The column_with_expression cannot contain the operator AND or OR, and all its columns
must come from the same input. If the function has multiple inputs, column_with_expression
and symbol must come from the same input.
The comparison_operator is either <, >, <=, >=, =, or <>.

Result
Defines the output columns. The col_expr is an expression whose value is a column
name; it specifies the values to retrieve from the matched rows. The function applies
aggregate_function to these values. For details, see nPath Results.
The function evaluates this syntax element once for every matched pattern in the partition
(that is, it outputs one row for each pattern match).

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 343
10: Path and Pattern Analysis Functions

nPath Input
The function requires at least one partitioned input table, and can have additional input tables that are either
partitioned or DIMENSION tables.

Note:
If the input to nPath is nondeterministic, the results are nondeterministic.

Input Table Schema


Column Data Type Description

partition_column INTEGER or VARCHAR Column by which every partitioned input table


is partitioned.

order_column INTEGER or VARCHAR Column by which every input table is ordered.

input_column INTEGER or VARCHAR Contains data to search for patterns.

nPath Output
The Result syntax element determines the output—see nPath Results.

nPath Symbols
A symbol identifies a row in the Pattern and Result syntax elements. A symbol can be any valid identifier
(that is, a sequence of characters and digits that begins with a character) but is typically one or two
uppercase letters. Symbols are case-insensitive; that is, 'SU' is identical to 'su', and the system reports an
error if you use both.
For example, suppose that you have this input table:

record city temp rh cloudcover windspeed winddirection rained_next_day

1 ? 81 30 0.0 5 NW 1

2 Tempe 76 40 0.2 15 NE 0

3 ? 70 70 0.4 10 N 0

4 Tusayan 75 50 0.4 5 NW 0

This table has examples of symbol definitions and the rows of the table that they match in
NONOVERLAPPING mode:

Symbol Definition Rows Matched

temp >= 80 AS H 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 344
10: Path and Pattern Analysis Functions

Symbol Definition Rows Matched

winddirection = 'NW' AS NW 1, 4

winddirection = 'NW' OR windspeed > 1, 2, 4


12 AS W

cloudcover <> 0.0 AND rh > 35 AS C 2, 3, 4

TRUE AS A 1, 2, 3, 4
This symbol definition matches all rows, for any
input table.

city like 'tu%' AS TU The like operator depends on Teradata Session mode:
Mode Match

BTET 1, 3, 4

ANSI None

city not like 'tu%' AS TU 2

city not like 'Tu%' AS N 2

city like 'Tu%n' as T 1, 3, 4


The % operator matches any number of characters.

city like 'Tu___n' as T 1, 3


The underscore (_) operator matches any single
character. The pattern 'Tu___n' has three underscores,
so it matches 'Tucson' but not 'Tusayan'.

Rows with NULL values do not match any symbol. That is, the function ignores rows with missing values.

LAG and LEAD Expressions in Symbol Predicates

You can create symbol predicates that compare a row to a previous or subsequent row, using a LAG or
LEAD operator.

LAG Expression Syntax

{ current_expr operator LAG (previous_expr, lag_rows [, default]) |


LAG (previous_expr, lag_rows [, default]) operator current_expr }

where:
• current_expr is the name of a column from the current row (or an expression operating on
this column).

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 345
10: Path and Pattern Analysis Functions

• operator is either >, >=, <, <=, =, or <>


• previous_expr is the name of a column from a previous row (or an expression operating on
this column).
• lag_rows is the number of rows to count backward from the current row to reach the previous row.
For example, if lag_rows is 1, the previous row is the immediately preceding row.
• default is the value to use for previous_expr when there is no previous row (that is, when the current
row is the first row or there is no row that is lag_rows before the current row).

LAG and LEAD Expression Rules

• A symbol definition can have multiple LAG and LEAD expressions.


• A symbol definition that has a LAG or LEAD expression cannot have an OR operator.
• If a symbol definition has a LAG or LEAD expression and the input is not a table, you must create
an alias of the input query, as in LAG and LEAD Expressions Example: Alias for Input Query.

LAG and LEAD Expressions Example: Alias for Input Query

Input
bank_web_clicks
customer_id session_id page datestamp

529 0 ACCOUNT SUMMARY 2004-03-17 16:35:00

529 0 FAQ 2004-03-17 16:38:00

529 0 ACCOUNT HISTORY 2004-03-17 16:42:00

529 0 FUNDS TRANSFER 2004-03-17 16:45:00

529 0 ONLINE STATEMENT ENROLLMENT 2004-03-17 16:49:00

529 0 PROFILE UPDATE 2004-03-17 16:50:00

529 0 ACCOUNT SUMMARY 2004-03-17 16:51:00

529 0 CUSTOMER SUPPORT 2004-03-17 16:53:00

529 0 VIEW DEPOSIT DETAILS 2004-03-17 16:57:00

529 1 ACCOUNT SUMMARY 2004-03-18 01:16:00

529 1 ACCOUNT SUMMARY 2004-03-18 01:18:00

529 1 FAQ 2004-03-18 01:20:00

... ... ... ...

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 346
10: Path and Pattern Analysis Functions

SQL Call

SELECT * FROM nPath (


ON (SELECT customer_id, session_id, datestamp, page FROM bank_web_clicks)
AS dt1
PARTITION BY customer_id, session_id
ORDER BY datestamp
USING
Mode (NONOVERLAPPING)
Pattern ('(DUP|A)*')
Symbols (
TRUE AS A,
page = LAG (page,1) AS DUP
)
Result (
FIRST (customer_id OF any (A)) AS customer_id,
FIRST (session_id OF A) AS session_id,
FIRST (datestamp OF A) AS first_date,
LAST (datestamp OF ANY(A,DUP)) AS last_date,
ACCUMULATE (page OF A) AS page_path,
ACCUMULATE (page of DUP) AS dup_path
)
) AS dt2;

Output
Columns 1-4
customer_id session_id first_date last_date

529 0 2004-03-17 16:35:00 2004-03-17 16:57:00

529 1 2004-03-18 01:16:00 2004-03-18 01:28:00

529 2 2004-03-18 09:22:00 2004-03-18 09:36:00

529 3 2004-03-18 22:41:00 2004-03-18 22:55:00

529 4 2004-03-19 08:33:00 2004-03-19 08:41:00

529 5 2004-03-19 10:06:00 2004-03-19 10:14:00

... ... ... ...

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 347
10: Path and Pattern Analysis Functions

Columns 5-6
page_path dup_path

[ACCOUNT SUMMARY, FAQ, ACCOUNT HISTORY, FUNDS []


TRANSFER, ONLINE STATEMENT ENROLLMENT, PROFILE
UPDATE, ACCOUNT SUMMARY, CUSTOMER SUPPORT, VIEW
DEPOSIT DETAILS]

[ACCOUNT SUMMARY, FAQ, ACCOUNT SUMMARY, FUNDS [ACCOUNT SUMMARY]


TRANSFER, ACCOUNT HISTORY, VIEW DEPOSIT DETAILS,
ACCOUNT SUMMARY, ACCOUNT HISTORY]

[ACCOUNT SUMMARY, ACCOUNT HISTORY, FUNDS TRANSFER, [ACCOUNT SUMMARY,


ACCOUNT SUMMARY, FAQ] ACCOUNT SUMMARY, FAQ]

[ACCOUNT SUMMARY, ACCOUNT HISTORY, ACCOUNT SUMMARY, [ACCOUNT SUMMARY]


ACCOUNT HISTORY, FAQ, ACCOUNT SUMMARY]

[ACCOUNT SUMMARY, FAQ, VIEW DEPOSIT DETAILS, FAQ] []

[ACCOUNT SUMMARY, FUNDS TRANSFER, VIEW DEPOSIT [VIEW DEPOSIT DETAILS]


DETAILS, ACCOUNT HISTORY]

... ...

LAG and LEAD Expressions Example: No Alias for Input Query

Input
aggregate_clicks
userid sessionid productname pagetype clicktime referrer productprice

1039 1 sneakers home 2009-07-29 20:17:59 Company1 100

1039 2 books home 2009-04-21 13:17:59 Company4 300

1039 3 television home 2009-05-23 13:17:59 Company2 500

1039 4 envelopes home 2009-07-16 11:17:59 Company3 10

1039 4 envelopes home1 2009-07-16 11:18:16 Company3 10

1039 4 envelopes page1 2009-07-16 11:18:18 Company3 10

1039 5 bookcases home 2009-08-19 22:17:59 Company5 150

1039 5 bookcases home1 2009-08-19 22:18:02 Company5 150

1039 5 bookcases page1 2009-08-19 22:18:05 Company5 150

1039 5 bookcases page2 2009-08-22 04:20:05 Company5 150

1039 5 bookcases checkout 2009-08-24 14:30:05 Company5 150

1039 5 bookcases page2 2009-08-27 23:03:05 Company5 150

1040 1 tables home 2009-07-29 20:17:59 Company5 250

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 348
10: Path and Pattern Analysis Functions

userid sessionid productname pagetype clicktime referrer productprice

1040 2 Appliances home 2009-04-21 13:17:59 Company6 1500

1040 3 laptops home 2009-05-23 13:17:59 Company7 800

1040 4 chairs home 2009-07-16 11:17:59 Company3 400

1040 4 chairs home1 2009-07-16 11:18:16 Company3 400

1040 4 chairs page1 2009-07-16 11:18:18 Company3 400

1040 5 cellphones home 2009-08-19 22:17:59 Company8 600

1040 5 cellphones home1 2009-08-19 22:18:02 Company8 600

1040 5 cellphones page1 2009-08-19 22:18:05 Company8 600

1040 5 cellphones page2 2009-08-22 04:20:05 Company8 600

1040 5 cellphones checkout 2009-08-24 14:30:05 Company8 600

1040 5 cellphones page2 2009-08-27 23:03:05 Company8 600

... ... ... ... ... ... ...

SQL Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime ASC
USING
Mode (NONOVERLAPPING)
Pattern ('H+.D*.X*.P1.P2+')
Symbols (
TRUE AS X,
pagetype = 'home' AS H,
pagetype <> 'home' AND pagetype <> 'checkout' AS D,
pagetype = 'checkout' AS P1,
pagetype = 'checkout' AND
productprice > 100 AND
productprice > LAG (productprice, 1, 100) AS P2
)
Result (
FIRST (productname OF P1) AS first_product,
MAX_CHOOSE (productprice, productname OF P2) AS max_product,
FIRST (sessionid OF P2) AS sessionid
)
) AS dt ORDER BY sessionid;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 349
10: Path and Pattern Analysis Functions

Output
first_product max_product sessionid

bookcases cellphones 5

nPath Patterns
The value of the Pattern syntax element specifies the sequence of rows for which the function searches.
You compose the pattern definition, pattern, with symbols (which you define in the Symbols syntax
element), operators, and parentheses. In the pattern definition, symbols represent rows. You can combine
symbols with pattern operators to define simple or complex patterns of rows for which to search.
The following table lists and describes the basic pattern operators, in decreasing order of precedence. In
the table, A and B are symbols that have been defined in the Symbols syntax element.

Basic Pattern Operators


Operator Description Precedence

Matches one row that meets the definition of A . 1 (highest)


A

Matches one row that meets the definition of A . 1


A.

Matches 0 or 1 rows that satisfy the definition of A . 1


A?

Matches 0 or more rows that satisfy the definition of A (greedy operator). 1


A*

Matches 1 of more rows that satisfy the definition of A (greedy operator). 1


A+

Matches two rows, where the first row meets the definition of A and the second 2
A.B
row meets the definition of B .

Matches one row that meets the definition of either A or B . 3


A|B

The nPath function uses greedy pattern matching. That is, it finds the longest available match
when matching patterns specified by nongreedy operators. For more information, see nPath Greedy
Pattern Matching.
These examples show the pattern operator precedence rules:
• A.B+ is the same as A.(B+)
• A|B* is the same as A|(B*)
• A.B|C is the same as (A.B)|C

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 350
10: Path and Pattern Analysis Functions

Example:

A.(B|C)+.D?.X*.A

The preceding pattern definition matches any set of rows whose first row meets the definition of symbol
A, followed by a nonempty sequence of rows, each of which meets the definition of either symbol B or C,
optionally followed by one row that meets the definition of symbol D, followed by any number of rows that
meet the definition of symbol X, and ending with a row that meets the definition of symbol A.
You can use parentheses to define precedence rules. Parentheses are recommended for clarity, even
where not strictly required.
To indicate that a sequence of rows must start or end with a row that matches a certain symbol, use the
start anchor (^) or end anchor ($) operator.

Start Anchor and End Anchor Pattern Operators


Operator Description

Appears only at the beginning of a pattern. Indicates that a set of rows must start with a row that
^A
meets the definition of A .

Appears only at the end of a pattern. Indicates that a set of rows must end with a row that meets
A$
the definition of A .

Subpattern operators let you specify how often a subpattern must appear in a match. You can specify
a minimum number, exact number, or range. In the following table, X represents any pattern definition
composed of symbols and any of the previously described pattern operators.

Subpattern Operators
Operator Description

Matches exactly a occurrences of the pattern X .


(X){a}

Matches at least a occurrences of the pattern X .


(X){a,}

Matches at least a and no more than b occurrences of the pattern X .


(X){a,b}

nPath Greedy Pattern Matching

The nPath function uses greedy pattern matching, finding the longest available match despite any
nongreedy operators in the pattern.
For example, consider the input table link2:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 351
10: Path and Pattern Analysis Functions

nPath Greedy Pattern Matching Examples Input Table link2


userid job_title startdate enddate

21 Chief Exec Officer 1994-10-01 2005-02-28

21 Software Engineer 1996-10-01 2001-06-30

21 Software Engineer 1998-10-01 2001-06-30

21 Chief Exec Officer 2005-03-01 2007-03-31

21 Chief Exec Officer 2007-06-01 ?

The following query returns the following table:

SELECT job_transition_path, count(*) AS path_count FROM nPath (


ON link2 PARTITION BY userid ORDER BY startdate
USING
Mode (NONOVERLAPPING)
Pattern ('CEO.ENGR.OTHER*')
Symbols (
job_title like '%Software Eng%' AS ENGR,
TRUE AS OTHER,
job_title like 'Chief Exec Officer' AS CEO
)
Result (accumulate(job_title OF ANY(ENGR,OTHER,CEO)) AS job_transition_path)
) AS dt GROUP BY 1 ORDER BY 2 DESC;

job_transition_path path_count

[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec Officer] 1

In the pattern, CEO matches the first row, ENGR matches the second row, and OTHER* matches the
remaining rows:

The following query returns the following table:

SELECT job_transition_path , count(*) AS path_count FROM nPath (


ON link2 PARTITION BY userid ORDER BY startdate

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 352
10: Path and Pattern Analysis Functions

USING
Mode (NONOVERLAPPING)
Pattern ('CEO.ENGR.OTHER*.CEO')
Symbols (
job_title like '%Software Eng%' AS ENGR,
TRUE AS OTHER,
job_title like 'Chief Exec Officer' AS CEO
)
Result (accumulate(job_title OF ANY(ENGR,OTHER,CEO)) AS job_transition_path)
) AS dt GROUP BY 1 ORDER BY 2 DESC;

job_transition_path path_count

[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec Officer] 1

In the pattern, CEO matches the first row, ENGR matches the second row, OTHER* matches the next two
rows, and CEO matches the last row:

nPath Filters
The Filter syntax element specifies filters to impose on the matched rows.

nPath Filters Example

Using clickstream data from an online store, this example finds the sessions where the user visited the
checkout page within 10 minutes of visiting the home page. Because there is no way to know in advance
how many rows might appear between the home page and the checkout page, the example cannot use
a LAG or LEAD expression. Therefore, it uses the Filter syntax element.

Input
clickstream
userid sessionid clicktime pagetype

1 1 10-10-2012 10:15 home

1 1 10-10-2012 10:16 view

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 353
10: Path and Pattern Analysis Functions

userid sessionid clicktime pagetype

1 1 10-10-2012 10:17 view

1 1 10-10-2012 10:20 checkout

1 1 10-10-2012 10:30 checkout

1 1 10-10-2012 10:35 view

1 1 10-10-2012 10:45 view

2 2 10-10-2012 13:15 home

2 2 10-10-2012 13:16 view

2 2 10-10-2012 13:43 checkout

2 2 10-10-2012 13:35 view

2 2 10-10-2012 13:45 view

SQL Call

SELECT * FROM Npath (


ON clickstream PARTITION BY userid ORDER BY clicktime
USING
Symbols (
pagetype='home' AS home,
pagetype <> 'home' AND pagetype <> 'checkout' AS clickview,
pagetype='checkout' AS checkout
)
Pattern ('home.clickview*.checkout')
Result (
FIRST(userid of ANY(home, checkout, clickview)) AS userid,
FIRST (sessionid of ANY(home, checkout, clickview)) AS sessioinid,
COUNT (* of any(home, checkout, clickview)) AS cnt,
FIRST (clicktime of ANY(home)) AS firsthome,
LAST (clicktime of ANY(checkout)) AS lastcheckout
)
Filter (
FIRST (clicktime + interval '10' minute OF ANY (home)) >
FIRST (clicktime of any(checkout))
)
Mode (NONOVERLAPPING)
);

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 354
10: Path and Pattern Analysis Functions

Output
userid sessionid cnt firsthome lastcheckout

1 1 4 2012-10-10 10:15:00 2012-10-10 10:20:00

nPath Results
The Result syntax element defines the output columns, specifying the values to retrieve from the matched
rows and the aggregate function to apply to these values.
For each pattern, the nPath function can apply one or more aggregate functions to the matched rows and
output the aggregated results. These are the supported aggregate functions:
• SQL aggregate functions AVG, COUNT, MAX, MIN, and SUM, described in Teradata Vantage™ - SQL
Functions, Expressions, and Predicates, B035-1145
• ML Engine nPath sequence aggregate functions described in the following table
In the following table, col_expr is an expression whose value is a column name, symbol is defined by the
Symbols syntax element, and symbol_list has this syntax:

{ symbol | ANY (symbol[,...]) }

Function Description

Returns either the number of total number of matched rows (*) or the
COUNT (
number (or distinct number) of col_expr values in the matched rows.
{ * | [DISTINCT] col_
expr }
OF symbol_list )

Returns the col_expr value of the first matched row.


FIRST (
col_expr OF symbol_list )

Returns the col_expr value of the last matched row.


LAST (
col_expr OF symbol_list )

Returns the col_expr value of the nth matched row, where n is a


NTH (
nonzero value of the data type SMALLINT, INTEGER, or BIGINT.
col_expr, n OF symbol_
list ) The sign of n determines whether the nth matched row is nth from the
first or last matched row. For example, if n is 1, the nth matched row
is the first matched row, and if n is -1, the nth matched row is the last
matched row.
If n is greater than the number of matched rows, the nth function
returns NULL.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 355
10: Path and Pattern Analysis Functions

Function Description

Returns the first non-null col_expr value in the matched rows.


FIRST_NOTNULL (
col_expr OF symbol_list )

Returns the last non-null col_expr value in the matched rows.


LAST_NOTNULL (
col_expr OF symbol_list )

Returns the descriptive_col_expr value of the matched row with


MAX_CHOOSE (
the highest-sorted quantifying_col_expr value. For example, MAX_
quantifying_col_expr,
CHOOSE (product_price, product_name OF B) returns the
descriptive_col_expr
product_name of the most expensive product in the rows that map
OF symbol_list ) to B.
The descriptive_col_expr can have any data type. The qualifying_
col_expr must have a sortable datatype (SMALLINT, INTEGER,
BIGINT, DOUBLE PRECISION, DATE, TIME, TIMESTAMP,
VARCHAR, or CHARACTER).

Returns the descriptive_col_expr value of the matched row with the


MIN_CHOOSE (
lowest-sorted qualifying_col_expr value. For example, MIN_CHOOSE
quantifying_col_expr,
(product_price, product_name OF B) returns the product_name
descriptive_col_expr
of the least expensive product in the rows that map to B.
OF symbol_list )
The descriptive_col_expr can have any data type. The qualifying_
col_expr must have a sortable datatype (SMALLINT, INTEGER,
BIGINT, DOUBLE PRECISION, DATE, TIME, TIMESTAMP,
VARCHAR, or CHARACTER).

Returns the duplicate count for col_expr in the matched rows.


DUPCOUNT (
That is, for each matched row, the function returns the number
col_expr OF symbol_list ) of occurrences of the current value of col_expr in the immediately
preceding matched row.
When col_expr is also the ORDER BY col_expr, this function returns
the equivalent of ROW_NUMBER()-RANK().

Returns the cumulative duplicate count for col_expr in the matched


DUPCOUNTCUM (
rows. That is, for each matched row, the function returns the number
col_expr OF symbol_list ) of occurrences of the current value of col_expr in all preceding
matched rows.
When col_expr is also the ORDER BY col_expr, this function returns
the equivalent of ROW_NUMBER()-DENSE_RANK().

Returns, for each matched row, the concatenated values in col_expr,


ACCUMULATE (
separated by delimiter.
[ DISTINCT
| CDISTINCT ] delimiter is a string of LATIN characters. Its default value is ', ' (a
col_expr OF symbol_list comma followed by a space).
[ DELIMITER 'delimiter'] DISTINCT limits the concatenated values to distinct values.
) CDISTINCT limits the concatenated values to consecutive
distinct values.
Accumulated string can have at most 32000 UNICODE characters
or 64000 LATIN characters. If longer, function truncates string to
maximum number of characters allowed.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 356
10: Path and Pattern Analysis Functions

You can compute an aggregate over more than one symbol. For example, SUM (val OF ANY (A,B))
computes the sum of the values of the attribute val across all rows in the matched segment that map to A
or B.

nPath Results Examples

nPath Results Example: FIRST, LAST_NOTNULL, MAX_CHOOSE, and MIN_CHOOSE

Input
trans1
userid gender ts productname productamt

1 M 2012-01-01 00:00:00 shoes 100

1 M 2012-02-01 00:00:00 books 300

1 M 2012-03-01 00:00:00 television 500

1 M 2012-04-01 00:00:00 envelopes 10

2 2012-01-01 00:00:00 bookcases 150

2 2012-02-01 00:00:00 tables 250

2 F 2012-03-01 00:00:00 appliances 1500

3 F 2012-01-01 00:00:00 chairs 400

3 F 2012-02-01 00:00:00 cellphones 600

3 F 2012-03-01 00:00:00 dvds 50

SQL Call

SELECT * FROM nPath (


ON trans1 PARTITION BY userid ORDER BY ts
USING
Mode (NONOVERLAPPING)
Pattern ('A+')
Symbols (TRUE AS A)
Result (
FIRST (userid OF A) AS Userid,
LAST_NOTNULL (gender OF A) AS Gender,
MAX_CHOOSE (productamt, productname OF A) AS Max_prod,
MIN_CHOOSE (productamt, productname OF A) AS Min_prod

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 357
10: Path and Pattern Analysis Functions

)
) ORDER BY 1;

Output
userid gender max_prod min_prod

1 M television envelopes

2 F appliances bookcases

3 F cellphones dvds

nPath Results Example: FIRST and Three Forms of ACCUMULATE

Input
clicks
userid sessionid productname pagetype clicktime referrer productprice

1039 1 ? home 06:59:13 Company1 100

1039 1 ? home 07:00:10 Company2 300

1039 1 television checkout 07:00:12 Company2 500

1039 1 television checkout 07:00:18 Company2 10

1039 1 envelopes checkout 07:01:00 Company3 10

1039 1 ? checkout 07:01:10 Company3 10

SQL Call

SELECT * FROM nPath (


ON clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Symbols (
pagetype='home' AS H,
pagetype='checkout' AS C,
pagetype <> 'home' AND pagetype <>'checkout' AS A
)
Pattern ('^H+.A*.C+$')
Result (
FIRST (sessionid OF ANY (H, A, C)) AS sessionid,
FIRST (clicktime OF H) AS firsthome,

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 358
10: Path and Pattern Analysis Functions

FIRST (clicktime OF C) AS firstcheckout,


ACCUMULATE (productname OF ANY (H,A,C) DELIMITER '*')
AS products_accumulate,
ACCUMULATE (CDISTINCT productname OF ANY (H,A,C) DELIMITER '$$')
AS cde_dup_products,
ACCUMULATE (DISTINCT productname OF ANY (H,A,C)) AS de_dup_products
)
) ORDER BY sessionid;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 359
10: Path and Pattern Analysis Functions

Output
sessionid firsthome firstcheckout products_accumulate cde_dup_products de_dup_products

1 06:59:13 07:00:12 [null*null*television*television*envelopes*null] [null$$television$ [null, television, envelopes]


$envelopes$$null]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20, Release 17.20 360
10: Path and Pattern Analysis Functions

nPath Results Example: FIRST, Three Forms of ACCUMULATE, COUNT, and NTH

Input
The input table for this example is clicks, as in nPath Results Example: FIRST and Three Forms
of ACCUMULATE.

SQL Call

SELECT * FROM nPath (


ON clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Symbols (
pagetype='home' AS H,
pagetype='checkout' AS C,
pagetype <> 'home' AND pagetype <>'checkout' AS A
)
Pattern ('^H+.A*.C+$')
Result (
FIRST (sessionid OF ANY (H, A, C)) AS sessionid,
FIRST (clicktime OF H) AS firsthome,
FIRST (clicktime OF C) AS firstcheckout,
ACCUMULATE (productname OF ANY (H,A,C)) AS products_accumulate,
COUNT (DISTINCT productname OF ANY(H,A,C)) AS count_distinct_products,
ACCUMULATE (CDISTINCT productname OF ANY
(H,A,C)) AS consecutive_distinct_products,
ACCUMULATE (DISTINCT productname OF ANY (H,A,C)) AS distinct_products,
NTH (productname, -1 OF ANY(H,A,C)) AS nth
)
) ORDER BY sessionid;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 361
10: Path and Pattern Analysis Functions

Output
count_distinct_ consecutive_
sessionid firsthome firstcheckout products_accumulate distinct_products nth
products distinct_products

1 06:59:13 07:00:12 [null, null, television, television, 2 [null, television, envelopes, null] [null, ?
envelopes, null] television, envelopes]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20, Release 17.20 362
10: Path and Pattern Analysis Functions

nPath Results Example: Combine Values from One Row with Values from the
Next Row

Input
The input table is clickstream, as in nPath Filters Example.

SQL Call

SELECT * FROM nPath (


ON clickstream PARTITION BY userid ORDER BY userid, sessionid, clicktime
USING
Mode (OVERLAPPING)
Pattern ('A.B')
Symbols (TRUE AS A, TRUE AS B)
Result (
FIRST (sessionid OF A) AS sessionid,
FIRST (pagetype OF A) AS pageid,
FIRST (pagetype OF B) AS next_pageid
)
) ORDER BY sessionid;

Output
sessionid pageid next_pageid

1 home view

1 view view

1 checkout view

1 checkout checkout

1 view checkout

1 view view

2 checkout view

2 home view

2 view view

2 view checkout

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 363
10: Path and Pattern Analysis Functions

nPath Results Example: Hindi Input, ACCUMULATE FIRST_NOTNULL

Input
The example has two input tables that include Hindi characters.

हिंदी टेबल
सत्रआईडी क्लिककरें token उत्पादकानाम पेजकाप्रकार रेफरर

1 06:59:13.000000 1 घर गूगल डॉट कॉम

13 15:35:08.000000 15 घर गूगल डॉट कॉम

400 10:00:00.000000 300 घर

9000 05:30:15.000000 ? घर

1 07:00:10.000000 11 घर गूगल डॉट कॉम

9001 05:30:15.000000 ? लॉग इन

400 10:05:04.000000 12 आकाशगंगा s4 चेक आउट

9000 05:30:20.000000 ? लेनोवो g580 चेक आउट

1 07:00:12.000000 111 ipod चेक आउट गूगल डॉट कॉम

9001 05:30:15.000000 ? घर

400 10:05:03.000000 12 कागज एक

9000 05:50:44.000000 ? लैपटॉप case चेक आउट

1 07:01:00.000000 1111 बोस चेक आउट

9001 05:30:20.000000 ? prod4 चेक आउट

400 09:59:55.000000 12 लॉग इन

9000 12:50:55.000000 ? लॉग आउट

1 18:00:00.000000 10 लॉग इन

9001 12:50:55.000000 ? लॉग आउट

400 10:05:55.000000 18 लॉग आउट

14 13:18:30.000000 2 घर गूगल डॉट कॉम

1 18:00:10.000000 10 घर

8000 16:00:00.000000 8001 लॉग इन

400 10:05:02.000000 18 घर

14 13:18:31.000000 8 कागज एक

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 364
10: Path and Pattern Analysis Functions

सत्रआईडी क्लिककरें token उत्पादकानाम पेजकाप्रकार रेफरर

1 18:00:15.000000 10 आई - फ़ोन USB cable चेक आउट

8000 16:00:10.000000 8002 घर

400 18:00:10.000000 18 घर

14 13:18:32.000000 8 page2

666 12:50:15.000000 40 घर

8000 16:00:20.000000 8003 nexus7 चेक आउट

400 18:05:10.000000 18 prod1 चेक आउट

14 13:18:40.000000 20 आई - फ़ोन चेक आउट

666 12:50:20.000000 41 लेनोवो g580 चेक आउट

8000 16:00:40.000000 8004 लॉग आउट

400 18:08:10.000000 100 prod2 चेक आउट

14 13:19:00.000000 20 बोस चेक आउट

666 12:50:44.000000 42 लैपटॉप case चेक आउट

400 18:10:10.000000 150 लॉग आउट

14 13:20:00.000000 20 सैमसंग चेक आउट

666 12:50:55.000000 50 लॉग आउट

400 09:59:45.000000 150 लॉग इन

500 08:15:12.000000 12 लॉग इन

10000 16:00:00.000000 1 लॉग इन

400 09:59:40.000000 210 लॉग इन

500 08:15:15.000000 31 घर

10000 16:00:10.000000 1 घर

400 09:59:55.000000 220 लॉग इन

500 08:15:20.000000 123 कागज एक

10000 16:00:20.000000 2 nexus7 चेक आउट

400 09:59:55.000000 220 लॉग इन

500 08:16:00.000000 1231 आकाशगंगा चाज्ज र चेक आउट

10000 16:00:40.000000 4 लॉग आउट

500 08:16:30.000000 1232 ंेडफोन चेक आउट

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 365
10: Path and Pattern Analysis Functions

सत्रआईडी क्लिककरें token उत्पादकानाम पेजकाप्रकार रेफरर

2 15:34:25.000000 333 घर गूगल डॉट कॉम

500 08:12:12.000000 1233 लॉग इन

2 15:34:25.000000 333 लॉग आउट

250 20:00:01.000000 8 घर गो डैडी डॉट कॉम

250 20:02:00.000000 80 बोस चेक आउट

250 20:02:50.000000 81 itrip चेक आउट

250 20:03:00.000000 82 आई - फ़ोन चेक आउट

कवज्ञापन
ररलेसमय channel कवज्ञापन duration

20:02:01.000000 सीएनबीसी 13 1000

15:35:06.000000 सीएनबीसी 14 1000

15:34:26.000000 food network 11 1000

13:18:42.000000 espn 12 1000

07:00:20.000000 सीएनबीसी 10 1000

SQL Call

SELECT * FROM nPath (


ON "हिंदी टेबल" PARTITION BY "सत्र आईडी" ORDER BY "क्लिक करें"
ON "कवज्ञापन" DIMENSION ORDER BY "ररले समय"
USING
Mode (NONOVERLAPPING)
Pattern ('^(X|A)+.C')
Symbols ( "पेज का प्रकार" IS NULL OR "पेज का प्रकार" IS NOT NULL AS X,
"पेज का प्रकार" = 'चेक आउट' AS C,
"कवज्ञापन" IS NULL OR "कवज्ञापन" IS NOT NULL AS A
)
Result (
ACCUMULATE ("रेफरर" of ANY(X)) AS "रेफरल पथ" ,
FIRST_NOTNULL ("सत्र आईडी"of ANY(X)) AS "सत्र आईडी"
)
) AS dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 366
10: Path and Pattern Analysis Functions

Output
रेफरल पथ सत्रआईडी

[गूगल डॉट कॉम, यांू डॉट कॉम, यांू डॉट कॉम, , , ] 1

[, ] 8000

[गूगल डॉट कॉम, , , , ] 14

[गो डैडी डॉट कॉम, , ] 250

[, , , , , , , , , , , ] 400

[, , , , ] 500

[, ] 666

[, ] 9001

[, ] 9000

[, ] 10000

nPath Results Example: Hindi Input, ACCUMULATE DISTINCT and CDISTINCT

Input
unicode_path
id price event

2 2.20000000000000E 000 ఈవట1

2 5.13300000000000E 001 ఈవట2

2 9.88123000000000E 002 ఈవట3

2 -1.20000000000000E-001 ఈవట4

2 1.10000000000000E 000 ఈవట5

2 1.12000000000000E 001 ఈవట5

2 1.22000000000000E 001 ఈవట5

2 1.20000000000000E 000 ఈవట1

SQL Call

SELECT * FROM nPath (


ON unicode_path PARTITION BY id
USING

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 367
10: Path and Pattern Analysis Functions

Mode (NONOVERLAPPING)
Pattern ('A*')
Symbols (true AS A)
Result (
ACCUMULATE (DISTINCT event OF A DELIMITER ', ' ) AS acc_result_distinct,
ACCUMULATE (CDISTINCT event OF A DELIMITER ', ' ) AS acc_result_cdistinct
)
) AS dt;

Output
acc_result_distinct acc_result_cdistinct

[ఈవట3, ఈవట5, ఈవట1, ఈవట4, ఈవట2] [ఈవట3, ఈవట5, ఈవట1, ఈవట4, ఈవట2, ఈవట1]

nPath Differences on Aster Database and Analytics Database


Aster Database nPath Analytics Database nPath

In a symbol, the Boolean expression TRUE, NOT In a symbol, the Boolean expression TRUE, NOT
TRUE, or integer can be enclosed in parentheses TRUE, or integer cannot be enclosed in parentheses
or quotation marks. or quotation marks.

Aggregate functions compare strings using Aggregate functions compare strings using
Unicode value of each character (lexicographic sort order, based on CHARACTER SET,
order), ignoring CHARACTER SET. CASESPECIFIC, and COLLATION.

AVG
Database Syntax Element Data Type Return Data Type

Aster SMALLINT, INTEGER, or BIGINT NUMERIC

DOUBLE PRECISION, NUMERIC, or INTERVAL Same as syntax element data type

Teradata NUMERIC DOUBLE PRECISION

INTERVAL, or DATE without TIME or TIMESTAMP Same as syntax element data type

COUNT
Syntax Element
Database Return Data Type
Data Type

Aster Any BIGINT

Teradata Any TD mode: INTEGER

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 368
10: Path and Pattern Analysis Functions

Syntax Element
Database Return Data Type
Data Type

ANSI mode: Depends on MaxDecimal value in DBSControl—see


following table.

MaxDecimal Value Result Data Type Result Data Type Format

0 or 15 NUMERIC(15,0) -(15)9

18 NUMERIC(18,0) -(18)9

38 NUMERIC(38,0) -(38)9

MAX and MIN


Database Syntax Element Data Type Return Data Type

Aster Any numeric, string, or DateTime type Same as syntax element data type

Teradata Any numeric, character, DateTime or If not UDT: Same as syntax element data type
Interval data type, or BYTE UDT: Data type to which UDT is implicitly cast

SUM
Database Syntax Element Data Type Return Data Type

Aster SMALLINT or INTEGER BIGINT

BIGINT NUMERIC

DOUBLE PRECISION DOUBLE PRECISION

NUMERIC or INTERVAL Same as syntax element data type

Teradata NUMERIC, INTERVAL, or DATE Same as syntax element data type, except
without TIME or TIMESTAMP for NUMERIC(n,m), which returns NUMERIC(p,m),
where p depends on MaxDecimal value in DBSControl
—see following table.

CHARACTER or VARCHAR DOUBLE PRECISION

MaxDecimal Value n p

0 or 15 n ≤ 15 15

15 < n ≤ 18 18

n > 18 38

18 n ≤ 18 18

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 369
10: Path and Pattern Analysis Functions

MaxDecimal Value n p

n > 18 38

38 Any value 38

nPath Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Symbols and Symbol Predicates That Examples Use

Symbol Symbol Predicate

A pageid IN (10, 25)

B category = 10 OR (category = 20 AND pageid <> 33)

C category IN (SELECT pageid FROM clicks1 GROUP BY userid HAVING COUNT(*) > 10)

D referrer LIKE '%Amazon%'

X TRUE

nPath ClickStream Data Examples

Input
This statement creates the input table of clickstream data that the examples use:

CREATE MULTISET TABLE clicks1 (


userid INTEGER,
sessionid INTEGER,
pageid INTEGER,
category INTEGER,
ts TIMESTAMP FORMAT 'YYYY-MM-DDbHH:MI:SS',
referrer VARCHAR (256),
val FLOAT
) PRIMARY INDEX ( userid );

This statement gets the pageid for each row and the pageid for the next row in sequence:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 370
10: Path and Pattern Analysis Functions

SELECT dt.sessionid, dt.pageid, dt.next_pageid FROM nPath (


ON clicks1 PARTITION BY sessionid ORDER BY ts
USING
Mode (OVERLAPPING)
Pattern ('A.B')
Symbols (TRUE AS A, TRUE AS B)
Result (
FIRST(sessionid OF A) AS sessionid,
FIRST (pageid OF A) AS pageid,
FIRST (pageid OF B) AS next_pageid
)
) AS dt;

Example: Counting Preceding Rows in a Sequence


For each row, this invocation counts the number of preceding rows in a given sequence (including the
current row). The ORDER BY clause specifies DESC because the pattern must be matched over the rows
preceding the start row, while the semantics dictate that the pattern be matched over the rows following
the start row.

SELECT dt.sessionid, dt.pageid, dt.countrank FROM nPath (


ON clicks1 PARTITION BY sessionid ORDER BY ts DESC
USING
Mode (OVERLAPPING)
Pattern ('A*')
Symbols (TRUE AS A)
Result (
FIRST (sessionid OF A) AS sessionid,
FIRST (pageid OF A) AS pageid,
COUNT (* OF A) AS countrank
)
) AS dt;

Example: Complex Path Query


This query finds the user click-paths that start at pageid 50 and proceed either to pageid 80 or to pages in
category 9 or category 10, finds the pageid of the last page in the path, counts the visits to page 80, and
returns the maximum count for each last page, by which it sorts the output. The query ignores paths of
fewer than five pages and pages for which category is less than zero.

SELECT dt.last_pageid, MAX(dt.count_page80) FROM nPath (


ON (SELECT * FROM clicks1 WHERE category >= 0)
PARTITION BY sessionid ORDER BY ts
USING
Pattern ('A.(B|C)*')

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 371
10: Path and Pattern Analysis Functions

Mode (OVERLAPPING)
Symbols (
pageid = 50 AS A,
pageid = 80 AS B,
pageid <> 80 AND category IN (9,10) AS C
)
Result (
LAST(pageid OF ANY (A,B,C)) AS last_pageid,
COUNT (* OF B) AS count_page80,
COUNT (* OF ANY (A,B,C)) AS count_any
)
) AS dt WHERE dt.count_any >= 5
GROUP BY dt.last_pageid
ORDER BY MAX(dt.count_page80);

nPath Range-Matching Examples

Whenever a user visits the home page and then visits checkout pages and buys increasingly expensive
products, the nPath query returns the first purchase and the most expensive purchase.

nPath Example Input Table: aggregate_clicks


userid sessionid productname pagetype clicktime referrer productprice

1039 1 sneakers home 2009-07-29 Company1 100


20:17:59

1039 2 books home 2009-04-21 Company4 300


13:17:59

1039 3 television home 2009-05-23 Company2 500


13:17:59

1039 4 envelopes home 2009-07-16 Company3 10


11:17:59

1039 4 envelopes home1 2009-07-16 Company3 10


11:18:16

1039 4 envelopes page1 2009-07-16 Company3 10


11:18:18

1039 5 bookcases home 2009-08-19 Company5 150


22:17:59

1039 5 bookcases home1 2009-08-19 Company5 150


22:18:02

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 372
10: Path and Pattern Analysis Functions

userid sessionid productname pagetype clicktime referrer productprice

1039 5 bookcases page1 2009-08-19 Company5 150


22:18:05

1039 5 bookcases page2 2009-08-22 Company5 150


04:20:05

1039 5 bookcases checkout 2009-08-24 Company5 150


14:30:05

1039 5 bookcases page2 2009-08-27 Company5 150


23:03:05

1040 1 tables home 2009-07-29 Company5 250


20:17:59

1040 2 Appliance home 2009-04-21 Company6 1500


13:17:59

1040 3 laptops home 2009-05-23 Company7 800


13:17:59

1040 4 chairs home 2009-07-16 Company3 400


11:17:59

1040 4 chairs home1 2009-07-16 Company3 400


11:18:16

1040 4 chairs page1 2009-07-16 Company3 400


11:18:18

1040 5 cellphones home 2009-08-19 Company8 600


22:17:59

1040 5 cellphones home1 2009-08-19 Company8 600


22:18:02

1040 5 cellphones page1 2009-08-19 Company8 600


22:18:05

1040 5 cellphones page2 2009-08-22 Company8 600


04:20:05

1040 5 cellphones checkout 2009-08-24 Company8 600


14:30:05

1040 5 cellphones page2 2009-08-27 Company8 600


23:03:05

... ... ... ... ... ... ...

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 373
10: Path and Pattern Analysis Functions

nPath Range-Matching Example: Accumulate Pages Visited in Each Session

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Pattern ('A*')
Symbols (TRUE AS A)
Result (
FIRST (sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF A) AS path
)
) AS dt ORDER BY dt.sessionid;

Output
sessionid path

1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1,
checkout, home, home, home, home, home, home, home, home, home]

2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]

3 [home, home, home, home, home, home, home, home, home1, page1, home, home1,
page1, home]

4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]

5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2,
page2, checkout, checkout, checkout, page2, page2, page2]

nPath Range-Matching Example: Find Sessions That Start at Home Page and
Visit Page1

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 374
10: Path and Pattern Analysis Functions

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Pattern ('^H.A*.P1.A*')
Symbols (pagetype='home' AS H, pagetype='page1' AS P1, TRUE AS A)
Result (
FIRST (sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF ANY(H,P1,A)) AS path
)
) AS dt ORDER BY dt.sessionid;

Output
sessionid path

1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1,
checkout, home, home, home, home, home, home, home, home, home]

2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]

3 [home, home, home, home, home, home, home, home, home1, page1, home, home1,
page1, home]

4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]

5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2,
page2, checkout, checkout, checkout, page2, page2, page2]

nPath Range-Matching Example: Find Paths to Checkout Page for Purchases


Over $200

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Pattern ('A*.C+.A*')
Symbols (

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 375
10: Path and Pattern Analysis Functions

productprice > 200 AND


pagetype='checkout' AS C, TRUE AS A
)
Result (
FIRST(sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF ANY(A,C)) AS path,
AVG (productprice OF ANY(A,C)) AS totalsum
)
) AS dt ORDER BY dt.sessionid;

Output
sessionid path totalsum

1 [home, home1, page1, home, home1, page1, home, home, home, 602.857142857143
home1, page1, checkout, home, home, home, home, home, home,
home, home, home]

5 [home, home, home, home, home1, home1, home1, page1, page1, 363.157894736842
page1, page2, page2, page2, checkout, checkout, checkout,
page2, page2, page2]

nPath Range-Matching Example: Use OVERLAPPING Mode

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (OVERLAPPING)
Pattern ('A.A')
Symbols (TRUE AS A)
Result (
FIRST (sessionid OF A) AS sessionid,
ACCUMULATE (pagetype OF A) AS path
)
) AS dt ORDER BY dt.sessionid;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 376
10: Path and Pattern Analysis Functions

nPath Range-Matching Example Output


sessionid path

1 [home, home]

1 [home, home]

1 [home, home]

1 [home, home]

1 [home, home]

1 [home, home]

1 [home, home]

1 [home, home]

1 [checkout, home]

1 [page1, checkout]

1 [home1, page1]

1 [home, home1]

1 [home, home]

1 [home, home]

1 [page1, home]

1 [home1, page1]

1 [home, home1]

1 [page1, home]

1 [home1, page1]

1 [home, home1]

2 [home, home]

2 [checkout, home]

2 [checkout, checkout]

... ...

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 377
10: Path and Pattern Analysis Functions

nPath Range-Matching Example: Find First Product with Multiple Referrers in


Any Session

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Pattern ('REFERRER{2,}')
Symbols (referrer IS NOT NULL AS REFERRER)
Result (
FIRST (sessionid OF REFERRER) AS sessionid,
FIRST (productname OF REFERRER) AS product
)
) AS dt ORDER BY dt.sessionid;

Output
sessionid product

1 envelopes

2 tables

3 bookcases

4 tables

5 Appliances

nPath Range-Matching Example: Find Data for Sessions That Checked Out
3-6 Products

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 378
10: Path and Pattern Analysis Functions

USING
Mode (NONOVERLAPPING)
Pattern ('H+.D*.C{3,6}.D')
Symbols (
pagetype = 'home' AS H,
pagetype='checkout' AS C,
pagetype<>'home' AND pagetype<>'checkout' AS D
)
Result (
FIRST (sessionid OF C) AS sessionid,
max_choose (productprice, productname OF C) AS
most_expensive_product,
MAX (productprice OF C) AS max_price,
min_choose (productprice, productname of C) AS
least_expensive_product,
MIN (productprice OF C) AS min_price)
) AS dt ORDER BY dt.sessionid;

Output
sessionid most_expensive_product max_price least_expensive_product min_price

5 cellphones 600 bookcases 150

nPath Range-Matching Example: Find Data for Sessions That Checked Out at Least
3 Products

Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
Modify the previous query call in nPath Range-Matching Example: Find Data for Sessions That Checked
Out 3-6 Products to find sessions where the user checked out at least three products by changing the
Pattern syntax element to:

Pattern ('H+.D*.C{3,}.D')

SQL-MapReduce Call

SELECT * FROM nPath (


ON aggregate_clicks PARTITION BY sessionid ORDER BY clicktime
USING
Mode (NONOVERLAPPING)
Pattern ('H+.D*.C{3,}.D')
Symbols (

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 379
10: Path and Pattern Analysis Functions

pagetype = 'home' AS H,
pagetype='checkout' AS C,
pagetype<>'home' AND pagetype<>'checkout' AS D
)
Result (
FIRST(sessionid OF C) AS sessionid,
max_choose(productprice, productname OF C) AS
most_expensive_product,
MAX (productprice OF C) AS max_price,
min_choose (productprice, productname OF C) AS
least_expensive_product,
MIN (productprice OF C) AS min_price
)
) AS dt ORDER BY dt.sessionid;

Output
sessionid most_expensive_product max_price least_expensive_product min_price

5 cellphones 600 bookcases 150

nPath Range-Matching Example: Multiple Partitioned Input Tables and Dimension


Input Table

An e-commerce store wants to count the advertising impressions that lead to a user clicking an online
advertisement. The example counts the online advertisements that the user viewed and the television
advertisements that the user might have viewed.

Input
impressions
userid ts imp

1 2012-01-01 ad1

1 2012-01-02 ad1

1 2012-01-03 ad1

1 2012-01-04 ad1

1 2012-01-05 ad1

1 2012-01-06 ad1

1 2012-01-07 ad1

2 2012-01-08 ad2

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 380
10: Path and Pattern Analysis Functions

userid ts imp

2 2012-01-09 ad2

2 2012-01-10 ad2

2 2012-01-11 ad2

... ... ...

clicks2
userid ts click

1 2012-01-01 ad1

2 2012-01-08 ad2

3 2012-01-16 ad3

4 2012-01-23 ad4

5 2012-02-01 ad5

6 2012-02-08 ad6

7 2012-02-14 ad7

8 2012-02-24 ad8

9 2012-03-02 ad9

10 2012-03-10 ad10

11 2012-03-18 ad11

12 2012-03-25 ad12

13 2012-03-30 ad13

14 2012-04-02 ad14

15 2012-04-06 ad15

tv_spots
ts tv_imp

2012-01-01 ad2

2012-01-02 ad2

2012-01-03 ad3

2012-01-04 ad4

2012-01-05 ad5

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 381
10: Path and Pattern Analysis Functions

ts tv_imp

2012-01-06 ad6

2012-01-07 ad7

2012-01-08 ad8

2012-01-09 ad9

2012-01-10 ad10

2012-01-11 ad11

2012-01-12 ad12

2012-01-13 ad13

2012-01-14 ad14

2012-01-15 ad15

SQL-MapReduce Call
The tables impressions and clicks have a user_id column, but the table tv_spots is only a record of
television advertisements shown, which any user might have seen. Therefore, tv_spots must be a
dimension table.

SELECT * FROM nPath (


ON impressions PARTITION BY userid ORDER BY ts
ON clicks2 PARTITION BY userid ORDER BY ts
ON tv_spots DIMENSION ORDER BY ts
USING
Mode (NONOVERLAPPING)
Symbols (TRUE AS imp, TRUE AS click, TRUE AS tv_imp)
Pattern ('(imp|tv_imp)*.click')
Result (
COUNT(* of imp) AS imp_cnt,
COUNT (* of tv_imp) AS tv_imp_cnt
)
) AS dt ORDER BY dt.imp_cnt;

Output
dt.imp_cnt tv_imp_cnt

18 0

19 0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 382
10: Path and Pattern Analysis Functions

dt.imp_cnt tv_imp_cnt

19 0

20 0

21 0

22 0

22 0

22 0

22 0

22 0

23 0

23 0

23 0

24 0

25 0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 383
11
Hypothesis Testing Functions

Hypothesis testing functions find the relative likelihood of hypotheses. You can accept the most likely
hypotheses and reject the least likely.

Hypothesis Test Components


All hypothesis tests have the following components:

Component Description

Null hypothesis (H0) The null hypothesis is known as a hypothesis of no difference.


Example: Experimental drug is no better than placebo.
The null hypothesis is accepted or rejected based on a statistical test of
the hypothesis.

Alternate Hypothesis accepted if null hypothesis is rejected.


hypothesis (H1) Example: Experimental drug is more effective than placebo.

Alpha (α) The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
(Also called value (where Alpha is the probability of rejecting the null hypothesis when it is true).
significance level or Most common α values are 0.01, 0.05, and 0.10, corresponding to 99%, 95%, and
Type I error.) 90% confidence, respectively.
Results are "statistically significant at α."

Test statistic Value to which data set is reduced, used in hypothesis test. Its sampling
distribution under null hypothesis must be calculable (exactly or approximately),
making p_values calculable.

Degrees of freedom Number of independent pieces of information needed to estimate a population


parameter (for example, μ or σ2) for sample of specified size.

Critical value Quantile of distribution of test statistic under null hypothesis. Used to determine
rejection region.

p_value Probability of test results at least as extreme as test statistic results observed
under assumption that null hypothesis is true.
The smaller the p_value, the stronger the evidence against the null hypothesis.

Hypothesis Acceptance or rejection of null hypothesis.


test conclusion

Hypothesis Test Types


A hypothesis test is either:
• One-tailed or two-tailed

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 384
11: Hypothesis Testing Functions

A one-tailed test can be either lower-tailed or upper-tailed.


• One-sample or two-sample
• Paired or unpaired

Hypothesis Test Term Definitions


Term Description

One-sample test Uses one test sample.

One-tailed test Rejection region is the lower tail or the upper tail of the sampling distribution under the
null hypothesis H0.

Lower-tailed test Alternate hypothesis (H1): μ < μ0

Upper-tailed test Alternate hypothesis (H1): μ > μ0

Two-tailed test The null hypothesis assumes that μ = μ0 where μ0 is a specified value.
Two-tailed test considers both lower and upper tails of distribution of test statistic.
Alternate hypothesis (H1): μ ≠ μ0

Two-sample test Uses two test samples.

Paired test Compares study subjects at two different times.


The null and alternative hypotheses are the same as one sample test.
The paired test becomes a one-sample test because the test considers the differences
between sample values before and after the subjects are exposed to treatment.

Unpaired test Compares different subjects drawn from two independent populations.
H0): μ1 = μ2
The alternate hypotheses are as follows:
• Alternate hypothesis for upper-tailed test (H1): μ 1 > μ2
• Alternate hypothesis for lower-tailed test (H1): μ 1 < μ2
• Two-tailed test μ 1 ≠ μ2

TD_ANOVA
Analysis of variance (ANOVA) is a statistical test that analyzes the difference between the means of more
than two groups.
The null hypothesis (H0) of ANOVA is that there is no difference among group means. However, if any one
of the group means is significantly different from the overall mean, then the null hypothesis is rejected.
You can use one-way Anova when you have data on an independent variable with at least three levels and
a dependent variable.
For example, assume that your independent variable is insect spray type, and you have data on spray type
A, B, C, D, E, and F. You can use one-way ANOVA to determine whether there is any difference in the
dependent variable, insect count based on the spray type used.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 385
11: Hypothesis Testing Functions

TD_ANOVA Syntax
SELECT * FROM TD_ANOVA (
ON { table | view | (query) } as InputTable
USING
[GroupColumns ('group_column1' |'group_column2'[,...]| group_column_range[,...])]
[Alpha (alpha)]
) AS dt;

TD_ANOVA Syntax Elements


GroupColumns
[optional]: Specify the input table column names or a column range.

Alpha
[optional]: Specify the probability of rejecting the null hypothesis when the null hypothesis
is true.
Default value: 0.05
Valid range: [0,1]

TD_ANOVA Input
Input Table Schema
Column Data Type Description

Column names with groups INTEGER, BYTEINT, The column name that contains the data
A, B, C, D, E, F SMALLINT, about the insect count for each insect
BIGINT, DECIMAL, spray type.
FLOAT, NUMBER

TD_ANOVA Output
Output Table Schema
Column Data Type Description

sum_of_squares DOUBLE The sum of squares [that is, variation] between group sum
(between groups) and of squares.
(within groups)

Df (between groups) and INTEGER The degrees of freedom corresponding to the between groups
(within groups) sum of squares and within group sum of squares.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 386
11: Hypothesis Testing Functions

Column Data Type Description

mean_square(between DOUBLE The mean of the sum of squares, which is calculated by dividing
groups) mean_ the sum of squares by the degrees of freedom.
square(within groups)

F_statistic DOUBLE The computed F-statistic is calculated as the ratio of between


group sum of squares/(k-1) and within group sum of squares
/(N-k+1).
The null hypothesis is rejected because of a high F-
statistic value.

alpha DOUBLE The level of significance of the test.

critical_f DOUBLE The critical value of F denoted by (k-1, N-k) where:


• k-1 is the degrees of freedom between groups
• N-k+1 is the degrees of freedom within groups
• k is the number of groups
• N is the total number of observations

p_value DOUBLE The probability value associated with the F-statistic value.
The low p-value indicates that the insect spray type has a
significant impact on the insect count.

conclusion VARCHAR The result of the test.

TD_ANOVA Example
Input: Insect_sprays

groupA groupB groupC groupD groupE groupF


------ ------ ------ ------ ------ ------
7 17 1 5 6 9
10 17 2 5 1 13
10 11 0 3 3 11
12 14 1 3 6 16
13 13 4 4 4 13
14 11 2 6 5 22
14 16 3 4 3 15
14 7 1 2 6 24
17 19 3 5 3 26
20 21 0 5 2 26
20 21 7 12 3 15
23 17 1 5 1 10

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 387
11: Hypothesis Testing Functions

SQL Call

SELECT cast("sum_of_squares(between groups)"


as decimal(20,6)),cast("sum_of_squares(within groups)"
as decimal(20,6)),"df(between groups)","df(within
groups)",cast("mean_square(between groups)" as
decimal(20,6)),cast("mean_square(within groups)" as
decimal(20,6)),cast(f_statistic as decimal(20,6)),cast(alpha as
decimal(20,6)),cast(critical_f as decimal(20,6)),cast(p_value as
decimal(20,6)),conclusion from
TD_ANOVA (
ON insect_sprays AS InputTable
USING
ALPHA (0.05)
) AS dt;

Output Table

sum_of_squares(between groups) sum_of_squares(within groups) df(between groups)


df(within groups) mean_square(between groups) mean_square(within groups)
f_statistic alpha critical_f p_value conclusion
------------------------------ -----------------------------
------------------ ----------------- ---------------------------
-------------------------- ------------ ----------- -----------
----------- -----------------------
2656.902778000 1019.083333000
5 66 531.380556000 15.440657000
34.414376000 0.050000000 2.353809000 0.000000000 Reject Null hypothesis

TD_ChiSq
TD_ChiSq performs Pearson's chi-squared (χ2) test for independence, which determines if there is a
statistically significant difference between the expected and observed frequencies in one or more categories
of a contingency table (also called a cross tabulation).

Test Type
• One-tailed, upper-tailed
• One-sample
• Unpaired

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 388
11: Hypothesis Testing Functions

Computational Method
The Chi-Square test finds statistically significant associations between categorical variables. The test
determines if the categorical variables are statistically independent or not.
The data for analysis is organized in a table known as contingency tables. A two-way contingency table
consists of r rows and c columns wherein:
• The rows correspond to variable 1 that consists of r categories
• The columns correspond to variable 2 that consists of c categories
Each cell of the contingency table is the count of the joint occurrence of particular levels of variable 1 and
variable 2.
For example, the following two-way contingency table shows the categorical variable Gender with two levels
(Male, Female) and the categorical variable Affiliation with two levels (Smokers, Non-smokers).

Gender Affiliation table


Affiliation
Gender
Smokers Non-Smokers

Male n11 n12

Female n21 n22

The cell counts nij , i = 1, 2; j = 1, 2 are number of joint occurrences of Gender and Affiliation at their ith
and the jth levels respectively. The Null and alternative hypotheses H0 and H1 corresponding to a χ2 test of
independence is as follows:
H0: The two categorical variables are independent

vs
H1: The two categorical variables are not independent

Using the previous table, the expected cell counts are calculated:
e11 = n11 + n21

e12 = n11 + n12

e21 = n21 + n22

e22 = n12 + n22

The χ2 test statistic is calculated as:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 389
11: Hypothesis Testing Functions

The χ2 statistic follows a Chi-Square distribution with r - 1 and c - 1 degrees of freedom. In the Gender
Affiliation table, r=2 and c=2. The Null hypothesis H0 is rejected if χ2stat > χ2r-1,c-1,α where α ϵ {0.10,
0.05, 0.01}.
The Cramer's V statistic is calculated using the following formula:

where:
• φ is the phi coefficient
• χ2 is derived from the Pearson's chi-squared test
• n is the grand total of observations
• c is the number of columns
• r is the number of rows
The following rules are used to compute the hypothesis conclusion:
• If the chi-square statistic is greater than the critical value, then the function rejects the Null hypothesis.
• If the chi-square statistic is lesser than or equal to the critical value, then the function fails to reject the
Null hypothesis.

TD_ChiSq Syntax
SELECT * from TD_CHISQ (
ON { table | view | (query) } AS CONTINGENCY
[ OUT [ PERMANENT | VOLATILE ] TABLE EXPCOUNTS (expected_values_table) ]
USING
[ Alpha (alpha) ]
) AS alias;

TD_ChiSq Syntax Elements


expected_values_table
[Optional] Specify a name for the table of expected values. expected_values_table cannot be
the name of an existing table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 390
11: Hypothesis Testing Functions

Default behavior: Function does not output this table.

alpha
[Optional] The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
value (where Alpha is the probability of rejecting the null hypothesis when it is true). alpha
must be a numeric value in the range [0, 1].
Default: 0.05

TD_ChiSq Input
A contingency table also known as a two-way frequency table is a tabular mechanism with at least two rows
and two columns used in statistics to present categorical data in terms of frequency counts.
A contingency table shows the observed frequency of two variables arranged into rows and columns. The
intersection of a row and a column of a contingency table is called a cell.
For example, a cell count nij represents a joint occurrence of row i and column j where i is a value between
1 to r (total number of rows) and j is a value between 2 to c (total number of columns).
You can interpret the contingency table in the example as follows:
• The First column represents the first category, gender, and has two labels, female and male which are
represented by two rows.
• The second category, habits has two labels, smokers and non-smokers which are represented by the
second and third columns.
The second category can have at most 2046 unique labels. The function ignores NULL values in the table.
Maximum label length is 64000 for category_1, 128 for all other columns.
For a valid test output, the value of each observed frequency in the CONTINGENCY table must be at
least 5.

CONTINGENCY Table Schema


Column Data Type Description

Name of categorical Any Columns can have one or multiple labels. Can either
column 1 be an integer, LATIN, or UTF8 code.

category_2_label_1 INTEGER, Joint frequency of category 1 label i and category 2


SMALLINT, label 1, where i has a value between 1 to r.
BYTEINT,
or BIGINT

category_2_label_2 INTEGER, Joint frequency of category 1 label i and category 2


SMALLINT, label 2, where i has a value between 1 to r.
BYTEINT,
or BIGINT

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 391
11: Hypothesis Testing Functions

Column Data Type Description

.
.
.
.

category_2_label_c INTEGER, [Column appears zero or more times.]


SMALLINT, Joint frequency of category 1 label i and category 2
BYTEINT, label c, where i has a value between 1 to r.
or BIGINT

TD_ChiSq Output
Output Table Schema
Column Data Type Description

chi_square DOUBLE PRECISION Chi-squared statistic.

cramers_v DOUBLE PRECISION Cramer's V statistic.

df INTEGER Degrees of freedom.

alpha DOUBLE PRECISION alpha (see TD_ChiSq Syntax Elements).

p_value DOUBLE PRECISION Probability associated with chi-squared statistic.

criticalvalue DOUBLE PRECISION Critical value calculated using Alpha for test.

conclusion VARCHAR Chi-squared test result, either 'reject null hypothesis' or 'fail to
reject null hypothesis'.

Table of Expected Values


The function outputs this table only if you include the OUT clause in the function call. The OUT clause
specifies its name.
This table contains the expected frequencies calculated under the assumption that the null hypothesis
is true.
This table has the same schema as the CONTINGENCY table that contains the observed frequencies (see
TD_ChiSq Input), except that all columns but the first have the data type DOUBLE PRECISION.

TD_ChiSq Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 392
11: Hypothesis Testing Functions

This example tests whether gender influences smoking habits affiliation. The null hypothesis is that
gender and smoking habits affiliation are independent. TD_ChiSq compares the null hypothesis (expected
frequencies) to the contingency table (observed frequencies).

Input: contingency1
The contingency table contains the frequencies of men and women affiliated with each smoking habit.
category_1, gender, has labels "female" and "male". category_2, habits, has labels "smokers" and "non-
smokers".
This example illustrates a two-way contingency table with two categories, category_1 and
category_2 respectively.
Each row has a label, i which has a value between 1 to r, and each column has a label, j which has a value
between 2 to c. The values of c and r are 3 and 2 respectively.
Here, category 1_label1 corresponds to females and category_1_label2 corresponds to males. Similarly,
category2_label1 corresponds to smokers and category2_label2 corresponds to non-smokers.
Query to create the contingency table is as follows:

--Query to construct table contingency1


CREATE MULTISET TABLE contingency1 as
(
select gender as gender
, sum((case when smokinghabit = 'smoker' then 1 else 0 end)) as smoker_cnt
, sum((case when smokinghabit= 'nonsmoker' then 1 else 0 end)) as nonsmoker_cnt
from mytesttable
group by gender
) with data;

--Alternate Query to construct table contingency1 using PIVOT


CREATE MULTISET TABLE contingency1 as
(
SELECT *
FROM (select gender as gender, smokinghabit, count(smokinghabit)
as smokinghabit_count
from mytesttable group by gender, smokinghabit) as mytesttable
PIVOT ( SUM(smokinghabit_count) as smokinghabit_cnt
FOR smokinghabit
IN ('smoker' AS smoker, 'nonsmoker' AS nonsmoker))Tmp
) with data;

Habits
gender smokers non-smokers
---------- --------- -------------

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 393
11: Hypothesis Testing Functions

female 6 9
male 8 5

SQL Call

SELECT * from TD_CHISQ (


ON contingency1 AS CONTINGENCY
OUT TABLE EXPCOUNTS (exptable1)
USING
Alpha (0.05)
) AS dt;

Output Table

chi_square cramers_v df alpha


p_value criticalvalue conclusion
---------- ------------------- ---- --------------------
------------------- -------------- -----------------------------
1.29230769230769E 000 2.14834462211830E-001 1 5.00000000000000E-002
2.55623107546413E-001 3.84145882069412E 000 Fail to reject Null hypothesis

exptable1

gender smokers non-smokers


---------- ---------------------- ----------------------
female 7.50000000000000E 000 7.50000000000000E 000
male 6.50000000000000E 000 6.50000000000000E 000

TD_FTest
TD_FTest performs an F-test, for which the test statistic follows an F-distribution under the Null hypothesis.
TD_FTest compares the variances of two independent populations. If the variances are significantly
different, TD_FTest rejects the Null hypothesis, indicating that the variances may not come from the same
underlying population.
Use TD_FTest to compare statistical models that have been fitted to a dataset, to identify the model that best
fits the population from which the data were sampled.

Assumptions
• Populations from which samples are drawn are normally distributed.
• Populations are independent of each other.
• Data is numeric.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 394
11: Hypothesis Testing Functions

Test Type
• One-tailed (lower and upper-tailed) or two-tailed (your choice)
• Two-sample
• Unpaired

Computational Method

The F-test is used to test the Null hypothesis σ2 = in various applications. For example, you might
need to test the variability in the measurement of the thickness of a manufactured part in a factory. If the

thickness is not equal to a certain thickness ( ) then you can conclude that the manufacturing process
is uncontrolled. The types of hypothesis are as follows:

H0: σ2 =

versus

H1: σ2 > (upper-tailed)

or

H1: σ2 < (lower-tailed)

or

H1: σ2 ≠ (two-tailed)

Let x1, x2,....xn be a random sample. To test the hypotheses, the test statistic is calculated as:

where

The statistic χ2 follows an F distribution with n-1 degrees of freedom.

For the one-sided upper-tailed test σ2 > the Null hypothesis H0 is rejected if .

For the one-sided lower-tailed test σ2 < , the Null hypothesis H0 is rejected

if .

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 395
11: Hypothesis Testing Functions

For the two-sided alternative σ2≠ , the Null hypothesis H0 is rejected if

Also, the F-test is used to test if the variances of two populations are equal. The F-test can have the
following tests:
• One-tailed test: The test is used to determine if the variance of one population is either greater than
(upper-tailed) or less than (lower-tailed) the variance of another population.
• Two-tailed test: The test is used to determine significant differences in variances of the two populations
and tests the Null hypothesis (H0) against the alternative hypothesis (H1) to find out if the variances are
not equal.

Let x1, x2,....xn1 ~Ɲ (µ1, σ2) and y1, y2,....yn2 ~Ɲ (µ2, σ2) be random samples from two independent
populations. The corresponding sample means and variances are as follows:

• Sample Means Formula:

• Sample Variance Formula:

• Sample Variance Formula for and :

and
In the following calculation, assume that sample 1 has a larger variance than sample 2. If sample 2 has a
larger variance than sample 1, switch the samples and apply the same formula.

H0: =

versus

H1: >

or

<

The test statistic for the one-sided upper tailed test ( > ) is calculated as:

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 396
11: Hypothesis Testing Functions

where: n1-1 and n2-1 are degrees of freedom corresponding to sample 1 and sample 2.

The Null hypothesis H0 is rejected if .

The test statistic for the one-sided lower-tailed test ( < ) is calculated as:

The Null hypothesis H0 is rejected if .

For the two-sided hypothesis test:

H 0: =

versus

H 1: ≠

The Null hypothesis H0 is rejected if:

The two-tailed test is based on the upper tail of the F-distribution.

TD_FTest Syntax
SELECT * from TD_FTEST (
[ ON { table | view | (query) } AS InputTable ]

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 397
11: Hypothesis Testing Functions

USING
first_sample_specifier
second_sample_specifier
[ AlternativeHypothesis ({ 'lower-tailed' | 'upper-tailed' | 'two-tailed' })
[ Alpha (alpha) ]
) AS alias;

first_sample_specifier

{ FirstSampleColumn ('sample_column_1') |
FirstSampleVariance (variance_1)
DF1 (degrees_of_freedom_first_sample)
}

second_sample_specifier

{ SecondSampleColumn ('sample_column_2') |
SecondSampleVariance (variance_2)
DF2 (degrees_of_freedom_second_sample)
}

TD_FTest Syntax Elements


FirstSampleColumn
[Required if you omit FirstSampleVariance, disallowed otherwise.] Specify the name of the
input column that contains the data for the first sample population.

FirstSampleVariance
[Required if you omit FirstSampleColumn, disallowed otherwise.] Specify the variance of the
first sample population.

DF1
[Required if you omit FirstSampleColumn, disallowed otherwise.] Specify the degrees of
freedom of the first sample.

SecondSampleColumn
[Required if you omit SecondSampleVariance, disallowed otherwise.] Specify the name of
the input column that contains the data for the second sample population.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 398
11: Hypothesis Testing Functions

SecondSampleVariance
[Required if you omit SecondSampleColumn, disallowed otherwise.] Specify the variance of
the second sample population.

DF2
[Required if you omit SecondSampleColumn, disallowed otherwise.] Specify the degrees of
freedom of the second sample.

AlternativeHypothesis
[Optional] Specify the alternative hypothesis:
Option Description

'lower-tailed' Alternate hypothesis (H1): μ < μ0

'upper-tailed' Alternate hypothesis (H1): μ > μ0

'two- Rejection region is on two sides of sampling distribution of test statistic.


tailed' (Default) Two-tailed test considers both lower and upper tails of distribution of
test statistic.
Alternate hypothesis (H1): μ ≠ μ0

Alpha
[Optional] The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
value (where Alpha is the probability of rejecting the null hypothesis when it is true). alpha
must be a numeric value in the range [0, 1].
Default: 0.05

TD_FTest Input
InputTable is required only if you specify either FirstSampleColumn or SecondSampleColumn. If you
specify FirstSampleVariance, SecondSampleVariance, DF1, and DF2, the function ignores InputTable.

InputTable Schema
Column Data Type Description

sample_column_1 Numeric Data for first sample population.

sample_column_2 Numeric Data for second sample population.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 399
11: Hypothesis Testing Functions

TD_FTest Output
Output Table Schema
Column Data Type Description

FirstSampleVariance DOUBLE Variance of first sample population.


PRECISION

SecondSampleVariance DOUBLE Variance of second sample population.


PRECISION

VarianceRatio DOUBLE FirstSampleVariance/SecondSampleVariance


PRECISION

DF1 INTEGER Degrees of freedom of first sample.

DF2 INTEGER Degrees of freedom of second sample.

CriticalValue DOUBLE Critical value calculated using Alpha for test.


PRECISION

Alpha DOUBLE alpha (see TD_FTest Syntax Elements).


PRECISION

p_value DOUBLE Probability associated with F-test statistic.


PRECISION

Conclusion VARCHAR F-test result, either 'reject null hypothesis' or 'fail to


reject null hypothesis'.

TD_FTest Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

TD_FTest Example: Specify Two Sample Columns

Input
SQL Call

SELECT * FROM td_ftest (


ON example_table AS InputTable
USING
FirstSampleColumn ('A')
SecondSampleColumn ('B')
) AS dt;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 400
11: Hypothesis Testing Functions

Output

firstsamplevariance secondsamplevariance varianceratio


df1 df2 CriticalValue Alpha
p_value Conclusion
---------------------- ---------------------- ----------------------
----------- ----------- ---------------------- ----------------------
----------------------
----------------------------------------------------------------
1.38561111111111E 003 5.21221052631579E 002 2.65839436859915E 000
9 19 2.88005204672380E 000 5.00000000000000E-002
6.96767848508086E-002 Fail to reject Null hypothesis

TD_FTest Example: Specify Two Sample Variances

Input
With two sample variances instead of two sample columns, you do not need InputTable.

SQL Call

SELECT * FROM td_ftest (


USING
FirstSampleVariance (1385.61)
SecondSampleVariance (521.22)
DF1 (9)
DF2 (19)
) AS dt;

Output

firstsamplevariance secondsamplevariance
varianceratio df1 df2
CriticalValue Alpha p_value Conclusion
---------------------- ---------------------- ----------------------
-------------------- -------------------- ----------------------
---------------------- ----------------------
----------------------------------------------------------------
1.38561000000000E 003 5.21220000000000E 002 2.65839760561759E
000 9 19 2.88005204672380E 000
5.00000000000000E-002 6.96764431913391E-002 Fail to reject Null hypothesis

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 401
11: Hypothesis Testing Functions

TD_ZTest
TD_ZTest performs a Z-test, for which the distribution of the test statistic under the Null hypothesis can be
approximated by normal distribution.
TD_ZTest tests the equality of two means under the assumption that the population variances are known
(rarely true). For large samples, sample variances approximate population variances, so TD_ZTest uses
sample variances instead of population variances in the test statistic.

Assumptions
• Sample distribution is normal.
• Data is numeric, not categorical.

Test Type
• One-tailed or two-tailed (your choice)
• One-sample or two-sample (your choice)
Use one-sample to test whether the mean of a population is greater than, less than, or not equal to a
specific value. TD_ZTest finds the answer by comparing the critical values of the normal distribution at
levels of significance (alpha = 0.01, 0.05, 0.10) to the Z-test statistic.
• Unpaired

Computational Method
A test of the hypothesis (ToH) involves the following framework:
• A Null hypothesis H0 and an alternative hypothesis H1
• A random sample x1, x2,....xn in the case of a one sample test
• Two random samples x1, x2,....xn and y1, y2,....yn in the case of a two sample test
• A test statistic Zstat
• A level of significance α ϵ {0.10, 0.05, 0.01}
• Compare the sample based Zstat with the percentage point of the normal distribution |ᴢ| or |ᴢ α/2|
• Compute the p-value
• Conclusion

One Sample Z-Tests

Let x1, x2,....xn be a random sample drawn from a population with mean µ and variance σ2. Also, assume
that the data follows a normal distribution Ɲ (µ, σ2).
H0; µ ≤ µ0

versus
H1; µ > µ0

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 402
11: Hypothesis Testing Functions

or
H0: µ ≥ µ0

versus
H1: µ < µ0

H0: µ = µ0

versus
H1: µ ≠ µ0

The test statistic for testing the previous hypotheses is the Z-stat. The validity of the Z-stat is predicated on
the assumption that the population variance σ2 is known.
The assumption of known variance is not practical because if the variance is known, then the mean µ is
known. So, if the mean µ is known, the test is not required.

However, for large sample sizes (which is common in Big data applications), the sample variance s 2 is
approximately equal to the unknown variance σ2. Therefore, a scenario that involves a large sample size
validates the application of the Z-statistic.
The z-statistic is calculated as:

where the unknown standard deviation σ is replaced by the sample standard deviation

as n → ∞ (sample size is very large). Therefore, the z-statistic is rewritten as:

where
In case I of the upper tailed hypothesis test, the Null hypothesis is rejected if Zstat > ᴢ α where α ϵ {0.10, 0.05,
0.01}. In case II of the lower tailed hypothesis test, the Null hypothesis is rejected if Zstat < ᴢ α where α ϵ {0.10,
0.05, 0.01}. In case III of the two-tailed test, the Null hypothesis is rejected if Zstat > ᴢ α/2 and Zstat < ᴢ α/2,
α ϵ {0.10, 0.05, 0.01}.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 403
11: Hypothesis Testing Functions

Two Sample Z tests

The two sample z-test is used for testing equality of means of two populations. Let x1, x2,....xn1 ~ Ɲ (µ1, )

and y1, y2,....yn2 ~ Ɲ (µ2, ) be random samples from two independent populations. The Null hypothesis
H0 and the alternative hypothesis H1 respectively for a one-sided lower-tailed test is given as:
H0; µ 1 ≥ µ2

versus
H1; µ1 < µ2

The Null hypothesis is rejected if Zstat < - ᴢ α where α ϵ {0.10, 0.05, 0.01}. Also, note that - ᴢ α is a percentile
of the normal distribution with area to its left.
A one-sided upper-tailed test is calculated as:
H0; µ 1 ≤ µ2

versus
H1; µ1 > µ2

The Null hypothesis is rejected if Zstat > ᴢ α with α ϵ {0.10, 0.05, 0.01}. Also, note that ᴢα is a percentile of
the normal distribution with (1- α) x 100 area to its left. So, - ᴢ α puts 100xα area to its left.

H0: µ 1 = µ2

versus
H1: µ1 ≠ µ2

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 404
11: Hypothesis Testing Functions

The Null hypothesis is rejected if Zstat > ᴢ 1-α/2 or Zstat < -ᴢ α/2 with α ϵ {0.10, 0.05, 0.01}. Also, note that ᴢ
1-α/2 is a percentile of the normal distribution with (1- α/2) x 100 area to its left. So, - ᴢ α puts 100xα area to
its left. Note Zstat ~ Ɲ (0,1).

TD_ZTest Syntax
SELECT * FROM TD_ZTest (
ON { table | view | (query) }
USING
FirstSampleColumn (sample_column_1)
[ FirstSampleVariance (variance_1) ]
[ SecondSampleColumn (sample_column_2) ]
[ SecondSampleVariance (variance_2) ]
[ AlternativeHypothesis ({ 'upper-tailed' | 'lower-tailed' | 'two-tailed' }) ]
[ MeanUnderH0 (mean_under_H0) ]
[ Alpha (alpha) ]
) AS dt;

TD_ZTest Syntax Elements


FirstSampleColumn
Specify the name of the input column that contains the data for the first sample population.

FirstSampleVariance
[Required if first sample size is less than 30, optional otherwise.] Specify the variance of the
first sample population. variance_1 is a numeric value in the range (0,1.79769e+308).
Default behavior: If sample size is greater than 30, the function approximates the variance.

SecondSampleColumn
[Optional] Specify the name of the input column that contains the data for the second
sample population.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 405
11: Hypothesis Testing Functions

SecondSampleVariance
[Required if you specify SecondSampleColumn and second sample size is less than 30,
optional otherwise.] Specify the variance of the second sample population. variance_2 is a
numeric value in the range (0, 1.79769e+308).
Default behavior: If sample size is greater than 30, the function approximates the variance.

AlternativeHypothesis
[Optional] Specify the alternative hypothesis:
Option Description

'lower-tailed' Alternate hypothesis (H1): μ < μ0

'upper-tailed' Alternate hypothesis (H1): μ > μ0

'two-tailed' Rejection region is on two sides of sampling distribution of test statistic.


Two-tailed test considers both lower and upper tails of distribution of
test statistic.
Alternate hypothesis (H1): μ ≠ μ0

Default: 'two-tailed'

MeanUnderH0
[Optional] Specify the mean under the null hypothesis (H0). mean_under_H0 is a numeric
value in the range (-1.79769e+308, 1.79769e+308).
Default: 0

Alpha
[Optional] The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
value (where Alpha is the probability of rejecting the null hypothesis when it is true). alpha
must be a numeric value in the range [0, 1].
The null hypothesis is rejected if p_value < alpha. (For a description of p_value, see
TD_ZTest Output.) If the null hypothesis is rejected, the rejection confidence level is 1-alpha.
Default: 0.05

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 406
11: Hypothesis Testing Functions

TD_ZTest Input
Input Table Schema
Column Data Type Description

sample_column_1 INTEGER, BYTEINT, SMALLINT, Data for first sample population.


BIGINT, DOUBLE PRECISION,
NUMERIC, NUMBER

sample_column_2 INTEGER, BYTEINT, SMALLINT, [Optional] Data for second


BIGINT, DOUBLE PRECISION, sample population.
NUMERIC, NUMBER

TD_ZTest Output
Output Table Schema
Column Data Type Description

sample_column_1 VARCHAR Data for first sample population.

sample_column_2 VARCHAR [Column appears only if you specify SecondSampleColumn.]


Data for second sample population.

N1 INTEGER Size of first sample.

N2 INTEGER [Column appears only if you specify SecondSampleColumn.]


Size of second sample.

mean1 DOUBLE Mean of first sample.


PRECISION

mean2 DOUBLE [Column appears only if you specify SecondSampleColumn.]


PRECISION Mean of second sample.

AlternativeHypothesis VARCHAR Hypothesis accepted if null hypothesis is rejected (H1).

z_score DOUBLE Test statistic z score.


PRECISION

Alpha DOUBLE alpha (see TD_ZTest Syntax Elements).


PRECISION

CriticalValue DOUBLE Critical value calculated using Alpha for test (z_α).
PRECISION

p_value DOUBLE Probability associated with Z-test statistic:


PRECISION • One-tailed, lower-tailed test:
P(z < z_score)
This is the lower-tail probability under normal distribution.
• One-tailed, upper-tailed test:
P(z > z_score)

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 407
11: Hypothesis Testing Functions

Column Data Type Description

This is the upper-tail probability under normal distribution.


• Two-tailed test:
P(z > z_score) + P(z < z_score)
If p_value < alpha, reject null hypothesis.

Conclusion VARCHAR Z-test result, either 'reject null hypothesis' or 'fail to reject
null hypothesis'.
If Conclusion is 'reject null hypothesis', rejection confidence
level is 1-alpha.

TD_ZTest Example
Input: example_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.

col1 col2
----------- -----------
93 12
? 12
22 4
? 87
1 10
? 43
92 31
? 23
2 3
? 52
21 65
? 49
? 17
? 17
? 14
? 24
53 20
85 9
50 11
86 1

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 408
11: Hypothesis Testing Functions

SQL Call

SELECT * FROM TD_ZTest (


ON example_table AS InputTable
USING
FirstSampleColumn ('col1')
SecondSampleColumn ('col2')
FirstSampleVariance (0.5)
SecondSampleVariance (0.7)
AlternativeHypothesis ('two-tailed')
MeanUnderH0 (-20)
Alpha (0.05)
) AS dt;

Output

firstsamplecolumn

secondsamplecolumn
N1
N2 mean1 mean2
AlternativeHypothesis z_score
Alpha CriticalValue p_value Conclusion
--------------------------------------------------------------------------------
------------------------------------------------
--------------------------------------------------------------------------------
------------------------------------------------ ----------- -----------
---------------------- ---------------------- --------------------------------
---------------------- ---------------------- ----------------------
---------------------- --------------------------------
col1

col2
10 20
5.05000000000000E 001 2.52000000000000E 001 TWO-TAILED
1.55377718139113E 002 5.00000000000000E-002 1.95996398454005E 000
0.00000000000000E 000 Reject Null hypothesis

.sidetitles view

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 409
11: Hypothesis Testing Functions

firstsamplecolumn col1
secondsamplecolumn col2
N1 10
N2 20
mean1 5.05000000000000E 001
mean2 2.52000000000000E 001
AlternativeHypothesis TWO-TAILED
z_score 1.55377718139113E 002
Alpha 5.00000000000000E-002
CriticalValue 1.95996398454005E 000
p_value 0.00000000000000E 000
Conclusion Reject Null hypothesis

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 410
12
Aster Compatibility Functions

TD_BYONE
The TD_BYONE function sends all table operator rows to a single AMP (access module processor) for
processing. The function is a deterministic scalar system function that takes no input parameters, and
returns an integer associated with a given query. The integer is based on the combined logical host identifier,
session identifier and the request identifier associated with the query.
When using the TD_BYONE function in a table operator, note the following:
• Best practice is to use the function when the number of processed rows is relatively small. Sending a
lot of rows to a single AMP could cause spooling space issues.
• Entities appearing in a PARTITION BY clause must be referenced in the SELECT list.
• The call to TD_BYONE() must be referenced in the SELECT statement.

TD_BYONE Syntax
TD_SYSFNLIB.TD_BYONE()

TD_BYONE Syntax Elements


database
The name of the database.

TD_BYONE Examples
TD_BYONE as a Scalar Function

SELECT TD_SYSFNLIB.TD_BYONE();

Query completed. One row found. One column returned.


Total elapsed time was 1 second.

TD_BYONE()
-----------
2028

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 411
12: Aster Compatibility Functions

TD_BYONE Function in a Table Operator Call

SELECT * FROM tblop(ON (SELECT table1.*, TD_BYONE()


FROM table1) AS table2
PARTITION BY TD_BYONE()) AS D;

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 412
A
How to Read Syntax

This document uses the following syntax conventions.

Syntax Convention Meaning

Keyword. Spell exactly as shown.


KEYWORD
Many environments are case-insensitive. Syntax shows keywords in uppercase
unless operating system restrictions require them to be lowercase or mixed-case.

Variable. Replace with actual value.


variable

String of one or more digits. Do not use commas in numbers with more than
number three digits.
Example: 10045

x is optional.
[ x ]

You can specify x , y , or nothing.


[ x | y ]

You must specify either x or y .


{ x | y }

You can repeat x , separating occurrences with spaces.


x [...]
Example: x x x
See note after table.

You can repeat x , separating occurrences with commas.


x [,...]
Example: x, x, x
See note after table.

You can repeat x , separating occurrences with specified delimiter.


x [delimiter...]
Examples:
• If delimiter is semicolon:
x; x; x
• If delimiter is {,|OR}, you can do either of the following:
◦ x, x, x
◦ x OR x OR x
See note after table.

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 413
A: How to Read Syntax

Note:
You can repeat only the immediately preceding item. For example, if the syntax is:

KEYWORD x [...]

You can repeat x. Do not repeat KEYWORD.


If there is no white space between x and the delimiter, the repeatable item is x and the delimiter. For
example, if the syntax is:

[ x, [...] ] y

• You can omit x: y


• You can specify x once: x, y
• You can repeat x and the delimiter: x, x, x, y

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 414
B
Additional Information

Changes and Additions


Date Release Description

June 17.20 New Features:


2022 • TD_ANOVA
• TD_TextParser
• TD_SentimentExtractor
• TD_NaiveBayesTextClassifierTrainer
• TD_ROC
• TD_Regression_Evaluator
• TD_ClassificationEvaluator
• TD_Silhouette
• TD_GLMPredict
• TD_KMeansPredict
• TD_VectorDistance
• TD_GLM
• TD_KMeans
• TD_DecisionForest
• TD_ColumnTransformer
• TD_NonLinearCombineFit
• TD_NonLinearCombineTransform
• TD_OrdinalEncodingFit
• TD_OrdinalEncodingTransform
• TD_RandomProjectionMinComponents
• TD_RandomProjectionFit
• TD_RandomProjectionTransform
• TD_GetFutileColumns

July 17.10 New Features:


2021 • TD_ConvertTo
• TD_GetRowsWithoutMissingValues
• TD_OutlierFilterFit
• TD_OutlierFilterTransform
• TD_SimpleImputeFit
• TD_SimpleImputeTransform
• TD_CategoricalSummary
• TD_ColumnSummary
• TD_GetRowsWithMissingValues
• TD_Histogram

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 415
B: Additional Information

Date Release Description

• TD_QQNorm
• TD_UnivariateStatistics
• TD_WhichMax
• TD_WhichMin
• TD_BinCodeFit
• TD_BinCodeTransform
• TD_FunctionFit
• TD_FunctionTransform
• TD_OneHotEncodingFit
• TD_OneHotEncodingTransform
• TD_PolynomialFeaturesFit
• TD_PolynomialFeaturesTransform
• TD_RowNormalizeFit
• TD_RowNormalizeTransform
• TD_ScaleFit
• TD_ScaleTransform
• TD_FillRowID
• TD_NumApply
• TD_RoundColumns
• TD_StrApply
• TD_ChiSq
• TD_FTest
• TD_ZTest

Enhancements:
• DecisionForestPredict function: Changed syntax.
• DecisionTreePredict function: Changed syntax and output table schema.
• GLMPredict function: Added column range support for syntax element Accumulate.
• NaiveBayesTextClassifierPredict function: Changed syntax.
• nPath function: Added UNICODE support.
• Pack function:
◦ Added column range support for syntax element TargetColumns.
◦ Added syntax elements Accumulate and ColCast.
• StringSimilarity function: Added column range support for syntax
element Accumulate.
• SVMSparsePredict function: Changed syntax, input data types, and output
table schema.
• Unpack function:
◦ Added column range support for syntax element TargetColumns.
◦ Added syntax element Accumulate.

June 17.00 Initial release.


2020

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 416
B: Additional Information

Teradata Links
Link Description

https://docs.teradata.com/ Search Teradata Documentation, customize content to your needs, and


download PDFs.
Customers: Log in to access Orange Books.

https://support.teradata.com Helpful resources in one place:


• Support requests
• Account management and software downloads
• Knowledge base, community, and support policies
• Product documentation
• Learning resources, including Teradata University

https://www.teradata.com/University/Overview Teradata education network

https://support.teradata.com/community Link to Teradata community

Related Documentation
Title Publication ID

Teradata Vantage™ - Analytics Database Release Summary B035-1098

Teradata Vantage™ - SQL Fundamentals B035-1141

Teradata Vantage™ - SQL Functions, Expressions, and Predicates B035-1145

Teradata Vantage™ - SQL Data Manipulation Language B035-1146

Teradata Vantage™ - Analytics Database International Character Set Support B035-1125

Teradata Vantage™ Machine Learning Engine Analytic Function Reference B700-4003

Teradata Aster® Database User Guide

Teradata Vantage™ - Analytics Database Analytic Functions - 17.20,


Release 17.20 417

You might also like