Teradata Vantage™ - Analytics Database Analytic Functions B035-1206-172K
Teradata Vantage™ - Analytics Database Analytic Functions B035-1206-172K
B035-1206-172K
DOCS.TERADATA.COM
Copyright and Trademarks
Copyright © 2017 - 2022 by Teradata. All Rights Reserved.
All copyrights and trademarks used in Teradata documentation are the property of their respective owners. For more information, see
Trademark Information.
Product Safety
Indicates a situation which, if not avoided, could result in damage to property, such as to equipment or data,
Indicates a hazardous situation which, if not avoided, could result in minor or moderate personal injury.
⚠ CAUTION
Indicates a hazardous situation which, if not avoided, could result in death or serious personal injury.
⚠ WARNING
Third-Party Materials
Non-Teradata (i.e., third-party) sites, documents or communications (“Third-party Materials”) may be accessed or accessible (e.g., linked or
posted) in or in connection with a Teradata site, document or communication. Such Third-party Materials are provided for your convenience only
and do not imply any endorsement of any third party by Teradata or any endorsement of Teradata by such third party. Teradata is not responsible
for the accuracy of any content contained within such Third-party Materials, which are provided on an “AS IS” basis by Teradata. Such third party
is solely and directly responsible for its sites, documents and communications and any harm they may cause you or others.
Warranty Disclaimer
Except as may be provided in a separate written agreement with Teradata or required by applicable laws, all designs, specifications,
statements, information, recommendations and content (collectively, "content") available from the Teradata Documentation website
or contained in Teradata information products is presented "as is" and without any express or implied warranties, including, but not
limited to, the implied warranties of merchantability, fitness for a particular purpose, or noninfringement, which are hereby disclaimed.
In no event shall Teradata corporation, its suppliers or partners be liable for any direct, indirect, incidental, special, exemplary, or
consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or
business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or
otherwise) arising in any way out of the use of content, even if advised of the possibility of such damage.
The Content available from the Teradata Documentation website or contained in Teradata information products may contain references or
cross-references to features, functions, products, or services that are not announced or available in your country. Such references do not imply
that Teradata Corporation intends to announce such features, functions, products, or services in your country. Please consult your local Teradata
Corporation representative for those features, functions, products, or services available in your country.
The Content available from the Teradata Documentation website or contained in Teradata information products may be changed or updated
by Teradata at any time without notice. Teradata may also make changes in the products or services described in the Content at any time
without notice.
The Content is subject to change without notice. Users are solely responsible for their application of the Content. The Content does not constitute
the technical or other professional advice of Teradata, its suppliers or partners. Users should consult their own technical advisors before
implementing any Content. Results may vary depending on factors not tested by Teradata.
Machine-Assisted Translation
Certain materials on this website have been translated using machine-assisted translation software/tools. Machine-assisted translations of any
materials into languages other than English are intended solely as a convenience to the non-English-reading users and are not legally binding.
Anybody relying on such information does so at his or her own risk. No automated translation is perfect nor is it intended to replace human
translators. Teradata does not make any promises, assurances, or guarantees as to the accuracy of the machine-assisted translations provided.
Teradata accepts no responsibility and shall not be liable for any damage or issues that may result from using such translations. Users are reminded
to use the English contents.
Feedback
To maintain the quality of our products and services, e-mail your comments on the accuracy, clarity, organization, and value of this document
to: [email protected].
Any comments or materials (collectively referred to as "Feedback") sent to Teradata Corporation will be deemed nonconfidential. Without any
payment or other obligation of any kind and without any restriction of any kind, Teradata and its affiliates are hereby free to (1) reproduce, distribute,
provide access to, publish, transmit, publicly display, publicly perform, and create derivative works of, the Feedback, (2) use any ideas, concepts,
know-how, and techniques contained in such Feedback for any purpose whatsoever, including developing, manufacturing, and marketing products
and services incorporating the Feedback, and (3) authorize others to do any or all of the above.
Confidential Information
Confidential Information means any and all confidential knowledge, data or information of Teradata, including, but not limited to, copyrights, patent
rights, trade secret rights, trademark rights and all other intellectual property rights of any sort.
The Content available from the Teradata Documentation website or contained in Teradata information products may include Confidential
Information and as such, the use of such Content is subject to the non-use and confidentiality obligations and protections of a non-disclosure
agreement or other such agreements to protect Confidential Information that you have executed with Teradata.
Contents
TD_RowNormalizeFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
TD_RowNormalizeTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
TD_ScaleFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
TD_ScaleTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Installing model files output by ML Engine Teradata Vantage™ User Guide, B700-4002
functions on Analytics Database
TD_GetRowsWithoutMissingValues Displays the rows that have non-NULL values in the specified input
table columns.
TD_SimpleImputeTransform Substitutes specified values for missing values in the input table.
TD_ConvertTo Converts the specified input table columns to specified data types.
StringSimilarity Calculates the similarity between two strings, using the specified
comparison method.
TD_CategoricalSummary Displays the distinct values and their counts for each specified input
table column.
TD_GetRowsWithMissingValues Displays the rows that have NULL values in the specified input
table columns.
TD_QQNorm Checks whether the values in the specified input table columns are
normally distributed.
TD_WhichMax Displays all rows that have the maximum value in a specified input
table column.
TD_WhichMin Displays all rows that have the minimum value in specified input
table column.
TD_NonLinearCombineFit Returns the target columns and a specified formula which uses
the non-linear combination of existing features.
TD_NonLinearCombineTransform Generates the values of the new feature using the specified
formula from the TD_NonLinearCombineFit function output.
TD_OrdinalEncodingTransform Maps the categorical value to a specified ordinal value using the
TD_OrdinalEncodingFit output.
TD_PolynomialFeaturesFit Stores all the specified values in the argument in a tabular format.
TD_RowNormalizeTransform Normalizes the input columns row-wise using the output of the
TD_RowNormalizeFit function.
TD_ScaleTransform Scales the specified input table columns using the output of the
TD_ScaleFit function.
TD_NumApply Applies a specified numeric operator to the specified input table columns.
TD_RoundColumns Rounds the values of each specified input table column to a specified number of
decimal places
TD_StrApply Applies a specified string operator to the specified input table columns.
TD_KMeans Groups a set of observations into k clusters in which each observation belongs to the
cluster with the nearest mean (cluster centers or cluster centroid).
TD_GLM Performs regression analysis on data sets where the response follows an exponential
family distribution.
TD_VectorDistance Accepts a table of target vectors and a table of reference vectors and returns a table
that contains the distance between target-reference pairs.
GLMPredict Uses the model file output by ML Engine GLM function to analyze the input data and
make predictions.
SVMSparsePredict Uses the model file output by ML Engine SVMSparse function to analyze the input
data and make predictions.
DecisionForestPredict Uses the model file output by Machine Learning Engine (ML Engine)
DecisionForest function to analyze the input data and make predictions.
DecisionTreePredict Uses the model file output by ML Engine DecisionTree function to analyze the input
data and make predictions.
TD_KMeansPredict Uses the cluster centroids in the TD_KMeans function output to assign the input
data points to the cluster centroids.
NaiveBayesPredict Uses the model file output by ML Engine Naive Bayes Classifier function to analyze
the input data and make predictions.
TD_GLMPredict Predicts target values (regression) and class labels (classification) for test data
using a GLM model of the TD_GLM function.
TD_ClassificationEvaluator Computes the Confusion matrix, precision, recall, and F1-score based on the
observed labels (true labels) and the predicted labels.
TD_Regression_Evaluator Computes metrics to evaluate and compare multiple models and summarizes
how close predictions are to their expected values.
TD_TextParser Tokenizes an input stream of words and creates a row for each word
in the output table.
Attribution Calculates attributions with a wide range of distribution models. Often used in web-
page analysis.
nPath Performs regular pattern matching over a sequence of rows from one or more inputs.
TD_ANOVA Performs analysis of variance (ANOVA) test to analyze the difference between
the means.
TD_FTest Performs an F-test, for which the test statistic has an F-distribution under the
null hypothesis.
TD_ZTest Performs a Z-test, for which the distribution of the test statistic under the null hypothesis
can be approximated by normal distribution.
Usage Notes
These usage notes apply to every function in this document.
the entire result set per AMP forms a single group or partition. The HASH BY expression is always
supported with the PARTITION BY ANY clause.
Related Information:
How to Read Syntax
'start_column:end_column' [, '-exclude_column' ]
• GLMPredict
• DecisionForestPredict
• StringSimilarity
• NgramSplitter
• TD_GetRowsWithMissingValues
• TD_GetRowsWithoutMissingValues
• TD_ConvertTo
• TD_qqnorm
• TD_TextParser
• TD_NumApply
• TD_StrApply
• TD_RoundColumns
• TD_BincodeTransform
• TD_NonLinearCombineTransform
• TD_OrdinalEncodingTransform
• TD_PolynomialFeaturesTransform
• TD_RowNormalizeTransform
• TD_ScaleTransform
• TD_RandomProjectionTransform
• TD_SentimentExtractor
If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_Silhouette
If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_RandomProjectionTransform
If you specify the target columns in the Accumulate argument, then the data type of the target columns
in the output can be REAL or FLOAT.
• TD_SentimentExtractor
If you specify the Text column in the Accumulate argument, then the data type of the Text column in
the output is VARCHAR (UNICODE).
• TD_FunctionTransform
If a numeric column is not specified in the IDColumns argument, then the data type of the numeric
column in the output can be REAL or FLOAT.
• Pack
If you set the Colcast argument as True and specify the target columns in the Accumulate argument,
then the data type of target columns in the output can be VARCHAR.
• TD_Qqnorm and TD_PolynomialFeaturesTransform
The target columns are by default included in the output. The data type of the target columns included
in the output can change to Double Precision or FLOAT.
• TD_NBTCT
If you specify the datatype for the token or doccategory columns as Char or VARCHAR, then the data
type of token or category in the output can be VARCHAR Unicode.
• TD_Histogram
If you specify the datatype for label from minmaxtable table schema as BYTEINT, SMALLINT,
INTEGER, or BIGINT, then the data type of the label in the output can be BIGINT. If you specify
CHAR or VARCHAR as the datatype for the label, then the data type of the label in the output can be
VARCHAR Unicode.
BC/BCE Timestamps
Analytics Database functions do not support Before the Common Era (BCE) timestamps. BCE is an
alternative to Before Christ (BC). These are examples of BC/BCE timestamps:
4713-01-01 11:07:11-07:52:58 BC
4713-01-01 11:07:11 BC
• NaiveBayesTextClassifierPredict
• SVMSparsePredict
AA 7.00 Limitations
• The minimum supported version of Aster Analytics is AA 7.00.
Models created using an earlier version of Aster Analytics must be recreated after upgrading to a
supported version of Aster Analytics.
The following table summarizes the differences between analytic functions on Analytics Database and
Aster Database.
PARTITION BY clause lets you specify a column by PARTITION BY clause accepts only column
its position, an integer. PARTITION BY 1 partitions names. PARTITION BY 1 causes the function to
rows by column 1. process all rows on a single worker node.
See TD_BYONE
For table operator output, an alias is required. For function output, an alias is optional.
To specify function syntax elements, you must use a Function syntax does not include USING clause.
USING clause.
Function syntax elements do not support Function syntax elements support column ranges.
column ranges.
For load_to_teradata instructions, see Teradata Aster® Database User Guide and the following
usage notes.
target_table ('glm_housing_model')
);
• If a model table column name contains Analytics Database reserved keywords or special characters
— characters other than letters, digits, or underscore (_)—enclose it in double quotation marks.
This rule applies to the following model column names:
Single_Tree_Drive node_gini(p)
node_entropy(p)
node_chisq_pv(p)
split_gini(p)
split_entropy(p)
split_chisq_pv(p)
NaiveBayesReduce class
variable
type
sum
sumSq
totalCnt
For example:
Related Information:
Loading Aster Tables to Analytics Database Using ODBC
The ODBC instructions follow. To follow them, you must have an account on
https://downloads.teradata.com.
1. Install Teradata Parallel Transporter Base.
2. Set up the Aster driver on the client machine.
3. If the table does not exist on Aster Database, create and populate it there.
4. On Analytics Database, do the following:
a. If the user who is to own the table does not exist, create it.
b. Write the tpt script.
c. Write the JobVariablesFile.
d. Use the tbuild command to run the tpt script.
Related Information:
Loading Aster Tables to Analytics Database Using load_to_teradata
Example: Loading Aster Table to Analytics Database Using ODBC
1. Go to https://support.teradata.com.
2. Log in.
3. Download the package TTU 16.20.04.00 Windows - Base.
1. Go to https://support.teradata.com.
2. Log in.
3. Download AsterClients__windows_x8664.version.zip, where version is the version of Aster
Analytics on your client machine; for example:
AsterClients__windows_x8664.06.20.00.00.zip
Downloading the packages takes several minutes.
4. On your client machine, go to the folder where the package was downloaded and unzip it.
5. Go to the subfolder \stage\home\beehive\clients-winnn, where nn is 32, 64, or 86, depending
on your Windows machine. For example:
\stage\home\beehive\clients-win64
6. Install nClusterODBCInstaller_xnn.
If the installer requests a dependency package, install it from the web.
7. Open ODBC Data Sources (nn-bit).
8. On the System DSN tab, select Add.
If Aster ODBC driver installation succeeded, the window Aster ODBC Driver appears.
9. In the window Aster ODBC Driver, select Finish.
10. In the DSN Setup form that appears, enter the following values and select OK:
Field Value
Port 2406
• Write the tpt script by substituting values for variables in the following script:
STEP STEP_CREATE_DDL
(
APPLY
('DROP TABLE '||@TargetTable||' ;'),
('CREATE MULTISET TABLE
'||@TargetTable||'(TD_compatible_table_definition;')
TO OPERATOR ($DDL() [1]);
);
Step Insert_Tables
(
APPLY
('Ins '||@TargetTable||'(
:"column_name_1"
,:"column_name_2"
[...,:"column_name_k"]
);
)
TO OPERATOR ($LOAD()[1])
• Write the JobVariablesFile by substituting values for variables in the following script:
DDLTdpId = 'td_host_name_or_ip'
,DDLUserName = 'td_user'
,DDLUserPassword = 'td_user_password'
,DDLErrorList = ['3807']
,DDLPrivateLogName = 'DDL001S1'
,TargetTable = 'td_table_name'
,ODBCPrivateLogName = 'ODB039P1'
,ODBCDSNName = 'Data_Source_Name_specified_in_DSN_Setup_form'
,TruncateData = {'Y' | 'N'}
,ODBCUserName = 'aster_user'
,ODBCUserPassword = 'aster_user_password'
,LOADPrivateLogName = 'ODB039C1'
,LOADTDPID = 'td_host_name_or_ip'
,LOADUserName = 'td_user'
,LOADUserPassword = 'td_user_password'
,SelectStmt = 'SELECT * FROM aster_table_name;'
,LOADTargetTable = 'td_table_name'
For TruncateData, 'Y' trims unused space. The default is 'N'. Specify 'Y' when the MaxLenVarchar
field of the DSN Setup form (in Setting Up the Aster Driver on the Client) exceeds the maximum
VARCHAR length specified in TD_compatible_table_definition in the tpt script; otherwise, ODBC
cannot load the script.
You are on Analytics Database, where a folder contains the tpt script and JobVariablesFile that
you wrote.
1. Open the command prompt.
2. Go to the folder where the tpt script and JobVariablesFile are.
3. Run the tpt script with this command:
where tptfile and JobVariablesFile are the names of the tpt script and JobVariablesFile that you wrote
and jobid is the name you are giving to this tbuild job.
)
);
STEP STEP_CREATE_DDL
(
APPLY
('DROP TABLE '||@TargetTable||';'),
('CREATE MULTISET TABLE '||@TargetTable||'(
class_nb VARCHAR(128),
variable_nb VARCHAR(128),
type_nb VARCHAR(128),
category VARCHAR(32),
cnt BIGINT,
sum_nb FLOAT,
sum_sq FLOAT,
total_cnt BIGINT) NO PRIMARY INDEX;
')
TO OPERATOR ($DDL()[1]);
);
Step Insert Tables
(
APPLY
('Ins '||@TargetTable||'(
:"class_nb"
,:"variable_nb"
,:"type_nb"
,:"category"
,:"cnt"
,:"sum_nb"
,:"sum_sq"
,:"total_cnt");'
)
TO OPERATOR ($LOAD()[1])
JobVariablesFile, attr.txt
DDLTdpid = 'td_host_name_or_ip'
,DDLUserName = 'alice'
,DDLUserPassword = 'alice'
,DDLErrorList = '[3807]'
,DDLPrivateLogName = 'DDL001S1'
,TargetTable = 'td_nb_modelsc'
,ODBCPrivateLogName = 'ODBC039P1'
,ODBCDSNName = 'shruti'
,TruncateData = 'Y'
,ODBCUserName = 'db_superuser'
,ODBCUserPassword = 'db_superuser'
,LOADPrivateLogName = 'ODBC039P1'
,LOADTDPID = 'td_host_name_or_ip'
,LOADUserName = 'alice'
,LOADUserPassword = 'alice'
,SelectStmt = 'SELECT * FROM aster_nb_modelsc;'
,LOADTargetTable = 'td_nb_modelsc'
The model table has the column t_score if created with Family ('GAUSSIAN'), otherwise it has the
column z_score.
)
DISTRIBUTE BY REPLICATION
STORAGE ROW;
Data cleaning functions prepare the input data set for the next set of transformations.
Handling Outliers
TD_GetFutileColumns
TD_GetFutileColumns function returns the futile column names if any of these conditions is met:
• If all values in the columns are unique
• If all the values in the columns are the same
• If the count of distinct values in the columns divided by the count of the total number of rows in the input
table is greater than or equal to the threshold value
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_GetFutileColumns Syntax
CategoricalSummaryColumn
[Required]: Specify the column name from the CategoricalSummaryTable generated using
the TD_CategoricalSummary function.
ThresholdValue
[Optional]: Specify the threshold value for the input table column name. The column name
is futile if the count of distinct values in the columns divided by the total number of rows in
the input table is greater or equal to the threshold value.
Note:
This function works only for categorical data.
TD_GetFutileColumns Input
Target_Column Varchar The input table columns from the Category Summary table.
TD_GetFutileColumns Output
FutileColumns Varchar Character Set Unicode The column names that are futile.
TD_GetFutileColumns Example
InputTable
CategorySummary table
SQL Call
ThresholdValue(0.7)
)As dt;
Output Table
ColumnName
----------
ticket
cabin
TD_OutlierFilterFit
TD_OutlierFilterFit function calculates the lower_percentile, upper_percentile, count of rows, and
median for the specified input table columns. The calculated values for each column help the
TD_OutlierFilterTransform function detect outliers in the input table.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_OutlierFilterFit Syntax
) AS alias
) WITH DATA;
TargetColumns
Specify the names of the numeric InputTable columns for which to compute metrics.
GroupColumns
[Optional] Specify the name of the InputTable column by which to group the input data.
Default behavior: Function does not group input data.
OutlierMethod
[Optional] Specify one of these methods for filtering outliers:
carling Q2 ± c*(Q3-Q1)
where:
Q2 = median of data
Q1 = 25th quartile of data
Q3 = 75th quartile of data
c = (17.63*r - 23.64) / (7.74*r - 3.71)
r = count of rows in group_column if you specify GroupColumns, otherwise
count of rows in InputTable
LowerPercentile
[Optional] Specify a lower range of percentile to use to detect whether the value is an outlier.
Value 0 to 1 is supported. For Tukey and Carling, use 0.25 as the lower percentile. The
default value is 0.05.
UpperPercentile
[Optional] Specify a upper range of percentile to use to detect whether the value is an outlier.
Value 0 to 1 is supported. For Tukey and Carling, use 0.75 as the upper percentile. The
default value is 0.95.
IQRMultiplier
[Optional] Specify interquartile range multiplier (IQR), k, for Tukey filtering.
The IQR is an estimate of the spread (dispersion) of the data in the target columns (IQR =
|Q3-Q1|).
Use k = 1.5 for moderate outliers and k = 3.0 for serious outliers.
Default: 1.5
ReplacementValue
[Optional] Specify how to handle outliers:
Option Description
null Copy row to output table, replacing each outlier with NULL.
median Copy row to output table, replacing each outlier with median value
for its group.
RemoveTail
[Optional] Specify whether to remove the upper tail, the lower tail, or both.
Default: both
PercentileMethod
[Optional] Specify either the PercentileCont or the PercentileDISC method for calculating
the upper and lower percentiles of the input data values. The default value
is PercentileDISC.
TD_OutlierFilterFit Input
InputTable Schema
Column Data Type Description
target_column NUMERIC The input table column names for computing metrics and filtering outliers
using the TD_OUTLIERFILTERTRANSFORM function.
TD_OutlierFilterFit Output
TD_OutlierFilterFit Example
InputTable: titanic
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_OutlierFilterTransform
TD_OutlierFilterTransform filters outliers from the input table. The metrics for determining outliers come
from TD_OutlierFilterFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_OutlierFilterTransform Syntax
TD_OutlierFilterTransform Input
InputTable Schema
See TD_OutlierFilterFit Input.
FitTable Schema
See TD_OutlierFilterFit Output.
TD_OutlierFilterTransform Output
OtherColumns Any The columns from the input table excluding the target columns
are displayed.
TD_OutlierFilterTransform Example
Input
• InputTable: titanic, as in TD_OutlierFilterFit Example
• FitTable: outlier_fit, created by TD_OutlierFilterFit Example
SQL Call
Output
4 1 53.100000000 1
5 3 8.050000000 0
TD_GetRowsWithoutMissingValues
TD_GetRowsWithoutMissingValues displays the rows that have non-NULL values in the specified input
table columns.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
Related Information:
TD_GetRowsWithMissingValues
TD_GetRowsWithoutMissingValues Syntax
TargetColumns
[Optional] Specify the target column names to check for non-Null values.
Default: If omitted, the function considers all columns of the Input table.
Accumulate
[Optional]: Specify the input table column names to copy to the output table.
TD_GetRowsWithoutMissingValues Input
InputTable Schema
Column Data Type Description
accumulate_column Any The input table column names to copy to the output table.
TD_GetRowsWithoutMissingValues Output
TD_GetRowsWithoutMissingValues Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_SimpleImputeFit
Imputation is the process of replacing missing values with substitute values.
TD_SimpleImputeFit outputs a table of values to substitute for missing values in the input table. The output
table is input to TD_SimpleImputeTransform, which makes the substitutions.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_SimpleImputeFit Syntax
literal_specification
stats_specification
}
) AS alias;
literal_specification
stats_specification
OutputTable
[Optional] Specify a name for the output table.
If you omit OutputTable, you must create the output table for TD_SimpleImputeTransform
with a CREATE TABLE AS statement:
ColsForLiterals
[Optional] Specify the names of the InputTable columns in which to find missing values to
replace with specified literal values.
Literals
[Optional] Specify the literal values to substitute for missing values in the columns specified
by ColsForLiterals. A literal must not exceed 128 characters.
The function maps each literal to the column in the same position in ColsForLiterals. For
example, ColsForLiterals ('[1:5]', '-[2:4]', '[3]') specifies the column with index 3 last, so the
function maps the last specified literal to it.
ColsForStats
[Optional] Specify the names of the InputTable columns in which find missing values to
replace with specified statistics.
Stats
[Optional] Specify the statistics to substitute for missing values in the columns specified
by ColsForStats.
For numeric columns, the value of the Stats argument must be one of the following values:
• MIN
• MAX
• MEAN
• MEDIAN
For columns with the following data types, the value of the Stats argument can be MODE:
• CHARACTER
• VARCHAR
• BYTEINT
• SMALLINT
• INTEGER
CHARACTER and VARCHAR values must not exceed 128 characters.
The output of MODE is the value that appeared in the last according to the alphabetical
order, in case of a tie.
The function maps the value of each Stats argument to the column in the same position
in ColsForStats. For example, ColsForStats ('[1:5]', '-[2:4]', '[3]') specifies the column with
index 3 last, so the function maps the last specified Stats argument to it.
PartitionColumn
[Optional] Specify the name of the InputTable column on which to partition the input.
Default behavior: The function treats all rows as a single partition.
TD_SimpleImputeFit Input
InputTable Schema
Column Data Type Description
TD_SimpleImputeFit Output
FitTable Schema
Column Data Type Description
TD_SimpleImputeFit Example
InputTable: simpleimputefit_input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_SimpleImputeTransform
SimpleImputeTransform substitutes specified values for missing values in the input table. The specified
values come from TD_SimpleImputeFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_SimpleImputeTransform Syntax
TD_SimpleImputeTransform syntax depends on whether the TD_SimpleImputeFit call that output
FitTable omitted or specified PartitionColumn.
TD_SimpleImputeTransform Input
InputTable Schema
See TD_SimpleImputeFit Input.
FitTable Schema
See TD_SimpleImputeFit Output.
TD_SimpleImputeTransform Output
target_column Same as in InputTable Column in which missing values have been replaced..
TD_SimpleImputeTransform Example
Input
FitTable: fit_table created by TD_SimpleImputeFit Example
SQL Call
Output
Parsing Data
TD_ConvertTo
TD_ConvertTo converts the specified input table columns to specified data types.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_ConvertTo Syntax
TargetColumns
Specify the names of the InputTable columns to convert to another data type.
Accumulate
[Optional]: Specify the input table column names to copy to the output table.
TargetDataType
Specify either a single target data type for all target columns or a target data type for each
target column. If you specify multiple target data types, the function assigns the nth target
data type to the nth target column.
BYTEINT BYTEINT
SMALLINT SMALLINT
INTEGER INTEGER
BIGINT BIGINT
REAL REAL
Input Data
Output Data Type
Type
TIME TIME(6)
TIMESTAMP TIMESTAMP(6)
BYTE BYTE(32000)
BYTE(charlen=len) BYTE(len)
VARBYTE VARBYTE(32000)
VARBYTE(charlen=len) VARBYTE(len)
BLOB BLOB(2097088000)
BLOB(charlen=len) BLOB(len)
TD_ConvertTo Input
InputTable Schema
Column Data Type Description
TD_ConvertTo Output
TD_ConvertTo Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
Pack
The Pack function packs data from multiple input columns into a single column. The packed column has a
virtual column for each input column. By default, virtual columns are separated by commas and each virtual
column value is labeled with its column name.
Pack complements the function Unpack, but you can use it on any columns that meet the
input requirements.
Note:
To use Pack and Unpack together, you must run both on Analytics Database. Pack and Unpack are
incompatible with ML Engine functions Pack_MLE and Unpack_MLE.
Before packing columns, note their data types—you need them if you want to unpack the packed column.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support locale-based formatting with the SDF file.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
Pack Syntax
Related Information:
Column Specification Syntax Elements
TargetColumns
[Optional] Specify the names of the input table columns to pack into a single output column.
Column names must be valid object names, which are defined in Teradata Vantage™ - SQL
Fundamentals, B035-1141.
These names become the column names of the virtual columns. If you specify this syntax
element, but do not specify all input table columns, the function copies the unspecified input
table columns to the output table.
Default behavior: All input table columns are packed into a single output column.
Delimiter
[Optional] Specify the delimiter—a single Unicode character in Normalization Form C (NFC)
—that separates the virtual columns in the packed data. The delimiter is case-sensitive.
IncludeColumnName
[Optional] Specify whether to label each virtual column value with its column name (making
the virtual column target_column:value).
Default: 'true'
OutputColumn
Specify the name to give to the packed output column. The name must be a valid object
name, as defined in Teradata Vantage™ - SQL Fundamentals, B035-1141.
Accumulate
[Optional] Specify the input columns to copy to the output table.
ColCast
[Optional] Specify whether to cast each numeric target_column to VARCHAR.
Specifying 'true' decreases run time for queries with numeric target columns.
Default: false
Pack Input
target_column Any Column to pack, with other input columns, into single
output column.
accumulate_column or other_ Any Column to copy to output table. Typically, one such
input_column column contains row identifiers.
Pack Output
accumulate_column Nonnumeric column or numeric column with Column copied from input table.
ColCast ('false'):Same as in input table
Numeric column with ColCast ('true')
: VARCHAR
Pack Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
The input table, ville_temperature contains temperature readings for the cities Nashville and Knoxville,
in the state of Tennessee.
ville_temperature
sn city state period temp_f
This example specifies the default options for Delimiter and IncludeColumnName.
Input
See Pack Examples Input.
SQL Call
Output
The columns specified by TargetColumns are packed in the column packed_data. Virtual columns are
separated by commas, and each virtual column value is labeled with its column name. The input column
sn, which was not specified by TargetColumns, is unchanged in the output table.
packed_data sn
city:Nashville,state:Tennessee,period:2010-01-01 00:00:00,temp_f:35.1 1
city:Nashville,state:Tennessee,period:2010-01-01 01:00:00,temp_f:36.2 2
city:Nashville,state:Tennessee,period:2010-01-01 02:00:00,temp_f:34.5 3
city:Nashville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.6 4
city:Nashville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:33.1 5
city:Knoxville,state:Tennessee,period:2010-01-01 03:00:00,temp_f:33.2 6
city:Knoxville,state:Tennessee,period:2010-01-01 04:00:00,temp_f:32.8 7
city:Knoxville,state:Tennessee,period:2010-01-01 05:00:00,temp_f:32.4 8
city:Knoxville,state:Tennessee,period:2010-01-01 06:00:00,temp_f:32.2 9
city:Knoxville,state:Tennessee,period:2010-01-01 07:00:00,temp_f:32.4 10
This example specifies the pipe character (|) for Delimiter and 'false' for IncludeColumnName.
Input
See Pack Examples Input.
SQL Call
Output
Virtual columns are separated by pipe characters and not labeled with their column names.
packed_data sn
Nashville|Tennessee|2010-01-01 00:00:00|35.1 1
Nashville|Tennessee|2010-01-01 01:00:00|36.2 2
Nashville|Tennessee|2010-01-01 02:00:00|34.5 3
Nashville|Tennessee|2010-01-01 03:00:00|33.6 4
Nashville|Tennessee|2010-01-01 04:00:00|33.1 5
Knoxville|Tennessee|2010-01-01 03:00:00|33.2 6
Knoxville|Tennessee|2010-01-01 04:00:00|32.8 7
Knoxville|Tennessee|2010-01-01 05:00:00|32.4 8
Knoxville|Tennessee|2010-01-01 06:00:00|32.2 9
Knoxville|Tennessee|2010-01-01 07:00:00|32.4 10
Unpack
The Unpack function unpacks data from a single packed column into multiple columns. The packed
column is composed of multiple virtual columns, which become the output columns. To determine the
virtual columns, the function must have either the delimiter that separates them in the packed column or
their lengths.
Unpack complements the function Pack, but you can use it on any packed column that meets the
input requirements.
Note:
• To use Pack and Unpack together, you must run both on Analytics Database. Pack and Unpack
are incompatible with ML Engine functions Pack_MLE and Unpack_MLE.
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support the following:
◦ Locale-based parsing with the SDF file
◦ This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
◦ This function does not support KanjiSJIS or Graphic data types.
Unpack Syntax
Related Information:
Column Specification Syntax Elements
TargetColumn
Specify the name of the input column that contains the packed data.
OutputColumns
Specify the names to give to the output columns, in the order in which the corresponding
virtual columns appear in target_column. The names must be valid object names, as
defined in Teradata Vantage™ - SQL Fundamentals, B035-1141.
If you specify fewer output column names than there are virtual input columns, the function
ignores the extra virtual input columns. That is, if the packed data contains x+y virtual
columns and the OutputColumns syntax element specifies x output column names, the
function assigns the names to the first x virtual columns and ignores the remaining y
virtual columns.
OutputDataTypes
Specify the datatypes of the unpacked output columns. Supported data types are
VARCHAR, INTEGER, DOUBLE PRECISION, TIME, DATE, and TIMESTAMP.
If OutputDataTypes specifies only one value and OutputColumns specifies multiple
columns, the specified value applies to every output_column.
If OutputDataTypes specifies multiple values, it must specify a value for each
output_column. The nth datatype corresponds to the nth output_column.
The function can output only 16 VARCHAR columns.
Delimiter
[Optional] Specify the delimiter—a single Unicode character in Normalization Form C (NFC)
—that separates the virtual columns in the packed data. The delimiter is case-sensitive.
Do not specify both this syntax element and the ColumnLength syntax element. If the
virtual columns are separated by a delimiter, specify the delimiter with this syntax element;
otherwise, specify the ColumnLength syntax element.
Default: ',' (comma)
ColumnLength
[Optional] Specify the lengths of the virtual columns; therefore, to use this syntax element,
you must know the length of each virtual column.
If ColumnLength specifies only one value and OutputColumns specifies multiple columns,
the specified value applies to every output_column.
If ColumnLength specifies multiple values, it must specify a value for each output_column.
The nth datatype corresponds to the nth output_column. However, the last output_column
can be an asterisk (*), which represents a single virtual column that contains the
remaining data. For example, if the first three virtual columns have the lengths 2, 1,
and 3, and all remaining data belongs to the fourth virtual column, you can specify
ColumnLength ('2', '1', '3', *).
Do not specify both this syntax element and the Delimiter syntax element.
Regex
[Optional] Specify a regular expression that describes a row of packed data, enabling the
function to find the data values.
A row of packed data contains a data value for each virtual column, but the row might also
contain other information (such as the virtual column name). In the regular_expression,
each data value is enclosed in parentheses.
For example, suppose that the packed data has two virtual columns, age and sex, and that
one row of packed data is age:34,sex:male. The regular_expression that describes the
row is '.*:(.*)'. The '.*:' matches the virtual column names, age and sex, and the
'(.*)' matches the values, 34 and male.
To represent multiple data groups in regular_expression, use multiple pairs of parentheses.
Without parentheses, the last data group in regular_expression represents the data value
(other data groups are assumed to be virtual column names or unwanted data). If a
different data group represents the data value, specify its group number with the RegexSet
syntax element.
Default: '(.*)', which matches the whole string (between delimiters, if any). When applied
to the preceding sample row, the default regular_expression causes the function to return
'age:34' and 'sex:male' as data values.
RegexSet
[Optional] Specify the ordinal number of the data group in regular_expression that
represents the data value in a virtual column.
Default behavior: The last data group in regular_expression represents the data value. For
example, suppose that regular_expression is '([a-zA-Z]*):(.*)'. If group_number is
'1', '([a-zA-Z]*)' represents the data value. If group_number is '2', '(.*)' represents
the data value.
Maximum: 30
IgnoreInvalid
[Optional] Specify whether the function ignores rows that contain invalid data.
IgnoreInvalid may not behave as you expect if an item in a virtual column has trailing special
characters. See Unpack Example: IgnoreInvalid ('true') with Trailing Special Characters.
Default: 'false' (The function fails if it encounters a row with invalid data.)
Accumulate
[Optional] Specify the input columns to copy to the output table.
Unpack Input
Unpack Output
Unpack Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
The input table, ville_tempdata, is a collection of temperature readings for two cities, Nashville and
Knoxville, in the state of Tennessee. In the column of packed data, the delimiter comma (,) separates the
virtual columns. The last row contains invalid data.
ville_tempdata
sn packed_temp_data
10 Nashville,Tennessee,35.1
11 Nashville,Tennessee,36.2
12 Nashville,Tennessee,34.5
13 Nashville,Tennessee,33.6
14 Nashville,Tennessee,33.1
15 Nashville,Tennessee,33.2
16 Nashville,Tennessee,32.8
17 Nashville,Tennessee,32.4
18 Nashville,Tennessee,32.2
19 Nashville,Tennessee,32.4
20 Thisisbaddata
SQL Call
Because comma is the default delimiter, the Delimiter syntax element is optional.
Output
Because of IgnoreInvalid ('true'), the function did not fail when it encountered the row with invalid data,
but it did not output that row.
Input
The input table, ville_tempdata1, is like the input table for the previous example, except that no delimiter
separates the virtual columns in the packed data. To enable the function to determine the virtual columns,
the function call specifies the column lengths.
ville_tempdata1
sn packed_temp_data
10 NashvilleTennessee35.1
11 NashvilleTennessee36.2
sn packed_temp_data
12 NashvilleTennessee34.5
13 NashvilleTennessee33.6
14 NashvilleTennessee33.1
15 NashvilleTennessee33.2
16 NashvilleTennessee32.8
17 NashvilleTennessee32.4
18 NashvilleTennessee32.2
19 NashvilleTennessee32.4
20 Thisisbaddata
SQL Call
Output
city state temp_f sn
Input
The input table is ville_tempdata1, as in Unpack Example: No Delimiter Separates Virtual Columns. Its
packed_temp_data column has three virtual columns.
SQL Call
The OutputColumns syntax element specifies only two output column names.
Output
The output table has columns for the first two virtual input columns, but not for the third.
city state sn
Nashville Tennessee 10
Nashville Tennessee 11
Nashville Tennessee 12
Nashville Tennessee 13
Nashville Tennessee 14
Nashville Tennessee 15
city state sn
Nashville Tennessee 16
Nashville Tennessee 17
Nashville Tennessee 18
Nashville Tennessee 19
In this example, the items in the first virtual input column have trailing special characters. No delimiter
separates the virtual columns. ColumnLength is 2. The call to Unpack includes IgnoreInvalid ('true'), but
the output is unexpected.
Input
t2
c1
1,1919-04-05
1.1919-04-05
5,.1919-04-05
2,2019/04/05
4.,.1919-04-05
32019/04/05
SQL Call
Output
a b
1 19/04/05
1 19/04/05
2 19/04/05
The reason for the unexpected output is the behavior of an internal library that Unpack uses, which is
as follows:
1,1919-04-05 Library prunes trailing comma, converts "1" to integer and "1919-04-05" to date. (Output
row 1.)
1.1919-04-05 Library prunes trailing period, converts "1" to integer and "1919-04-05" to date. (Output
row 2.)
5,.1919-04-05 Library prunes trailing comma, converts 5 to integer, but cannot convert ".1919-04-05"
to date. (No output row.)
With ColumnLength ('3','*'), library prunes trailing comma and period, converts "5" to
integer and "1919-04-05" to date, and outputs a row for this input row.
2,2019/04/05 Library prunes trailing comma, converts "2" to integer and "1919/04/05" to date. (Output
row 3.)
4.,.1919-04-05 Library converts "4." to integer, but cannot convert ",.1919-04-05" to date. (No output
row.)
32019/04/05 Library converts "32" to integer, but cannot convert "019/04/05" to date. (No output row.)
StringSimilarity
The StringSimilarity function calculates the similarity between two strings, using the specified comparison
method. The similarity is a value in the range [0, 1].
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• When comparing strings, the function assumes that they are in the same Unicode script in
Normalization Form C (NFC).
• When used with this function, the ORDER BY clause supports only ASCII collation.
StringSimilarity Syntax
Related Information:
Column Specification Syntax Elements
ComparisonColumnPairs
Specify the names of the input table columns that contain strings to compare (column1 and
column2), how to compare them (comparison_type), and (optionally) a constant and the
name of the output column for their similarity (output_column). The similarity is a value in
the range [0, 1].
For column1 and column2:
• If column1 or column2 includes any special characters (that is, characters other than
letters, digits, or underscore (_)), surround the column name with double quotation
marks. For example, if column1 and column2 are c(col1) and c(col2), respectively,
specify them as "c(col1)" and "c(col2)".
If column1 or column2 includes double quotation marks, replace each double quotation
mark with a pair of double quotation marks. For example, if column1 and column2 are
c1"c and c2"c, respectively, specify them as "c1""c" and "c2""c".
Note:
These rules do not apply to output_column. For example, this is valid syntax:
ComparisonColumnPairs ('jaro ("c1""c", "c2""c") AS out"col')
• If column1 or column2 supports more than 200 characters, you can cast it to
VARCHAR(200), as in the following example; however, the string may be truncated.
For information about the CAST operation, see Teradata Vantage™ - SQL Functions,
Expressions, and Predicates, B035-1145.
'n_gram' N-gram similarity. If you specify this comparison type, you can specify
the value of N with constant. Default: N = 2
'soundexcode' Only for English strings: -1 if either string has a non-English character;
otherwise, 1 if their soundex codes are the same and 0 otherwise.
CaseSensitive
[Optional] Specify whether string comparison is case-sensitive. You can specify either one
value for all pairs or one value for each pair. If you specify one value for each pair, the ith
value applies to the ith pair.
Default: 'false'
Accumulate
[Optional] Specify the names of input table columns to copy to the output table.
StringSimilarity Input
If any column1 or column2 in the input table schema supports more than 200 characters, you must cast
it to VARCHAR(200). See example in StringSimilarity Syntax Elements.
StringSimilarity Output
StringSimilarity Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
strsimilarity_input
id src_text1 src_text2 tar_text
SQL Call
Output
Columns 1-3
id src_text1 tar_text
1 astre aster
2 hone phone
3 acqiese acquiesce
4 AAAACCCCCGGGGA CCAGGGAAACCCAC
5 alice allies
6 angela angels
7 senter centre
8 chef chief
9 circus circuit
10 debt debris
11 deal lead
12 bare bear
Columns 4-7
jaro1_sim ld1_sim ngram1_sim jw1_sim
Input
The input table is strsimilarity_input, as in StringSimilarity Example: Specify Column Names.
SQL Call
Output
Data exploration functions help you learn about the variables (columns) of the input data set.
MovingAverage
The MovingAverage function computes average values in a series, using the specified moving
average type.
Modified Moving Average Computes first value as simple moving average. Computes subsequent
values by adding new value and subtracting last average from resulting sum.
Weighted Moving Average Computes average of points in series, applying weights to older values.
Weights for older values decrease arithmetically.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• The ORDER BY clause supports only ASCII collation.
• The PARTITION BY clause assumes column names are in Normalization Form C (NFC).
The value n—the number of old values to use when calculating the new weighted moving average—is
specified by the WindowSize syntax element.
The function calculates SMAi on the ith window of the target column from the start of the row.
3. Compute the triangular moving average by computing the simple moving average with window size N
on the values obtained in step 2, using this formula:
TMA = (SMA1 + SMA2 + … + SMAN)/N
The function writes the cumulative moving average values computed for the first n rows, where n is less
than N, to the output table.
The value alpha is specified by the Alpha syntax element. V is the new value.
With MAvgType ('C'), the MovingAverage function computes the arithmetic average of all the rows from the
beginning of the series with this formula:
CMA = (V1 + V2 + ... + VN)/N
Vi is a value. N is the number of rows from the beginning of the data set.
MovingAverage Syntax
SELECT * FROM MovingAverage (
ON { table | view | (query) }
[ PARTITION BY partition_column [,...] ]
[ ORDER BY order_column [,...] ]
[ USING
[ MAvgType ({ 'C' | 'E' | 'M' | 'S' | 'T' | 'W' }) ]
[ TargetColumns ({'target_column'| 'target_column_range'}[,...])]
[ Alpha (alpha) ]
[ StartRows (n) ]
[ IncludeFirst ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ WindowSize (window_size) ]
]
) AS alias;
Note:
If the ON clause does not include the PARTITION BY and ORDER BY clauses, results
are nondeterministic.
Type Description
Type Description
TargetColumns
[Optional] Specify the input column names for which to compute the moving average.
Default behavior: The function copies every input column to the output table but does not
compute any moving averages.
Alpha
[Optional with MAvgType E, otherwise ignored.] Specify the damping factor, a value in the
range [0, 1], which represents a percentage in the range [0, 100]. For example, if alpha is 0.2,
the damping factor is 20%. A higher alpha discounts older observations faster.
Default: 0.1
StartRows
[Optional with MAvgType E, otherwise ignored.] Specify the number of rows to skip before
calculating the exponential moving average. The function uses the arithmetic average of
these rows as the initial value of the exponential moving average. The value n must be
an integer.
Default: 2
IncludeFirst
[Ignored with MAvgType C, otherwise optional.] Specify whether to include the starting rows
in the output table. If you specify 'true', the output columns for the starting rows contain NULL,
because their moving average is undefined.
Default: 'false'
WindowSize
[Optional with MAvgType M, S, T, and W; otherwise ignored.] Specify the number of previous
values to consider when computing the new moving average. The data type of window_size
must be BYTEINT, SMALLINT, or INTEGER.
Minimum value: 3
Default: '10'
MovingAverage Input
Input Table Schema
Column Data Type Description
partition_ Any Column by which input data is partitioned. This column must
column contain all rows of an entity. For example, if function is to
compute moving average of a particular stock share price, all
transactions of that stock must be in one partition.
PARTITION BY clause assumes column names are in
Normalization Form C (NFC).
MovingAverage Output
Output Table Schema
Column Data Type Description
MovingAverage Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
The input table, company1_stock, contains 25 observations of common stock closing prices.
company1_stock
id name period stockprice
This example computes a cumulative moving average for the price of stock.
Input
See MovingAverage Examples Input.
SQL Call
Output
id name period stockprice stockprice_cmavg
This example computes an exponential moving average for the price of stock.
Input
See MovingAverage Examples Input.
SQL Call
Output
id name period stockprice stockprice_emavg
This example computes the triangular moving average for the price of stock.
Input
See MovingAverage Examples Input.
SQL Call
Output
id name period stockprice stockprice_mmavg
This example computes a simple moving average for the price of stock.
Input
See MovingAverage Examples Input.
SQL Call
Output
id name period stockprice stockprice_smavg
This example computes the triangular moving average for the price of stock.
Input
See MovingAverage Examples Input.
SQL Call
IncludeFirst ('true')
) AS dt ORDER BY id;
Output
id name period stockprice stockprice_tmavg
This example computes the weighted moving average for the price of stock.
Input
See MovingAverage Examples Input.
SQL Call
Output
id name period stockprice stockprice_wmavg
TD_CategoricalSummary
TD_CategoricalSummary displays the distinct values and their counts for each specified input table column.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_CategoricalSummary Syntax
SELECT * FROM TD_CategoricalSummary (
ON { table | view | (query) } AS InputTable
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
) AS alias;
TD_CategoricalSummary Input
InputTable Schema
Column Data Type Description
TD_CategoricalSummary Output
Output Table Schema
Column Data Type Description
TD_CategoricalSummary Example
InputTable: cat_titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
0 0 PC 17754 34.6542 A5 C
488 0 1 Kent; Mr. Edward Austin male 58
0 0 11771 29.7 B37 C
505 1 1 Maioni; Miss. Roberta female 16
0 0 110152 86.5 B79 S
631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80
0 0 27042 30 A23 S
873 0 1 Carlsson; Mr. Frans Olof male 33
0 0 695 5 B51 B53 B55 S
SQL Call
Output
TD_ColumnSummary
TD_ColumnSummary displays the following for each specified input table column:
• Column name
• Column data type
• Count of these values:
◦ Non-NULL
◦ NULL
◦ Blank (all space characters) (NULL for numeric data type)
◦ Zero (NULL for nonnumeric data type)
◦ Positive (NULL for nonnumeric data type)
◦ Negative (NULL for nonnumeric data type)
• Percentage of NULL values
• Percentage of non-NULL values
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_ColumnSummary Syntax
SELECT * FROM TD_ColumnSummary (
ON { table | view | (query) } AS InputTable
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
) AS alias;
TD_ColumnSummary Input
InputTable Schema
Column Data Type Description
TD_ColumnSummary Output
Output Table Schema
Column Data Type Description
TD_ColumnSummary Example
InputTable: col_titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_GetRowsWithMissingValues
TD_GetRowsWithMissingValues displays the rows that have NULL values in the specified input
table columns.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
Related Information:
TD_GetRowsWithoutMissingValues
TD_GetRowsWithMissingValues Syntax
SELECT * FROM TD_GetRowsWithMissingValues (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_column ] ]
[ USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
]
) AS alias;
Accumulate
[Optional]: Specify the input table column names to copy to the output table.
TD_GetRowsWithMissingValues Input
InputTable Schema
Column Data Type Description
accumulate_column Any The input table column names to copy to the output table.
TD_GetRowsWithMissingValues Output
Output Table Schema
Same as Input Table schema
TD_GetRowsWithMissingValues Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_Histogram
TD_Histogram calculates the frequency distribution of a data set using your choice of these methods:
• Sturges
• Scott
• Variable-width
• Equal-width
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_Histogram Syntax
SELECT * TD_Histogram (
ON { table | view | (query) } AS InputTable
[ ON { table | view | (query) } AS MinMax DIMENSION ]
USING
MethodType ({ 'Sturges' | 'Scott' | 'Variable-Width' | 'Equal-Width' })
TargetColumn ('target_column')
[ NBins ('number_of_bins') ]
[ Inclusion ({ 'left' | 'right' }) ]
) AS alias;
Available
Description
Methods
Available
Description
Methods
w = r/(1 + log2n)
where:
w = bin width
r = data value range
n = number of elements in data set
Sturges algorithm performs best if data is normally distributed and n is at
least 30.
Variable- Requires MinMax table, which specifies the minimum value and the maximum
Width value of the bin in column1 and column2 respectively, and the label of the bin
in column3.
Maximum number of bins cannot exceed 3500.
Equal- Requires MinMax table, which specifies the minimum value of the bins in
Width column1 and the maximum value of the bins in column2.
Algorithm for calculating bin width, w:
w = (max - min)/k
where:
min = minimum value of the bins
max = maximum value of the bins
k = number of intervals into which algorithm divides data set
Interval boundaries: min+w, min+2w, …, min+(k-1)w
TargetColumn
Specify the name of the InputTable column that contains the data set.
NBins
[Required with methods Variable-Width and Equal-Width, otherwise ignored.] Specify the
number of bins (number of data value ranges).
Inclusion
[Optional] Specify where to put data points that are on bin boundaries—in the bin to the left
of the boundary or the bin to the right of boundary.
Default: left
TD_Histogram Input
InputTable Schema
Column Data Type Description
TD_Histogram Output
Output Table Schema
Column Data Type Description
MinValue DOUBLE PRECISION Minimum values for bins or data value ranges.
MaxValue DOUBLE PRECISION Maximum values for bins or data value ranges.
TD_Histogram Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• InputTable: hist_titanic_train
• MinMax: hist_titanic_train_dim
SQL Call
Output
TD_QQNorm
TD_QQNorm checks whether the values in the specified input table columns are normally distributed. The
function returns the quantiles of the column values and corresponding theoretical quantile values from a
normal distribution. If the column values are normally distributed, then the quantiles of column values and
normal quantile values appear in a straight line when plotted on a 2D graph.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_QQNorm Syntax
SELECT * FROM TD_QQNorm (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ ORDER BY order_column ] ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
RankColumns ({ 'rank_column' | rank_column_range }[,...])
[ OutputColumns ('output_column' [,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
RankColumns
Specify the names of the InputTable columns that contain the ranks for the target columns.
OutputColumns
[Optional] Specify names for the output table columns that contain the theoretical quantiles
of the target columns.
Default: target_column_theoretical_quantiles
Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.
TD_QQNorm Input
InputTable Schema
Column Data Type Description
TD_QQNorm Output
Output Table Schema
Column Data Type Description
TD_QQNorm Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_UnivariateStatistics
UnivariateStatistics displays descriptive statistics for each specified numeric input table column.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_UnivariateStatistics Syntax
SELECT * FROM UnivariateStatistics (
ON { table | view | (query) }
AS InputTable [ PARTITION BY ANY ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ PartitionColumns ('partition_column' [,...]) ]
PartitionColumns
[Optional] Specify the names of the InputTable columns on which to partition the input. The
function copies these columns to the output table.
Default behavior: The function treats all rows as a single partition.
Stats
[Optional] Specify the statistics to calculate. statistic is one of these:
• SUM
• COUNT or CNT
• MAXIMUM or MAX
• MINIMUM or MIN
• MEAN
• UNCORRECTED SUM OF SQUARES or USS
• NULL COUNT or NLC
• POSITIVE VALUES COUNT or PVC
• NEGATIVE VALUES COUNT or NVC
• ZERO VALUES COUNT or ZVC
• TOP5 or TOP
• BOTTOM5 or BTM
• RANGE or RNG
• GEOMETRIC MEAN or GM
• HARMONIC MEAN or HM
• VARIANCE or VAR
• STANDARD DEVIATION or STD
• STANDARD ERROR or SE
• SKEWNESS or SKW
• KURTOSIS or KUR
• COEFFICIENT OF VARIATION or CV
Centiles
[Optional] Specify the centile to calculate. percentile is an INTEGER in the range [1, 100].
The function ignores Centiles unless Stats specifies PERCENTILES, PRC, or ALL.
Default: 1, 5, 10, 25, 50, 75, 90, 95, 99
TrimPercentile
[Optional] Specify the trimmed lower percentile, an integer value in the range [1, 50].
The function calculates the mean of the values between the trimmed lower percentile
(trimmed_percentile) and trimmed upper percentile (1-trimmed_percentile).
The function ignores TrimPercentile unless Stats specifies TRIMMED MEAN, TM, or ALL.
Default: 20
TD_UnivariateStatistics Input
InputTable Schema
Column Data Type Description
TD_UnivariateStatistics Output
Output Table Schema
Column Data Type Description
partition_column Same as in InputTable Column copied from InputTable. Defines a partition for
statistics calculation.
StatsName VARCHAR [Column appears once for each specified statistic.] Statistic.
StatsValue DOUBLE PRECISION [Column appears once for each specified statistic.]
Statistic value.
TD_UnivariateStatistics Example
InputTable: titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_WhichMax
TD_WhichMax displays all rows that have the maximum value in a specified input table column.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_WhichMax Syntax
SELECT * FROM TD_WhichMax (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_by_column ] ]
USING
TargetColumn ('target_column')
) AS alias;
TD_WhichMax Input
InputTable Schema
Column Data Type Description
target_column Any except BLOB, CLOB, and UDT. Columns for which maximum values
are checked.
TD_WhichMax Output
Output Table Schema
Same as InputTable schema
TD_WhichMax Example
InputTable: titanic_dataset
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------- ----- --------
1 0 3 male 22 1 0 7.25 null S
2 1 1 female 38 1 0 71.28 C85 C
3 1 3 female 26 0 0 7.93 null S
4 1 1 female 35 1 0 53.10 C123 S
5 0 3 male 35 0 0 8.05 null S
SQL Call
Output
passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------- ----- --------
2 1 1 female 38 1 0 71.28 C85 C
TD_WhichMin
TD_WhichMin displays all rows that have the minimum value in specified input table column.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_WhichMin Syntax
SELECT * FROM TD_WhichMin (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ORDER BY order_by_column ] ]
USING
TargetColumn ('target_column')
) AS alias;
TD_WhichMin Input
InputTable Schema
Column Data Type Description
target_column Any except BLOB, CLOB, and UDT. Columns for which minimum values are checked.
TD_WhichMin Output
Output Table Schema
Same as InputTable schema
TD_WhichMin Example
InputTable: titanic_dataset
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------- ----- --------
1 0 3 male 22 1 0 7.25 null S
2 1 1 female 38 1 0 71.28 C85 C
3 1 3 female 26 0 0 7.93 null S
4 1 1 female 35 1 0 53.10 C123 S
5 0 3 male 35 0 0 8.05 null S
SQL Call
Output
passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ---- --- ----- ----- --------- ----- --------
1 0 3 male 22 1 0 7.25 null S
Feature engineering transform functions encapsulate variable transformations during the training phase so
you can chain them to create a pipeline for operationalization.
Each TD_nameFit function outputs a table to input to the TD_nameTransform function as FitTable. For
example, TD_BinCodeFit outputs a FitTable for TD_BinCodeTransform.
Antiselect
Antiselect returns all columns except those specified in the Exclude syntax element.
Note:
• This function requires the UTF8 client character set for UNICODE data.
Antiselect Syntax
SELECT * FROM Antiselect (
ON { table | view | (query) }
USING
Exclude ({ 'exclude_column' | exclude_column_range }[,...])
) AS alias;
'start_column:end_column' [, '-exclude_in-range_column' ]
name with double quotation marks. For example, if the column name is a*b, specify it as
"a*b". A column name cannot contain a double quotation mark.
• Nonnegative integers that represent the indexes of columns in the table (for
example, '[0:4]')
The first column has index 0; therefore, '[0:4]' specifies the first five columns in
the table.
• Empty. For example:
◦ '[:4]' specifies all columns up to and including the column with index 4.
◦ '[4:]' specifies the column with index 4 and all columns after it.
◦ '[:]' specifies all columns in the table.
The exclude_in-range_column is a column in the specified range, represented by either its
name or its index (for example, '[0:99]', '-[50]', '-column10' specifies the columns
with indexes 0 through 99, except the column with index 50 and column10).
Column ranges cannot overlap, and cannot include any specified exclude_column.
Antiselect Input
The input table can have any schema.
Antiselect Output
The output table has all input table columns except those specified by the Exclude syntax element.
Antiselect Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
The input table, antiselect_test, is a sample set of sales data containing 13 columns.
antiselect_test
sno id orderdate priority qty sales disct dmode custname province region cust
sno id orderdate priority qty sales disct dmode custname province region cust
49 293 2012-10-01 High 49 10123 0.07 Delivery Barry Nunavut Nunavut Con
00:00:00 Truck French
50 293 2012-10-01 High 27 244. 0.01 Regular Barry Nunavut Nunavut Con
00:00:00 57 Air French
80 483 2011-07-10 High 30 4965. 0.08 Regular Clay Nunavut Nunavut Corp
00:00:00 76 Air Rozendal
85 515 2010-08-28 Not 19 394. 0.08 Regular Carlos Nunavut Nunavut Con
00:00:00 specified 27 Air Soltero
86 515 2010-08-28 Not 21 146. 0.05 Regular Carlos Nunavut Nunavut Con
00:00:00 specified 69 Air Soltero
97 613 2011-06-17 High 12 93.54 0.03 Regular Carl Nunavut Nunavut Corp
00:00:00 Air Jackson
SQL Call
Output
sno priority qty sales dmode custname region prodcat
Input
The input table is antiselect_test, as in Antiselect Example: No Column Ranges.
SQL Call
Output
sno qty sales disct dmode
TD_BinCodeFit
TD_BinCodeFit outputs a table of information to input to TD_BinCodeTransform, which bin-codes the
specified input table columns.
Bin-coding is typically used to convert numeric data to categorical data by binning the numeric data into
multiple numeric bins (intervals). The bins can have a fixed-width with auto-generated labels or can have
specified variable widths and labels.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_BinCodeFit Syntax
For Equal-Width Bins with Generated Labels
) AS alias
) WITH DATA;
TargetColumns
Specify the names of the InputTable columns to bin-code.
The maximum number of target columns is 2018.
MethodType
Specify the bin-coding method:
Method
Description
Type
variable- Bins have specified variable widths and specified labels, provided by
width FitInput table.
Maximum number of bins is 3000.
NBins
[MethodType ('equal-width') only.] Specify either a single bin value for all the target columns
or separate bin values for each of the target columns.
LabelPrefix
[MethodType ('equal-width') only.][Optional] Specify either a prefix for all the target columns
or a separate prefix for each of the target columns.
Lower prefix-Generated prefix_count-
Upper Bin Boundary
Bin Boundary Bin Label Generated Bin Label
MinValueColumn
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the minimum value of the bin (lower bin boundaries).
Default: MinValue
MaxValueColumn
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the maximum value of the bin (upper bin boundaries).
Default: MaxValue
LabelColumn
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the bin labels.
Default: Label
TargetColNames
[MethodType ('variable-width') only.][Optional] Specify the name of the FitInput column that
has the target column names.
Note:
Column range is not supported
TD_BinCodeFit Input
InputTable Schema
Column Data Type Description
FitTable Schema
Required with specify MethodType ('variable-width'), ignored otherwise.
TD_BinCodeFit Output
Output Table Schema
Column Data Type Description
OutputTable Schema
The function outputs this secondary output table only if you specify MethodType ('equal-width').
TD_BinCodeFit Example
InputTable: bin_titanic_train
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_BinCodeTransform
TD_BinCodeTransform bin-codes input table columns, using TD_BinCodeFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_BinCodeTransform Syntax
SELECT * FROM TD_BincodeTransform (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
TD_BinCodeTransform Input
InputTable Schema
See TD_BinCodeFit Input.
FitTable Schema
See TD_BinCodeFit Output.
TD_BinCodeTransform Output
Output Table Schema
Column Data Type Description
TD_BinCodeTransform Example
• InputTable: bin_titanic_train, as in TD_BinCodeFit Example
• FitTable: FitOutputTable, created by TD_BinCodeFit Example
SQL Call
Output
passenger age
--------- ----------
873 Middle Age
631 Old Age
505 Young Age
488 Old Age
97 Old Age
TD_ColumnTransformer
The TD_ColumnTransformer function transforms the input table columns in a single operation. You only
need to provide the FIT tables to the function, and the function runs all transformations that you require in
a single operation.
The function performs the following transformations:
• TD_Scale Transform
• TD_Bincode Transform
• TD_Function Transform
• TD_NonLinearCombine Transform
• TD_OutlierFilter Transform
• TD_PolynomialFeatures Transform
• TD_RowNormalize Transform
• TD_OrdinalEncoding Transform
• TD_OneHotEncoding Transform
• TD_SimpleImpute Transform
You must create the FIT tables before using the function and you must provide the FIT tables in the same
order as in the training data sequence to transform the dataset. The FIT tables must have a maximum of
128 columns.
Note:
The TD_BincodeFit function has a maximum of 5 columns when using the variable-width method.
TD_ColumnTransformer Syntax
SELECT * FROM TD_ColumnTransformer (
ON { table | view | (query) } AS InputTable
[ ON { table | view | (query) } AS BincodeFitTable DIMENSION ]
[ ON { table | view | (query) } AS FunctionFitTable DIMENSION ]
[ ON { table | view | (query) } AS NonLinearCombineFitTable DIMENSION ]
[ ON { table | view | (query) } AS OneHotEncodingFitTable DIMENSION ]
[ ON { table | view | (query) } AS OrdinalEncodingFitTable DIMENSION ]
[ ON { table | view | (query) } AS OutlierFilterFitTable DIMENSION ]
[ ON { table | view | (query) } AS PolynomialFeaturesFitTable DIMENSION ]
[ ON { table | view | (query) } AS RowNormalizeFitTable DIMENSION ]
[ ON { table | view | (query) } AS ScaleFitTable DIMENSION ]
[ ON { table | view | (query) } AS SimpleImputeFitTable DIMENSION ]
USING
[FillRowIDColumnName(‘output_column_name’)]
) AS dt;
TD_ColumnTransformer Input
Column Data Type Description
TargetColumn • CHAR or The input table columns that requires transformation based
VARCHAR for on the FIT table.
categorical columns Functions with categorical columns:
• INTEGER, REAL, • TD_OrdinalEncoding Fit
DECIMAL, or • TD_OneHotEncoding Fit
NUMBER for Functions with numeric columns:
numeric columns • TD_Scale Fit
• TD_Bincode Fit
• TD_Function Fit
• TD_NonLinearCombine Fit
• TD_OutlierFilter Fit
• TD_PolynomialFeatures Fit
• TD_RowNormalize Fit
• TD_SimpleImpute Fit
TD_ColumnTransformer Output
TD_ColumnTransformer Output
otherColumns • CHAR or VARCHAR for The default columns from input to output.
categorical columns
• INTEGER, REAL, DECIMAL, or
NUMBER for numeric columns
TD_ColumnTransformer Example
Input Table: titanic_train
PassengerID Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
PassengerID Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
SQL Call
Output
5 888 1 1 2
19 0 0 112053 5.85561002574126E-002
2 B 1.00000000000000E 000 0 1 0
0
5 889 0 3 2
28 1 2 W./C. 6607 4.57713517012109E-002
2 ? 4.00000000000000E 000 0 0 0
1
10M 89 29
20M 167 49
30M 332 98
TD_FunctionFit
TD_FunctionFit determines whether specified numeric transformations can be applied to specified input
columns and outputs a table to use as input to TD_FunctionTransform, which does the transformations.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
Related Information:
TD_NumApply
TD_FunctionFit Syntax
CREATE TABLE output_table AS (
SELECT * FROM TD_FunctionFit (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS TransformationTable DIMENSION
) AS alias
) WITH DATA;
TD_FunctionFit Input
InputTable Schema
Column Data Type Description
input_column VARCHAR (CHARACTER SET LATIN Column whose name can appear as
or UNICODE) or NUMERIC TargetColumn in TransformationTable.
TransformationTable Schema
Column Data Type Description
Transformations
Transformation Parameter Operation on TargetColumn Value x
EXP None ex
(e = 2.718)
TD_FunctionFit Output
Output Table Schema
Same as TransformationTable schema (see TD_FunctionFit Input).
TD_FunctionFit Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
InputTable: function_input_table
SQL Call
Output
TD_FunctionTransform
TD_FunctionTransform applies numeric transformations to input columns, using TD_FunctionFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_FunctionTransform Syntax
SELECT * FROM TD_FunctionTransform (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ IDColumns ({ 'id_column' | id_column_range }[,...])]
) AS alias;
TD_FunctionTransform Input
InputTable Schema
See TD_FunctionFit Input.
FitTable Schema
See TD_FunctionFit Output.
TD_FunctionTransform Output
Output Table Schema
Column Data Type Description
TD_FunctionTransform Example
Input
• InputTable: titanic_data, as in TD_FunctionFit Example
• FitTable: fit_out, created by TD_FunctionFit Example
SQL Call
Output
TD_NonLinearCombineFit
TD_NonLinearCombineFit function returns the target columns and a specified formula which uses the
non-linear combination of existing features.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_NonLinearCombineFit Syntax
SELECT * FROM TD_NonLinearCombineFit (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable(output_table_name) ]
USING
TargetColumns ({'target_column' | 'target_column_range'}[,...])
Formula ('Y = <expression>')
ResultColumn ('result_column')
) as alias;
Formula
[Required] Specify the formula. See the Arithmetic, Trigonometric, Hyperbolic Operators/
Functions section in the SQL Functions, Operators, Expressions, and Predicates Guide.
ResultColumn
[Required] Specify the name of the new feature column generated by the Transform function.
The Fit function saves the specified formula in this column.
TD_NonLinearCombineFit Input
Input Table Schema
Column Data Type Description
TD_NonLinearCombineFit Output
Output Table Schema
Column Data Type Description
ResultColumn VARCHAR CHARACTER The Fit function saves the specified formula in
SET UNICODE this column.
TD_NonLinearCombineFit Example
Input table
SQL Call
Output Table
TD_NonLinearCombineTransform
TD_NonLinearCombineTransform generates the values of the new feature using the specified formula from
the TD_NonLinearCombineFit function output.
TD_NonLinearCombineTransform Syntax
SELECT * FROM TD_NonLinearCombineTransform (
ON { table | view | (query) } AS InputTable
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
) as alias;
TD_NonLinearCombineTransform Input
Input Table Schema
Column Data Type Description
TD_NonLinearCombineTransform Output
Output Table Schema
Column Data Type Description
AccumulateColumns Same as Input The specified column names in the Accumulate element copied
to the output table.
ResultColumn REAL The values calculated using the specified formula are displayed.
TD_NonLinearCombineTransform Example
InputTable
See Input table and Output table sections of TD_NonLinearCombineFit Example
SQL Call
Output Table
passenger TotalCost
--------- -------------
1 14.50000
2 213.84000
3 7.93000
4 106.20000
5 16.10000
TD_OneHotEncodingFit
TD_OneHotEncodingFit outputs a table of attributes and categorical values to input to
TD_OneHotEncodingTransform, which encodes them as one-hot numeric vectors.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_OneHotEncodingFit Syntax
CREATE TABLE fit_table AS (
SELECT * FROM TD_OneHotEncodingFit (
ON { table | view | (query) } AS InputTable
[ PARTITION BY { ANY [ ORDER BY order_column ] | attribute_column } ]
USING
IsInputDense ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
{ for_dense_input | for_sparse_input }
) AS alias
) WITH DATA;
for_dense_input
TargetColumn ('target_column')
CategoricalValues ('categorical_value' [,...])
[ OtherColumnName ('other_column') ]
for_sparse_input
AttributeColumn ('attribute_column')
ValueColumn ('value_column')
TargetAttributes ('target_attribute' [,...])
[ OtherAttributeNames ('other_attribute' [,...]) ]
TargetColumn
[Required with IsInputDense ('true'), disallowed otherwise.] Specify the name of the
InputTable column of categorical values.
CategoricalValues
[Required with IsInputDense ('true'), disallowed otherwise.] Specify one or more categorical
values in target_column to encode in one-hot form.
OtherColumnName
[Optional with IsInputDense ('true'), disallowed otherwise.] Specify a category name for
values that CategoricalValues does not specify (categorical values not to encode in one-
hot form).
Default: 'other'
AttributeColumn
[Required with IsInputDense ('false'), disallowed otherwise.] Specify the name of the
InputTable column of attributes.
ValueColumn
[Required with IsInputDense ('false'), disallowed otherwise.] Specify the name of the
InputTable column of attribute values.
TargetAttributes
[Required with IsInputDense ('false'), disallowed otherwise.] Specify one or more attributes
to encode in one-hot form. Every target_attribute must be in attribute_column.
OtherAttributeNames
[Optional with IsInputDense ('false'), disallowed otherwise.] For each target_attribute,
specify a category name (other_attribute) for attributes that TargetAttributes does not specify.
The nth other_attribute corresponds to the nth target_attribute.
TD_OneHotEncodingFit Input
InputTable Schema for Dense Input
Column Data Type Description
TD_OneHotEncodingFit Output
Output Table Schema for Dense Input
Column Data Type Description
TD_OneHotEncodingFit Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
IsInputDense ('true')
) AS dt
) WITH DATA;
Output
TD_OneHotEncodingTransform
TD_OneHotEncodingTransform encodes specified attributes and categorical values as one-hot numeric
vectors, using TD_OneHotEncodingFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_OneHotEncodingTransform Syntax
SELECT * FROM TD_OneHotEncodingTransform (
ON { table | view | (query) } AS InputTable
[ PARTITION BY { ANY [ ORDER BY order_column ] | attribute_column } ]
ON { table | view | (query) } AS FitTable
{ DIMENSION | PARTITION BY attribute_column }
USING
IsInputDense ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
) AS alias;
TD_OneHotEncodingTransform Input
InputTable Schema
See TD_OneHotEncodingFit Input.
FitTable Schema
See TD_OneHotEncodingFit Output.
TD_OneHotEncodingTransform Output
Output Table Schema for Dense Input
Column Data Type Description
TD_OneHotEncodingTransform Example
• InputTable: onehotencoding_input, as in TD_OneHotEncodingFit Example
• FitTable: onehotencodingfit_output, created by TD_OneHotEncodingFit Example
SQL Call
Output
TD_OrdinalEncodingFit
The TD_OrdinalEncodingFit function identifies distinct categorical values from the input table or a user-
defined list and returns the distinct categorical values along with the ordinal value for each category.
TD_OrdinalEncodingFit Syntax
SELECT * FROM TD_ORDINALENCODINGFIT (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table_name) ]
USING
TargetColumn ('target_column')
[{
Approach ('LIST')
Categories ('category'[,...])
[ OrdinalValues (ordinal_value[,...]) ]
}|
{
[ Approach ('AUTO') ]
}]
[ StartValue (start_value) ]
[ DefaultValue (default_value) ]
)as alias;
Note:
Only one column is supported.
Approach
[Optional] Specify AUTO to obtain categories from the input table or specify LIST to obtain
categories from the user.
Default value: AUTO
Categories
[Required, when you use the LIST approach] Specify the list of categories for encoding in the
required order.
Note:
The maximum length supported for the categorical value is 128 characters.
OrdinalValues
[Optional] Specify the custom ordinal values when you use the LIST approach for encoding
the categorical values.
If you do not provide the ordinal value and the start value, then by default, the first category
contains the default start value 0, and the last category is assigned a value that is one lesser
than the total number of categories.
For example, if there are three categories, then the categories contain the values 0, 1,
2 respectively.
However, if you only specify the ordinal values, then each ordinal value is associated with a
categorical value. For example, if there are three categories and the ordinal values are 3, 4,
5 then the ordinal values are assigned to the respective categories.
The TD_OrdinalEncodingFit function returns an error when the ordinal value count does
not match the categorical value count or if both the ordinal values and the start value
are provided.
Note:
You can either use the OrdinalValues or the StartValue argument in the syntax.
StartValue
[Optional] Specify the starting value for the ordinal values list.
Default value: 0
DefaultValue
[Optional] Specify the ordinal value to use when the categorical value is not found.
TD_OrdinalEncodingFit Input
CONTINGENCY Table Schema
Column Data Type Description
TargetColumn CHAR or VARCHAR CHARACTER SET The input table column name for encoding
LATIN/UNICODE the categorical values.
TD_OrdinalEncodingFit Output
Output Table Schema
Column Data Type Description
TargetColumn VARCHAR CHARACTER The distinct categorical values from the input table or
SET UNICODE the user-defined list.
TD_OrdinalEncodingFit Example
Input: titanic_dataset
passenger survived pclass sex age sibsp parch fare cabin embarked
--------- -------- ------ ------ --- ----- ----- ------------ ----- --------
1 0 3 male 22 1 0 7.250000000 null S
2 1 1 female 38 1 0 71.283300000 C85 C
3 1 3 female 26 0 0 7.925000000 null S
4 1 1 female 35 1 0 53.100000000 C123 S
5 0 3 male 35 0 0 8.050000000 null S
SQL Call
Output Table
sex TD_VALUE_ORDFIT
----------------- ---------------
female 0
male 1
TD_OTHER_CATEGORY -1
TD_OrdinalEncodingTransform
The TD_OrdinalEncodingTransform function maps the categorical value to a specified ordinal value using
the TD_OrdinalEncodingFit output.
TD_OrdinalEncodingTransform Syntax
SELECT * FROM TD_OrdinalEncodingTransform (
ON { table | view | (query) } as InputTable
ON { table | view | (query) } as FitTable DIMENSION
USING
[ Accumulate ({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
) as alias;
TD_OrdinalEncodingTransform Input
Input Table Schema
Column Data Type Description
TargetColumn CHAR or VARCHAR The input table or user-defined column names for
CHARACTER SET LATIN encoding the categorical values.
/UNICODE
Accumulate Any The input table column names that you want to copy to
the output table.
FitTable Schema
Column Data Type Description
TargetColumn VARCHAR The column that has the distinct categories obtained
CHARACTER using the AUTO or the LIST approach.
SET UNICODE
TD_OrdinalEncodingTransform Output
Output Table Schema
Column Data Type Description
TargetColumn INTEGER The Target column with encoded ordinal values from the Fit table. Also, the
TD_OTHER_CATEGORY with the specified ordinal value from the Fit table
is displayed.
Accumulate Any The specified column names in the Accumulate element is copied to the
output table.
TD_OrdinalEncodingTransform Example
InputTable Schema
See Input table and Output table sections of TD_OrdinalEncodingFit Example.
SQL Call
USING
Accumulate ('passenger')
) as dt order by passenger;
Output Table
passenger sex
--------- ---
1 1
2 0
3 0
4 0
5 1
TD_PolynomialFeaturesFit
TD_PolynomialFeaturesFit function stores all the specified values in the argument in a tabular format.
All polynomial combinations of the features with degrees less than or equal to the specified degree are
generated. For example, for a 2-D input sample [x, y], the degree-2 polynomial features are [x, y, x-squared,
xy, y-squared,1].
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_PolynomialFeaturesFit Syntax
SELECT * FROM TD_PolynomialFeaturesFit (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable (output_table) ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ IncludeBias ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ InteractionOnly ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Degree (degree) ]
) AS alias;
TargetColumns
Specify the names of the InputTable columns for which to output polynomial combinations for
features (no more than five).
IncludeBias
[Optional] Specify whether the output table is to include a bias column for the feature in which
all polynomial powers are zero (that is, a column of ones). A bias column acts as an intercept
term in a linear model.
Default: true
InteractionOnly
[Optional] Specify whether to output polynomial combinations only for interaction features
(features that are products of at most degree distinct input features).
Default: false
Degree
[Optional] Specify the maximum degree of the input features for which to output polynomial
combinations, an integer in the range [1,2,3].
Default: 2
TD_PolynomialFeaturesFit Input
InputTable Schema
Column Data Type Description
target_column NUMERIC Column for which to output polynomial combinations for features.
TD_PolynomialFeaturesFit Output
OutputTable Schema
Column Data Type Description
TD_PolynomialFeaturesFit Example
InputTable: polynomialFeaturesFit_input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
) AS dt
) WITH DATA;
Output
TD_PolynomialFeaturesTransform
TD_PolynomialFeaturesTranform function extracts values of arguments [TargetColumns, Degree,
IncludeBias, and InteractionOnlygenerates] from the output of the TD_PolynomialFeaturesFit function
and generates a feature matrix of all polynomial combinations of the features.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_PolynomialFeaturesTransform Syntax
SELECT * FROM TD_PolynomialFeaturesTransform (
ON { table | view | (query) } AS InputTable PARTITION BY ANY
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
Note:
If two or more column names are concatenated and the column name exceeds 128
characters, then the function replaces the actual column names with names such as
col1, col2, col3, col4, col5 in the output.
TD_PolynomialFeaturesTransform Input
InputTable Schema
See TD_PolynomialFeaturesFit Input.
FitTable Schema
See TD_PolynomialFeaturesFit Output.
TD_PolynomialFeaturesTransform Output
Output Table Schema
Column Data Type Description
One DOUBLE [Column appears only with IncludeBias ('true').] Column for
PRECISION feature in which all polynomial powers are zero (column of
ones).
TD_PolynomialFeaturesTransform Example
• InputTable: polynomialFeatures, as in TD_PolynomialFeaturesFit Example
• FitTable: polynomialFit, created by TD_PolynomialFeaturesFit Example
SQL Call
Accumulate ('[0:0]')
) AS dt;
Output
TD_RandomProjectionMinComponents
The TD_RandomProjectionMinComponents function calculates the minimum number of components
required for applying RandomProjection on the given dataset for the specified epsilon(distortion)
parameter value. The function estimates the minimum value of the NumComponents argument in the
TD_RandomProjectionFit function for a given dataset. The function uses the Johnson-Lindenstrauss
Lemma algorithm to calculate the value.
TD_RandomProjectionMinComponents Syntax
SELECT * FROM TD_RandomProjectionMinComponents(
ON {table | view | (query)} as InputTable
USING
TargetColumns({'target_column' | 'target_column_range'} [,...])
[ Epsilon(epsilon_value) ]
) as dt;
Epsilon
[Optional]: Specify a value to control distortion introduced while projecting the data to a lower
dimension. The amount of distortion increases if you increase the value.
Default Value: 0.1
Allowed Values: Between 0 and 1
TD_RandomProjectionMinComponents Input
Input Table Schema
Column Data Type Description
TD_RandomProjectionMinComponents Output
Output Table Schema
Column Data Type Description
TD_RandomProjectionMinComponents Example
Input Table
Each of the input data points in the following input table consists of 963 columns and a unique identifier of
the data point:
SQL Call
Output table
randomprojection_mincomponents
------------------------------
353
TD_RandomProjectionFit
The TD_RandomProjectionFit function returns a random projection matrix based on the
specified arguments.
The function also returns the required parameters for transforming the input data into lower-dimensional
data. The TD_RandomProjectionTransform function uses the TD_RandomProjectionFit output to reduce
the dimensionality of the input data.
TD_RandomProjectionFit Syntax
SELECT * FROM TD_RandomProjectionFit
(ON {table | view | (query)} as InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE OutputTable(out_table_name) ]
USING
TargetColumns({'target_column' | 'target_column_range'} [,...])
NumComponents(num_components)
[ Seed(seed_value) ]
[ Epsilon(epsilon_value) ]
[ ProjectionMethod({'GAUSSIAN' | 'SPARSE'}) ]
[ Density(density_value) ]
[ OutputFeatureNamesPrefix('output_feature_names_prefix') ])
as dt;
NumComponents
[Required]: Specify the target dimension (number of features) on which the data points from
the original dimension are projected.
The NumComponents value cannot be greater than the original dimension (number
of features) and must satisfy the Johnson-Lindenstrauss Lemma result. The
minimum value allowed for the NumComponents argument is calculated using the
TD_RandomProjectionMinComponents function.
Seed
[Optional]: Specify the random seed the algorithm uses for repeatable results. The algorithm
uses the seed to generate a random projection matrix. The seed must be a non-negative
integer value.
Default Value: The Random Seed value is used for generating a random projection matrix,
and hence the output is non-deterministic.
Epsilon
[Optional]: Specify a value to control distortion introduced while projecting the data to a lower
dimension. The amount of distortion increases if you increase the value.
Default Value: 0.1
Allowed Values: Between 0 and 1
ProjectionMethod
[Optional]: Specify the method name for generating the random projection matrix.
Default Value: GAUSSIAN
Allowed Values: [GAUSSIAN, SPARSE]
Density
[Optional]: Specify the approximate ratio of non-zero elements in the random projection
matrix when SPARSE is used as the projection method.
Default Value: 0.33333333
Allowed Values: 0 < Density <= 1
OutputFeatureNamesPrefix
[Optional]: Specify the prefix for the output column names.
Default Value: td_rpj_feature
TD_RandomProjectionFit Input
Input Table Schema
Column Data Type Description
TD_RandomProjectionFit Output
Output Table Schema
Column Data Type Description
Target_column REAL The columns that have the elements of the Random
Projection Matrix.
TD_RandomProjectionFit Example
Input Table
Each of the input data points in the following input table consists of 963 columns and a unique identifier of
the data point:
0.840019 -19.589981
2 -0.640002 -0.65
-0.400002 0.66
3 -2.350006
1.260009 .... -1.760009
3.740021
4 0.109997 0
0.040001 0.540001
5 0.459999
1.77 .... 1.130005
0.309998
6 0.45
0.460001 -0.06
-0.11
7 0.18
0.220001 .... 0.330002
1.150001
8 0.73 0.369999
0.090001 -0.110001
9 0.899997
0.700001 .... -0.220001
0.159996
10 0.36 0.909996
1.070003 1.050003
SQL Call
Output Table
1 0.076606803
0.033381927 .... 0.047328448 0.039908753
2 -0.100981365
0.026625478 .... -0.035816093 -0.009469693
.... .... ....
.... .... ....
.... .... ....
.... .... ....
351
0.039755885 -0.040469189 .... 0.046338113
0.018347439
352 0.040278453
0.072748211 .... -0.04456492 -0.031728421
TD_RandomProjectionTransform
The TD_RandomProjectionTransform function converts the high-dimensional input data to a lower-
dimensional space using the TD_RandomProjectionFit function output.
TD_RandomProjectionTransform Syntax
SELECT * FROM TD_RandomProjectionTransform(
ON {table | view | (query)} as InputTable
ON {table | view | (query)} as FitTable DIMENSION
USING
[ Accumulate({'accumulate_column' | 'accumulate_column_range'} [,...]) ]
) as dt;
TD_RandomProjectionTransform Input
Input Table Schema
Column Data Type Description
/Numeric,Float,Real,
Double precision
Accumulate_ ANY The input table columns that you want to copy to
column the output table.
TD_RandomProjectionTransform Output
Output Table Schema
Column Data Type Description
OutputFeatureNamesPrefix_i REAL The rendered columns after converting the data points to
lower-dimensional space wherein i is the sequence number
of the generated column.
TD_RandomProjectionTransform Example
Input Table
Each of the input data points in the following input table consists of 963 columns and a unique identifier of
the data point:
SQL Call
Output Table
5.545038445 -18.76927716
-4.260967918 1.729951186
2
0.38584145 -1.436265087 .... -0.66632
7415 -0.820435756
3 -2.173481389 -1.528033248
2.975176363 -4.719261735
4 -0.193832387 -1.093648166 ....
0.07893702 0.923864022
5 -1.542301006 -1.794068037
-0.795435421 0.797221691
6 -0.344011761 -0.344862717 ....
0.13398249 0.088007141
7
0.899506286 -1.592144274
-1.225987822 0.192506954
8
0.446963102 -0.736407073 ....
0.133092795 0.192294245
9 -3.02747734 -3.933652254
-0.086266401 -2.545520414
10 1.937290638
0.498908831 .... -0.393961865 -0.88549
645
TD_RowNormalizeFit
TD_RowNormalizeFit outputs a table of parameters and specified input columns to input to
TD_RowNormalizeTransform, which normalizes the input columns row-wise.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_RowNormalizeFit Syntax
SELECT * FROM TD_RowNormalizeFit (
ON { table | view | (query) } AS InputTable
TargetColumns
Specify the names of the InputTable columns to normalize row-wise.
Approach
[Optional] Specify the normalization method:
Option Normalizing Formula
BaseColumn
[Required with Approach ('INDEX'), ignored otherwise.] Specify the name of the InputTable
column that has the B values to use in the normalizing formula.
BaseValue
[Required with Approach ('INDEX'), ignored otherwise.] Specify the V value to use in the
normalizing formula.
TD_RowNormalizeFit Input
InputTable Schema
Column Data Type Description
TD_RowNormalizeFit Output
OutputTable Schema
Column Data Type Description
TD_RowNormalizeFit Example
InputTable: rowNormalizeFit_input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
id x y
-- - --
1 0 1
2 3 4
3 5 12
4 7 24
SQL Call
Output
TD_KEY_ROWFIT TD_VALUE_ROWFIT x y
------------- --------------- ---- ----
Approach INDEX null null
BaseColumn y null null
BaseValue 100 null null
TD_RowNormalizeTransform
TD_RowNormalizeTransform normalizes input columns row-wise, using TD_RowNormalizeFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_RowNormalizeTransform Syntax
SELECT * FROM TD_RowNormalizeTransform (
ON { table | view | (query) } AS InputTable [ PARTITION BY ANY [ ORDER BY
order_column ] ]
ON { table | view | (query) } AS FitTable DIMENSION
USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
TD_RowNormalizeTransform Input
InputTable Schema
FitTable Schema
See TD_RowNormalizeFit Output.
TD_RowNormalizeTransform Output
Column Data Type Description
TD_RowNormalizeTransform Example
• InputTable: rowNormalizeFit_input, as in TD_RowNormalizeFit Example
• FitTable: rowNormalizeFit_output, output by TD_RowNormalizeFit Example
SQL Call
Output
id x y
-- ------- ------
1 0.00 100.00
2 75.00 100.00
3 41.66 100.00
4 29.16 100.00
TD_ScaleFit
TD_ScaleFit outputs a table of statistics to input to TD_ScaleTransform, which scales specified input
table columns.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_ScaleFit Syntax
SELECT * FROM TD_ScaleFit (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ ORDER BY order_column ] ]
TargetColumns
Specify the names of the InputTable columns for which to output statistics. The columns must
contain numeric data in the range (-1e308, 1e308).
ScaleMethod
Specify either one scale_method for all target columns or one scale_method for each
target_column. The nth scale_method applies to the nth target_column.
The following table lists each possible scale_method and its location and scale values.
The TD_ScaleTransform function uses the location and scale values in the following formula
to scale target column value X to scaled value X':
X' = intercept + multiplier * ((X - location)/scale)
Intercept and Multiplier determine intercept and multiplier.
In the table, Xmin, Xmax, and XMean are the minimum, maximum, and mean values
of target_column.
scale_method Description location scale
SUM Sum. 0 ΣX
Intercept
[Optional] Specify either one intercept for all target columns or one intercept for each
target_column. The function uses the nth intercept for the nth target_column.
Default: '0'
Multiplier
[Optional] Specify either one multiplier for all target columns or one multiplier for each
target_column. The function uses the nth multiplier for the nth target_column.
Default: '1'
GlobalScale
[Optional] Specify whether to scale all target columns to the same location and scale.
Default: 'false' (scale each target column separately)
MissValue
[Optional] Specify how to handle NULL values:
Option Description
TD_ScaleFit Input
InputTable Schema
Column Data Type Description
TD_ScaleFit Output
Output Table Schema
Column Data Type Description
0 MEAN
1 SUM
2 USTD
3 STD
4 RANGE
5 MIDRANGE
6 MAXABS
7 RESCALE
TD_ScaleFit Example
InputTable: scale_input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_STATTYPE_SCLFIT
fare
--------------------------------------------------------------------------------
----------- -------------
min
5.000000000
max
86.500000000
sum
185.854200000
count
5.000000000
null
0.000000000
avg
37.170840000
multiplier
1.000000000
intercept
0.000000000
location
5.000000000
scale
81.500000000
globalscale_false
null
ScaleMethodNumberMapping:
[0:mean,1:sum,2:ustd,3:std,4:range,5:midrange,6:maxabs,7:rescale] 4.000000000
missvalue_KEEP
null
TD_ScaleTransform
TD_ScaleTransform scales specified input table columns, using TD_ScaleFit output.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_ScaleTransform Syntax
SELECT * FROM TD_ScaleTransform (
ON { table | view | (query) } AS InputTable
[ PARTITION BY ANY [ ORDER BY order_column ] ]
ON { table | view | (query) } AS FitTable DIMENSION
[ USING
Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...])
]
) AS alias;
TD_ScaleTransform Input
InputTable Schema
See TD_ScaleFit Input.
FitTable Schema
See TD_ScaleFit Output.
TD_ScaleTransform Output
Output Table Schema
Column Data Type Description
TD_ScaleTransform Example
• InputTable: input_table, as in TD_ScaleFit Example
• FitTable: scaleFitOut, output by TD_ScaleFit Example
SQL Call
Output
passenger fare
--------- ------------------
97 0.363855214723926
488 0.303067484662577
505 1
631 0.306748466257669
873 0
Feature engineering utility functions for analyzing and extracting features of the input dataset.
TD_FillRowID
TD_FillRowID adds a column of unique row identifiers to the input table.
Note:
This function may not return the same RowIds if the function is run multiple times.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_FillRowID Syntax
SELECT * FROM TD_FillRowID (
ON { table | view | (query) } AS InputTable [ PARTITION BY ANY [ ORDER BY
order_column ] ]
USING
[ RowIDColumnName ('row_id_column') ]
) AS alias;
TD_FillRowID Input
InputTable Schema
InputTable can have any schema.
TD_FillRowID Output
Output Table Schema
Column Data Type Description
TD_FillRowID Example
InputTable: titanic
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
Output
TD_NumApply
TD_NumApply applies a specified numeric operator to the specified input table columns. For the list of
numeric operators, see TD_NumApply Syntax Elements.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
Related Information:
TD_StrApply
TD_FunctionFit
TD_FunctionTransform
TD_NumApply Syntax
SELECT * FROM TD_NumApply (
ON { table | view | (query) } AS InputTable [ PARTITION BY ANY [ ORDER BY
order_column ] ]
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ OutputColumns ('output_column' [,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
ApplyMethod ('num_operator')
[ SigmoidStyle ({ 'logit' | 'modifiedlogit' | 'tanh' )]
[ InPlace ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
) AS alias;
OutputColumns
[Ignored with Inplace ('true'), otherwise optional.] Specify names for the output columns. An
output_column cannot exceed 128 characters.
Default: With InPlace ('false'), target_column_operator; otherwise target_column
Note:
If any target_column_operator exceeds 128 characters, specify an output_column for
each target_column.
Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.
With InPlace ('true'), no target_column can be an accumulate_column.
ApplyMethod
Specify one of these numeric operators:
num_operator Description
SigmoidStyle
[Required with ApplyMethod ('sigmoid'), otherwise ignored.] Specify the sigmoid style.
Default: logit
InPlace
[Optional] Specify whether the output columns have the same names as the target columns.
InPlace ('true') effectively replaces each value in each target column with the result of
applying num_operator to it.
InPlace ('false') copies the target columns to the output table and adds output columns whose
values are the result of applying num_operator to each value.
With InPlace ('true'), no target_column can be an accumulate_column.
Default: true
TD_NumApply Input
InputTable Schema
Column Data Type Description
TD_NumApply Output
Output Table Schema
Column Data Type Description
TD_NumApply Example
InputTable: input_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
C123 S
5 0 3 male 35 0 0 8.050000000 null S
SQL Call
Output
TD_RoundColumns
TD_RoundColumns rounds the values of each specified input table column to a specified number of
decimal places.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_RoundColumns Syntax
SELECT * FROM TD_RoundColumns (
ON { table | view | (query) } AS InputTable
USING
PrecisionDigit
[Optional] Specify the number of decimal places to which to round values.
If precision is positive, the function rounds values to the right of the decimal point.
If precision is negative, the function rounds values to the left of the decimal point.
Default: If the PrecisionDigit value is not provided, the function rounds the column values to
0 places.
Note:
If the column values have the DECIMAL/NUMERIC data type with a precision less than
38, then the function increases the precision by 1. For example, when a DECIMAL (4,2)
value of 99.99 is rounded to 0 places, the function returns a DECIMAL (5,2) value,
100.00. However, if the precision is 38, then the function only reduces the scale value
by 1 unless the scale is 0. For example, the function returns a DECIMAL (38, 36) value
of 99.999999999 as a DECIMAL (38, 35) value, 100.00.
Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.
TD_RoundColumns Input
InputTable Schema
Column Data Type Description
TD_RoundColumns Output
Output Table Schema
Column Data Type Description
TD_RoundColumns Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
InputTable: titanic
SQL Call
Output
4 1 1 53.10
5 3 0 8.10
TD_StrApply
TD_StrApply applies a specified string operator to the specified input table columns. For the list of string
operators, see TD_StrApply Syntax Elements.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
Related Information:
TD_NumApply
TD_StrApply Syntax
SELECT * FROM TD_strApply (
ON { table | view | (query) } AS InputTable PARTITION BY ANY
USING
TargetColumns ({ 'target_column' | target_column_range }[,...])
[ OutputColumns ('output_column' [,...]) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
StringOperation (str_operator)
[ String ('string')]
[ StringLength ('length') ]
[ OperatingSide ({ 'Left' | 'Right' })]
[ IsCaseSpecific ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ EscapeString ('escape_string')]
[ IgnoreTrailingBlank ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ StartIndex ('start_index')]
[ InPlace ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
) AS alias;
OutputColumns
[Ignored with Inplace ('true'), otherwise optional.] Specify names for the output columns. An
output_column cannot exceed 128 characters.
Default: With InPlace ('false'), target_column_operator; otherwise target_column
Note:
If any target_column_operator exceeds 128 characters, specify an output_column for
each target_column.
Accumulate
[Optional] Specify the names of the input table columns to copy to the output table.
With InPlace ('true'), no target_column can be an accumulate_column.
StringOperation
Specify a str_operator from the following table. If str_operator requires string, length, or
start_index, specify that value with String, StringLength, or StartIndex.
str_operator Description
STRINGLIKE Returns first string that matches specified pattern if one exists in value.
Options: See EscapeString, IsCaseSpecific, IgnoreTrailingBlank.
str_operator Description
String
[Required when str_operator needs string argument, ignored otherwise.] Specify string
argument for str_operator:
str_operator string
STRINGINDEX String for which to return index of its first character in value.
StringLength
[Optional] [Required when str_operator needs length argument, ignored otherwise.] Specify
length argument for str_operator:
str_operator length
OperatingSide
[Optional] Applies only when str_operator is GETNCHARS, STRINGPAD, or STRINGTRIM.
Specifies side of value on which to apply str_operator.
Default: left
IsCaseSpecific
[Optional] Applies only when str_operator is STRINGINDEX or STRINGLIKE. Specify
whether search for string is case-specific.
Default: true (search is case-specific)
EscapeString
[Optional] Applies only when str_operator is STRINGLIKE. Specify the escape characters.
IgnoreTrailingBlank
[Optional] Applies only when str_operator is STRINGLIKE. Specify whether to ignore trailing
space characters.
StartIndex
[Optional] Applies only when str_operator is SUBSTRING. Specify the index of the character
at which the substring starts.
InPlace
[Optional] Specify whether the output columns have the same names as the target columns.
InPlace ('true') effectively replaces each value in each target column with the result of
applying str_operator to it.
InPlace ('false') copies the target columns to the output table and adds output columns whose
values are the result of applying str_operator to each value.
With InPlace ('true'), no target_column can be an accumulate_column.
Default: true
TD_StrApply Input
Input Table Schema
Column Data Type Description
TD_StrApply Output
Output Table Schema
Column Data Type Description
TD_StrApply Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
SQL Call
InPlace('True')
) as dt order by 1;
Output
passenger sex
--------- ------
1 MALE
2 FEMALE
3 FEMALE
4 FEMALE
5 MALE
TD_DecisionForest
The function is an ensemble algorithm used for classification and regression predictive modeling problems.
It is an extension of bootstrap aggregation (bagging) of decision trees. Typically, constructing a decision tree
involves evaluating the value for each input feature in the data to select a split point.
The function reduces the features to a random subset (that can be considered at each split point); the
algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.
The function uses a training dataset to create a predictive model. The DecisionForestPredict function uses
the model created by the TD_DecisionForest function for making predictions.
The function supports regression, binary, and multi-class classification.
Consider the following points:
• All input features are numeric. Convert the categorical columns to numerical columns as
preprocessing step.
• For classification, class labels (ResponseColumn values) can only be integers.
• Any observation with a missing value in an input column is skipped and not used for training. You can
use the TD_SimpleImpute function to assign missing values.
The number of trees built by TD_DecisionForest depends on the NumTrees, TreeSize, CoverageFactor
values, and the data distribution in the cluster. The trees are constructed in parallel by all the AMPs, which
have a non-empty partition of data.
• When you specify the NumTrees value, the number of trees built by the function is adjusted as:
TD_DecisionForest Syntax
SELECT * FROM TD_DecisionForest (
ON { table | view | (query) } PARTITION BY ANY
USING
InputColumns ({'input_column'|input_column_range }[,…])
ResponseColumn('response_column')
[ MaxDepth (maxdepth) ]
[ MinNodeSize (minnodesize) ]
[ NumTrees (numtrees) ]
[ Treetype ('regression'|'classification') ]
[ TreeSize (treesize) ]
[ CoverageFactor (coveragefactor) ]
[ Seed (seed) ]
[ Mtry (mtry) ]
[ MtrySeed (mtryseed) ]
[ MinImpurity (minimpurity) ]
)
as dt;
ResponseColumn
Specify the column name that contains the classification label or target value (dependent
variable) for regression.
MaxDepth
[Optional] Specify the maximum depth of a tree. The algorithm stops splitting a node beyond
this depth. Decision trees can grow to 2(max_depth+1)-1 nodes. The default value is 5. You must
specify a non-negative integer value.
NumTrees
[Optional] Specify the number of trees for the forest model. You must specify a value greater
than or equal to the number of data AMPs. By default, the function builds the minimum
number of trees that provides the specified coverage level in the CoverageFactor argument
for the input dataset. The default value is -1.
MinNodeSize
[Optional] Specify the minimum number of observations in a tree node. The algorithm stops
splitting a node if the number of observations in the node is equal to or smaller than this value.
You must specify a non-negative integer value. The default value is 1.
Mtry
[Optional] Specify the number of features from input columns for evaluating the best split
of a node. A higher value improves the splitting and performance of a tree. A smaller value
improves the robustness of the forest and prevents it from overfitting. When the value is -1,
all variables are used for each split. The default value is -1.
MtrySeed
[Optional] Specify the random seed that the algorithm uses for the Mtry argument. The
default value is 1.
Seed
[Optional] Specify the random seed the algorithm uses for repeatable results. The default
value is 1.
TreeType
[Optional] Specify the modeling type.
Allowed Values: Regression, Classification. The default value is Regression.
TreeSize
[Optional] Specify the number of rows that each tree uses as its input dataset. The function
builds a tree using either the number of rows on an AMP, the number of rows that fit into the
AMP’s memory (whichever is less), or the number of rows given by the TreeSize argument.
By default, this value is the minimum number of rows on an AMP and the number of rows that
fit into the AMP’s memory. The default value is -1.
CoverageFactor
Specify the level of coverage for the dataset in the forest. The value is specified in
percentage. The default coverage value is 1.0 (100%).
MinImpurity
[Optional] Specify the minimum impurity of a tree node. The algorithm stops splitting a node
if the value is equal to or smaller than the specified value. The default value is 0.0.
TD_DecisionForest Input
Column Name Data TYype Description
input_column INTEGER, BIGINT, The columns that the function uses to train the
SMALLINT, BYTEINT, DecisionForest model.
FLOAT, DECIMAL,
or NUMBER
response_ INTEGER, BIGINT, The column that contains the response value for an
column SMALLINT, BYTEINT, observation. For regression, all numeric data types
FLOAT, DECIMAL, are supported. For classification, INTEGER, BIGINT,
or NUMBER SMALLINT datatypes are supported.
TD_DecisionForest Output
The function produces a model and a JSON representation of the decision tree. The model output is
as follows:
Column Name Data Type Description
tree CLOB The trained decision tree model represented in JSON format.
The JSON representation of the decision tree has the following elements:
JSON Type Description
sum_ [Regression trees] The sum of respone variable values in the node.
sumSq_ [Regression trees] The sum of squared values of the response variable in the node.
responseCounts_ [Classification trees] The number of observations in each class of the node.
maxDepth_ The maximum possible depth of the tree starting from the current node. For the root
node, the value is max_depth. For leaf nodes, the value is 0. For other nodes, the value
is the maximum possible depth of the tree, starting from that node.
split_ The start of JSON item that describes a split in the node.
• REGRESSION_NUMERIC_SPLIT
leftNodeSize_ The number of observations assigned to the left node of the split.
rightNodeSize_ The number of observations assigned to the right node of the split.
leftChild_ The start of the JSON item that describes the left child of the node.
rightChild_ The start of the JSON item that describes the right child of the node.
TD_DecisionForest Examples
Example: TD_DecisionForest Regression
The following is a sample of housing data taken from Boston housing dataset.
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT medv
0. 0.0 4.49 0.0 0. 6. 45.1 4. 3.0 247 18.5 396. 12.86 22.5
05188 449 015 4272 99
0. 0.0 7.38 0.0 0. 6. 28.9 5. 5.0 287 19.6 396. 6.15 23.0
30347 493 312 4159 90
0. 0.0 6.20 0.0 0. 6. 80.8 3. 8.0 307 17.4 396. 7.6 30.1
6147 507 618 2721 90
0. 0.0 11.93 0.0 0. 6. 76.7 2. 1.0 273 21.0 396. 9.08 20.6
04527 537 120 2875 90
... ... ... ... ... ... ... ... ... ... ... ... ... ...
SQL Call
MinNodeSize(1)
NumTrees(4)
TreeType('REGRESSION')
Seed(1)
Mtry(3)
MtrySeed(1)
) as dt;
TD_DecisionForest Output
task_ tree_
tree
index num
0 0 {"id_":1,"sum_":201.700000,"sumSq_":6781.890000,"size_":6,"maxDepth_":12,
"nodeType_":"REGRESSION_NODE","split_":{"splitValue_":7.091500,"attr_":"rm",
"type_":"REGRESSION_NUMERIC_SPLIT","score_":32984.915253,"scoreImprove_
":32984.915253,"leftNodeSize_":5,"rightNodeSize_":1},"leftChild_":{"id_":2,"sum_":
167.000000,"sumSq_":5577.800000,"size_":5,"maxDepth_":11,"value_":33.400000,
"nodeType_":"REGRESSION_LEAF"},"rightChild_":{"id_":3,"sum_":34.700000,
"sumSq_":1204.090000,"size_":1,"maxDepth_":11,"value_":34.700000,"nodeType_":
"REGRESSION_LEAF"}}
2 0 {"id_":1,"sum_":208.800000,"sumSq_":4905.980000,"size_":9,"maxDepth_":12,
"nodeType_":"REGRESSION_NODE","split_":{"splitValue_":6.465000,"attr_":"rm",
"type_":"REGRESSION_NUMERIC_SPLIT","score_":37076.368050,"scoreImprove_
":37076.368050,"leftNodeSize_":8,"rightNodeSize_":1},"leftChild_":{"id_":2,"sum_":
178.700000,"sumSq_":3999.970000,"size_":8,"maxDepth_":11,"value_":22.337500,
"nodeType_":"REGRESSION_LEAF"},"rightChild_":{"id_":3,"sum_":30.100000,
"sumSq_":906.010000,"size_":1,"maxDepth_":11,"value_":30.100000,"nodeType_":
"REGRESSION_LEAF"}}
3 0 {"id_":1,"sum_":93.600000,"sumSq_":2194.560000,"size_":4,"maxDepth_":12,
"nodeType_":"REGRESSION_NODE","split_":{"splitValue_":7.060000,"attr_":"lstat",
"type_":"REGRESSION_NUMERIC_SPLIT","score_":6272.052528,"scoreImprove_
":6272.052528,"leftNodeSize_":3,"rightNodeSize_":1},"leftChild_":{"id_":2,"sum_":
72.000000,"sumSq_":1728.000000,"size_":3,"maxDepth_":11,"value_":24.000000,
"nodeType_":"REGRESSION_LEAF"},"rightChild_":{"id_":3,"sum_":21.600000,
"sumSq_":466.560000,"size_":1,"maxDepth_":11,"value_":21.600000,"nodeType_":
"REGRESSION_LEAF"}}
Diabetes
Blood Skin
ID Pregnancies Glucose Insulin BMI Pedigree Age Outcome
Pressure Thickness
Function
4 0 123 84 37 0 3. 0.197 29 0
52
... ... ... ... ... ... ... ... ... ...
SQL Call
TD_DecisionForest Output
amp_ tree_
tree
id num
0 0 {"id_":1,"size_":11,"maxDepth_":3,"responseCounts_":{"0":8,"1":3},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":35.500000,"attr_":"skinthickness",
"type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.396694,"scoreImprove_
":0.178512,"leftNodeSize_":6,"rightNodeSize_":5},"leftChild_":{"id_":2,"size_":6,
"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":3,"size_":5,"maxDepth_":2,"responseCounts_":{"0":2,"1":3},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":135.000000,"attr_":"glucose",
"type_":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.480000,"scoreImprove_
":0.218182,"leftNodeSize_":2,"rightNodeSize_":3},"leftChild_":{"id_":6,"size_":2,
amp_ tree_
tree
id num
"maxDepth_":1,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_":
{"id_":7,"size_":3,"maxDepth_":1,"label_":"1","nodeType_":"CLASSIFICATION_LEAF"
1 0 {"id_":1,"size_":9,"maxDepth_":3,"responseCounts_":{"0":5,"1":4},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":32.500000,"attr_":"age","type_
":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.493827,"scoreImprove_":
0.316049,"leftNodeSize_":4,"rightNodeSize_":5},"leftChild_":{"id_":2,"size_":4,
"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":3,"size_":5,"maxDepth_":2,"responseCounts_":{"0":1,"1":4},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":36.500000,"attr_":"age","type_
":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.320000,"scoreImprove_":
0.066667,"leftNodeSize_":3,"rightNodeSize_":2},"leftChild_":{"id_":6,"size_":3,
"maxDepth_":1,"label_":"1","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":7,"size_":2,"maxDepth_":1,"label_":"0","nodeType_":"CLASSIFICATION_
LEAF"}}}
2 0 {"id_":1,"size_":5,"maxDepth_":3,"label_":"1","nodeType_":
"CLASSIFICATION_LEAF"}
3 0 {"id_":1,"size_":10,"maxDepth_":3,"responseCounts_":{"0":9,"1":1},"nodeType_":
"CLASSIFICATION_NODE","split_":{"splitValue_":37.000000,"attr_":"age","type_
":"CLASSIFICATION_NUMERIC_SPLIT","score_":0.180000,"scoreImprove_":
0.080000,"leftNodeSize_":8,"rightNodeSize_":2},"leftChild_":{"id_":2,"size_":8,
"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_LEAF"},"rightChild_
":{"id_":3,"size_":2,"maxDepth_":2,"label_":"0","nodeType_":"CLASSIFICATION_
LEAF"}}
TD_KMeans
The K-means algorithm groups a set of observations into k clusters in which each observation belongs to
the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective
function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:
1. Specify or randomly select k initial cluster centroids.
2. Assign each data point to the cluster that has the closest centroid.
3. Recalculate the positions of the k centroids.
4. Repeat steps 2 and 3 until the centroids no longer move.
The algorithm doesn't necessarily find the optimal configuration as it depends significantly on the initial
randomly selected cluster centers. You can run the function multiple times to reduce the effect of
this limitation.
Also, this function returns the within-cluster-squared-sum, which you can use to determine an optimal
number of clusters using the Elbow method.
Note:
• This function doesn't consider the InputTable and InitialCentroidsTable Input rows that have a
NULL entry in the specified TargetColumns.
• The function can produce deterministic output across different machine configurations if you
provide the InitialCentroidsTable in the query.
• The function randomly samples the initial centroids from the InputTable, if you don't provide the
InitialCentroidsTable in the query. In this case, you can use the Seed element to make the function
output deterministic on a machine with an assigned configuration. However, using the Seed
argument won't guarantee deterministic output across machines with different configurations.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_KMeans Syntax
SELECT * FROM TD_KMeans (
ON {table | view | query} as InputTable
[ ON {table | view | query} as InitialCentroidsTable DIMENSION ]
[ OUT [ PERMANENT | VOLATILE ] TABLE ModelTable(model_output_table_name) ]
USING
IdColumn('id_column')
TargetColumns({'target_column'|'target_column_range'}[,...])
[ NumClusters(number_of_clusters) ]
[ Seed(seed_value) ]
[ StopThreshold(threshold_value) ]
[ MaxIterNum(number_of_iterations) ]
[ NumInit(num_init) ]
[ OutputClusterAssignment({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
) as alias;
TargetColumns
[Required]: Specify the input table columns for clustering.
ModelTable
[Optional]: Specify the ModelTable name to save the clustering data model. If specified, then
a model containing centroids of clusters is saved in the specified ModelTable name.
NumClusters
[Optional]: Specify the number of clusters to create from the clustering data. Not required, if
the InitialCentroidsTable is specified.
Seed
[Optional]: Specify a non-negative integer value to randomly select the initial cluster centroid
positions from the input table rows. Not required, if the InitialCentroidsTable is specified.
StopThreshold
[Optional]: The algorithm converges if the distance between the centroids from the previous
iteration and the current iteration is less than the specified value.
Default Value: 0.0395
MaxIterNum
[Optional]: Specify the maximum number of iterations for the K-means algorithm. The
algorithm stops after performing the specified number of iterations even if the convergence
criterion is not met.
Default Value: 10
NumInit
[Optional]: Specify the number of times to repeat clustering with different initial centroid
seeds. The function returns the model having the least value of Total Within Cluster
Squared Sum.
Not required, if the InitialCentroidsTable is specified.
Default Value: 1
OutputClusterAssignment
[Optional]: Specify whether to return the Cluster Assignment information.
Default Value: False
TD_KMeans Input
Input Table Schema
Column Data Type Description
IdColumn Any The InputTable column name that has the unique
identifier for each input table row.
TD_KMeans Output
Output Table Schema
If the OutputClusterAssignment value is set to False:
TargetColumns REAL The columns that contain the centroid value for each feature.
Id_Column BYTEINT The unique identifier column name copied from the
InputTable. This column contains only NULL values in
the output.
Id_Column Any The unique identifier of input rows copied from the input table.
TD_KMeans Example
Input Table
id C1 C2
-- -- --
1 1 1
2 2 2
3 8 8
4 9 9
TD_CLUSTERID_KMEANS C1 C2
------------------- -- --
2 2 2
4 9 9
td_clusterid_kmeans C1 C2
td_size_kmeans td_withinss_kmeans id td_modelinfo_kmeans
-------------------- ---------------------- ----------------------
-------------------- ----------------------
---- -----------------------------------------
0 1.5
1.5 2 1 NULL NULL
1 8.5
8.5 2 1 NULL NULL
NULL NULL
NULL NULL NULL NULL
Converged : True
NULL NULL
NULL NULL NULL NULL Number of
Iterations : 2
NULL NULL
NULL NULL NULL NULL Number of
Clusters : 2
NULL NULL
NULL NULL NULL NULL
Total_WithinSS 2.00000000000000E+00
NULL NULL
NULL NULL NULL NULL
Between_SS : 9.80000000000000E+01
td_clusterid_kmeans C1 C2
td_size_kmeans td_withinss_kmeans id td_modelinfo_kmeans
-------------------- ----------------------
---------------------- -------------------- ----------------------
---- ----------------------------------------
0 1.5 1.5
2 1 NULL NULL
1 8.5 8.5
2 1 NULL NULL
NULL NULL NULL
NULL NULL NULL Converged : True
NULL NULL NULL
NULL NULL NULL Number of Iterations : 2
NULL NULL NULL
NULL NULL NULL Number of Clusters : 2
NULL NULL
NULL NULL NULL NULL
Total_WithinSS : 2.00000000000000E+00
NULL NULL
NULL NULL NULL NULL
Between_SS : 9.80000000000000E+01
id td_clusterid_kmeans
----------- --------------------
1 0
2 0
3 1
4 1
id td_clusterid_kmeans
----------- --------------------
1 0
2 0
3 1
4 1
TD_GLM
The TD_GLM function is a generalized linear model (GLM) that performs regression and classification
analysis on data sets, where the response follows an exponential family distribution and supports the
following models:
• Regression (Gaussian family): The loss function is squared error.
• Binary Classification (Binomial family): The loss function is logistic and implements logistic regression.
The only response values are 0 or 1.
The function uses the Minibatch Stochastic Gradient Descent (SGD) algorithm that is highly scalable
for large datasets. The algorithm estimates the gradient of loss in minibatches, which is defined by the
Batchsize argument and updates the model with a learning rate using the LearningRate argument.
The function also supports the following approaches:
• L1, L2, and Elastic Net Regularization for shrinking model parameters
• Accelerated learning using Momentum and Nesterov approaches
The function uses a combination of IterNumNoChange and Tolerance arguments to define the convergence
criterion and runs multiple iterations (up to the specified value in the MaxIterNum argument) until the
algorithm meets the criterion.
The function also supports LocalSGD, a variant of SGD, that uses LocalSGDIterations on each AMP to run
multiple batch iterations locally followed by a global iteration.
The weights from all mappers are aggregated in a reduce phase and are used to compute the gradient
and loss in the next iteration. LocalSGD lowers communication costs and can result in faster learning and
convergence in fewer iterations, especially when there is a large cluster size and many features.
Due to gradient-based learning, the function is highly-sensitive to feature scaling. Before using
the features in the function, you must standardize the Input features using TD_ScaleFit and
TD_ScaleTransform functions.
The function only accepts numeric features. Therefore, before training, you must convert the categorical
features to numeric values.
The function skips the rows with missing (null) values during training.
The function output is a trained GLM model that is used as an input to the TD_GLMPredict
function. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC. You can use
Note:
When an unsupported data type is passed in InputColumns or ResponseColumn, the following error
message is displayed:
In the message, n refers to the column index based on an input to the function comprising InputColumns
and ResponseColumn only. The function does not need the rest of the columns, and the Teradata Vantage
optimizer does not project them to the function. Due to this, n might be different from the actual index in the
input table.
TD_GLM Syntax
SELECT * FROM TD_GLM (
ON { table | view | (query) } PARTITION BY ANY
[ OUT TABLE MetaInformationTable (meta_table) ]
USING
InputColumns ({‘input_column’|input_column_range }[,…])
ResponseColumn(‘response_column’)
[ Family (‘Gaussian’ | ‘Binomial’) ]
[ BatchSize (batchsize) ]
[ MaxIterNum (max_iter) ]
[ RegularizationLambda (lambda) ]
[ Alpha (alpha) ]
[ IterNumNoChange (n_iter_no_change) ]
[ Tolerance (tolerance) ]
[ Intercept (‘true’ | ‘false’) ]
[ ClassWeights (‘class:weight,...’) ]
[ LearningRate (‘constant’|‘optimal’|’invtime’|’adaptive’) ]
[ InitialEta (eta0) ]
[ DecayRate (gamma) ]
[ DecaySteps (decay_steps) ]
[ Momentum (momentum) ]
[ Nesterov (‘true’|’false’) ]
[ LocalSGDIterations(local_iterations) ]
) as dt;
ResponseColumn
Specify the column name that contains the class label for classification or target value
(dependent variable) for regression.
Family
[Optional] Specify the distribution exponential family. Options are Gaussian and Binomial.
Default value is Gaussian.
MaxIterNum
[Optional] Specify the maximum number of iterations (minibatches) over the training data
batches. Value is a positive integer less than 10,000,000. Default value is 300.
BatchSize
[Optional] Specify the number of observations (training samples) processed in a single
minibatch per AMP. A value of 0 or higher than the number of rows on an AMP processes
all rows on the AMP, such that the entire dataset is processed in a single iteration, and the
algorithm becomes Gradient Descent. Specify a positive integer value. The default value
is 10.
RegularizationLambda
[Optional] Specify the regularization amount. The higher the value, stronger the
regularization. It is also used to compute learning rate when learning rate is set to optimal.
Must be a non-negative float value. A value of 0 means no regularization. Default value
is 0.02.
Alpha
[Optional] Specify the Elasticnet parameter for penalty computation. It is only effective when
RegularizationLambda is greater than 0. The value represents the contribution ratio of L1 in
the penalty. A value of 1.0 indicates L1 (LASSO) only, a value of 0 indicates L2 (Ridge) only,
and a value between is a combination of L1 and L2. Value is a float value between 0 and 1.
The default value is 0.15 (15% L1, 85% L2).
IterNumNoChange
[Optional] Specify the number of iterations (minibatches) with no improvement in loss
including the tolerance to stop training. A value of 0 indicates no early stopping and the
algorithm continues until MaxIterNum iterations are reached. Specify a positive integer value.
The default value is 50.
Tolerance
[Optional] Specify the stopping criteria in terms of loss function improvement. Applicable
when IterNumNoChange is greater than 0. Specify a positive integer value. The default value
is 0.001.
Intercept
[Optional] Specify whether to estimate intercept based on whether the data is already
centered. The default value is true.
ClassWeights
[Optional] Specify weights associated with classes. Only applicable for Binomial Family.
The format is 0:weight,1:weight. For example, 0:1.0,1:0.5 gives twice the weight to each
observation in class 0. If the weight of a class is omitted, it is assumed to be 1.0. The default
value is 0:1.0,1:1.0.
LearningRate
[Optional] Specify one of the learning rate algorithms:
• Constant
• InvTime.
• Optimal
• Adaptive
The default value is invtime for Gaussian, and optimal for Binomial.
InitialEta
[Optional] Specify the initial learning rate eta value. If you specify the learning rate as
constant, the eta value is applicable for all iterations. The default value is 0.05.
DecayRate
[Optional] Specify the decay rate for the learning rate. Only applicable for invtime and
adaptive learning rates. The default value is 0.25.
DecaySteps
Specify the number of iterations without decay for the adaptive learning rate. The learning
rate changes by decay rate after the specified number of iterations are completed. The
default value is 5.
Momentum
[Optional] Specify the value to use for momentum learning rate optimizer. A larger value
indicates higher momentum contribution. A value of 0 means momentum optimizer is
disabled. For a good momentum contribution, a value between 0.6-0.95 is recommended.
Value is a non-negative float between 0 and 1. The default value is 0.
Nesterov
[Optional] Specify whether to use Nesterov optimization for the Momentum optimizer. Only
applicable when the Momentum optimizer value is greater than 0. The default value is True.
LocalSGDIterations
[Optional] Specify the number of local iterations for the Local SGD algorithm. A value
of 0 implies that the algorithm is disabled. A value greater than 0 enables the algorithm
and specifies the number of iterations for the algorithm. The recommended values for the
arguments are as follows:
• LocalSGDIterations: 10
• MaxIterNum: 100
• BatchSize: 50
• IterNumNoChange: 5
The default value is 0.
TD_GLM Input
Column Data Type Description
input_column INTEGER, BIGINT, The input table columns used to train the
SMALLINT, BYTEINT, FLOAT, GLM model.
DECIMAL, NUMBER
response_column INTEGER, BIGINT, The column that contains the response value
SMALLINT, BYTEINT, FLOAT, for an observation.
DECIMAL, NUMBER
TD_GLM Output
TD_GLM produces the following outputs:
• Model (Primary output): Contains the trained model with model statistics. The following model
statistics are stored in the model:
◦ Loss Function
◦ MSE (Gaussian)
◦ Loglikelihood (Logistic)
◦ Number of Observations
◦ AIC
◦ BIC
◦ Number of Iterations
◦ Regularization
◦ Alpha (L1/L2/Elasticnet)
◦ Learning Rate (initial)
◦ Learning Rate (Final)
◦ Momentum
◦ Nesterov
◦ LocalSGD Iterations
• [Optional] MetaInformationTable (Secondary Output): Contains training progress information for
each iteration.
The model output schema is as follows:
Column Data Type Description
attribute SMALLINT The column contains the numeric index of predictor and model metrics. Intercept
is specified using index 0, and the rest of the predictors take positive values.
Model metrics take negative indices.
value VARCHAR The string-based metric values such as SQUARED_ERROR for LossFunction,
L2 for Regularization, and so on.
TD_GLM Example
TD_GLM Example for Credit Data Set
The following credit data set is used in this example:
... ... ... ... ... ... ... ... ... ... ... .. ...
-11 Momentum 0
-5 BIC 151.787
-4 AIC 80.4189
-3 Number of Observations 44
-2 Loglik -0.209461
0 (Intercept) 0.146566
1 A1 0.732289
2 A2 0
3 A7 0.717899
4 A10 0.682358
5 A13 0.822302
6 A14 0.176791
7 A0_b 0.172178
8 A0_a -0.12165
9 A3_y -0.285135
10 A3_u 0.335663
11 A4_p -0.285135
12 A4_g 0.335663
13 A5_k 0.0358046
14 A5_cc 0
15 A5_d 0.0480538
16 A5_c 0.430725
17 A5_aa 0
18 A5_m -0.332389
19 A5_q 0.524153
20 A5_w -0.599829
21 A5_e 0.0257252
22 A5_ff -0.359011
23 A5_j -0.119432
24 A5_x 0.0494795
25 A5_i 0
26 A6_v -0.387493
27 A6_h 0.0415697
28 A6_bb 0.0604108
29 A6_z 0
30 A6_ff -0.372217
31 A6_j 0.538139
32 A8_t 0.9259
33 A8-f -0.875372
34 A9_t 0
35 A9_f 0
36 A11_t 0.197957
37 A11_f -0.146928
38 A12_g -0.221414
39 A12_s 0.272506
... ... ... ... ... ... ... ... ... ...
count 69 69 69 69 69 69 69
null 0 0 0 0 0 0 0
multiplier 1 1 1 1 1 1 1
intercept 0 0 0 0 0 0 0
ScaleMethodNumberMapping: 3 3 3 3 3 3 3
[0:mean, 1:sum, 2:ustd, 3:
std, 4:range, 5:midrange,6:
maxabs, 7:rescale]
-12 Nesterov
-11 Momentum 0
-5 BIC -67.6236
-4 AIC -87.7305
-3 Number of Observations 69
-2 MSE 0.216033
0 (Intercept) 2.07174
1 MedInc 0.782883
2 HouseAge 0.231914
3 AveRooms 0.0619822
4 AveBedrms -0.113656
5 Population 0.211336
6 AveOccup -0.388201
7 Latitude -0.195511
8 Longitude -0.193884
TD_VectorDistance
The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and
returns a table that contains the distance between target-reference pairs.
The function computes the distance between the target pair and the reference pair from the same table if
you provide only one table as the input.
You must have the same column order in the TargetFeatureColumns argument and the RefFeatureColumns
argument. The function ignores the feature values during distance computation if the value is either NULL,
NAN, or INF.
Important:
The function returns N2 output if you use the TopK value as -1 because the function includes all
reference vectors in the output table.
Note:
The algorithm used in this function is of the order of N2 (where N is the number of rows). Hence, expect
the query to run significantly longer as the number of rows increases in either the target table or the
reference table. Also, because the Reference table is a DIMENSION input, it is copied to the spool for
each AMP before running the query. The user spool limits the size/scalability of the input.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_VectorDistance Syntax
SELECT * FROM TD_VectorDistance (
ON { table | view | (query) } AS TARGETTABLE PARTITION BY ANY
[ ON { table | view | (query) } AS REFERENCETABLE DIMENSION ]
USING
TargetIDColumn ('target_id_column')
TargetFeatureColumns ({ 'target_feature_column' | target_feature_Column_range }[,...])
[ RefIDColumn ('ref_id_column') ]
[ RefFeatureColumns ({ 'ref_feature_column' | ref_feature_Column_range }[,...]) ]
[ DistanceMeasure ({'Cosine' | 'Euclidean' | 'Manhattan' }[,...])]
[ TopK(integer_value)]
) AS alias;
TargetFeatureColumns
Specify the target table column names that contain features of the target table vectors.
Note:
You can specify up to 2018 feature columns.
RefIDColumn
[Optional] Specify the reference table column name that contains identifiers of the reference
table vectors.
RefFeatureColumns
[Optional] Specify the reference table column names that contain features of the reference
table vectors.
Note:
You can specify up to 2018 feature columns.
DistanceMeasure
[Optional] Specify the distance type to compute between the target and the reference vector:
• Cosine: Cosine distance between the target vector and the reference vector.
• Euclidean: Euclidean distance between the target vector and the reference vector.
• Manhattan: Manhattan distance between the target vector and the reference vector.
TopK
[Optional] Specify the maximum number of closest reference vectors to include in the output
table for each target vector. The value k is an integer between 1 and 100. The default value
is 10.
TD_VectorDistance Input
Target Table Schema:
Column Data Type Description
target_id_column BYTEINT, SMALLINT, The target table column name that contains
BIGINT, INTEGER target table vector identifiers.
target_feature_ BYTEINT, SMALLINT, The target table column names that contain
column BIGINT, INTEGER, DECIMAL, features of the target table vectors.
NUMBER, FLOAT, REAL,
DOUBLE PRECISION
ref_id_column BYTEINT, SMALLINT, The reference table column name that contains
BIGINT, INTEGER identifiers of the reference table vectors.
ref_feature_ BYTEINT, SMALLINT, The reference table column names that contain
column BIGINT, INTEGER, DECIMAL, features of the reference table vectors.
NUMBER, FLOAT, REAL,
DOUBLE PRECISION
TD_VectorDistance Output
The function produces a table with the distances between the target and reference vectors.
Column Data Type Description
Distance FLOAT The distance between the target and the reference vectors.
TD_VectorDistance Example
Target Table:
3 1 0.8 0.9
Reference Table:
SQL Call
ON target_mobile_data_dense as TargetTable
ON ref_mobile_data_dense as ReferenceTable Dimension
USING
TargetIDColumn('userid')
TargetFeatureColumns('CallDuration','DataCounter','SMS')
RefIDColumn('userid')
RefFeatureColumns('CallDuration','DataCounter','SMS')
DistanceMeasure('euclidean','cosine','manhattan')
topk(2)
) as dt order by 3,1,2,4;
TD_VectorDistance Result
GLMPredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.
GLMPredict Syntax
SELECT * FROM GLMPredict (
ON { table | ( query ) } [ PARTION BY ANY ]
ON { table | view | (query) } AS Model DIMENSION
[ USING
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ Family ('family') ]
[ LinkFunction ('link') ]
]
) AS alias;
You must specify the keyword USING to use any function syntax element.
Related Information:
Column Specification Syntax Elements
Family
[Optional] Specify the distribution exponential family.
If you specify this syntax element, you must give it the same value that you used for the
Family syntax element of ML Engine GLM function when you created the model table.
LinkFunction
[Optional] Specify the link function. For the canonical link functions (default link functions)
and the link functions allowed for each exponential family, see the GLM function description
in Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003.
If you specify this syntax element, you must give it the same value that you used for the
LinkFunction syntax element of ML Engine GLM function when you created the model table.
Default: 'CANONICAL'
GLMPredict Input
Table Description
Model Model output by ML Engine GLM function. For schema, see Teradata Vantage™ Machine
Learning Engine Analytic Function Reference, B700-4003.
If the GLM call that created the model table specified the Step syntax element, include the optional
ORDER BY clause in the GLMPredict call; otherwise, the GLMPredict result is nondeterministic.
t_score DOUBLE PRECISION [Column appears only with Family ('GAUSSIAN').] The t_
score follows a t(N-p-1) distribution.
z_score DOUBLE PRECISION [Column appears only without Family ('GAUSSIAN').] The
z-score follows the N(0,1) distribution.
p_value DOUBLE PRECISION p-value for z_score. (p-value represents significance of each
coefficient.)
GLMPredict Output
Output Table Schema
Column Data Type Description
fitted_value DOUBLE Score of the input data, given by equation g-1(Xβ), where g-1 is
PRECISION the inverse link function, X the predictors, and β is the vector
of coefficients estimated by the GLM function.
For other values of Family, the scores are the expected values
of dependent/response variable, conditional on the predictors.
GLMPredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
• Input table: admissions_test, which has admissions information for 20 students
• Model: glm_admissions_model, output by "GLM Example: Logistic Regression Analysis with
Intercept" in Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003
admissions_test
id masters gpa stats programming admitted
60 no 4 Advanced Novice 1
SQL Call
Output
This query returns the following table:
Fitted values can vary in precision, because they depend on the model table output by ML Engine GLM
function and fetched to Analytics Database.
glmpredict_admissions
id masters gpa stats programming admitted fitted value
A fitted_value probability greater than or equal to 0.5 implies class 1 (student admitted); a probability less
than 0.5 implies class 0 (student rejected).
The following code adds a fitted_category column to glmpredict_admissions and populates it:
fitted_
id masters gpa stats programming admitted fitted_value
category
52 no 3. Novice Beginner 1 7. 1
70000000000000E 58306140231079E-001
000
55 no 3. Beginner Advanced 1 9. 1
60000000000000E 68031141480050E-001
000
56 no 3. Advanced Advanced 1 9. 1
82000000000000E 45772725968165E-001
000
57 no 3. Advanced Advanced 1 9. 1
71000000000000E 46411914806798E-001
000
58 no 3. Advanced Advanced 1 9. 1
13000000000000E 49666186386367E-001
000
59 no 3. Novice Novice 1 8. 1
65000000000000E 74189685344822E-001
000
60 no 4. Advanced Novice 1 8. 1
00000000000000E 65058992199339E-001
000
fitted_
id masters gpa stats programming admitted fitted_value
category
62 no 3. Advanced Advanced 1 9. 1
70000000000000E 46469669158547E-001
000
63 no 3. Advanced Advanced 1 9. 1
83000000000000E 45714262806374E-001
000
66 no 3. Novice Beginner 1 7. 1
87000000000000E 54738501228429E-001
000
68 no 1. Advanced Novice 1 8. 1
87000000000000E 90965518431337E-001
000
69 no 3. Advanced Advanced 1 9. 1
96000000000000E 44948816395031E-001
000
Prediction Accuracy
This query returns the prediction accuracy:
prediction_accuracy
1.00000000000000000000
This example evaluates the predictions for new houses, comparing the original price information with root
mean square error evaluation (RMSE).
Input
• Input table: housing_test, as in DecisionForestPredict Example: Specify Column Names
• Model: glm_housing_model, output by "GLM Example: Gaussian Distribution Analysis" in Teradata
Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003
SQL Call
The canonical link specifies the default family link, which is "identity" for the Gaussian distribution.
Output
This query returns the following table:
sn price fitted_value
sn price fitted_value
SELECT SQRT(AVG(POWER(glmpredict_housing.price -
glmpredict_housing.fitted_value, 2))) AS RMSE FROM glmpredict_housing;
rmse
1.06854695738768E 004
Like GLMPredict Example: Logistic Distribution Prediction, this example predicts the admission status
of students. In both examples, the input column masters is categorical—the value can be yes or no. In
the other example, the value is 'yes' or 'no'. In this example, the value is numerical—1 for yes or 0 for
no—therefore, it must be cast to VARCHAR.
Input
• Input table: admissions_test_2, which has admissions information for 20 students
• Model: glm_admissions_model, output by "GLM Example: Logistic Regression Analysis with
Intercept" in Teradata Vantage™ Machine Learning Engine Analytic Function Reference,
B700-4003, with the category column modified as follows:
1 masters '1'
2 masters '0'
admissions_test_2
id masters gpa stats programming admitted
SQL Call
Accumulate ('id','masters','gpa','stats','programming','admitted')
Family ('LOGISTIC')
LinkFunction ('LOGIT')
) AS dt
) WITH DATA;
Output
This query returns the following table:
Fitted values can vary in precision, because they depend on the model table output by ML Engine GLM
function and fetched to Analytics Database.
glmpredict_admissions_2
id masters gpa stats programming admitted fitted value
Prediction Accuracy
See GLMPredict Example: Logistic Distribution Prediction.
SVMSparsePredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.
If the SVMSparse call that created the model specified HashProjection ('true'), SVMSparsePredict does not
support UNICODE data.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.
SVMSparsePredict Syntax
SELECT * FROM SVMSparsePredict (
ON { table | view | (query) } AS InputTable PARTITION BY id_column
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('id_column')
AttributeNameColumn ('attribute_name_column')
[ AttributeValueColumn ('attribute_value_column') ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ TopK ({ output_class_number | 'output_class_number' }) ]
[ OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Responses ('response' [,...]) ]
) AS alias;
Related Information:
Column Specification Syntax Elements
AttributeNameColumn
Specify the name of the InputTable column that contains the attributes of the test samples.
AttributeValueColumn
[Optional] Specify the name of the InputTable column that contains the attribute values.
Default behavior: Each attribute has the value 1.
Accumulate
[Optional] Specify the names of the InputTable columns to copy to the output table.
TopK
[Disallowed with Responses, otherwise optional] Specify the number of class labels to
appear in the output table. For each observation, the output table has n rows, corresponding
to the n most likely classes. To see the probability of each class, use OutputProb ('true').
OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to output the
probability for each response. If you omit Responses, the function outputs only the probability
of the predicted class.
Default: 'true'
Responses
[Optional] Specify the classes for which to output probabilities.
Default behavior: Output only the probability of the predicted class.
SVMSparsePredict Input
Table Description
Model Output by ML Engine SVMSparse function. Model is in binary format. To display its readable
content, use ML Engine SVMSparseSummary function.
InputTable Schema
Column Data Type Description
SVMSparsePredict Output
Output Table Schema
The table has the predicted class of each test sample.
If you specify TopK (n), the output table has n rows for each observation.
predict_confidence DOUBLE PRECISION [Column appears only with OutputProb ('true') and
without Responses syntax element] Probability
that observation belongs to class in predict_
value column.
The function calculates the values of prob_response and predict_confidence with the following formulas.
valuer = Wr·X
where:
• X is the vector of predictor values corresponding to an observation.
• Wr is the vector of predictor weights calculated by the model for class r, where r is a class specified
by the Responses syntax element.
prob_response
For binary classification, the formula for the probability that a response belongs to class r is:
For multiple-class classification, the formula for the probability that a response belongs to class r is:
predict_confidence
The column predict_confidence, which appears only if you omit the Responses syntax element, displays
the probability that the observation belongs to the class in the column predict_value. This value is the
maximum value of prob_response over all responses r.
SVMSparsePredict Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• InputTable: svm_iris_input_test
• Model: svm_iris_model, output by ML Engine SVMSparse function
The model is in binary format. To display its readable content, use ML Engine
SVMSparseSummary function.
svm_iris_input_test
id species attribute value
svm_iris_model
classid weights
-3 757365686173683A66616C736500636F73743A312E300073616D706C656E756D6265723A313230007365656
-2 7365746F7361007665727369636F6C6F720076697267696E696361
-1 706574616C5F6C656E67746800706574616C5F776964746800736570616C5F6C656E67746800736570616C5
0 BFF134DF08DD751EBFE204E07599DDE03FD93A9DDED02C8A3FD57FC7A69871810000000000000000
1 3FE4F1C5871DE4A0C000B12D7E8C18FE3FE558515B291C5DBFF6F2558D9050370000000000000000
2 3FF9424250696DEA4003BA4B98AB24FCBFF5FBB2667D07A7BFF1D36766E2FE0A0000000000000000
SQL Call
Output
This query returns the following table:
Prediction Accuracy
This query returns the prediction accuracy:
prediction_accuracy
0.83
DecisionForestPredict
This function can use the models from TD_DecisionForest, and the ML Engine DecisionForest functions to
analyze the input data and make predictions.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.
DecisionForestPredict outputs the probability that each observation is in the predicted class. To use
DecisionForestPredict output as input to ML Engine ROC function, you must first transform it to show the
probability that each observation is in the positive class. One way to do this is to change the probability to
(1- current probability) when the predicted class is negative.
The prediction algorithm compares floating-point numbers. Due to possible inherent data type differences
between ML Engine and Analytics Database executions, predictions can differ. Before calling the function,
compute the relative error, using this formula:
where mle_prediction is ML Engine prediction value and td_prediction is Analytics Database prediction
value. Errors (e) follow Gaussian law; 0 < e < 3% is a negligible difference, with high confidence.
DecisionForestPredict Syntax
SELECT * FROM DecisionForestPredict (
ON { table | view | (query) } PARTITION BY ANY
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('id_column')
[ NumericInputs ({ 'numeric_input_column' | numeric_input_column_range }[,...]) ]
[ CategoricalInputs ({ 'categorical_input_column' | categorical_input_column_range
}[,...]) ]
[ Detailed ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Responses ('response' [,...]) ]
[ OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
Related Information:
Column Specification Syntax Elements
NumericInputs
[Optional] Specify the names of the columns that contain the numeric predictor variables.
Default behavior: The function gets these variables from the model output by DecisionForest
only if you omit both NumericInputs and CategoricalInputs. If you specify this syntax element,
you must specify it exactly as you specified it in the DecisionForest call that created
the model.
CategoricalInputs
[Optional] Specify the names of the columns that contain the categorical predictor variables.
Default behavior: The function gets these variables from the model output by DecisionForest
only if you omit both NumericInputs and CategoricalInputs. If you specify this syntax element,
you must specify it exactly as you specified it in the DecisionForest call that created
the model.
Detailed
[Optional] Specify whether to output detailed information about the forest trees; that is,
the decision tree and the specific tree information, including task index and tree index for
each tree.
Default: 'false'
Responses
[Optional] Specify the classes for which to output probabilities.
Note:
Responses works only with a classification model.
OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to output the
probability for each response. If you omit Responses, the function outputs only the probability
of the predicted class.
Note:
OutputProb works only with a classification model.
Default: 'false'
Accumulate
[Optional] Specify the names of the input columns to copy to the output table.
DecisionForestPredict Input
Table Description
Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.
task_index INTEGER, BIGINT, or SMALLINT Identifier of worker that produced decision tree.
DecisionForestPredict Output
Output Table Schema
The table has a set of predictions for each test point.
id_column Same as in Column copied from input table. Unique row identifier.
input table
tree_num VARCHAR Either the concatenation of task_index and tree_num from the
model table, to show which tree created the prediction, or 'final'
to show the overall prediction. This column appears only if you
specify Detailed ('true').
prob DOUBLE [Column appears only with OutputProb ('true') and without
PRECISION Responses syntax element.] Probability that observation
belongs to class prediction.
prob_response DOUBLE [Column appears only with OutputProb ('true') and Responses
PRECISION syntax element and Responses syntax element. Appears
once for each specified response.] Probability that observation
belongs to category response.
DecisionForestPredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
• Input table: housing_test, which has 54 observations of 14 variables
• Model: rft_model, output by "DecisionForest Example: TreeType ('classification') and OutOfBag
('false')" in Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003
Column Description
housing_test
sn price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco
... ... ... ... ... ... ... ... ... ... ...
rtf_model
worker_ip task_index tree_num CAST(tree AS VARCHAR(50))
xx.xx.xx.xx 0 0 {"responseCounts_":{"Eclectic":148,"bungalow":30,"
xx.xx.xx.xx 0 1 {"responseCounts_":{"Eclectic":158,"bungalow":26,"
xx.xx.xx.xx 0 2 {"responseCounts_":{"Eclectic":120,"bungalow":38,"
xx.xx.xx.xx 0 3 {"responseCounts_":{"Eclectic":166,"bungalow":29,"
xx.xx.xx.xx 0 4 {"responseCounts_":{"Eclectic":138,"bungalow":32,"
xx.xx.xx.xx 0 5 {"responseCounts_":{"Eclectic":158,"bungalow":34,"
xx.xx.xx.xx 0 6 {"responseCounts_":{"Eclectic":168,"bungalow":32,"
xx.xx.xx.xx 0 7 {"responseCounts_":{"Eclectic":145,"bungalow":40,"
xx.xx.xx.xx 0 8 {"responseCounts_":{"Eclectic":150,"bungalow":34,"
xx.xx.xx.xx 0 9 {"responseCounts_":{"Eclectic":156,"bungalow":42,"
xx.xx.xx.xx 0 10 {"responseCounts_":{"Eclectic":148,"bungalow":18,"
xx.xx.xx.xx 0 11 {"responseCounts_":{"Eclectic":147,"bungalow":20,"
xx.xx.xx.xx 0 12 {"responseCounts_":{"Eclectic":150,"bungalow":31,"
xx.xx.xx.xx 0 13 {"responseCounts_":{"Eclectic":135,"bungalow":32,"
xx.xx.xx.xx 0 14 {"responseCounts_":{"Eclectic":139,"bungalow":24,"
xx.xx.xx.xx 0 15 {"responseCounts_":{"Eclectic":146,"bungalow":27,"
xx.xx.xx.xx 0 16 {"responseCounts_":{"Eclectic":152,"bungalow":23,"
xx.xx.xx.xx 0 17 {"responseCounts_":{"Eclectic":135,"bungalow":23,"
xx.xx.xx.xx 0 18 {"responseCounts_":{"Eclectic":148,"bungalow":29,"
xx.xx.xx.xx 0 19 {"responseCounts_":{"Eclectic":166,"bungalow":33,"
xx.xx.xx.xx 0 20 {"responseCounts_":{"Eclectic":142,"bungalow":28,"
xx.xx.xx.xx 0 21 {"responseCounts_":{"Eclectic":172,"bungalow":27,"
xx.xx.xx.xx 0 22 {"responseCounts_":{"Eclectic":147,"bungalow":37,"
xx.xx.xx.xx 0 23 {"responseCounts_":{"Eclectic":158,"bungalow":31,"
xx.xx.xx.xx 0 24 {"responseCounts_":{"Eclectic":158,"bungalow":33,"
xx.xx.xx.xx 1 0 {"responseCounts_":{"Eclectic":140,"bungalow":44,"
xx.xx.xx.xx 1 1 {"responseCounts_":{"Eclectic":161,"bungalow":28,"
xx.xx.xx.xx 1 2 {"responseCounts_":{"Eclectic":131,"bungalow":25,"
xx.xx.xx.xx 1 3 {"responseCounts_":{"Eclectic":167,"bungalow":28,"
xx.xx.xx.xx 1 4 {"responseCounts_":{"Eclectic":150,"bungalow":19,"
xx.xx.xx.xx 1 5 {"responseCounts_":{"Eclectic":158,"bungalow":24,"
xx.xx.xx.xx 1 6 {"responseCounts_":{"Eclectic":177,"bungalow":32,"
xx.xx.xx.xx 1 7 {"responseCounts_":{"Eclectic":156,"bungalow":24,"
xx.xx.xx.xx 1 8 {"responseCounts_":{"Eclectic":156,"bungalow":37,"
xx.xx.xx.xx 1 9 {"responseCounts_":{"Eclectic":165,"bungalow":24,"
xx.xx.xx.xx 1 10 {"responseCounts_":{"Eclectic":135,"bungalow":29,"
xx.xx.xx.xx 1 11 {"responseCounts_":{"Eclectic":140,"bungalow":20,"
xx.xx.xx.xx 1 12 {"responseCounts_":{"Eclectic":156,"bungalow":24,"
xx.xx.xx.xx 1 13 {"responseCounts_":{"Eclectic":147,"bungalow":34,"
xx.xx.xx.xx 1 14 {"responseCounts_":{"Eclectic":151,"bungalow":22,"
xx.xx.xx.xx 1 15 {"responseCounts_":{"Eclectic":161,"bungalow":18,"
xx.xx.xx.xx 1 16 {"responseCounts_":{"Eclectic":156,"bungalow":19,"
xx.xx.xx.xx 1 17 {"responseCounts_":{"Eclectic":126,"bungalow":29,"
xx.xx.xx.xx 1 18 {"responseCounts_":{"Eclectic":148,"bungalow":26,"
xx.xx.xx.xx 1 19 {"responseCounts_":{"Eclectic":177,"bungalow":21,"
xx.xx.xx.xx 1 20 {"responseCounts_":{"Eclectic":137,"bungalow":31,"
xx.xx.xx.xx 1 21 {"responseCounts_":{"Eclectic":171,"bungalow":28,"
xx.xx.xx.xx 1 22 {"responseCounts_":{"Eclectic":146,"bungalow":30,"
xx.xx.xx.xx 1 23 {"responseCounts_":{"Eclectic":149,"bungalow":21,"
xx.xx.xx.xx 1 24 {"responseCounts_":{"Eclectic":158,"bungalow":18,"
SQL Call
Use the Accumulate syntax element to pass the homestyle variable, to easily compare the actual and
predicted response for each observation.
Output
This query returns the following table:
homestyle sn prediction
confidence_lower confidence_upper
------------------------------ ----------- --------------------
---------------------- ----------------------
classic 13 classic
8.88888888888889E-001 8.88888888888889E-001
classic 16 classic
8.88888888888889E-001 8.88888888888889E-001
classic 25 classic
1.00000000000000E 000 1.00000000000000E 000
eclectic 38 eclectic
7.77777777777778E-001 7.77777777777778E-001
eclectic 53 eclectic
7.77777777777778E-001 7.77777777777778E-001
bungalow 104 eclectic
7.77777777777778E-001 7.77777777777778E-001
classic 111 classic
1.00000000000000E 000 1.00000000000000E 000
eclectic 117 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 132 classic
8.88888888888889E-001 8.88888888888889E-001
classic 140 classic
8.88888888888889E-001 8.88888888888889E-001
classic 142 classic
8.88888888888889E-001 8.88888888888889E-001
eclectic 157 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 161 eclectic
1.00000000000000E 000 1.00000000000000E 000
bungalow 162 bungalow
5.55555555555556E-001 5.55555555555556E-001
eclectic 176 eclectic
1.00000000000000E 000 1.00000000000000E 000
eclectic 177 eclectic
1.00000000000000E 000 1.00000000000000E 000
classic 195 classic
1.00000000000000E 000 1.00000000000000E 000
Prediction Accuracy
This query returns the prediction accuracy:
pa
0.77777777777777777778
Input
• Input table: housing_test_sample
• Model: rft_model_classification
172.24.106.80 2 2
{"responseCounts_":{"classic":53,"bungalow":8,"eclectic":97},
SQL Call
)AS dt;
Output
DecisionTreePredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.
DecisionTreePredict Syntax
SELECT * FROM DecisionTreePredict (
ON { table | view | (query) } AS AttributeTable
PARTITION BY pid_col [,...]
ON { table | view | (query) } AS Model DIMENSION
USING
AttrTableGroupbyColumns ({ 'gcol' | gcol_range }[,...])
AttrTablePIDColumns ({ 'pid_col' | pid_col_range }[,...])
AttrTableValColumn ('value_column')
[ OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})
[ Responses ('response'[,...]) ]
]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
Related Information:
Column Specification Syntax Elements
AttrTablePIDColumns
Specify the names of the columns that define the data point identifiers.
AttrTableValColumn
Specify the name of the AttributeTable column that contains the input values.
OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to
output probabilities.
Default: 'false'
Responses
[Optional with OutputProb, disallowed otherwise.] Specify the labels for which to
output probabilities.
Accumulate
[Optional] Specify the names of the input columns to copy to the output table.
If you are using this function to create input for ML Engine ROC function, this syntax element
must specify actual_label.
DecisionTreePredict Input
Table Description
AttributeTable Contains test data. Has same schema as ML Engine DecisionTree InputTable.
AttributeTable Schema
See Teradata Vantage™ Machine Learning Engine Analytic Function Reference, B700-4003.
Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.
Double quotation marks around some column names are required because the names contain
special characters.
"node_gini(p)" or INTEGER, SMALLINT, GINI impurity value for information in node. For
node_gini BIGINT, NUMBER, or ImpurityMeasurement ('gini'), column name is node_
DOUBLE PRECISION gini(p); otherwise, it is node_gini.
"node_entropy(p) INTEGER, SMALLINT, Entropy impurity value for the information in the node. For
" or node_entropy BIGINT, NUMBER, or ImpurityMeasurement ('entropy'), column name is node_
DOUBLE PRECISION entropy(p); otherwise, it is node_entropy.
"node_chisq_ INTEGER, SMALLINT, Chi-square impurity value for the information in the node.
pv(p)" or node_ BIGINT, NUMBER, or For ImpurityMeasurement ('chisquare'), column name is
chisq_pv DOUBLE PRECISION node_chisq_pv(p); otherwise, it is node_chisq_pv.
"split_gini(p)" or INTEGER, SMALLINT, GINI impurity measurement for information in node after
split_gini BIGINT, NUMBER, or splitting. For ImpurityMeasurement ('gini'), column name
DOUBLE PRECISION is split_gini(p); otherwise, it is split_gini.
left_bucket CHARACTER When split value is categorical attribute, value in left child
or VARCHAR of node.
node_majorfreq INTEGER, SMALLINT, [Column appears only with Weighted ('true').] Weighted
BIGINT, NUMBER, or objects that belong to category identified by node_label.
DOUBLE PRECISION
left_majorfreq INTEGER, SMALLINT, [Column appears only with Weighted ('true').] Weighted
BIGINT, NUMBER, or objects that belong to category identified by left_label.
DOUBLE PRECISION
right_majorfreq INTEGER, SMALLINT, [Column appears only with Weighted ('true').] Weighted
BIGINT, NUMBER, or objects that belong to category identified by right_label.
DOUBLE PRECISION
DecisionTreePredict Output
Output Table Schema
Column Data Type Description
prob DOUBLE [Column appears only with OutputProb ('true') and without
PRECISION Responses syntax element.] Probability that observation belongs to
class pred_label, which depends on value of DecisionTree syntax
element ResponseProbDistType used to create model:
ResponseProbDistType Probability Formula
'frequency' or 'rawcount' Lc / L
Where:
Operand Description
C Number of trees.
prob_for_label_ DOUBLE [Column appears only with Responses syntax element.] Probability
response PRECISION that observation belongs to category response, calculated as in the
description of column prob.
DecisionTreePredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• AttributeTable: iris_attribute_test
• Model: iris_attribute_output
Both tables are created in "DecisionTree Example: Create Model" in Teradata Vantage™ Machine
Learning Engine Analytic Function Reference, B700-4003.
For input table column descriptions, see NaiveBayesPredict Example.
iris_attribute_test
pid attribute attrvalue
5 petal_length 1.4
5 petal_width 0.2
5 sepal_length 5
5 sepal_width 3.6
10 petal_length 1.5
10 petal_width 0.1
10 sepal_length 4.9
10 sepal_width 3.1
15 petal_length 1.2
15 petal_width 0.2
15 sepal_length 5.8
15 sepal_width 4
iris_attribute_output
node_
node_ node_ node_ node_
node_gini(p) node_entropy chisq_ split_value
id size label majorvotes
pv
0 120 0. 1. 1 1 40 3
666666666666667 58496250072116
2 80 0.5 1 1 2 40 1.
70000004768372
5 39 0. 0. 1 2 38 4.
0499671268902038 172036949353113 90000009536743
6 41 0. 0. 1 3 39 4.
0928019036287924 281193796432043 90000009536743
14 37 0. 0. 1 3 36 2.
0525931336742148 179256066928321 90000009536743
30 24 0. 0. 1 3 23 3.
0798611111111112 249882292833186 20000004768372
61 14 0.13265306122449 0. 1 3 13 6.
371232326640875 30000019073486
Input
See DecisionTreePredict Examples Input.
SQL Call
Output
This query returns the following table:
The predict labels 1, 2, and 3 correspond to species setosa, versicolor, and virginica.
pid pred_label
5 1
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 2
60 2
65 2
70 2
75 2
80 2
85 2
90 2
95 2
100 2
105 3
110 3
115 3
120 2
pid pred_label
125 3
130 2
135 2
140 3
145 3
150 3
Input
See DecisionTreePredict Examples Input.
SQL Call
Output
pid pred_label prob_for_label_1 prob_for_label_2 prob_for_label_3
TD_KMeansPredict
The TD_KMeansPredict function uses the cluster centroids in the TD_KMeans function output to assign the
input data points to the cluster centroids.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_KMeansPredict Syntax
SELECT * FROM TD_KMeansPredict (
ON { table | view | (query) } as InputTable
ON { table | view | (query) } as ModelTable DIMENSION
USING
[ OutputDistance(({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Accumulate({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
) as alias;
Accumulate
[Optional]: Specify the input table column names to copy to the output table.
TD_KMeansPredict Input
Input Table Schema
Column Data Type Description
IdColumn Any The InputTable column name that has the unique
identifier for each input table row.
TargetColumns BYTEINT,SMALLINT,INTEGER, The input table column names used for clustering.
BIGINT, Decimal/Numeric,Float,
Real,Double precision
TD_KMeansPredict Output
Output Table Schema
Column Data Type Description
Id_Column ANY The unique identifier of input rows copied from the input table.
TD_ REAL The distance between a data point and the center of the
DISTANCE_KMEANS assigned cluster.
Note:
The column is shown if the OutputDistance element is set
to 'True'.
Accumulate_Columns ANY The specified input table column names copied to the
output table.
TD_KMeansPredict Example
Input Table
id C1 C2
-- -- --
1 1 1
2 2 2
3 8 8
4 9 9
td_clusterid_kmeans C1 C2
td_size_kmeans td_withinss_kmeans id td_modelinfo_kmeans
-------------------- ---------------------- ----------------------
-------------------- ----------------------
---- -----------------------------------------
0 1.5
1.5 2 1 NULL NULL
1 8.5
8.5 2 1 NULL NULL
NULL NULL
NULL NULL NULL NULL
Converged : True
NULL NULL
NULL NULL NULL NULL Number of
Iterations : 2
NULL NULL
NULL NULL NULL NULL Number of
Clusters : 2
NULL NULL
NULL NULL NULL NULL
Total_WithinSS 2.00000000000000E+00
NULL NULL
NULL NULL NULL NULL
Between_SS : 9.80000000000000E+01
Output
id td_clusterid_kmeans td_distance_kmeans
C1 C2
----------- -------------------- ----------------------
---------------------- ----------------------
1 0 0.707106781 1
1
2 0 0.707106781 2
2
3 1 0.707106781 8
8
4 1 0.707106781 9 9
NaiveBayesPredict
Note:
This namePredict function uses the model output by ML Engine name function to analyze the input
data and make predictions.
If your model table was created using a supported version of Aster Analytics on Aster Database, see AA 7.00
Usage Notes.
NaiveBayesPredict Syntax
SELECT * FROM NaiveBayesPredict (
ON { table | view | (query) } PARTITION BY ANY
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('test_point_id_col')
NumericInputs ('numeric_input_column'[,...] )
CategoricalInputs ('categorical_input_column'[,...] )
Responses ('response'[,...])
) AS alias;
NumericInputs
[Required if CategoricalInputs is omitted.] Specify the same numeric_input_columns that you
specified when you used the NaiveBayesMap and NaiveBayesReduce functions to create
the model table from the training data.
CategoricalInputs
[Required if NumericInputs is omitted.] Specify the same categorical_input_columns that you
specified when you used the NaiveBayesMap and NaiveBayesReduce functions to create
the model table from the training data.
Responses
Specify the responses to output.
NaiveBayesPredict Input
Table Description
Input Contains test data. Has same schema as ML Engine Naive Bayes Classifier input table.
Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.
Double quotation marks around some column names are required because the names are either Analytics
Database reserved keywords or are camel-case.
"sum" or sum_nb INTEGER or For numerical predictor, sum of variable values for
DOUBLE_ observations with this class, variable, and category. For
PRECISION categorical predictor, NULL.
"sumSq" or sum_sq INTEGER or For numerical predictor, sum of square of variable values
DOUBLE_ for observations with this class, variable, and category.
PRECISION For categorical predictor, NULL.
NaiveBayesPredict Output
Output Table Schema
Each row of the table represents one observation.
NaiveBayesPredict Example
Input
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
• Input table: nb_iris_input_test
• Model: nb_iris_model
The model is created in the Naive Bayes example in Teradata Vantage™ Machine Learning Engine
Analytic Function Reference, B700-4003.
sepal_length Numeric
sepal_width Numeric
petal_length Numeric
petal_width Numeric
nb_iris_input_test
id sepal_length sepal_width petal_length petal_width species
nb_iris_model
class variable type category cnt sum sumSq totalcnt
SQL Call
Output
This query returns the following table:
The output provides a prediction for each row in the test data set and specifies the log likelihood values that
were used to make the predictions for each category.
TD_GLMPredict
The TD_GLMPredict function predicts target values (regression) and class labels (classification) for test
data using a GLM model of the TD_GLM function.
Before using the features in the function, you must standardize the Input features using TD_ScaleFit and
TD_ScaleTransform functions.
The function only accepts numeric features. Therefore, you must convert the categorical features to numeric
values before prediction.
The function skips the rows with missing (null) values during prediction.
TD_GLMPredict Syntax
SELECT * from TD_GLMPredict (
ON { table | view | (query) } AS InputTable PARTITION BY ANY
ON { table | view | (query) } AS Model DIMENSION
USING
IDColumn ('id_column')
[Accumulate({'accumulate_column'|accumulate_column_range}[,...])]
[OutputProb ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no','n','0'})]
[Responses ('response' [,...])]
) AS dt;
Accumulate
[Optional] Specify the input table column names to copy to the output table.
OutputProb
[Optional] Specify whether the function returns the probability for each response. Only
applicable if family of probability distribution is BINOMIAL. The default value is false.
Responses
[Optional] Specify the class labels if the function returns probabilities for each response.
Only applicable if the OutputProb element is True. A class label has the value 0 or 1. If not
specified, the function returns the probability of the predicted response.
TD_GLMPredict Input
TD_GLMPredict Input
The input table schema is as follows:
Column Name Data Type Description
attribute SMALLINT A numeric index that represents predictor and model metrics wherein
model metrics have negative values and predictors take positive values.
Intercept is specified using index 0.
predictor VARCHAR The predictor or model metric name. The maximum length is
32000 characters.
value VARCHAR The values of metric string. The maximum length is 30 characters.
TD_GLMPredict Output
The model output schema is as follows:
Column Name Data Type Description
id_column Same as The specified column name that uniquely identifies an observation
input table in test table.
prob FLOAT The probability that the observation belongs to the predicted class.
Only appears if the OutputProb element is set to True and the
Responses element is not specified.
prob_0 FLOAT The probability that the observation belongs to class 0. Only
appears if the Responses element is specified.
prob_1 FLOAT The probability that the observation belongs to class 1. Only
appears if the Responses element is specified.
accumulate_ Any The specified column names in the Accumulate element copied to
column the output table.
TD_GLMPredict Example
TD_GLMPredict Example for Credit Data
This example takes credit data and uses TD_GLM function to get a model. You can view the input and
output in the TD_GLM example.
61 1 1
297 0 0
631 0 0
122 1 1
TD_Silhouette
The Silhouette function refers to a method of interpretation and validation of consistency within clusters of
data. The function determines how well the data is clustered among clusters.
The silhouette value determines the similarity of an object to its cluster (cohesion) compared to other
clusters (separation). The silhouette plot displays a measure of how close each point in one cluster is to the
points in the neighbouring clusters and thus provides a way to assess parameters like the optimal number
of clusters.
The silhouette scores and its definitions are as follows:
• 1: Data is appropriately clustered
• -1: Data is not appropriately clustered
• 0: Datum is on the border of two natural clusters
Note:
The algorithm used in this function is of the order of N2 (where N is the number of rows). Hence, expect
the query to run significantly longer as the number of rows increases in the input table.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_Silhouette Syntax
SELECT * FROM TD_Silhouette (
ON { table | view | (query) } as InputTable
USING
IdColumn('id_column')
ClusterIdColumn('clusterid_column')
TargetColumns({'target_column'|'target_column_range'}[,...])
[{
[ OutputType({'SCORE' | 'CLUSTER_SCORES'}) ]
}
|
{
OutputType('SAMPLE_SCORES')
[ Accumulate({'accumulate_column' | 'accumulate_column_range'}[,...]) ]
}]
)as alias;
ClusterIdColumn
[Required]: Specify the column name that contains the assigned clusterIds for the input
data points.
TargetColumns
[Required]: Specify the features or columns for clustering.
OutputType
[Optional]: Specify the output type or format.
• SCORE: Returns average silhouette score of all input samples.
• SAMPLE_SCORES: Returns silhouette score for each input sample.
• CLUSTER_SCORES: Returns average silhouette scores of input samples for
each cluster.
Allowed Values: ['SCORE','SAMPLE_SCORES','CLUSTER_SCORES']
Default Value: SCORE
Accumulate
[Optional]: Specify the input table columns to copy to the output table.
Note:
Only applicable for 'SAMPLE_SCORES' output type.
TD_Silhouette Input
Input Table Schema
Column Data Type Description
clusterid_column BYTEINT, SMALLINT, The column that contains the assigned clusterIds
INTEGER, BIGINT for the input data points.
TD_Silhouette Output
Output Table Schema
If the Output type is set to Score:
Silhouette_Score REAL Silhouette Coefficient (that is, the Mean Silhouette Score)
Id_Column Same as in The unique identifier of input rows copied from the input table.
Input Table
clusterid_Column Same as in The ClusterIds of the input data points copied from the
Input Table input table.
A_i REAL The mean distance of a data point to other data points in the
same cluster.
accumulate_ ANY The specified columns in the Accumulate element are copied
columns from the input table to the output table.
TD_Silhouette Example
InputTable
id clusterid c1 c2
-- --------- -- --
1 1 1 1
2 1 2 2
3 2 8 8
4 2 9 9
Output Table
Output when the Output type is set to 'Score'.
silhouette_score
----------------
0.856410256
clusterid silhouette_score
--------- ----------------
1 0.856410256
2 0.856410256
TD_ClassificationEvaluator
In classification problems, a confusion matrix is used to visualize the performance of a classifier. The
confusion matrix contains predicted labels represented across the row-axis and actual labels represented
across the column-axis. Each cell in the confusion matrix corresponds to the count of occurrences of labels
in the test data.
Note:
The function works for multi-class scenarios as well. In any case, the primary output table contains
class-level metrics, whereas the secondary output table contains metrics that are applicable
across classes.
Apart from accuracy, the secondary output table returns micro, macro, and weighted-averaged metrics of
precision, recall, and F1-score values.
TD_ClassificationEvaluator Syntax
SELECT * FROM TD_ClassificationEvaluator(
ON { input_table | view | (query) }
[ OUT [VOLATILE| PERMANENT] TABLE OutputTable(output_table_name) ]
USING
ObservationColumn('ObservationColumn')
PredictionColumn('PredictionColumn')
{
Labels({'Lable1','Label2' } [,…])) | NumLabels('label_count')
}
)AS dt1;
PredictionColumn
[Required]: Specify the column name that has predicted labels.
Labels
[Required]: Specify the list of predicted labels.
NumLabels
[Optional]: Specify the total count of labels.
TD_ClassificationEvaluator Input
Input Table Schema
Column Data Type Description
TD_ClassificationEvaluator Output
Output Table Schema
The Primary Output table is as follows:
Precision REAL The positive predictive value. Refers to the fraction of relevant instances among
the total retrieved instances.
Recall REAL Refers to the fraction of relevant instances retrieved over the total amount of
relevant instances.
F1 REAL F1 score, defined as the harmonic mean of the precision and recall.
TD_ClassificationEvaluator Example
Input Table
id observed_value predicted_value
--- -------------- ---------------
5 setosa setosa
5 setosa setosa
5 setosa setosa
5 setosa setosa
5 setosa setosa
10 setosa setosa
10 setosa setosa
10 setosa setosa
10 setosa setosa
10 setosa setosa
15 setosa setosa
15 setosa setosa
15 setosa setosa
15 setosa setosa
15 setosa setosa
20 setosa setosa
20 setosa setosa
20 setosa setosa
20 setosa setosa
20 setosa setosa
25 setosa setosa
25 setosa setosa
25 setosa setosa
25 setosa setosa
25 setosa setosa
30 setosa setosa
30 setosa setosa
30 setosa setosa
30 setosa setosa
30 setosa setosa
35 setosa setosa
35 setosa setosa
35 setosa setosa
35 setosa setosa
35 setosa setosa
40 setosa setosa
40 setosa setosa
40 setosa setosa
40 setosa setosa
40 setosa setosa
45 setosa setosa
45 setosa setosa
45 setosa setosa
45 setosa setosa
50 setosa setosa
50 setosa setosa
50 setosa setosa
50 setosa setosa
55 versicolor versicolor
55 versicolor versicolor
55 versicolor versicolor
55 versicolor versicolor
60 versicolor versicolor
60 versicolor versicolor
60 versicolor versicolor
60 versicolor versicolor
65 versicolor versicolor
65 versicolor versicolor
65 versicolor versicolor
65 versicolor versicolor
70 versicolor versicolor
70 versicolor versicolor
70 versicolor versicolor
75 versicolor versicolor
75 versicolor versicolor
75 versicolor versicolor
80 versicolor versicolor
80 versicolor versicolor
80 versicolor versicolor
85 virginica versicolor
85 virginica versicolor
85 virginica versicolor
90 versicolor versicolor
90 versicolor versicolor
90 versicolor versicolor
95 versicolor versicolor
95 versicolor versicolor
95 versicolor versicolor
100 versicolor versicolor
100 versicolor versicolor
100 versicolor versicolor
105 virginica virginica
105 virginica virginica
105 virginica virginica
110 virginica virginica
110 virginica virginica
110 virginica virginica
115 virginica virginica
115 virginica virginica
115 virginica virginica
120 versicolor virginica
120 versicolor virginica
120 versicolor virginica
125 virginica virginica
125 virginica virginica
125 virginica virginica
130 versicolor virginica
130 versicolor virginica
130 versicolor virginica
135 versicolor virginica
SQL Call
Output Table
Primary Output table:
Output Table
Secondary Output table:
3 Micro-Recall 0.889908257
4 Micro-F1 0.889908257
5 Macro-Precision 0.862554113
6 Macro-Recall 0.877622378
7 Macro-F1 0.864444444
8 Weighted-Precision 0.902597403
9 Weighted-Recall 0.889908257
10 Weighted-F1 0.891926606
TD_Regression_Evaluator
The TD_RegressionEvaluator function computes metrics to evaluate and compare multiple models and
summarizes how close predictions are to their expected values.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_RegressionEvaluator Syntax
SELECT * FROM TD_RegressionEvaluator(
ON { table | view | (query) } as InputTable
USING
ObservationColumn('observation_column')
PredictionColumn('prediction_column')
[ Metrics('metric_1',['metric_2',.....'metric_n']) ]
[ NumOfIndependentVariables(value) ]
[ DegreesOfFreedom(df1,df2) ]
) AS alias;
PredictionColumn
[Required]: Specify the column name that has prediction values.
NumOfIndependentVariables
[Optional]: Specify the number of independent variables in the model. Required with
Adjusted R Squared metric, otherwise ignored.
DegreesOfFreedom
[Optional]: Specify the numerator degrees of freedom (df1) and denominator degrees of
freedom (df2). Required with fstat metric, else ignored.
Metrics
[Optional]: Specify the list of evaluation metrics. The function returns the following metrics if
the list is not provided:
• MAE: Mean absolute error (MAE) is the arithmetic average of the absolute errors
between observed values and predicted values.
• MSE: Mean squared error (MSE) is the average of the squares of the errors between
observed values and predicted values.
• MSLE: Mean Square Log Error (MSLE) is the relative difference between the log-
transformed observed values and predicted values.
• MAPE: Mean Absolute Percentage Error (MAPE) is the mean or average of the absolute
percentage errors of forecasts.
• MPE: Mean percentage error (MPE) is the computed average of percentage errors by
which predicted values differ from observed values.
• RMSE: Root means squared error (MSE) is the square root of the average of the
squares of the errors between observed values and predicted values.
• RMSLE: Root means Square Log Error (MSLE) is the square root of the relative
difference between the log-transformed observed values and predicted values.
• R2: R Squared (R2) is the proportion of the variation in the dependent variable that is
predictable from the independent variable(s).
• AR2: Adjusted R-squared (AR2) is a modified version of R-squared that has been
adjusted for the independent variable(s) in the model.
• EV: Explained variation (EV) measures the proportion to which a mathematical model
accounts for the variation (dispersion) of a given data set.
• ME: Max-Error (ME) is the worst-case error between observed values and
predicted values.
• MPD: Mean Poisson Deviance (MPD) is equivalent to Tweedie Deviances when the
power parameter value is 1.
• MGD: Mean Gamma Deviance (MGD) is equivalent to Tweedie Deviances when the
power parameter value is 2.
• FSTAT: F-statistics (FSTAT) conducts an F-test. An F-test is any statistical test in which
the test statistic has an F-distribution under the null hypothesis.
◦ F_score = F_score value from the F-test.
◦ F_Critcialvalue = F critical value from the F-test. (alpha, df1, df2, UPPER_TAILED) ,
alpha = 95%
◦ p_value = Probability value associated with the F_score value (F_score, df1,
df2, UPPER_TAILED)
◦ F_conclusion = F-test result, either 'reject null hypothesis' or 'fail to reject null
hypothesis'. If F_score > F_Critcialvalue, then 'reject null hypothesis' Else 'fail to
reject null hypothesis'
TD_RegressionEvaluator Input
Input Table Schema
Column Data Type Description
TD_RegressionEvaluator Output
Output Table Schema
Column Data Type Description
Metricsi FLOAT The metrics specified in the Metrics syntax element are displayed. For FSTAT, the
following columns are displayed:
• F_score
• F_Critcialvalue
• p_value
• F_Conclusion
TD_RegressionEvaluator Example
InputTable
sn price prediction
--- ---------------- ----------------
13 27000.000000000 40446.918834842
16 37900.000000000 40510.148673279
25 42000.000000000 43449.453484331
38 67000.000000000 76624.879832478
53 68000.000000000 71463.482418863
104 132000.000000000 116919.270833333
111 43000.000000000 44914.331354282
117 93000.000000000 65017.025152392
132 44500.000000000 40953.263035303
140 43000.000000000 43084.061169765
142 40000.000000000 40842.383578431
157 60000.000000000 63601.429679229
161 63900.000000000 63577.865086289
162 130000.000000000 118893.154761905
176 57500.000000000 65472.830594775
177 70000.000000000 62739.489325450
195 33000.000000000 39967.151210673
198 40500.000000000 44205.358401854
224 78500.000000000 66951.540759118
234 32500.000000000 42075.221979656
237 43000.000000000 42838.368042767
239 26000.000000000 40172.789343484
249 44500.000000000 40931.339168183
251 48500.000000000 43288.879830816
254 60000.000000000 71441.950774676
255 61000.000000000 62427.945104114
260 41000.000000000 45264.892185064
274 64900.000000000 64333.059596141
294 47000.000000000 42006.077797518
301 55000.000000000 59668.624729461
306 64000.000000000 64594.501483399
317 80000.000000000 69883.134938113
329 115442.000000000 116388.318452381
339 141000.000000000 131657.638888889
340 62500.000000000 60979.553793302
353 78500.000000000 69278.445119583
355 86900.000000000 64204.176931452
364 72000.000000000 75421.353748405
367 114000.000000000 126319.444444444
377 140000.000000000 110247.569444444
401 92500.000000000 80670.206257863
403 77500.000000000 80768.002205787
408 87500.000000000 79691.581621186
411 90000.000000000 77262.550218560
SQL Call
Output Table
TD_ROC
The Receiver Operating Characteristic (ROC) function accepts a set of prediction-actual pairs for a binary
classification model and calculates the following values for a range of discrimination thresholds:
• True-positive rate (TPR)
• False-positive rate (FPR)
• The area under the ROC curve (AUC)
• Gini coefficient
A receiver operating characteristic (ROC) curve shows the performance of a binary classification model as
its discrimination threshold varies. For a range of thresholds, the curve plots the true positive rate against
the false-positive rate.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_ROC Syntax
select * from TD_ROC(
on { input_table | view | (query) } as InputTable
[ OUT [VOLATILE| PERMANENT] TABLE OutputTable(output_table_name) ]
Using
[ ModelIDColumn ('model_id_column') ]
ProbabilityColumn ('probability_column')
ObservationColumn ('observation_column')
PositiveLabel ('positive_class_label')
[ NumThresholds (num_thresholds) ]
[ AUC ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Gini ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
)As alias;
ProbabilityColumn
[Required]: Specify the InputTable column name that contains the probability values
for predictions.
ObservationColumn
[Required]: Specify the InputTable column name that contains the actual classes.
PositiveLabel
[Required]: Specify the label of the positive class.
NumThresholds
[Optional]: Specify the number of thresholds for this function. The value must be in the range
[1, 10000]. Default value: 50 (The function uniformly distributes the thresholds between 0
and 1.)
AUC
[Optional]: Specify whether the function displays the AUC calculated from the ROC values
(thresholds, false positive rates, and true positive rates).
GINI
[Optional]: Specify whether the function displays the Gini coefficient calculated from the
ROC values. The Gini coefficient is an inequality measure among the values of a frequency
distribution. A Gini coefficient of 0 indicates that all values are the same. The closer the Gini
coefficient is to 1, the more unequal are the values in the distribution.
TD_ROC Input
Input Table Schema
Column Data Type Description
model_id_column Varchar, Char, SmallInt, The Model identifier or partition for ROC curve
BigInt, Integer associated with observation.
TD_ROC Output
Output Table Schema
If the OutputTable is given in OUT clause:
Model_id Varchar, Char, The Model identifier or partition for ROC curve associated with observation.
SmallInt, The column is not displayed if you do not provide the ModelIdColumn
BigInt, Integer syntax element.
TPR REAL The TPR (True Positive Rate) for the threshold. Calculated as:
The number of observations correctly predicted as positive based on the
threshold divided by the number of positive observations.
FPR REAL The FPR (False positive rate) for threshold. Calculated as:
The number of observations incorrectly predicted as positive based on the
threshold divided by the number of negative observations.
AUC REAL The area under the ROC curve for data in the partition. The column is not
displayed if the AUC syntax element is False.
GINI REAL The column is not displayed if the GINI syntax element is False.
TD_ROC Example
Input Table
SQL Call
AUC('true')
GINI('true')
)As dt;
Output Table
NaiveBayesTextClassifierPredict
This function uses the model output by TD_NaiveBayesTextClassifierTrainer function to analyze the input
data and make predictions.
NaiveBayesTextClassifierPredict Syntax
SELECT * FROM NaiveBayesTextClassifierPredict (
ON { table | view | (query) } AS PredictorValues PARTITION BY doc_id_column [,...]
ON { table | view | (query) } AS Model DIMENSION
USING
InputTokenColumn ('input_token_column')
[ ModelType ({ 'Multinomial' | 'Bernoulli' }) ]
DocIDColumns ({ 'doc_id_column' | 'doc_id_column_range' }[,...])
[ ModelTokenColumn ('model_token_column')
ModelCategoryColumn ('model_category_column')
ModelProbColumn ('model_probability_column') ]
[ TopK ({ num_of_top_k_predictions | 'num_of_top_k_predictions' }) |
Responses ('response' [,...]) ]
[ OutputProb {'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'} ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
) AS alias;
ModelType
[Optional] Specify the model type of the text classifier.
Default: 'Multinomial'
DocIDColumns
Specify the names of the PredictorValues columns that contain the document identifier.
ModelTokenColumn
[Optional] Specify the name of the Model table column that contains the tokens.
Default: First column of Model table
ModelCategoryColumn
[Optional] Specify the name of the Model table column that contains the
prediction categories.
Default: Second column of Model table
ModelProbColumn
[Optional] Specify the name of the Model table column that contains the probability values.
Default: Third column of Model table
TopK
[Disallowed with Responses, otherwise optional.] Specify the number of most likely
prediction categories to output with their loglikelihood values (for example, the top 10 most
likely prediction categories). To see the probability of each class, use OutputProb ('true').
Default: All prediction categories
Responses
[Disallowed with TopK, otherwise optional.] Specify the labels for which to output
loglikelihood values and probabilities (with OutputProb ('true')).
OutputProb
Specify whether to output the calculated probability for each observation.
Default: 'false'
Accumulate
Specify the names of the PredictorValues table columns to copy to the output table.
Note:
NaiveBayesTextClassifierPredict Input
Table Description
PredictorValues Contains test data, for which to predict outcomes, in document-token pairs. To transform
the input document into this form, input it to TD_TextParser or ML Engine function,
TextTokenizer, or TextParser.
TextTokenizer and TextParser have language-processing limitations that might limit
support for Unicode input data (see Teradata Vantage™ Machine Learning Engine
Analytic Function Reference, B700-4003).
PredictorValues Schema
Column Data Type Description
Model Schema
For CHARACTER and VARCHAR columns, CHARACTER SET must be either UNICODE or LATIN.
NaiveBayesTextClassifierPredict Output
Output Table Schema
Column Data Type Description
prob DOUBLE PRECISION [Column appears only when you both specify OutputProb
('true') and omit Responses.] Probability that document
belongs to class label in prediction column, which is
max(softmax(loglik)).
prob_response DOUBLE PRECISION [Column appears only when you specify both OutputProb
('true') and Responses. Column appears once for each
specified response.] Probability that document belongs to
class label response, which is softmax(loglik_response).
NaiveBayesTextClassifierPredict Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
• PredictorValues: complaints_test_tokenized, created by applying ML Engine TextTokenizer function
to the table complaints_test, a log of vehicle complaints, as follows:
complaints_test
doc_
doc_id text_data
name
3 C WHILE DRIVING AT 60 MPH GAS PEDAL GOT STUCK DUE TO THE RUBBER
THAT IS AROUND THE GAS PEDAL.
7 G DRIVING ABOUT 5-10 MPH, THE VEHICLE HAD A LOW FRONTAL IMPACT IN
WHICH THE OTHER VEHICLE HAD NO DAMAGES. UPON IMPACT, DRIVER'S
AND THE PASSENGER'S AIR BAGS DID NOT DEPLOY, RESULTING IN
INJURIES. PLEASE PROVIDE FURTHER INFORMATION AND VIN#.
8 H THE AIR BAG WARNING LIGHT HAS COME ON. INDICATING AIRBAGS
ARE INOPERATIVE.THEY WERE FIXED ONE AT THE TIME, BUT PROBLEM
HAS REOCCURRED.
9 I CONSUMER WAS DRIVING WEST WHEN THE OTHER CAR WAS GOING
EAST. THE OTHER CAR TURNED IN FRONT OF CONSUMER'S VEHICLE,
CONSUMER HIT OTHER VEHICLE AND STARTED TO SPIN AROUND ,
COULDN'T STOP, RESULTING IN A CRASH. UPON IMPACT, AIRBAGS
DIDN'T DEPLOY.
SQL Call
Output
Input
As in NaiveBayesTextClassifierPredict Example: TopK Specified
• PredictorValues: complaints_test_tokenized
• Model: complaints_tokens_model
SQL Call
Output
doc_id prediction loglik_crash loglik_no_crash prob_crash prob_no_crash doc_name
------ ---------- ---------------------- ---------------------- ---------------------- ---------------------- --------
1 no_crash -1.38044220625651E 002 -1.17666267644292E 002 1.41243173571687E-009 9.99999998587568E-001 A
2 no_crash -1.04652470718918E 002 -9.82811865081127E 001 1.70704288519507E-003 9.98292957114805E-001 B
3 no_crash -1.03026451289745E 002 -7.62146044204976E 001 2.26862573862878E-012 9.99999999997731E-001 C
4 no_crash -1.10830711173169E 002 -8.58531176043404E 001 1.42026355157382E-011 9.99999999985797E-001 D
5 crash -1.20601083912966E 002 -1.23936921216052E 002 9.65637986161646E-001 3.43620138383542E-002 E
6 no_crash -1.30310015371040E 002 -1.17454141890718E 002 2.61074198636704E-006 9.99997389258014E-001 F
7 crash -1.20005517060745E 002 -1.23123774759574E 002 9.57639606312734E-001 4.23603936872661E-002 G
8 no_crash -1.08617321658980E 002 -9.00827983614664E 001 8.92398441816595E-009 9.99999991076016E-001 H
9 no_crash -1.19919230739025E 002 -1.16147101713878E 002 2.24857954852037E-002 9.77514204514796E-001 I
10 no_crash -1.06104244132225E 002 -8.97078469668254E 001 7.57068462691010E-008 9.99999924293154E-001 J
NGramSplitter
The NGramSplitter function tokenizes (splits) an input stream of text and outputs n multigrams (called
n-grams) based on the specified Reset, Punctuation, and Delimiter syntax elements. NGramSplitter first
splits sentences, next removes punctuation characters from them, and finally splits the words into n-grams.
NGramSplitter provides more flexibility than standard tokenization when performing text analysis. Many
two-word phrases carry important meaning (for example, "machine learning") that single-word tokens do
not capture. This, combined with additional analytical techniques, can be useful for performing sentiment
analysis, topic identification, and document classification.
NGramSplitter considers each input row to be one document, and returns a row for each unique n-gram in
each document. NGramSplitter also returns, for each document, the counts of each n-gram and the total
number of n-grams.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
NGramSplitter Syntax
SELECT * FROM NGramSplitter (
ON { table | view | (query) }
USING
TextColumn ('text_column')
Grams ({ gram_number | 'value_range' }[,...])
[ OverLapping({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ ConvertToLowerCase ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ Reset ('reset_character...') ]
[ Punctuation ('punctuation_character...') ]
[ Delimiter ('delimiter_character...') ]
[ OutputTotalGramCount ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'}) ]
[ TotalCountColName ('total_count_column') ]
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ NGramColName ('ngram_column') ]
[ GramLengthColName ('gram_length_column') ]
[ FrequencyColName ('frequency_column') ]
) AS alias;
Related Information:
Column Specification Syntax Elements
Grams
Specify the length, in words, of each n-gram (that is, the value of n). A value_range has the
syntax integer1-integer2, where integer1 <= integer2. The values of n, integer1, and integer2
must be positive.
OverLapping
[Optional] Specify whether the function allows overlapping n-grams.
Default: 'true' (Each word in each sentence starts an n-gram, if enough words follow it in the
same sentence to form a whole n-gram of the specified size. For information on sentences,
see the Reset syntax element description.)
ConvertToLowerCase
[Optional] Specify whether the function converts all letters in the input text to lowercase.
Default: 'true'
Reset
[Optional] Specify, in a string, the characters that can end a sentence. At the end of a
sentence, the function discards any partial n-grams and searches for the next n-gram at the
beginning of the next sentence. An n-gram cannot span sentences.
Default: '.,?!'
Punctuation
[Optional] Specify, in a string, the punctuation characters for the function to remove before
evaluating the input text.
Punctuation characters can be from both Unicode and Latin character sets.
Default: '`~#^&*()-'
Delimiter
[Optional] Specify the character or string that separates words in the input text.
Default: ' ' (space)
OutputTotalGramCount
[Optional] Specify whether the function returns the total number of n-grams in the document
(that is, in the row) for each length n specified in the Grams syntax element. If you specify
'true', the TotalCountColName syntax element determines the name of the output table
column that contains these totals.
The total number of n-grams is not necessarily the number of unique n-grams.
Default: 'false'
TotalCountColName
[Optional] Specify the name of the output table column that appears if the value of the
OutputTotalGramCount syntax element is 'true'.
Default: 'totalcnt'
Accumulate
[Optional] Specify the names of the input table columns to copy to the output table for each
n-gram. These columns cannot have the same names as those specified by the syntax
elements NGramColName, GramLengthColName, and TotalCountColName.
Default: All input columns for each n-gram
NGramColName
[Optional] Specify the name of the output table column that is to contain the created n-grams.
Default: 'ngram'
GramLengthColName
[Optional] Specify the name of the output table column that is to contain the length of n-gram
(in words).
Default: 'n'
FrequencyColName
[Optional] Specify the name of the output table column that is to contain the count of
each unique n-gram (that is, the number of times that each unique n-gram appears in
the document).
Default: 'frequency'
NGramSplitter Input
Input Table Schema
Each row of the table has a document to tokenize.
NGramSplitter Output
Output Table Schema
The table has a row for each unique n-gram in each input document.
NGramSplitter Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
The input table, paragraphs_input, contains sentences about commonly used machine
learning techniques.
paragraphs_input
paraid paratopic paratext
1 Decision Trees Decision tree learning uses a decision tree as a predictive model which maps
observations about an item to conclusions about the items target value. It
is one of the predictive modeling approaches used in statistics, data mining
and machine learning. Tree models where the target variable can take a finite
set of values are called classification trees. In these tree structures, leaves
represent class labels and branches represent conjunctions of features that
lead to those class labels. Decision trees where the target variable can take
continuous values (typically real numbers) are called regression trees.
2 Simple In statistics, simple linear regression is the least squares estimator of a linear
Regression regression model with a single explanatory variable. In other words, simple
linear regression fits a straight line through the set of n points in such a
way that makes the sum of squared residuals of the model (that is, vertical
distances between the points of the data set and the fitted line) as small
as possible
SQL Call
Output
paraid paratopic ngram n frequency totalcnt
Input
The input table is paragraphs_input, as in NGramSplitter Example: Omit Accumulate.
SQL Call
Accumulate ('[0:1]')
) AS dt;
Output
TD_NaiveBayesTextClassifierTrainer
TD_NaiveBayesTextClassifierTrainer function calculates the conditional probabilities for token-category
pairs, the prior probabilities, and the missing token probabilities for all categories. The trainer function
trains the model with the probability values, and the predict function uses the values to classify documents
into categories.
TD_NaiveBayesTextClassifierTrainer Syntax
SELECT * FROM TD_NaiveBayesTextClassifierTrainer (
ON { table | view | (query) } AS InputTable
[ OUT [ PERMANENT | VOLATILE ] TABLE ModelTable (model_table_name) ]
USING
TokenColumn ('token_column')
DocCategoryColumn ('doc_category_column')
[{
[ ModelType ('Multinomial') ]
}
|
{
ModelType ('Bernoulli')
DocIDColumn ('doc_id_column')
}]
) AS alias;
DocCategoryColumn
[Required]: Specify the InputTable column name that contains the document category.
DocIDColumn
[Required for Bernoulli model type]: Specify the InputTable column name that contains the
document identifier.
ModelType
[Optional]: Specify the model type of the text classifier.
Supported Model Types: Bernoulli and Multinomial
Default: Multinomial
TD_NaiveBayesTextClassifierTrainer Input
Input Table Schema
Column Data Type Description
token_column CHAR or VARCHAR The column name that contains the classified
training tokens from a tokenization function.
Note:
The following vocabulary token names are reserved:
• NAIVE_BAYES_TEXT_MODEL_TYPE
• NAIVE_BAYES_PRIOR_PROBABILITY
• NAIVE_BAYES_MISSING_TOKEN_PROBABILITY
TD_NaiveBayesTextClassifierTrainer Output
Output Table Schema
Column Data Type Description
TD_NaiveBayesTextClassifierTrainer Example
InputTable
SQL Call
Output Table
TD_SentimentExtractor
The TD_SentimentExtractor function uses a dictionary model to extract the sentiment (positive, negative, or
neutral) of each input document or sentence.
The dictionary model consists of WordNet, a lexical database of the English language, and these negation
words (no, not, neither, never, and similar negation words).
The function handles negated sentiments as follows:
• -1 if the sentiment is negated (for example, "I am not happy")
• -1 if the sentiment and a negation word are separated by one word (for example, "I am not very happy")
• +1 if the sentiment and a negation word are separated by two or more words (for example, "I am not
saying I am happy")
Note:
• You can omit the dimension ON clause(s) of the dictionary tables from the query if you want to use
the default sentiment dictionary.
• You can use your dictionary table and provide it as a CUSTOMDICTIONARYTABLE ON clause.
• You can provide additional dictionary entries through the ADDITIONALDICTIONARYTABLE
ON clause if you want to add more entries to either the CUSTOMDICTIONARY table or
default dictionary.
• You can access the dictionary through the OUTPUTDICTIONARYTABLE OUT clause if you want
to check the dictionary contents used during sentiment analysis.
• Only the English language is supported.
• The max length supported for sentiment word in the dictionary table is 128 characters.
• The Max length of the sentiment_words output column is 32000 characters. If the
sentiment_words output column value exceeds this limit, then a triple dot(...) displays at the
end of the string.
• The Max length of the content output column is 32000 characters; that is, the supported maximum
length of a sentence is 32000.
• You can have up to 10 words in a sentiment phrase.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_SentimentExtractor Syntax
SELECT * FROM TD_SentimentExtractor (
ON { table | view | (query) } AS INPUTTABLE PARTITION BY ANY
[ ON { table | view | (query) } AS CUSTOMDICTIONARYTABLE DIMENSION ]
[ ON { table | view | (query) } AS ADDITIONALDICTIONARYTABLE DIMENSION ]
[ OUT PERMANENT TABLE OUTPUTDICTIONARYTABLE (output_table_name) ]
USING
TextColumn ('text_column')
[ Accumulate ({ 'accumulate_column' | accumulate_column_range }[,...]) ]
[ AnalysisType ({ 'DOCUMENT' | 'SENTENCE' }) ]
[ Priority ({ 'NONE' | 'NEGATIVE_RECALL' | 'NEGATIVE_PRECISION' |
'POSITIVE_RECALL' | 'POSITIVE_PRECISION'}) ]
Accumulate
[Optional]: Specify the input table column names to copy to the output table.
AnalysisType
[Optional]: Specify the analysis level - whether you want to analyze each document (default
level) or sentence.
Priority
[Optional]: Specify one of the following priorities for results:
• None (Default): Provide all results the same priority.
• Negative_Recall: Provide the highest priority to negative results, including those
with lower-confidence sentiment classifications (maximizes number of negative
results returned)
• Negative_Precision: Provide the highest priority to negative results with high-confidence
sentiment classifications
• POSITIVE_RECALL: Provide the highest priority to positive results, including those
with lower-confidence sentiment classifications (maximizes number of positive
results returned).
• Positive_Precision: Provide the highest priority to positive results with high confidence
sentiment classifications.
OutputType
[Optional]: Specify one of the following result types:
• All (Default): Returns all results.
• Positive: Returns only results with positive sentiments.
• Negative: Returns only results with negative sentiments.
• Neutral: Returns only results with neutral sentiments.
TD_SentimentExtractor Input
Input Table Schema
Column Data Type Description
text_column CHAR, The InputTable column name that contains text for
VARCHAR, CLOB sentiment analysis.
accumulate_column ANY The input table column names to copy to the output table.
sentiment_word CHAR, VARCHAR The column name that contains the sentiment word.
polarity_strength BYTEINT, The column name that contains the strength of the
SMALLINT, INTEGER sentiment word.
TD_SentimentExtractor Output
Output Table Schema
Column Data Type Description
AccumulateColumns ANY The specified input table column names copied to the output table.
content VARCHAR The column contains the sentence extracted from the document. The
column displays if you use Sentence as the AnalysisType.
polarity VARCHAR The sentiment value of the result. Possible values are POS
(positive), NEG (negative), or NEU (neutral).
sentiment_score INTEGER The sentiment score of polarity. Possible values are 0 (neutral), 1
(higher than neutral), or 2 (higher than 1).
sentiment_words VARCHAR The string that contains a total positive score, total negative score,
and sentiment words with their polarity_strength and frequency
enclosed in parenthesis.
polarity_strength INTEGER The column that contains the strength of the sentiment word.
TD_SentimentExtractor Example
Input Table
id product category
review
-- ------------ --------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
-----------------
1 camera POS we primarily bought this camera for high image quality
and excellent video capability without paying the price for a dslr .it has
excelled in what we expected of it , and consequently represented excellent value
for me .all my friends want my camera for their vacations . i would recommend
this camera to anybody .definitely worth the price .plus , when you buy some
accessories , it becomes even more powerful
2 office suite POS it is the best office suite i have used to date . it
is launched before office 2010 and it is ages ahead of it already . the fact that
i could comfortable import xls , doc , ppt and modify them , and then export them
back to the doc , xls , ppt is terrific . i needed the compatibility .it is a
very intuitive suite and the drag drop functionality is
terrific .
3 camera POS this is a nice camera , delivering good quality video
images decent photos .
light small , using easily obtainable , high quality minidv i love it .
minor irritations include touchscreen based menu only digital photos can only be
trensferred via usb , requiring ilink and usb if you use ilink .
5 gps POS nice graphs and map route info .i would not run outside
again without this unique gadget . great job. big display , good backlight ,
really watertight , training assistant .i use in trail running and it worked well
through out the
race
6 gps NEG most of the complaints i have seen in here are from a
lack of rtfm. i have never seen so many mistakes do to what i think has to be none
update of data to the system . i wish i could make all the rating stars be
empty .
9 television NEG $3k is way too much money to drop onto a piece of
crap .poor customer support . after about 1 and a half years and hardly using the
tv , a big yellow pixilated stain appeared. product is very inferior and subject
to several lawsuits . i expressed my dissatifaction with the situation as this
is a known
issue
SQL Call
AnalysisType ('DOCUMENT')
) AS dt ORDER BY id;
Output
Example 1: Default Dictionary
sentiment_word polarity_strength
-------------- -----------------
big 0
constant 0
crap -2
difficulty -1
disappointed -1
excellent 2
fun 1
incredible 2
love 1
mistake -1
nice 1
not tolerate -1
outstanding 2
screwed 2
small 0
stuck -1
terrific 2
terrrible -2
update 0
SQL Call
Output
Example 2: With Custom Dictionary
id product
content
polarity sentiment_score
sentiment_words
-- ------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------------- -------- ---------------
---------------------------------------------------------------------
1 camera i would recommend this camera to anybody .definitely worth the
price .plus , when you buy some accessories , it becomes even more
powerful
NEU
0
1 camera we primarily bought this camera for high image quality and
excellent video capability without paying the price for a dslr .it has excelled
in what we expected of it , and consequently represented excellent value for
me .all my friends want my camera for their vacations . POS
2 In total, positive score:4 negative score:0. excellent 2 (2).
2 office suite the fact that i could comfortable import xls , doc , ppt and
modify them , and then export them back to the doc , xls , ppt is
terrific .
POS
2 In total, positive score:2 negative score:0. terrific 2 (1).
2 office suite i needed the compatibility .it is a very intuitive suite and
the drag drop functionality is
terrific .
NEU
0
2 office suite it is the best office suite i have used to
date .
NEU
0
3 camera minor irritations include touchscreen based menu only digital
photos can only be trensferred via usb , requiring ilink and usb if you use
ilink .
NEU
0
POS 2 In total,
positive score:1 negative score:0. small 0 (1), love 1 (1).
3 camera this is a nice camera , delivering good quality video images
decent
photos .
POS 2 In total,
positive score:1 negative score:0. nice 1 (1).
4 gps it is a fine
gps .
NEU
0
4 gps outstanding performance , works
great .
NEU
0
5 gps nice graphs and map route info .i would not run outside again
without this unique
gadget .
NEU
0
6 gps i have never seen so many mistakes do to what i think has to be
none update of data to the
system .
NEU
0
7 gps i found their website support difficult to
navigate .
NEU
0
7 gps i am is so disapointed and just returned it and now looking for
another
one
NEU
0
7 gps on my way home from a friends house it told me there is no possible
route .
NEU
0
7 gps this machine is all screwed
up .
it.
NEU
0
9 television i expressed my dissatifaction with the situation as this is a
known
issue
NEU
0
9 television after about 1 and a half years and hardly using the tv , a big
yellow pixilated stain
appeared.
NEG 2 In total,
positive score:0 negative score:-2. crap -2 (1).
10 camera due to the constant need for repair , i would never recommend
this
product .
NEU 0 In total,
positive score:0 negative score:0. constant 0 (1).
InputTable: Sentiment_Word_Add
sentiment_word polarity_strength
--------------- -----------------
love 2
need for repair -2
repair -1
SQL Call
Output
Example 3: With Default Dictionary and Additional Dictionary
score:-1. decent 1 (1), good 1 (1), irritations -1 (1), nice 1 (1), love 2 (1),
obtainable 1 (1).
4 gps POS 2 In total, positive score:5 negative
score:0. incredible 1 (1), outstanding 1 (1), fine 1 (1), great 1 (1), works 1
(1).
5 gps POS 2 In total, positive score:5 negative
score:0. good 1 (1), worked 1 (1), nice 1 (1), great 1 (1), well 1
(1).
6 gps NEG 2 In total, positive score:0 negative
score:-3. lack -1 (1), complaints -1 (1), mistakes -1
(1).
7 gps NEG 2 In total, positive score:1 negative
score:-3. disapointed -1 (1), screwed -1 (1), difficult -1 (1), support 1
(1).
8 camera NEG 2 In total, positive score:0 negative
score:-10. stuck -1 (1), sucks -1 (1), screwy -1 (2), not fast -1 (1), bad -1 (1),
difficulty -1 (1), horrible -1 (1), not work -1 (1), hate -1 (1).
9 television NEG 2 In total, positive score:1 negative
score:-5. crap -1 (1), issue -1 (1), stain -1 (1), inferior -1 (1), poor -1 (1),
support 1 (1).
10 camera NEG 2 In total, positive score:0 negative
score:-5. failing -1 (1), need for repair -2 (1), issue -1 (1), never recommend
-1 (1).
TD_TextParser
The TD_TextParser performs the following operations:
• Tokenizes the text in the specified column
• Removes the punctuations from the text and converts the text to lowercase
• Removes stop words from the text and converts the text to their root forms
• Creates a row for each word in the output table
• Performs stemming; that is, the function identifies the common root form of a word by removing or
replacing word suffixes
Note:
• The stems resulting from stemming may not be actual words. For example, the stem for
'communicate' is 'commun' and the stem for 'early' is 'earli' (trailing 'y' is replaced by 'i').
• This function requires the UTF8 client character set.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
TD_TextParser Syntax
USING
TextColumn ('text_column')
[ConvertToLowerCase
({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
[StemTokens ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
[RemoveStopWords ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
[Delimiter ('delimiter_expression')]
[Punctuation ('punctuation_expression')]
[TokenColName ('token_column')]
[ Accumulate ({'accumulate_column' | )]
) as dt;
ConvertToLowerCase
[Optional] Specify the default value, true to convert the text in the input table column name
to lowercase.
Default value: true
StemTokens
[Optional] Specify the default value, true to convert the text in the input table column name
to their root forms.
Default value: true
Delimiter
[Optional] Specify single-character delimiter values to apply to the text in the specified
column in the TextColumn element.
Default values: '\t\n\f\r'
RemoveStopWords
[Optional] Specify the value, true to remove the stop words before parsing the text in the
specified column in the TextColumn element.
Default value: true
Punctuation
[Optional] Specify the punctuation characters that you want to replace in the text of the
specified column in the TextColumn element with space.
Default values: ‘!#$%&()*+,-./:;?@\^_`{|}~’
TokenColName
[Optional] Specify a name for the output column that contains the individual words from the
text of the specified column in the TextColumn element.
Default value: token
Accumulate
[Optional] Specify the input table column names to copy to the output table.
TD_TextParser Input
Input Table Schema
Column Data Type Description
TD_TextParser Output
Output Table Schema
Column Data Type Description
token VARCHAR The default column name, token that is set in the
CHARACTER SET TokenColName element is used and contains the rows
LATIN/UNICODE with individual words.
AccumulateColumns Any The column name specified that is, id in the Accumulate
element is copied to the output table.
TD_TextParser Example
Input table: test_table
id paragraph
-- ----------------------------------------------
1 Programmers program with programming languages
2 The quick brown fox jumps over the lazy dog
Also, create a stopwords table with the Word column that contains the stopwords with the VARCHAR/
CHAR CHARACTER SET LATIN/UNICODE datatype.
word
----
a
an
the
SQL Call
Output Table
id token
-- --------
1 languag
1 program
1 program
1 programm
1 with
2 brown
2 dog
2 fox
2 jump
2 lazi
2 over
2 quick
Terminology
This document uses the following terms.
Term Description
Path An ordered, start-to-finish series of actions, for example, page views, for which sequences
and sub-sequences can be created.
Sequence A sequence is the path prefixed with a carat (^), which indicates the start of a path. For
example, if a user visited page a, page b, and page c, in that order, the session sequence
is ^,a,b,c.
Subsequence For a given sequence of actions, a sub-sequence is one possible subset of the steps that
begins with the initial action. For example, the path a,b,creates three subsequences: ^,a; ^,
a,b; and ^,a,b,c.
Attribution
The Attribution function is used in web page analysis, where it lets companies assign weights to pages
before certain events, such as buying a product.
The function takes data and parameters from multiple tables and outputs attributions.
Analytics Database Attribution function corresponds to the multiple-input version. Unlike Attribution_MLE,
Attribution does not support Unicode.
A query that runs longer than 3 seconds before displaying output indicates that syntax elements supplied
to the function are incorrect.
Attribution Syntax
SELECT * FROM Attribution (
ON { table | view | (query) } [ AS InputTable1 ]
PARTITION BY user_id
ORDER BY times_column
[ ON { table | view | (query) } [ AS InputTable2 ]
PARTITION BY user_id
ORDER BY time_column [,...] ]
ON conversion_event_table AS ConversionEventTable DIMENSION
[ ON excluding_event_table AS ExcludedEventTable DIMENSION ]
[ ON optional_event_table AS OptionalEventTable DIMENSION ]
ON model1_table AS FirstModelTable DIMENSION
[ ON model2_table AS SecondModelTable DIMENSION ]
USING
EventColumn ('event_column')
TimeColumn ('time_column')
WindowSize ({'rows:K' | 'seconds:K' | 'rows:K&seconds:K2'})
) AS alias ORDER BY user_id,time_stamp;
TimeColumn
Specify the name of the input column that contains the timestamps of the clickstream events.
WindowSize
Specify how to determine the maximum window size for the attribution calculation:
Option Description
seconds: K Assign attributions only to rows not more than K seconds before
conversion event.
Option Description
rows: K &seconds: K2 Apply both constraints and comply with stricter one.
Attribution Input
Required
Table Description
Input tables (maximum of five) Contain clickstream data for computing attributions.
Optional
Table Description
ConversionEventTable Schema
Column Data Type Description
ExcludedEventTable Schema
Column Data Type Description
OptionalEventTable Schema
Column Data Type Description
Model Specification
SIMPLE MODEL: Distribution model for all events. For MODEL and PARAMETER
PARAMETERS definitions, see following table.
Row 1, ..., n:
Row 0: Model Distribution
Additional Information
Type Model
Specification
SEGMENT_ Ki:WEIGHT: Distribution model by row. Sum of Ki values must be value K specified by
ROWS MODEL: 'rows:K' in WindowSize syntax element.
PARAMETERS Function considers rows from most to least recent. For example, suppose
that function call has these syntax elements:
WindowSize ('rows:10')
Model1 ('SEGMENT_ROWS',
'3:0.5:UNIFORM:NA',
'4:0.3:LAST_CLICK:NA',
'3:0.2:FIRST_CLICK:NA')
Attribution for a conversion event is divided among attributable events in 10
rows immediately preceding conversion event. If conversion event is in row
11, first model specification applies to rows 10, 9, and 8; second applies to
rows 7, 6, 5, and 4; and third applies to rows 3, 2, and 1.
Half attribution (5/10) is uniformly divided among rows 10, 9, and 8; 3/10 to
last click in rows 7, 6, 5, and 4 (that is, in row 7), and 2/10 to first click in rows
3, 2, and 1 (that is, in row 1).
SEGMENT_ Ki:WEIGHT: Distribution model by time in seconds. Sum of Kivalues must be value K
SECONDS MODEL: specified by 'seconds:K' in WindowSize syntax element.
PARAMETERS Function considers rows from most to least recent. For example, suppose
that function call has these syntax elements:
WindowSize ('seconds:20')
Model1 ('SEGMENT_SECONDS',
'6:0.5:UNIFORM:NA',
'8:0.3:LAST_CLICK:NA',
'6:0.2:FIRST_CLICK:NA')
Attribution for a conversion event is divided among attributable events
in 20 seconds immediately preceding conversion event. If conversion
event is at second 21, first model specification applies to seconds 20-15
(counting backward); second applies to seconds 14-7; and third applies to
seconds 6-1.
Half attribution (5/10) is uniformly divided among seconds 20-15; 3/10 to last
click in seconds 14-7, and 2/10 to first click in seconds 6-1.
'WEIGHTED' Conversion event is attributed to You can specify any number of weights.
preceding attributable events with If there are more attributable events than
weights specified by PARAMETERS. weights, extra (least recent) events are
SEGMENT_SECONDS (when you assigned zero weight. If there are more
specify 'rows:K&seconds:K' in weights than attributable events, then
WindowSize syntax element) function renormalizes weights.
EVENT_REGULAR
SEGMENT_ROWS
Attribution Output
Attribution Output Table Schema
Column Data Type Description
excluding Email
Input
InputTable1: attribution_sample_table1
user_id event time_stamp
InputTable2: attribution_sample_table2
user_id event time_stamp
ConversionEventTable: conversion_event_table
conversion_events
PaidSearch
SocialNetwork
ExcludedEventTable: excluding_event_table
excluding_events
OptionalEventTable: optional_event_table
optional_events
Direct
OrganicSearch
Referral
The following two model tables apply the distribution models by rows and by seconds, respectively.
FirstModelTable: model1_table
id model
0 SEGMENT_ROWS
1 3:0.5:EXPONENTIAL:0.5,SECOND
2 4:0.3:WEIGHTED:0.4,0.3,0.2,0.1
3 3:0.2:FIRST_CLICK:NA
SecondModelTable: model2_table
id model
0 SEGMENT_SECONDS
1 6:0.5:UNIFORM:NA
id model
2 8:0.3:LAST_CLICK:NA
3 6:0.2:FIRST_CLICK:NA
SQL Call
Output
user_id event time_stamp attribution time_to_conversion
Sessionize
The Sessionize function maps each click in a session to a unique session identifier. A session is a sequence
of clicks by one user that are separated by at most n seconds.
The function is useful for both sessionization and detecting web crawler ("bot") activity. A typical use is to
understand user browsing behavior on a web site.
Sessionize Syntax
SELECT * FROM Sessionize (
ON { table | view | (query) }
PARTITION BY expression [,...]
ORDER BY order_column [,...]
USING
TimeColumn ('time_column')
TimeOut (session_timeout)
[ ClickLag (min_human_click_lag) ]
[ EmitNull ({'true'|'t'|'yes'|'y'|'1'|'false'|'f'|'no'|'n'|'0'})]
) AS alias;
TimeOut
Specify the number of seconds at which the session times out. If session_timeout seconds
elapse after a click, the next click starts a new session. The data type of session_timeout is
DOUBLE PRECISION.
ClickLag
[Optional] Specify the minimum number of seconds between clicks for the session user to be
considered human. If clicks are more frequent, indicating that the user is a bot, the function
ignores the session. The min_human_click_lag must be less than session_timout. The data
type of min_human_click_lag is DOUBLE PRECISION.
Default behavior: The function ignores no session, regardless of click frequency.
EmitNull
[Optional] Specify whether to output rows that have NULL values in their session id and rapid
fire columns, even if their time_column has a NULL value.
Default: 'false'
Sessionize Input
Input Table Schema
Column Data Type Description
time_column TIME, TIMESTAMP, Click times (in milliseconds if data type is INTEGER,
INTEGER, BIGINT, BIGINT, or SMALLINT).
SMALLINT, or DATE
partition_column Any Column by which input data is partitioned. Input data must
be partitioned such that each partition contains all rows of
an entity.
No input table column can have the name 'sessionid' or 'clicklag', because these are output table
column names.
Tip:
To create a single timestamp column from separate date and time columns:
Sessionize Output
Output Table Schema
Column Data Type Description
input_column Same as in input table Column copied from input table. Function copies every input_
column to output table.
Sessionize Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
sessionize_table
partition_id clicktime userid productname pagetype referrer productprice
SQL Call
Output
partition_id clicktime userid productname pagetype referrer productprice SESSIONID CLICKLAG
nPath
The nPath function scans a set of rows, looking for patterns that you specify. For each set of input rows that
matches the pattern, nPath produces a single output row. The function provides a flexible pattern-matching
capability that lets you specify complex patterns in the input data and define the values that are output for
each matched input set.
nPath is useful when your goal is to identify the paths that lead to an outcome. For example, you can use
nPath to analyze:
• Web site click data, to identify paths that lead to sales over a specified amount
• Sensor data from industrial processes, to identify paths to poor product quality
• Healthcare records of individual patients, to identify paths that indicate that patients are at risk of
developing conditions such as heart disease or diabetes
• Financial data for individuals, to identify paths that provide information about credit or fraud risks
The output from the nPath function can be input to other ML Engine functions or to a visualization tool such
as Teradata® AppCenter.
Note:
• This function requires the UTF8 client character set for UNICODE data.
• This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International
Character Set Support, B035-1125.
• This function does not support KanjiSJIS or Graphic data types.
• When used with this function, the ORDER BY clause supports only ASCII collation.
• When used with this function, the PARTITION BY clause assumes column names are in
Normalization Form C (NFC).
nPath Syntax
SELECT * FROM nPath (
ON { table | view | (query) }
PARTITION BY partition_column
ORDER BY order_column [ ASC | DESC ][...]
[ ON { table | view | (query) }
[ PARTITION BY partition_column | DIMENSION ]
ORDER BY order_column [ ASC | DESC ]
][...]
USING
Mode ({ OVERLAPPING | NONOVERLAPPING })
Pattern ('pattern')
Symbols ({ col_expr = symbol_predicate AS symbol}[,...])
[ Filter (filter_expression[,...]) ]
Result ({ aggregate_function (expression OF [ANY] symbol [,...]) AS alias_1 }[,...])
) AS alias_2;
NONOVERLAPPING Start next pattern search at row that follows last pattern match.
Pattern
Specify the pattern for which the function searches. You compose pattern with the symbols
(which you define in the Symbols syntax element), operators, and parentheses.
When patterns have multiple operators, the function applies them in order of precedence,
and applies operators of equal precedence from left to right. To force the function to evaluate
a subpattern first, enclose it in parentheses. For more information, see nPath Patterns.
Symbols
Defines the symbols that appear in the values of the Pattern and Result syntax elements. The
col_expr is an expression whose value is a column name, symbol is any valid identifier, and
symbol_predicate is a SQL predicate (often a column name).
Each col_expr = symbol_predicate must satisfy the SQL syntax of the Analytics Database
when nPath is invoked. Otherwise, it is a syntax error.
For example, this Symbols syntax element is for analyzing website visits:
Symbols (
pagetype = 'homepage' AS H,
pagetype <> 'homepage' AND pagetype <> 'checkout' AS PP,
pagetype = 'checkout' AS CO
)
The symbol is case-insensitive; however, a symbol of one or two uppercase letters is easy
to identify in patterns.
If col_expr represents a column that appears in multiple input tables, you must qualify the
ambiguous column name with its table name. For example:
Symbols (
weblog.pagetype = 'homepage' AS H,
weblog.pagetype = 'thankyou' AS T,
ads.adname = 'xmaspromo' AS X,
ads.adname = 'realtorpromo' AS R
)
For more information about symbols that appear in the Pattern syntax element value, see
nPath Symbols. For more information about symbols that appear in the Result syntax
element value, see nPath Results.
Filter
[Optional] Specify filters to impose on the matched rows. The function combines the filter
expressions using the AND operator.
This is the filter_expression syntax:
The column_with_expression cannot contain the operator AND or OR, and all its columns
must come from the same input. If the function has multiple inputs, column_with_expression
and symbol must come from the same input.
The comparison_operator is either <, >, <=, >=, =, or <>.
Result
Defines the output columns. The col_expr is an expression whose value is a column
name; it specifies the values to retrieve from the matched rows. The function applies
aggregate_function to these values. For details, see nPath Results.
The function evaluates this syntax element once for every matched pattern in the partition
(that is, it outputs one row for each pattern match).
nPath Input
The function requires at least one partitioned input table, and can have additional input tables that are either
partitioned or DIMENSION tables.
Note:
If the input to nPath is nondeterministic, the results are nondeterministic.
nPath Output
The Result syntax element determines the output—see nPath Results.
nPath Symbols
A symbol identifies a row in the Pattern and Result syntax elements. A symbol can be any valid identifier
(that is, a sequence of characters and digits that begins with a character) but is typically one or two
uppercase letters. Symbols are case-insensitive; that is, 'SU' is identical to 'su', and the system reports an
error if you use both.
For example, suppose that you have this input table:
1 ? 81 30 0.0 5 NW 1
2 Tempe 76 40 0.2 15 NE 0
3 ? 70 70 0.4 10 N 0
4 Tusayan 75 50 0.4 5 NW 0
This table has examples of symbol definitions and the rows of the table that they match in
NONOVERLAPPING mode:
temp >= 80 AS H 1
winddirection = 'NW' AS NW 1, 4
TRUE AS A 1, 2, 3, 4
This symbol definition matches all rows, for any
input table.
city like 'tu%' AS TU The like operator depends on Teradata Session mode:
Mode Match
BTET 1, 3, 4
ANSI None
Rows with NULL values do not match any symbol. That is, the function ignores rows with missing values.
You can create symbol predicates that compare a row to a previous or subsequent row, using a LAG or
LEAD operator.
where:
• current_expr is the name of a column from the current row (or an expression operating on
this column).
Input
bank_web_clicks
customer_id session_id page datestamp
SQL Call
Output
Columns 1-4
customer_id session_id first_date last_date
Columns 5-6
page_path dup_path
... ...
Input
aggregate_clicks
userid sessionid productname pagetype clicktime referrer productprice
SQL Call
Output
first_product max_product sessionid
bookcases cellphones 5
nPath Patterns
The value of the Pattern syntax element specifies the sequence of rows for which the function searches.
You compose the pattern definition, pattern, with symbols (which you define in the Symbols syntax
element), operators, and parentheses. In the pattern definition, symbols represent rows. You can combine
symbols with pattern operators to define simple or complex patterns of rows for which to search.
The following table lists and describes the basic pattern operators, in decreasing order of precedence. In
the table, A and B are symbols that have been defined in the Symbols syntax element.
Matches two rows, where the first row meets the definition of A and the second 2
A.B
row meets the definition of B .
The nPath function uses greedy pattern matching. That is, it finds the longest available match
when matching patterns specified by nongreedy operators. For more information, see nPath Greedy
Pattern Matching.
These examples show the pattern operator precedence rules:
• A.B+ is the same as A.(B+)
• A|B* is the same as A|(B*)
• A.B|C is the same as (A.B)|C
Example:
A.(B|C)+.D?.X*.A
The preceding pattern definition matches any set of rows whose first row meets the definition of symbol
A, followed by a nonempty sequence of rows, each of which meets the definition of either symbol B or C,
optionally followed by one row that meets the definition of symbol D, followed by any number of rows that
meet the definition of symbol X, and ending with a row that meets the definition of symbol A.
You can use parentheses to define precedence rules. Parentheses are recommended for clarity, even
where not strictly required.
To indicate that a sequence of rows must start or end with a row that matches a certain symbol, use the
start anchor (^) or end anchor ($) operator.
Appears only at the beginning of a pattern. Indicates that a set of rows must start with a row that
^A
meets the definition of A .
Appears only at the end of a pattern. Indicates that a set of rows must end with a row that meets
A$
the definition of A .
Subpattern operators let you specify how often a subpattern must appear in a match. You can specify
a minimum number, exact number, or range. In the following table, X represents any pattern definition
composed of symbols and any of the previously described pattern operators.
Subpattern Operators
Operator Description
The nPath function uses greedy pattern matching, finding the longest available match despite any
nongreedy operators in the pattern.
For example, consider the input table link2:
job_transition_path path_count
[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec Officer] 1
In the pattern, CEO matches the first row, ENGR matches the second row, and OTHER* matches the
remaining rows:
USING
Mode (NONOVERLAPPING)
Pattern ('CEO.ENGR.OTHER*.CEO')
Symbols (
job_title like '%Software Eng%' AS ENGR,
TRUE AS OTHER,
job_title like 'Chief Exec Officer' AS CEO
)
Result (accumulate(job_title OF ANY(ENGR,OTHER,CEO)) AS job_transition_path)
) AS dt GROUP BY 1 ORDER BY 2 DESC;
job_transition_path path_count
[Chief Exec Officer, Software Engineer, Software Engineer, Chief Exec Officer, Chief Exec Officer] 1
In the pattern, CEO matches the first row, ENGR matches the second row, OTHER* matches the next two
rows, and CEO matches the last row:
nPath Filters
The Filter syntax element specifies filters to impose on the matched rows.
Using clickstream data from an online store, this example finds the sessions where the user visited the
checkout page within 10 minutes of visiting the home page. Because there is no way to know in advance
how many rows might appear between the home page and the checkout page, the example cannot use
a LAG or LEAD expression. Therefore, it uses the Filter syntax element.
Input
clickstream
userid sessionid clicktime pagetype
SQL Call
Output
userid sessionid cnt firsthome lastcheckout
nPath Results
The Result syntax element defines the output columns, specifying the values to retrieve from the matched
rows and the aggregate function to apply to these values.
For each pattern, the nPath function can apply one or more aggregate functions to the matched rows and
output the aggregated results. These are the supported aggregate functions:
• SQL aggregate functions AVG, COUNT, MAX, MIN, and SUM, described in Teradata Vantage™ - SQL
Functions, Expressions, and Predicates, B035-1145
• ML Engine nPath sequence aggregate functions described in the following table
In the following table, col_expr is an expression whose value is a column name, symbol is defined by the
Symbols syntax element, and symbol_list has this syntax:
Function Description
Returns either the number of total number of matched rows (*) or the
COUNT (
number (or distinct number) of col_expr values in the matched rows.
{ * | [DISTINCT] col_
expr }
OF symbol_list )
Function Description
You can compute an aggregate over more than one symbol. For example, SUM (val OF ANY (A,B))
computes the sum of the values of the attribute val across all rows in the matched segment that map to A
or B.
Input
trans1
userid gender ts productname productamt
SQL Call
)
) ORDER BY 1;
Output
userid gender max_prod min_prod
1 M television envelopes
2 F appliances bookcases
3 F cellphones dvds
Input
clicks
userid sessionid productname pagetype clicktime referrer productprice
SQL Call
Output
sessionid firsthome firstcheckout products_accumulate cde_dup_products de_dup_products
Teradata Vantage™ - Analytics Database Analytic Functions - 17.20, Release 17.20 360
10: Path and Pattern Analysis Functions
nPath Results Example: FIRST, Three Forms of ACCUMULATE, COUNT, and NTH
Input
The input table for this example is clicks, as in nPath Results Example: FIRST and Three Forms
of ACCUMULATE.
SQL Call
Output
count_distinct_ consecutive_
sessionid firsthome firstcheckout products_accumulate distinct_products nth
products distinct_products
1 06:59:13 07:00:12 [null, null, television, television, 2 [null, television, envelopes, null] [null, ?
envelopes, null] television, envelopes]
Teradata Vantage™ - Analytics Database Analytic Functions - 17.20, Release 17.20 362
10: Path and Pattern Analysis Functions
nPath Results Example: Combine Values from One Row with Values from the
Next Row
Input
The input table is clickstream, as in nPath Filters Example.
SQL Call
Output
sessionid pageid next_pageid
1 home view
1 view view
1 checkout view
1 checkout checkout
1 view checkout
1 view view
2 checkout view
2 home view
2 view view
2 view checkout
Input
The example has two input tables that include Hindi characters.
हिंदी टेबल
सत्रआईडी क्लिककरें token उत्पादकानाम पेजकाप्रकार रेफरर
9000 05:30:15.000000 ? घर
9001 05:30:15.000000 ? घर
1 18:00:00.000000 10 लॉग इन
1 18:00:10.000000 10 घर
400 10:05:02.000000 18 घर
14 13:18:31.000000 8 कागज एक
400 18:00:10.000000 18 घर
14 13:18:32.000000 8 page2
666 12:50:15.000000 40 घर
500 08:15:15.000000 31 घर
10000 16:00:10.000000 1 घर
कवज्ञापन
ररलेसमय channel कवज्ञापन duration
SQL Call
Output
रेफरल पथ सत्रआईडी
[, ] 8000
[, , , , , , , , , , , ] 400
[, , , , ] 500
[, ] 666
[, ] 9001
[, ] 9000
[, ] 10000
Input
unicode_path
id price event
2 -1.20000000000000E-001 ఈవట4
SQL Call
Mode (NONOVERLAPPING)
Pattern ('A*')
Symbols (true AS A)
Result (
ACCUMULATE (DISTINCT event OF A DELIMITER ', ' ) AS acc_result_distinct,
ACCUMULATE (CDISTINCT event OF A DELIMITER ', ' ) AS acc_result_cdistinct
)
) AS dt;
Output
acc_result_distinct acc_result_cdistinct
[ఈవట3, ఈవట5, ఈవట1, ఈవట4, ఈవట2] [ఈవట3, ఈవట5, ఈవట1, ఈవట4, ఈవట2, ఈవట1]
In a symbol, the Boolean expression TRUE, NOT In a symbol, the Boolean expression TRUE, NOT
TRUE, or integer can be enclosed in parentheses TRUE, or integer cannot be enclosed in parentheses
or quotation marks. or quotation marks.
Aggregate functions compare strings using Aggregate functions compare strings using
Unicode value of each character (lexicographic sort order, based on CHARACTER SET,
order), ignoring CHARACTER SET. CASESPECIFIC, and COLLATION.
AVG
Database Syntax Element Data Type Return Data Type
INTERVAL, or DATE without TIME or TIMESTAMP Same as syntax element data type
COUNT
Syntax Element
Database Return Data Type
Data Type
Syntax Element
Database Return Data Type
Data Type
0 or 15 NUMERIC(15,0) -(15)9
18 NUMERIC(18,0) -(18)9
38 NUMERIC(38,0) -(38)9
Aster Any numeric, string, or DateTime type Same as syntax element data type
Teradata Any numeric, character, DateTime or If not UDT: Same as syntax element data type
Interval data type, or BYTE UDT: Data type to which UDT is implicitly cast
SUM
Database Syntax Element Data Type Return Data Type
BIGINT NUMERIC
Teradata NUMERIC, INTERVAL, or DATE Same as syntax element data type, except
without TIME or TIMESTAMP for NUMERIC(n,m), which returns NUMERIC(p,m),
where p depends on MaxDecimal value in DBSControl
—see following table.
MaxDecimal Value n p
0 or 15 n ≤ 15 15
15 < n ≤ 18 18
n > 18 38
18 n ≤ 18 18
MaxDecimal Value n p
n > 18 38
38 Any value 38
nPath Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
C category IN (SELECT pageid FROM clicks1 GROUP BY userid HAVING COUNT(*) > 10)
X TRUE
Input
This statement creates the input table of clickstream data that the examples use:
This statement gets the pageid for each row and the pageid for the next row in sequence:
Mode (OVERLAPPING)
Symbols (
pageid = 50 AS A,
pageid = 80 AS B,
pageid <> 80 AND category IN (9,10) AS C
)
Result (
LAST(pageid OF ANY (A,B,C)) AS last_pageid,
COUNT (* OF B) AS count_page80,
COUNT (* OF ANY (A,B,C)) AS count_any
)
) AS dt WHERE dt.count_any >= 5
GROUP BY dt.last_pageid
ORDER BY MAX(dt.count_page80);
Whenever a user visits the home page and then visits checkout pages and buys increasingly expensive
products, the nPath query returns the first purchase and the most expensive purchase.
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
SQL-MapReduce Call
Output
sessionid path
1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1,
checkout, home, home, home, home, home, home, home, home, home]
2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]
3 [home, home, home, home, home, home, home, home, home1, page1, home, home1,
page1, home]
4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]
5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2,
page2, checkout, checkout, checkout, page2, page2, page2]
nPath Range-Matching Example: Find Sessions That Start at Home Page and
Visit Page1
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
SQL-MapReduce Call
Output
sessionid path
1 [home, home1, page1, home, home1, page1, home, home, home, home1, page1,
checkout, home, home, home, home, home, home, home, home, home]
2 [home, home, home, home, home, home, home, home, home, home1, page1, checkout,
checkout, home, home]
3 [home, home, home, home, home, home, home, home, home1, page1, home, home1,
page1, home]
4 [home, home, home, home, home, home, home1, home1, home1, page1, page1, page1]
5 [home, home, home, home, home1, home1, home1, page1, page1, page1, page2, page2,
page2, checkout, checkout, checkout, page2, page2, page2]
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
SQL-MapReduce Call
Output
sessionid path totalsum
1 [home, home1, page1, home, home1, page1, home, home, home, 602.857142857143
home1, page1, checkout, home, home, home, home, home, home,
home, home, home]
5 [home, home, home, home, home1, home1, home1, page1, page1, 363.157894736842
page1, page2, page2, page2, checkout, checkout, checkout,
page2, page2, page2]
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
SQL-MapReduce Call
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [home, home]
1 [checkout, home]
1 [page1, checkout]
1 [home1, page1]
1 [home, home1]
1 [home, home]
1 [home, home]
1 [page1, home]
1 [home1, page1]
1 [home, home1]
1 [page1, home]
1 [home1, page1]
1 [home, home1]
2 [home, home]
2 [checkout, home]
2 [checkout, checkout]
... ...
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
SQL-MapReduce Call
Output
sessionid product
1 envelopes
2 tables
3 bookcases
4 tables
5 Appliances
nPath Range-Matching Example: Find Data for Sessions That Checked Out
3-6 Products
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
SQL-MapReduce Call
USING
Mode (NONOVERLAPPING)
Pattern ('H+.D*.C{3,6}.D')
Symbols (
pagetype = 'home' AS H,
pagetype='checkout' AS C,
pagetype<>'home' AND pagetype<>'checkout' AS D
)
Result (
FIRST (sessionid OF C) AS sessionid,
max_choose (productprice, productname OF C) AS
most_expensive_product,
MAX (productprice OF C) AS max_price,
min_choose (productprice, productname of C) AS
least_expensive_product,
MIN (productprice OF C) AS min_price)
) AS dt ORDER BY dt.sessionid;
Output
sessionid most_expensive_product max_price least_expensive_product min_price
nPath Range-Matching Example: Find Data for Sessions That Checked Out at Least
3 Products
Input
The input table is aggregate_clicks, from LAG and LEAD Expressions Example: No Alias for Input Query.
Modify the previous query call in nPath Range-Matching Example: Find Data for Sessions That Checked
Out 3-6 Products to find sessions where the user checked out at least three products by changing the
Pattern syntax element to:
Pattern ('H+.D*.C{3,}.D')
SQL-MapReduce Call
pagetype = 'home' AS H,
pagetype='checkout' AS C,
pagetype<>'home' AND pagetype<>'checkout' AS D
)
Result (
FIRST(sessionid OF C) AS sessionid,
max_choose(productprice, productname OF C) AS
most_expensive_product,
MAX (productprice OF C) AS max_price,
min_choose (productprice, productname OF C) AS
least_expensive_product,
MIN (productprice OF C) AS min_price
)
) AS dt ORDER BY dt.sessionid;
Output
sessionid most_expensive_product max_price least_expensive_product min_price
An e-commerce store wants to count the advertising impressions that lead to a user clicking an online
advertisement. The example counts the online advertisements that the user viewed and the television
advertisements that the user might have viewed.
Input
impressions
userid ts imp
1 2012-01-01 ad1
1 2012-01-02 ad1
1 2012-01-03 ad1
1 2012-01-04 ad1
1 2012-01-05 ad1
1 2012-01-06 ad1
1 2012-01-07 ad1
2 2012-01-08 ad2
userid ts imp
2 2012-01-09 ad2
2 2012-01-10 ad2
2 2012-01-11 ad2
clicks2
userid ts click
1 2012-01-01 ad1
2 2012-01-08 ad2
3 2012-01-16 ad3
4 2012-01-23 ad4
5 2012-02-01 ad5
6 2012-02-08 ad6
7 2012-02-14 ad7
8 2012-02-24 ad8
9 2012-03-02 ad9
10 2012-03-10 ad10
11 2012-03-18 ad11
12 2012-03-25 ad12
13 2012-03-30 ad13
14 2012-04-02 ad14
15 2012-04-06 ad15
tv_spots
ts tv_imp
2012-01-01 ad2
2012-01-02 ad2
2012-01-03 ad3
2012-01-04 ad4
2012-01-05 ad5
ts tv_imp
2012-01-06 ad6
2012-01-07 ad7
2012-01-08 ad8
2012-01-09 ad9
2012-01-10 ad10
2012-01-11 ad11
2012-01-12 ad12
2012-01-13 ad13
2012-01-14 ad14
2012-01-15 ad15
SQL-MapReduce Call
The tables impressions and clicks have a user_id column, but the table tv_spots is only a record of
television advertisements shown, which any user might have seen. Therefore, tv_spots must be a
dimension table.
Output
dt.imp_cnt tv_imp_cnt
18 0
19 0
dt.imp_cnt tv_imp_cnt
19 0
20 0
21 0
22 0
22 0
22 0
22 0
22 0
23 0
23 0
23 0
24 0
25 0
Hypothesis testing functions find the relative likelihood of hypotheses. You can accept the most likely
hypotheses and reject the least likely.
Component Description
Alpha (α) The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
(Also called value (where Alpha is the probability of rejecting the null hypothesis when it is true).
significance level or Most common α values are 0.01, 0.05, and 0.10, corresponding to 99%, 95%, and
Type I error.) 90% confidence, respectively.
Results are "statistically significant at α."
Test statistic Value to which data set is reduced, used in hypothesis test. Its sampling
distribution under null hypothesis must be calculable (exactly or approximately),
making p_values calculable.
Critical value Quantile of distribution of test statistic under null hypothesis. Used to determine
rejection region.
p_value Probability of test results at least as extreme as test statistic results observed
under assumption that null hypothesis is true.
The smaller the p_value, the stronger the evidence against the null hypothesis.
One-tailed test Rejection region is the lower tail or the upper tail of the sampling distribution under the
null hypothesis H0.
Two-tailed test The null hypothesis assumes that μ = μ0 where μ0 is a specified value.
Two-tailed test considers both lower and upper tails of distribution of test statistic.
Alternate hypothesis (H1): μ ≠ μ0
Unpaired test Compares different subjects drawn from two independent populations.
H0): μ1 = μ2
The alternate hypotheses are as follows:
• Alternate hypothesis for upper-tailed test (H1): μ 1 > μ2
• Alternate hypothesis for lower-tailed test (H1): μ 1 < μ2
• Two-tailed test μ 1 ≠ μ2
TD_ANOVA
Analysis of variance (ANOVA) is a statistical test that analyzes the difference between the means of more
than two groups.
The null hypothesis (H0) of ANOVA is that there is no difference among group means. However, if any one
of the group means is significantly different from the overall mean, then the null hypothesis is rejected.
You can use one-way Anova when you have data on an independent variable with at least three levels and
a dependent variable.
For example, assume that your independent variable is insect spray type, and you have data on spray type
A, B, C, D, E, and F. You can use one-way ANOVA to determine whether there is any difference in the
dependent variable, insect count based on the spray type used.
TD_ANOVA Syntax
SELECT * FROM TD_ANOVA (
ON { table | view | (query) } as InputTable
USING
[GroupColumns ('group_column1' |'group_column2'[,...]| group_column_range[,...])]
[Alpha (alpha)]
) AS dt;
Alpha
[optional]: Specify the probability of rejecting the null hypothesis when the null hypothesis
is true.
Default value: 0.05
Valid range: [0,1]
TD_ANOVA Input
Input Table Schema
Column Data Type Description
Column names with groups INTEGER, BYTEINT, The column name that contains the data
A, B, C, D, E, F SMALLINT, about the insect count for each insect
BIGINT, DECIMAL, spray type.
FLOAT, NUMBER
TD_ANOVA Output
Output Table Schema
Column Data Type Description
sum_of_squares DOUBLE The sum of squares [that is, variation] between group sum
(between groups) and of squares.
(within groups)
Df (between groups) and INTEGER The degrees of freedom corresponding to the between groups
(within groups) sum of squares and within group sum of squares.
mean_square(between DOUBLE The mean of the sum of squares, which is calculated by dividing
groups) mean_ the sum of squares by the degrees of freedom.
square(within groups)
p_value DOUBLE The probability value associated with the F-statistic value.
The low p-value indicates that the insect spray type has a
significant impact on the insect count.
TD_ANOVA Example
Input: Insect_sprays
SQL Call
Output Table
TD_ChiSq
TD_ChiSq performs Pearson's chi-squared (χ2) test for independence, which determines if there is a
statistically significant difference between the expected and observed frequencies in one or more categories
of a contingency table (also called a cross tabulation).
Test Type
• One-tailed, upper-tailed
• One-sample
• Unpaired
Computational Method
The Chi-Square test finds statistically significant associations between categorical variables. The test
determines if the categorical variables are statistically independent or not.
The data for analysis is organized in a table known as contingency tables. A two-way contingency table
consists of r rows and c columns wherein:
• The rows correspond to variable 1 that consists of r categories
• The columns correspond to variable 2 that consists of c categories
Each cell of the contingency table is the count of the joint occurrence of particular levels of variable 1 and
variable 2.
For example, the following two-way contingency table shows the categorical variable Gender with two levels
(Male, Female) and the categorical variable Affiliation with two levels (Smokers, Non-smokers).
The cell counts nij , i = 1, 2; j = 1, 2 are number of joint occurrences of Gender and Affiliation at their ith
and the jth levels respectively. The Null and alternative hypotheses H0 and H1 corresponding to a χ2 test of
independence is as follows:
H0: The two categorical variables are independent
vs
H1: The two categorical variables are not independent
Using the previous table, the expected cell counts are calculated:
e11 = n11 + n21
The χ2 statistic follows a Chi-Square distribution with r - 1 and c - 1 degrees of freedom. In the Gender
Affiliation table, r=2 and c=2. The Null hypothesis H0 is rejected if χ2stat > χ2r-1,c-1,α where α ϵ {0.10,
0.05, 0.01}.
The Cramer's V statistic is calculated using the following formula:
where:
• φ is the phi coefficient
• χ2 is derived from the Pearson's chi-squared test
• n is the grand total of observations
• c is the number of columns
• r is the number of rows
The following rules are used to compute the hypothesis conclusion:
• If the chi-square statistic is greater than the critical value, then the function rejects the Null hypothesis.
• If the chi-square statistic is lesser than or equal to the critical value, then the function fails to reject the
Null hypothesis.
TD_ChiSq Syntax
SELECT * from TD_CHISQ (
ON { table | view | (query) } AS CONTINGENCY
[ OUT [ PERMANENT | VOLATILE ] TABLE EXPCOUNTS (expected_values_table) ]
USING
[ Alpha (alpha) ]
) AS alias;
alpha
[Optional] The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
value (where Alpha is the probability of rejecting the null hypothesis when it is true). alpha
must be a numeric value in the range [0, 1].
Default: 0.05
TD_ChiSq Input
A contingency table also known as a two-way frequency table is a tabular mechanism with at least two rows
and two columns used in statistics to present categorical data in terms of frequency counts.
A contingency table shows the observed frequency of two variables arranged into rows and columns. The
intersection of a row and a column of a contingency table is called a cell.
For example, a cell count nij represents a joint occurrence of row i and column j where i is a value between
1 to r (total number of rows) and j is a value between 2 to c (total number of columns).
You can interpret the contingency table in the example as follows:
• The First column represents the first category, gender, and has two labels, female and male which are
represented by two rows.
• The second category, habits has two labels, smokers and non-smokers which are represented by the
second and third columns.
The second category can have at most 2046 unique labels. The function ignores NULL values in the table.
Maximum label length is 64000 for category_1, 128 for all other columns.
For a valid test output, the value of each observed frequency in the CONTINGENCY table must be at
least 5.
Name of categorical Any Columns can have one or multiple labels. Can either
column 1 be an integer, LATIN, or UTF8 code.
.
.
.
.
TD_ChiSq Output
Output Table Schema
Column Data Type Description
criticalvalue DOUBLE PRECISION Critical value calculated using Alpha for test.
conclusion VARCHAR Chi-squared test result, either 'reject null hypothesis' or 'fail to
reject null hypothesis'.
TD_ChiSq Example
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
This example tests whether gender influences smoking habits affiliation. The null hypothesis is that
gender and smoking habits affiliation are independent. TD_ChiSq compares the null hypothesis (expected
frequencies) to the contingency table (observed frequencies).
Input: contingency1
The contingency table contains the frequencies of men and women affiliated with each smoking habit.
category_1, gender, has labels "female" and "male". category_2, habits, has labels "smokers" and "non-
smokers".
This example illustrates a two-way contingency table with two categories, category_1 and
category_2 respectively.
Each row has a label, i which has a value between 1 to r, and each column has a label, j which has a value
between 2 to c. The values of c and r are 3 and 2 respectively.
Here, category 1_label1 corresponds to females and category_1_label2 corresponds to males. Similarly,
category2_label1 corresponds to smokers and category2_label2 corresponds to non-smokers.
Query to create the contingency table is as follows:
Habits
gender smokers non-smokers
---------- --------- -------------
female 6 9
male 8 5
SQL Call
Output Table
exptable1
TD_FTest
TD_FTest performs an F-test, for which the test statistic follows an F-distribution under the Null hypothesis.
TD_FTest compares the variances of two independent populations. If the variances are significantly
different, TD_FTest rejects the Null hypothesis, indicating that the variances may not come from the same
underlying population.
Use TD_FTest to compare statistical models that have been fitted to a dataset, to identify the model that best
fits the population from which the data were sampled.
Assumptions
• Populations from which samples are drawn are normally distributed.
• Populations are independent of each other.
• Data is numeric.
Test Type
• One-tailed (lower and upper-tailed) or two-tailed (your choice)
• Two-sample
• Unpaired
Computational Method
The F-test is used to test the Null hypothesis σ2 = in various applications. For example, you might
need to test the variability in the measurement of the thickness of a manufactured part in a factory. If the
thickness is not equal to a certain thickness ( ) then you can conclude that the manufacturing process
is uncontrolled. The types of hypothesis are as follows:
H0: σ2 =
versus
or
or
H1: σ2 ≠ (two-tailed)
Let x1, x2,....xn be a random sample. To test the hypotheses, the test statistic is calculated as:
where
For the one-sided upper-tailed test σ2 > the Null hypothesis H0 is rejected if .
For the one-sided lower-tailed test σ2 < , the Null hypothesis H0 is rejected
if .
Also, the F-test is used to test if the variances of two populations are equal. The F-test can have the
following tests:
• One-tailed test: The test is used to determine if the variance of one population is either greater than
(upper-tailed) or less than (lower-tailed) the variance of another population.
• Two-tailed test: The test is used to determine significant differences in variances of the two populations
and tests the Null hypothesis (H0) against the alternative hypothesis (H1) to find out if the variances are
not equal.
Let x1, x2,....xn1 ~Ɲ (µ1, σ2) and y1, y2,....yn2 ~Ɲ (µ2, σ2) be random samples from two independent
populations. The corresponding sample means and variances are as follows:
and
In the following calculation, assume that sample 1 has a larger variance than sample 2. If sample 2 has a
larger variance than sample 1, switch the samples and apply the same formula.
H0: =
versus
H1: >
or
<
The test statistic for the one-sided upper tailed test ( > ) is calculated as:
where: n1-1 and n2-1 are degrees of freedom corresponding to sample 1 and sample 2.
The test statistic for the one-sided lower-tailed test ( < ) is calculated as:
H 0: =
versus
H 1: ≠
TD_FTest Syntax
SELECT * from TD_FTEST (
[ ON { table | view | (query) } AS InputTable ]
USING
first_sample_specifier
second_sample_specifier
[ AlternativeHypothesis ({ 'lower-tailed' | 'upper-tailed' | 'two-tailed' })
[ Alpha (alpha) ]
) AS alias;
first_sample_specifier
{ FirstSampleColumn ('sample_column_1') |
FirstSampleVariance (variance_1)
DF1 (degrees_of_freedom_first_sample)
}
second_sample_specifier
{ SecondSampleColumn ('sample_column_2') |
SecondSampleVariance (variance_2)
DF2 (degrees_of_freedom_second_sample)
}
FirstSampleVariance
[Required if you omit FirstSampleColumn, disallowed otherwise.] Specify the variance of the
first sample population.
DF1
[Required if you omit FirstSampleColumn, disallowed otherwise.] Specify the degrees of
freedom of the first sample.
SecondSampleColumn
[Required if you omit SecondSampleVariance, disallowed otherwise.] Specify the name of
the input column that contains the data for the second sample population.
SecondSampleVariance
[Required if you omit SecondSampleColumn, disallowed otherwise.] Specify the variance of
the second sample population.
DF2
[Required if you omit SecondSampleColumn, disallowed otherwise.] Specify the degrees of
freedom of the second sample.
AlternativeHypothesis
[Optional] Specify the alternative hypothesis:
Option Description
Alpha
[Optional] The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
value (where Alpha is the probability of rejecting the null hypothesis when it is true). alpha
must be a numeric value in the range [0, 1].
Default: 0.05
TD_FTest Input
InputTable is required only if you specify either FirstSampleColumn or SecondSampleColumn. If you
specify FirstSampleVariance, SecondSampleVariance, DF1, and DF2, the function ignores InputTable.
InputTable Schema
Column Data Type Description
TD_FTest Output
Output Table Schema
Column Data Type Description
TD_FTest Examples
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
Input
SQL Call
Output
Input
With two sample variances instead of two sample columns, you do not need InputTable.
SQL Call
Output
firstsamplevariance secondsamplevariance
varianceratio df1 df2
CriticalValue Alpha p_value Conclusion
---------------------- ---------------------- ----------------------
-------------------- -------------------- ----------------------
---------------------- ----------------------
----------------------------------------------------------------
1.38561000000000E 003 5.21220000000000E 002 2.65839760561759E
000 9 19 2.88005204672380E 000
5.00000000000000E-002 6.96764431913391E-002 Fail to reject Null hypothesis
TD_ZTest
TD_ZTest performs a Z-test, for which the distribution of the test statistic under the Null hypothesis can be
approximated by normal distribution.
TD_ZTest tests the equality of two means under the assumption that the population variances are known
(rarely true). For large samples, sample variances approximate population variances, so TD_ZTest uses
sample variances instead of population variances in the test statistic.
Assumptions
• Sample distribution is normal.
• Data is numeric, not categorical.
Test Type
• One-tailed or two-tailed (your choice)
• One-sample or two-sample (your choice)
Use one-sample to test whether the mean of a population is greater than, less than, or not equal to a
specific value. TD_ZTest finds the answer by comparing the critical values of the normal distribution at
levels of significance (alpha = 0.01, 0.05, 0.10) to the Z-test statistic.
• Unpaired
Computational Method
A test of the hypothesis (ToH) involves the following framework:
• A Null hypothesis H0 and an alternative hypothesis H1
• A random sample x1, x2,....xn in the case of a one sample test
• Two random samples x1, x2,....xn and y1, y2,....yn in the case of a two sample test
• A test statistic Zstat
• A level of significance α ϵ {0.10, 0.05, 0.01}
• Compare the sample based Zstat with the percentage point of the normal distribution |ᴢ| or |ᴢ α/2|
• Compute the p-value
• Conclusion
Let x1, x2,....xn be a random sample drawn from a population with mean µ and variance σ2. Also, assume
that the data follows a normal distribution Ɲ (µ, σ2).
H0; µ ≤ µ0
versus
H1; µ > µ0
or
H0: µ ≥ µ0
versus
H1: µ < µ0
H0: µ = µ0
versus
H1: µ ≠ µ0
The test statistic for testing the previous hypotheses is the Z-stat. The validity of the Z-stat is predicated on
the assumption that the population variance σ2 is known.
The assumption of known variance is not practical because if the variance is known, then the mean µ is
known. So, if the mean µ is known, the test is not required.
However, for large sample sizes (which is common in Big data applications), the sample variance s 2 is
approximately equal to the unknown variance σ2. Therefore, a scenario that involves a large sample size
validates the application of the Z-statistic.
The z-statistic is calculated as:
where the unknown standard deviation σ is replaced by the sample standard deviation
where
In case I of the upper tailed hypothesis test, the Null hypothesis is rejected if Zstat > ᴢ α where α ϵ {0.10, 0.05,
0.01}. In case II of the lower tailed hypothesis test, the Null hypothesis is rejected if Zstat < ᴢ α where α ϵ {0.10,
0.05, 0.01}. In case III of the two-tailed test, the Null hypothesis is rejected if Zstat > ᴢ α/2 and Zstat < ᴢ α/2,
α ϵ {0.10, 0.05, 0.01}.
The two sample z-test is used for testing equality of means of two populations. Let x1, x2,....xn1 ~ Ɲ (µ1, )
and y1, y2,....yn2 ~ Ɲ (µ2, ) be random samples from two independent populations. The Null hypothesis
H0 and the alternative hypothesis H1 respectively for a one-sided lower-tailed test is given as:
H0; µ 1 ≥ µ2
versus
H1; µ1 < µ2
The Null hypothesis is rejected if Zstat < - ᴢ α where α ϵ {0.10, 0.05, 0.01}. Also, note that - ᴢ α is a percentile
of the normal distribution with area to its left.
A one-sided upper-tailed test is calculated as:
H0; µ 1 ≤ µ2
versus
H1; µ1 > µ2
The Null hypothesis is rejected if Zstat > ᴢ α with α ϵ {0.10, 0.05, 0.01}. Also, note that ᴢα is a percentile of
the normal distribution with (1- α) x 100 area to its left. So, - ᴢ α puts 100xα area to its left.
H0: µ 1 = µ2
versus
H1: µ1 ≠ µ2
The Null hypothesis is rejected if Zstat > ᴢ 1-α/2 or Zstat < -ᴢ α/2 with α ϵ {0.10, 0.05, 0.01}. Also, note that ᴢ
1-α/2 is a percentile of the normal distribution with (1- α/2) x 100 area to its left. So, - ᴢ α puts 100xα area to
its left. Note Zstat ~ Ɲ (0,1).
TD_ZTest Syntax
SELECT * FROM TD_ZTest (
ON { table | view | (query) }
USING
FirstSampleColumn (sample_column_1)
[ FirstSampleVariance (variance_1) ]
[ SecondSampleColumn (sample_column_2) ]
[ SecondSampleVariance (variance_2) ]
[ AlternativeHypothesis ({ 'upper-tailed' | 'lower-tailed' | 'two-tailed' }) ]
[ MeanUnderH0 (mean_under_H0) ]
[ Alpha (alpha) ]
) AS dt;
FirstSampleVariance
[Required if first sample size is less than 30, optional otherwise.] Specify the variance of the
first sample population. variance_1 is a numeric value in the range (0,1.79769e+308).
Default behavior: If sample size is greater than 30, the function approximates the variance.
SecondSampleColumn
[Optional] Specify the name of the input column that contains the data for the second
sample population.
SecondSampleVariance
[Required if you specify SecondSampleColumn and second sample size is less than 30,
optional otherwise.] Specify the variance of the second sample population. variance_2 is a
numeric value in the range (0, 1.79769e+308).
Default behavior: If sample size is greater than 30, the function approximates the variance.
AlternativeHypothesis
[Optional] Specify the alternative hypothesis:
Option Description
Default: 'two-tailed'
MeanUnderH0
[Optional] Specify the mean under the null hypothesis (H0). mean_under_H0 is a numeric
value in the range (-1.79769e+308, 1.79769e+308).
Default: 0
Alpha
[Optional] The Null Hypothesis is rejected if the P-value is smaller than the specified Alpha
value (where Alpha is the probability of rejecting the null hypothesis when it is true). alpha
must be a numeric value in the range [0, 1].
The null hypothesis is rejected if p_value < alpha. (For a description of p_value, see
TD_ZTest Output.) If the null hypothesis is rejected, the rejection confidence level is 1-alpha.
Default: 0.05
TD_ZTest Input
Input Table Schema
Column Data Type Description
TD_ZTest Output
Output Table Schema
Column Data Type Description
CriticalValue DOUBLE Critical value calculated using Alpha for test (z_α).
PRECISION
Conclusion VARCHAR Z-test result, either 'reject null hypothesis' or 'fail to reject
null hypothesis'.
If Conclusion is 'reject null hypothesis', rejection confidence
level is 1-alpha.
TD_ZTest Example
Input: example_table
Every complete example in this document is available in a zip file that you can download. The zip file
includes a SQL script file that creates the input tables for the examples. If you are reading this document
on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
col1 col2
----------- -----------
93 12
? 12
22 4
? 87
1 10
? 43
92 31
? 23
2 3
? 52
21 65
? 49
? 17
? 17
? 14
? 24
53 20
85 9
50 11
86 1
SQL Call
Output
firstsamplecolumn
secondsamplecolumn
N1
N2 mean1 mean2
AlternativeHypothesis z_score
Alpha CriticalValue p_value Conclusion
--------------------------------------------------------------------------------
------------------------------------------------
--------------------------------------------------------------------------------
------------------------------------------------ ----------- -----------
---------------------- ---------------------- --------------------------------
---------------------- ---------------------- ----------------------
---------------------- --------------------------------
col1
col2
10 20
5.05000000000000E 001 2.52000000000000E 001 TWO-TAILED
1.55377718139113E 002 5.00000000000000E-002 1.95996398454005E 000
0.00000000000000E 000 Reject Null hypothesis
.sidetitles view
firstsamplecolumn col1
secondsamplecolumn col2
N1 10
N2 20
mean1 5.05000000000000E 001
mean2 2.52000000000000E 001
AlternativeHypothesis TWO-TAILED
z_score 1.55377718139113E 002
Alpha 5.00000000000000E-002
CriticalValue 1.95996398454005E 000
p_value 0.00000000000000E 000
Conclusion Reject Null hypothesis
TD_BYONE
The TD_BYONE function sends all table operator rows to a single AMP (access module processor) for
processing. The function is a deterministic scalar system function that takes no input parameters, and
returns an integer associated with a given query. The integer is based on the combined logical host identifier,
session identifier and the request identifier associated with the query.
When using the TD_BYONE function in a table operator, note the following:
• Best practice is to use the function when the number of processed rows is relatively small. Sending a
lot of rows to a single AMP could cause spooling space issues.
• Entities appearing in a PARTITION BY clause must be referenced in the SELECT list.
• The call to TD_BYONE() must be referenced in the SELECT statement.
TD_BYONE Syntax
TD_SYSFNLIB.TD_BYONE()
TD_BYONE Examples
TD_BYONE as a Scalar Function
SELECT TD_SYSFNLIB.TD_BYONE();
TD_BYONE()
-----------
2028
String of one or more digits. Do not use commas in numbers with more than
number three digits.
Example: 10045
x is optional.
[ x ]
Note:
You can repeat only the immediately preceding item. For example, if the syntax is:
KEYWORD x [...]
[ x, [...] ] y
• TD_QQNorm
• TD_UnivariateStatistics
• TD_WhichMax
• TD_WhichMin
• TD_BinCodeFit
• TD_BinCodeTransform
• TD_FunctionFit
• TD_FunctionTransform
• TD_OneHotEncodingFit
• TD_OneHotEncodingTransform
• TD_PolynomialFeaturesFit
• TD_PolynomialFeaturesTransform
• TD_RowNormalizeFit
• TD_RowNormalizeTransform
• TD_ScaleFit
• TD_ScaleTransform
• TD_FillRowID
• TD_NumApply
• TD_RoundColumns
• TD_StrApply
• TD_ChiSq
• TD_FTest
• TD_ZTest
Enhancements:
• DecisionForestPredict function: Changed syntax.
• DecisionTreePredict function: Changed syntax and output table schema.
• GLMPredict function: Added column range support for syntax element Accumulate.
• NaiveBayesTextClassifierPredict function: Changed syntax.
• nPath function: Added UNICODE support.
• Pack function:
◦ Added column range support for syntax element TargetColumns.
◦ Added syntax elements Accumulate and ColCast.
• StringSimilarity function: Added column range support for syntax
element Accumulate.
• SVMSparsePredict function: Changed syntax, input data types, and output
table schema.
• Unpack function:
◦ Added column range support for syntax element TargetColumns.
◦ Added syntax element Accumulate.
Teradata Links
Link Description
Related Documentation
Title Publication ID