Integration With Teradata ISV Partner Technical Guide
Integration With Teradata ISV Partner Technical Guide
June 2015
1 6/18/2015
PREFACE
Revision History:
Audience:
2
Contents
Contents 3
1. Teradata Partners Program .............................................................................................. 5
Section 1.1 Teradata Database -- Introduction .......................................................................... 5
Section 1.2 Teradata Support ..................................................................................................... 6
Section 1.3 Teradata Partner Intelligence Network and the Teradata Education Network ....... 6
2. Teradata Basics ............................................................................................................... 7
Section 2.1 Unified Data Architecture ....................................................................................... 7
Section 2.2 Data Types .............................................................................................................. 8
Section 2.3 Primary Index......................................................................................................... 14
Section 2.4 NoPI Objects .......................................................................................................... 15
Section 2.5 Secondary Indexes ................................................................................................ 16
Section 2.6 Intermediate/Temporary Tables ............................................................................. 17
Section 2.7 Locking .................................................................................................................. 17
Section 2.8 Statistics ................................................................................................................. 19
2.8.1 Random AMP Sampling .............................................................................................. 20
2.8.2 Full Statistics Collection .............................................................................................. 22
2.8.3 Collection with the USING SAMPLE option .............................................................. 23
2.8.4 Collect Statistics Summary Option .............................................................................. 24
2.8.5 Summary: Teradata Statistics Collection ..................................................................... 25
2.8.6 New opportunities for statistics collection in Teradata 14.0[1] .................................... 26
2.8.7 Recommended Reading ............................................................................................... 28
Section 2.9 Stored Procedures .................................................................................................. 29
Section 2.10 User Defined Functions (UDF) ............................................................................ 32
UDF’s are invoked qualifying the database name where they are stored, e.g.
DBName.UDFname(), or if stored in the special database call SYSLIB, without database name
qualification, e.g. UDFname(). ......................................................................................... 33
Section 2.11 Table Operators .................................................................................................... 35
Section 2.12. QueryGrid ........................................................................................................... 37
Section 2.13 DBQL ................................................................................................................... 39
Section 2.14 Administrative Tips ............................................................................................. 43
3. Workload Management ................................................................................................. 45
Section 3.1 Workload Administration ...................................................................................... 45
4. Migrating to Teradata ................................................................................................... 46
Section 4.1 Utilities and Client Access ..................................................................................... 46
Section 4.1.1 Teradata Load/Unload Protocols & Products ..................................................... 46
Section 4.1.2 Input Data Sources with Scripting Tools ............................................................ 48
Section 4.1.3 Teradata Parallel Transporter .............................................................................. 48
Section 4.1.4 Restrictions & Other Techniques ........................................................................ 58
Section 4.2 Load Strategies & Architectural Options ............................................................... 61
Section 4.2.1 ETL Architectural Options ................................................................................. 61
Section 4.2.2 ISV ETL Tool Advantages vs. Teradata Tools ................................................... 62
3
Section 4.2.3 Load Strategies.................................................................................................... 62
Section 4.2.4 ETL Tool Integration .......................................................................................... 63
Section 4.3 Concurrency of Load and Unload Jobs .................................................................. 64
Section 4.4 Load comparisons ................................................................................................. 64
5. References ..................................................................................................................... 64
Section 5.1 SQL Examples ....................................................................................................... 64
Derived Tables ...................................................................................................................... 65
Recursive SQL ...................................................................................................................... 65
Sub Queries ........................................................................................................................... 66
Case Statement ...................................................................................................................... 67
Sum (SQL-99 Window Function) ......................................................................................... 69
Rank (SQL-99 Window Function)........................................................................................ 70
Fast Path Insert ...................................................................................................................... 71
Fast Path Delete .................................................................................................................... 72
Section 5.2 Set vs. Procedural................................................................................................... 72
Section 5.3 Statistics collection “cheat sheet” .......................................................................... 75
Section 5.4 Reserved words ...................................................................................................... 78
Section 5.5 Orange Books and Suggested Reading .................................................................. 78
4
1. Teradata Partners Program
Section 1.1 Teradata Database -- Introduction
This document is intended for ISV partners new to the Teradata partner program. There is a
wealth of documentation available on the database and client/utility software; in fact, the
majority of information contained in this document came from those sources.
ISV’s need to understand that although Teradata is an ANSI standard relational database
management system, it is different and in order to leverage Teradata strength’s ISV’s will need
to understand these differences in order to do an effective job in their integration. It is primarily
those key differences and strengths that are pointed out in this document; this document is
intended as a quick start guide and not a replacement for education or the utilization of the
extensive user documentation provided for Teradata.
The test environment Teradata normally recommends for our partners is the usage of the
Teradata Partner Engineering lab in San Diego. For testing in this environment, you can connect
to Teradata servers in the lab via high speed internet connections.
For some partners, the Teradata Partner Engineering lab will not be sufficient as it does not cover
all requirements. For example, the lab is not an applicable environment for performance testing.
While Teradata can accommodate some requests on a case-by-case basis, it is not a function of
the lab.
Therefore, if you decide to execute your testing on your own premises, the following test
environments are supported:
Teradata Database client applications run on many operating systems. See Teradata Tools and
Utilities Supported Platforms and Product Versions, at: http://www.info.teradata.com
5
Section 1.2 Teradata Support
Partners that have an active partner support contract with us can submit incidents via T@YS
(http://tays.teradata.com). In addition, T@YS gives access to download drivers and patches and
search the knowledge repositories.
All incidents are submitted via T@YS on-line to our CS organization. Our CS organization
assumes the partner has a decent level of Teradata knowledge. It is not in the scope of the CS
organization to educate the partner, hand hold a partner through an installation of TD s/w or
through the resolution of an issue or to provide general Teradata consulting advice, etc.
To sign up for T@YS, submit your name, title, address, phone number and e-mail to your global
alliance support partner.
All partners should receive an orientation session for the Teradata Partner Intelligence Network
as part of their Teradata Partner benefit package. This network is the one source for all the tools
and resources partners need to develop and promote their integrated solutions for Teradata.
Partners that have an active partner support contract with us also have access to the Teradata
Education Network (TEN) (http://www.teradata.com/ten). In order to sign up, submit your name,
title, address, phone number and e-mail to your global alliance support partner.
Access to TEN is free. Depending on the type of education selected, there may be a cost as
follows:
Of particular interest is “Teradata Essentials for Partners.” The primary focus of this four-day
technical class is to provide a foundational understanding of Teradata’s design and
implementation to Alliance Partners. The class is given in a lecture format and provides a
technical and detailed description of the following topics:
Data Warehousing
Teradata concepts, features, and functions
Teradata physical database design – make the correct index selections by understanding
distribution, access, and join techniques.
6
Explains, space utilization of tables and databases, join indexes, and hash indexes are
discussed in relation to physical database design
Teradata Application Utilities – BTEQ and TPT (Load, Update, Export and Stream
adapters); details on when and how to use
Key Teradata features and utilities (up to Teradata 14.0) are included
For more training and education information, including a Partner Curriculum Map with
descriptions of the courses as well as additional recommended training, visit the “Education”
webpage on the Teradata PartnerIntelligence website.
http://partnerintelligence.teradata.com
2. Teradata Basics
In the UDA, data is intended to flow where it is most efficiently processed. In the case of
unstructured data with multiple formats, it can be quickly loaded in Hadoop as a low-cost
platform for landing, staging, and refining raw data in batch. Teradata’s partnership with
Hortonworks enables enterprises to use Hadoop to capture extensive volumes of historical data
and perform massive processing to refine the data with no need for expensive and specialized
knowledge and resources.
The data can then be leveraged in the Teradata Aster analytics discovery platform for real-time,
deep statistical analysis and discovery, allowing users to identify patterns and trends in the data.
The patented Aster SQL-MapReduce® parallel programming framework combines the analytic
power of MapReduce with the familiarity of SQL.
After analysis in the Aster environment, the relevant data can be routed to the Teradata Database,
integrating the data discovered by the Teradata Aster discovery platform with all of the existing
operational data, resulting in intelligence that can be leveraged across the enterprise.
7
Use massively parallel storage in Hadoop to efficiently retain data
For more information on UDA, contact your Partner Integration Lab consultant, or visit
the PartnerIntelligence website at http://partnerintelligence.teradata.com.
Every data value belongs to an SQL data type. For example, when you define a column in a
CREATE TABLE statement, you must specify the data type of the column. The set of data
values that a column defines can belong to one of the following groups of data types:
• Array/VARRAY
• Byte and BLOB
• Character and CLOB
• DateTime
• Geospatial
• Interval
• JSON
• Numeric, including Number
• Parameter
• Period
• UDT
• XML
Single - The 1-D ARRAY type is defined as a variable-length ordered list of values of the same
data type. It has a maximum number of values that you specify when you create the ARRAY
type. You can access each element value in a 1-D ARRAY type using a numeric index value.
You can also create an ARRAY data type using the VARRAY keyword and syntax for Oracle
compatibility.
8
Byte and BLOB Data Types
The BYTE, VARBYTE and BLOB data types are stored in the client system format – they are
never translated by Teradata Database. They store raw data as logical bit streams. For any
machine, BYTE, VARBYTE, and BLOB data is transmitted directly from the memory of the
client system. The sort order is logical, and values are compared as if they were n-byte, unsigned
binary integers suitable for digitized images. The following are examples of Byte data types.
Teradata does not support converting a character to its underlying ASCII integer value. To
accomplish this task there is a function CHAR2HEXINT that returns a hex representation of a
character, and there is also an ASCII() UDF in the Oracle library.
Date and Time DML syntax can be tricky in Teradata. Below are a few common date/time
queries that have proven to be useful. Note that some of the output shown is dependent upon the
“Date/Time Format” parameter selected in the ODBC Setup Options.
9
a) Current Date and Time:
• SELECT Current_Date; Retrieves the System Date (mm/dd/yyyy)
• SELECT Current_Time; Retrieves the System Time (hh:mm:ss)
• SELECT Current_TimeStamp; Retrieves the System TimeStamp (mm/dd/yyyy)
b) Timestamps:
• SELECT CAST (Current_Timestamp as DATE); Extracts the Date (mm/dd/yyyy)
• SELECT CAST (Current_Timestamp as TIME(6)); Extracts the Time (hh:mm:ss)
• SELECT Day_Of_Week Uses TD System Calendar to
FROM Sys_Calendar.Calendar compute the Day of Week (#)
WHERE Calendar_Date = CAST(Current_Timestamp AS DATE);
• SELECT EXTRACT (DAY FROM Current_Timestamp); Extracts Day of Month
• SELECT EXTRACT (MONTH FROM Current_Timestamp); Extracts Month
• SELECT EXTRACT (YEAR FROM Current_Timestamp); Extracts Year
• SELECT CAST(Current_Timestamp as DATE) - CAST (AnyTimestampCol AS DATE)
Computes the # of days between timestamps
• SELECT Current_Timestamp - AnyTimestampColumn day(4) to second(6)
Computes the length of time between timestamps
c) Month:
• SELECT CURRENT_DATE - EXTRACT (DAY FROM CURRENT_DATE) + 1
Computes the first day of the month
• SELECT Add_Months ((CURRENT_DATE - EXTRACT (DAY FROM
CURRENT_DATE) + 1),1)-1 Computes the last day of the month
• SELECT Add_Months (Current_Date, 3); Adds 3 months to the current date
10
•ST_MultiPoint: 0-dimensional geometry collection where the elements
are restricted to ST_Point values.
•ST_MultiLineString: 1-dimensional geometry collection where the
elements are restricted to ST_LineString values.
•ST_MultiPolygon: 2-dimensional geometry collection where the
elements are restricted to ST_Polygon values.
•GeoSequence: Extension of ST_LineString that can contain tracking
information, such as time stamps, in addition to geospatial information.
MBR -- Teradata Database also provides a UDT called MBR that provides a way to
obtain the minimum bounding rectangle (MBR) of a geometry for tessellation purposes.
ST_Geometry defines a method called ST_MBR that returns the MBR of a geometry.
The ST_Geometry and MBR UDTs are defined in the SYSUDTLIB database.
Day-Time represents a time span that can include a number of days, hours, minutes, or seconds.
•INTERVAL DAY
•INTERVAL DAY TO HOUR
•INTERVAL DAY TO MINUTE
•INTERVAL DAY TO SECOND
•INTERVAL HOUR
•INTERVAL HOUR TO MINUTE
•INTERVAL HOUR TO SECOND
•INTERVAL MINUTE
•INTERVAL MINUTE TO SECOND
•INTERVAL SECOND
Teradata Database can store JSON records as a JSON document or store JSON records in
relational format. Teradata Database provides the following support for JSON data.
11
• Methods, functions, and stored procedures that operate on the JSON data type, such as parsing
and validation.
• Shredding functionality that allows you to extract values from JSON documents up to 16MB in
size and store the extracted data in relational format.
• Publishing functionality that allows you to publish the results of SQL queries in JSON format.
• Schema-less or dynamic schema with the ability to add a new attribute without changing the
schema. Data with new attributes is immediately available for querying. Rows without the new
column can be filtered out.
• Use existing join indexing structures on extracted portions of the JSON data type.
• Apply advanced analytics to JSON data.
• Functionality to convert an ST_Geometry object into a GeoJSON value and a GeoJSON value
into an ST_Geometry object.
• Allows JSON data of varying maximum length and JSON data can be internally compressed.
• Collect statistics on extracted portions of the JSON data type.
• Use standard SQL to query JSON data.
• JSONPath provides simple traversal and regular expressions with wildcards to filter and
navigate complex JSON documents.
If the default datatype used to aggregate numeric values is causing an overflow, then a cast to
BIGINT may resolve the problem. For the differences between NUMBER, DECIMAL, FLOAT
and BIGINT see the SQL Data Types and Literals manual..
12
VARIANT_TYPE - An input parameter data type that can be used to package
and pass in a varying number of parameters of varying data types to a UDF as a single
UDF input parameter.
Related Topics
For detailed information on data types, see the SQL Data Types and Literals, SQL Geospatial
Types, and Teradata XML manuals. Also refer to the SQL Functions, Operators,
Expressions, and Predicates manual for a list of data type conversion functions, including
support of Oracle data type conversion functions.
13
Section 2.3 Primary Index
A primary index is required for all Teradata database tables. If you do not assign a primary
index explicitly when you create a table, then the system assigns one automatically according to
the following rules:
Stage Process
1 WHEN THEN the system selects the…..
a primary key column is primary key column set to be the primary index
defined, but a primary and defines it as a UPI.
index is not
2 WHEN THEN the system selects the…..
neither a primary key nor first column having a UNIQUE constraint to be the
primary index is defined. primary index and defines it as a UPI.
3 WHEN THEN the system selects the…..
no primary key, primary first column defined for the table to be the primary
index, or uniquely index.
constrained column is
defined If the first column defined in the table has a LOB
data type, then the CREATE TABLE operation
aborts and the system returns an error message.
WHEN THEN the system
selects the…..
the table has only one UPI.
column and its kind is
defined as SET
anything else NUPI.
Use the CREATE TABLE statement to create primary indexes. Data accessed using a primary
index is always a one-AMP operation because a row and its primary index are stored together in
the same structure. This is true whether the primary index is unique or non-unique, and whether
it is partitioned or non-partitioned.
With the exception of NoPI tables and column-partitioned tables and join indexes,
Teradata Database distributes table rows across the AMPs on the hash of their primary
index value. The determination of which hash bucket, and hence which AMP the row is
to be stored on, is made solely on the value of the primary index.
14
The choice of columns for the primary index affects how even this distribution is. An
even distribution of rows to the AMPs is usually of critical importance in picking a
primary index column set.
To provide access to rows more efficiently than with full table scan.
If the values for all the primary index columns are specified in the constraint clause in the
DML statement, single-AMP access can be made to the rows using that primary index
value.
With a partitioned primary index, faster access is also possible when all the values of the
partitioning columns are specified or if there is a constraint on partitioning columns.
Other retrievals might use a secondary index, a hash or join index, a full table scan, or a
mix of several different index types.
If there is an equality join constraint on the primary index of a table, it may be possible to
do a direct join to the table (that is, rows of the table might not have to be redistributed,
spooled, and sorted prior to the join).
If the GROUP BY key is on the primary index of a table, it is often possible to perform a
more efficient aggregation.
15
not have a primary index and always has a table kind of MULTISET. Without a PI, the hash
value as well as AMP ownership of a row is arbitrary. A NoPI table is internally treated as a hash
table; it is just that typically all the rows on one AMP will have the same hash bucket value.
The chief purpose of NoPI tables is as staging tables. FastLoad can efficiently load data into
empty nonpartitioned NoPI staging tables because NoPI tables do not have the overhead of row
distribution among the AMPs and sorting the rows on the AMPs by rowhash.
Nonpartitioned NoPI tables are also critical to support Extended MultiLoad Protocol
(MLOADX). A nonpartitioned NoPI staging table is used for each MLOADX job.
NoPI tables are not intended to be used as tables which are utilized by end user queries. They
exist primarily as a landing place for FastLoad or MultiLoad staging data.
For more information, please refer to the Teradata Database Design manual.
While secondary indexes are exceedingly useful for optimizing repetitive and standardized
queries, the Teradata Database is also highly optimized to perform full-table scans in parallel.
Because of the strength of full-table scan optimization in the Teradata Database, there is little
reason to be heavy-handed about assigning multiple secondary indexes to a table.
Secondary indexes are less frequently included in query plans by the Optimizer than the primary
index for the table being accessed.
You can create secondary indexes when you create the table via the CREATE TABLE statement,
or you can add them later using the CREATE INDEX statement
Data access using a secondary index varies depending on whether the index is unique or non-
unique.
Teradata Database tables can have up to a total of 32 secondary, hash, and join indexes.
16
No more than 64 columns can be included in a secondary index definition.
You can include UDT columns in a secondary index definition
You cannot include columns having XML, BLOB, CLOB, BLOB-based UDT, CLOB-
based UDT, XML-based UDT, Period, JSON, ARRAY, or VARRAY data types in any
secondary index definition.
You can define a simple NUSI on a geospatial column, but you cannot include a column
having a geospatial data type in a composite NUSI definition or in a USI definition
You can include row-level security columns in a secondary index definition
You cannot include the system-derived PARTITION or PARTITION#Ln columns in any
secondary index definition
Creating a secondary index causes the system to build a subtable to contain its index rows, thus
adding another set of rows that requires updating each time a table row is inserted, deleted, or
updated.
Secondary index subtables are also duplicated whenever a table is defined with FALLBACK, so
the maintenance overhead is effectively doubled.
When compression at the data block level is enabled for their primary table, secondary index
subtables are not compressed.
One should specify a primary index (PI) for a temp table where access or join by the PI is
anticipated. Not specifying a PI will cause a default to a NOPI table or a PI on the first column
of the table regardless of the data demographics. If you do not know in advance what the best PI
candidate will be, then specify a NOPI table to insure even distribution of the rows across all
AMPs.
Most locks used on Teradata resources are obtained automatically. Users can override some
locks by making certain lock specifications, but cannot downgrade the severity of a lock --
17
Teradata Database only allows overrides when it can assure data integrity. The data integrity
requirement of a request decides the type of lock that the system uses.
A request for a locked resource by another user is queued until the process using the resource
releases its lock on that resource.
Lock Levels:
18
Write Granted Granted Queued Queued Queued
Exclusive Granted Queued Queued Queued Queued
The Teradata Database applies most of its locks automatically. The following table illustrates
how the Teradata Database applies different locks for various types of SQL statements.
Is it generally recommended that queries that do not require READ lock access use the SQL
clause “LOCKING ROW FOR ACCESS.” This will allow non-blocked access to tables that are
being updated.
In addition to the information listed here, refer to the orange book, “Understanding Oracle and
Teradata Transactions and Isolation Levels for Oracle Migrations” for a further understanding of
the Teradata database differences.
Over the last two decades, Teradata software releases have consistently provided improvements
and enhancements in the way statistics are collected, and then utilized by the cost-based Teradata
Optimizer. The Optimizer doesn’t perform a detailed evaluation of every possible query plan
(multiple joins could produce billions of possibilities). Instead, it uses sophisticated algorithms to
identify and select the most promising candidates for detailed evaluation, then picks what it
perceives as the best plan among those. The essential task of the optimizer is to produce the
optimal execution plan (the one with the lowest cost) from many possible plans. The basis on
which different plans are compared with each other is the cost which is derived from the
estimation of cardinalities of the temporary or intermediate relations, after an operation such as
19
selections, joins and projections. The estimations in Teradata are derived primarily from statistics
and random AMP samples. Accurate estimations are crucial to get optimal plans.
Providing statistical information for performance optimization is critical to optimal query plans,
but collecting statistics can prove difficult due to the demands of time and system resources .
Without full or all-AMP sampled statistics, query optimization must rely on extrapolation and
dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics
that a COLLECT STATISTICS request does.
Besides estimated cardinalities, dynamic AMP samples also collect a few other statistics, but far
fewer than are collected by a COLLECT STATISTICS request.
Statistics and demographics provide the Optimizer with information it uses to reformulate
queries in ways that permit it to produce the least costly access and join plans. The critical issues
you must evaluate when deciding whether to collect statistics are not whether query optimization
can or cannot occur in the face of inaccurate statistics, but the following pair of probing
questions.
• How accurate must the available statistics be in order to generate the best possible query plan?
• How poor a query plan are you willing to accept?
Different strategies can be used to attain the right balance between the need for statistics and the
demands of time and resources.
The main strategies for collecting statistics are: Random AMP Sampling and Full Sampling.
By default, the optimizer does the single AMP sampling to produce random AMP sample
demographics with some exceptions (volatile, sparse single table join indexes and aggregate join
indexes). By changing an internal field in the dbscontrol record called RandomAMPSampling, it
can be requested that sampling be performed on 2 AMPs, 5 AMPs, all AMPs on a node, or all
AMPs in the system.
When using these options, random sampling uses the same techniques as single-AMP random
AMP sampling, but more AMPs participate. Touching more AMPs may improve the quality of
the statistical information available during plan generation, particularly if rows are not evenly
distributed.
In Teradata Database 12.0 and higher releases, all-AMP sampling was enhanced to use an
efficient technique using “Last done Channel mechanism” which considerably reduces the
messaging overhead. This is used when all-AMP sampling is enabled in the dbscontrol or cost
20
profile but dbscontrol internal flag RowsSampling5 is set to 0 (which is the default). If set to
greater than 0, this flag causes the sampling logic to read the specified percentage of rows to
determine the number of distinct values for primary index.
Best Use
• Good for cardinality estimates when there is little or no skew and the table has significantly
more rows than the number of AMPs in the system.
• Collects reliable statistics for NUSI columns when there is limited skew and the table has
significantly more rows than the number of AMPs in the system.
• Useful as a temporary fallback measure for columns and indexes on which you have not yet
decided whether to collect statistics or not. Dynamic AMP sampling provides a reasonable
fallback mechanism for supporting the optimization of newly devised ad hoc queries until you
understand where collected statistics are needed to support query plans for them. Teradata
Database stores cardinality estimates from dynamic AMP samples in the interval histogram for
estimating table growth even when complete, fresh statistics are available.
21
This operation is automatically performed
Cons:
Works only with indexed columns.
The single-AMP sampling may not be good enough for small tables and
tables with non-uniform distribution on the primary index.
Does not provide the following information.
Number of nulls
Skew Info
Value Range
For NUSIs, the estimated number of distinct values on a single-AMP is
assumed to be the total distinct values. This is true for highly non-unique
columns but can cause distinct value underestimation for fairly unique columns.
On the other hand, it can cause overestimation for highly nonunique
columns because of rowid spill over.
Cannot estimate the number of distinct values for non-unique primary indexes.
Single table estimations can use this information only for equality conditions
assuming uniform distribution.
It is strongly recommended to contact Teradata Global Support Center (GSC) to assess the
impact of enabling all-AMP sampling on your configuration and to help change the
internal dbscontrol settings.
The greater the number of intervals in a histogram, the more accurately it can describe the
distribution of data by characterizing a smaller percentage of its composition per each interval.
Each interval histogram in the system is composed of a number of intervals (the default is 250
and the maximum is 500) intervals. A 500 interval histogram permits each interval to
characterize roughly 0.25% of the data.
Because these statistics are kept in a persistent state, it is up to the administrator to keep collected
statistics fresh. It is common for many Teradata Warehouse sites to re-collect statistics on the
majority of their tables weekly, and on particularly volatile tables daily, if deemed necessary.
22
• Time consuming.
• Most accurate of the three methods of collecting statistics.
• Stored in interval histograms in the Data Dictionary.
Best Use
• Best choice for columns or indexes with highly skewed data values.
• Recommended for tables with fewer than 1,000 rows per AMP.
• Recommended for selection columns having a moderate to low number of distinct values.
• Recommended for most NUSIs, PARTITION columns, and other selection columns because
collection time on NUSIs is very fast.
• Recommended for all column sets or indexes where full statistics add value, and where
sampling does not provide satisfactory statistical estimates.
Collecting statistics on a sample of the data reduces the resources required and the time to
perform statistics collection. However, the USING SAMPLE alternative was certainly not
designed to replace full statistics collection. It requires some careful analysis and planning to
determine under which conditions it will add benefit.
The quality of the statistics collected with full-table sampling is not guaranteed to be as good as
the quality of statistics collected on an entire table without sampling. Do not think of sampled
statistics as an alternative to collecting full-table statistics, but as an alternative to never, or rarely,
collecting statistics.
When you use sampled statistics rather than full-table statistics, you are trading time in exchange
for what are likely to be less accurate statistics. The underlying premise for using sampled
statistics is usually that sampled statistics are better than no statistics.
Do not confuse statistical sampling with the dynamic AMP samples (system default) that the
Optimizer collects when it has no statistics on which to base a query plan. Statistical samples
taken across all AMPs are likely to be much more accurate than dynamic AMP samples.
Sampled statistics are different from dynamic AMP samples in that you specify the percentage of
rows you want to sample explicitly in a COLLECT STATISTICS (Optimizer Form) request to
collect sampled statistics, while the number of AMPs from which dynamic AMP samples are
collected and the time when those samples are collected is determined by Teradata Database, not
by user choice. Sampled statistics produce a full set of collected statistics, while dynamic AMP
samples collect only a subset of the statistics that are stored in interval histograms.
23
• Collects all statistics for the data, but not by accessing all rows in the table.
• Significantly faster collection time than full statistics.
• Stored in interval histograms in the Data Dictionary.
Best Use
• Acceptable for columns or indexes that are highly singular; meaning that their number of
distinct values approaches the cardinality of the table.
• Recommended for unique columns, unique indexes, and for columns or indexes that are
highly singular. Experience suggests that sampled statistics are useful for very large
tables; meaning tables with tens of billions of rows.
• Not recommended for tables whose cardinality is less than 20 times the number of AMPs in the
system.
You can request summary statistics for a table, but even if you never do that, each individual
statistics collection statement causes summary stats to be gathered. For this reason, it is
recommended that you group your statistics collections against the same table into one statement,
in order to avoid even the small overhead involved in building summary stats repeatedly for the
same table within the same script.
There are several benefits in having summary statistics. One critical advantage is that the
optimizer now uses summary stats to get the most up-to-date row count from the table in order
to provide more accurate extrapolations. It no longer needs to depend on primary index or
PARTITION stats, as was the case in earlier releases, to perform good extrapolations when it
finds statistics on a table to be stale.
Here’s an example of what the most recent summary statistic for the Items table looks like:
24
ON CAB.Items
VALUES
(
/** TableLevelSummary **/
/* Version */ 5,
/* NumOfRecords */ 50,
/* Reserved1 */ 0.000000,
/* Reserved2 */ 0.000000,
/* SummaryRecord[1] */
/* Temperature */ 0,
/* TimeStamp */ TIMESTAMP '2011-12-29 13:30:46',
/* NumOfAMPs */ 160,
/* OneAMPSampleEst */ 5761783680,
/* AllAMPSampleEst */ 5759927040,
/* RowCount */ 5759985050,
/* DelRowCount */ 0,
/* PhyRowCount */ 5759927040,
/* AvgRowsPerBlock */ 81921.871617,
/* AvgBlockSize */ 65024.000000,
/* BLCPctCompressed */ 0.00,
/* BLCBlkUcpuCost */ 0.000000,
/* BLCBlkURatio */ 0.000000,
/* RowSizeSampleEst */ 148.000000,
/* Reserved2 */ 0.000000,
/* Reserved3 */ 0.000000,
/* Reserved4 */ 0.000000
);
While the above statement may be true, the decision is not so easily made in a production
environment. Other factors must be taken into consideration, including the length of time
required to collect the statistics and the resource consumption the collection of full-table
statistics incurs while running other workloads on the system.
To resolve this, the benefits and drawbacks of each method must be considered. An excellent
information table comparing the three methods (Full Statistics, Sampled Statistics, Dynamic
AMP Samples) is provided in Chapter 2 of the SQL Request and Transaction Processing Release
14.0 manual, under the heading Relative Benefits of Collecting Full-Table and Sampled Statistics.
25
2.8.6 New opportunities for statistics collection in Teradata 14.0[1]
Teradata 14.0 offers some very helpful enhancements to the statistics collection process. This
posting discusses a few of the key ones, with an explanation of how these enhancements can be
used to streamline your statistics collection process and help your statistics be more effective.
For more detail on these and other statistics collection enhancements, please read the orange
book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs.
In Teradata 14.0 you may optionally specify a USING clause within the collect statistics
statement. As an example, here are the 3 new USING options that are available in 14.0 with
parameters you might use:
. . . USING MAXVALUELENGTH 50
MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a
time in the new version 5 statistics histogram. The default maximum number of intervals is
250. The valid range is 0 to 500. A larger number of intervals can be useful if you have
widespread skew on a column or index you are collecting statistics on, and you want
more individual high-row-count values to be represented in the histogram. Each statistics
interval highlights its single most popular value, which is designates as its “mode value” and lists
the number of rows that carry that value. By increasing the number of intervals, you will be
providing the optimizer an accurate row count for a greater number of popular values.
MAXVALUELENGTH lets you expand the length of the values contained in the histogram for
that statistic. The new default length is 25 bytes, when previously it was 16. If needed, you can
specify well over 1000 bytes for a maximum value length. No padding is done to the values in
the histogram, so only values that actually need that length will incur the space (which is why the
parameter is named MAXVALUELENGTH instead of VALUELENGTH). The 16-byte limit
on value sizes in earlier releases was always padded to full size. Even if you statistics value was
one character, you used the full 16 bytes to represent it.
Another improvement around value lengths stored in the histogram has to do with multicolumn
statistics. In earlier releases the 16 byte limit for values in the intervals was taken from the
beginning of the combined value string. In 14.0 each column within the statistic will be able to
26
represent its first 25 bytes in the histogram as the default, so no column will go without
representation in a multicolumn statistics histogram.
SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection
level, rather than at the system level. This allows you to easily apply different levels of statistics
sampling to different columns and indexes.
COLLECT STATISTICS
USING MAXVALUELENGTH 50
COLUMN ( P_NAME )
ON CAB.product;
Statistic collection statements for the same table that share the same USING options, and that
request full statistics (as opposed to sampled), can now be grouped syntactically. In fact it is
recommended that once you get on 14.0 that you collect all such statistics on a table as one
group. The optimizer will then look for opportunities to overlap the collections, wherever
possible, reducing the time to perform the statistics collection and the resources it uses.
Here is an example
COLLECT STATISTICS
COLUMN (o_orderdatetime,o_orderID)
, COLUMN (o_orderdatetime)
, COLUMN (o_orderID)
ON Orders;
This is particularly useful when the same column appears in single and also multicolumn
statistics, as in the example above. In those cases the optimizer will perform the most inclusive
27
collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to
derive the statistics for the other two columns. Only a single table scan is required, instead of 3
table scans using the old approach.
Sometimes the optimizer will choose to perform separate collections (scans of the table) the first
time it sees a set of bundled statistics. But based on demographics it has available from the first
collection, it may come to understand that it can group future collections and use pre-aggregation
and rollup enhancements to satisfy them all in one scan.
But you have to remember to re-code your statistics collection statements when you get on 14.0
in order to experience this savings.
Note: With Teradata Software Release 15.0 and above, the Teradata Statistics Wizard is no
longer supported.
28
SQL Request and Transaction Processing Release 14.0 manual. Excellent, technically detailed
information on different statistic collection strategies is provided in chapter 2. Also, great
explanations of how the optimizer uses statistics.
Anything written by Carrie Ballinger on the subject of Teradata statistics. Check out her
contributions to the Teradata Developers Exchange at: http://developer.teradata.com/
Including - New opportunities for statistics collection in Teradata 14.0 on Carrie’s Blog.
http://developer.teradata.com/blog/carrie/2012/08/new-opportunities-for-statistics-collection-in-
teradata-14-0
Also on http://developer.teradata.com/
When is the right time to refresh statistics? - Part I (and Part II) by Marcio Moura
http://developer.teradata.com/blog/mtmoura/2009/12/when-is-the-right-time-to-refresh-statistics-
part-i
Others
http://developer.teradata.com/tools/articles/easy-statistics-recommendations-statistics-wizard-
feature
http://developer.teradata.com/database/articles/statistics-collection-recommendations-for-
teradata-12
http://developer.teradata.com/blog/carrie/2012/04/teradata-13-10-statistics-collection-
recommendations
Stored procedures can be internal (SQL and/or SPL) or external (C, C++, Java in Teradata 12
and beyond) and are considered database objects. Internal and protected external stored
procedures are run by Parsing Engines (in other words, governed internally by Teradata).
Internal stored procedures are written in SQL and SPL whereas external stored procedures
29
cannot execute SQL Statements. External Stored procedures, however, can execute other stored
procedures providing an indirect method of executing SQL statements.
External stored procedures can also execute as a separate process/thread (outside of a Teradata
Parsing Engine), or as a function depending on the protection mode used when the stored
procedure was created. Protected mode invokes the procedure directly under Teradata while
unprotected mode allows the procedure to run in its own thread. The tradeoff is protected mode
will ensure that memory and other resources don’t conflict with Teradata but can negatively
affect performance. Running in unprotected mode can provide better performance but there is
risk of a potential resource conflict (memory usage/fault, using processing resources that would
be used by Teradata). If you are attempting to run a stored procedure and it’s very slow, one of
the first items to check is the protection mode that was selected when the procedure was created.
Note: If creating stored procedures using SQL Assistant, make sure that ‘Allow the Use of
ODBC SQL Extensions’ is checked (Menu – Tools/Options, Query tab). SQL Assistant will not
recognize the CREATE/REPLACE commands if this option is not checked.
Teradata V12 Stored Procedures can return result sets. Prior to V12, result sets can be stored in a
table (permanent or temp table) for access outside the stored procedure.
30
Error handling is also built-in to Teradata stored procedures through messaging facilities (Signal,
Resignal) and a host of available standard diagnostic variables. External stored procedures also
can use a debug trace facility that provides a means to store tracing information in a global
temporary trace table.
It is important to find a balance when using stored procedures, especially when porting existing
stored procedures from another database. Teradata’s strength lies in its ability to process large
sets of data rather quickly. Using row at a time processing such as cursors can cause slower
performance. In Teradata 13, recursive queries are allowed in stored procedures allowing for a
set approach to many of the problems cursors have been used in the past to solve.
SP Commands
SHOW PROCEDURE
Display procedure statements and comments
HELP PROCEDURE
Show procedures parameter names and types
ALTER PROCEDURE
Change attributes such as protected mode /storing of SPL.
COMPILE / REecompile stored procedures
EXECUTE PROTECTED/EXECUTE NOT PROTECTED
Provides an execution option for fault isolation and performance
ATTRIBUTES clause
Display transaction and platform the SP was created on
DROP PROCEDURE
Removes unwanted SPs. For XSP, it is removed from the available SP library with a re-
link of the library.
31
> Pre_upgrade_prep.pl
> script identifies SPs that will not recompile automatically.
• During Migration
> Qualified SPs and XSPs are recompiled automatically
> SPs with NO SPL and those that fail recompile are identified
– Will have to be manually recreated/recompiled respectively
• SPs must be recompiled
> on new Major releases of TD
> after cross-platform migration
Scalar Set
Scalar functions accept inputs from an argument list and returns a single result value. Some
examples of built-in Teradata scalar functions include: SUBSTR, ABS, and SQRT.
Scalar UDFs can also be written using SQL constructs. These are called SQL UDFs and they are
very limited. SQL commands cannot be issued in "SQL UDFs", so they are basically limited to
single statements using SQL functions. Two main advantages: they can simplify SQL DML
statements that use the function call instead of long convoluted logic, and they run faster than C
language UDFs.
Example:
REPLACE FUNCTION Power( M Float, E Float )
RETURNS FLOAT
LANGUAGE SQL
CONTAINS SQL
RETURNS NULL ON NULL INPUT
DETERMINISTIC
32
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN
CASE M WHEN 0 then 0 ELSE EXP ( LN ( M ) * E ) END
;
SELECT POWER(cast(2 as decimal(17,0))-cast(1 as decimal(17,0)),2);
SELECT POWER(cast(1234567890098765432 as bigint)-1,2);
These are not the equivalent of PL/SQL Functions because they can't really do SQL. For ideas
on translating PL/SQL Functions to Teradata see
http://developer.teradata.com/blog/georgecoleman/2014/01/ordered-analytical-functions-
translating-sql-functions-to-set-sql
Aggregate functions are similar to scalar functions except they work on sets (created by GROUP
BY clauses in a SQL statement) of data, one row at a time, and returning a single result. SUM,
MIN, MAX, and AVG are examples of Teradata built-in aggregate functions.
Table functions return a table a row at a time and are unlike scalar and aggregate UDF’s since
they cannot be called in places that system functions are called. Table functions require a
different syntax are invoked similarly to a derived table:
The table function, Sales_Retrieve is passed the parameter of 9005. The results of
Sales_Retrieve will be packaged to match the 3 columns in the SELECT clause.
1. Additional SQL functions to enhance existing Teradata supplied functions. For example,
certain string manipulations may be common for a given application. It may make sense
to create UDF’s for those string manipulations; the rule is created once and can be used
by many. UDF’s can make porting from different databases easier (i.e. the Oracle
DECODE function) by coding a UDF to match the other database function. Recreating
the function (one code change) reduces the amount of SQL rewrite for an application that
may use the function many times (many code changes).
2. Complex algorithms for data mining, forecasting, business rules, and encryption
3. Analysis of non-traditional data types (i.e. image, text, audio, etc.)
4. XML string processing
5. ELT (Extract Transform Load) data validation
UDF’s are invoked qualifying the database name where they are stored, e.g.
DBName.UDFname(), or if stored in the special database call SYSLIB, without database name
qualification, e.g. UDFname().
33
UDF’s are invoked in a SQL DML statement whereas Stored Procedures must be
invoked with an explicit CALL statement
UDF’s cannot modify data with INSERT, UPDATE, or DELETE statements and can
only work with data that is passed in as parameters.
UDF’s are written in C/C++ (Java in Teradata V13), Internal Stored Procedures are
written in SPL (Stored Procedure Language). External Stored Procedures are similar to
UDF’s; they are written in C/C++ and Java however a UDF cannot CALL a stored
procedure. However a stored procedure can invoke a UDF as part of a DML statement.
UDF’s run on the AMPs while Stored Procedures run under control of the Parsing
Engines (PEs). Starting with Teradata V2R6.0, some UDF’s can run as part of the PEs.
UDF’s can only return a single value (except for Table UDF’s)
Stored procedures can handle multiple SQL exceptions whereas UDF’s can only catch and
pass one value to the caller.
Developing and distributing UDF’s: UDF’s support the concept of packages which allow
developers to create function suites or libraries (i.e. .DLL’s in Windows and .SO’s on UNIX)
that are easily deployable across systems.
SELECT
dbase.databasename (Format 'x(20)') (Title 'DB Name')
34
, UDFInfo.FunctionName (Format 'x(20)') (Title 'Function')
,case
when UDFInfo.FunctionType='A' then 'Aggregate'
when UDFInfo.FunctionType='B' then 'Aggr Ordered Anal'
when UDFInfo.FunctionType='C' then 'Contract Func'
when UDFInfo.FunctionType='E' then 'Ext Stored Proc'
when UDFInfo.FunctionType='F' then 'Scalar'
when UDFInfo.FunctionType='H' then 'Method'
when UDFInfo.FunctionType='I' then 'Internal'
when UDFInfo.FunctionType='L' then 'Table Op'
when UDFInfo.FunctionType='R' then 'Table Function'
when UDFInfo.FunctionType='S' then 'Ordered Anal'
else 'Unknown'
end (varchar(17), Title 'Function//Type')
, CAST(TVM.LastAlterTimeStamp AS DATE) (FORMAT 'MMM-DD',Title 'Altered')
FROM DBC.UDFInfo, DBC.DBase, DBC.TVM
WHERE DBC.UDFInfo.DatabaseId = DBC.DBase.DatabaseId
AND DBC.UDFInfo.FunctionId = DBC.TVM.TVMId
ORDER BY 1,2,3,4;
Archive/Restore/Copy/Migrating Considerations
The following list BAR related operations and describes how they act on UDFs. It is very
important to understand that only database level operations will act on UDF. There is no way
of selectively archive or restore a UDF (by the way, the same is true for stored procedures):
35
Differences Between Table Functions and Table Operators
• The inputs and outputs for table operators are a set of rows (a table) and not columns. The
default format of a row is IndicData.
• In a table function, the row iterator is outside of the function and the iterator calls the function
for each input row. In the table operators, the operator writer is responsible for iterating over the
input and producing the output rows for further consumption. The table operator itself is called
only once. This reduces per row costs and provides more flexible read/write patterns.
A table operator can be system defined or user defined. Teradata release 14.10 introduces three
new system defined table operators:
TD_UNPIVOT which transforms columns into rows based on the syntax of the unpivot
expression.
CalcMatrix which calculates a Correlation, Covariance or Sums of Squares and Cross
Products matrix
LOAD_FROM_HCATALOG which is used for accessing the Hadoop file system.
Use Options
The table operator is always executed on the AMP within a return step (stpret in DBQL). This
implies that it can read from spool, base table, PPI partition, index structure, etc. and will always
write its output to spool. Some concepts related to the operator execution.
If a HASH BY and/or a LOCAL ODER BY is specified the input data will always be spooled to
enforce the HASH BY geography and the LOCAL ORDER BY ordering within the AMP.
HASH BY can be used to assign rows to a AMP and LOCAL ORDER BY can be used to order
the rows within a AMP. You can specify either or both of the clauses independently.
If a PARTITION BY and ORDER BY is specified the input data will always be spooled to
enforce the PARTITON by grouping and ORDER BY ordering within the PARTITION. You
can specify a PARTITON BY without an ORDER BY but you cannot have an ORDER BY
without a PARTITION BY. Further the table operator will be called once for each partition and
the row iterator will only be for the rows within the partition. In summary a PARTITION is a
logical concept and one or more partitions may be assigned to a AMP, same behavior as ordered
analytic partitions.
The USING clause values are modeled as key-value. You can define multiple key-value pairs
and a single key can have multiple values. The using clause is a literal value and cannot contain
any expressions, DML etc. Further, the values are handled by the syntaxer in a similar manner to
regular SQL literals. For example {1 , 1.0 ,'1'} are respectively passed to the table operator as
byteint, decimal(2,1) and VARCHAR(1) CHARACTER SET UNICODE values.
36
Section 2.12. QueryGrid
Teradata Database provides a means to connect to a remote system and retrieve or insert data
using SQL. This enables easy access to Hadoop data for the SQL user, without replicating the
data in the warehouse.
Note: With Teradata Software Release 15.0, SQL-H has been rebranded as Teradata QueryGrid:
Teradata Database to Hadoop. The existing connectors on Teradata 14.10 to TDH/HDP will
continue to be called SQL-H. The 14.10 SQL-H was released with Hortonworks Hadoop and is
certified to work with TDH 1.1.0.17, TDH/HDP 1.3.2, and TDH/HDP 2.1 (TDH = Teradata
Distribution for Hadoop, HDP = Hortonworks Data Platform).
The goal and vision of Teradata® QueryGrid™ is to make specialized processing engines,
including those in the Teradata Unified Data Architecture™ act as one solution from the user’s
perspective. Teradata QueryGrid is the core enabling software, engineered to tightly link with
these processing engines to provide intelligent, transparent and seamless access to data and
processing. This family of intelligent connectors deliver bi-directional data movement and push-
down processing to enable the Teradata Database or the Teradata Aster Database systems to
work as a powerful orchestration layer.
As the role of analytics within organizations continues to grow, along with the number and types
of data sources and processing requirements, companies face increasing IT complexity. Much of
the complexity arises from the proliferation of non-integrated systems from different vendors,
each of which is designed for a specific analytic task.
This challenge is best addressed by the Teradata® Unified Data Architecture™, which enables
businesses to take advantage of new data sources, data types and processing requirements across
the Teradata Database, Teradata Aster Database, and open-source Apache™ Hadoop®. Teradata
QueryGrid™ optimizes and simplifies access to the systems and data within the Unified Data
Architecture and beyond to other source systems, including Oracle Database; delivering seamless
multi-system analytics to end-users.
This enabling solution orchestrates processing to present a unified analytical environment to the
business. It also provides fast, intelligent links between the systems to enhance processing and
data movement while leveraging the unique capabilities of each platform. Teradata Database
15.0 brings new capabilities to enable this virtual computing, building on existing features and
laying the groundwork for future enhancements.
Teradata QueryGrid is a powerful enabler of technologies within and beyond the Unified Data
Architecture that delivers seamless data access and localized processing. The QueryGrid adds a
single execution layer that orchestrates analyses across Teradata, Teradata Aster, Hadoop, and in
37
the future other databases and platforms. The analysis options include SQL queries, as well as
graph, MapReduce, R-based analytics, and other applications. Offering two-way, Infiniband
connectivity among data sources, the QueryGrid can execute sophisticated, multi-part analyses.
It empowers users to immediately and automatically access and benefit from all their data along
with a wide range of processing capabilities, all without IT intervention. This solution raises the
bar for enterprise analytics and gives companies a clear competitive advantage.
The vision --simply said-- is that a business person connected to the Teradata Database or Aster
Database can submit a single SQL query that joins data together from a one or more systems for
analysis. There’s no need to depend upon IT to extract data and load it into another machine. The
business person doesn’t have to care where the data is – they can simply combine relational
tables in Teradata with tables or flat files found in Hadoop on demand.
The Teradata approach to fabric-based computing also leverages these elements wherever
possible for seamlessly accessing data across the Teradata® Unified Data Architecture™:
38
scalability and integrity of InfiniBand to load-balance multiple fabrics, seamlessly
handling failover in the event of an interconnect failure.
InfiniBand technology, a Teradata fabric, gains much of its resiliency from the Mellanox-
supplied InfiniBand switches, adapters and cables that are recognized as industry-leading
products for high-quality, fully interoperable enterprise switching systems.
The recommended leading practice DBQL configuration for Teradata environments is all usage
should be logged at the Detail level with SQL and Objects. The only exception is database usage
comprised of strictly subsecond, known work, i.e. tactical applications. This subsecond, known
work is logged at the Summary level.
Logging with DBQL is best accomplished by one logging statement for each “accountstring” on
a Teradata system. A database user session is always associated with an account. The account
information is set with an “accountstring.” Accountstrings typically carry a workload group
name (“$M1$”), identifying information for an application (“WXYZ”), and expansion variables
for activity tracking (“&S&D&H” for session number, date, and hour).
An example of the recommended DBQL Detail logging statement using this example
accountstring is:
BEGIN QUERY LOGGING with SQL, OBJECTS LIMIT sqltext=0 ON ALL ACCOUNT =
‘$M1$WXYZ&S&D&H’;
This statement writes rows to the DBQLogTbl table. This table contains detailed information
including but not limited to CPU seconds used, I/O count, result row count, system clock times
for various portions of a query, and other query identifying information. This logging statement
writes query SQL to the DBQLSqlTbl table with no SQL text kept in the DBQLogTbl. Database
object access counts for a query is written to the DBQLObjTbl table.
An example of the recommended DBQL Summary logging statement for tactical applications
(only known subsecond work) using the example accountstring on Teradata V2R6 and later is:
39
BEGIN QUERY LOGGING LIMIT SUMMARY=10,50,100 CPUTIME ON ALL ACCOUNT =
‘$M1$WXYZ&S&D&H’;
This statement results in up to four rows written to the DBQLSummaryTbl in a 10 minute DBQL
logging interval. These rows summarize the query logging information for queries in hundredths
of a CPU second between 0 and 0.10 CPU second, 0.10 - 0.50 CPU second, 0.50 - 1 CPU
second, and over 1 CPU second.
For a well-behaved workload with occasional performance outliers, DBQL threshold logging can
be used. Threshold logging is not typically recommended but is available. An example of a
DBQL Threshold logging statement using the example accountstring is:
The CPUTIME threshold is expressed in hundredths of a CPU second. This statement logs all
queries over 1 CPU second to the DBQL Detail table with the first 10,000 characters of the SQL
statement also logged to the DBQL Detail table. For queries less than 1 CPU second, this DBQL
Threshold logging statement writes a query cumulative count by CPU seconds consumed for
each session as a separate row in DBQLSummaryTbl every 10 minutes. The
DBQLSummaryTbl will also contain I/O use and other query identifying information.
IMPORTANT NOTE: DBQL Threshold and Summary logging cause you to lose the ability to
log SQL and OBJECTS. Threshold and Summary logging are recommended only after the
workload is appropriately profiled using DBQL Detail logging data. Further, threshold logging
should be used in limited circumstances. Detail logging data, with SQL and OBJECTS, is
typically desired to ensure a full picture is gathered for analysis of queries performing outside
expected norms.
DBQL can be enabled on Teradata user names in place of accountstrings. This is typically done
when measuring performance tests run under a specific set of known user logins. DBQL logging
by accountstrings is the more flexible, production-oriented approach.
If a maintenance process is required for storing DBQL data historically, the Download Center on
Teradata.com provides a DBQL Setup and Maintenance document with the table definitions,
macros, etc. to accomplish the data copy and the historical storage process. This is generally
implemented when it is decided that DBQL data should not be in the main or root Teradata DBC
database for more than a day. The DBQL data maintenance process from Teradata.com can be
implemented in under an hour.
Two additional logging statements are recommended in addition to the recommended Detail and
Summary logging statements previously mentioned. These additional DBQL logging statements
are used temporarily on a database user such as SYSTEMFE to dump the DBQL data buffers
before query analysis or during the maintenance process. Examples of this statement are:
40
END QUERY LOGGING ON SYSTEMFE;
DBQL buffers can retain data up to 10 minutes after a workload runs. On a lightly used Teradata
system, DBQL buffers may not flush for hours or days. Any DBQL configuration change causes
all DBQL data to be written to the DBQL tables. Use of these two additional statements ensure
that all data is available before an analysis of DBQL data is performed.
This feature can be enabled for all users with the following statement.
Begin Query Logging with ParamInfo on all;
41
Now that we capture the parameter values the same query can now be replayed with the values in
place of parameters. This will help customers isolate problem values supplied to parameters.
DBQL Queries
Query Analysis:
select
queryid,
FirstStepTime,
EstResultRows (format 'zzz,zzz,zzz,zz9'),
NumResultRows(format 'zzz,zzz,zzz,zz9'),
EstProcTime (format 'zzz,zzz,zz9.999'),
AMPCPUTime (format 'zzz,zzz,zz9.999'),
TotalIOCount(format 'zzz,zzz,zzz,zz9'),
NumOfActiveAMPs(format 'zzz,zz9'),
MaxAmpCPUTime(format 'zzz,zzz,zz9.999'),
MinAmpCPUTime(format 'zzz,zzz,zz9.999'),
MaxAmpIO (format 'zzz,zzz,zzz,zz9'),
MinAmpIO (format 'zzz,zzz,zzz,zz9')
from dbc.dbqlogtbl
where username ='xxxx'
and cast(FirstStepTime as date) = '2013-02-13'
and NumOfActiveAMPs <> 0
and AMPCPUTime > 100
order by Firststeptime;
42
Top 10 (CPU):
SELECT StartTime
,QueryID
,Username
,StatementType
,AMPCPUTime
,rank () over (
order by AMPCPUTime DESC) as ranking
from DBC.QryLog
where cast(CollectTimeStamp as date) = date
qualify ranking <=10;
To list all the tables for these databases that you can access:
select databasename,tablename from dbc.tablesX order by 1,2;
Collations can be defined at the USER (modify user collation=x) or SESSION (set session
collation x) levels. Set session takes precedence. Each type of collation be discovered by using
a query such as
SELECT CharType FROM DBC.ColumnsX WHERE ...
If Referential Integrity is defined on the tables, you can get a list of the relationships with a query
like this:
select
trim(ParentDB) || '.' || trim(ParentTable) || '.' || trim(ParentKeyColumn) (char(32)) "Parent"
, trim(ChildDB) || '.' || trim(ChildTable) || '.' || trim(ChildKeyColumn) (char(32)) "Child"
from DBC.All_RI_ParentsX
order by IndexName
;
If you want to start with a particular parent table and build a hierarchical list, you might try this
recursive query:
43
from DBC.All_RI_ParentsX
where ParentDB = <Parent-Database-Name> and ParentTable = <Parent-Table-Name>
union all
select child.ParentDB, child.ParentTable, child.ChildDB, child.ChildTable, RI_Hier.level+1
from RI_Hier
,DBC.All_RI_ParentsX child
where RI_Hier.ChildDB = child.ParentDB
and RI_Hier.ChildTable = child.ParentTable
)
select
trim(ParentDB) || '.' || trim(ParentTable) "Parent"
,trim(ChildDB) || '.' || trim(ChildTable) "Child"
,level
from RI_Hier
order by level, Parent, Child
;
There are a number of session-specific diagnostic features which can be very helpful under
specific situations. Use these by executing the statement(s) below in a query window prior to
diagnosing the query of interest. Note that these settings are only active for the current session.
When the session is terminated, the session parameter is cleared.
Using the EXPLAIN feature on a query in conjunction with the above session parameter
provides the user with statistics recommendations for the given query. While the list can be very
beneficial in helping identify missing statistics, not all of the recommendations may be
appropriate. Each of the recommended statistics should be evaluated individually for usefulness.
Using the example below, note that the optimizer is suggesting that statistics on columns
DatabaseName and TVMName would likely result in higher confidence factors in the query
plan. This is due to those columns being used in the WHERE condition of the query.
Query:
SELECT DatabaseName,TVMName,TableKind
FROM dbc.TVM T
,dbc.dbase D
WHERE D.DatabaseId=T.DatabaseId
AND DatabaseName='?DBName'
AND TVMName='?TableName'
ORDER BY 1,2;
Results (truncated):
44
• "COLLECT STATISTICS dbc.dbase COLUMN (DATABASENAME)". (HighConf)
• "COLLECT STATISTICS dbc.TVM COLUMN (TVMNAME)". (HighConf)
<- END RECOMMENDED STATS
Teradata’s Verbose Explain feature provides additional query plan information above and
beyond that shown when using the regular Explain function. Specifically, more detailed
information regarding spool usage, hash distribution and join criteria are presented. Use the
above session parameter in conjunction with the EXPLAIN feature on a query of interest.
3. Workload Management
Section 3.1 Workload Administration
Workload Management in Teradata is used to control system resource allocation to the various
workloads on the system. After installation of a Teradata partner application at a customer site,
Teradata has a number of tools used for workload administration. Teradata Active System
Management (TASM) is a grouping of products, including system tables and logs, that interact
with each other and a common data source. TASM consists on a number of products: Teradata
Workload Analyzer, Viewpoint Workload Monitor/Health, and Viewpoint Workload Designer.
TASM also includes features to capture and analyze Resource Usage and Teradata Database
Query Log (DBQL) statistics. There are a number of orange books, Teradata magazine articles,
and white papers addressing the capabilities of TASM.
Perhaps the best source of information for TASM is the Teradata University website at
https://university.teradata.com. There are a number of online courses and webcasts available on
the Teradata University site which offer a wealth of information on TASM and its component
products.
Portlets enable users across an enterprise to customize tasks and display options to their
specific business needs. You can view current data, run queries, and make timely business
decisions, reducing the database administrator workload by allowing you to manage your
work independently. Portlets are added to a portal page using the Add Content screen. The
Teradata Viewpoint Administrator configures access to portlets based on your role.
45
4. Migrating to Teradata
Section 4.1 Utilities and Client Access
Teradata offers a complete set of tools and utilities that exploit the power of Teradata for
building, accessing, managing, and protecting the data warehouse. Teradata’s data acquisition
and integration (load and unload) tools are typically used by partners in the ETL space while
partners in the Business Intelligence and EAI spaces use the connectivity and interface tools
(ODBC, JDBC, CLIv2, .NET, OLE DB). For this discussion, we will focus primarily on the
“Load & Unload tools” and the “Connectivity and Interface Tools.”
One common reason that Partners want to integrate their products with Teradata is Teradata’s
ability to more efficiently work with, and process, larger amounts of data than the other
databases that the Partner is accustomed to working with. With this in mind, Partners should
consider the advantages and flexibility offered by the Teradata Parallel Transporter (and the TPT
API), which provides the greatest flexibility and throughput capability of all the Load/Unload
products.
Another relatively new option which should be considered as an ELT architecture option is the
ANSI Merge in combination with NoPI tables, which is an option as of Teradata Version 13.0.
The ANSI merge offers many of the capabilities of the Teradata utilities (error tables, block-at-a-
time optimization, etc.) and the FastLoad into the NoPI target table is up to 50% faster than the
FastLoad into a target table with a Primary Index.
Many other options are presented here, and it is important to consider all of them to find the
methods that are best for a particular Partner’s needs.
46
> TPump protocol
– TPump product or Parallel Transporter Stream Operator.
– TPump product - script-driven batch tool.
– Parallel Transporter Stream – use with TPT API or scripts.
– Optimizations for statement and data buffering along with reduced.
row locking contention and SQL statement cache reuse on the
Teradata Database.
– Indexes are retained and updated
> BTEQ
– Batch SQL and report writer tool that has basic import/export.
> Preprocessor2
– Used for imbedding SQL in C application programs.
> CLIv2
– Lowest level API but not recommended due to added complexity
without much performance gain over using higher level interface
– Note: CLIv2 is used by Teradata tools like BTEQ, Teradata load tools
etc. One can run different protocols using CLIv2 (e.g., SQL, ARC for
backup, FastLoad, MultiLoad, FastExport, etc.) Only SQL protocol is
published.
47
– Teradata Parallel Transporter (improved tools).
• Execute all load/unload protocols in one product with one scripting
language.
• Plug-in Operators: Load, Update, Stream, Export.
• Provides C++ API to protocols for ISV partners.
Module Tool
48
• If you already know the legacy Stand-alone load tools, everything you have learned about
the four load tools still applies as far as features, parameters, limitations (e.g., number of
concurrent load jobs), etc. There is no learning curve for the protocols, just learn the new
language and the new features.
• Why was the client code rewritten? Benefits are one tool with one language, performance
improvement on large loads with parallel load streams when I/O is a bottleneck, ease of
use, & the TPT API (for better integration with partner load tools, no landing of
data, and parallel load streams).
• Most everything about load tools still applies
> Similar basic features, parameters, limitations (e.g., number of concurrent load
jobs), when to use, etc.
• Parallel Transporter performs the tasks of FastLoad, MultiLoad, Tpump, and Fast Export. In
addition to these functions it also provides:
– Common scripting language across all processes. This simplifies the writing of scripts and
makes it easier to do tasks such as “Export from this database, load into this database” in a
single script.
– Full parallelism on the client side (where the FastLoad and MultiLoad run). We now create
and use multiple threads on the client side allowing ETL tools to fan out in parallel on the
multi-CPU SMP box they run on.
– API for connecting ETL tools straight to the TPT Load/Unload Operators. This will simplify
integration with these tools and improve performance in some cases.
• Wizard that generates scripts from prompts for learning the script language –
supports only small subset of Parallel Transporter features.
• Command line interface (known as Easy Loader interface) that allows one to
create load jobs with a single command line. Supports a subset of features.
• With Teradata release 14.10, the MultiLoad protocol on the Teradata Database has been
extended. This extension this is known as Extended Mload or MLOADX. The new
extension is implemented only for the Parallel Transporter Update Operator. At
execution time, the extension converts the TPT Update Operator script into an ELT
process if the target table of the utility has any of the following objects: Unique
Secondary Indexes, Join Indexes, Referential Integrity, Hash Index, a trigger, and
supports temporal tables. This extension eliminates the need for the system administrator
to Drop/Create the aforementioned objects just so that Multiload may be used.
Conversion from ETL to ELT will happen automatically; no change to the utility or its
script is necessary.
Note: The names -- FastLoad, MultiLoad, FastExport, TPump – refer to load protocols and
not products. Anywhere those load protocols are mentioned, Teradata Parallel Transporter
(TPT) can be substituted to run the load/unload protocols.
49
Parallel Transporter Architecture
Teradata Database
50
Increased Throughput
1 job per
source Source2 Read Read Read
or 1 Operator Operator Operator
source
at a time Source3
Load Load
Load Utility
Operator Operator
51
streams , can have parallel load processes that communicate with the Teradata Database
through parallel sessions (scale across network bandwidth), & parallel PEs read data
while parallel AMPs apply data to target tables.
TPT Operators
Much of the functionality is provided by scalable, re-usable components called Operators:
• Operators can be combined to perform desired operations (load, update, filter, etc.).
• Operators & infrastructure are controlled through metadata.
• Scalable Data Connector – reads/writes data in parallel from/to files and other sources
(e.g. Named Pipes, TPT Filter Operator, etc.).
• Export, SQL Selector - read data from Teradata tables (Export uses Fast Export protocol
& Selector is similar to BTEQ Export).
• Load - inserts data into empty Teradata tables (FastLoad protocol).
• SQL Inserter - inserts data into existing tables (similar to BTEQ Import using SQL).
• Update - inserts, updates, deletes data in existing tables (uses MultiLoad protocol).
Starting with Teradata release 15.0, the TPT Update Operator can load LOBs.
• Stream - inserts, updates, deletes data in existing tables in continuous mode.
• Infrastructure reads script, parses, creates parallel plan, optimizes, launches, monitors,
and handles checkpoint/restart.
FastLoad, MultiLoad, and FastExport protocols are client/server protocols with a program that
runs on the client box talking with a proprietary interface to a program on the Teradata Database.
These closed, undocumented interfaces have been opened up with the Teradata Parallel
Transporter API.
With the FastLoad & MultiLoad protocols, there are two phases, acquisition and apply. In the
acquisition phase the client reads data as fast as it can and sends it to the database which puts the
data into temp tables. When the data read is exhausted, the client signals the database to start the
apply phase where the data in the temp tables is redistributed to the target tables in parallel.
Teradata Active System Management tools can throttle Teradata load tools (ML, FL, & FE
protocols), TPump must be treated like any other SQL application.
52
TPT API
TPT API is the interface for programmatically running the Teradata load protocols.
Specifically: FastLoad, MultiLoad, FastExport, and TPump. It enhances partnering with 3rd
party tools by:
Proprietary load protocols become open
Partners integrate faster and easier
Partner tool has more control over entire load process
Increased performance
TPT API is used by more than ETL vendors – BI vendors use Export Operator to pull data
from TDAT into their tool.
Not all vendors use API. Script interface has its place (e.g., TDE) & non-parallel tools can
create parallel streams using named pipes and the script version of TPT which has a parallel
infrastructure to read the multiple input files in parallel.
53
Parallel Transporter - Using API
Data Sources
Application/ETL
Program
Oracle, etc.
Direct
Flat file API
L E U St
o x p re
a p d a
d or at m
t e
Teradata
Database
5. Read
messages Message & Statistics File
Data –
Metadata
2. Write Named 4. Read 4. Write
Data Pipe Data Msgs.
1. Vendor ETL tool creates Teradata utility script & writes to file.
54
2. Vendor ETL tool reads source data & writes to intermediate file (lowers performance to
land data in intermediate file).
3. Vendor invokes Teradata utility (Teradata tool doesn’t know vendor tool called it).
4. Teradata tool reads script, reads file, loads data, writes messages to file.
5. Vendor ETL tool reads messages file and searches for errors, etc.
• Before the API, it was necessary to generate multiple script languages depending on
the tool, land data in file or named pipe, and post-process error messages from a file.
Using TPT API, ETL tool reads data into a buffer and passes buffer to Operator
thru API (no landing of the data).
Get return codes and statistics through function calls in the API.
No generating multiple script languages.
ETL tool has control over load process (e.g., ETL tool determines when a
checkpoint is kicked off instead of being specified in the load script).
ETL tool can dynamically choose an operator (e.g. UPDATE vs. STREAM)
Simpler, faster, higher performance.
55
> No landing the data with TPT API – in memory buffers.
• Ease of Use – When Using Scripts
Ease of use features apply when you are writing scripts. If you use an ETL vendor, then
the ETL tool will either call the TPT API or generate the appropriate script.
56
• 1. Parameters based on the load protocol
• 2. Data
• 3. Get messages
Only the four load/unload protocols are available through TPT API. No access modules
are available since ETL vendors have the functionality of the Teradata access modules in
their products.
The following two diagrams depict the advantages of TPT over Stand-alone tools:
File
ETL tool must bring or
parallel streams back Pipe
to one input source
F
a
st
FastLoad can only
L
read one input o
stream a
d
Teradata
• Stand-alone load tools can only read one input source
57
Parallel Input Streams Using API
Oracle
Parallel Transporter
Reads Parallel TPT TPT TPT
Streams with Loa Loa Loa
d d d
Multiple Instances
Inst Inst Inst
Launched Through anc anc
anc
API
Teradata
• Application program (e.g., ETL tool) reads the source data and calls the API to pass:
• 1. Parameters to the load Operator based on the load protocol (e.g., number of
sessions to connect to Teradata, etc.)
• 2. Data
• 3. Function calls to get messages and statistics
ETL tool can flow parallel streams of data through TPT API to gain throughput for
large data loads.
58
> Use of UDFs (e.g., table functions), stored procedures, triggers, etc.
> Data Mover product is a shell on top of TPT API, ARC, and JDBC
59
CDC (changed
Load data capture API)
Protocol ----> SQL (Insert, for pulling data out
Product Update, etc.) FastLoad MultiLoad Tpump FastExport ARC only
TPT X X X X X
BTEQ X
ODBC driver X
JDBC driver X X X
OLE DB provider X
.NET Data Provider X
TPump, TPT Stream X
FastLoad, TPT Load X
MultiLoad, TPT Update X
FastExport, TPT Export X
BAR & ARC scripts X
Replication Services
(GoldenGate) X X
TDM (calls TPT API &
ARC) X X X X X
Teradata Unity X X X X X X X
60
Section 4.2 Load Strategies & Architectural Options
The ELT approach can also take advantage of the SQL bulk load operations that are available
within the Teradata Database. These operations not only support MERGE-INTO but also
enhance INSERT-SELECT and UPDATE-FROM. This enables primary, fallback and index data
processing with block-at-a-time optimization.
The Teradata bulk load operations also allow users to define their own error tables to handle
errors from operations on target tables. These are separate and different from the Update
operator’s error tables. Furthermore, the no primary index (NoPI) table feature also extends the
bulk load capabilities. By allowing NoPI tables, Teradata can load a staging table faster and
more efficiently.
Merge is ANSI-standard SQL syntax that can perform bulk operations on tables using the
extract, load and transform (ELT) function. These operations merge data from one source table
into a target table for performing massive inserts, update and upserts. So why use the merge
61
function instead of an insert-select, or when an update join will suffice? Better performance and
the added functionality of executing a bulk, SQL-based upsert.
Beginning with Teradata 13, the FastLoad target table can be a “No-PI” table. This type of table
will load data faster because it avoids the redistribution and sorting steps, but this only postpones
what eventually must be done during the merge process. The “merge target table” can have a
predefined error table assigned to it for trapping certain kinds of failures during the merge
process. There can be up to a 50% performance improvement when NoPI tables are used in a
FastLoad loading scenario.
62
> ISV tools - CDC data from other databases & load Teradata.
63
Section 4.3 Concurrency of Load and Unload Jobs
If you do not use the Teradata Viewpoint Workload Designer portlet to throttle concurrent load
and unload jobs, the MaxLoadTasks and the MaxLoadAWT fields of the DBSControl determine
the combined number of FastLoad, MultiLoad, and FastExport jobs that the system allows to run
concurrently. The default is 5 concurrent jobs.
If you have the System Throttle (Category 2) rule enabled, even if there is no Utility Throttle
defined, the maximum number of jobs is controlled by the Teradata dynamic workload
management software and the value in MaxLoadTasks field is ignored.
For more information on changing the concurrent job limit from the default value, see
"MaxLoadTasks" and "MaxLoadAWT" in the chapter on DBS Control in the Utilities manual.
There can be as much as a 10x difference between TPT Load/Update and ODBC parameter array
inserts. Not utilizing ODBC parameter arrays (single row at a time) can make this as much as a
100x difference.
5. References
Teradata is powerful relational database engine that can perform complex processing against
large volumes of data. The preferred data processing architecture for a Teradata solution is one
that would have business questions/problems passed to the database via complex SQL statements
as opposed to selecting data from the database for processing elsewhere.
The following examples are meant to provide food for thought in how the ISV would approach
the integration with Teradata.
64
Derived Tables
Description
A derived table is obtained from one or more other tables through the results of a query. How
derived tables are implemented determines how or if performance is enhanced or not. For
example, one way of optimizing a query is to use derived tables to control how data from
different tables is accessed in joins. The use of a derived table in a SELECT forces the subquery
to create a spool file, which then becomes the derived table. Derived tables can then be treated in
the same way as base tables. Using derived tables avoids CREATE and DROP TABLE
statements for storing retrieved information and can assist in optimizing joins. The scope of a
derived table is only visible to the level of the SELECT statement calling the subquery.
Recursive SQL
Description
Recursive queries are used for hierarchies of data, such as Bill of Materials, organizational
structures (department, sub-department, etc.), routes, forums of discussions (posting, response
and response to response) and document hierarchies.
Example
The following selects a row for each child-parent, child-grandparent, etc. relationship in a
recursive table
65
Sub Queries
Description
Permits a more sophisticated and detailed query of a database through the use of nested SELECT
statements. Hence, the elimination of intermediate result sets to the client. There are subqueries
for search conditions and correlated subqueries when it references columns of outer tables in an
enclosing, or containing, inner query. The expression 'correlated subquery' comes from the
explicit requirement for the use of correlation names (table aliases) in any correlated subquery in
which the same table is referenced in both the internal and external query.
66
Example (Complex – w/ OLAP function and Having clause)
The following SELECT statement uses a nested OLAP function and a HAVING clause to display
those Partkeys) that appear in the top 10 percent of profitability in more than 10 Orderkeys.
(result: 486 rows)
Correlated Subqueries
SELECT *
FROM EMPLOYEE AS T1
WHERE SALARY =
(SELECT MAX(SALARY)
FROM EMPLOYEE AS T2
WHERE T1.DEPTNO = T2.DEPTNO);
Case Statement
Description
The CASE expression is used to return alternative values based on search conditions. There are
two forms of the CASE Expression:
Valued CASE Expression
- Specify a SINGLE expression to test (equality)
- List the possible values for the test expression that return different results
67
- CASE---- value_expression_1 WHEN value_expression_n THEN
scalar_expression_n ELSE scalar_expression_m (Result is either result_n or
result_m)
Searched CASE Expression
- You do not specify an expression to test. You specify multiple, arbitrary, search
conditions that can return different results.
- CASE WHEN search_condition_n THEN scalar_expression_n ELSE
scalar_expression_m
SELECT T1.P_TYPE,
SUM(CASE WHEN (T2.L_RETURNFLAG = 'R')
THEN (T2.L_QUANTITY) ELSE (0) END) AS “RETURNED”,
SUM(CASE WHEN (T2.L_RETURNFLAG <> 'R')
THEN (T2.L_QUANTITY) ELSE (0) END) AS “NOT RETURNED”,
(SUM(CASE WHEN (T2.L_RETURNFLAG = 'R')
THEN (T2.L_QUANTITY) ELSE (0) END)) /
(
(SUM(CASE WHEN (T2.L_RETURNFLAG = 'R')
THEN (T2.L_QUANTITY) ELSE (0) END)) +
(SUM(CASE WHEN (T2.L_RETURNFLAG <> 'R')
THEN (T2.L_QUANTITY) ELSE (0) END))
) * 100 AS “% RETURNED”
FROM PRODUCT T1, ITEM T2
WHERE T2.L_PARTKEY = T1.P_PARTKEY
GROUP BY 1
HAVING ("RETURNED" / ("RETURNED" + "NOT RETURNED")) >= .25
ORDER BY 4 DESC, 1 ASC
SELECT CURRENT_DATE,
SUM(CASE WHEN L_SHIPDATE >
68
(((((CURRENT_DATE (FORMAT 'YYYY'))(CHAR(4))) || '-' ||
TRIM((EXTRACT(MONTH FROM
CURRENT_DATE)) (FORMAT '99')) || '-01')))(DATE) AND
L_SHIPDATE < CURRENT_DATE
THEN L_EXTENDEDPRICE ELSE 0 END) AS MTD,
SUM(CASE WHEN L_SHIPDATE >
(((((CURRENT_DATE (FORMAT 'YYYY'))(CHAR(4))) || (('-01-01')))
(DATE))) AND L_SHIPDATE < CURRENT_DATE
THEN L_EXTENDEDPRICE ELSE 0 END) AS YTD,
SUM(CASE WHEN L_SHIPDATE >
CURRENT_DATE-365
THEN L_EXTENDEDPRICE ELSE 0 END) AS ROLLING365,
SUM(CASE WHEN L_SHIPDATE <
CURRENT_DATE
THEN L_EXTENDEDPRICE ELSE 0 END) AS ITD
FROM ITEM;
69
The following SELECT statement displays projected Monthly and YTD Sums for the Extended
Price attribute based on Shipdate. (result: 84 rows)
70
The following SELECT statement ranks the Monthly Extended Price in descending order within
each year using Monthly and YTD Sums for the Extended Price attribute query above. (result: 84
rows)
Example
The following multiple step inserts all participate in the fast path insert for an empty target table
INSERT TABLE A
71
SELECT *
FROM TABLE A1
;INSERT TABLE A
SELECT *
FROM TABLE A2
;
Actually all the DELETE ALL does is to re-chain the internal data blocks on the "data block free- chain" -
something that is not effected by the number of rows in the table.
If a transaction is created with succession of delete statements, it will not be able to take advantage of
this feature except (possibly) for the last delete. The reason for this is that, by definition, all statements
within a transaction must either succeed or fail. Teradata must be able to recover or undo the changes.
To do this, the transient journal must record the "before image" of all data modified by the transaction.
o) Statements executed in parallel (F9) within SQL Assistant operate as a transaction. Note that this is
not the case for the other execute, F5: "Execute the statements one step at a time".
Within a transaction, you can take advantage of the fast path delete if the following requirements are met:
72
Teradata can process set expressions (SQL) in parallel, but procedural statements result in serial,
row-at-a-time processing.
As an example, the following is what we would typically encounter in working with an Oracle
based application:
73
where :rundate < current_date
;
A nested loop is usually a join, so why not do it in SQL? Sub-processes or nested processes must
be analyzed as nested loops and brought into SQL. Examples in code are multiple subroutines
within a package or program, functions in C or PL/SQL or whatever, and nested loops.
Another example:
UPDATE Accts
SET acct_balance = acct_balance + Txn_Sum.total_amt
FROM (
SELECT AcctNum, SUM(Amt) from Acct_Transactions
GROUP BY AcctNum
) Txn_Sum ( acct, total_Amt )
WHERE Accts.AcctNum = Txn_Sum.acct ;
74
When all else fails,
– You might be able to use Teradata SPL (not parallel)
– Or create an aggregate table.
75
Nearly-unique (any column which is over 95% unique is considered as a nearly-unique
column) columns or indexes
Groups of columns that often appear together with equality predicates, if the first 16
bytes of the concatenated column values are sufficiently distinct. These statistics are used
for single-table estimates.
Groups of columns used for joins or aggregations, where there is either a dependency or
some degree of correlation among them. With no multicolumn statistics collected, the
optimizer assumes complete independence among the column values. The more that the
combination of actual values are correlated, the greater the value of collecting
multicolumn statistics.
Other Considerations
Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination
are not chosen unless statistics have been collected on the relevant columns.
NUPIs that are used in join steps in the absence of collected statistics are assumed to be
75% unique, and the number of distinct values in the table is derived from that. A NUPI
that is far off from being 75% unique (for example, it’s 90% unique, or on the other side,
it’s 60% unique or less) benefits from having statistics collected, including a NUPI
composed of multiple columns regardless of the length of the concatenated
values. However, if it is close to being 75% unique, dynamic AMP samples are
adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you
can issue this SQL statement:
For a partitioned primary index table, it is recommended that you always collect statistics
on:
o PARTITION. This tells the optimizer how many partitions are empty, and how
many rows are in each partition. This statistic is used for optimizer costing.
o The partitioning column. This provides cardinality estimates to the optimizer
when the partitioning column is part of a query’s selection criteria.
For a partitioned primary index table, consider collecting these statistics if the
partitioning column is not part of the table’s primary index (PI):
76
o (PARTITION, PI). This statistic is most important when a given PI value may
exist in multiple partitions, and can be skipped if a PI value only goes to one
partition. It provides the optimizer with the distribution of primary index values
across the partitions. It helps in costing the sliding-window and rowkey-based
merge join, as well as dynamic partition elimination.
o (PARTITION, PI, partitioning column). This statistic provides the combined
number of distinct values for the combination of PI and partitioning columns after
partition elimination. It is used in rowkey-based merge join costing.
Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than
from a single AMP (the default). For small tables, with less than 25 rows per AMP, all-
AMP sampling is done automatically. It is also the default for volatile tables and sparse
join indexes. All-AMP sampling comes with these tradeoffs:
o Dynamic all-AMP sampling provides a more accurate row count estimate for a
table with a NUPI. This benefit becomes important when NUPI statistics have
not been collected (as might be the case if the table is extraordinarily large), and
the NUPI has an uneven distribution of values.
o Statistics extrapolation for any column in a table is triggered only when the
optimizer detects that the table has grown. The growth is computed by comparing
the current row count with the last known row count to the optimizer. If the
default single-AMP dynamic sampling estimate of the current row count is not
accurate (which can happen if the primary index is skewed), it is recommended to
enable all-AMP sampling or re-collect PARTITION statistics.
o Parsing times for queries may increase when all AMPs are involved, as the
queries that perform dynamic AMP sampling will have slightly more work to do.
Note that dynamic AMP samples will stay in the dictionary cache until the
periodic cache flush, or unless they are purged from the cache for some
reason. Because they can be retrieved once and re-used multiple times, it is not
expected that dynamic all-AMP samping will will cause additional overhead for
all query executions.
For temporal tables, follow all collection recommendations made above. However,
statistics are currently not supported on BEGIN and END period types. That capability is
planned for a future release.
These recommendations were compiled by: Carrie Ballinger, Rama Krishna Korlapati, Paul
Sinclair
77
Section 5.4 Reserved words
Please refer to Appendix B of the Teradata Database SQL Fundamentals manual for a list for
Restricted Words.
Here’s a list of orange books that are particular interest to migrating and tuning applications on
the Teradata database. In addition, there are a number of orange books that address controlling
and administrating mixed workload environments, and other subjects, that are not listed here.
“Understanding Oracle and Teradata Transactions and Isolation Levels for Oracle Migrations.”
When migrating applications from Oracle to Teradata, the reduced isolation levels used by the
Oracle applications need to be understood before the applications can be ported or redesigned to
run on Teradata. This Orange Book will describe transaction boundaries, scheduling, and the
isolation levels available in Oracle and in Teradata. It will suggest possible solutions for coping
with incompatible isolation levels when migrating from Oracle to Teradata.
“ANSI MERGE Enhancements.” This is an overview of the support of full ANSI MERGE
syntax for set inserts, updates and upserts into tables. This includes an overview of the batch
error handling capabilities.
78
“Implementing Tactical Queries the Basics Teradata Database V2R61.” Tactical queries support
decision-making of an immediate nature within an active data warehouse environment. These
response time-sensitive queries often come with clear service level expectations. This orange
addresses supporting tactical queries with the Teradata database. Note: Although the title
specifies Teradata V2R6.1, this orange book is also applicable to Teradata V12 and beyond.
“Feeding the Active Data Warehouse.” Active ingest is one of the first steps that needs to be
considered in evolving towards an active data warehouse. There are several proven approaches to
active ingest into a Teradata database. This orange reviews the approaches, their pros and cons,
and implementation guidelines.
“Reserved QueryBand Names for Use by Teradata, Customer and Partner Applications.” The
Teradata 12 Feature known as Query Bands provide a means to set Name/Value pairs across
individual Database connections at a Session or Transaction level to provide the database with
significant information about the connections originating source. This provides a mechanism for
Applications to collaborate with the underlying Teradata Database in order to provide for better
Workload Management, Prioritization and Accounting.
“Teradata Active System Management.” As DBAs and other support engineers attempt to
analyze, tune and manage their environment ,s performance, these new features will greatly ease
that effort through centralizing management tasks under one domain, providing automation of
certain management tasks, improving visibility into management related details, and by
introducing management and monitoring by business driven, workload-centric goals.
“Stored Procedures Guide.” This Orange books provides an overview of stored procedures and
some basic examples for getting started.
“User Defined Functions” and “Teradata Java User Defined Functions User’s Guide” are two
Orange Books for digging deeper in C/C++ and Java UDF’s. Both guides explain the UDF
architecture and how a UDF is created and packaged for use by Teradata.
K-means clustering and Teradata 14.10 table operators. Article by Watzke on 17 Sep 2013.
Teradata Developer Exchange. http://developer.teradata.com/extensibility/articles/k-means-
clustering-and-teradata-14-10-table-operators-0
Here are a few white papers and Teradata manuals that are particular interest to migrating and
tuning application on the Teradata database. A list of available white papers can be found at
http://www.teradata.com/t/resources.aspx?TaxonomyID=4533. The Teradata manuals are found
in the latest documentation set available at www.info.teradata.com.
79
“Implementing AJIs for ROLAP.” This white paper describes how to build and implement
ROLAP cubes on the Teradata database.
http://www.teradata.com/t/article.aspx?id=1644
“Teradata Database Queue Tables.” This white paper describes the Queue Tables feature that
was introduced in Teradata® Database V2R6. This feature enables new and improved ways of
building applications to support event processing use cases.
http://www.teradata.com/t/article.aspx?id=1660
“Oracle to Teradata Migration Technical Info.” This document is based on a true practical
experience in migrating data between Oracle and Teradata. This document contains a description
of the technicalities involved in the actual porting of the software and data; as well as some
templates and tools provided that are useful for projects of this nature. This is available on the
Teradata Partner Portal.
[1]
Blog entry by carrie on 27 Aug 2012
80