0% found this document useful (0 votes)
36 views15 pages

Born To Be Parallel and Beyond - DA015152

The document discusses Teradata's approach to parallel processing and how it provides performance advantages. It describes Teradata's basic unit of parallelism called an AMP and how queries can be parallelized across multiple AMPs. It also discusses two other dimensions of parallelism: within-a-step parallelism where parts of a query step can execute in parallel, and multi-step parallelism where different steps of a query can parallelize. The document aims to explain key components of Teradata's database engine that have continued to deliver high performance.

Uploaded by

Sohel Sayyad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views15 pages

Born To Be Parallel and Beyond - DA015152

The document discusses Teradata's approach to parallel processing and how it provides performance advantages. It describes Teradata's basic unit of parallelism called an AMP and how queries can be parallelized across multiple AMPs. It also discusses two other dimensions of parallelism: within-a-step parallelism where parts of a query step can execute in parallel, and multi-step parallelism where different steps of a query can parallelize. The document aims to explain key components of Teradata's database engine that have continued to deliver high performance.

Uploaded by

Sohel Sayyad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

WHITE PAPER

Born to Be Parallel,
and Beyond

Original work by Carrie Ballinger - Cloud Performance Architect


Revised and updated by Douglas Ebel - Director, Technical Product Marketing
09.22 / TERADATA VANTAGE
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Table of Contents Teradata’s Enduring Performance


Advantage
2 Teradata’s Enduring Performance Advantage

3 Multidimensional Parallel Capabilities Teradata’s ability to solve the most complex


analytics problems of our day is unmatched.
5 Parallel-Aware Optimizer
Available today in an array of deployment
6 Bynet’s Considerable Contribution options, Teradata’s history has evolved from
an on-prem analytics appliance where limited
8 A Flexible, Fast Way to Find and Store Data resources drove deep expertise in workload
10 Workflow Self-Regulation management and query optimization. In the
cloud, this same expertise translates into
13 Workload Management superior execution performance without the
14 Conclusion risk of costs escalating out of control.

15 About Teradata Teradata customers from every industry and around


the world depend on Teradata for predictable price
performance and mission-critical reliability to run
their business. The performance offered by Vantage
empowers them to run analytics that would be too
expensive or time-consuming otherwise. For example,
allowing them to continuously run predictive models on
every household in their markets rather than only simple
models on a subset of customers.

To ensure that we continue to deliver on the promise


of Teradata, we have continuously invested in the
architecture of our core analytics engine. As we have
done so, we continue to follow key tenets that arose
from being born at a time of scarcity because we believe
that although resources in the cloud may be infinite,
budgets are not.

The enduring performance advantage of the Advanced


Analytics Engine is a direct result of early, somewhat
unconventional design decisions made by a handful of
imaginative architects. Intended for a more technical
audience, this paper describes and illustrates some of
these key fundamental components of the Advanced
Analytics Engine that are as critical to performance now
as they were then, and upon which today’s features and
capabilities rest.

2 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Discussions of these specific areas are included in


this paper:
AMP 4
• Multidimensional parallel capabilities AMP 3
AMP 2
• A parallel-aware query optimizer AMP 1

• The BYNET’s considerable contribution


Row Locking Sorting
• A flexible and fast way to find and store data Reading/Writing Aggregating

• Internal self-regulation of the flow of work Building


Indexes
Transaction
Journaling …
• Managing the flow of work externally with Backup and
Loading Recovery
Workload Management

AMP 1’s Data

It is important to note that the scope of this whitepaper


Figure 1. Inside a unit of parallelism.
is limited to important, foundational components of
database performance. It is not a comprehensive
discussion of all the aspects of the Teradata
Types of Query Parallelism
AdvanceAnalytics Engine or the platform.
While the AMP is the fundamental unit of parallelism,
there are two additional parallel dimensions woven into
Multidimensional Parallel Capabilities the Advanced Analytics Engine, specifically for query
performance. These are referred to here as “within-a-step”
Everything in Vantage is parallelized—from the entry parallelism, and “multi-step” parallelism. The following
of SQL statements to the smallest detail of their sections describe these three dimensions of parallelism:
execution—to weed out any possible single point of
control and to effectively eliminate the conditions that Parallel Execution Across AMPs
can breed gridlock in a system. It is this foundational
Parallel execution across AMPs involves breaking the
architecture, dating back to our humble beginnings, that
request into subdivisions, and working on each subdivision
continues to deliver unmatched performance and the
at the same time, with one single answer delivered. Parallel
best price per query in the market today.
execution can incorporate all or part of the operations
within a query and can significantly reduce the response
The Teradata basic unit of parallelism is the AMP
time of a request, particularly if the query or function
(Access Module Processor), a virtual processing unit
reads and analyzes a large amount of data.
that manages all database operations for its portion of
a table’s data. Many AMPs are typically configured on a
Parallel execution is usually enabled in Teradata by
given node. (20 to 40 or more are common.) Everything
hash-partitioning the data across all the AMPs defined
that happens in a Teradata system is distributed across
in the system. Once data is assigned to an AMP, the AMP
a pre-defined number of AMPs with each AMP acting
provides all the database services on its allocation of data
like a microcosm of the database, supporting such things
blocks. All relational operations such as table scans, index
as data loading, reading, writing, journaling, and recovery
scans, projections, selections, joins, aggregations, and
for all the data that it owns. (see Figure 1). Importantly,
sorts execute in parallel across the AMPs simultaneously.
parallel units work cooperatively together behind
Each operation is performed on an AMP’s data
the scenes—an unusual strength that drives higher
independently of the data associated with the other AMPs.
performance with minimal overhead.

3 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Within-a-Step Parallelism
Within-a-step parallelism is when the optimizer carefully WITHIN-A-STEP
PARALLELISM
splits a request into a small number of high-level Multiple operations
are pipelined
database operations and dispatches these distinct Join Product 1. Scan Product
Scan Stores and Inventory 2. Scan Inventory
operations for execution in a process called pipelining. Redistribute
3. Join Product
1.1 1.2 and Inventory
Here each operation can continue on without waiting for 4. Redistribute
joined rows
the completion of the full results from the first operation.
Join spools Join Items
The relational-operator mix of a step is carefully chosen and Orders
MULTI-STEP
PARALLELISM
by the Teradata optimizer to avoid stalls within the 2.1
Redistribute
2.2 Redistribute Do step 1.1 and 1.2
and also steps
pipeline. (see Figure 2) 2.1 and 2.2
simultaneously

Join spools
QUERY EXECUTION
PARALLELISM
Redistribute Four AMPs
3 perform each step
AMP 4 on their data
blocks at the
AMP 3 same time
AMP 2
Sum step
AMP 1
4
Select and project Product table

Select and project Inventory table

Join Product and Inventory tables Return final


Time 1 – Start step

answer
Redistribute joined rows to other AMPs 5
Step done
Time 4 –
Time 2

Time 3

Figure 3. Multiple types of parallelism combined.

Figure 2. Pipelining of 4 operations within one query step.


The figure shows four AMPs supporting a single query’s
execution, and the query has been optimized into 7
steps. Step 1.2 and Step 2.2 each demonstrate within-
Multi-Step Parallelism a-step parallelism, where two different tables are
Multi-step parallelism is enabled by executing multiple scanned and joined together (three different operations
“steps” of a query simultaneously, across all the are performed). The result of those three operations is
participating units of parallelism. One or more tasks pipelined into a sort and then a redistribution, all in one
are invoked for each step on each AMP to perform the step. Steps1.1 and 1.2 together (as well as 2.1 and 2.2
actual database operation. Multiple steps for the same together) demonstrate multi-step parallelism, as two
query can be executing at the same time to the extent distinct steps are chosen to execute at the same time,
that they are not dependent on results of previous steps. within each AMP.

This automated multifaceted parallelism is not easy Multi-Statement Requests


to choreograph unless it is planned for in the early In addition to the three dimensions of parallelism shown in
stages of product evolution. In addition to these three Figure 3, Multi-Statement Requests allow several distinct
dimensions of parallelism for each query, such as SQL statements to be bundled together and sent to the
described here, we will see additional elements below optimizer as if they were one unit. These will be run in
that ensure that Teradata customers get maximum parallel as long as there are no dependencies among the
value from every system. It is important to note, the statements. More importantly, any sub-expressions that
Advanced Analytics Engine applies these multiple the different statements have in common will be executed
dimensions of parallelism automatically, without user once, and the results shared among them (see Figure 4).
intervention, hints or special setup.

4 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Instead, the Teradata optimizer accesses and joins


Individual multiple tables simultaneously and also leverages
SQL statements
performed one A multi-statement request different types of joins (e.g. indexed access, table scan)
at a time performs individual SQL statements in parallel
to build a more intelligent query plan.
Access Access Access Access
Pricing Data Pricing Data Customer Data Store Data
The Teradata optimizer seeks out tables within the
Return answer set query that have logical relationships between them and
Return all three answer sets
also groups tables that can be accessed and joined
Access
Customer Data independently from the other subsets of tables. Those
are often candidates to execute within parallel steps.
Return answer set
Figure 5 illustrates the differences when optimizing a
six-table join between a plan that is restricted to linear
Access
Store Data joins, and one that has the option of performing some of
the joins in parallel.
Return answer set

Figure 4. A multi-statement request.


Sizing up the Environment
In addition to the parallelism methods described above,
Parallel Use of I/O the optimizer takes into account numerous other factors
including the profile of the data itself, the number
In addition, the Advanced Analytic Engine supports
of AMPs on each of the nodes and the processing
synchronized scanning of large tables. This permits a new
power of the underlying hardware. Putting all this
full-table scan to begin at the current point of an ongoing
information together, the optimizer comes up with a
scan of the same large table in another session, thus
price in terms of resources expected to be used for each
reducing the I/O load and supporting higher concurrency.
of several candidate query plans, then picks the least
costly candidate. Considering many factors including
Parallel-Aware Optimizer movement of data, the lowest cost plan is the plan
which will take the least system resources to execute on
Having an array of parallel techniques can turn into a the wide variety of platforms that we support.
disadvantage if they are not carefully applied around the
needs of each request. Orchestration of the different Hiding Complexity
parallelization techniques is driven by our optimizer, which Unlike other solutions, Teradata’s optimizer completely
takes on a number of tasks to ensure optimal use of automates the complexity behind query planning.
resources. The optimizer lives within a component called Users have complete freedom to submit everything from
the “Parsing Engine” or PE. The default configuration uses simple tactical queries to very complex ad hoc analytic
two PEs per node with each capable of coordinating the queries and the optimizer will ensure that all requests
query planning, coordination of execution, and returning are delivered in the most efficient manner. This allows
results for 120 sessions each. A four node system would customers to build complex data models with dozens of
have 8 PEs supporting 960 sessions and a 24 node joins which provides a richer dataset for analytics.
system would have 48 PEs supporting 5,760 sessions.
This is both scalable and fault tolerant by eliminating Evolution
single points of congestion or failure.
Although the fundamentals have remained the same, the
Advanced Analytics Engine has continued to evolve over
Join Planning time to meet customer needs. This includes everything
Joining tables in a linear fashion (join table1 to table2, from the ability to support tables with no primary index
then join their result to table3, and so on) can have a to stage data for in-database transformation or push-
negative impact on query time. down processing by client tools or new types of joins.

5 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

TABLE 1 TABLE 2 TABLE 1 TABLE 2 TABLE 3 TABLE 4

JOIN TABLE 3 JOIN TABLE 5 JOIN TABLE 6

JOIN TABLE 4 JOIN JOIN

JOIN TABLE 5 JOIN Plan with parallel joins

JOIN TABLE 6

Plan with serial joins JOIN

Figure 5. A bushy query plan vs. a serial plan.

There are now nearly 20 join strategies that are chosen the AMPs performing other parts of the query processing
automatically by the optimizer. It will incrementally plan until the data is retrieved from the other DBMS.
and execute when there is uncertainty about the size of
an intermediate result set, and it will re-write queries to
eliminate redundant logic. The goal is always the same:
BYNET’s Considerable Contribution
ensuring that our customers enjoy the lowest cost per
Another important component of the Teradata
query in the industry.
architecture is referred to as the BYNet. This acts as the
Being Parallel in the Ecosystem interconnection between all of the independent parallel
components. (see Figure 6). Originally implemented
In today’s environment, data may reside in other file
within the hardware of our on-premises systems, this
systems or data management systems. Files in cloud
functionality is now implemented directly into the cloud
storage may be defined as foreign tables. The optimizer
network facilities. Beyond just passing messages, the
will assign the task of reading and interpreting CSV,
BYNET is a bundle of intelligence and low-level functions
Parquet or JSON files to AMPs. As with everything else,
that aid in efficient processing at practically each point
the files making up a foreign table in cloud storage will be
in a query’s life. It offers coordination as well as oversight
assigned across the AMPs to be read in parallel.
and control to every optimized query step.

Teradata’s Query Grid can be used to access data in


In short, the BYNet acts as flight coordinator ensuring
other data management systems The Advance Analytics
that the entire system is working in concert and managing
Engine’s optimizer can decide on whether to select the raw
situations as they arise. This can include everything from
data or push down some of the selection and aggregation
ordering results from across parallel units, adjusting to
processing to the other platform to reduce the size of
hardware failures, or monitoring for points of congestion.
data to be retrieved. Meanwhile, the optimizer may have

6 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

BYNET Groups
Without the BYNET’s ability to combine and consolidate
Parsing
Engine information from across all units of parallelism, each
AMP would have to independently talk to each other
AMP in the system about each query step that is
underway. As the configuration grew, such a distributed
approach to coordinating query work would quickly
AMP 0 AMP 1 AMP 2 become a bottleneck.

Instead, BYNET groups create a dynamic relationship


between AMPs that are working on a specific step
Figure 6. AMPs and PEs communicate using messages.
which keeps the number of AMPs that must exchange
messages down to the bare minimum. As a step begins
to execute, one or more channels are established that
Messaging loosely associate all AMPs in the dynamic BYNET group
A key role of the BYNET is to support communication that is executing the step. The channels use monitoring
between the PE and AMPs and from AMPs to other and signaling semaphores in order to communicate
AMPs. These simple message-passing requirements things like the completion or the success/failure of
are performed using a low-level messaging approach, each participating AMP. If a tight coordination did not
bypassing more heavyweight protocols for communication. exist among AMPs in the same BYNET group, then
the problem-free AMPs would continue to work on the
• Sending a step from the PE to AMPs to initiate a doomed query step, eating up resources in unproductive
query step ways (Figure 7). In general, the only message that is set
back to the PE is the final completion message whether
• Redistributing rows from one AMP to another to
the dynamic BYNET group is composed of three or
support different join geographies
3000 AMPs.
• Sort/merging a final answer set from multiple AMPs

Final Answer Set Sort/Merge


Even though message protocols are low-cost, the
Advanced Analytics Engine goes further by minimizing Never needing to materialize a query’s final answer
interconnect traffic wherever possible. Same AMP, set inside the database has long been a Teradata
localized activity is encouraged wherever possible. differentiator. The final sort/merge of a query takes
AMP-based ownership of data keeps activities such as place within the BYNET as the answer set rows are being
locking and some of the simple data processing local to funneled up to the client as needed. This happens at
the AMP. Hash partitioning that supports co-location the AMP, Node and finally PE level with only the highest
of to-be-joined rows reduces data transporting prior to values being processed until the client needs more. The
a join. All aggregations are ground down to the smallest final answer set never has to be brought together saving
possible set of sub-totals at the local (AMP) level first considerable resources. A potential “big sort” penalty
before being brought together globally via messaging. has been eliminated—or actually, never existed.

It is notable that another side effect of this extremely


efficient coordination of AMPs is our ability to offer
exceptionally faster performance for tactical queries
than other vendors.

7 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Step 1
Done Done Done
Work

Step 1

3 AMPs
BYNET

BYNET

BYNET

BYNET
begins Step 1 Step 1
Done Done
across Work Work
3 AMPs
Message to
Step 1 Step 1 Step 1 dispatcher
Done for next step
Work Work Work

Semaphore
Software
BYNET

Completion Completion Completion Completion


for step semaphore semaphore semaphore semaphore
completion Semaphore
Count = 3 Count = 2 Count = 1 Count = 0
is established disbanded
Time 1 Time 2 Time 3 Time 4

Figure 7. A completion semaphore.

A Flexible, Fast Way to Find Teradata was architected in such a way that no space
is allocated or set aside for a table until such time as
and Store Data it is needed. Rows are stored in variable length data
blocks that are only as big as they need to be. These
Another very important factor behind the enduring
data blocks can dynamically change size and can be
Teradata performance is how space is managed which
moved to different locations on the cylinder or even to a
is done by a sub-system that is simply referred to as
different cylinder, without manual intervention or end-user
the “file system.” The file system is responsible for the
knowledge. With the development of Teradata Virtual
logical organization and management of the rows, along
Storage (TVS), the database will assess the frequency of
with their reliable storage and retrieval.
access of data and can move it between different speed
storage media to optimize response time for the end user.
The file system in Teradata was architected to be
extremely adaptable, simple on the outside but
This section takes a close look at how file system frees up
surprisingly inventive on the inside. It was designed
the administrator from mundane data placement tasks,
from Day One to be fluid and open to change. The file
and at the same time provides an environment that is
system’s built-in flexibility is achieved by means of:
friendly to change.

• Logical addressing, which allows blocks of data to How Data is Organized


be dynamically shifted to different physical locations
when needed, with minimal impact to active work. For data stored inside the database, Teradata
permanently assigns data rows to AMPs using a simple
• The ability for data blocks to expand and contract on scheme that lends itself to an even distribution of
demand, as a table matures. data—hash partitioning. (Figure 8). In addition to being
• An array of unobtrusive background tasks that do a distribution technique, this hash approach to data
continuous space adjustments and clean-up. placement serves as an indexing strategy.

8 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

This is incredibly important, especially for tactical queries


A customer that are often leveraged by business applications.
row is inserted

Hashing algorithm produces


1. A hash bucket
Easy Accommodation of Data Growth
2. A hash-ID
The Advanced Analytics Engine is built using a logical
addressing model as a low impact way to adjust to data
growth. Data for each table in a Teradata system is stored
The hash in flexibly-sized data blocks that are assigned to logical
bucket
points to cylinders. The block assignment of a row is based on its
one AMP
hash value. If a block grows beyond a DBA-specified
AMP 1 AMP 5 AMP 9 AMP 13 maximum size, it is automatically split to make room for
AMP 2 AMP 6 AMP 10 AMP 14 more rows and the cylinder index is updated. If a logical
AMP 3 AMP 7 AMP 11 AMP 15
cylinder gets full, blocks can be moved to a different
AMP 4 AMP 8 AMP 12 AMP 16
logical cylinder and the cylinder indexes are updated.
NODE 1 NODE 2 NODE 3 NODE 4 On retrieving a row, the hash of the primary index
identifies the AMP, the index of cylinders in the AMP point
Figure 8. A row’s primary index hash bucket points to the AMP
that owns it. to the cylinder, and the cylinder index points to the block
to be read. Figure 10 explains this behavior visually.

To retrieve a row, the primary index data value is passed This adaptable behavior delivers numerous benefits.
to the hashing algorithm, which generates the two hash Random growth is accommodated at the time it
outputs: 1) the hash bucket which points to the AMP; and happens. Rows can easily be moved from one location
2) the hash-ID which helps to locate the row within the to another without affecting in-flight work or any other
file system structure on that AMP. There is no space or data objects that reference that row. There is never a
processing overhead involved in either building a primary need to stop activity and re-organize the physical data
index or accessing a row through its primary index value, blocks or adjust pointers.
as no special index structure needs to be built.

Hashed data placement is very easy to use and requires


no setup. The only effort a DBA makes is the selection Master Index

of the columns that will comprise the primary index of Sorted List
of Cylinder 1 per AMP
the table such as customer number, order number or Indexes
product key. From that point on, the process is completely
automated. No files need to be allocated, sized, monitored, Cylinder Indexes

or named. No DDL needs to be created beyond specifying


Many per
the primary index in the original CREATE TABLE Sorted List of AMP

statement. No unload-reload activity is ever required. Data Blocks

Once the owning AMP is identified by means of the Data Blocks

hash bucket, the hash-ID is used to look up the physical Many per
location of the row on disk. Which virtual cylinder and Rows Sorted Rows Sorted Rows Sorted
cylinder

sector holds the row is determined by means of a tree-like by Row-ID by Row-ID by Row-ID

three-level indexing structure (as shown in Figure 9).


It is enough to say here that the data is automatically
Figure 9. A three-level indexing structure identifies a row’s
and dynamically indexed down to the exact data block for location on an AMP.
exceptional retrieval speed.

9 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Column Partitioning
Tables can also be stored with columns in separate
If there is space If there is no free space
on the cylinder… on the cylinder… partitions. This has the advantage of focusing I/O on
just the columns of data needed in a query instead of
Data Block
the entire row. This also supports vertical compression
techniques where a value is stored once for use in
Data Data Cylinder 1
Block Block consecutive rows. Column partitioning can be combined
with row partitioning to further reduce the amount of I/O
Insert a row needed to satisfy a query.
and the data Data Block-1 Data Block-2
block expands
Indexes
Cylinder 1 Cylinder 2
The primary index for a table takes no space and by
Insert a row and the data block
splits across 2 cylinders calculating the hash value of a constraint on that
column, its row can usually be retrieved in a single I/O.
Partitioning also requires no space and allows for a
Figure 10. A new row is inserted into an existing data block.
significant reduction in I/O and improvement in response
time. The Advanced Analytics Engine also supports
traditional secondary indexes. These are valuable with a
This flexibility to consolidate or expand data blocks frequently used, high cardinality column exists such as
anytime allows the Advanced Analytics Engine to customer number on a table such as Orders where the
do many space-related housekeeping tasks in the logical primary index for the orders table is the Order_ID.
background and avoid table unloads and reloads common
to fixed-sized page databases. This advantage increases Also supported are Join Indexes which are transparent
database availability and translates to less maintenance to the user or their BI tools but are leveraged by the
tasks for the DBA. optimizer to eliminate join and aggregation processing.
As the base tables are maintained these join indexes
Multi-level Row Partitioning are automatically maintained. If one join index is a more
Added to this storage architecture is the ability to aggressive aggregation of another, after the base table
partition the table by one or more columns to make is updated, the lower-level aggregation is re-calculated,
it faster to access data without the need of full table then those values are aggregated to maintain the more
scans or the costly maintenance of secondary indexes. aggressive aggregation. If analysis of usage in the
For example, a transaction table might be partitioned on query logging indicate that the join index is not being
transaction date, week, or month. If a query constrains used, it can be dropped and there is no impact to the
on a period of time for those transactions, the optimizer syntax of the user’s queries.
will figure out which partitions need to be read, whether
the table was partitioned on day week, month or other Work Flow Self-Regulation
time period ranges. You could also add additional
partitioning columns like country, district, or brand. A A shared-nothing parallel database has a special
query with a constraint on either partitioning column challenge when it comes to knowing how much new work it
or both will reduce the amount of data to be read to can accept, and how to identify congestion that is starting
satisfy a query. The hashed cylinder and row access is to build up inside one or more of the parallel units. With
accomplished within the defined partitions. the optimizer attempting to apply multiple dimensions of
parallelism to each query that it sees, it is easy to reach
very high resource utilization within a Teradata system,
even with just a handful of active queries.

10 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Designed for stress, the Advanced Analytics Engine mindfulness is the cornerstone of the database’s ability
is able to function with large numbers of users, a very to accept impromptu swings of very high and very low
diverse mix of work, and a fully-loaded system. Being demand, and gracefully and unobtrusively manage
able to keep on functioning full throttle under conditions whatever comes its way.
of extreme stress relies on internal techniques that
were built inside the database to automatically and AMP Worker Tasks
transparently manage the flow of work, while the system
AWTs are the tasks inside of each AMP that get the
stays up and productive.
database work done. This database work may be
initiated by the internal database software routines, such
Even though the data placement conventions in use as dead-lock detection or other background tasks. Or
with the Advanced Analytics Engine lend themselves to the work may originate from a user-submitted query.
even placement of the data across AMPs, the data is These pre-allocated AWTs are assigned to each AMP
not always accessed by queries in a perfectly even way. at startup and, like taxi cabs queued up for fares at the
During the execution of a multi-step query, there will airport, they wait for work to arrive, do the work, and
be occasions when some AMPs require more resources come back for more work.
for certain steps than do other AMPs. For example,
if a query from an airline company site is executing a
Because of their stateless condition, AWTs respond
join based on airport codes, you can expect whichever
quickly to a variety of database execution needs. There
AMP is performing the join for rows with Atlanta (ATL)
is a fixed number of AWTs on each AMP. For a task to
to need more resources than does the AMP that
start running it must acquire an available AWTs. Having
is joining rows with Anchorage (ANC). Some of this
an upper limit on the number of AWTs per AMP keeps
uneven processing demand has been reduced by the
the number of activities performing database work within
optimizer splitting the data into separate spool files and
each AMP at a reasonable level. AWTs play the role of
applying different join strategies for the busy airports
both expeditor and governor.
and the less busy ones. However, some unevenness of
processing demands will remain.
As part of the optimization process, a query is broken
into one or many AMP execution steps. An AMP step
AMP-Level Control may be simple, such as read one row using a unique
The Advanced Analytics Engine manages the flow of primary index or apply a table level lock. Or an AMP step
work that enters the system in a highly-decentralized may be a very large block of work, such as scanning
manner, in keeping with its shared-nothing architecture. a table, applying selection criteria on the rows read,
There is no centralized coordinator to become a redistributing the rows that are selected, and sorting the
bottleneck. There is no message-passing between redistributed rows.
AMPs to determine if it’s time to hold back new
requests. Rather, each AMP evaluates its own ability The Message Queue
to take on more work, and temporarily pushes back When all AMP worker tasks on an AMP are busy
when it experiences a heavier load than it can efficiently servicing other query steps, arriving work messages are
process. And when an AMP does have to push back, placed in a message queue that resides in the AMP’s
it does that for the briefest moments of time, often memory. This is a holding area until an AWT frees up
measured in milliseconds. and can service the message. This queue is sequenced
first by message work type, which is a category
This bottom-up control over the flow of work was indicating the importance of the work message. Within
fundamental to the original architecture of the database work type the queue is sequenced by the priority of the
as designed. All-AMP step messages come down to request the message is coming from.
the AMPs, and each AMP will decide whether to begin
working on it, put it on hold, or ignore it. This AMP-level

11 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Messages representing a new query step are broadcast Once the queue of messages of a certain work type
to all participating AMPs by the PE. In such a case, grows to a specified length, new messages of that type
some AMPs may provide an AWT immediately, while are no longer accepted and that AMP is said to be in
other AMPs may have to queue the message. Some a state of flow control, as shown in Figure 15. The flow
AMPs may dequeue their message and start working control gate will temporarily close, pulling in the welcome
on the step sooner than others. This is typical behavior mat, and arriving messages will be returned to the
on a busy system where each AMP is managing its own sender. The sender, often the PE, continues to retry the
flow of work. message, until that message can be received on that
AMP’s message queue.
Once a message has either acquired an AWT or been
accepted onto the message queue across each AMP Because the acceptance and rejection of work messages
in the dynamic BYNET group, then it is assumed that happens at the lowest level, in the AMP, there are no
each AMP will eventually process it, even if some AMPs layers to go through when the AMP can get back to
take longer than others. The sync point for the parallel normal message delivery and processing. The impact
processing of each step is at step completion when of turning on and turning off the flow of messages is
each AMP signals across the completion semaphore kept local—only the AMP hit by an over-abundance of
that it has completed its part. The BYNET channels messages at that point in time throttles back temporarily.
set up for this purpose are discussed more fully in the
BYNET section of this paper. Riding the Wave of Full Usage
Teradata was designed as a throughput engine, able to
Turning Away New Messages exploit parallelism to maximize resource usage of each
Each AMP has flow control gates that monitor and request when only a few queries are active, while at the
manage messages arriving from senders. There are same time able to continue churning out answer sets in
separate flow control gates for each different message high demand situations. To protect overall system health
work type.7 New work messages will have their own flow under extreme usage conditions, highly-decentralized
control gates, as will spawned work messages. The flow internal controls were put into the foundation, as
control gates keep a count of the active AWTs of that discussed in this section.
work type as well as how many messages are queued up
waiting for an AWT.

Flow control gate for broadcast


spawned messages is open


Reject now
3 Spawned Messages 20 New Messages

Retry later

Flow control gate for broadcast


new messages is closed
Figure 11. Flow control gates close when a threshold of messages is reached.

12 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

The original architecture related to flow control and AMP The Original Four Priorities
worker tasks has needed very little improvement or even One of the challenges faced by the original architects of
tweaking over the years. 80 AWTs per AMP is still the Teradata Database was how to support maximum levels
default setting for new Teradata systems. The number of resource usage on the platform, and still get critical
can be increased for more powerful platforms that aren’t pieces of internal database code to run quickly when
achieving full utilization or platforms with large number of it needed to. For example, if there is a rollback taking
active queries with diverse response time expectations. place due to an aborted transaction, it benefits the
Message work types, the work message queue, and retry entire system if the reversal of updates to clean up the
logic all work the same as they always did. failure can be executed quickly.

There have been a few extensions in regard to AMP It was also important to ensure that background tasks
worker tasks that have emerged over time, including: running inside the database didn’t lag too far behind.
If city streets are so congested with automobile traffic
• Setting up reserve pools of AWTs exclusively for use that the weekly garbage truck can’t get through and is
by tactical queries, protecting high priority work from delayed for weeks at a time, a health crisis could arise.
being impacted when there is a shortage of AWTs.
The solution the original architects found was a simple
• Automatic reserve pools of AWTs just for load priority scheme that applied priorities to all tasks running
utilities that become available when the number on the system. This rudimentary approach offered four
of AWTs per AMP is increased to a very high level, priority buckets, each with a greater weight than the one
intended to reduce resource contention between that came before: L for Low, M for Medium, H for High
queries and load jobs for enterprise platforms with and R for rush. The default priority was medium, and
especially high concurrency indeed most work ran at medium, and was considered
equally-important to other medium priority work that
Workload Management was active.

The second section in this whitepaper called attention However, database routines and even small pieces
to the multifaceted parallelism available for queries of code could assign themselves one of the other
on the Advanced Analytics Engine. The subsequent three priorities, based on the importance of the work.
section discussed how the optimizer uses those parallel Developers, for example, decided to give all END
opportunities in smart ways to improve performance TRANSACTION activity the rush priority, because
on a query-by-query basis. And the previous section finishing almost-completed work at top speed frees
illustrated internal AMP-level controls to keep high up valuable resources sooner, and was seen as critical
levels of user demand and an over-abundance of within the database. In addition, if the administrator
parallelism from bringing the system to its knees. wanted to give a favored user a higher priority, all that
was involved was manually adding one of the priority
In addition to those automatic controls at the AMP level, identifiers into the user’s account string.
Teradata has always had some type of system-level
workload management, mainly priority differences, that Background tasks discussed in the section about space
are used by the internal database routines. management were designed to use priorities as well.
Some of these tasks, like the task that deletes transient
journal rows that are no longer needed, were designed
to start out at the low priority, but increase their priority
over time if the system was so busy that they were not
able to get their work accomplished. This approach kept
such tasks in the background most of the time, except
when their need to complete becomes critical.

13 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND

Impact of Mixed Workloads • Concurrency control mechanisms, called throttles,


that can be placed at multiple levels and tailored to
The simple approach to priorities was all the internal
specific types of queries or users.
database tasks required. And early users of the
database were satisfied running all their queries at • An improved and more effective priority scheduler to
the default medium priority. But requirements shifted accompany the Linux SLES 11 operating system that
over time as Teradata users began to supplement their can protect short, critical work more effectively from
traditional decision support queries with new types of more resource-intensive lower-priority jobs.
more varied workloads. • Rules to reject queries that are poorly written or that
are inappropriate to run at certain times of the day.
In the late 1990’s, a few Teradata sites began to issue
• Ability to automatically change workload settings by
direct look-up queries against entities like their Inventory
time of day or system conditions.
tables or their customer databases, at the same time
as their standard decision support queries were running. • Ability to automatically reduce the priority of a
Call centers started using data in their Teradata running query which exceeds the threshold of
Database to validate customer accounts and recent resources consumed for its current priority.
interactions. Tactical queries and online applications • Ability to give a percentage of resources to a
blossomed, at the same time as more sites turned to workload, either as a maximum percentage or an
continuous loading to supplement their batch windows, “at least” percentage.
giving their end users more timely access to recent
• A user-friendly front-end GUI called Viewpoint
activity. Service level goals reared their head.
Workload Designer that supports ease of setup
Stronger, more flexible workload management was
and tuning.
required. Today it is typical for 90% of the queries to
execute in < 1 second.
Workload management in Teradata has proven to be
rapidly expanding area, indispensable to customers that
Evolution of Workload Management are running a wide variety of work on their Teradata
While the internal management of the flow of work platform. While internal background tasks and subsets of
has changed little, the capabilities within system-level the database code continue to run at the four different
workload management have expanded dramatically priority levels initially defined for them, many Teradata
over the years. As the first step beyond the original sites have discovered that their end users’ experiences
four priorities, Teradata engineering developed a more are better and they can get more work through the
extensive priority scheduler composed of multiple resource system when taking advantage of the wider workload
partitions and performance groups, and the flexibility of management choices today. And many do just that.
assigning your own customized weighting values. These
custom weightings and additional enhancements make Managing Workload Management
it easier to match controls to business workloads and
To know whether the system is meeting required
priorities than the original capabilities designed more for
performance or is being impacted by new, unplanned,
controlling internal system work.
or poorly constructed workloads, it is critical to have
logging of system activity. The query logging in 18
Additional workload management features and options
tables and 993 columns records everything about query
that have evolved over the years include:
execution including use of system resources, SQL, steps,
objects, and a textual description of the query execution
• Ability to define workloads by username, client
plan. The Resource Usage logging in 12 tables and
logon ID, profile, the application they are using,
1878 columns records everything happening at the
the database objects they are referencing or the
system level including node, AMP, AWT, and device.
optimizer’s assessment of the query characteristics

14 TERADATA.COM
WH ITE PA PE R THIS IS WHERE THE TITLE GOES, MAXIMUM TWO LINES

The logging levels are optional and may be combined This white paper attempts to familiarize you with a
with the Performance Data Capture Routines (PDCR) few of the features that make up important building
for historical analysis and capacity planning. No other blocks of the Advanced Analytics Engine, so you can
DBMS has the maturity of logging as the Vantage see for yourself the elegance and the durability of the
Advanced Analytic Engine. architecture. This paper points out recent enhancements
that have grown out of this original foundation, building
on it rather than replacing it.
Conclusion
These foundational components have such a widespread
Foundations are important. Teradata’s ability to grow
consequence that they simply cannot be tacked on as
in new directions and continue to sustain its core
an afterthought. The database must be born with them.
competencies is a direct result of its strong, tried-and-
true foundation. As our engine has matured the same
fundamentals have been adapted to new technology About Teradata
advances. For example, in initial releases, the AMP was
a physical computer which owned its own disk strive and Teradata is the connected multi-cloud data platform
directly managed how data was located on its disks. company. Our enterprise analytics solve business
Today an AMP is a software virtual processor that challenges from start to scale. Only Teradata gives
co-exists with other such virtual processors on the same you the flexibility to handle the massive and mixed
node all of whom share the node resources. Yet each data workloads of the future, today. Learn more at
AMP maintains its shared-nothing characteristics, same Teradata.com.
as in the first release.

The natural evolution towards the virtualization of key


database functionality is significant because it broadens
the usefulness of the Advanced Analytics Engine. For
much of its history, Teradata database software has run
on purpose-built hardware, where the underlying platform
has been optimized to support high throughput, critical
SLAs, and solid reliability. While those benefits remain well-
suited for enterprise platforms, this virtualization opens
the door for the Advanced Analytics Engine to participate
in more portable, less demanding solutions. Public or
private cloud architectures, as well as as-a-service
offerings, can now enjoy the core Advanced Analytics
Engine capabilities as described in this white paper.

17095 Via Del Campo, San Diego, CA 92127     Teradata.com

The Teradata logo is a trademark, and Teradata is a registered trademark of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Teradata continually
improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features,
functions and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.

© 2022 Teradata Corporation    All Rights Reserved.    Produced in U.S.A.    09.22

You might also like