Born To Be Parallel and Beyond - DA015152
Born To Be Parallel and Beyond - DA015152
Born to Be Parallel,
and Beyond
2 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
3 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
Within-a-Step Parallelism
Within-a-step parallelism is when the optimizer carefully WITHIN-A-STEP
PARALLELISM
splits a request into a small number of high-level Multiple operations
are pipelined
database operations and dispatches these distinct Join Product 1. Scan Product
Scan Stores and Inventory 2. Scan Inventory
operations for execution in a process called pipelining. Redistribute
3. Join Product
1.1 1.2 and Inventory
Here each operation can continue on without waiting for 4. Redistribute
joined rows
the completion of the full results from the first operation.
Join spools Join Items
The relational-operator mix of a step is carefully chosen and Orders
MULTI-STEP
PARALLELISM
by the Teradata optimizer to avoid stalls within the 2.1
Redistribute
2.2 Redistribute Do step 1.1 and 1.2
and also steps
pipeline. (see Figure 2) 2.1 and 2.2
simultaneously
Join spools
QUERY EXECUTION
PARALLELISM
Redistribute Four AMPs
3 perform each step
AMP 4 on their data
blocks at the
AMP 3 same time
AMP 2
Sum step
AMP 1
4
Select and project Product table
answer
Redistribute joined rows to other AMPs 5
Step done
Time 4 –
Time 2
Time 3
4 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
5 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
JOIN TABLE 6
There are now nearly 20 join strategies that are chosen the AMPs performing other parts of the query processing
automatically by the optimizer. It will incrementally plan until the data is retrieved from the other DBMS.
and execute when there is uncertainty about the size of
an intermediate result set, and it will re-write queries to
eliminate redundant logic. The goal is always the same:
BYNET’s Considerable Contribution
ensuring that our customers enjoy the lowest cost per
Another important component of the Teradata
query in the industry.
architecture is referred to as the BYNet. This acts as the
Being Parallel in the Ecosystem interconnection between all of the independent parallel
components. (see Figure 6). Originally implemented
In today’s environment, data may reside in other file
within the hardware of our on-premises systems, this
systems or data management systems. Files in cloud
functionality is now implemented directly into the cloud
storage may be defined as foreign tables. The optimizer
network facilities. Beyond just passing messages, the
will assign the task of reading and interpreting CSV,
BYNET is a bundle of intelligence and low-level functions
Parquet or JSON files to AMPs. As with everything else,
that aid in efficient processing at practically each point
the files making up a foreign table in cloud storage will be
in a query’s life. It offers coordination as well as oversight
assigned across the AMPs to be read in parallel.
and control to every optimized query step.
6 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
BYNET Groups
Without the BYNET’s ability to combine and consolidate
Parsing
Engine information from across all units of parallelism, each
AMP would have to independently talk to each other
AMP in the system about each query step that is
underway. As the configuration grew, such a distributed
approach to coordinating query work would quickly
AMP 0 AMP 1 AMP 2 become a bottleneck.
7 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
Step 1
Done Done Done
Work
Step 1
3 AMPs
BYNET
BYNET
BYNET
BYNET
begins Step 1 Step 1
Done Done
across Work Work
3 AMPs
Message to
Step 1 Step 1 Step 1 dispatcher
Done for next step
Work Work Work
Semaphore
Software
BYNET
A Flexible, Fast Way to Find Teradata was architected in such a way that no space
is allocated or set aside for a table until such time as
and Store Data it is needed. Rows are stored in variable length data
blocks that are only as big as they need to be. These
Another very important factor behind the enduring
data blocks can dynamically change size and can be
Teradata performance is how space is managed which
moved to different locations on the cylinder or even to a
is done by a sub-system that is simply referred to as
different cylinder, without manual intervention or end-user
the “file system.” The file system is responsible for the
knowledge. With the development of Teradata Virtual
logical organization and management of the rows, along
Storage (TVS), the database will assess the frequency of
with their reliable storage and retrieval.
access of data and can move it between different speed
storage media to optimize response time for the end user.
The file system in Teradata was architected to be
extremely adaptable, simple on the outside but
This section takes a close look at how file system frees up
surprisingly inventive on the inside. It was designed
the administrator from mundane data placement tasks,
from Day One to be fluid and open to change. The file
and at the same time provides an environment that is
system’s built-in flexibility is achieved by means of:
friendly to change.
8 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
To retrieve a row, the primary index data value is passed This adaptable behavior delivers numerous benefits.
to the hashing algorithm, which generates the two hash Random growth is accommodated at the time it
outputs: 1) the hash bucket which points to the AMP; and happens. Rows can easily be moved from one location
2) the hash-ID which helps to locate the row within the to another without affecting in-flight work or any other
file system structure on that AMP. There is no space or data objects that reference that row. There is never a
processing overhead involved in either building a primary need to stop activity and re-organize the physical data
index or accessing a row through its primary index value, blocks or adjust pointers.
as no special index structure needs to be built.
of the columns that will comprise the primary index of Sorted List
of Cylinder 1 per AMP
the table such as customer number, order number or Indexes
product key. From that point on, the process is completely
automated. No files need to be allocated, sized, monitored, Cylinder Indexes
hash bucket, the hash-ID is used to look up the physical Many per
location of the row on disk. Which virtual cylinder and Rows Sorted Rows Sorted Rows Sorted
cylinder
sector holds the row is determined by means of a tree-like by Row-ID by Row-ID by Row-ID
9 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
Column Partitioning
Tables can also be stored with columns in separate
If there is space If there is no free space
on the cylinder… on the cylinder… partitions. This has the advantage of focusing I/O on
just the columns of data needed in a query instead of
Data Block
the entire row. This also supports vertical compression
techniques where a value is stored once for use in
Data Data Cylinder 1
Block Block consecutive rows. Column partitioning can be combined
with row partitioning to further reduce the amount of I/O
Insert a row needed to satisfy a query.
and the data Data Block-1 Data Block-2
block expands
Indexes
Cylinder 1 Cylinder 2
The primary index for a table takes no space and by
Insert a row and the data block
splits across 2 cylinders calculating the hash value of a constraint on that
column, its row can usually be retrieved in a single I/O.
Partitioning also requires no space and allows for a
Figure 10. A new row is inserted into an existing data block.
significant reduction in I/O and improvement in response
time. The Advanced Analytics Engine also supports
traditional secondary indexes. These are valuable with a
This flexibility to consolidate or expand data blocks frequently used, high cardinality column exists such as
anytime allows the Advanced Analytics Engine to customer number on a table such as Orders where the
do many space-related housekeeping tasks in the logical primary index for the orders table is the Order_ID.
background and avoid table unloads and reloads common
to fixed-sized page databases. This advantage increases Also supported are Join Indexes which are transparent
database availability and translates to less maintenance to the user or their BI tools but are leveraged by the
tasks for the DBA. optimizer to eliminate join and aggregation processing.
As the base tables are maintained these join indexes
Multi-level Row Partitioning are automatically maintained. If one join index is a more
Added to this storage architecture is the ability to aggressive aggregation of another, after the base table
partition the table by one or more columns to make is updated, the lower-level aggregation is re-calculated,
it faster to access data without the need of full table then those values are aggregated to maintain the more
scans or the costly maintenance of secondary indexes. aggressive aggregation. If analysis of usage in the
For example, a transaction table might be partitioned on query logging indicate that the join index is not being
transaction date, week, or month. If a query constrains used, it can be dropped and there is no impact to the
on a period of time for those transactions, the optimizer syntax of the user’s queries.
will figure out which partitions need to be read, whether
the table was partitioned on day week, month or other Work Flow Self-Regulation
time period ranges. You could also add additional
partitioning columns like country, district, or brand. A A shared-nothing parallel database has a special
query with a constraint on either partitioning column challenge when it comes to knowing how much new work it
or both will reduce the amount of data to be read to can accept, and how to identify congestion that is starting
satisfy a query. The hashed cylinder and row access is to build up inside one or more of the parallel units. With
accomplished within the defined partitions. the optimizer attempting to apply multiple dimensions of
parallelism to each query that it sees, it is easy to reach
very high resource utilization within a Teradata system,
even with just a handful of active queries.
10 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
Designed for stress, the Advanced Analytics Engine mindfulness is the cornerstone of the database’s ability
is able to function with large numbers of users, a very to accept impromptu swings of very high and very low
diverse mix of work, and a fully-loaded system. Being demand, and gracefully and unobtrusively manage
able to keep on functioning full throttle under conditions whatever comes its way.
of extreme stress relies on internal techniques that
were built inside the database to automatically and AMP Worker Tasks
transparently manage the flow of work, while the system
AWTs are the tasks inside of each AMP that get the
stays up and productive.
database work done. This database work may be
initiated by the internal database software routines, such
Even though the data placement conventions in use as dead-lock detection or other background tasks. Or
with the Advanced Analytics Engine lend themselves to the work may originate from a user-submitted query.
even placement of the data across AMPs, the data is These pre-allocated AWTs are assigned to each AMP
not always accessed by queries in a perfectly even way. at startup and, like taxi cabs queued up for fares at the
During the execution of a multi-step query, there will airport, they wait for work to arrive, do the work, and
be occasions when some AMPs require more resources come back for more work.
for certain steps than do other AMPs. For example,
if a query from an airline company site is executing a
Because of their stateless condition, AWTs respond
join based on airport codes, you can expect whichever
quickly to a variety of database execution needs. There
AMP is performing the join for rows with Atlanta (ATL)
is a fixed number of AWTs on each AMP. For a task to
to need more resources than does the AMP that
start running it must acquire an available AWTs. Having
is joining rows with Anchorage (ANC). Some of this
an upper limit on the number of AWTs per AMP keeps
uneven processing demand has been reduced by the
the number of activities performing database work within
optimizer splitting the data into separate spool files and
each AMP at a reasonable level. AWTs play the role of
applying different join strategies for the busy airports
both expeditor and governor.
and the less busy ones. However, some unevenness of
processing demands will remain.
As part of the optimization process, a query is broken
into one or many AMP execution steps. An AMP step
AMP-Level Control may be simple, such as read one row using a unique
The Advanced Analytics Engine manages the flow of primary index or apply a table level lock. Or an AMP step
work that enters the system in a highly-decentralized may be a very large block of work, such as scanning
manner, in keeping with its shared-nothing architecture. a table, applying selection criteria on the rows read,
There is no centralized coordinator to become a redistributing the rows that are selected, and sorting the
bottleneck. There is no message-passing between redistributed rows.
AMPs to determine if it’s time to hold back new
requests. Rather, each AMP evaluates its own ability The Message Queue
to take on more work, and temporarily pushes back When all AMP worker tasks on an AMP are busy
when it experiences a heavier load than it can efficiently servicing other query steps, arriving work messages are
process. And when an AMP does have to push back, placed in a message queue that resides in the AMP’s
it does that for the briefest moments of time, often memory. This is a holding area until an AWT frees up
measured in milliseconds. and can service the message. This queue is sequenced
first by message work type, which is a category
This bottom-up control over the flow of work was indicating the importance of the work message. Within
fundamental to the original architecture of the database work type the queue is sequenced by the priority of the
as designed. All-AMP step messages come down to request the message is coming from.
the AMPs, and each AMP will decide whether to begin
working on it, put it on hold, or ignore it. This AMP-level
11 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
Messages representing a new query step are broadcast Once the queue of messages of a certain work type
to all participating AMPs by the PE. In such a case, grows to a specified length, new messages of that type
some AMPs may provide an AWT immediately, while are no longer accepted and that AMP is said to be in
other AMPs may have to queue the message. Some a state of flow control, as shown in Figure 15. The flow
AMPs may dequeue their message and start working control gate will temporarily close, pulling in the welcome
on the step sooner than others. This is typical behavior mat, and arriving messages will be returned to the
on a busy system where each AMP is managing its own sender. The sender, often the PE, continues to retry the
flow of work. message, until that message can be received on that
AMP’s message queue.
Once a message has either acquired an AWT or been
accepted onto the message queue across each AMP Because the acceptance and rejection of work messages
in the dynamic BYNET group, then it is assumed that happens at the lowest level, in the AMP, there are no
each AMP will eventually process it, even if some AMPs layers to go through when the AMP can get back to
take longer than others. The sync point for the parallel normal message delivery and processing. The impact
processing of each step is at step completion when of turning on and turning off the flow of messages is
each AMP signals across the completion semaphore kept local—only the AMP hit by an over-abundance of
that it has completed its part. The BYNET channels messages at that point in time throttles back temporarily.
set up for this purpose are discussed more fully in the
BYNET section of this paper. Riding the Wave of Full Usage
Teradata was designed as a throughput engine, able to
Turning Away New Messages exploit parallelism to maximize resource usage of each
Each AMP has flow control gates that monitor and request when only a few queries are active, while at the
manage messages arriving from senders. There are same time able to continue churning out answer sets in
separate flow control gates for each different message high demand situations. To protect overall system health
work type.7 New work messages will have their own flow under extreme usage conditions, highly-decentralized
control gates, as will spawned work messages. The flow internal controls were put into the foundation, as
control gates keep a count of the active AWTs of that discussed in this section.
work type as well as how many messages are queued up
waiting for an AWT.
…
Reject now
3 Spawned Messages 20 New Messages
Retry later
12 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
The original architecture related to flow control and AMP The Original Four Priorities
worker tasks has needed very little improvement or even One of the challenges faced by the original architects of
tweaking over the years. 80 AWTs per AMP is still the Teradata Database was how to support maximum levels
default setting for new Teradata systems. The number of resource usage on the platform, and still get critical
can be increased for more powerful platforms that aren’t pieces of internal database code to run quickly when
achieving full utilization or platforms with large number of it needed to. For example, if there is a rollback taking
active queries with diverse response time expectations. place due to an aborted transaction, it benefits the
Message work types, the work message queue, and retry entire system if the reversal of updates to clean up the
logic all work the same as they always did. failure can be executed quickly.
There have been a few extensions in regard to AMP It was also important to ensure that background tasks
worker tasks that have emerged over time, including: running inside the database didn’t lag too far behind.
If city streets are so congested with automobile traffic
• Setting up reserve pools of AWTs exclusively for use that the weekly garbage truck can’t get through and is
by tactical queries, protecting high priority work from delayed for weeks at a time, a health crisis could arise.
being impacted when there is a shortage of AWTs.
The solution the original architects found was a simple
• Automatic reserve pools of AWTs just for load priority scheme that applied priorities to all tasks running
utilities that become available when the number on the system. This rudimentary approach offered four
of AWTs per AMP is increased to a very high level, priority buckets, each with a greater weight than the one
intended to reduce resource contention between that came before: L for Low, M for Medium, H for High
queries and load jobs for enterprise platforms with and R for rush. The default priority was medium, and
especially high concurrency indeed most work ran at medium, and was considered
equally-important to other medium priority work that
Workload Management was active.
The second section in this whitepaper called attention However, database routines and even small pieces
to the multifaceted parallelism available for queries of code could assign themselves one of the other
on the Advanced Analytics Engine. The subsequent three priorities, based on the importance of the work.
section discussed how the optimizer uses those parallel Developers, for example, decided to give all END
opportunities in smart ways to improve performance TRANSACTION activity the rush priority, because
on a query-by-query basis. And the previous section finishing almost-completed work at top speed frees
illustrated internal AMP-level controls to keep high up valuable resources sooner, and was seen as critical
levels of user demand and an over-abundance of within the database. In addition, if the administrator
parallelism from bringing the system to its knees. wanted to give a favored user a higher priority, all that
was involved was manually adding one of the priority
In addition to those automatic controls at the AMP level, identifiers into the user’s account string.
Teradata has always had some type of system-level
workload management, mainly priority differences, that Background tasks discussed in the section about space
are used by the internal database routines. management were designed to use priorities as well.
Some of these tasks, like the task that deletes transient
journal rows that are no longer needed, were designed
to start out at the low priority, but increase their priority
over time if the system was so busy that they were not
able to get their work accomplished. This approach kept
such tasks in the background most of the time, except
when their need to complete becomes critical.
13 TERADATA.COM
WH ITE PA PE R BORN TO BE PARALLEL, AND BEYOND
14 TERADATA.COM
WH ITE PA PE R THIS IS WHERE THE TITLE GOES, MAXIMUM TWO LINES
The logging levels are optional and may be combined This white paper attempts to familiarize you with a
with the Performance Data Capture Routines (PDCR) few of the features that make up important building
for historical analysis and capacity planning. No other blocks of the Advanced Analytics Engine, so you can
DBMS has the maturity of logging as the Vantage see for yourself the elegance and the durability of the
Advanced Analytic Engine. architecture. This paper points out recent enhancements
that have grown out of this original foundation, building
on it rather than replacing it.
Conclusion
These foundational components have such a widespread
Foundations are important. Teradata’s ability to grow
consequence that they simply cannot be tacked on as
in new directions and continue to sustain its core
an afterthought. The database must be born with them.
competencies is a direct result of its strong, tried-and-
true foundation. As our engine has matured the same
fundamentals have been adapted to new technology About Teradata
advances. For example, in initial releases, the AMP was
a physical computer which owned its own disk strive and Teradata is the connected multi-cloud data platform
directly managed how data was located on its disks. company. Our enterprise analytics solve business
Today an AMP is a software virtual processor that challenges from start to scale. Only Teradata gives
co-exists with other such virtual processors on the same you the flexibility to handle the massive and mixed
node all of whom share the node resources. Yet each data workloads of the future, today. Learn more at
AMP maintains its shared-nothing characteristics, same Teradata.com.
as in the first release.
The Teradata logo is a trademark, and Teradata is a registered trademark of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Teradata continually
improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features,
functions and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.