Training Material - Teradata Basics Certification
Training Material - Teradata Basics Certification
Training Material - Teradata Basics Certification
07/09/2013
Confidentiality Statement
Confidentiality and Non-Disclosure Notice
The information contained in this document is confidential and proprietary to TATA
Consultancy Services. This information may not be disclosed, duplicated or used for any
other purposes. The information contained in this document may not be released in
whole or in part outside TCS for any purpose without the express written permission of
TATA Consultancy Services.
Table of Content
1. Teradata Architecture ............................................................................................................. 4
2. Space Management ................................................................................................................. 6
3.
Application Development..7
4.
Data Distribution.8
5.
Partitioning....10
6.
Access Methods10
7.
Join Index..11
8.
1.
Teradata Architecture
1.3 BYNET
Its the channel of communication between Parsing Engine and Access Module Processor. It is also called the
Message passing layer in Teradata. There are 2 BYNET systems: BYNET 0 and BYNET 1.If one of the connections
fails, the second is completely independent and can continue to function. Therefore, communications continue
between all nodes.
1.4 Node
Its the basic building block in Teradata. It contains a huge number of hardware and software components. The
processing for database occurs in the node.
Space Management
1.10
Its the amount of data storage allowed for a specific user or database. Upon new installation of Teradata, all perm
space in Teradata is owned by the system master account, DBC. The total amount of Perm space on the Teradata
system would be the sum of all available Perm space, across all AMPs, divided by the number of AMPs.
Database, users, tables etc are created and stored in perm space.
Spool Space and Temp Space are unused PERM Space.
1.11
Spool Space
It is the unused Perm space, which is used to temporarily build answer sets when users run queries. Spool space
will accumulate the row results and hold onto the rows until the query completes. Once the query is completed, all
rows are returned by the AMPs through the BYNET.
1.12
It is the unused Perm space that can be used to create temporary tables (Volatile or Global temporary). Data is
active up to the current session and tables created in Temp space will survive a restart. Temp space is available to
the user until their session is terminated.
Exam Tips:
If the DBA created the maximum of 32 secondary indexes on a table, then there would be 32 Sub tables
created, each taking up PERM Space.
A database may have PERM space allocated to it. This PERM space establishes the maximum amount of disk
space for storing user data rows in any table located in the database. However, if no tables are stored
within a database, it is not required to have PERM space.
Although a database without PERM space cannot store tables, it can store views and macros because they
are physically stored in the Data Dictionary (DD) PERM space and require no user storage space. The DD is
in a database called DBC.
A database or user with no PERM space can still own views, macros, and triggers and execute queries.
Password differentiates a user from a database.
Global temporary tables survive system restarts.
2. Application Development
Application development for Teradata RDBMS falls into one of the following categories:
Explicit SQL
Implicit SQL
Under explicit SQL application development you have the following tools:
Embedded SQL
Macros
Stored Procedures
BTEQ
CLI
ODBC
Queryman
Third-party products that package and submit SQL
EXPLAIN statement
Under implicit SQL application development, you have tools such as Teradata and third-party products that permit
various fourth- generation languages and application generators to be translated into SQL.
Here we will discuss only some of the most important topics required for certification.
2.1 Macro
Teradata macros are SQL statements, which the server stores and executes. The advantages of using macros
include the generation of less channel traffic and easy execution of frequently used SQL operations. Macros are
particularly useful for enforcing data integrity rules, providing data security, and improving performance.
3. Data Distribution
3.1 Primary Key
It is a logical concept in Teradata. Primary key cannot be NULL and is not mandatory. It is the designated column
(or columns) whose unique values will be used to identify each row in the table. Only one primary key can exist in
a table. The best choice for PK would be one small column, which appropriately represents the data in the row.
3.8 NoPI
A NoPI table is a MULTISET nontemporal table (a nontemporal table is a table that doesnt support
TransactionTime) that does not have a primary index. The chief purpose of nonpartitioned NoPI tables is to
enhance the performance of Fastload and Teradata Parallel Data Pump Array INSERT data loading operations.
4. Partitioning
4.1 Partitioned Primary Index
The default PI of Teradata Database is a non-partitioned PI, though both UPIs and NUPIs can be partitioned. A
Partitioned primary index will still provide a path to rows in the base table, global temporary tables, volatile tables,
and non-compressed, join indexes using PI values. When a PPI is used to create a table or join index, rows are
hashed to AMPs based on the PI columns and assigned to appropriate partitions. Rows are stored in row hash
order when assigned to a partition.
5. Access Methods
The following are the basic data access methods in Teradata in order of best-to-worst performance:
10
If a SI exists on a table and if the PE determines that PI cant be used, then the secondary index will be used to
retrieve data.
6. Join Index
Join Indexes are file structures for permitting queries to be resolved by accessing an index rather than its base
table. Join indexes can be defined on one or more tables.
Multitable join indexes will store and maintain joined rows of two or more tables and will aggregate selected
columns. These are used for join queries that are performed with high frequency.
Aggregate Join Index is a cost effective highly efficient method of resolving queries to frequently specified
aggregate operations on the same column or set of columns. As a result, aggregate calculations for every query
are not required.
Single table Join Indexes are used to resolve joins on large tables without redistributing the joined rows across the
AMPs. These types of join indexes will hash a frequently joined subset of base table columns to the same AMP. As
a result BYNET traffic is eliminated.
11
Exclusive: The requester has exclusive rights to the locked resource. No other process can read from, write to, or
access the locked resource in any way.
Write: The requester has exclusive rights to the locked resource except for readers not concerned with data
consistency
Read: Several users can hold Read locks on a resource, during which the system permits no modification of that
resource.
Access: The requestor is willing to accept minor inconsistencies of the data while accessing the database.
Exam tips:
OLTP
DSS
OLCP
OLAP
Data mining
Active Data Warehousing combines the best features of all above so that enterprise can utilize each processing
type to run their businesses better.
13
Logical Data Marts are not separated from the detail data in a separate computer system. Many companies use a
best practice of keeping the detail data and the dependent data marts on the same data warehouse platform. This
allows users to query the data mart for summarized or aggregated information while still being able to ask
questions about the detail data. A dependant data mart has nothing to do with a Logical Data Mart. You can take a
dependent data mart from Teradata and place the summarized information on Oracle or you can choose to keep
the dependent data mart with the detailed Teradata data warehouse.
It is important to have dependent data marts. Dependent data marts are extracted directly from the detail data.
They are always one extract away from the detail. Independent data marts are extracted from other data marts or
directly from operational systems.
Exam tips:
Row v/s Set processing:
Row processing is the type of processing in which rows are fetched one at a time and after doing
calculations in it, it is inserted or updated. Then the next row is fetched and the process continues as
before. Since rows are fetched one by one, it makes the processing very slow, although there is less locking
contention when row processing is used.
Set Processing is built on the concept of handling groups of rows at one time. The biggest advantage of Set
Processing over Row processing is performance and the benefit of using row processing over set
processing is less lock contention.
Throughput v/s Response time:
Response Time is the elapsed time per query and Throughput is the number of queries executed in an
hour. While throughput measures the amount of work processed, response time is a measure of process
completion.
Macros and stored procedures are two methods used by Teradata RDBMS to limit data access.
The typical purpose of Semantic layer in an Enterprise data warehouse architecture is data views.
Inverted list database is built around both set processing and row-at-a-time processing
X-views in data dictionary:
The views present in a data dictionary including DBC.AllTempTablesV, DBC.TablesV and DBC.UsersV can be
accessed using an EXPLAIN modifier preceding a DDL or DCL statement.
Some views are user-restricted and they are only applied to the user submitting the query acting upon the
view. These views are identified by an appended X to the system view name and are sometimes called X
views. These views will only report a subset of the available information. The only difference between an X
view and a non-X view is the existence of a WHERE clause to ensure a user can view only those objects the
user owns, is associated with, been granted privileges on, or assigned a role with privileges.
14
9. Data Protection
9.1 RAID1:
Using RAID1, data is mirrored across paired disks. It provides the highest level of protection, although the disk
space overhead is almost 50%.
9.2 RAID5:
Data and parity are stored by stripping across multiple disks. Its not mirrored.
Exam Tips
The three interfaces that enable access to the Teradata Database from a network-attached client are:
1. Call level Interface V2 (CLIv2)
2. Java Database Connectivity (JDBC)
3. Open Database Connectivity (ODBC)
15
10.
Both hardware and software provide fault tolerance, some of which is mandatory and some of which is optional.
Teradata RDBMS facilities for software fault tolerance are:
Vproc migration
Fallback tables
AMP clusters
Journaling
Archive/Recovery
Table Rebuild utility
10.1
Vproc Migration:
Because the Parsing Engine (PE) and Access Module Process (AMP) are software, they can migrate from their home
node to another node within the same hardware clique if the home node fails for any reason.
Although the system normally determines which vprocs migrate to which nodes, a user can configure preferred
migratory destinations.
10.2
Fallback:
A fallback table is a duplicate copy of a primary table. Each row in a fallback table is stored on an AMP different
from the one to which the primary row hashes. This reduces the likelihood of loss of data due to simultaneous
losses of the 2 AMPs or their associated disk storage.
The disadvantage of this method is that it requires twice the storage space and twice the I/O (on inserts, update,
and deletes) of tables maintained without fallback. The advantage is that data is almost never lost because of a
down AMP. Data is fully available during an AMP or disk outage, and recovery is automatic after repairs have been
made.
10.3
AMP Clusters:
Clustering is a means of logically grouping AMPs to minimize (or eliminate) data loss that might occur from losing
an AMP.AMP clusters are used only for fallback data. The fallback copy of any row is always located on an AMP
different from the AMP that holds the primary copy. This is an entry-level fault tolerance strategy.
16
10.4
Journaling:
Transient Journal:
Permanent Journal:
Is active continuously
Is available for tables or databases
Provides rollforward for hardware failure recovery
Provides rollback for software failure recovery
Provides full recovery of nonfallback tables
Reduces need for frequent, full-table archives
10.5
Archive/Recovery:
17
10.6
Table Rebuild provides for rebuilds of entire databases and all tables in a database, including:
The primary portion of a table
The fallback portion of a table
The entire table (both primary and fallback portions).
All fallback tables that reside on an AMP
All tables that reside on an AMP
Some of the Teradata RDBMS facilities for software fault tolerance are:
Dual BYNETs:
In a Teradata system there are two BYNETs, called the BYNET-0 and BYNET-1.So if one BYNET fails, the other one is
available and the interprocessor traffic is not hindered until both of them fail.
RAID disk units:
These are used to protect against disk failure. The most common level of RAID is RAID -1 (Transparent mirroring).
Each primary disk will have an exact copy of all its data on another disk. This provides the highest level of
protection although it incurs a 100% overhead.
RAID-5 data and parity are stripped across a rank of disks.
Cliques:
They are a group of nodes that share access to the same disk arrays. Cliques support the migration of vprocs when
nodes fail. If a node in a clique fails, then the vprocs in that node migrate to other nodes in the same clique. This
migration minimizes the performance impact on the system.
Exam Tips:
The following can be archived by the Teradata ARC:
Database, table, partition
The ARC statement COPY restores a copy of an archived file to a specified Teradata database system.
18
One of the reasons to use USI over NUSI while creating a table is that they are needed for journaling and it
ensures that unique data is inserted into the table.
Tables defined with FALLBACK create a Down AMP Recovery Journal in the event of an AMP failure.
The Transient Journal (TJ) ensures data integrity by keeping a beforeimage copy of changed rows in
memory. Upon transaction failure, the changes are rolled back.
TJ is maintained automatically. It provides rollback of changed rows for transactions that are not
completed.
Two of the reasons why a customer would choose table partitioning are:
1. To reduce the I/O for range constrained queries
2. For the ability to archive specific partitions in a table.
The 3 ways by which Teradata protects data are:
1. Archive to tape
2.Archive to disk
3.RAID technology
When using RAID1, 50% space is used as overhead.
Archive to tape, archive to disk and RAID technology are some of the ways by which Teradata protects data.
Hot Standby Node is a member of a clique.
When a node fails, all LAN PEs and AMPs will migrate to another node in the clique.
Hot standby nodes:
In Teradata, in case of a node failure, Teradata will reset. When it does, the AMPs and PEs in the down node
will be instructed to migrate to the hot standby node.
11.
It basically involves preventing concurrently running processes from improperly inserting, deleting, or updating
the same data. The are two mechanisms to achieve this:
Transactions
Locks
19
A transaction is the unit of work and the unit of recovery. A partial transaction cannot exist either all statements
should execute or none of them should. A set of transactions is said to be serializable if and only if it produces the
same result as some arbitrary serial execution of those same transactions for arbitrary input. A set of transactions is
correct only if it is serializable.
A lock is a means of claiming usage rights to some resource. A user can lock the following resource types in a
Teradata database:
1. Database 2. Table 3.View 4.Row Hash
11.1
System restarts:
When such an unscheduled restart occurs, 2 types of automatic transaction recovery can occur:
1. Single transaction recovery:
It aborts a single transaction because of many reasons like user error, user initiated abort command, transaction
deadlock timeout etc.
2. RDBMS recovery
It is caused by hardware/software failure or user command
11.2
If an AMP fails to come online during system recovery, the RDBMS continues to process the transaction with the
help of fallback data. When it comes back online, down AMP recovery procedures make the data for the AMP up to
date. If there are a large number of rows to be processed, then the AMP recovers offline. If only a few are there, it
recovers online.
Exam Tips:
For a tablelevel lock the following lock type combinations are compatible:
Access-Access, Access-Read, Access-write
Read- Access, Read-read
Write-Access
20
The purpose of a lock is to serialize access in situations where concurrent access would data consistency.
Two reasons an access lock is used:
To perform a dirty read
To be able to read data that is currently being written by another process
The impact on a database when a node failure occurs is that the database is restarted
12.
System Monitoring
Teradata Manager is a monitoring system for production and performance, which is used to control one or more
Teradata servers. It is mainly used for reviewing historical workload.
Some of the monitoring tools in Teradata Analyst Pack to help the users and the DBA are:
12.1
Index Wizard:
It is used to analyze a workload of SQL. Index wizard creates a series of reports and index recommendations
describing the costs and statistics associated with those recommendations.
12.2
Statistics Wizard:
It is primarily used to help with collecting statistics. It recommends the collection of new statistics and the
recollection of existing statistics.
12.3
TSET:
It allows user to capture cost parameters, statistics etc. It doesnt export user data. It allows you to quickly project a
production environment by emulating a larger production system in a smaller test or development environment.
This reduces the cost of query plan analysis and your overall development efforts.
12.4
Visual Explain:
It depicts the PE optimizers execution plan to access data in a visual format with easy to read icons. It visually
displays the query execution plan generated by the Teradata optimizer.
21
12.5
Teradata Analyst Pack, in conjunction with other Teradata tools and utilities, enhances users ability to analyze and
understand the detailed steps involved in query plans along with the influences of the system configuration, data
demo- graphics, and secondary index structure.
13.
Teradata Utilities
13.1
13.2
Teradata Parallel Transporter (TPT) provides scalable, high-speed, parallel data extraction, loading and updating by
using and expanding on the traditional Teradata extraction and load utilities such as Fastload, Multiload,
FastExport and TPump.
TPump:
Used for row-level locking
Used for continuous updates to rows in a table
Does not support Multi-SET tables
Loads data to Teradata from a Mainframe or LAN flat file
SQL operations can be performed on many tables simultaneously
FastLoad:
Used for table-level locking
Only one table may be loaded at a time and that table should be empty
Loads (INSERTs) data to Teradata from a Mainframe or LAN flat file
Runs in two operating modes: Interactive mode and Batch mode
Duplicate rows will not be loaded
22
Uses two error tables: the first error table pertains to the errors when the format of the data is not correct and
the second error table takes errors when there is violation of UPI.
MultiLoad:
Used for table-level locking
Up to 20 inserts, updates or deletes can be done on up to 5 tables
Loads data to Teradata from a Mainframe or LAN flat file
Extremely fast
Duplicate rows allowed
Usually the receiving tables are populated
Uses two error tables: Acquisition phase error table is related to data error and Application phase error table is
related to UPI violations
FastExport:
TPT combines all the above Teradata utilities into 1 comprehensive language. It can perform insertion of data into
tables, exporting from tables, updating table etc with in-line filtering
Exam tips:
BTEQ allows import/export across all supported platforms.
Teradata allows you to set multiple sessions in a BTEQ script. However, this will work only if your SQL is
using the Primary Index or a Unique Secondary Index. So UPI, NUPI and USI are the only options that will
utilize multiple sessions.
BTEQ connects to the database by means of CLIv2
BTEQ enables users on a workstation to easily access one or more Teradata Database systems for ad hoc
queries, report generation, data movement (suitable for small volumes) and database administration.
23
14.
1.
2.
3.
4.
References
Teradata 101- The foundation and principles by Steve Wilmes and Eric Rivard
Introduction to Teradata RDBMS by Teradata Corporation
http://teradata.uark.edu
Tera-Tom on Teradata Basics by Tom Coffing
24
Thank You
Contact
For more information, contact [email protected] (Email Id of ISU)
IT Services
Business Solutions
Consulting
All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /
information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced,
republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS.
Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws,
and could result in criminal or civil penalties. Copyright 2011 Tata Consultancy Services Limited