TD Join Strategies 1
TD Join Strategies 1
Teradata Join Strategies are utilized by the optimizer to choose the least cost plan and
better performance. The strategy will be chosen based on the available information to the
Optimizer like Table size, PI information, Stats Information.
The inner join above focuses on returning all rows when there is a match between the two
tables. The ON clause is extremely important because this join establishes the join
(equality) condition. Think of this as if you were repairing a deck and you required a certain
kind of wood in order to make sure you had a match. You would go down to the local home
improvement store to locate the right pieces of wood to repair the deck. A match would
definitely not be cedar if your deck required pressured pine.
To investigate further, each matching row is joined where Emp = Emp which is stated from
the ON Clause in the JOIN. If we analyze further, EMP is the Primary Index for both tables.
This first merge join type is extremely efficient because both columns in the ON clause are
the Primary Indexes of their respective tables. When this occurs, NO data has to be moved
into spool and the joins can be performed in what is called AMP LOCAL. Teradata can
perform this join with rapid speed. Remember that the less data that has to be moved to
complete a join the better the performance will be achieved. The end result of this join is
that the rows from each table are physically joined together to form one row.
In this example, two tables are being joined based on the DEPT column. In the department
table, the Primary Index column is DEPT. As we know, this is a good match based on the
equality (ON) condition. However, the employee table has EMP as the Primary Index
column. Which table do you think will move based on this join?
Regardless of the equality condition, the primary objective is to bring the rows together
from each table on the same AMPs. There are several options that the Teradata Optimizer
could choose in order to complete this task. The first option is to duplicate the smaller table
on all AMPs. The second option could be to leave the department table that has an equality
condition match on the Primary Index Column stationary on the AMP. The next step would
be to move the rows from the Employee table into spool. This would be accomplished by
hashing (locating) the columns in the employee table, and then moving these rows into
spool to the appropriate AMPs where the department table rows reside. What has occurred
here is that the stationary table is already sorted by the hash code and the spool table will
be re-sorted by hash code in spool to the appropriate matching AMPs.
Clearly the second option is an excellent choice for the Teradata Optimizer because it
reduces the amount of resources necessary to complete the join and improve performance.
In the previous example, the columns in the join equality are MgrEmp = MgrNo. The
Primary Index of the department table is DEPT and the Primary Index for the manager table
is LOC. In this case, both columns being utilized in this join equality are not part of the
Primary Index columns. So what strategy will Teradata take to resolve this join operation?
Basically rows from both tables will need to be rehashed and redistributed into SPOOL.
The reason is because neither columns selected in the ON Clause are the Primary Index of
the respective tables. Therefore, both tables are redistributed based on the ON clause
columns.
The next step in this process is to redistributed the rows and locate them to the matching
AMPs. When this is completed, the rows from both tables will be located in two different
spools. Lastly, the rows in each spool will be joined together to bring back the matching
rows. This type of join strategy is extremely inefficient. It consumes a ton of resources and
time to manage and assemble this type of join.
In this inner join above, the two tables involved in the join are the Employee table and the
Department table. The DEPT column is the join equality that is making the match between
the two tables. The DEPT column is the Primary Index Column in the Department table. The
Employee table has the EMP column as the Primary Index. The final analysis of this join is
that the Department table is small and makes a good candidate for this type of join
strategy.
In order to join these two tables together, the first step is to get the rows together on the
same AMP. In this case, since the Department table is small, Teradata will choose to
duplicate the entire Department table on each AMP into spool. Once this is completed, then
the next step is for the AMPs to join the base Employee rows with the Department rows.
`
Instead of redistributing the larger Employee table, which is not part of the Primary Index
Column in the equality (ON) condition, Teradata will choose a more efficient strategy. This
strategy would be to duplicate the smaller table across all the AMPs (Big Table -Small Table
Join). This merge join strategy will consume minimal resources, and allow for Teradata to
excel.
Nested Join
A nested join strategy is probably the most precise join available. This join is designed to
utilize a unique index type (Either Unique Primary Index or Unique Secondary Index) from
one of the tables in the join statement in order to retrieves a single row. It then matches
that row to one or more rows on the other table being used in the join.
From the example above, the nested join has the join equality (ON) condition based on
the DEPT column. The dept column is the Primary Index Column on the department table.
In addition, the dept column is the Secondary Index Column in the employee table. Based
on this information above, which rows will move?
Keep in mind that the nested join prides itself on being able to move a single row into spool
and then matching that row with another table that contains several matches. How is this
done? Analysis of this join statement indicates a new clause has been added to this join
statement. This is known as the WHERE option. When utilized, the WHERE option allows for
a single row to be retrieved from a table. In addition, a nested join will always use a unique
index to isolate that single record and then join that record to another table. The other table
may use an index or it may not. However, the best practice is to always use columns that
have indexes when doing joins. Teradata has superior knowledge on indexed columns and
can utilize this information to choose an aggressive strategy to complete a join. Utilization
of indexes in join statements will improve performance and utilize less resources as the
below diagram illustrates.
Since there is only one row in the department table that has a match for department =10,
which is based on the AND option in the join statement, the Teradata Optimizer will choose
a path to move the department table columns into spool and duplicate them across all the
AMP’s.
Once this is completed, then the matches will proceed with that single record (10 and
SALES) to the second table, which did not move from the base AMP. Nested Joins are great
in an OLTP Environment because of the usage of both Unique and Non-Unique Indexes. In
addition, a nested join can reduce the resources necessary to complete the join. Finally,
nested joins, which are similar to all the join strategies discussed, must have an equality
condition in the ON Clause (d.dept = e.dept).
Hash Join
The Hash Join is part of the Merge Join Family. Remember that the key to a Merge Join is
based on an equality condition such as E.Dept = D.Dept in the ON clause of the join
statement. A Hash Join can only take place if one or both of the tables on each AMP can fit
completely inside the AMP’s memory.
Hash Join Strategy
In this example, the Hash Join has a join equality (ON) condition based on the EMP and
MGREMP Columns. The key point here is the columns do not necessarily have to the same
name when doing a join operation. The columns names can be different but the row
information has to be similar in order for the match to work. Both the EMP and MGREMP
columns have the same type of information so therefore a join based on these column
names will be successful. In addition, EMP column is the Primary Index column on the
employee table. However, the MGREMP column is not an index column in the department
table. Based on this information above, which rows will move?
Remember that the key to determining a Hash Join is if the SMALLER TABLE can be held
completely in each AMP’s MEMORY.
The Hash Join process is where the smaller table is sorted by row hash and duplicated on
every AMP. The key here is that the smaller table is required to be held completely in each
AMP’s memory. Teradata will use the join column of the larger table in order to search for a
match. The row hash join is extremely efficient because it eliminates the sorting,
redistribution, and or copying of the larger table into spool. In addition, the rows that are
duplicated into the AMP’s memory yield increased performance because the rows never go
into spool. Rows that go into spool always have to involve disk activity. AMP memory does
not involve disk interaction, which automatically increases performance. Hash Joins and
Nested Joins are both Great in an OLTP Environment for these reasons.
Exclusion Join
All of the joins that we have reviewed up to this point were based on finding matching rows
based on a join equality condition. The returned rows from these types of joins compared
rows from both tables in a join and then returned rows that matched. In addition, these
joins were inclusive. When working with exclusion joins the thought process has to be
reversed. Exclusion Joins have one primary function. They exclude rows during a join. The
best example here is when I was out with my best friend and his two sons (Blake and
Zach). They were telling a story that sounded believable about their dog Freddie who
allegedly chased a neighbor’s cat down the street. When they completed the story, I asked
them if this was true…they said in unison “NOT”! Well the NOT statement works exactly the
same in Teradata. When you put a NOT in front of a statement it will give you the opposite
answer.
As you can see in the above example and as has been discussed above this type of join
utilizes the NOT IN statement. Exclusion joins are used for finding rows that don’t have a
matching row in the other table. Queries with the NOT IN operator are the types of queries
that always result in exclusion joins. In this case, this query will find all the employees who
belong to department 10 who are NOT managers.
These joins will always involve a Full Table Scan because Teradata will need to compare
every record to eliminate rows that will need to be excluded. With this being stated, this
type of join can be resource intensive if the two tables in this comparison are large.
In addition, the biggest problem with the Exclusion Joins is when the NOT IN statement is
used. The reason for this is that NULLs are considered unknowns so the data returned in the
answer will be NULLs. There are two ways to correct this:
Define NOT IN columns as NOT NULL on the CREATE TABLE.
Add the “ AND WHERE Column IS NOT NULL” to the end of the JOIN as seen in the above
example.
Product Joins
What Makes Product Joins Different
Product Joins compare every row of one table to every row of another table. They are called
product joins because they are a product of the number of rows in table one multiplied by
the number of rows in table two. For example, if one table had five rows and the other table
had five rows then the Product Join would compare 5 x 5 or 25 rows with a potential of 25
rows coming back.
SELECT E.EMP,D.DEPT
FROM EMPLOYEETABLE E,DEPTTABLE D
WHERE
EMP LIKE '_b%'
About 99% of the time, product joins are major mistakes. The recommendation is to avoid
these types of queries whenever possible. The reason is because all rows in both tables will
be compared. Remember, Teradata tables have the potential to contain millions of rows. If a
user accidentally writes a product join against two tables that have 1 million rows each. The
result set would return One Trillion Rows (1000000000000)! Needless to say this is a
mistake you want to make. So how do you avoid writing a product join?
To avoid a product join, check your syntax to ensure that the join is based on an EQUALITY
condition. In the join syntax example above, the equality statement reads “WHERE EMP Like
‘_b%’”. Because this clause is not based on a common domain condition between the two
tables (i.e., e.dept = d.dept), the result is a product join. Another cause of a product join is
when aliases are not used after being established. Finally check your join syntax to ensure
the WHERE clause is not missing.
When we join two or more tables on a column or set of columns, Joining takes place. This will result in
data resulting from matching records in both the tables. This Universal concept remains the same for all
the databases.
In Teradata, we have Optimizer (a very smart Interpreter), which determines type of join strategy to be
used based on user input taking performance factor in mind.
When User provides join query, optimizer will come up with join plans to perform joins. These Join
strategies include
- Merge Join
- Nested Join
- Hash Join
- Product join
- Exclusion Join
Merge Join
--------------------
Merge join is a concept in which rows to be joined must be present in same AMP. If the rows to be joined
are not on the same AMP, Teradata will either redistribute the data or duplicate the data in spool to make
that happen based on row hash of the columns involved in the joins WHERE Clause.
If two tables to be joined have same primary Index, then the records will be present in Same AMP
and Re-Distribution of records is not required.
There are four scenarios in which redistribution can happen for Merge Join
Case 1: If joining columns are on UPI = UPI, the records to be joined are present in Same AMP and
redistribution is not required. This is most efficient and fastest join strategy
Case 2: If joining columns are on UPI = Non Index column, the records in 2nd table has to be
redistributed on AMP's based on data corresponding to first table.
Case 3: If joining columns are on Non Index column = Non Index column , the both the tables are to be
redistributed so that matching data lies on same amp , so the join can happen on redistributed data. This
strategy is time consuming since complete redistribution of both the tables takes across all the amps
Case 4: For join happening on Primary Index, If the Referenced table (second table in the join) is very
small, then this table is duplicated /copied on to every AMP.
Nested Join
-------------------
Nested Join is one of the most precise join plans suggested by Optimizer .Nested Join works on UPI/USI
used in Join statement and is used to retrieve the single row from first table . It then checks for one more
matching rows in second table based on being used in the join using an index (primary or secondary) and
returns the matching results.
Example:
Select EMP.Ename , DEP.Deptno, EMP.salary
from
EMPLOYEE EMP ,
DEPARTMENT DEP
Where EMP.Enum = DEP.Enum
and EMp.Enum= 2345; -- this results in nested join
Hash join
----------------
Hash join is one of the plans suggested by Optimizer based on joining conditions. We can say Hash Join
to be close relative of Merge based on its functionality. In case of merge join, joining would happen in
same amp. In Hash Join, one or both tables which are on same amp are fit completely inside the AMP's
Memory . Amp chooses to hold small tables in its memory for joins happening on ROW hash.
1. They are faster than Merge joins since the large table doesn’t need to be sorted.
2. Since the join happening b/w table in AMP memory and table in unsorted spool, it happens so quickly.
Exclusion Join
-------------------------
These type of joins are suggested by optimizer when following are used in the queries
- NOT IN
- EXCEPT
- MINUS
Select EMP.Ename , DEP.Deptno, EMP.salary
from
EMPLOYEE EMP
WHERE EMP.Enum NOT IN
( Select Enum from
DEPARTMENT DEP
where Enum is NOT NULL );
Please make sure to add an additional WHERE filter “with <column> IS NOT NULL” since usage of
NULL in a NOT IN <column> list will return no results.
Case 1: matched data in "NOT IN" sub Query will disqualify that row
Case 2: Non-matched data in "NOT IN" sub Query will qualify that row
Case 3: Any Unknown result in "NOT IN" will disqualify that row - ('NULL' is a typical example of this
scenario).