Redshift Best Practices
Redshift Best Practices
Shaheer Anjum
Senior Data Engineer
Data Platforms
AGENDA
● Overview of AWS Redshift
○ What is AWS Redshift
○ AWS Redshift Architecture
● Key Distribution:
○ Key distribution is achieved by selecting a column or set of columns as the
distribution key. The distribution key determines how data is distributed across
compute nodes.
○ Rows with the same distribution key are hashed to the same node, enabling efficient
querying for data with the specified key. This can enhance performance for joint
operations involving the distribution key.
DATA DISTRIBUTION STRATEGIES:
● Even Distribution:
○ Even distribution (or distribution style EVEN) distributes data evenly across all nodes
without relying on a specific column for hashing.
○ Redshift automatically divides the data evenly across the compute nodes, providing a
balanced workload distribution.
○ Even distribution is useful when there is no clear natural key for distribution, or when
the workload is evenly spread across the entire dataset.
● ALL Distribution:
○ All distribution (or distribution style ALL) involves replicating the entire table on each
node in the cluster.
○ Each compute node holds a full copy of the table, eliminating the need for inter-node
data movement during query execution.
○ All distribution is advantageous for small sized dimension tables which are joined very
frequently and take minimum
DATA DISTRIBUTION STRATEGIES:
● Auto Distribution:
○ With AUTO distribution, Amazon Redshift assigns an optimal distribution style based
on the size of the table data.
○ For example, if the AUTO distribution style is specified, Amazon Redshift initially
assigns the ALL distribution style to a small table.
○ When the table grows larger, Amazon Redshift might change the distribution style to
KEY, choosing the primary key (or a column of the composite primary key) as the
distribution key.
○ If the table grows larger and none of the columns are suitable to be the distribution
key, Amazon Redshift changes the distribution style to EVEN. The change in
distribution style occurs in the background with minimal impact to user queries.
REDSHIFT SORT KEYS:
● There can be multiple columns defined as Sort Keys. Data stored in the table can
be sorted using these columns. The query optimizer uses this sort ordered table
while determining optimal query plans.
● Amazon Redshift supports two kinds of Sort Keys.
○ Compound Sort Keys
○ Interleaved Sort Keys
REDSHIFT SORT KEYS:
● COMPOUND SORT KEYS:
○ These are made up of all the columns that are listed in the Redshift sort keys
definition during the creation of the table, in the order that they are listed. Therefore,
it is advisable to put the most frequently used column at the first in the list.
COMPOUND is the default sort type. Compound sort keys might speed up joins,
GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.
○ Use the Compound Sort Key, when you have more than one column as Sort Key when
your query includes JOINS, GROUP BY, ORDER BY, and PARTITION BY when your table
size is small.
○ Add predicates to filter tables that participate in joins, even if the predicates apply the
same filters. The query returns the same result set, but Amazon Redshift is able to
filter the join tables before the scan step and can then efficiently skip scanning blocks
from those tables. Redundant filters aren't needed if you filter on a column that's
used in the join condition.
○ For example, suppose that you want to join SALES and LISTING to find ticket sales for
tickets listed after December, grouped by seller. Both tables are sorted by date. The
following query joins the tables on their common key and filters for listing.listtime
values greater than December 1.
AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:
○ The WHERE clause doesn't include a predicate for sales.saletime, so the execution
engine is forced to scan the entire SALES table. If you know the filter would result in
fewer rows participating in the join, then add that filter as well. The following example
cuts execution time significantly.
AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:
○ Use sort keys in the GROUP BY clause so the query planner can use more efficient
aggregation.
○ If you use both GROUP BY and ORDER BY clauses, make sure that you put the
columns in the same order in both. That is, use the approach just following.
■ group by a, b, c;
■ order by a, b, c