0% found this document useful (0 votes)
10 views17 pages

Redshift Best Practices

The document provides an overview of AWS Redshift, detailing its architecture and key features such as columnar storage and massively parallel processing. It discusses best practices for sorting and distribution styles, query writing, and performance optimization, emphasizing the importance of selecting appropriate sort keys and distribution strategies. Additionally, it outlines specific recommendations for writing efficient queries to maximize performance.

Uploaded by

Saad Durrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Redshift Best Practices

The document provides an overview of AWS Redshift, detailing its architecture and key features such as columnar storage and massively parallel processing. It discusses best practices for sorting and distribution styles, query writing, and performance optimization, emphasizing the importance of selecting appropriate sort keys and distribution strategies. Additionally, it outlines specific recommendations for writing efficient queries to maximize performance.

Uploaded by

Saad Durrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Optimizing Performance:

Best Practices in Redshift

Shaheer Anjum
Senior Data Engineer
Data Platforms
AGENDA
● Overview of AWS Redshift
○ What is AWS Redshift
○ AWS Redshift Architecture

● Understanding the Sorting & Distribution Styles


○ Best Distribution Styles
○ Best Sort Keys

● Query Writing Best Practices & Performance Optimization


○ How to Write an optimized query.
○ How to reduce execution time.
● Code Refactoring
○ Converting SQL server queries to redshift.
AWS REDSHIFT OVERVIEW:
● Amazon Redshift is a fully managed, petabyte-scale data warehouse service on
the AWS cloud, built for high-performance analysis.
● Key Features:
○ Columnar Storage, Stores data in columns rather than rows for faster query
performance.
○ Massively Parallel Processing (MPP).
○ Distributes queries across multiple nodes for parallel execution.
○ Easily scales up or down based on workload and data volume.
○ Ideal for data warehousing, analytics, and business intelligence.
○ Leader Node: Coordinates queries, manages query optimization, and distributes
queries to compute nodes.
○ Compute Nodes: Store and process data in parallel, providing scalable storage and
processing power.
DATA DISTRIBUTION STRATEGIES:
● Amazon Redshift supports three different data distribution styles:
○ Key Distribution
○ Even Distribution
○ All Distribution
○ AUTO Distribution

● Key Distribution:
○ Key distribution is achieved by selecting a column or set of columns as the
distribution key. The distribution key determines how data is distributed across
compute nodes.
○ Rows with the same distribution key are hashed to the same node, enabling efficient
querying for data with the specified key. This can enhance performance for joint
operations involving the distribution key.
DATA DISTRIBUTION STRATEGIES:
● Even Distribution:
○ Even distribution (or distribution style EVEN) distributes data evenly across all nodes
without relying on a specific column for hashing.
○ Redshift automatically divides the data evenly across the compute nodes, providing a
balanced workload distribution.
○ Even distribution is useful when there is no clear natural key for distribution, or when
the workload is evenly spread across the entire dataset.
● ALL Distribution:
○ All distribution (or distribution style ALL) involves replicating the entire table on each
node in the cluster.
○ Each compute node holds a full copy of the table, eliminating the need for inter-node
data movement during query execution.
○ All distribution is advantageous for small sized dimension tables which are joined very
frequently and take minimum
DATA DISTRIBUTION STRATEGIES:
● Auto Distribution:
○ With AUTO distribution, Amazon Redshift assigns an optimal distribution style based
on the size of the table data.
○ For example, if the AUTO distribution style is specified, Amazon Redshift initially
assigns the ALL distribution style to a small table.
○ When the table grows larger, Amazon Redshift might change the distribution style to
KEY, choosing the primary key (or a column of the composite primary key) as the
distribution key.
○ If the table grows larger and none of the columns are suitable to be the distribution
key, Amazon Redshift changes the distribution style to EVEN. The change in
distribution style occurs in the background with minimal impact to user queries.
REDSHIFT SORT KEYS:
● There can be multiple columns defined as Sort Keys. Data stored in the table can
be sorted using these columns. The query optimizer uses this sort ordered table
while determining optimal query plans.
● Amazon Redshift supports two kinds of Sort Keys.
○ Compound Sort Keys
○ Interleaved Sort Keys
REDSHIFT SORT KEYS:
● COMPOUND SORT KEYS:
○ These are made up of all the columns that are listed in the Redshift sort keys
definition during the creation of the table, in the order that they are listed. Therefore,
it is advisable to put the most frequently used column at the first in the list.
COMPOUND is the default sort type. Compound sort keys might speed up joins,
GROUP BY and ORDER BY operations, and window functions that use PARTITION BY.

● INTERLEAVED SORT KEYS:


○ Interleaved sort gives equal weight to each column in the Redshifts sort keys. As a
result, it can significantly improve query performance where the query uses
restrictive predicates (equality operator in WHERE clause) on secondary sort columns.
REDSHIFT SORT KEYS:
● Selecting the right kind needs the knowledge of the queries.
○ Use Interleaved Sort Key when you plan to use one column as Sort Key or when
WHERE clauses in your query have highly selective restrictive predicates. Or if the
tables are huge.

○ Use the Compound Sort Key, when you have more than one column as Sort Key when
your query includes JOINS, GROUP BY, ORDER BY, and PARTITION BY when your table
size is small.

○ Don’t use an interleaved sort key on columns with monotonically increasing


attributes, like an identity column, dates or timestamps.
DIST KEY EXAMPLES:
● Look at the schema of the USERS table in the TICKIT database. USERID is defined as
the SORTKEY column and the DISTKEY column:
DIST KEY EXAMPLES:
● USERID is a good choice for the distribution column in this table. If you query the
SVV_DISKUSAGE system view, you can see that the table is very evenly distributed.
Column numbers are zero-based, so USERID is column 0.
DIST KEY EXAMPLES:
● CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE [ IF NOT EXISTS ] table_name (
{ column_name data_type [column_attributes] [ column_constraints ] |
table_constraints | LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ] }
[, ... ] ) [ BACKUP { YES | NO } ] [table_attributes]
and table_attributes are: [ DISTSTYLE { AUTO | EVEN | KEY | ALL } ] [ DISTKEY
( column_name ) ] [ [COMPOUND | INTERLEAVED ] SORTKEY ( column_name [,...])
| [ SORTKEY AUTO ] ] [ ENCODE AUTO ]

● ALTER TABLE tablename ALTER DISTSTYLE ALL, ALTER SORTKEY (column_list);


AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:

● To maximize query performance, follow these recommendations when creating


queries:
○ Avoid using select *. Include only the columns you specifically need.
○ Use a CASE conditional expression to perform complex aggregations instead of
selecting from the same table multiple times.
○ Don't use cross-joins unless necessary. Cross joins without a join condition result in
the Cartesian product of two tables. Cross-joins are typically run as nested-loop joins,
which are the slowest of the possible join types.
○ Use subqueries in cases where one table in the query is used only for predicate
conditions and the subquery returns a small number of rows (less than about 200).
The following example uses a subquery to avoid joining the LISTING table.
AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:

○ Join Larger tables first.


○ Use predicates to restrict the dataset as much as possible.
○ In the predicate, use the least expensive operators that you can.
○ Comparison operators are preferable to like operator.
○ =,<>,<,> are better than LIKE.
○ LIKE operators are still better than SIMILAR TO.
○ Avoid using functions in query predicates. Using them can drive up the cost of the
query by requiring large numbers of rows to resolve the intermediate steps of the
query.
AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:

○ Add predicates to filter tables that participate in joins, even if the predicates apply the
same filters. The query returns the same result set, but Amazon Redshift is able to
filter the join tables before the scan step and can then efficiently skip scanning blocks
from those tables. Redundant filters aren't needed if you filter on a column that's
used in the join condition.

○ For example, suppose that you want to join SALES and LISTING to find ticket sales for
tickets listed after December, grouped by seller. Both tables are sorted by date. The
following query joins the tables on their common key and filters for listing.listtime
values greater than December 1.
AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:

○ The WHERE clause doesn't include a predicate for sales.saletime, so the execution
engine is forced to scan the entire SALES table. If you know the filter would result in
fewer rows participating in the join, then add that filter as well. The following example
cuts execution time significantly.
AMAZON REDSHIFT BEST PRACTICES FOR WRITING QUERIES:

○ Use sort keys in the GROUP BY clause so the query planner can use more efficient
aggregation.
○ If you use both GROUP BY and ORDER BY clauses, make sure that you put the
columns in the same order in both. That is, use the approach just following.
■ group by a, b, c;
■ order by a, b, c

○ Don't use the following approach.


■ group by b, c, a
■ order by a, b, c

You might also like