Query Optimization
Query Optimization
Imagine you're a chef in a big kitchen preparing a delicious meal. You have a
recipe that involves using different ingredients and cooking techniques. Now,
let's say you have a list of tasks to complete to make the meal: chopping
vegetables, marinating meat, boiling pasta, and so on.
In the world of databases, like Snowflake, when you run a query (which is like
asking the database a question), it's like giving the kitchen chef a set of cooking
instructions. However, just like the chef can find more efficient ways to
complete cooking tasks to save time and effort, databases can also find smarter
ways to execute queries.
Query optimization in Snowflake is like the chef trying to cook the meal in the
fastest and most efficient way. The database looks at your query and figures
out the best path to retrieve and combine the data you want. It considers
things like which parts of the data to fetch first, how to sort and filter them,
and whether it can use special techniques to make things quicker.
For instance, if your query involves looking at data from different tables, the
database might choose to combine the tables in a particular way to avoid
unnecessary work. It's like the chef deciding to chop all the vegetables at once
instead of doing it separately for each dish.
The goal of query optimization is to minimize the time it takes for your query to
give you the results you want. Just like the chef aiming to serve a delicious
meal quickly, Snowflake's query optimization helps you get your data faster and
more efficiently, making your work smoother and more enjoyable
Query optimization is a process of defining the most efficient and optimal
way and techniques that can be used to improve query performance based
on the rational use of system resources and performance metrics.
The purpose of query tuning is to find a way to decrease the response time of
the query, prevent the excessive consumption of resources, and identify poor
query performance.
WHY QUERY OPTIMIZATION ???
Reduce query run time: Some queries may have a certain SLA that they need
to meet, not hitting those targets may result in delays in the delivery of data
that could negatively impact the business. Query optimization can help
improve the performance of queries by finding the optimal way to access and
manipulate the data.
User Experience: If you're sharing data insights or reports with others, the
speed at which they receive the results matters. Optimized queries mean
quicker response times for your end-users, leading to a better overall
experience and higher user satisfaction.
Scalability: As your data grows, the performance of unoptimized queries can
degrade significantly. Query optimization ensures that your queries continue to
run efficiently as your dataset expands, maintaining consistent performance
levels.
Complex Analytics: Data analysts often deal with complex queries involving
joins, aggregations, and filtering. Optimizing these intricate queries becomes
essential to ensure that the results are accurate and generated in a timely
manner.
Now before going deep into the technique to optimize query first understand
the architecture of snowflake.
In simple words:
Database Storage
Query Processing
Cloud Services
Database Storage Layer:
This is where the raw data is stored.
Snowflake stores data in a structured and compressed format, using a
combination of columnar storage and a unique indexing scheme that improves
query performance.
The data is divided into micro-partitions, which are small, immutable units of
data that allow for efficient storage and querying.
Frequently Queried Coloumn : The columns being frequently queried are not
the primary cluster key columns. In a database, the primary cluster key (often
referred to as the primary key) is used for indexing and organizing data. When
queries predominantly use other columns for filtering and searching, it can
impact the efficiency of Query Execution.
Business users need fast response times for critical dashboards with
highly selective filters.
Data scientists who are exploring large data volumes and looking for
specific subsets of data.
NOTE:
To enable search optimization for your account you must enable this
feature for specific columns or fields in columns or for the entire table
For example:
suppose you have a large table of sales data that is partitioned by date,
with each partition containing sales data for a specific date range. If you
run a query that filters on a specific date range, Snowflake can use
partition pruning to eliminate all the partitions that do not contain any
sales data for that date range, reducing the amount of data that needs to
be scanned.
HOW TO IMPLEMENT PARTITION PRUNING??
For example, if you have a partitioned table on the “date” column, you
can use the following query to only scan the partitions for the month
of January:
SELECT *
FROM <TABLE_NAME>
WHERE year >= ' 2020-01-01 ' AND date < ' 2020-01-31 ';
When query requires sorting, you can use the ORDER BY clause to specify
the sort order. This can help Snowflake minimize data movement by
sorting the data on each node before sending it to the query execution
node.
SELECT *
FROM <TABLE_NAME>
ORDER BY name;
3. USE APPROPRIATE DATA TYPES
The right data type for your database columns is a critical decision that
affects storage efficiency, query performance, and data integrity. Each
database system provides a range of data types designed to
accommodate different types of data and optimize storage and
processing.
For example, if you have a sales table with millions of rows, you can
create a materialized view that aggregates the data at the monthly level.
#CREATING A TABLE
CREATE TABLE sales_table (
date DATE,
product_category VARCHAR,
sales_amount FLOAT
From this fig you can see that Finance, Data Scientist, Data Loading,
and Marketing have separate warehouses which will help them with
fast and responsive queries which will ultimately improve
performance, scalability, and cost efficiency.
9. USE LIMIT ROWS OR TOP CLAUSE
The LIMIT or TOP clause also improves queries with an ORDER BY clause.
For example, the first query below returns in just over two minutes on an
X-SMALL warehouse, whereas the second query, which returns the top
ten entries, is twelve times faster, taking just six seconds to complete.
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.ORDERS
order by o_totalprice desc;
select top 10 *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.ORDERS
order by o_totalprice desc;
10.BREAKING-DOWN COMPLEX JOIN OPERATION
JOIN operations enable you to combine data from multiple related tables
into a single result set. Instead of retrieving data from each table
separately and then manually combining them, JOINs allow you to get
the desired data in a single query.
This significantly reduces the amount of data transferred over the
network and minimizes the need for additional processing on the client
side.
Making querying more reliable.
Here we can see that the process is too complex. And it will directly
impact on query optimization.
Once you run a query click the query profile view and you should identify
the following red flags:
1. Analyze the query profile to identify the slowest stages of the query
plan. Look for stages with high “elapsed time” or “execution time”
values. These stages are likely to be the bottlenecks in your query.
2. Once you have identified the slowest stages, you can try to optimize
them. Here are some tips:
o If a stage involves a large number of rows, consider adding a
filter to reduce the number of rows processed.
o If a stage involves a large amount of data movement, consider
using a more efficient join strategy or partitioning the data.
o If a stage involves a complex computation, consider simplifying
the computation or using a more efficient algorithm.
o Look for long-running or high-cost operations in the query
plan. These are likely areas where the query is spending a lot of
time or scanning a large amount of data.
o Examine the number of rows processed by each step in the
query plan. This can help you identify areas where the query is
performing unnecessary processing.
o Look for opportunities to minimize data movement by using
appropriate sorting and partitioning.
o Review the query code to identify areas for optimization, such
as using appropriate data types, and minimizing the use of
subqueries and joins.
3. After you have made changes to your query, rerun it to see if the
changes have improved the query performance.
12. AVOID USING SELECT *
Avoid using select * when you just require data from few
coloumn.
Select * will unusually execute over the entire table.
So when we just require the data from few column, better
I nstead of doing select * , specify coloumn name.
While using this code only the specified coloumn will be taken
care. It is more optimized.
Keep Learning 😊