0% found this document useful (0 votes)
12 views2 pages

Bigquery

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views2 pages

Bigquery

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Bigquery

Legacy vs Standard Sql

Legacy sql – [], udf available in web console. Tables use “:” as separator

Standard sql – backtick is used, separator is . does not support TABLE_DATE_RANGE and
TABLE_QUERY. Can be overcome using wildcard and table_suffix. Supports querying nested and
repeated data.

Standard sql advantages:

▪ Composability using WITH clauses and SQL functions.

▪ Subqueries in the SELECT list and WHERE clause.

▪ Correlated subqueries

▪ ARRAY and STRUCT data types (legacy had repeated and record data types)

▪ Inserts, updates, and deletes (dml)

▪ COUNT(DISTINCT <expr>) is exact and scalable, providing the accuracy of


EXACT_COUNT_DISTINCT without its limitations

▪ Automatic predicate push-down through JOINs

▪ Complex JOIN predicates, including arbitrary expressions

▪ Table wildcards, table_suffix

▪ Stricter timestamp checking

Best practises/Performance

▪ Avoid self-joins, use window function instead

▪ If data is skewed like some partitions are huge, filter early. Use approximate_top_count to
determine skew

▪ Avoid joins that produces more output rows than input

▪ Avoid point specific dml. Batch the dml statements

▪ Sub-queries are more efficient than joins

▪ Avoid self-joins, use window function instead

▪ Use only columns that are needed

▪ Filter using “WHERE” clause so that there are minimal rows

▪ With joins, do bigger joins first. Left side of join must be the bigger table

▪ Low cardinality “by groups” are faster. Low cardinality means that the column contains a lot
of “repeats” in its data range

▪ LIMIT doesnt affect cost as it controls only the display

▪ Built-in functions are faster than js udf


▪ Exact functions are slower than approximate built-in function, use approximate built-in if
possible. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT()

▪ Ordering on outermost query, not inner. Outer query is performed last, so put complex
operations in the end when all filtering is done.

▪ Wildcards – be more specific if possible

▪ Performance – query time split between stages, can be seen using stackdriver as well.

▪ Each stage – wait, read, write, compute

▪ Tail skew – max time spent is significantly more than average. Some partitions are way bigger
than other partitions. Tail skew can be found out using approximate aggregate function like
APPROX_TOP_COUNT

▪ Avoid tail skew – filter as early as possible

▪ Batch load is free, streaming has a cost. Unless data is needed in real-time, use batch when
possible.

▪ Denormalize when possible. Still use structs and arrays.

▪ External data sources are slow, use it only when needed.

▪ Monitor query performance – using “details” page. Can find out if there is read, compute or
write latency. Query plan shows different stages and shows breakup of time between
different activities in a stage

You might also like