Bigquery
Bigquery
Legacy sql – [], udf available in web console. Tables use “:” as separator
Standard sql – backtick is used, separator is . does not support TABLE_DATE_RANGE and
TABLE_QUERY. Can be overcome using wildcard and table_suffix. Supports querying nested and
repeated data.
▪ Correlated subqueries
▪ ARRAY and STRUCT data types (legacy had repeated and record data types)
Best practises/Performance
▪ If data is skewed like some partitions are huge, filter early. Use approximate_top_count to
determine skew
▪ With joins, do bigger joins first. Left side of join must be the bigger table
▪ Low cardinality “by groups” are faster. Low cardinality means that the column contains a lot
of “repeats” in its data range
▪ Ordering on outermost query, not inner. Outer query is performed last, so put complex
operations in the end when all filtering is done.
▪ Performance – query time split between stages, can be seen using stackdriver as well.
▪ Tail skew – max time spent is significantly more than average. Some partitions are way bigger
than other partitions. Tail skew can be found out using approximate aggregate function like
APPROX_TOP_COUNT
▪ Batch load is free, streaming has a cost. Unless data is needed in real-time, use batch when
possible.
▪ Monitor query performance – using “details” page. Can find out if there is read, compute or
write latency. Query plan shows different stages and shows breakup of time between
different activities in a stage