0% found this document useful (0 votes)
29 views6 pages

Big Query Content

Uploaded by

abc xyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Big Query Content

Uploaded by

abc xyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 6

Google BigQuery is a serverless, highly scalable data warehouse that comes with a

built-in query engine.


It is a cost-effective analytics data warehouse that lets you run analytics over
vast amounts of data in near real time.
The query engine is capable of running SQL queries on terabytes of data in a matter
of seconds, and petabytes in only minutes.
You get this performance without having to manage any infrastructure
-----------------------------------------------------------------------------------
-------------------------

BigQuery offers scalable, flexible pricing options to meet your technical needs and
your

budget.

Mainly you are charged for the Storage i.e. the amount of data you store in tables
and

the query costs.

Other than that mostly all operations like Loading data, copying data, exporting
data

are free of cost.


are free of cost

There are 2 pricing models that you can opt for – On-demand pricing or flat-rate
pricing.

On-demand pricing as the name suggests charges you only when you run the query.

There is no lump sum or monthly cost, you just pay for the queries you run.

Charges for the queries are decided by using 1 metric which is the number of bytes
processed

(also referred to as bytes read).

You are charged for the number of bytes processed, no matter where the data is
stored, it can

be stored in BigQuery or in an external data source such as Cloud Storage, Drive,


or Cloud Bigtable.

Actually On-demand pricing is the default your project is attached with.

But you can change this billing model to flat-rate billing or can even have a mix
and match of

the two billing models for each project and location.


-----------------------------------------------------------------------------------
--

Let’s move on to next pricing model – flat-rate price model.

This pricing option is best for customers who desire a stable cost for queries.

Flat-rate customers purchase dedicated resources for query processing and are not
charged on
demand for individual queries.

It is like subscription-based model.

When you enroll in flat-pricing you basically purchase slot commitments or you can
say a

dedicated query processing capacity.

You can fire any number of queries with any data size within the allotted
processing capacity,

you are not charged for those bytes processed.

This model is pretty much flexible as well, because once you get your capacity
allocated,

you can distribute this capacity across your organization, by reserving pools of
capacity

for different projects or different parts of your organization.

And In cases when your capacity demands exceed your committed capacity, then also
you will

not be charged additional fees, no additional slots would be given to you rather
BigQuery

will queue up your slots.

Its like queue up the tasks until the running tasks are finished and they free up
some slots

for the queued-up tasks.

Moving next, we have Flex slots also known as short-term commitments where the
Commitment duration

is only 60 seconds.

After 60 seconds you can keep the Flex slots with you for as long as you want or
cancel

them any time and you will be charged only for the seconds your commitment was
deployed.

now You would be wondering why someone would need a commitment of only 60 seconds.

Actually, flex slots are a good way to test how your workloads are going to perform

with flat-rate billing before purchasing a long-term commitment.

They are also useful for handling seasonal demand big sale in e-commerce site etc
events.
------------------------------------------------------------------------------
1 practice that can control communication between slots is that you try to reduce
the
amount of data that is processed before a JOIN clause.

As join operation lets a query to jump from one table to another and comes with lot
of

shuffling so it a good practice to trim the data in the query as early as possible
before

a Join clause.

Less data going to join clause means less shuffling which in turn means better
performance.

To avoid shuffling, BigQuery broadcasts some of your small tables in Join query to
every

processing node.

To allow proper broadcasting always write the join query with decreasing size of
tables.

Heaviest table at left and lightest at right extreme.

So yeah, I guess that’s all you can do from data scanning and shuffling perspective
to

get better performances and control query cost.

-----------------------------------------------------------------------------
It is true that whatever, functions, aggregations, transformations you apply within
a query,

they all directly impact your CPU time needed.

More are the transformations; more computation will be there and more time query
will take

to produce output.

It is a common use case to use SQL to perform ETL, there you have to write a number
of functions

to transform the data.

Obviously these transformations are inevitable as they are part of business


requirements

but as a best practice what you can do is you can write the transformed data into
another

table and rest all aggregations you do it on the new table.

For example, if in your query, you are having trim statements, regular expressions
or even

some UDFs, then it is performant to write the transformed results into a new table
and
then do the aggregations or other things on the new table because now when you do
aggregations

on the new table, it would be done in a much efficient way as this time there is no
overhead

of doing those transformations.

So basically it is sort of creating an intermediate table.

Then, order by clause is also a costly operation as it requires the sorting at


whole data level

so you need to use it very carefully.

Use order by only in the outermost query or within the window clause because in
outer

query means the final data on which ordering is to be performed is already filtered
and

reduced so you would be doing sorting on a subset of data and not the unnecessary
data that is already filtered.

Actually not only order by.. whatever complex operations are there such as regular
expressions

or any other mathematical functions.. try to push it to the end of query.

Your query will perform better doing so.

And yes, it is also a good practice to use limit whenever you are using an order by
clause

because.

Since order by means sorting of whole data so it must be done on a single slot and
if

you are attempting to order a very large result set, the final sorting can
overwhelm the slot

that is processing the data, hence sometimes throwing an resources exceeded error
and FYI,

Resources exceeded is returned when your query uses too many resources.

Going next is, In what order shall we place the tables in a Join query.

Even though Bigquery’s optimizer can determine which table should be on which side
of the

join while creating its execution plan, but it is still recommended to order your
joined

tables appropriately.

The best practice is to place the largest table first, followed by the decreasing
size

of tables.

This you do to enable the broadcasting.

In broadcast join, the whole data of small table present on right side can be
broadcasted

to each slot that processes the larger table which results in lesser I/O requests
out of

the processing slot.

When evaluating the output data, you should consider How many bytes are written for
your

result set?

Are you properly limiting the amount of data being written?

It should not happen they you want to just see few rows of output and you are not
including

any limit clause in your query.

Limit clause might not restrict the data being read but can definitely restrict the
amount

of data to be written.

because amount of data written by a query does take its time.

Also, If you are writing results to a permanent (destination) table, the amount of
data written

will have a cost.

-----------------------------------------------------------------------------------
----------------------------------
partitioning:

select
X733SPECIFICPROB
,IDENTIFIER
,NODE
,SEVERITY
,FIRSTOCCURRENCE
,LASTOCCURRENCE
,SITEID
,CONTROLNE
,NODETYPE
,CLEARTIME
,EMS_NAME

from `bmas-eu-mbnl-data-prod.ONEFM_SEMANTIC.F_ALARM`

-----------------------------------------------------------------------------------
----------------------------------------

cache

SELECT Priority FROM `bmas-eu-mbnl-data-prod.HEALTH_CHECK.LOOKUP_MAJOR_ALARMS`


LIMIT 1000

You might also like