0% found this document useful (0 votes)
96 views30 pages

BigQuery Cost Optimization + Best Practices

Estudos de Data Engineer GCP

Uploaded by

elaine.cristina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views30 pages

BigQuery Cost Optimization + Best Practices

Estudos de Data Engineer GCP

Uploaded by

elaine.cristina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Cost optimization

best practices for


BigQuery
TecHub
Data Analytics
Google BigQuery

Fully managed and serverless


for maximum agility and scale
Unique
Google Cloud Platform’s
enterprise data warehouse Real-time insights from streaming data
for analytics
Unique

Exabyte-scale data
warehousing Built-in ML and Geospatial for
predictive insights
Unique
Encrypted, durable, secure,
And highly available
High-speed, in-memory BI Engine
for faster reporting and analysis
Unique
BigQuery | Architectural Advantage
Decoupled storage and compute for maximum flexibility

SQL:2011
Compliant
Replicated, Distributed BigQuery High-Available Cluster
Storage Compute
Streaming (99.9999999999% durability) (Dremel) REST API
Ingest
Distributed Web UI, CLI
Memory
Shuffle Tier
Client
Libraries
Free Bulk In 7
Loading Petabit Network languages
BigQuery | Managed storage
Durable and persistent storage with automatic backup

● Tables are stored in optimized


Table 1 Table 2 Table 3
columnar format

● Each table is compressed and


encrypted on disk Region

● Storage is durable & each table is


replicated across datacenters 2 2

● You can do time travel on data 3 3


within 7 days 1 1 2 1 3

Zone A Zone B Zone C


BigQuery | Large stateless compute
Modern architecture for scalability and performance

● Superlinear horizontal scalability

● Immune to node/rack downtime

● Seamless maintenance

● Pipelined execution, dynamic work


repartitioning, speculative execution
Cost optimization techniques

Query Processing Storage


● Ondemand pricing ● Data Retention
○ Query the data you need
● Long term storage
○ Query cost controls
○ Partition and Cluster your tables ● Avoid duplicate storage - use
(includes zero maintenance federated data access model
auto-reclustering)
● Streaming Insert
● Flat-rate pricing
● Backup and recovery
01
Optimize
Querying
Query the data you need Optimize
querying

● Avoid SELECT * (Use preview option to explore your data - its free!) Query
1
● Denormalize your data (nested fields) *To bear in mind: BigQuery is a Data Warehouse required data
● Filter your query as early and as often as possible to improve performance and
reduce cost.
● Check how much your query is going to be charged
● Avoid SQL anti-patterns Enforce cost
2
control

Partition and
3
cluster

Flat-rate
4
pricing
Avoid human errors Optimize
querying

● Enforce MAX limits on bytes processed at query, user and project level. Query
1
required data
● Cancelling a query may cost $

● Use caching intelligently

Enforce cost
2
control

Partition and
3
cluster

Flat-rate
4
pricing
Partition & cluster your data Optimize
querying

Query
Partition your table to reduce the data sweeped 1
● required data
○ Enable required partition filter

● Cluster to further prune your data blocks

Enforce cost
2
control

Partition and
3
cluster

Partitioning Clustering
Flat-rate
4
pricing
Flat-rate & Reservations Optimize
querying

Query
● Think about flat-rate once your BigQuery processing cost > $10K 1
required data
○ Familiarize with BigQuery cost using our pricing calculator

● How many slots you should buy? - Visualize slot utilization in Stackdriver

Enforce cost
2
control

Partition and
3
cluster

Flat-rate
4
pricing
02
Optimizing
Storage
How long are you keeping your data? Optimizing
Storage

(TTL) 1 Data Retention

Long term
2
storage

Avoid duplicate
3
storage

Dataset level Table level Streaming


4
inserts

*Similar to dataset-level and table-level, you can also set up expiration at


Backup and
partition-level. Do checkout our public documentation for default behaviors. 5
Recovery
Be wary how you edit your data? Optimizing
Storage

● If your table or partition has not been edited for 90 days, the storage 1 Data Retention
price drops by 50% (Long-term storage)

○ Watchout for any actions that edits your table: Loading into
BQ, DML operations, streaming inserts, .. Long term
2
storage
● For long term archives with access frequency at most once a year -
leverage Coldline class in GCS.
Avoid duplicate
3
storage

Streaming
4
inserts

Backup and
5
Recovery
Avoid duplicate copies of data Optimizing
Storage

Leverage BigQuery’s federated data access model for


1 Data Retention
your data stored on:
● Cloud Drive
● Cloud BigTable Long term
Cloud Storage 2
● storage
● Cloud SQL

Avoid duplicate
Use cases: 3
storage
● Frequently changing small side inputs
● Ingestion with cleanup that needs to be archived
● Querying of large archives Streaming
4
● Querying is less performant - gotcha! inserts

Backup and
5
Recovery
Optimizing
Storage
Loading the data
1 Data Retention
● Batch upload is free. Use streaming inserts only if it consumed by
downstream processes in real time.

Long term
2
storage

Understanding DR and backup processes 3


Avoid duplicate
storage
● By default your 7-day history is tracked by BigQuery at the service level.
○ You can find examples in our public documentation for point in time restore.
Streaming
4
● If you delete your table, you cannot restore it after 2 days. inserts

Backup and
5
Recovery
BigQuery Materialized Views

Automatically synchronizes data refreshes with


Zero Maintenance data changes in base tables. No user inputs
required.

Always consistent with the base table. There will


Always fresh never be a situation when querying MV results in
stale data.

BigQuery will rewrite the query to use the MV for


Self tuning better performance and/or efficiency when
querying the base table directly.
Flexibility and choice across the BI process
Flexibility and choice across the BI process
Introducing BigQuery
BI Engine

Sub-second queries

Simplified architecture

Smart tuning
Visualize cost

● Create your own dashboard (example)

● Analyze spending trend & query trend over time

● Breakdown cost per project and per user

● Be proactive about tracking your expensive


queries and optimize them
● BQ Audit logs Queries repository (Github)
Blogpost
For more details

bit.ly/gcp-co-bq
Thank you
Appendix
Ingestion formats

Faster Avro

Avro (Compressed)
Avro (Uncompressed) Parquet

Parquet / ORC
CSV ORC

JSON BigQuery
CSV (Compressed) CSV

JSON (Compressed)

Slower JSON
Introducing
BigQuery Omni
A flexible, fully-managed, multi-cloud
analytics solution that lets you analyze data
across public clouds without leaving the
familiar BigQuery user interface.
Data integration partners

SaaS Data Sources

Databases

Data warehouses

B2B, EDI data


Resource Optimizations
● BigQuery Partitioning & Clustering
● Federation: Avoid duplication of data
● Data retention and clean up for active storage
● BigQuery Caching

Pricing Efficiency
● Flex Slots
● BigQuery Slots Recs

You might also like