Framework For Migrate Your Data Warehouse Google BigQuery WhitePaper
Framework For Migrate Your Data Warehouse Google BigQuery WhitePaper
The purpose of this document is to provide a framework and help guide you through
the process of migrating a data warehouse to Google BigQuery. It highlights many
of the areas you should consider when planning for and implementing a migration
of this nature, and includes an example of a migration from another cloud data
warehouse to BigQuery. It also outlines some of the important differences between
BigQuery and other database solutions, and addresses the most frequently asked
questions about migrating to this powerful database.
Note: This document does not claim to provide any benchmarking or performance
comparisons between platforms, but rather highlights areas and scenarios where
BigQuery is the best option.
MIGRATION 8
STAGING REDSHIFT DATA IN S3 9
REDSHIFT CLI CLIENT INSTALL 9
TRANSFERRING DATA TO GOOGLE CLOUD STORAGE 9
DATA LOAD 10
ORIGINAL STAR SCHEMA MODEL 11
BIG FAT TABLE 13
DATE PARTITIONED FACT TABLE 15
DIMENSION TABLES 17
SLOWLY CHANGING DIMENSIONS (SCD) 17
FAST CHANGING DIMENSIONS (FCD) 18
POST-MIGRATION 18
MONITORING 18
STACKDRIVER 18
AUDIT LOGS 19
QUERY EXPLAIN PLAN 20
MAIN MOTIVATORS
Having a clear understanding of main motivators for the migration will help
structure the project, set priorities and migration goals, and provide a basis for
assessing the success of the project at the end.
Here are some of the main reasons that users find migrating to BigQuery an
attractive option:
• BigQuery is a fully managed, no-operations data warehouse. The concept of
hardware is completely abstracted away from the user.
• BigQuery enables extremely fast analytics on a petabyte scale through its
unique architecture and capabilities.
• BigQuery eliminates the need to forecast and provision storage and
compute resources in advance. All the resources are allocated dynamically
based on usage.
• BigQuery provides a unique ‘pay as you go’ model for your data warehouse
and allows you to move away from a CAPEX-based model.
• BigQuery charges separately for data storage and query processing
enabling an optimal cost model, unlike solutions where processing capacity
is allocated (and charged) as a function of allocated storage.
• BigQuery employs a columnar data store, which enables the highest data
compression and minimizes data scanning in common data warehouse
deployments.
• BigQuery provides support for streaming data ingestions directly through an
API or by using Google Cloud Dataflow.
• BigQuery has native integrations with many third-party reporting and BI
providers such as Tableau, MicroStrategy, Looker, and so on.
PERFORMANCE
In an analytical context, performance has a direct effect on productivity, because
running a query for hours means days of iterations between business questions.
It is important to build a solution that scales well not only in terms of data volume
but, in the quantity and complexity of the queries performed.
For queries, BigQuery uses the notion of ‘slots’ to allocate query resources during
query execution. A ‘slot’ is simply a unit of analytical computation (i.e. a chunk of
infrastructure) pertaining to a certain amount of CPU and RAM. All projects, by
default, are allocated 2,000 slots on a best effort basis. As the use of BigQuery
and the consumption of the service goes up, the allocation limit is dynamically
raised. This method serves most customers very well, but in rare cases where a
higher number of slots is required, or/and reserved slots are needed due to strict
SLAs, a new flat-rate billing model might be a better option as it allows you to
choose the specific slot allocation required.
To monitor slot usage per project, use BigQuery integration with Stackdriver. This
captures slot availability and allocations over time for each project. However, there
is no easy way to look at resource usage on a per query basis. To measure per
query usage and identify problematic loads, you might need to cross-reference
slot usage metrics with the query history and query execution.
BigQuery launched support for standard SQL, which is compliant with the SQL 2011
standard and has extensions that support querying nested and repeated data.
In addition, BigQuery support for querying data directly from Google Cloud
Storage and Google Drive. It supports AVRO, JSON NL, CSV files as well as
Google Cloud Datastore backup, Google Sheets (first tab only). Federated
datasources should be considered in the following cases:
• Loading and cleaning your data in one pass by querying the data from a
federated data source (a location external to BigQuery) and writing the
cleaned result into BigQuery storage.
• Having a small amount of frequently changing data that you join with other
tables. As a federated data source, the frequently changing data does not
need to be reloaded every time it is updated.
DATA TRANSFORMATION
Exporting data from the source system into a flat file and then loading it into
BigQuery might not always work. In some cases the transformation process
should take place before loading data into BigQuery due to some data type
constraints, changes in data model, etc. For example, WebUI does not allow any
parsing (i.e. date string format), type casting, etc. and source data should be fully
compatible with target schema.
When considering loading data to BigQuery, there are a few options available:
• Bulk loading from AVRO/JSON/CSV on GCS. There are some difference of
what format is best for what, more information available from (https://cloud.
google.com/bigquery/loading-data)
• Using the BigQuery API (https://cloud.google.com/bigquery/loading-data-
post-request)
• Streaming Data into BigQuery using one of the API Client Libraries or
Google Cloud Dataflow
CAPACITY PLANNING
Identifying queries, concurrent users, data volumes processed, and usage pattern
is important to understand. Normally migrations start from identifying and building
a PoC from the most demanding area where the impact will be well visible.
As access to big data is being democratized with better user interfaces, standard
SQL support and faster queries, it makes the reporting/data mining self-serviced
and variability of analysis much wider. This shifts the attention from boring
administrative and maintenance tasks to what really matters to the business:
generating insights and intelligence from the data. As a result, the focus for
organizations is in expanding capacity in business analysis, developing expertise
in data modeling, data science, machine learning and statistics, etc.
BigQuery removes the need for capacity planning in most cases. Storage is
virtually infinitely scalable and data processing compute capacity allocation is
done automatically by the platform as data volumes and usage grow.
MIGRATION
To test a simple “Star Schema” to BigQuery migration scenario we will be loading
a SSB sample schema into Redshift and then performing ETL into BigQuery.
This schema was generated based on the TPC-H benchmark with some data
warehouse specific adjustments.
The client requires the following environment variables to store AWS access
credentials
• export ACCESS-KEY=
• export SECRET-ACCESS-KEY=
• export USERNAME=
• export PASSWORD=
Once RedShift CLI client is installed and access credentials are specified, we need to
• create source schema in RedShift
• load SSB data into RedShift
• Export data to S3
After executing the scripts in the above link, you will have an export of your
Redshift tables in S3.
Cloud Storage Transfer Service has options that make data transfers and
synchronization between data sources and data sinks easier.
For example, you can:
• Schedule one-time transfers or recurring transfers.
• Schedule periodic synchronization from data source to data sink with
For subsequent data synchronizations, data should ideally be pushed directly from
the source into GCP. If for legacy reasons data must be staged in AWS S3 the
following approaches can be used to load data into BigQuery on an ongoing basis
assuming:
• Policy on S3 bucket is set to create a SQS message on object creation.
In GCP we can use Dataflow and create custom unbounded source to
subscribe to SQS messages and read and stream newly created objects into
BigQuery
• Alternatively a policy to execute the AWS Lambda function on the new
S3 object can be used to read the newly created object and load it
into BigQuery. Here (https://github.com/vstoyak/GCP/blob/master/
RedShift2BigQuery_lambda.js) is a simple Node.js Lambda function
implementing automatic data transfer into BigQuery triggered on every new
RedShift dump saved to S3. It implements a pipeline with the following steps:
• Triggered on new RedShift dump saved to S3 bucket
• Initiates Async Google Cloud Transfer process to move new data file
from S3 to Google Cloud Storage
• Initiates Async BigQuery load job for transferred data files
DATA LOAD
Once we have data files transferred to Google Cloud Storage they are ready to be
loaded into BigQuery.
Unless you are loading AVRO files, BigQuery currently does not support schema
creation out of JSON datafile so we need to create tables in BigQuery using the
following schemas https://gist.github.com/anonymous/8c0ece715c6a72ff6258e2
5cb84045c8 When creating tables in BigQuery using the above schemas, you can
One caveat here is that BigQuery data load interfaces are rather strict in what
formats they accept and unless all types are fully compatible, properly quoted
and formatted, the load job will fail with very little indication as to where the
problematic row/column is in the source file.
There is always an option to load all fields as STRING and then do the
transformation into target tables using default or user defined functions. Another
approach would be to rely on Google Cloud Dataflow (https://cloud.google.com/
dataflow/) for the transformation or ETL tools such as Talend or Informatica.
For help with automation of data migration from RedShift to BigQuery you can
also look at BigShift utility (https://github.com/iconara/bigshift). BigShift follows the
same steps dumping RedShift tables into S3, moving them over to Google Cloud
Storage and then loading them into BigQuery. In addition, it implements automatic
schema creation on the BigQuery side and implements type conversion, cleansing
in case the source data does not fit perfectly into formats required by BigQuery
(data type translation, timestamp reformatting, quotes are escaped, and so on). You
should review and evaluate the limitations of BigShift before using the software
and make sure it fits what you need.
QUERY 1
Query complete (3.3s, 17.9 GB processed) , Cost: ~8.74¢
SELECT
sum(lo_extendedprice*lo_discount) as revenue
FROM
ssb.lineorder,ssb.dwdate
WHERE
lo_orderdate = d_datekey
AND d_year = 1997
AND lo_discount between 1 and 3
AND lo_quantity < 24;
QUERY 3
Query complete (10.7s, 18.0 GB processed) , Cost: ~8.79¢
SELECT
c_city,
s_city,
d_year,
sum(lo_revenue) as revenue
FROM
ssb.customer,
ssb.lineorder,
ssb.supplier,
ssb.dwdate
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND lo_orderdate = d_datekey
AND (c_city=’UNITED KI1’ OR c_city=’UNITED KI5’)
AND (s_city=’UNITED KI1’ OR s_city=’UNITED KI5’)
AND d_yearmonth = ‘Dec1997’
GROUP BY
c_city,
To do this we will flatten out “Star Schema” into a big fat table using the following
query, setting it as a batch job and materializing query results into a new table
(lineorder_denorm_records). Note that rather than denormalizing joined tables into
columns we rely on RECORD types and ARRAYs supported by BigQuery.
The following query will collapse all dimension tables into fact table.
To get an idea of table sizes in BigQuery before and after changes the following
query will calculate space used by each table in a dataset:
select
table_id,
ROUND(sum(size_bytes)/pow(10,9),2) as size_gb
Once “Star Schema” was denormalized into big fat table we can repeat our
queries and look at how they performed in comparison with “lift & shift” approach
and how much the cost to process will change.
QUERY 1
Query complete (3.8s, 17.9 GB processed) , Cost: 8.74¢
SELECT
sum(lo_extendedprice*lo_discount) as revenue
FROM
ssb.lineorder_denorm_records
WHERE
dwdate.d_year = 1997
AND lo_discount between 1 and 3
AND lo_quantity < 24;
QUERY 2
Query complete (6.6s, 24.9 GB processed) , Cost: 12.16¢
SELECT
sum(lo_revenue),
dwdate.d_year,
part.p_brand1
FROM
ssb.lineorder_denorm_records
WHERE
part.p_category = ‘MFGR#12’
AND supplier.s_region = ‘AMERICA’
QUERY 3
Query complete (8.5s, 27.4 GB processed) , Cost: 13.23¢
SELECT
customer.c_city,
supplier.s_city,
dwdate.d_year,
SUM(lo_revenue) AS revenue
FROM
ssb.lineorder_denorm_records
WHERE
(customer.c_city=’UNITED KI1’
OR customer.c_city=’UNITED KI5’)
AND (supplier.s_city=’UNITED KI1’
OR supplier.s_city=’UNITED KI5’)
AND dwdate.d_yearmonth = ‘Dec1997’
GROUP BY
customer.c_city,
supplier.s_city,
dwdate.d_year
ORDER BY
dwdate.d_year ASC,
revenue DESC;
We can see some improvement in response time after denormalization, but these
might not be enough to justify a significant increase in stored data size (~80GB
“star schema” VS ~330GB for big fat table) and data processed on per query level
(for Q3 it is 18.0 GB processed at a cost ~8.79¢ for “star schema” VS 24.5 GB
processed at a cost: 13.23¢ for big fat table).
And we can also run count across all partitions to see how the data is distributed
across partitions
SELECT
dwdate.d_datekey,
COUNT(*) rows
FROM
ssb.lineorder_denorm_records_p
GROUP BY
dwdate.d_datekey
ORDER BY
dwdate.d_datekey;
QUERY 3
Query complete (4.9s, 312 MB processed), Cost: 0.15¢
SELECT
c_city,
s_city,
dwdate.d_year,
SUM(lo_revenue) AS revenue
FROM
ssb.lineorder_denorm_records_p,
ssb.supplier,
ssb.customer
WHERE
lo_custkey = c_custkey
AND lo_suppkey = s_suppkey
AND (c_city=’UNITED KI1’
OR c_city=’UNITED KI5’)
AND (s_city=’UNITED KI1’
OR s_city=’UNITED KI5’)
AND lineorder_denorm_records_p._PARTITIONTIME
BETWEEN TIMESTAMP(“1992-01-01”)
AND TIMESTAMP(“1992-01-31”)
GROUP BY
c_city,
s_city,
dwdate.d_year
ORDER BY
dwdate.d_year ASC,
revenue DESC;
Now, with the use of date-based partitions, we see that query response time is
improved and the the amount of data processed is reduced significantly.
DIMENSION TABLES
SLOWLY CHANGING DIMENSIONS (SCD)
SCDs are the dimensions that evolve over time. In some cases only most recent
state is important (Type1) in some other situations before and after states are kept
(Type3) and in many cases the whole history should be preserved (Type2) to report
accurately on historical data.
For this exercise let’s look at how to implement most widely used Type2 SCD.
Type2 for which an entire history of updates should be preserved, is really easy to
implement in BigQuery. All you need to do is to add one SCD metadata field called
“startDate” to table schema and implement SCD updates as straight additions to
SELECT
s_suppkey,
s_name,
s_phone,
startDate
FROM
ssb.supplier_scd s1
WHERE
startDate = (
SELECT
MAX(startDate)
FROM
ssb.supplier_scd s2
WHERE
s1.s_suppkey=s2.s_suppkey)
POST-MIGRATION
MONITORING
STACKDRIVER
Recently Google added BigQuery support as one of the resources in Stackdriver.
It allows you to report on the most important project and dataset level metrics
summarized below. In addition to that, Stackdriver allows setting threshold, change
rate, and metric presence based alerts on any of these metric with triggering
different forms of notification.
AUDIT LOGS
Stackdriver integration is extremely valuable to get understanding on per project
BigQuery usage, but many times we need to dig deeper and understand platform
usage on a more granular level such as individual department or user levels.
BigQuery Audit (https://cloud.google.com/bigquery/audit-logs) log is a valuable
source for such information and can answer many questions such as:
• who are the users running the queries
• how much data is being processed per query or by an individual user in a
day or a month.
• what is the cost per department processing data
• what are the long running queries or queries processing too much data
To make this type of analysis easier, GCP Cloud Logging comes with the ability to
save audit log data into BigQuery. Once configured entries will be persisted into
day-partitioned table in BigQuery and the following query can be used to get per
user number of queries and estimated processing charges
SELECT
protoPayload.authenticationInfo.principalEmail User,
ROUND((total_bytes*5)/1000000000000, 2) Total_Cost_
For_User,
Query_Count
FROM (
SELECT
protoPayload.authenticationInfo.principalEmail,
SUM(protoPayload.serviceData.jobCompletedEvent.
job.jobStatistics.totalBilledBytes) AS total_bytes,
COUNT(protoPayload.authenticationInfo.
In addition to similar information available from Audit Logs on who was running the
query, how long the query took and volume of data scanned, it goes deeper and
makes the following useful information available:
• priority of the query (batch VS interactive)
• whether the cache was used
• what columns and tables were referenced without a need to parse query string
• stages of query execution, records IN/OUT for each stage as well as wait/
read/compute/write ratios
Unfortunately this information is only available through Web UI on per query basis
which makes it hard to analyze this information across multiple queries.
One option is to use API where it is relatively easy to write a simple agent that
would periodically retrieve query history and save it to BigQuery for subsequent
analysis. Here is a simple Node.js script https://gist.github.com/anonymous/
ca36a117e418b88fe2cf1f28e628def8
ABOUT PYTHIAN
Pythian is a global IT services company that helps businesses become more competitive by using technology to reach their business goals.
We design, implement, and manage systems that directly contribute to revenue and business success. Our services deliver increased agility
and business velocity through IT transformation, and high system availability and performance through operational excellence. Our highly
skilled technical teams work as an integrated extension of our clients’ organizations to deliver continuous transformation and uninterrupted
operational excellence using our expertise in databases, cloud, DevOps, big data, advanced analytics, and infrastructure management.
Pythian, The Pythian Group, “love your data”, pythian.com, and Adminiscope are trademarks of The Pythian Group Inc. Other product and company
names mentioned herein may be trademarks or registered trademarks of their respective owners. The information presented is subject to change
without notice. Copyright © <year> The Pythian Group Inc. All rights reserved.
V01-092016-NA