Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
August 2021
Build a modern,
unified analytics
data platform
with Google Cloud
Firat Tekiner & Susan Pierce
2
There is no shortage of data being created. IDC research holistic way to make a data platform more successful for
indicates that worldwide data will grow to 175 zettabytes your organization.
by 2025 . The volume of data being generated every day is
1
staggering, and it is increasingly difficult for companies to In this paper, we will discuss the decision points neces-
collect, store, and organize it in a way that is accessible and sary in creating a modern, unified analytics data plat-
usable. In fact, 90% of data professionals say their work has form built on Google Cloud Platform.
sions contribute to the gap that companies have between their data4. The first issue is data freshness. The second
aggregating data and making it work for them. Companies issue stems from the difficulty in integrating disparate and
want to move to the Cloud to modernize their data analytics legacy systems across silos. Organizations are migrating to
systems, but that alone doesn’t solve the underlying issues the Cloud, but that does not solve the real problem of older
around siloed data sources and brittle processing pipelines. legacy systems that might have been vertically structured to
Strategic decisions around data ownership and technical meet the needs of a single business unit.
1 https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
2 https://www.zdnet.com/article/data-analysts-stretched-lack-engineering-resource-current-data-says-survey/
3 ibid
4 https://www.accenture.com/gr-en/insights/technology/closing-data-value-gap
3
In planning out organizational data needs, it’s easy to over- architecture or set of software components to purchase; it
generalize and consider a single, simplified structure where requires companies to take stock of their overall data matu-
there is one set of consistent data sources, one enterprise rity and make systemic, organizational changes in addition
data warehouse, one set of semantics, and one tool for to technical upgrades.
business intelligence. That might work for a very small, highly
centralized organization, and could even work for a single By the end of 2024, 75% of enterprises will shift from pilot-
business unit with its own integrated IT and data engineer- ing to operationalizing AI, driving a 5X increase in stream-
ing team. In practice, though, no organization is that simple ing data and analytics infrastructures5. It’s easy enough to
and there are always surprise complexities around data pilot AI with an arms-length data science team, working in
ingestion, processing, and/or usage that complicate matters a siloed environment. But the fundamental challenge that
5 https://emtemp.gcom.cloud/ngw/globalassets/en/doc/documents/721868-100-data-and-analytics-predic-
tions-through-2024.pdf
4
Data work is rarely done by a single individual; there are and test models, and make insights available for another
many data-related users in an organization who play import- team. An ML engineer may be responsible for packaging up
ant roles in the data lifecycle. Each has a different perspec- the model for deployment into production systems, in a way
tive on data governance, freshness, discoverability, meta- that is non-disruptive to other data processing pipelines. A
data, processing timelines, queryability, etc. In most cases, product manager or business analyst may be checking in on
they are all using different systems and software to operate derived insights, using Data QnA (a natural language inter-
on the same data, at different stages of processing. face for analytics on BigQuery data), visualization software
or might be querying the result set directly through an IDE
Let’s look at a machine learning lifecycle, for example. A or a command-line interface. There are countless users with
data engineer may be responsible for ensuring fresh data is different needs and we have built a compressive platform to
available for the data science team, with appropriate secu- serve them all. Google Cloud meets customers where they
rity and privacy constraints in place. A data scientist may are with tools to meet the needs of the business.
create training and test datasets based on a golden set of
pre-aggregated data sources from the data engineer, build
5
If you know what datasets you need to analyze, have a clear un-
derstanding of its structure, and have a known set of questions
you need answered, then you are likely looking at a data ware-
house.
But there’s more to the decision, so let’s talk through some of the organizational chal-
lenges of each.
Data warehouses are often difficult to manage. The legacy systems that have worked
well in the past 40 years have proven to be very expensive and pose a lot of challenges
around data freshness, scaling, and high costs. Furthermore, they cannot easily provide
AI or real-time capabilities without bolting that functionality on after the fact. These
issues are not just present in on-premise legacy data warehouses; we even see this with
the newly created cloud-based data warehouses as well. Many do not offer integrated
AI capabilities, despite their claims. These new data warehouses are essentially the same
legacy environments but ported over to the Cloud.
Data warehouse users tend to be analysts, often embedded within a specific business
unit. They may have ideas about additional datasets that would be useful to enrich their
understanding of the business. They may have ideas for improvements in the analysis,
data processing, and requirements for business intelligence functionality.
6
However, in a traditional organization, they often don’t have Data lake users typically are closer to the raw data sources
direct access to the data owners, nor can they easily influ- and are equipped with tools and capabilities to explore the
ence the technical decision makers who decide datasets data. In traditional organizations, these users tend to focus
and tooling. In addition, because they are kept separate on the data itself and are frequently held at arm’s length
from raw data, they are unable to test hypotheses or drive a from the rest of the business. This disconnect means that
deeper understanding of the underlying data. business units miss out on the opportunity to find insights
that would drive their business objectives forward to higher
Data lakes have their own challenges. In theory, they are revenues, lower costs, lower risk, and new opportunities.
low cost and easy to scale, but many of our customers have
seen a different reality in their on-premise data lakes. Plan- Given these tradeoffs, many companies end up with a
ning for and provisioning sufficient storage can be expen- hybrid approach, where a data lake is set up to graduate
sive and difficult, especially for organizations that produce some data into a data warehouse or a data warehouse has
highly variable amounts of data. On-premise data lakes can a side data lake for additional testing and analysis. But with
be brittle and maintenance of existing systems takes time. In multiple teams fabricating their own data architectures to
many cases, the engineers who would otherwise be devel- suit their individual needs, data sharing and fidelity gets
oping new features are relegated to the care and feeding of even more complicated for a central IT team.
data clusters. Said more bluntly, they are maintaining value
as opposed to creating new value. Overall, the total cost of Instead of having separate teams with separate goals —
ownership is higher than expected for many companies. Not where one explores the business, and another understands
only that, governance is not easily solved across systems, the business — you can unite these functions and their data
especially when different parts of the organization use systems to create a virtuous cycle where a deeper under-
different security models. As a result, the data lakes become standing of the business drives directed exploration, and
siloed and segmented, making it difficult to share data and that exploration drives a better understanding of the busi-
models across teams. ness.
Data type and Structured data Unstructured (raw) and structured data
access SQL access manipulation Code-involved access and exploration
This requires convergence in both the technology and the approach to understanding and
discovering the value in your data.
7
Query Federation
BQ Storage API
Parquet & ORC
Python Pandas Beam
in GCSQ4
Go Scikit Spark
Cloud
Java Keras MapReduce Dataflow
BigQuery Compute
Cloud
ANSI SQL BQML Dataproc
UDFs BQGIS
SpannerRoadmap
BigQuery Storage API provides the capability to use BigQ- ing to meet demand no matter the usage by different teams,
uery Storage like Google Cloud Storage (GCS) for a number tools and access patterns. All of the above applications can
of other systems such as Dataflow and Dataproc. This allows run without impacting the performance of any other jobs
breaking down the data warehouse storage wall and enables accessing BigQuery at the same time. In addition, the BigQ-
running high-performance dataframes on BigQuery. In other uery Storage API provides a petabit level network, moving
words, the BigQuery Storage API allows your BigQuery data data between nodes to fulfill a query request effectively
warehouse to act like a data lake. So what are some of the leading to a similar performance to an in-memory opera-
practical uses for it? For one, we built a series of connectors tion. It also allows federating with the popular Hadoop data
- MapReduce, Hive, Spark, for example - so that you can run formats such as Parquet & ORC directly as well as NoSQL
your Hadoop and Spark workloads directly on your data in and OLTP databases. You can go a step further with the
BigQuery. You no longer need a data lake in addition to your capabilities that are provided by Dataflow SQL, which is
data warehouse! Dataflow is incredibly powerful for batch embedded in BigQuery. This allows you to join the streams
and stream processing. Today, you can run Dataflow jobs on with BigQuery tables or data residing in files, effectively
top of BigQuery data, enriching it with data from PubSub, creating a lambda architecture, allowing you to ingest large
Spanner or any other data source amounts of batch and streaming data, while also providing a
serving layer to respond to queries. BigQuery BI Engine and
BigQuery can independently scale both storage and Materialized Views make it even easier to increase efficiency
compute, and each is serverless, allowing for limitless scal- and performance in this multi-use architecture.
8
All these services connect transparently to each other due to clear design and clean implementation.
Change management is often one of the hardest aspects of incorporating any new technology into an organization. Google
Cloud seeks to meet our customers where they are by providing familiar tools, platforms and integrations for developers and
business users alike. Our mission is to accelerate your organization’s ability to digitally transform and reimagine your busi-
ness through data-powered innovation, together. Instead of creating vendor lock-in, Google Cloud provides companies with
options for simple, streamlined integrations with on-premise environments, other Cloud offerings and even the Edge to form a
truly hybrid Cloud:
• BigQuery Omni removes the need for data to be ported from one environment to
another and instead takes the analytics to the data regardless of the environment.
• Apache Beam, the SDK leveraged on Cloud Dataflow, provides transferability and
portability to runners like Apache Spark and Apache Flink.
• For organizations looking to run Apache Spark or Apache Hadoop, Google Cloud
provides Dataproc.
9
Most data users care about what data they have, not which
system it resides in. Having access to the data they need
when they need it is the most important thing. So for the
most part, the type of platform does not matter for users,
so long as they are able to access fresh, usable data with
familiar tools - whether they are exploring datasets, manag-
ing sources across data stores, running ad hoc queries or
developing internal business intelligence tools for executive
stakeholders.
Democratized Services
Vertex AI SQL and BI Tools
Data QnA, Connected Sheets
DLP Databases
External Public
Security Controls
Controls
Emerging Trends
Continuing on this idea of the convergence of a data lake and a data warehouse into a unified analytics data platform, there
are some additional data solutions that are gaining traction. We have been seeing a lot of concepts emerging around Lake-
house and Data Mesh, for example. You may have heard some of these terms before. Some are not new and have been around
in different shapes and formats for years. However, they work very nicely within the Google Cloud environment. Let’s take
a closer look into what a Data Mesh and a Lakehouse would look like in Google Cloud and what they mean for data sharing
within an organization.
Lakehouse and Data Mesh are not mutually exclusive, but they help solve different challenges within an organization. But one
favors enabling data, while the other enables teams. Data Mesh empowers people to avoid being bottlenecked by one team,
and therefore enables the entire data stack. It breaks silos into smaller organizational units in an architecture that provides
access to data in a federated manner. Lakehouse brings the data warehouse and data lake together, allowing different types
and higher volumes of data. This effectively leads to schema-on-read instead of schema-on-write, a feature of data lakes that
was thought to close some of the performance gaps in enterprise data warehouses. As an added benefit, this architecture
also borrows more rigorous data governance, something that data lakes typically lack.
• Removes the overhead of Data Lakes and Data • Removes the organizational barriers becoming
Warehouses the bottleneck
• Data warehouse gets the capabilities of a data • Federates data ownership
lake • Focuses on data as product
• Data Lake gets the capabilities of the Data • Allows for the creation of agile teams and
Warehouses shorter time to insights
• Benefits: • Teams own their data & technology
• Multimodal data access with higher volumes • Provides API / access to other teams
of data
• Decentralized raw and processed data
• Schema on read
• Benefits:
• The governance that Data Lakes lack but
• Well defined, governed and secure
DWHs provide
• Ability to leverage several domains with no
• Enables unified access to batch and real-
data movement
time data
• Leverages DataOps methodologies (builds
on lessons learned in DevOps)
Lakehouse
As mentioned above, BigQuery’s Storage API lets you treat your data warehouse like a data lake. Spark jobs running on Datap-
roc or similar Hadoop environments can use the data stored on BigQuery rather than requiring a separate storage medium by
taking storage out of the data warehouse.
The sheer compute power that is decoupled from storage within BigQuery enables SQL-based transformation and utilizes
views across different layers of these transformations. This then leads to an ELT-type approach and enables a more agile data
processing platform. Leveraging ELT over ETL, BigQuery enables SQL-based transformations to be stored as logical views.
While dumping all of the raw data into data warehouse storage may be expensive with a traditional data warehouse, there is no
premium charge for BigQuery storage. Its cost is fairly comparable to blob storage in GCS.
When performing ETL, the transformations are taking place outside of BigQuery, potentially in a tool that does not scale as
well. It might end up transforming the data line-by-line rather than parallelizing the queries. There may be instances where
Spark or other ETL processes are already codified and changing them for the sake of new technology might not make sense.
If, however, there are transformations that can be written in SQL, BigQuery is likely a great place to do them.
12
In addition, this architecture is supported by all the GCP components like Composer,
Data Catalog or Data Fusion. It provides an end-to-end layer for different user personas.
Another important aspect of reducing operational overhead can be done by leveraging
the capabilities of the underlying infrastructure. Consider Dataflow and BigQuery, all run
on containers and let us manage the uptime and the mechanics behind the scenes. Once
this is extended to third-party and partner tools, and when they start exploiting similar
capabilities such as Kubernetes, then it becomes much simpler to manage and porta-
ble. In turn, this reduces resource and operational overheads. Furthermore, this can be
complemented by better observability by exploiting monitoring dashboards with Cloud
Composer to lead for operational excellence.
Not only can you build a data lake by bringing together data stored in GCS and BigQuery,
without any data movement or duplication, but we are offering additional administra-
tive functionality to manage your data sources. Dataplex enables a Lakehouse by offer-
ing a centralized management layer to coordinate data in GCS and BigQuery. Doing this
enables you to organize your data based on your business needs, so you are no longer
restricted by how or where that data is stored.
Dataplex is an intelligent data fabric that enables you to keep your data distributed for the
right price/performance while making this data securely accessible to all your analytics
tools. It provides metadata-led data management with built-in data quality and gover-
nance so you spend less time wrestling with infrastructure boundaries and inefficiencies,
trust the data you have and spend more time deriving value out of this data. Additionally,
it provides an integrated analytics experience, bringing the best of GCP and open-source
together, to enable you to rapidly curate, secure, integrate and analyze their data at scale.
Finally, you can build an analytics strategy that augments existing architecture and meets
your financial governance goals.
13
Data Mesh
Data Mesh is built on a long history of innovation from other words, the people who created the data or brought
across data warehouses and data lakes, combined with it into the organization must also be responsible for creat-
the unparalleled scalability performance pay models, APIs, ing consumable data assets as products from the data they
DevOps and close integration of Google Cloud products. create.
With this approach, you can effectively create an on-de- In many organizations, establishing a “single source of truth”
mand data solution. A Data Mesh decentralizes data or “authoritative data source” is challenging due to the
ownership among domain data owners, each of whom are repeated extraction and transformation of data across the
held accountable for providing their data as a product in a organization without clear ownership responsibilities over
standard way. A Data Mesh also facilitates communication the newly-created data. In the Data Mesh, the authoritative
between different parts of the organization to distribute data source is the Data Product published by the source
datasets across different locations. domain, with a clearly assigned Data Owner and Steward
who is responsible for that data.
In a Data Mesh, the responsibility for generating value from
data is federated to the people who understand it best; in
Online Cust. P.
BigQuery Orders Details Referential
In summary, the Data Mesh promises a domain-oriented, decentralized data ownership and architecture. This is enabled by
having federated computation and access layers just like we provide in GCP. Furthermore, if your organization is looking to get
more functionality, you can use something like Looker, which can provide a unified layer to model and access the data. Look-
er’s platform offers a single pane UI to access the truest, most up-to-date version of your company’s data and business defi-
nitions. With this unified view into the business, you can choose or design data experiences that assure people and systems
get data delivered to them in a way that makes the most sense for their needs. It fits in perfectly as it allows data scientists,
analysts and even business users to access their data with a single semantic model. Data scientists are still accessing the raw
data, but without the data movement and duplication.
We’re building additional functionality on top of our workhorse products like BigQuery, to make the creation and management
of datasets easier. Analytics Hub provides the ability to create private data exchanges, in which exchange administrators (a.k.a.
data curators) give permissions to publish and subscribe to data in the exchange to specific individuals or groups both inside
the company and externally to business partners or buyers.
Analytics Hub
BigQuery BigQuery
Publish, discover and subscribe to shared assets, including open source formats, powered by the scalability of BigQuery.
Publishers can view aggregated usage metrics. Data providers can reach enterprise BigQuery customers with data, insights,
ML models or visualizations and leverage Cloud marketplace to monetize their apps, insights or models. This is also similar to
how BigQuery public datasets are managed through a Google-managed exchange. Drive innovation with access to unique
Google datasets, commercial/industry datasets, public datasets or curated data exchanges from your organization or partner
ecosystem.
15
There are typically three categories of migration that we The second migration strategy we see most often is a full
see among customers: Lift and Replatform, Lift and Rehome modernization as the first step. This provides a clean break
and full Modernization. For most businesses, we suggest from the past because you are going full in with a Cloud-na-
starting with the Lift and Replatform, as it offers a high-im- tive approach. It is built native on GCP, but because you are
pact migration with as little disruption and risk as possible. changing everything in one go, migration can be slower if
With this strategy, you migrate your data into BigQuery or you have multiple, large legacy environments.
A clean legacy break requires rewriting jobs and changing move your data estate onto Cloud. You can lift and rehome
different applications. However, it provides higher velocity your existing platforms and carry on using them as before
and agility as well and the lowest total cost of ownership but in the GCP environment. This is applicable for envi-
in the long run compared to the other approaches. This is ronments such as Teradata and Databricks for example, to
because of two main reasons: your applications are already reduce the initial risk and allow applications to run. However,
optimized and don’t need to be retrofitted, and once you this brings the existing siloed environment to the Cloud
migrate your data sources, you don’t have to manage two rather than transforming it, so you won’t benefit from the
environments at the same time. This approach is best suited performance of a platform built natively on GCP. However,
for digital natives or engineering-driven organizations with we can help you with a full migration into Google Cloud
few legacy environments. native products, so you can take advantage of interopera-
bility and create a fully modern analytics data platform on
Lastly, the most conservative approach is a Lift and Rehome, Google Cloud.
which we recommend as a short-term tactical solution to
Tactical or strategic?
We think the key differentiators of an analytics data platform built on GCP are that it is open, intelligent, flexible and tightly
integrated. There are a lot of solutions in the market that provide tactical solutions that may feel comfortable and familiar.
However, these generally provide a short-term solution and just compound organization and technical issues over time.
17
Google Cloud significantly simplifies data analytics. You can unlock the potential hidden in your data with a cloud-native,
serverless approach that decouples storage from compute and lets you analyze gigabytes to petabytes of data in minutes.
This allows you to remove the traditional constraints of scale, performance and cost to ask any question of data and solve
business problems. As a result, it becomes easier to operationalize insights across the enterprise with a single, trusted data
fabric.
Solves for every stage of the data analytics lifecycle, from ingestion to transformation and
analysis, to business intelligence and more
Enables ability to leverage the best open source technologies for your organization
Scales to meet the needs of your enterprise, particularly as you increase your use of data in
driving your business and through your digital transformation
A modern, unified analytics data platform built on GCP gives you the best capabilities of a data lake and a data warehouse,
but with closer integration into the AI platform. You can automatically process real-time data from billions of streaming events
and serve insights in up to milliseconds to respond to changing customer needs. Our industry-leading AI services can opti-
mize your organizational decision making and customer experiences, helping you to close the gap between descriptive and
prescriptive analytics without having to staff up a new team. You can augment your existing skills to scale the impact of AI with
automated, built-in intelligence.
Build a Unified Data
Platform with Google
August 2021