0% found this document useful (0 votes)
71 views1 page

Intro To Data Science On Cloud

This document provides an overview of the six key steps in the data science workflow - data engineering, data analysis, model development, ML engineering, and insights activation. It summarizes Google Cloud products and services that can aid each step, including Dataflow for data ingestion/preprocessing, BigQuery for data storage/analysis, Vertex AI Workbench for end-to-end data science/ML workflows, and Looker/Data Studio for insights activation and business intelligence. The document encourages readers to learn more about these tools and how they can accelerate their data science processes on Google Cloud.

Uploaded by

Nageswar Makala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views1 page

Intro To Data Science On Cloud

This document provides an overview of the six key steps in the data science workflow - data engineering, data analysis, model development, ML engineering, and insights activation. It summarizes Google Cloud products and services that can aid each step, including Dataflow for data ingestion/preprocessing, BigQuery for data storage/analysis, Vertex AI Workbench for end-to-end data science/ML workflows, and Looker/Data Studio for insights activation and business intelligence. The document encourages readers to learn more about these tools and how they can accelerate their data science processes on Google Cloud.

Uploaded by

Nageswar Makala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Cloud

Blog

Developers & Practitioners

Intro to data science on


Google Cloud
February 1, 2022

Priyanka Vergadia
Lead Developer Advocate, Google

Polong Lin
Developer Advocate

While you likely know that data science


is the practice of making data useful,
you may not have a clear landscape
around the tools that can aid each
stage of the data science work!ow as
you use machine learning to tackle
your challenges.

Click to enlarge

Read on to discover the six broad


areas that are critical to the process of
making data useful, and some
corresponding Google Cloud products
and services for those areas.

The 6 steps of Data Sc…

Data engineering

Perhaps the greatest missed


oppo"unities in data science stem
from data that exists somewhere, but
hasn't been made accessible for use in
fu"her analysis. Laying the critical
foundation for downstream systems,
data engineering involves the
transpo"ing, shaping, and enriching of
data for the purposes of making it
available and accessible.

Data ingestion and data


preprocessing on Google
Cloud

Here we consider data ingestion as


moving data from one place to
another, and data preparation the
process of transformation,
augmentation, or enrichment prior to
consumption. Global scalability, high
throughput, real-time access, and
robustness are common challenges in
this stage. For scalable, real-time, and
batch data processing, look into
building data ingestion and
preprocessing pipelines with Data!ow,
a managed Apache Beam service.
There's a reason why Data!ow is
called the backbone of analytics on
Google Cloud.
If you're looking for a scalable
messaging system to help you ingest
data, consider Cloud Pub/Sub, a
global, horizontally scalable messaging
infrastructure. Cloud Pub/Sub was
built using the same infrastructure
component that enabled Google
products, including Ads, Search, and
Gmail, to handle hundreds of millions
of events per second.
If you want an easy way to automate
data movement to BigQuery, a
serverless data warehouse on Google
Cloud, look into the BigQuery Data
Transfer Service. For transferring data
to Cloud Storage, take a look at the
Storage Transfer Service. Or, for a no-
code data ingestion and
transformation tool, check out Data
Fusion, which has over 150
precon$gured connectors and
transformations. In addition to
Data!ow and Data Fusion for data
preparation, Spark users may want to
look at related products and features
for Spark on Google Cloud.

Data storage and data


cataloging on Google
Cloud

For structured data, consider a data


warehouse like BigQuery, or any of the
Cloud Databases (relational ones like
Cloud SQL and NoSQL ones like Cloud
BigTable and Cloud Firestore). For
unstructured data, you can always use
Cloud Storage. You may also want to
consider a data lake. For data
discovery, cataloging, and metadata
management, consider Data Catalog.
For a uni$ed solution, take a look at
Dataplex, which integrates a uni$ed
data management solution with an
integrated analytics experience.

Learn more about data


engineering on Google
Cloud

Explore the data engineering


learning path

Discover reference pa%erns

Get ce"i$ed by Google Cloud as a


Professional Data Engineer

Click to enlarge

Data Analysis

From descriptive statistics to


visualizations, data analysis is where
the value of data sta"s to appear.

Data exploration, data


preprocessing, and data
insights

Data exploration, a highly iterative


process, involves slicing and dicing
data via data preprocessing before
data insights can sta" to manifest
through visualizations or simply via
simple group-by, order-by operations.
One hallmark of this phase is that the
data scientist may not yet know which
questions to ask about the data. In this
somewhat ephemeral phase, a data
analyst or scientist has likely
uncovered some aha-moments, but
hasn't shared them yet. Once insights
are shared, the !ow enters the Insights
Activation stage, where those insights
become used to guide business
decisions, in!uence consumer
choices, or become embedded in
other applications or services.

On Google Cloud, there are many ways


to explore, preprocess, and uncover
insights in your data. If you are looking
for a notebook-based end-to-end
data science environment, check out
Ve"ex AI Workbench, which enables
you to access, analyze, and visualize
your entire data estate: from
structured data at the petabyte-scale
in SQL with BigQuery, to processing
data with Spark on Google Cloud and
its serverless, auto-scaling, and GPU
acceleration capabilities. As a uni$ed
data science environment, Ve"ex AI
Workbench also makes it easy to do
machine learning with TensorFlow,
PyTorch, and Spark, with built-in
MLOps capabilities.

Finally, if your focus is on analyzing


structured data from data warehouses
and insight activation for business
intelligence, you may want to also
consider using Looker, with its rich
interactive analytics, visualizations,
dashboarding tools, and Looker Blocks
to help you accelerate your time-to-
insight.

Learn more about data


analysis on Google Cloud

Learn about Ve"ex AI Workbench


for a Jupyter-based fully managed
notebook environment

Learn about how you can use


BigQuery for petabyte-scale data
analysis

Learn about Spark on Google


Cloud

Discover the data analyst learning


path

Explore reference pa%erns for


common analytics use cases

Model development

From linear regression to XGBoost,


from TensorFlow to PyTorch, the
model development stage is where
machine learning sta"s to provide new
ways of unlocking value from your
data. Experimentation is a strong
theme here, with data scientists
looking to accelerate iteration speed
between models without worrying
about infrastructure overhead or
context-switching between tools for
data analysis and tools for
productionizing models with MLOps.

To solve these challenges, once again,


as a Jupyter-based fully managed,
scalable, and enterprise-ready
environment, Ve"ex AI Workbench
makes it easy as the one-stop-shop
for data science, combining analytics
and machine learning, including Ve"ex
AI services. Apache Spark, XGBoost,
TensorFlow, and PyTorch are just some
of the frameworks suppo"ed on
Ve"ex AI Workbench. Ve"ex AI
Workbench makes managing the
underlying compute infrastructure
needed for model training easy with
the ability to scale ve"ically and
horizontally, and with idle timeouts and
auto shutdown capabilities to reduce
unnecessary costs. Notebooks
themselves can be used for
distributed training and
hyperparameter optimization, and they
include Git integration for version
control. Due to the signi$cant
reduction in context switching
required, data scientists can build and
train models 5x faster using Ve"ex AI
Workbench than when using traditional
notebooks.

With Ve"ex AI, custom models can be


trained and deployed using containers.
You can take advantage of pre-built
containers or custom containers to
train and deploy your models.

For low-code model development,


data analysts and data scientists can
use SQL with BigQuery ML to train and
deploy models (including XGBoost,
deep neural networks, and PCA),
directly using BigQuery's built-in
serverless, autoscaling capabilities.
Behind-the-scenes, BigQuery ML
leverages Ve"ex AI to enable
automated hyperparameter tuning,
and explainable AI. For no-code model
development, Ve"ex AI Training
provides a point-and-click inte&ace to
train powe&ul models using AutoML,
which comes in multiple !avors:
AutoML Tables, AutoML Image,
AutoML Text, AutoML Video, and
AutoML Translation.

Learn more about model


development on Google
Cloud

Learn about Ve"ex AI Workbench


for a Jupyter-based fully managed
notebook environment

Learn more about Ve"ex AI

ML engineering

Once a satisfactory model is


developed, the next step is to
incorporate all the activities of a well-
engineered application lifecycle,
including testing, deployment, and
monitoring. And all of those activities
should be as automated and robust as
possible.

Managed datasets and Feature Store


on Ve"ex AI provide shared
repositories for datasets and
engineered features, respectively,
which provide a single source of truth
for data and promote reuse and
collaboration within and across teams.
Ve"ex AI’s model serving capability
enables deployment of models with
multiple versions, automatic capacity
scaling, and user-speci$ed load
balancing. Finally, Ve"ex AI Model
Monitoring provides the ability to
monitor prediction requests !owing
into a deployed model and
automatically ale" model owners
whenever the production tra'c
deviates beyond user-de$ned
thresholds and previous historical
prediction requests.

MLOps is the industry term for


modern, well engineered ML services,
with scalability, monitoring, reliability,
automated CI/CD, and many other
characteristics and functions that are
now taken for granted in the
application domain. The ML
engineering features provided by
Ve"ex AI are informed by Google’s
extensive experience deploying and
operating internal ML services. Our
goal with Ve"ex AI is to provide
everyone with easy access to essential
MLOps services and best practices.

Learn more about ML


engineering and MLOps on
Google Cloud

Follow the guides, tutorials and


documentation for Ve"ex AI

Watch this video to learn more


about Ve"ex AI

Discover the data


scientist/machine learning
engineer learning path

Get ce"i$ed as a Professional ML


Engineer

Insights activation

The insights activation stage is where


your data has now become useful to
other teams and processes. You can
use Looker and Data Studio to enable
use cases in which data is used to
in!uence business decisions with
cha"s, repo"s, and ale"s.

Data can also in!uence customer


decisions and as a result increase
usage or decrease churn, for example.
Finally, the data can also be used by
other services to drive insights; these
services can run outside Google
Cloud, inside Google Cloud on Cloud
Run or Cloud Functions, and/or using
Apigee API Management as an
inte&ace.

Learn more about insights


activation on Google Cloud

Watch this video to learn about


building interactive ML apps using
Looker and Ve"ex AI

Learn about Looker, and Looker


solutions for eCommerce, Digital
Media and more

Discover a gallery of interactive


dashboards created with Data
Studio

Watch this video to understand the


di(erence between Cloud Run and
Cloud Functions

Orchestration

All of the capabilities discussed above


provide the key building blocks to a
modern data science solution, but a
practical application of those
capabilities requires orchestration to
automatically manage the !ow of data
from one service to another. This is
where a combination of data pipelines,
ML pipelines, and MLOps comes into
play. E(ective orchestration reduces
the amount of time that it takes to
reliably go from data ingestion to
deploying your model in production, in
a way that lets you monitor and
understand your ML system.

For data pipeline orchestration, Cloud


Composer and Cloud Scheduler are
both used to kick o( and maintain the
pipeline.

For ML pipeline orchestration, Ve"ex


AI Pipelines is a managed machine
learning service that enables you to
increase the pace at which you
experiment with and develop machine
learning models and the pace at which
you transition those models to
production. Ve"ex Pipelines is
serverless, which means that you don’t
need to deal with managing an
underlying GKE cluster or
infrastructure. It scales up when you
need it to, and you pay only for what
you use. In sho", it lets you just focus
on building your data science
pipelines.

Learn more about


orchestration on Google
Cloud

Read more about Cloud Composer


for Ai&low-based pipelines

Try some example notebooks on


Github with Ve"ex AI Pipelines

Learn di(erent ways to trigger


Ve"ex AI Pipeline runs

Read the whitepaper on


Practitioners Guide to MLOps: A
framework for continuous delivery
and automation of machine
learning

Summary

Google Cloud o(ers a complete suite


of data management, analytics, and
machine learning tools to generate
insights from data. Want to learn
more? Check out the following
resources:

Building the data science driven


organization from the $rst
principles

Special thanks to the following


contributors to this blogpost: Alok
Pa%ani, Brad Miro, Saeed
Aghabozorgi, Diptiman Raichaudhuri,
Reza Rokni.

AI & Machine Learning


:

You might also like