0% found this document useful (0 votes)
585 views

M1 - Introduction To Data Engineering Slides

The document discusses the role of data engineers and challenges they face. It introduces data lakes, data warehouses, BigQuery, and how they address issues like data access, quality, and availability of computational resources. The last sections cover transactional databases, partnering with other teams, and a case study.

Uploaded by

Écio Ferreira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
585 views

M1 - Introduction To Data Engineering Slides

The document discusses the role of data engineers and challenges they face. It introduces data lakes, data warehouses, BigQuery, and how they address issues like data access, quality, and availability of computational resources. The last sections cover transactional databases, partnering with other teams, and a case study.

Uploaded by

Écio Ferreira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Introduction to

Data Engineering
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


A data engineer builds data pipelines to enable
data-driven decisions

Get the data Get the data


Add new value
to where it can into a usable
to the data
be useful condition

So… how do we
get the raw data Manage the Productionize
from multiple
systems and data data processes
where can be
store it durably?
A data lake brings together data from across the
enterprise into a single location

Replicate
Raw Data

Data Lake
Spread
RDMBS sheets

Other
Offline systems
files and apps
Key considerations when building a Data Lake

1. Can your data lake handle all the


types of data you have?
Replicate
2. Can it scale to meet the
demand? Data Lake

3. Does it support high-throughput


ingestion?
We need an elastic data
4. Is there fine-grained access container that is flexible
control to objects? and durable to stage all
our data …
5. Can other tools connect easily?
Cloud Storage is designed for 99.999999999% annual durability

Replace/decommission Content storage and


Backup Analytics and ML
infrastructure delivery

Quickly create buckets with cloud shell


gsutil mb gs://your-project-name
What if your data is not usable in its original form?

SOME ETL Data Processing


ASSEMBLY
REQUIRED

Cloud Dataproc Cloud Dataflow

Extract, Transform,
and Load
What if your data arrives continuously and endlessly?

THIS DATA
Streaming Data
DOES NOT
WAIT
Processing

Cloud Cloud
Dataflow BigQuery
Pub/Sub
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


Common challenges encountered by data engineers

Access to data Data accuracy Availability of Query


and quality computational performance
resources
Challenge: Consolidating disparate datasets, data
formats, and manage access at scale

Access to data Data accuracy Availability of Query


and quality computational performance
resources
Getting insights across multiple datasets is difficult
without a data lake

Data is scattered across


No common tool exists to
Google Analytics 360,
analyze data and share
CRM, and Campaign
results with the rest of
Manager products,
the organization.
among other sources.

Customer and sales


Some data is not in a
data is stored in
queryable format.
a CRM system.
Data is often siloed in many upstream source systems

Example Query:
Give me all the
in-store promotions
for recent orders and
their inventory levels

Stored in a separate system


and restricted access
Challenge: Cleaning, formatting, and getting the data
ready for useful business insights in a data warehouse

Access to data Data accuracy Availability of Query


and quality computational performance
resources
Assume that any raw data from source systems needs to be
cleaned and transformed and stored in a data warehouse

Query: Give me the best


performing in-store
Missing data and all promotions in France
timestamps
stored as UTC

Promotion list stored


as .csv files or manual
spreadsheets
Challenge: Ensuring you have the compute capacity
to meet peak-demand for your team

Access to data Data accuracy Availability of Query


and quality computational performance
resources
Challenge: Data Engineers need to manage server and cluster
capacity if using on-premise
Under-provisioned
Resources (Wasting time) On-Premises
Compute Capacity

Under-utilized
Consumption (Wasting $$$)

Capacity

Time
Challenge: Queries need to be optimized for
performance (caching, parallel execution)

Access to data Data accuracy Availability of Query


and quality computational performance
resources
Challenge: Managing query performance
on-premise comes with added overhead

● Choosing a Query Engine

● Continually patching and Is there a better way to


updating query engine software manage server
overhead so we can
● Managing clusters and when to focus on insights?
re-cluster

● Optimize for concurrent queries


and quota / demand between
teams
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


BigQuery is Google’s data warehouse solution

Data Tables and


Data mart Data lake Grants
warehouse views
BigQuery replaces BigQuery BigQuery defines Function the same Cloud IAM grants
a typical data organizes data schemas and way as in a permission to
warehouse tables into units issues queries traditional data perform specific
hardware setup called datasets directly on warehouse actions
external data
sources
Cloud allows data engineers to spend less time managing
hardware and enabling scale; Let Google do that for you

Typical Big Data Processing With Google

Monitoring Insights Insights

Performance Resource
tuning provisioning

Utilization Handling
improvements growing scale

Deployment &
Reliability
configuration
You don't need to provision resources before using BigQuery

Resources On-demand storage


and compute

Consumption

Allocation

Time
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


A data engineer gets data into a useable condition

Get the data Get the data


Add new value
to where it can into a usable
to the data
be useful condition

Manage the Productionize


data data processes
A data warehouse stores transformed data in a
usable condition for business insights

Replicate Extract, Transform, Load


Pipeline
Raw Data Data
Data Lake
Warehouse

What are the key


considerations when
deciding between data
warehouse options?
Considerations when choosing a data warehouse
include:

● Can it serve as a sink for both batch


and streaming data pipelines?
Extract, Transform, Load
● Can the data warehouse scale to meet Pipeline
Data
my needs? Warehouse

● How is the data organized, cataloged,


and access controlled?

● Is the warehouse designed for


performance?

● What level of maintenance is required


by our engineering team?
BigQuery is a modern data warehouse that changes
the conventional mode of data warehousing

Complex Restricted Optimized Optimized Needs continuous Needs an army of


Traditional ETL to only a for legacy for batch patching DBAs for operational
DW few users BI data and updates tasks

Automate Make Build the Tee up Fully Simplify


data insights foundation real-time managed data
BigQuery delivery accessible for AI insights operations
You can simplify Data Warehouse ETL pipelines with external
connections to Cloud Storage and Cloud SQL

● Postgres Federate
d Query
Cloud ● MySQL
SQL ● SQL Server

Cloud
Storage
Demo Federated Queries with
BigQuery
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data


Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


Cloud SQL is fully managed SQL Server, Postgres, or MySQL
for your Relational Database (transactional RDBMS)

● Automatic encryption
Cloud
SQL ● 30TB storage capacity
● 60,000 IOPS
(read/write per second)
● Auto-scale and auto
backup

Why not simply use Cloud


SQL for reporting workflows?
RDBMS are optimized for data from a single source and
high-throughput writes vs high-read data warehouses

You will likely need and


encounter both a database and
data warehouse in your final
architecture
Cloud BigQuery
SQL

● Scales to GB and TB ● Scales to PB


● Ideal for back-end ● Easily connect to
database applications external data sources
● Record based storage for ingestion
● Column based storage
Relational database management systems (RDBMS)
are critical for managing new transactions

RDBMS are optimized for


high throughput WRITES
to RECORDS
The complete picture: Source data comes into the data lake, is
processed into the data warehouse and made available for insights

line ML Model
pip e
re
Featu

Data
Data Lake Other Team Data
Warehouse Eng Pipeline
Warehouse

BI p
ipe
line

Reporting
Dashboards

Who leads these other


teams that we will have to
partner with?
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


A data engineer builds data pipelines to enable
data-driven decisions What teams rely on
these pipelines?

Get the data Get the data


Add new value
to where it can into a usable
to the data
be useful condition

Manage the Productionize


data data processes
Many teams rely on partnerships with data
engineering to get value out of their data

Machine Learning Data Analyst Data Engineer


Engineer

How might each of these teams rely on data engineering?


Machine learning teams need data engineers to help
them capture new features in a stable pipeline
fe at ures ?”
h e s e t i m e
a l l oft u c t ion
“Are t p ro d
bl e a
availa
eline ML Model
re pi p
Raw Data Data Featu
Data Lake
Warehouse

“Can you help us get more


features (columns) of data for
our machine learning model?”
Add value: Machine learning directly in BigQuery

FROM
ML.EVALUATE(MODEL
`bqml_tutorial.sample_model`,
TABLE eval_table)

1 Dataset 2 Create/train 3 Evaluate 4 Predict/classify

CREATE MODEL `bqml_tutorial.sample_model` FROM


OPTIONS(model_type='logistic_reg') AS ML.PREDICT(MODEL
SELECT `bqml_tutorial.sample_model`,
table game_to_predict) )
AS predict
Data analysis and business intelligence teams rely on
data engineering to showcase the latest insights
in yo ur
va i la ble ? ”
i s a ce s s
hat data us to ac
“W us e for
o
wareh
Reporting
p eline Dashboards
B I pi
Raw Data Data
Data Lake
Warehouse

“Our dashboards are slow, can you


help us re-engineer our BI tables
for better performance?”
Add value: BI Engine for dashboard performance
Data
BigQuery BI
Studio
Engine
Batch or
Streaming Sheets
BigQuery

Partner
● No need to manage OLAP cubes
BI tools or separate BI servers for
dashboard performance

● Natively integrates with


BigQuery streaming for real-time
data refresh

● Column oriented in-memory BI


execution engine
Other data engineering teams may rely on your
pipelines being timely and error free

“How can we both ensure


dataset uptime and
performance?”

Data Pipeline Other Team Data


Raw Data Data Lake
Warehouse Warehouse

“We’re noticing high demand for your


datasets -- be sure your warehouse
can scale for many users”
Add value: Cloud Monitoring for performance

● View in-flight and completed queries.


● Track spending on BigQuery resources.
● Use Cloud Audit Logs to view actual job
information (who executed, what query
was ran).
● Create alerts and send notifications.
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


A data engineer manages data access and governance

Get the data Get the data


Add new value
to where it can into a usable
to the data
be useful condition

Manage the Productionize


data data processes
Data engineering must set and communicate a
responsible data governance model

line ML Model
pip e
re
Featu

Raw Data Data


Data Lake Other Team Data
Warehouse Eng Pipeline
Warehouse

BI p
ipe
line

● Who should have access? Reporting


● How is PII handled? Dashboards
● How can we educate end-users on
our data catalog?
Cloud Data Catalog is a managed data discovery +
Data Loss Prevention API for guarding PII

Data Catalog

Simplify data discovery at any scale:


Fully managed metadata management service with
no infrastructure to set up or manage

Unified view of all datasets:


Central and secure data catalog across Google
Cloud with metadata capture and tagging

Data governance foundation:


Security compliance with access level controls along
with Cloud Data Loss Prevention integration for
handling sensitive data
Demo Finding PII in your dataset with
DLP API
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


A data engineer builds production data pipelines to
enable data-driven decisions

Get the data Get the data


Add new value
to where it can into a usable
to the data
be useful condition

Manage the Productionize


data data processes
Data engineering owns the health and future of their
production data pipelines

line ML Model
pip e
re
Featu

Raw Data Data


Data Lake Other Team Data
Warehouse Eng Pipeline
Warehouse

BI p
ipe
line

● How can we ensure pipeline health and data Reporting


cleanliness? Dashboards
● How do we productionalize these pipelines to
minimize maintenance and maximize uptime?
● How do we respond and adapt to changing
schemas and business needs?
● Are we using the latest data engineering tools and
best practices?
Cloud Composer (managed Apache Airflow) is used
to orchestrate production workflows
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


Ocado’s customer service department is
bombarded with messages
Can we use ML to prioritize these messages?

“My order is missing 12


“I love Ocado! You’re the
steaks. I need them for a
best.”
party I’m hosting in 6 hours.”

“I scheduled my delivery for 3


“My delivery is five hours late. PM today. Something came
Can I get a full refund?” up and I won’t be home at
that time. Can I reschedule?”
Ocado’s GCP solution helps them respond to urgent customer
emails 4x faster with ML
Increased contact center efficiency enables representatives to spend extra time on high-priority tasks

AI Platform
http://www.multichannel-blog.co.uk/2017/05/03/google-the-future-of-cloud-conference-in-london-3-4th-may/
Twitter democratized data analysis using BigQuery
“We believe that users with a wide range of technical skills should be able to discover
data and have access to SQL-based analysis and visualization tools that perform well”
-- Twitter

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/democratizing-data-analysis-with-google-bigquery.html
Recap
● Data sources
● Data lakes
● Data warehouses
● Google Cloud solutions for
Data Engineering
Concept Review:

Data sources feed


into a Data Lake and AI

are processed into


Platform
Notebooks

your Data Warehouse AI


Platform

for analysis
Data stores

AI
Platform
Notebooks
Here’s a useful guide
for “GCP products in
4 words or less”

https://github.com/gr
egsramblings/google-
cloud-4-words
Updated continually By Greg Wilson -
Google DevRel
Agenda
Explore the role of a data engineer

Analyze data engineering challenges

Intro to BigQuery

Data Lakes and Data Warehouses

Transactional Databases vs Data Warehouses

Partner effectively with other data teams


● Manage data access and governance
● Build production-ready pipelines

Review GCP customer case study

Lab: Analyzing Data with BigQuery


Lab
Using BigQuery to do
Analysis
Objectives

● Execute interactive queries in the BigQuery console


● Combine and run analytics on multiple datasets

You might also like