0% found this document useful (0 votes)
25 views138 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

akshayadev2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views138 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

akshayadev2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Data Storage and Queries

Storage
Abstractions
Storage Abstractions

Week 2 Overview
Storage Hierarchy
Storage Abstractions

st

Storage Systems

st
Cache

Raw Ingredients
Physical components Processes
st Networking Serialization
st st CPU
Compression Caching
Storage Hierarchy
Storage Abstractions

st
+ =

Data Warehouse Data Lake Data Lakehouse


Cloud data Supports Combines the
warehouse growing storage advantages of data
needs warehouses and
data lakes
Week 2 Lab - Simple Data Lake

Amazon Athena

Source bucket Data Lake bucket Crawler


Glue ETL
Week 2 Lab - Building a Data Lakehouse
Storage Abstractions

Data Warehouse -
Key Architectural ideas
Data Warehouse

Data Warehouse:
A subject-oriented, integrated, nonvolatile, and
time-variant collection of data in support of
management’s decisions.
Bill Inmon
“Father” of the
Data Warehouse
Data Warehouse
Subject-Oriented
Organizes and stores
data around key
business domains
(Models data to support
decision making) Data Warehouse

Customers Products

Sales Finance
Data Warehouse
Subject-Oriented Integrated
Organizes and stores Combines data from
data around key different sources into a
business domains consistent format
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

snapshot
new snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

changes
snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile Time-variant
Organizes and stores Combines data from Data is read-only and Stores current and
data around key different sources into a cannot be deleted or historical data
business domains consistent format updated (Unlike OLTP systems)
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

snapshot
new snapshot
Data Warehouse-Centric Architecture
Extract-Transform-Load (ETL)
sales
Extract Transform Load
Data Sources

marketing
Analytics &
Staging Reports
Area

Data finance
Warehouse
Comprehensive
Schema
• Clean and Data Marts
standardize data
Simple Denormalized
• Model the data Schema
Change Data Capture
Extract-Transform-Load (ETL)
sales
Extract Transform Load
Data Sources

marketing
Change Analytics &
Staging Reports
event Area

Data finance
Warehouse
Comprehensive
Change data Schema
capture • Clean and Data Marts
(CDC) standardize data
Simple Denormalized
• Model the data Schema
Data Warehouse Implementation
Early Data Warehouses with
Data Warehouses Massively Parallel Processing
(MPP)

Big
monolithic
server

• Scans large amounts of data in parallel


• Complex configurations and requires
effort to maintain
Data Warehouse Implementation
Early Data Warehouses with Modern Cloud
Data Warehouses Massively Parallel Processing Data Warehouses
(MPP)

Amazon Redshift

Big
monolithic
server
• Separates compute from
storage

• Scans large amounts of data in parallel • Expands the capability of


MPP systems
• Complex configurations and requires
effort to maintain
Storage Abstractions

Modern Cloud Data Warehouses


Massively Parallel Processing

Multiple
processors

Large amounts
of data

Massively Parallel Processing (MPP)


Massively Parallel Processing - Cloud Data Warehouses

Amazon Redshift

Massively Parallel Processing (MPP)


MPP Architecture for Amazon Redshift

Redshift Cluster
Leader Node

CPU

Memory space

Compute Nodes
Slices
Amazon Redshift
MPP Architecture for Amazon Redshift

Redshift Cluster
Leader Node

Amazon Redshift
MPP Architecture for Amazon Redshift
Client application

JDBC/ODBC

Redshift Cluster • Parses the request


Leader Node • Forms an execution plan
• Compiles code

Amazon Redshift
MPP Architecture for Amazon Redshift
Client application
Upgrade the node type
JDBC/ODBC

Redshift Cluster
Leader Node

Amazon Redshift
Data Warehouse-Centric Architecture
Extract-Load-Transform (ELT)
sales
Extract Load
Data Sources

Transform Analytics
marketing
& Reports
Staging
Machine
Area
Learning
Cloud Data finance
Warehouse
With MPP

Data Marts
Cloud Data Warehouse
Separation of
Columnar Architecture
Compute and Storage
Order ID Price Product SKU Quantity Customer ID

1 40 458650 10 67t Compute


2 23 902348 14 56t

3 45 1255893 12 87q

4 50 456829 13 98q Storage

Facilitates high performance analytical queries


Cloud Data Warehouse
Traditional Data Warehouse Cloud Data Warehouse

• Stored data is highly structured • Stored data is highly structured


• Data modeled to enable analytical • Data modeled to enable analytical
queries queries
• High processing from MPP
• Columnar Storage
• Separation of storage and compute
Efficiently stores and processes data
for high-volume analytical workloads
Storage Abstractions

Data Lakes -
Key Architectural ideas
Data Lake
• Central repository for storing large volumes of data
• No fixed schema or predefined set of transformations
• Schema-on-read pattern:
• Reader determines the schema when reading the data

Data Lake 1.0


Combined different storage and processing technologies

Storage
Amazon S3

Processing
Tools
Shortcomings of Data Lake 1.0

Data Swamp
• No proper data management
• No data cataloging
• No data discovery tools
• No guarantee on the data integrity
and quality
Shortcomings of Data Lake 1.0
Write-only storage
• Data Manipulation Language (DML)
operations were painful to implement

delete or update rows


Create a new table

• Difficult to comply with data


Data Swamp regulations

• No proper data management


• No data cataloging
• No data discovery tools
• No guarantee on the data integrity
and quality
Shortcomings of Data Lake 1.0
Write-only storage
• Data Manipulation Language (DML)
operations were painful to implement

delete or update rows


Create a new table

• Difficult to comply with data


Data Swamp regulations

• No proper data management


• No data cataloging No schema management and data modeling
• No data discovery tools • Hard to process stored data
• No guarantee on the data integrity • Data not optimized for query
and quality operations such as joins Joins
Storage Abstractions

Next-Generation Data Lakes


Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Zone 1 Zone 2 Zone 3

Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Zone 1 Zone 2 Zone 3

Landing / raw
Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Processing
• Clean
• Validate
• Standardize
• Remove PII information

Zone 2 Zone 3

Landing / raw Cleaned / Transformed


Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

Zone 3

Landing / raw Cleaned / Transformed Curated / enriched


Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Processing
• Clean • Model
Open file •
formats
• Validate Apply further
• Standardize transformations

• Remove PII information

.parquet .avro .parquet .avro

.orc .orc Analytics


Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

• Apply appropriate Processing


data governance
policies on each zone • Clean • Model

• Ensure data quality • Validate • Apply further


• Standardize transformations

• Remove PII information

.parquet .avro

.orc Analytics
Landing / raw Curated / enriched Machine
Storage Learning
Data Divide a dataset into smaller, more manageable parts based on a set
Partitioning of criteria (e.g. time, date, location recorded in the data)

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

1 1
2 2
3 3
Analytics
Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Collection of metadata about the Catalog
Data Catalog
dataset (owner, source, partitions, etc.) • Metadata
• Schema
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

1 1
2 2
3 3
Analytics
Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Separate Data Lakes and Data Warehouses

and
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Separate Data Lakes and Data Warehouses
Data Sources

Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Separate Data Lakes and Data Warehouses

ETL Pipeline
Data Sources

Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases

Expensive Solution
Separate Data Lakes and Data Warehouses

ETL Pipeline
Data Sources

Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases

Expensive Solution
• Can introduce bugs/failures
• Can cause issues with data quality, duplication, consistency
Lab Walkthrough

Simple Data Lake with AWS Glue


(Part 1)
Lab Overview
Processing

Transform
AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Processing the JSON Files
Complete two functions to process the raw JSON files
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful" [2, 3],
"reviewText": "Great purchase!",
"overall" 5.0,
Reviews "summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}

Import the review data into a tabular data frame

reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime

Process the data

reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime year month totalHelpful
helpful
Processing the JSON Files
Complete two functions to process the raw JSON files
{
"asin": "0000031852",
"description": "Girls Ballet Tutu Zebra Hot Pink",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"related": {
"also bought": ["B00JHONNS", "B002BZX8Z6","B007R2RM8W"],
"also viewed": ["B002BZX8Z6", "B00JHONN1S","B00BFXLZ8M"],
Product "bought together": [“B002BZX8Z6”]
},
Metadata "salesRank":{"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", “Dance"]]
}
Import the metadata data into a tabular data frame

asin description title price related salesRank brand categories

Process the data

asin description title price brand sales_category sales_rank • Drop null entries from numerical columns
• Replace null values with empty strings in the other columns
Processing the JSON Files

Complete two functions to process the raw JSON files

Use the code to define the Glue jobs in Terraform

Glue ETL
Lab Walkthrough

Simple Data Lake with AWS Glue


(Part 2)
Glue Data Catalog
Processing

• AWS Glue Data Catalog:


central repository that
stores metadata for all your
data assets
AWS Glue
Glue ETL Crawler Data Catalog • You can create your
metadata databases.
• The catalog database can
contain many tables each
Reviews
.parquet storing:
• Column names
Product
Metadata .parquet • Data types
Amazon Athena • Partition keys
raw processed_data
S3 bucket Simple Data Lake
Lab Overview
Processing

AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Lab Walkthrough

Simple Data Lake with AWS Glue


(Part 3 - Optional)
Lab Experiments
Processing
• Compression algorithm: Snappy
• Partitioning for each dataset

AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
S3 bucket Simple Data Lake
Lab Experiments
Processing
Perform 4 experiments to explore
different configurations for the Glue jobs:
• compression
• partitioning
AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
S3 bucket Simple Data Lake
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

!aws s3 ls --summarize --human-readable --recursive


s3://{BUCKET_NAME}/processed_data/uncompressed/no_partition/toys_metadata
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

No partitioning Snappy 4 51 MiB faster


2 Metadata
No partitioning Gzip 4 32.7 MiB
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

No partitioning Snappy 4 51 MiB


2 Metadata
No partitioning Gzip 4 32.7 MiB

No partitioning Snappy 4 556.6 MiB


3 Reviews
Partitioning Snappy 556 578.0 MiB
year, month

• Choose an appropriate key that organizes the data into meaningful files that
are aligned with your query pattern.
• A bad partitioning key divides data into too many small files.
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

No partitioning Snappy 4 51 MiB


2 Metadata
No partitioning Gzip 4 32.7 MiB

No partitioning Snappy 4 556.6 MiB


3 Reviews
Partitioning Snappy 556 578.0 MiB
year, month
Partitioning Snappy 556 578.0 MiB
year, month
4 Reviews
Partitioning Snappy Timeout
asin
What you'll compare?
What you'll run?
Compression
Compression Experiment Dataset Partitioning
Dataset Partitioning Algorithm
Algorithm
No partitioning Uncompressed
No partitioning Uncompressed 1 Metadata
Metadata No partitioning Snappy
No partitioning Snappy
No partitioning Snappy
No partitioning Gzip 2 Metadata
No partitioning Gzip
No partitioning Snappy
Reviews No partitioning Snappy
Partitioning Snappy
year, month 3 Reviews
Partitioning Snappy
Partitioning year, month
asin
Snappy
Partitioning Snappy
year, month
4 Reviews
Partitioning Snappy
asin
Lab Overview
Processing

AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Storage Abstractions

Data Lakehouse
Data Lakehouse

+ =

Data Lake Data Warehouse Data Lakehouse

• Flexibility • Superior query performance


• Low-Cost Storage • Robust data management
Data Lake Features

Medallion Architecture

Improve Data Quality

Bronze
Landing / raw CleanedSilver Gold
/ Transformed Curated / enriched
Storage
Data Lake Features

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

.parquet .parquet

Bronze Silver Gold


Storage
Data Lakehouse Catalog
Data Management from
Data Warehouses
• Metadata Schema Enforcement
• Schema
ID Name ID Last First

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

.parquet .parquet

Bronze Silver Gold


Storage
Data Lakehouse Catalog
Data Management from
Data Warehouses
• Metadata Schema Enforcement
• Schema
ID Name ID Last First

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations ACID
• Remove PII information • Atomic, Consistent, Isolated, Durable
• Concurrent read, insert, update, delete

.parquet .parquet

Bronze Silver Gold


Storage
Data Lakehouse Catalog
Data Management from
Data Warehouses
• Metadata Schema Enforcement
• Schema
ID Name ID Last First

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations ACID
• Remove PII information • Atomic, Consistent, Isolated, Durable
• Concurrent read, insert, update, delete

Data Governance & Security


.parquet .parquet • Robust access controls, data
auditing capabilities, data lineage

Bronze Silver Gold • Incremental updates & deletions


Storage Connector • Rollback to or access any version
APIs of your historical data
Storage Abstractions

Date Lakehouse Implementations


Open Table Formats

Specialized storage formats that add transactional features to


Open Table your data lakehouse
Formats
• Allows you to update and delete records
• Supports ACID principles

Hadoop
Update
Delete
Incremental
Open Table Formats

Open Table Format Snapshot:


Reflects the state of the data at a
Track changes in data given time

Snapshot

time

Data
Insert, Update, Delete
Open Table Formats

Open Table Format Snapshot:


Reflects the state of the data at a
Track changes in data given time

Time Travel:
Snapshot Query any previous version of a table

time

Data
Insert, Update, Delete
Open Table Formats

Open Table Format Snapshot:


Reflects the state of the data at a
Track changes in data given time

Time Travel:
Snapshot Query any previous version of a table

Schema & Partition Evolution:


Ability to query the data even if you
time make changes to the schema or
partitioning

Data
Insert, Update, Delete
Open Table Formats

Open Table Format Snapshot:


Reflects the state of the data at a
Track changes in data given time

Time Travel:
Snapshot Query any previous version of a table

Schema & Partition Evolution:


Ability to query the data even if you
time make changes to the schema or
partitioning

Data Open Source


Insert, Update, Delete Different query engines can access the
data
Iceberg Catalog

Current metadata pointer

Metadata layer metadata file


metadata file table schemas,
partitioning information,
s0 s0 s1 snapshots

manifest manifest
list list

manifest manifest manifest


file file file

Storage layer
(Parquet) data files data files data files
Storage Options

Production Database Data Warehouse

• Process small • Bring together large


amounts of structured volumes of
data structured / semi-
structured data

• Query current and


historical data

Use case: analytics, Use case: analytics,


reporting reporting
Storage Options

Production Database Data Warehouse Data Lake

• Process small • Bring together large • Process large


amounts of structured volumes of volumes of
data structured / semi- structured, semi-
structured data structured and
unstructured data
• Query current and
historical data • Save on storage
cost

Use case: reporting, Use case: reporting, Use case: machine


analytics analytics learning
Storage Options

Production Database Data Warehouse Data Lake Data Lakehouse

• Process small • Bring together large • Process large • Data management


amounts of structured volumes of volumes of & discoverability
data structured / semi- structured, semi- features
structured data structured and
unstructured data • Low latency queries
• Query current and
historical data • Save on storage
cost

Use case: reporting, Use case: reporting, Use case: machine Use case: machine
analytics analytics learning learning, analytics,
reporting
Storage Abstractions

Lakehouse Architecture on AWS


AWS Lake Formation Groups Users

Setting up
access controls

parquet .avro
.

orc
.
AWS Lake Formation
Cataloging data
Defining storage

Managing
permissions
AWS Lake Formation
Data Sources

Amazon S3
AWS Lake Formation

✓Identify data
sources Relational DB

✓Move data into


AWS Glue Metadata AWS IAM
the data lake Data Catalog
✓Catalog data NoSQL DB

✓Manage
permissions Crawler
Setting up a Data Lake
AWS Lake Formation Groups Users

Setting up Managing
access controls permissions

AWS Lake Formation

✓Identify data
sources

✓Move data into


AWS Glue Metadata AWS IAM AWS Glue
the data lake Data Catalog
Glue
✓Catalog data Console

✓Manage
permissions Amazon S3 Relational DB NoSQL DB Crawler
Setting up a Data Lake
AWS Lake Formation Groups Users

Setting up Managing
access controls permissions

AWS Lake Formation

✓Identify data
sources

✓Move data into


AWS Glue Metadata AWS IAM AWS Glue
the data lake Data Catalog
Glue
✓Catalog data Console

✓Manage
permissions Amazon S3 Relational DB NoSQL DB Crawler
Setting up a Data Lake
AWS Data Lakehouse
Data Sources
AWS Data Lakehouse
Ingestion

Amazon Kinesis AWS Glue


Data Streams
Data Sources

Amazon Data
Firehose

AWS DataSync

AWS Database
Migration

Amazon AppFlow
AWS Data Lakehouse
Ingestion

Amazon Kinesis
Data Streams
Data Sources

Amazon Data
Firehose

AWS DataSync

AWS Database
Migration
Amazon Redshift Amazon S3

Amazon AppFlow Storage


AWS Data Lakehouse
Ingestion

Amazon Kinesis
Data Streams

Processing
Data Sources

Amazon Data SQL ELT


Firehose

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

AWS Database
Migration
Amazon Redshift Amazon S3

Amazon AppFlow Storage


AWS Data Lakehouse
Ingestion Catalog

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM


Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

AWS Database
Migration
Amazon Redshift Amazon S3

Amazon AppFlow Storage


AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database
Migration
Amazon Redshift Amazon S3
Amazon Redshift
Amazon AppFlow Storage Spectrum
Storage Abstractions

Implementing a Lakehouse on
AWS
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database
Migration
Amazon Redshift Amazon S3
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena
✓Structured
AWS Database ✓Structured
Migration ✓Semi-structured
Amazon Redshift
✓Semi-structured Amazon S3 ✓Unstructured
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena
Native Integration
AWS Database
Migration
write ETL jobs?

Amazon Redshift possible mistakes Amazon S3


Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration


Migration
Amazon Redshift
Amazon Redshift Complex ETL Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis Crawler AWS Glue AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration


Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis Crawler AWS Glue AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration


Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Schema
& Data
Amazon Kinesis
Versioning Apache AWS Glue AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Iceberg Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration


Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration


Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database SQL Interface


Native Integration
Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database SQL Interface


Native Integration
Migration
Amazon Redshift Amazon S3
Spectrum
MPP Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift


AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration SQL Interface


Migration Historical
Amazon Redshift Amazon S3 Data
Spectrum
Amazon Redshift
Hot Data Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight
SQL
Amazon EMR AWS Glue Amazon Managed Amazon Redshift
AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database ✓Serverless


Native Integration SQL Interface
Migration ✓On Demand
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight
SQL
Amazon EMR AWS Glue Amazon Managed Amazon Redshift
AWS DataSync Service for Apache Flink Spectrum

Amazon Athena
✓Federated
AWS Database Query
Native Integration SQL Interface
Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT


Firehose Amazon Quicksight
SQL
Amazon EMR AWS Glue Amazon Managed Amazon Redshift
AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration SQL Interface


Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
Lab Walkthrough

Building a Data Lakehouse with


AWS LakeFormation (Part 1)
• Schema evolution
• Time travel
Amazon S3
data_lake_bucket AWS Lake Formation

Establish governance and fine-grained permissions for the data

MySQL RDS
classic_models
Creating
Processing tables Analytics
to serve & ML users
data_lake data_lake data_lake

source_bucket Data Catalog Data Catalog Amazon


Customer ratings curated_zone presentation_zone Athena
of the products
(json)
• Each csv file:
s3://{data_lake_bucket}/landing_zone/rds/{table_name}
productlines products orderdetails
productLine productCode orderNumber
textDescription productName productCode
htmlDescription productLine quantityOrdered
image productScale priceEach
productVendor orderLineNumber
productDescription
employees quantityInStock
employeeNumber buyPrice
lastName MSRP orders
firstName orderNumber
MySQL RDS
extension customers orderDate
classic_models email customerNumber requiredDate
officeCode customerName shippedDate
reportsTo contactLastName status
jobTitle contactFirstName comments
phone customerNumber
data_lake offices addressLine1
officeCode addressLine2 payments
• 8 csv files city city customerNumber
phone state checkNumber
addressLine1 postalCode paymentDate
addressLine2 country amount
source_bucket salesRepEmployeeN
state
Customer ratings country umber
of the products postalCode creditLimit
territory
(json)
• Each csv file:
s3://{data_lake_bucket}/landing_zone/rds/{table_name}

• JSON file:
s3://{data_lake_bucket}/landing_zone/json/ratings

MySQL RDS
classic_models

data_lake
• 8 csv files
• 1 json file
source_bucket
Customer ratings customer product product ingest_ts
of the products Number Code Rating
(json)
• Each parquet file:
s3://{data_lake_bucket}/curated_zone/{table_name}

• 8 csv files • 8 parquet files


• 1 json file Processing
MySQL RDS (ratings) CSV files:
classic_models • Extract the 8
tables into a You are provided with a schema
DataFrame for the processed tables.
• Add two
data_lake metadata data_lake
columns:
• ingest_ts
• source
source_bucket • Enforce the
Data Catalog
schema
Customer ratings curated_zone
of the products
(json)
• Each parquet file: products
customers
customerNumber
s3://{data_lake_bucket}/curated_zone/{table_name} customerName
productCode contactLastName
• Iceberg data: productName
productLine
contactFirstName
phone
s3://{data_lake_bucket}/curated_zone/ratings_for_ML/iceberg productScale addressLine1
productVendor addressLine2
productDescription city
quantityInStock state
buyPrice postalCode
MSRP country
• 8 csv files salesRepEmployeeN
• 1 json file • 8 parquet files umber
MySQL RDS (ratings) • iceberg format creditLimit

classic_models Preparing data for (ratings for ML)


ML:
•Extract the CSV
tables and the
latest ratings into data_lake
data_lake
Data Frames
•Combine product,
customers and
ratings
source_bucket •Add a processing Data Catalog
Customer ratings timestamp curated_zone
of the products
(json)
• Each parquet file:
s3://{data_lake_bucket}/curated_zone/{table_name}
• Iceberg data:
s3://{data_lake_bucket}/curated_zone/ratings_for_ML/iceberg
s3://{data_lake_bucket}/curated_zone/ratings/iceberg

• 8 csv files • 8 parquet files


• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )

data_lake data_lake
Processing the
ratings:
•Extract the
latest ratings
source_bucket
•Update the Data Catalog
Customer ratings ratings if it curated_zone
of the products already exits
(json)
Terraform

• 8 csv files • 8 parquet files


• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )

data_lake data_lake

source_bucket Data Catalog


Customer ratings curated_zone
of the products
(json)
For analytics end users: For ML end users:
• Average sales grouped by year and month • Ratings for ML table
• Average ratings per product
• Latest ratings

• 8 csv files • 8 parquet files


• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )
Creating
tables
to serve
data_lake data_lake data_lake

source_bucket Data Catalog


Customer ratings curated_zone
of the products
(json)
For analytics end users: For ML end users:
• Average sales grouped by year and month • Ratings for ML table
• Average ratings per product
• Latest ratings

• 8 csv files • 8 parquet files


• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )

data_lake Amazon data_lake


data_lake Athena

source_bucket Data Catalog


Customer ratings curated_zone
of the products
(json)
Creating the ratings table using Athena

Amazon data_lake
Athena
For analytics end users: For ML end users:
• Average sales grouped by year and month • Ratings for ML table
• Average ratings per product
• Latest ratings

• 8 csv files • 8 parquet files


• 1 json file • iceberg format 4 tables in
MySQL RDS (ratings for ML)
(ratings) iceberg format
classic_models (ratings )

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon


Customer ratings curated_zone presentation_zone Athena
of the products
(json)
Lab Walkthrough

Building a Data Lakehouse with


AWS LakeFormation (Part 2)
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

• 8 csv files • 8 parquet files


• 1 json file • iceberg format 4 tables in
MySQL RDS (ratings for ML)
(ratings) iceberg format
classic_models (ratings)

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon


Customer ratings curated_zone presentation_zone Athena
of the products
(json)
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

Metadata
metadata file
• Table schema
metadata file
layer
s0
• Location
s0
of the table in S3
s1
• Date and time (last updated)
• UUID of the current snapshot
/metadata manifest manifest
list list

manifest manifest manifest


file file file

Storage
layer data files data files data files
/data
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

Metadata metadata file


metadata file
layer
s0 s0 s1

/metadata manifest manifest


list list

manifest manifest manifest


file file file

Storage
layer data files data files data files
/data
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

Metadata metadata file


metadata file
layer
s0 s0 s1

/metadata manifest manifest


list list

manifest manifest manifest


file file file

Storage
layer data files data files data files
/data
Iceberg Catalog

Current metadata pointer


Glue Data Catalog

Metadata metadata file


• Contains a catalog table for each
metadata file
layer iceberg file in the curated and
s0 s0 s1 presentation zone
• Catalog tables are organized:
/metadata • curated_zone database
manifest manifest
list list • presentation_zone database

manifest manifest manifest


file file file

Storage
layer data files data files data files
/data
Iceberg Catalog Schema evolution
• Add a new column to the ratings table at:
s3://{data_lake_bucket}/curated_zone/ratings/iceberg
Current metadata pointer • Apply the transformation in Terraform using
the “alter_table” module

Metadata metadata file


metadata file
layer
s0 s0 s1

/metadata manifest manifest


list list

manifest manifest manifest


file file file

Storage The metadata file will change without the


layer data files data files data files need to completely re-write the data
/data
Iceberg Catalog Time Travel

Query the new and old versions


Current metadata pointer
of the ratings table.

Metadata metadata file


metadata file
layer
s0 s0 s1

/metadata manifest manifest


list list

manifest manifest manifest


file file file

Storage
layer data files data files data files
/data
Apply fine-grained permissions to the data:

4 tables in
MySQL RDS iceberg format
classic_models

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon


Customer ratings curated_zone presentation_zone Athena
of the products
(json)
Apply fine-grained permissions to the data:
• Metadata-level (catalog resources)
• Storage access

4 tables in
MySQL RDS iceberg format
classic_models

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon


Customer ratings curated_zone presentation_zone Athena
of the products
(json)
Apply fine-grained permissions to the data:
• Metadata-level (catalog resources)
• Storage access
Grant IAM users and roles
permissions on: databases
Data lake tables, columns, rows, cells. • No need for a detailed IAM policy for data
lake users.
administrator
• Lake formation permissions are meant to
augment regular IAM permissions.

Data Catalog data_lake


• Data lake users need to be attached to
IAM policy to access:
• AWS Glue service
Get Access
metadata data • Lake Formation service
• Amazon Athena
• Use lake formation to grant users fine-
Temporary credentials grained permissions to access specific
Data lake
AWS Lake Amazon Athena users
resources
Formation
Apply fine-grained permissions to the data:
• Metadata-level (catalog resources)
• Storage access
Grant IAM users and roles
permissions on: databases
Data lake tables, columns, rows, cells.
Role assumed by the cloud9 instance
administrator
Role assumed by the glue resources

Data Catalog data_lake Access all catalog tables and the


underlying stored data
Get Access
metadata data
Machine Learning team member
Temporary credentials
Data lake Access to the ratings_for_ml
AWS Lake Amazon Athena users table from the presentation zone
Formation
Storage Abstractions

Summary
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost

Data Lake Catalog


Machine
• Large amounts of structured
and unstructured data
Learning
Processing & • Supports applications
Big Data requiring lots of data
Processing
Storage • Can turn into data swamp
First lab

Amazon Athena

Create data
catalog
Source bucket Data Lake bucket Crawler
Glue ETL

Partitioned
data
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost

Data Lake Catalog


Machine
• Large amounts of structured
and unstructured data
Learning
Processing & • Supports applications
Big Data requiring lots of data
Processing
Storage • Can turn into data swamp

Data Lakehouse
Analytics
+
Machine
Scalable, low-cost, Structured query and Learning
and flexible storage data management
Lab Assignment

You might also like