Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Storage
Abstractions
Storage Abstractions
Week 2 Overview
Storage Hierarchy
Storage Abstractions
st
Storage Systems
st
Cache
Raw Ingredients
Physical components Processes
st Networking Serialization
st st CPU
Compression Caching
Storage Hierarchy
Storage Abstractions
st
+ =
Amazon Athena
Data Warehouse -
Key Architectural ideas
Data Warehouse
Data Warehouse:
A subject-oriented, integrated, nonvolatile, and
time-variant collection of data in support of
management’s decisions.
Bill Inmon
“Father” of the
Data Warehouse
Data Warehouse
Subject-Oriented
Organizes and stores
data around key
business domains
(Models data to support
decision making) Data Warehouse
Customers Products
Sales Finance
Data Warehouse
Subject-Oriented Integrated
Organizes and stores Combines data from
data around key different sources into a
business domains consistent format
(Models data to support
decision making) Data Warehouse
Sales Finance
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse
Sales Finance
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse
Sales Finance
snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse
Sales Finance
snapshot
new snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse
Sales Finance
changes
snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile Time-variant
Organizes and stores Combines data from Data is read-only and Stores current and
data around key different sources into a cannot be deleted or historical data
business domains consistent format updated (Unlike OLTP systems)
(Models data to support
decision making) Data Warehouse
Sales Finance
snapshot
new snapshot
Data Warehouse-Centric Architecture
Extract-Transform-Load (ETL)
sales
Extract Transform Load
Data Sources
marketing
Analytics &
Staging Reports
Area
Data finance
Warehouse
Comprehensive
Schema
• Clean and Data Marts
standardize data
Simple Denormalized
• Model the data Schema
Change Data Capture
Extract-Transform-Load (ETL)
sales
Extract Transform Load
Data Sources
marketing
Change Analytics &
Staging Reports
event Area
Data finance
Warehouse
Comprehensive
Change data Schema
capture • Clean and Data Marts
(CDC) standardize data
Simple Denormalized
• Model the data Schema
Data Warehouse Implementation
Early Data Warehouses with
Data Warehouses Massively Parallel Processing
(MPP)
Big
monolithic
server
Amazon Redshift
Big
monolithic
server
• Separates compute from
storage
Multiple
processors
Large amounts
of data
Amazon Redshift
Redshift Cluster
Leader Node
CPU
Memory space
Compute Nodes
Slices
Amazon Redshift
MPP Architecture for Amazon Redshift
Redshift Cluster
Leader Node
Amazon Redshift
MPP Architecture for Amazon Redshift
Client application
JDBC/ODBC
Amazon Redshift
MPP Architecture for Amazon Redshift
Client application
Upgrade the node type
JDBC/ODBC
Redshift Cluster
Leader Node
Amazon Redshift
Data Warehouse-Centric Architecture
Extract-Load-Transform (ELT)
sales
Extract Load
Data Sources
Transform Analytics
marketing
& Reports
Staging
Machine
Area
Learning
Cloud Data finance
Warehouse
With MPP
Data Marts
Cloud Data Warehouse
Separation of
Columnar Architecture
Compute and Storage
Order ID Price Product SKU Quantity Customer ID
3 45 1255893 12 87q
Data Lakes -
Key Architectural ideas
Data Lake
• Central repository for storing large volumes of data
• No fixed schema or predefined set of transformations
• Schema-on-read pattern:
• Reader determines the schema when reading the data
Storage
Amazon S3
Processing
Tools
Shortcomings of Data Lake 1.0
Data Swamp
• No proper data management
• No data cataloging
• No data discovery tools
• No guarantee on the data integrity
and quality
Shortcomings of Data Lake 1.0
Write-only storage
• Data Manipulation Language (DML)
operations were painful to implement
Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees
Landing / raw
Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees
Processing
• Clean
• Validate
• Standardize
• Remove PII information
Zone 2 Zone 3
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations
Zone 3
Processing
• Clean • Model
Open file •
formats
• Validate Apply further
• Standardize transformations
.parquet .avro
…
.orc Analytics
Landing / raw Curated / enriched Machine
Storage Learning
Data Divide a dataset into smaller, more manageable parts based on a set
Partitioning of criteria (e.g. time, date, location recorded in the data)
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations
1 1
2 2
3 3
Analytics
Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Collection of metadata about the Catalog
Data Catalog
dataset (owner, source, partitions, etc.) • Metadata
• Schema
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations
1 1
2 2
3 3
Analytics
Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Separate Data Lakes and Data Warehouses
and
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Separate Data Lakes and Data Warehouses
Data Sources
Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Separate Data Lakes and Data Warehouses
ETL Pipeline
Data Sources
Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Expensive Solution
Separate Data Lakes and Data Warehouses
ETL Pipeline
Data Sources
Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Expensive Solution
• Can introduce bugs/failures
• Can cause issues with data quality, duplication, consistency
Lab Walkthrough
Transform
AWS Glue
Glue ETL Crawler Data Catalog
Reviews
.parquet
Product
Metadata .parquet
Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Processing the JSON Files
Complete two functions to process the raw JSON files
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful" [2, 3],
"reviewText": "Great purchase!",
"overall" 5.0,
Reviews "summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime year month totalHelpful
helpful
Processing the JSON Files
Complete two functions to process the raw JSON files
{
"asin": "0000031852",
"description": "Girls Ballet Tutu Zebra Hot Pink",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"related": {
"also bought": ["B00JHONNS", "B002BZX8Z6","B007R2RM8W"],
"also viewed": ["B002BZX8Z6", "B00JHONN1S","B00BFXLZ8M"],
Product "bought together": [“B002BZX8Z6”]
},
Metadata "salesRank":{"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", “Dance"]]
}
Import the metadata data into a tabular data frame
asin description title price brand sales_category sales_rank • Drop null entries from numerical columns
• Replace null values with empty strings in the other columns
Processing the JSON Files
Glue ETL
Lab Walkthrough
AWS Glue
Glue ETL Crawler Data Catalog
Reviews
.parquet
Product
Metadata .parquet
Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Lab Walkthrough
AWS Glue
Glue ETL Crawler Data Catalog
Reviews
.parquet
Product
Metadata .parquet
Amazon Athena
raw processed_data
S3 bucket Simple Data Lake
Lab Experiments
Processing
Perform 4 experiments to explore
different configurations for the Glue jobs:
• compression
• partitioning
AWS Glue
Glue ETL Crawler Data Catalog
Reviews
.parquet
Product
Metadata .parquet
Amazon Athena
raw processed_data
S3 bucket Simple Data Lake
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB
• Choose an appropriate key that organizes the data into meaningful files that
are aligned with your query pattern.
• A bad partitioning key divides data into too many small files.
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB
AWS Glue
Glue ETL Crawler Data Catalog
Reviews
.parquet
Product
Metadata .parquet
Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Storage Abstractions
Data Lakehouse
Data Lakehouse
+ =
Medallion Architecture
Bronze
Landing / raw CleanedSilver Gold
/ Transformed Curated / enriched
Storage
Data Lake Features
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations
.parquet .parquet
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations
.parquet .parquet
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations ACID
• Remove PII information • Atomic, Consistent, Isolated, Durable
• Concurrent read, insert, update, delete
.parquet .parquet
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations ACID
• Remove PII information • Atomic, Consistent, Isolated, Durable
• Concurrent read, insert, update, delete
Hadoop
Update
Delete
Incremental
Open Table Formats
Snapshot
time
Data
Insert, Update, Delete
Open Table Formats
Time Travel:
Snapshot Query any previous version of a table
time
Data
Insert, Update, Delete
Open Table Formats
Time Travel:
Snapshot Query any previous version of a table
Data
Insert, Update, Delete
Open Table Formats
Time Travel:
Snapshot Query any previous version of a table
manifest manifest
list list
Storage layer
(Parquet) data files data files data files
Storage Options
Use case: reporting, Use case: reporting, Use case: machine Use case: machine
analytics analytics learning learning, analytics,
reporting
Storage Abstractions
Setting up
access controls
parquet .avro
.
orc
.
AWS Lake Formation
Cataloging data
Defining storage
Managing
permissions
AWS Lake Formation
Data Sources
Amazon S3
AWS Lake Formation
✓Identify data
sources Relational DB
✓Manage
permissions Crawler
Setting up a Data Lake
AWS Lake Formation Groups Users
Setting up Managing
access controls permissions
✓Identify data
sources
✓Manage
permissions Amazon S3 Relational DB NoSQL DB Crawler
Setting up a Data Lake
AWS Lake Formation Groups Users
Setting up Managing
access controls permissions
✓Identify data
sources
✓Manage
permissions Amazon S3 Relational DB NoSQL DB Crawler
Setting up a Data Lake
AWS Data Lakehouse
Data Sources
AWS Data Lakehouse
Ingestion
Amazon Data
Firehose
AWS DataSync
AWS Database
Migration
Amazon AppFlow
AWS Data Lakehouse
Ingestion
Amazon Kinesis
Data Streams
Data Sources
Amazon Data
Firehose
AWS DataSync
AWS Database
Migration
Amazon Redshift Amazon S3
Amazon Kinesis
Data Streams
Processing
Data Sources
AWS Database
Migration
Amazon Redshift Amazon S3
Processing
Data Sources
AWS Database
Migration
Amazon Redshift Amazon S3
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
AWS Database
Migration
Amazon Redshift Amazon S3
Amazon Redshift
Amazon AppFlow Storage Spectrum
Storage Abstractions
Implementing a Lakehouse on
AWS
AWS Data Lakehouse
Ingestion Catalog Consumption
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
AWS Database
Migration
Amazon Redshift Amazon S3
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
✓Structured
AWS Database ✓Structured
Migration ✓Semi-structured
Amazon Redshift
✓Semi-structured Amazon S3 ✓Unstructured
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Native Integration
AWS Database
Migration
write ETL jobs?
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis Crawler AWS Glue AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis Crawler AWS Glue AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Schema
& Data
Amazon Kinesis
Versioning Apache AWS Glue AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Iceberg Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
✓Federated
AWS Database Query
Native Integration SQL Interface
Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption
Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog
Processing
Data Sources
Amazon Athena
MySQL RDS
classic_models
Creating
Processing tables Analytics
to serve & ML users
data_lake data_lake data_lake
• JSON file:
s3://{data_lake_bucket}/landing_zone/json/ratings
MySQL RDS
classic_models
data_lake
• 8 csv files
• 1 json file
source_bucket
Customer ratings customer product product ingest_ts
of the products Number Code Rating
(json)
• Each parquet file:
s3://{data_lake_bucket}/curated_zone/{table_name}
data_lake data_lake
Processing the
ratings:
•Extract the
latest ratings
source_bucket
•Update the Data Catalog
Customer ratings ratings if it curated_zone
of the products already exits
(json)
Terraform
data_lake data_lake
Amazon data_lake
Athena
For analytics end users: For ML end users:
• Average sales grouped by year and month • Ratings for ML table
• Average ratings per product
• Latest ratings
Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data
Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data
Metadata
metadata file
• Table schema
metadata file
layer
s0
• Location
s0
of the table in S3
s1
• Date and time (last updated)
• UUID of the current snapshot
/metadata manifest manifest
list list
Storage
layer data files data files data files
/data
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data
Storage
layer data files data files data files
/data
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data
Storage
layer data files data files data files
/data
Iceberg Catalog
Storage
layer data files data files data files
/data
Iceberg Catalog Schema evolution
• Add a new column to the ratings table at:
s3://{data_lake_bucket}/curated_zone/ratings/iceberg
Current metadata pointer • Apply the transformation in Terraform using
the “alter_table” module
Storage
layer data files data files data files
/data
Apply fine-grained permissions to the data:
4 tables in
MySQL RDS iceberg format
classic_models
Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena
4 tables in
MySQL RDS iceberg format
classic_models
Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena
Summary
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost
Amazon Athena
Create data
catalog
Source bucket Data Lake bucket Crawler
Glue ETL
Partitioned
data
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost
Data Lakehouse
Analytics
+
Machine
Scalable, low-cost, Structured query and Learning
and flexible storage data management
Lab Assignment