0% found this document useful (0 votes)

25 views138 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

akshayadev2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views138 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

akshayadev2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 138

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Data Storage and Queries

Storage
Abstractions
Storage Abstractions

Week 2 Overview
Storage Hierarchy
Storage Abstractions

Storage Systems

st
Cache

Raw Ingredients
Physical components Processes
st Networking Serialization
st st CPU
Compression Caching
Storage Hierarchy
Storage Abstractions

st
+ =

Data Warehouse Data Lake Data Lakehouse

Cloud data Supports Combines the
warehouse growing storage advantages of data
needs warehouses and
data lakes
Week 2 Lab - Simple Data Lake

Amazon Athena

Source bucket Data Lake bucket Crawler

Glue ETL
Week 2 Lab - Building a Data Lakehouse
Storage Abstractions

Data Warehouse -
Key Architectural ideas
Data Warehouse

Data Warehouse:
A subject-oriented, integrated, nonvolatile, and
time-variant collection of data in support of
management’s decisions.
Bill Inmon
“Father” of the
Data Warehouse
Data Warehouse
Subject-Oriented
Organizes and stores
data around key
business domains
(Models data to support
decision making) Data Warehouse

Customers Products

Sales Finance
Data Warehouse
Subject-Oriented Integrated
Organizes and stores Combines data from
data around key different sources into a
business domains consistent format
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

snapshot
new snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile
Organizes and stores Combines data from Data is read-only and
data around key different sources into a cannot be deleted or
business domains consistent format updated
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

changes
snapshot
Data Warehouse
Subject-Oriented Integrated Nonvolatile Time-variant
Organizes and stores Combines data from Data is read-only and Stores current and
data around key different sources into a cannot be deleted or historical data
business domains consistent format updated (Unlike OLTP systems)
(Models data to support
decision making) Data Warehouse

Data Sources Customers Products

Sales Finance

snapshot
new snapshot
Data Warehouse-Centric Architecture
Extract-Transform-Load (ETL)
sales
Extract Transform Load
Data Sources

marketing
Analytics &
Staging Reports
Area

Data finance
Warehouse
Comprehensive
Schema
• Clean and Data Marts
standardize data
Simple Denormalized
• Model the data Schema
Change Data Capture
Extract-Transform-Load (ETL)
sales
Extract Transform Load
Data Sources

marketing
Change Analytics &
Staging Reports
event Area

Data finance
Warehouse
Comprehensive
Change data Schema
capture • Clean and Data Marts
(CDC) standardize data
Simple Denormalized
• Model the data Schema
Data Warehouse Implementation
Early Data Warehouses with
Data Warehouses Massively Parallel Processing
(MPP)

Big
monolithic
server

• Scans large amounts of data in parallel

• Complex configurations and requires
effort to maintain
Data Warehouse Implementation
Early Data Warehouses with Modern Cloud
Data Warehouses Massively Parallel Processing Data Warehouses
(MPP)

Amazon Redshift

Big
monolithic
server
• Separates compute from
storage

• Scans large amounts of data in parallel • Expands the capability of

MPP systems
• Complex configurations and requires
effort to maintain
Storage Abstractions

Modern Cloud Data Warehouses

Massively Parallel Processing

Multiple
processors

Large amounts
of data

Massively Parallel Processing (MPP)

Massively Parallel Processing - Cloud Data Warehouses

Amazon Redshift

Massively Parallel Processing (MPP)

MPP Architecture for Amazon Redshift

Redshift Cluster
Leader Node

CPU

Memory space

Compute Nodes
Slices
Amazon Redshift
MPP Architecture for Amazon Redshift

Redshift Cluster
Leader Node

Amazon Redshift
MPP Architecture for Amazon Redshift
Client application

JDBC/ODBC

Redshift Cluster • Parses the request

Leader Node • Forms an execution plan
• Compiles code

Amazon Redshift
MPP Architecture for Amazon Redshift
Client application
Upgrade the node type
JDBC/ODBC

Redshift Cluster
Leader Node

Amazon Redshift
Data Warehouse-Centric Architecture
Extract-Load-Transform (ELT)
sales
Extract Load
Data Sources

Transform Analytics
marketing
& Reports
Staging
Machine
Area
Learning
Cloud Data finance
Warehouse
With MPP

Data Marts
Cloud Data Warehouse
Separation of
Columnar Architecture
Compute and Storage
Order ID Price Product SKU Quantity Customer ID

1 40 458650 10 67t Compute

2 23 902348 14 56t

3 45 1255893 12 87q

4 50 456829 13 98q Storage

Facilitates high performance analytical queries

Cloud Data Warehouse
Traditional Data Warehouse Cloud Data Warehouse

• Stored data is highly structured • Stored data is highly structured

• Data modeled to enable analytical • Data modeled to enable analytical
queries queries
• High processing from MPP
• Columnar Storage
• Separation of storage and compute
Efficiently stores and processes data
for high-volume analytical workloads
Storage Abstractions

Data Lakes -
Key Architectural ideas
Data Lake
• Central repository for storing large volumes of data
• No fixed schema or predefined set of transformations
• Schema-on-read pattern:
• Reader determines the schema when reading the data

Data Lake 1.0

Combined different storage and processing technologies

Storage
Amazon S3

Processing
Tools
Shortcomings of Data Lake 1.0

Data Swamp
• No proper data management
• No data cataloging
• No data discovery tools
• No guarantee on the data integrity
and quality
Shortcomings of Data Lake 1.0
Write-only storage
• Data Manipulation Language (DML)
operations were painful to implement

delete or update rows

Create a new table

• Difficult to comply with data

Data Swamp regulations

• No proper data management

• No data cataloging
• No data discovery tools
• No guarantee on the data integrity
and quality
Shortcomings of Data Lake 1.0
Write-only storage
• Data Manipulation Language (DML)
operations were painful to implement

delete or update rows

Create a new table

• Difficult to comply with data

Data Swamp regulations

• No proper data management

• No data cataloging No schema management and data modeling
• No data discovery tools • Hard to process stored data
• No guarantee on the data integrity • Data not optimized for query
and quality operations such as joins Joins
Storage Abstractions

Next-Generation Data Lakes

Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Zone 1 Zone 2 Zone 3

Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Zone 1 Zone 2 Zone 3

Landing / raw
Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Processing
• Clean
• Validate
• Standardize
• Remove PII information

Zone 2 Zone 3

Landing / raw Cleaned / Transformed

Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

Zone 3

Landing / raw Cleaned / Transformed Curated / enriched

Storage
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

Processing
• Clean • Model
Open file •
formats
• Validate Apply further
• Standardize transformations

• Remove PII information

.parquet .avro .parquet .avro

.orc .orc Analytics

Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Used to organize data in a data lake, where each zone houses data
Data Zones
that has been processed to varying degrees

• Apply appropriate Processing

data governance
policies on each zone • Clean • Model

• Ensure data quality • Validate • Apply further

• Standardize transformations

• Remove PII information

.parquet .avro
…
.orc Analytics
Landing / raw Curated / enriched Machine
Storage Learning
Data Divide a dataset into smaller, more manageable parts based on a set
Partitioning of criteria (e.g. time, date, location recorded in the data)

Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

1 1
2 2
3 3
Analytics
Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Collection of metadata about the Catalog
Data Catalog
dataset (owner, source, partitions, etc.) • Metadata
• Schema
Processing
• Clean • Model
• Validate • Apply further
• Standardize transformations

• Remove PII information

1 1
2 2
3 3
Analytics
Landing / raw Cleaned / Transformed Curated / enriched Machine
Storage Learning
Separate Data Lakes and Data Warehouses

and
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Separate Data Lakes and Data Warehouses
Data Sources

Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases
Separate Data Lakes and Data Warehouses

ETL Pipeline
Data Sources

Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases

Expensive Solution
Separate Data Lakes and Data Warehouses

ETL Pipeline
Data Sources

Subset
of Data
Analytics
Data
Data Lake Warehouse
Low-Cost Storage Superior Query Performance
Store large amounts of data For analytical uses cases

Expensive Solution
• Can introduce bugs/failures
• Can cause issues with data quality, duplication, consistency
Lab Walkthrough

Simple Data Lake with AWS Glue

(Part 1)
Lab Overview
Processing

Transform
AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Processing the JSON Files
Complete two functions to process the raw JSON files
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful" [2, 3],
"reviewText": "Great purchase!",
"overall" 5.0,
Reviews "summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}

Import the review data into a tabular data frame

reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime

Process the data

reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime year month totalHelpful
helpful
Processing the JSON Files
Complete two functions to process the raw JSON files
{
"asin": "0000031852",
"description": "Girls Ballet Tutu Zebra Hot Pink",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"related": {
"also bought": ["B00JHONNS", "B002BZX8Z6","B007R2RM8W"],
"also viewed": ["B002BZX8Z6", "B00JHONN1S","B00BFXLZ8M"],
Product "bought together": [“B002BZX8Z6”]
},
Metadata "salesRank":{"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", “Dance"]]
}
Import the metadata data into a tabular data frame

asin description title price related salesRank brand categories

Process the data

asin description title price brand sales_category sales_rank • Drop null entries from numerical columns
• Replace null values with empty strings in the other columns
Processing the JSON Files

Complete two functions to process the raw JSON files

Use the code to define the Glue jobs in Terraform

Glue ETL
Lab Walkthrough

Simple Data Lake with AWS Glue

(Part 2)
Glue Data Catalog
Processing

• AWS Glue Data Catalog:

central repository that
stores metadata for all your
data assets
AWS Glue
Glue ETL Crawler Data Catalog • You can create your
metadata databases.
• The catalog database can
contain many tables each
Reviews
.parquet storing:
• Column names
Product
Metadata .parquet • Data types
Amazon Athena • Partition keys
raw processed_data
S3 bucket Simple Data Lake
Lab Overview
Processing

AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Lab Walkthrough

Simple Data Lake with AWS Glue

(Part 3 - Optional)
Lab Experiments
Processing
• Compression algorithm: Snappy
• Partitioning for each dataset

AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
S3 bucket Simple Data Lake
Lab Experiments
Processing
Perform 4 experiments to explore
different configurations for the Glue jobs:
• compression
• partitioning
AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
S3 bucket Simple Data Lake
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

!aws s3 ls --summarize --human-readable --recursive

s3://{BUCKET_NAME}/processed_data/uncompressed/no_partition/toys_metadata
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

No partitioning Snappy 4 51 MiB faster

2 Metadata
No partitioning Gzip 4 32.7 MiB
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

No partitioning Snappy 4 51 MiB

2 Metadata
No partitioning Gzip 4 32.7 MiB

No partitioning Snappy 4 556.6 MiB

3 Reviews
Partitioning Snappy 556 578.0 MiB
year, month

• Choose an appropriate key that organizes the data into meaningful files that
are aligned with your query pattern.
• A bad partitioning key divides data into too many small files.
Lab Experiments
Compression Number of Total Size of the
Experiment Dataset Partitioning
Algorithm Objects Processed Data
No partitioning Uncompressed 4 93.6 MiB
1 Metadata
No partitioning Snappy 4 51 MiB

No partitioning Snappy 4 51 MiB

2 Metadata
No partitioning Gzip 4 32.7 MiB

No partitioning Snappy 4 556.6 MiB

3 Reviews
Partitioning Snappy 556 578.0 MiB
year, month
Partitioning Snappy 556 578.0 MiB
year, month
4 Reviews
Partitioning Snappy Timeout
asin
What you'll compare?
What you'll run?
Compression
Compression Experiment Dataset Partitioning
Dataset Partitioning Algorithm
Algorithm
No partitioning Uncompressed
No partitioning Uncompressed 1 Metadata
Metadata No partitioning Snappy
No partitioning Snappy
No partitioning Snappy
No partitioning Gzip 2 Metadata
No partitioning Gzip
No partitioning Snappy
Reviews No partitioning Snappy
Partitioning Snappy
year, month 3 Reviews
Partitioning Snappy
Partitioning year, month
asin
Snappy
Partitioning Snappy
year, month
4 Reviews
Partitioning Snappy
asin
Lab Overview
Processing

AWS Glue
Glue ETL Crawler Data Catalog

Reviews
.parquet

Product
Metadata .parquet

Amazon Athena
raw processed_data
Optional part: explore the effects
S3 bucket Simple Data Lake of compression and partitioning
Storage Abstractions

Data Lakehouse
Data Lakehouse

+ =

Data Lake Data Warehouse Data Lakehouse

• Flexibility • Superior query performance

• Low-Cost Storage • Robust data management
Data Lake Features

Bronze Silver Gold

Storage
Data Lakehouse Catalog
Data Management from
Data Warehouses
• Metadata Schema Enforcement
• Schema
ID Name ID Last First

Data Governance & Security

.parquet .parquet • Robust access controls, data
auditing capabilities, data lineage

Bronze Silver Gold • Incremental updates & deletions

Storage Connector • Rollback to or access any version
APIs of your historical data
Storage Abstractions

Date Lakehouse Implementations

Open Table Formats

Specialized storage formats that add transactional features to

Open Table your data lakehouse
Formats
• Allows you to update and delete records
• Supports ACID principles

Hadoop
Update
Delete
Incremental
Open Table Formats

Open Table Format Snapshot:

Reflects the state of the data at a
Track changes in data given time

Snapshot

time

Data
Insert, Update, Delete
Open Table Formats

Open Table Format Snapshot:

Reflects the state of the data at a
Track changes in data given time

Time Travel:
Snapshot Query any previous version of a table

time

Data
Insert, Update, Delete
Open Table Formats

Open Table Format Snapshot:

Reflects the state of the data at a
Track changes in data given time

Time Travel:
Snapshot Query any previous version of a table

Schema & Partition Evolution:

Ability to query the data even if you
time make changes to the schema or
partitioning

Data
Insert, Update, Delete
Open Table Formats

Open Table Format Snapshot:

Reflects the state of the data at a
Track changes in data given time

Time Travel:
Snapshot Query any previous version of a table

Schema & Partition Evolution:

Ability to query the data even if you
time make changes to the schema or
partitioning

Data Open Source

Insert, Update, Delete Different query engines can access the
data
Iceberg Catalog

Current metadata pointer

Metadata layer metadata file

metadata file table schemas,
partitioning information,
s0 s0 s1 snapshots

manifest manifest
list list

manifest manifest manifest

file file file

Storage layer
(Parquet) data files data files data files
Storage Options

Production Database Data Warehouse

• Process small • Bring together large

amounts of structured volumes of
data structured / semi-
structured data

• Query current and

historical data

Use case: analytics, Use case: analytics,

reporting reporting
Storage Options

Production Database Data Warehouse Data Lake

• Process small • Bring together large • Process large

amounts of structured volumes of volumes of
data structured / semi- structured, semi-
structured data structured and
unstructured data
• Query current and
historical data • Save on storage
cost

Use case: reporting, Use case: reporting, Use case: machine

analytics analytics learning
Storage Options

Production Database Data Warehouse Data Lake Data Lakehouse

• Process small • Bring together large • Process large • Data management

amounts of structured volumes of volumes of & discoverability
data structured / semi- structured, semi- features
structured data structured and
unstructured data • Low latency queries
• Query current and
historical data • Save on storage
cost

Use case: reporting, Use case: reporting, Use case: machine Use case: machine
analytics analytics learning learning, analytics,
reporting
Storage Abstractions

Lakehouse Architecture on AWS

AWS Lake Formation Groups Users

Setting up
access controls

parquet .avro
.

orc
.
AWS Lake Formation
Cataloging data
Defining storage

Managing
permissions
AWS Lake Formation
Data Sources

Amazon S3
AWS Lake Formation

✓Identify data
sources Relational DB

✓Move data into

AWS Glue Metadata AWS IAM
the data lake Data Catalog
✓Catalog data NoSQL DB

✓Manage
permissions Crawler
Setting up a Data Lake
AWS Lake Formation Groups Users

Setting up Managing
access controls permissions

AWS Lake Formation

✓Identify data
sources

✓Move data into

AWS Glue Metadata AWS IAM AWS Glue
the data lake Data Catalog
Glue
✓Catalog data Console

✓Manage
permissions Amazon S3 Relational DB NoSQL DB Crawler
Setting up a Data Lake
AWS Lake Formation Groups Users

Setting up Managing
access controls permissions

AWS Lake Formation

✓Identify data
sources

✓Move data into

AWS Glue Metadata AWS IAM AWS Glue
the data lake Data Catalog
Glue
✓Catalog data Console

✓Manage
permissions Amazon S3 Relational DB NoSQL DB Crawler
Setting up a Data Lake
AWS Data Lakehouse
Data Sources
AWS Data Lakehouse
Ingestion

Amazon Kinesis AWS Glue

Data Streams
Data Sources

Amazon Data
Firehose

AWS DataSync

AWS Database
Migration

Amazon AppFlow
AWS Data Lakehouse
Ingestion

Amazon Kinesis
Data Streams
Data Sources

Amazon Data
Firehose

AWS DataSync

AWS Database
Migration
Amazon Redshift Amazon S3

Amazon AppFlow Storage

AWS Data Lakehouse
Ingestion

Amazon Kinesis
Data Streams

Processing
Data Sources

Amazon Data SQL ELT

Firehose

Amazon EMR AWS Glue Amazon Managed Amazon Redshift

AWS DataSync Service for Apache Flink Spectrum

AWS Database
Migration
Amazon Redshift Amazon S3

Amazon AppFlow Storage

AWS Data Lakehouse
Ingestion Catalog

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM

Processing
Data Sources

Amazon Data SQL ELT

Firehose Amazon Quicksight

Amazon EMR AWS Glue Amazon Managed Amazon Redshift

AWS Database Native Integration SQL Interface

Migration Historical
Amazon Redshift Amazon S3 Data
Spectrum
Amazon Redshift
Hot Data Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT

Firehose Amazon Quicksight
SQL
Amazon EMR AWS Glue Amazon Managed Amazon Redshift
AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database ✓Serverless

Native Integration SQL Interface
Migration ✓On Demand
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT

Firehose Amazon Quicksight
SQL
Amazon EMR AWS Glue Amazon Managed Amazon Redshift
AWS DataSync Service for Apache Flink Spectrum

Amazon Athena
✓Federated
AWS Database Query
Native Integration SQL Interface
Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
AWS Data Lakehouse
Ingestion Catalog Consumption

Amazon Kinesis AWS Lake AWS Glue Metadata AWS IAM Amazon SageMaker
Data Streams Formation Data Catalog

Processing
Data Sources

Amazon Data SQL ELT

Firehose Amazon Quicksight
SQL
Amazon EMR AWS Glue Amazon Managed Amazon Redshift
AWS DataSync Service for Apache Flink Spectrum

Amazon Athena

AWS Database Native Integration SQL Interface

Migration
Amazon Redshift Amazon S3
Spectrum
Amazon Redshift
Amazon AppFlow Storage Spectrum
Lab Walkthrough

Building a Data Lakehouse with

AWS LakeFormation (Part 1)
• Schema evolution
• Time travel
Amazon S3
data_lake_bucket AWS Lake Formation

Establish governance and fine-grained permissions for the data

MySQL RDS
classic_models
Creating
Processing tables Analytics
to serve & ML users
data_lake data_lake data_lake

source_bucket Data Catalog Data Catalog Amazon

Customer ratings curated_zone presentation_zone Athena
of the products
(json)
• Each csv file:
s3://{data_lake_bucket}/landing_zone/rds/{table_name}
productlines products orderdetails
productLine productCode orderNumber
textDescription productName productCode
htmlDescription productLine quantityOrdered
image productScale priceEach
productVendor orderLineNumber
productDescription
employees quantityInStock
employeeNumber buyPrice
lastName MSRP orders
firstName orderNumber
MySQL RDS
extension customers orderDate
classic_models email customerNumber requiredDate
officeCode customerName shippedDate
reportsTo contactLastName status
jobTitle contactFirstName comments
phone customerNumber
data_lake offices addressLine1
officeCode addressLine2 payments
• 8 csv files city city customerNumber
phone state checkNumber
addressLine1 postalCode paymentDate
addressLine2 country amount
source_bucket salesRepEmployeeN
state
Customer ratings country umber
of the products postalCode creditLimit
territory
(json)
• Each csv file:
s3://{data_lake_bucket}/landing_zone/rds/{table_name}

• JSON file:
s3://{data_lake_bucket}/landing_zone/json/ratings

MySQL RDS
classic_models

data_lake
• 8 csv files
• 1 json file
source_bucket
Customer ratings customer product product ingest_ts
of the products Number Code Rating
(json)
• Each parquet file:
s3://{data_lake_bucket}/curated_zone/{table_name}

• 8 csv files • 8 parquet files

• 1 json file Processing
MySQL RDS (ratings) CSV files:
classic_models • Extract the 8
tables into a You are provided with a schema
DataFrame for the processed tables.
• Add two
data_lake metadata data_lake
columns:
• ingest_ts
• source
source_bucket • Enforce the
Data Catalog
schema
Customer ratings curated_zone
of the products
(json)
• Each parquet file: products
customers
customerNumber
s3://{data_lake_bucket}/curated_zone/{table_name} customerName
productCode contactLastName
• Iceberg data: productName
productLine
contactFirstName
phone
s3://{data_lake_bucket}/curated_zone/ratings_for_ML/iceberg productScale addressLine1
productVendor addressLine2
productDescription city
quantityInStock state
buyPrice postalCode
MSRP country
• 8 csv files salesRepEmployeeN
• 1 json file • 8 parquet files umber
MySQL RDS (ratings) • iceberg format creditLimit

classic_models Preparing data for (ratings for ML)

ML:
•Extract the CSV
tables and the
latest ratings into data_lake
data_lake
Data Frames
•Combine product,
customers and
ratings
source_bucket •Add a processing Data Catalog
Customer ratings timestamp curated_zone
of the products
(json)
• Each parquet file:
s3://{data_lake_bucket}/curated_zone/{table_name}
• Iceberg data:
s3://{data_lake_bucket}/curated_zone/ratings_for_ML/iceberg
s3://{data_lake_bucket}/curated_zone/ratings/iceberg

• 8 csv files • 8 parquet files

• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )

data_lake data_lake
Processing the
ratings:
•Extract the
latest ratings
source_bucket
•Update the Data Catalog
Customer ratings ratings if it curated_zone
of the products already exits
(json)
Terraform

• 8 csv files • 8 parquet files

• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )

data_lake data_lake

source_bucket Data Catalog

Customer ratings curated_zone
of the products
(json)
For analytics end users: For ML end users:
• Average sales grouped by year and month • Ratings for ML table
• Average ratings per product
• Latest ratings

• 8 csv files • 8 parquet files

• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )
Creating
tables
to serve
data_lake data_lake data_lake

source_bucket Data Catalog

• 8 csv files • 8 parquet files

• 1 json file • iceberg format
MySQL RDS (ratings for ML)
(ratings)
classic_models (ratings )

data_lake Amazon data_lake

data_lake Athena

source_bucket Data Catalog

Customer ratings curated_zone
of the products
(json)
Creating the ratings table using Athena

Amazon data_lake
Athena
For analytics end users: For ML end users:
• Average sales grouped by year and month • Ratings for ML table
• Average ratings per product
• Latest ratings

• 8 csv files • 8 parquet files

• 1 json file • iceberg format 4 tables in
MySQL RDS (ratings for ML)
(ratings) iceberg format
classic_models (ratings )

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon

Customer ratings curated_zone presentation_zone Athena
of the products
(json)
Lab Walkthrough

Building a Data Lakehouse with

AWS LakeFormation (Part 2)
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

• 8 csv files • 8 parquet files

• 1 json file • iceberg format 4 tables in
MySQL RDS (ratings for ML)
(ratings) iceberg format
classic_models (ratings)

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon

Customer ratings curated_zone presentation_zone Athena
of the products
(json)
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

Metadata
metadata file
• Table schema
metadata file
layer
s0
• Location
s0
of the table in S3
s1
• Date and time (last updated)
• UUID of the current snapshot
/metadata manifest manifest
list list

manifest manifest manifest

file file file

Storage
layer data files data files data files
/data
s3://{data_lake_bucket}/curated_zone/ratings/iceberg /metadata

s3://{data_lake_bucket}/curated_zone/ratings/iceberg /data

Metadata metadata file

metadata file
layer
s0 s0 s1

/metadata manifest manifest

list list

manifest manifest manifest

Storage
layer data files data files data files
/data
Apply fine-grained permissions to the data:

4 tables in
MySQL RDS iceberg format
classic_models

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon

Customer ratings curated_zone presentation_zone Athena
of the products
(json)
Apply fine-grained permissions to the data:
• Metadata-level (catalog resources)
• Storage access

4 tables in
MySQL RDS iceberg format
classic_models

Analytics
& ML users
data_lake Amazon data_lake
data_lake Athena

source_bucket Data Catalog Data Catalog Amazon

Customer ratings curated_zone presentation_zone Athena
of the products
(json)
Apply fine-grained permissions to the data:
• Metadata-level (catalog resources)
• Storage access
Grant IAM users and roles
permissions on: databases
Data lake tables, columns, rows, cells. • No need for a detailed IAM policy for data
lake users.
administrator
• Lake formation permissions are meant to
augment regular IAM permissions.

Data Catalog data_lake

• Data lake users need to be attached to
IAM policy to access:
• AWS Glue service
Get Access
metadata data • Lake Formation service
• Amazon Athena
• Use lake formation to grant users fine-
Temporary credentials grained permissions to access specific
Data lake
AWS Lake Amazon Athena users
resources
Formation
Apply fine-grained permissions to the data:
• Metadata-level (catalog resources)
• Storage access
Grant IAM users and roles
permissions on: databases
Data lake tables, columns, rows, cells.
Role assumed by the cloud9 instance
administrator
Role assumed by the glue resources

Data Catalog data_lake Access all catalog tables and the

underlying stored data
Get Access
metadata data
Machine Learning team member
Temporary credentials
Data lake Access to the ratings_for_ml
AWS Lake Amazon Athena users table from the presentation zone
Formation
Storage Abstractions

Summary
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost

Data Lake Catalog

Machine
• Large amounts of structured
and unstructured data
Learning
Processing & • Supports applications
Big Data requiring lots of data
Processing
Storage • Can turn into data swamp
First lab

Amazon Athena

Create data
catalog
Source bucket Data Lake bucket Crawler
Glue ETL

Partitioned
data
Data Warehouse
• Low-latency query
Analytics performance
Reporting • High storage cost

Data Lake Catalog

Machine
• Large amounts of structured
and unstructured data
Learning
Processing & • Supports applications
Big Data requiring lots of data
Processing
Storage • Can turn into data swamp

Data Lakehouse
Analytics
+
Machine
Scalable, low-cost, Structured query and Learning
and flexible storage data management
Lab Assignment

Sage Accounting Application Specialist Study Guide v2
No ratings yet
Sage Accounting Application Specialist Study Guide v2
2 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
No ratings yet
01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam
32 pages
Data Warehousing - Architecture - Tutorialspoint
No ratings yet
Data Warehousing - Architecture - Tutorialspoint
7 pages
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)
Automatic Turbine Testing (Att) : PMI-Noida 1
100% (2)
Automatic Turbine Testing (Att) : PMI-Noida 1
12 pages
Designing A Modern Data Warehouse + Data Lake
No ratings yet
Designing A Modern Data Warehouse + Data Lake
73 pages
Big Query
No ratings yet
Big Query
8 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
DL Vs DLH Draft v0.1
No ratings yet
DL Vs DLH Draft v0.1
9 pages
Designing A Modern Data Warehouse + Data Lake
100% (1)
Designing A Modern Data Warehouse + Data Lake
72 pages
02-dw Architecture
No ratings yet
02-dw Architecture
31 pages
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
What Is A Data Warehouse - IBM
No ratings yet
What Is A Data Warehouse - IBM
9 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
57 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
1.1 Basic Concepts & Architecture
No ratings yet
1.1 Basic Concepts & Architecture
27 pages
AWS ML Cheat Sheet Nov 2024
No ratings yet
AWS ML Cheat Sheet Nov 2024
100 pages
House Refcard 350 Getting Started Data Lakes 2021
No ratings yet
House Refcard 350 Getting Started Data Lakes 2021
5 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
50 pages
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
119 pages
Lecture 14 Data Warehouse and Data Lake Architecture Part 1
No ratings yet
Lecture 14 Data Warehouse and Data Lake Architecture Part 1
10 pages
Real Scenarios On Data Term 1722747078
No ratings yet
Real Scenarios On Data Term 1722747078
11 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
51 pages
Module 3 - Datawarehousing
No ratings yet
Module 3 - Datawarehousing
45 pages
Lec 01 - Intro To Data Warehouse
No ratings yet
Lec 01 - Intro To Data Warehouse
54 pages
Data Warehousing and Data Mining Original Notes
No ratings yet
Data Warehousing and Data Mining Original Notes
47 pages
Data Ware House Architectures
No ratings yet
Data Ware House Architectures
34 pages
Lect 5 Data Warehousing I - 240924 - 033406
No ratings yet
Lect 5 Data Warehousing I - 240924 - 033406
38 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
194 pages
Unit 1
No ratings yet
Unit 1
39 pages
Trends in Data Warehousing and Business Intelligence
No ratings yet
Trends in Data Warehousing and Business Intelligence
44 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
Data Mining Warehousing I & II
No ratings yet
Data Mining Warehousing I & II
7 pages
Unit 5
No ratings yet
Unit 5
19 pages
Bring Data Lakes and Data Warehouses Together
100% (1)
Bring Data Lakes and Data Warehouses Together
19 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
DM Mod1 PDF
No ratings yet
DM Mod1 PDF
16 pages
2-Data Warehouse Architecture - Three-Tier Data Warehouse Architecture-16!12!2024
No ratings yet
2-Data Warehouse Architecture - Three-Tier Data Warehouse Architecture-16!12!2024
30 pages
Database Datalake
No ratings yet
Database Datalake
2 pages
1 Unit
No ratings yet
1 Unit
46 pages
U - 2 A C D W: NIT Rchitectural Omponents of A ATA Arehouse
No ratings yet
U - 2 A C D W: NIT Rchitectural Omponents of A ATA Arehouse
103 pages
Cloud Training
No ratings yet
Cloud Training
14 pages
Unit 2 Data Mining & Warehouse
No ratings yet
Unit 2 Data Mining & Warehouse
40 pages
Difference Between Data Warehousing and Data Mining: Data Warehouse Architecture Three-Tier Data Warehouse Architecture
No ratings yet
Difference Between Data Warehousing and Data Mining: Data Warehouse Architecture Three-Tier Data Warehouse Architecture
10 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
8 pages
Assignment 1st DMDW
No ratings yet
Assignment 1st DMDW
12 pages
Data WearHousein
No ratings yet
Data WearHousein
3 pages
CH 2 Introduction To Data Warehousing
No ratings yet
CH 2 Introduction To Data Warehousing
31 pages
DSS ch2
No ratings yet
DSS ch2
112 pages
Unit 1-1
No ratings yet
Unit 1-1
60 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
16 pages
Unit 1
No ratings yet
Unit 1
18 pages
Data Mining-Data Warehouse
No ratings yet
Data Mining-Data Warehouse
7 pages
Unit 1
No ratings yet
Unit 1
22 pages
DW Architecture & DataFlow
No ratings yet
DW Architecture & DataFlow
24 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
03 Data Warehouse
No ratings yet
03 Data Warehouse
27 pages
Piovan DP640-644 Double Desiccant Dryer Datasheet
No ratings yet
Piovan DP640-644 Double Desiccant Dryer Datasheet
2 pages
La Gard Combogard Pro 39e Electronic Lock Software Installation Instructions 730 018 Rev D Web PDF
No ratings yet
La Gard Combogard Pro 39e Electronic Lock Software Installation Instructions 730 018 Rev D Web PDF
12 pages
Volkswagen India Digital Marketing Case Study
No ratings yet
Volkswagen India Digital Marketing Case Study
2 pages
Setupwizard
No ratings yet
Setupwizard
38 pages
Defining Digital Advertising
No ratings yet
Defining Digital Advertising
8 pages
Swiggy: Case Study
50% (2)
Swiggy: Case Study
23 pages
Mini Project Report2
No ratings yet
Mini Project Report2
41 pages
Esp8266 Commands
No ratings yet
Esp8266 Commands
12 pages
Sensors and Transducers
100% (1)
Sensors and Transducers
35 pages
The Central Business District: New Administrative Capital
No ratings yet
The Central Business District: New Administrative Capital
18 pages
K Pos Heavy Lift Application Operator Manual Relaese 8.4 (441472E)
No ratings yet
K Pos Heavy Lift Application Operator Manual Relaese 8.4 (441472E)
30 pages
Concur Expense EXP - SG - Workflow - AuthAppr
No ratings yet
Concur Expense EXP - SG - Workflow - AuthAppr
38 pages
Gulf Times Urgent
No ratings yet
Gulf Times Urgent
10 pages
00700M-Schedule and Narrative
No ratings yet
00700M-Schedule and Narrative
3 pages
Centrifuga
No ratings yet
Centrifuga
21 pages
Manual RCM 12
No ratings yet
Manual RCM 12
24 pages
ICT) Security Policy and Guideline, 2015 (PDFDrive) PDF
No ratings yet
ICT) Security Policy and Guideline, 2015 (PDFDrive) PDF
67 pages
Innovation Models
No ratings yet
Innovation Models
16 pages
Unit - 4 ADC
No ratings yet
Unit - 4 ADC
40 pages
Physics Classmate
No ratings yet
Physics Classmate
15 pages
BTB Brochure
No ratings yet
BTB Brochure
7 pages
ZCP 515-33KV Twin FDR
No ratings yet
ZCP 515-33KV Twin FDR
21 pages
CH 32 Security in The Internet IPSec SSLTLS PGP VPN and Firewalls Multiple Choice Questions and Answers PDF
No ratings yet
CH 32 Security in The Internet IPSec SSLTLS PGP VPN and Firewalls Multiple Choice Questions and Answers PDF
9 pages
K73 TEKO Plus Owner Manual
No ratings yet
K73 TEKO Plus Owner Manual
7 pages
S7 Power System Earthing
No ratings yet
S7 Power System Earthing
11 pages
Form 1 Term 1 ICT Quiz
No ratings yet
Form 1 Term 1 ICT Quiz
10 pages
Four-Pole Squirrel-Cage Induction Motor 579493 (8221-05) : Labvolt Series Datasheet
No ratings yet
Four-Pole Squirrel-Cage Induction Motor 579493 (8221-05) : Labvolt Series Datasheet
3 pages
Manual SerDia2010 en
No ratings yet
Manual SerDia2010 en
235 pages