0% found this document useful (0 votes)
17 views189 pages

Deeplearning - Ai Deeplearning - Ai

The document outlines the copyright notice and educational purpose of the slides provided by DeepLearning.AI, which are licensed under Creative Commons. It introduces the course on Source Systems, Data Ingestion, and Pipelines, detailing the data engineering lifecycle, types of source systems, and the course plan over four weeks. The content covers structured, semi-structured, and unstructured data, as well as relational databases and their management.

Uploaded by

Ahmad Hanif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views189 pages

Deeplearning - Ai Deeplearning - Ai

The document outlines the copyright notice and educational purpose of the slides provided by DeepLearning.AI, which are licensed under Creative Commons. It introduces the course on Source Systems, Data Ingestion, and Pipelines, detailing the data engineering lifecycle, types of source systems, and the course plan over four weeks. The content covers structured, semi-structured, and unstructured data, as well as relational databases and their management.

Uploaded by

Ahmad Hanif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 189

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Source Systems, Data
Ingestion, and Pipelines

Week 1
Source Systems, Data Ingestion,
and Pipelines

Welcome
Working with Source Systems

Course 2 Overview
Data Engineering Lifecycle

• How is data generated?


• Where and how is it stored?
• What are its characteristics?
Source System Data Engineer
Owner

Analytics
i Ingestion t Transformation s Serving

Machine
Generation
Learning

st Storage Reverse
ETL
Course Plan

Week 1 Common source systems


• Databases, object storage, and streaming sources
• Working with source systems on AWS

Week 2 Setting up ingestion from source systems

Week 3 DataOps undercurrent


• Automating some of your pipeline tasks
• Monitoring data quality
Week 4 Orchestration, monitoring, and automating data pipelines
• Setting up directed acyclic graphs
• Working with infrastructure as code
Introduction to Source Systems

Different Types of Source


Systems
Structured Data Data organized as tables of rows and columns

ID Last First Card Rows

14 Barry John XXX878

25 Goode Cynthia XXX980

14 Barry John XXX978

25 Goode Cynthia XXX990

Columns
Structured Data

Video by Adobe Stock (paid license)


Structured Data

Video by Adobe Stock (paid license)


Structured Data Data organized as tables of rows and columns

Semi-Structured Data Data that is not in tabular form but still has some structure
{
JavaScript key “firstName”: “Joe”, value
Object Notation “lastName” : “Reis”
(JSON) “age”: 10 ,
A series of key-value pairs “languages”:[“Python”, “JavaScript”, “SQL”],
“address”: {

“city”: “Los Angeles”,


“postalCode”: 90024,
“country”: “USA”
}
}
Structured Data Data organized as tables of rows and columns

Semi-Structured Data Data that is not in tabular form but still has some structure
{
JavaScript key “firstName”: “Joe”, value
Object Notation “lastName” : “Reis”
(JSON) “age”: 10 ,
A series of key-value pairs “languages”:[“Python”, “JavaScript”, “SQL”],
“address”: {

Nested “city”: “Los Angeles”,


JSON keys “postalCode”: 90024, values
format “country”: “USA”
}
}
Structured Data Data organized as tables of rows and columns

Semi-Structured Data Data that is not in tabular form but still has some structure

Unstructured Data Data that does not have any predefined structure

Text Video Audio Images

• dimensions
• pixel colors
Streaming
Databases Files
Systems

Structured data Semi-structured data


Semi-structured data
Streaming
Databases Files
Systems
Store data in an organized way
Structured data Semi-structured data
Semi-structured data

C reate
R ead
U pdate
D elete

Database
Management
System
(DBMS)
Database Person/
Storage Application
Streaming
Databases Files
Systems
Relational Non-relational
Store data in an organized way databases (NoSQL) databases
Structured data Key Semi-structured
Value Document data

Semi-structured data k1 value 1


{
“firstName”: “Joe”,
k2 value 2 “lastName” : “Reis”,
“age”: 10

C reate
}
k3 value 3

R ead Tables with rows and columns Non-tabular data


U pdate
D elete

Database
Management
System
(DBMS)
Database Person/
Storage Application
Streaming
Databases Files
Systems
Store data in an organized way Sequence of bytes
representing information
Structured data Semi-structured data
Semi-structured data

C reate
R ead
U pdate {
“firstName”: “Joe”,
D elete “lastName” : “Reis”,
“languages”:[“R”, “SQL”],
}

Database
Management
System
(DBMS)
Database Person/
Amazon S3
Storage Application
Streaming
Databases Files
Systems
Store data in an organized way Sequence of bytes Continuous flow of data
representing information
Structured data Semi-structured data
Semi-structured data
Producer
C reate
R ead Consumer

U pdate { Producer Message queue/


“firstName”: “Joe”, Streaming platform
D elete “lastName” : “Reis”,
“languages”:[“R”, “SQL”],
}

Database
Management
System
(DBMS)
Database Person/
Amazon S3
Storage Application
Streaming
Databases Files
Systems
Store data in an organized way Sequence of bytes Continuous flow of data
representing information
Structured data Semi-structured data
Semi-structured data
Producer
C reate
R ead Consumer

U pdate {
“firstName”: “Joe”,
D elete “lastName” : “Reis”,
Amazon
“languages”:[“R”, “SQL”], Kinesis
} Smart
Thermostat
Database
Management
System
(DBMS)
Database Person/
Amazon S3
Storage Application
Streaming
Databases Files
Systems
Store data in an organized way Sequence of bytes Continuous flow of data
representing information
Structured data Semi-structured data
Semi-structured data
Source System
Producer
C reate

R ead
U pdate {
“firstName”: “Joe”,
D elete “lastName” : “Reis”,
Amazon
“languages”:[“R”, “SQL”], Kinesis
} Smart
Thermostat
Database
Management
System Your ingestion
(DBMS)
Database Person/
Amazon S3 pipeline starts
Storage Application here
Streaming
Databases Files
Systems
Store data in an organized way Sequence of bytes Continuous flow of data
representing information

Ingest

• Structured
• Semi-structured
• Unstructured
Introduction to Source Systems

Relational Databases
Relational Databases

Customers Products
key key

key
• Reduce redundancy
• Make data easier to manage
Orders
Relational Databases
One big table for everything!

name address phone date_time amount brand SKU description

Jane Doe 74th Street 12345678 12/08/2024 700 ABC B32 Blender

Jane Doe 74th Street 12345678 12/08/2024 99 XYZ i56 Iron

Jane Doe 74th Street 12345678 12/08/2024 100 GHJ k70 Kettle

Jane Doe
Relational Databases
One big table for everything!

name address phone date_time amount brand SKU description

Jane Doe 74th Street 12345678 12/08/2024 700 ABC B32 Blender

Jane Doe 74th Street 12345678 12/08/2024 99 XYZ i56 Iron

Jane Doe 74th Street 12345678 12/08/2024 100 GHJ k70 Kettle

Mary Ann 19th Avenue 98765432 13/08/2024 899 STU w40 Washer

John Ken 1st Link 36891623 14/08/2024 899 STU w40 Washer

Ivy Tan 67th Street 98639513 15/08/2024 899 STU w40 Washer
Relational Databases
One big table for everything!
Inconsistency
name address phone date_time amount brand SKU description

Jane Doe 74thAvenue


11th Street 12345678 12/08/2024 700 ABC B32 Blender

Jane Doe 11th


74thAvenue
Street 12345678 12/08/2024 99 XYZ i56 Iron

Jane Doe 74th Street 12345678 12/08/2024 100 GHJ k70 Kettle

Mary Ann 19th Avenue 98765432 13/08/2024 899 STU w40


w31 Washer

John Ken 1st Link 36891623 14/08/2024 899 STU w40


w31 Washer

Ivy Tan 67th Street 98639513 15/08/2024 899 STU w40 Washer

Inconsistency
Jane Doe SKU
now lives on 11th Avenue now w31
Relational Databases
Single
product
Single
Customers Products
customer
id first_name last_name age address id brand SKU description
1 Jane Doe 24 74th Ave.
11th St. 1 ABC b32 Blender
2 Mary Ann 65 19th Ave. 2 XYZ i56 Iron
3 John Ken 27 1st Link 3 GHJ k70 Kettle
4 Ivy Tan 18 67th St. 4 STU w40
w31 Washer

id customer_id product_id date_time purchase_amount

Orders

Database schema
Relational Databases
Customers Products
Keys id first_name last_name age address id brand SKU description
1 Jane Doe 24 74th Ave.
11th St. 1 ABC b32 Blender
Primary key: 2 Mary Ann 65 19th Ave. 2 XYZ i56 Iron
uniquely 3 John Ken 27 1st Link 3 GHJ k70 Kettle
identifies each 4 Ivy Tan 18 67th St. 4 STU w40
w31 Washer
row in a table
id customer_id product_id date_time purchase_amount

Orders 1 1 1 12/08/2024 700


2 1 2 12/08/2024 99
3 1 3 12/08/2024 100
Database schema
4 2 4 13/08/2024 899
5 3 4 14/08/2024 899

Foreign key:
references the primary key of the customer table
Relational Databases string

Customers integer Products


id first_name last_name age address id brand SKU description
1 Jane Doe 24 74th Ave.
11th St. 1 ABC b32 Blender
2 Mary Ann 65 19th Ave. 2 XYZ i56 Iron
3 John Ken 27 1st Link 3 GHJ k70 Kettle
4 Ivy Tan 18 67th St. 4 STU w40
w31 Washer

id customer_id product_id date_time purchase_amount

Orders 1 1 1 12/08/2024 700


2 1 2 12/08/2024 99
3 1 3 12/08/2024 100
Database schema
4 2 4 13/08/2024 899
5 3 4 14/08/2024 899

Each row in a table has to follow the same column structure:


same sequence of columns and data types
Relational Databases
Customers Products
id first_name last_name age address id brand SKU description
1 Jane Doe 24 74th Ave.
11th St. 1 ABC b32 Blender
2 Mary Ann 65 19th Ave. 2 XYZ i56 Iron
3 John Ken 27 1st Link 3 GHJ k70 Kettle
4 Ivy Tan 18 67th St. 4 STU w40
w31 Washer

id customer_id product_id date_time purchase_amount

Orders 1 1 1 12/08/2024 700


2 1 2 12/08/2024 99
3 1 3 12/08/2024 100
4 2 4 13/08/2024 899
5 3 4 14/08/2024 899
6 1 4 15/08/2024 899
Relational Databases
One big table for everything!
name address phone date_time amount brand SKU description

Jane Doe 74th Street 12345678 12/08/2024 700 ABC B32 Blender

Jane Doe 74th Street 12345678 12/08/2024 99 XYZ i56 Iron

Jane Doe 74th Street 12345678 12/08/2024 100 GHJ k70 Kettle

Mary Ann 19th Avenue 98765432 13/08/2024 899 STU w40 Washer

John Ken 1st Link 36891623 14/08/2024 899 STU w40 Washer

Ivy Tan 67th Street 98639513 15/08/2024 899 STU w40 Washer

One Big Table (OBT) approach: use cases that need faster processing
Relational Databases

Relational Database
Management System
(RDBMS)

Software layer that sits on top of a


relational database to manage and
interact with the data.

Structured Query Language


(SQL)
Relational Databases

SQL Commands

Data Data Data Data


Cleaning Joining Aggregating Filtering
DROP INNER JOIN SUM WHERE
TRUNCATE LEFT JOIN AVG AND
TRIM RIGHT JOIN COUNT OR
REPLACE FULL JOIN MAX IS NULL
SELECT DISTINCT UNION MIN IS NOT NULL
GROUP BY IN
LIKE
Introduction to Source Systems

SQL Queries
customer
The Relational Database customer_id
store_id
payment first_name
• Database for a fictitious DVD rental company payment_id
last_name
email address
called Rentio customer_id
address_id address_id
staff_id
• Database schema rental_id
activebool
create_date
address
address2 city
amount
last_update district city_id
payment_date
active city_id city
postal_code country_id
staff
phone last_update
rental staff_id last_update
rental_id first_name
rental_date last_name country
SQL queries
inventory_id address_id country_id
inventory store
customer_id email
inventory_id store_id country
return_date store_id
Answer business questions film_id manager_staff_id last_update
staff_id active
store_id username address_id
last_update last_update last_update
password
last_update
picture
film customer
actor film_actor film_id customer_id
actor_id actor_id title store_id
first_name film_id description payment first_name
last_name last_update release_year last_name
payment_id
last_update language_id email address
customer_id
rental_duration address_id address_id
film_category staff_id
rental_rate activebool address
category film_id rental_id
length create_date address2 city
category_id category_id amount
replacement_cost last_update district city_id
name last_update payment_date
last_update active city_id city
last_update postal_code country_id
special_features
staff
language phone last_update
rental staff_id last_update
language_id first_name
name rental_id
rental_date last_name country
last_update address_id
inventory inventory_id store country_id
customer_id email
inventory_id store_id country
return_date store_id
film_id manager_staff_id last_update
staff_id active
store_id username address_id
Entity Relationship last_update last_update
password last_update
last_update
picture
film customer
actor film_actor film_id customer_id
actor_id actor_id title store_id
first_name film_id description payment first_name
last_name last_update release_year payment_id last_name
last_update language_id email address
customer_id
rental_duration staff_id address_id address_id
film_category
rental_rate rental_id activebool address
category film_id city
length amount create_date address2
category_id category_id last_update district city_id
replacement_cost payment_date
name last_update active city_id city
last_update
last_update postal_code
special_features country_id
staff
language phone last_update
rental staff_id last_update
language_id first_name
rental_id
name last_name
rental_date country
last_update address_id
inventory inventory_id store country_id
customer_id email
inventory_id store_id country
return_date store_id
film_id manager_staff_id last_update
staff_id active
store_id username address_id
Entity Relationship last_update last_update
password last_update
last_update
picture
film customer
actor film_actor film_id customer_id
actor_id actor_id title store_id
first_name film_id description payment first_name
last_name last_update release_year last_name
payment_id
last_update language_id email address
customer_id
rental_duration address_id address_id
film_category staff_id
rental_rate activebool address
category film_id rental_id
length create_date address2 city
category_id category_id amount
replacement_cost last_update district city_id
name last_update payment_date
rating active city_id city
last_update postal_code country_id
last_update
staff
language special_features phone last_update
rental staff_id last_update
language_id first_name
name rental_id
rental_date last_name country
last_update address_id
inventory inventory_id store country_id
customer_id email
inventory_id store_id country
return_date store_id
film_id manager_staff_id last_update
staff_id active
store_id username address_id
last_update last_update last_update
password
last_update
picture
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
last_update
special_features
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
LIMIT last_update
special_features
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
LIMIT last_update
special_features
category film_category film
category_id film_id film_id Exploring the films that are less than 60 minutes long.
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
WHERE
LIMIT last_update
special_features
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
WHERE last_update
special_features
ORDER
LIMITBY
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
WHERE last_update
special_features
ORDER BY

LIMIT
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
WHERE
JOIN last_update
special_features
ORDER BY

LIMIT
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
JOIN last_update
special_features
WHERE

ORDER BY

LIMIT
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
JOIN last_update
special_features
WHERE

ORDER BY

LIMIT
category film_category film
category_id film_id film_id
name category_id title INNER JOIN
last_update last_update description
release_year
language_id
SELECT JOIN: combine the records from both tables
rental_duration
rental_rate that have a matching column value specified
FROM length in the ON statement.
replacement_cost
JOIN last_update film has a row with film_id = 123
special_features film_category does not have a row with film_id= 123
WHERE

ORDER BY

LIMIT
The row with film_id = 123 will not be in the join results
category film_category film
category_id film_id film_id
name category_id title INNER JOIN
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length LEFT JOIN
replacement_cost
JOIN last_update
special_features
WHERE
RIGHT JOIN
ORDER BY

LIMIT

FULL JOIN
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM length
replacement_cost
JOIN last_update
special_features
WHERE

ORDER BY
GROUP

LIMIT
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
FROM
COUNT length
replacement_cost
JOIN last_update
special_features
WHERE

GROUP BY

ORDER BY

LIMIT
category film_category film
category_id film_id film_id
name category_id title
last_update last_update description
release_year
language_id
SELECT rental_duration
rental_rate
COUNT length
replacement_cost
FROM last_update
special_features
JOIN

WHERE

GROUP BY

ORDER BY

LIMIT
Common Data Manipulation
SQL Commands Operations

INSERT
SELECT CREATE
INTO
COUNT
UPDATE DELETE
FROM

JOIN

WHERE

GROUP BY

ORDER BY

LIMIT
Introduction to Source Systems

NoSQL Databases
NoSQL Databases

NoSQL
NoSQL Databases

No SQL
NoSQL Databases

Not Only SQL Relational Databases

Non-Relational Databases

It can still support SQL or SQL-like


query languages.
NoSQL Databases
Non-tabular structures
Wide-Column
Key-Value Document
Column 1 Column 2 Column 3
Row A
{
k1 value 1 Value 1 Value 2 Value 3
“firstName”: “Joe”,
“lastName” : “Reis”,
k2 value 2 “age”: 10, Column 1 Column 2 Column 3
“address”: { Row B
k3 value 3 “city”: “Los Angeles”, Value 1 Value 2 Value 3
“postalCode”: 90024,
“country”: “USA”
}
}
Graph

• No predefined schemas
• More flexibility when storing your data
Horizontal Scaling
Data received is
Not
Updated
updated is not up-to-date

read

user 2
write

user 1 Updated Eventual consistency


NoSQL
Database
Consistency

NoSQL Databases Relational Databases


Eventual Consistency Strong Consistency

• Speed is prioritized • Read data only when all nodes


• System availability and scalability have been updated
are important
NoSQL Databases

Not all NoSQL databases guarantee:

ACID compliance
Atomicity
Consistency
Isolation
Durability
Specialized Query Language

Example of NoSQL Data

{
"id": 1, Query
"key": "Blender",
db.products.find({qty: {$gt: 4}})
"qty": 6,
"sku": “b32"
}

Ref: AWS docs


Types of No-SQL Databases
Key-Value Document

key value {
“firstName”: “Joe”,
84620 {“name”: “blender”, “sku”: “b32”, “quantity”:6} “lastName” : “Reis”,
“age”: 10,
64820 {“name”: “iron”, “sku”: “i56”, “quantity”:5}
“address”: {
“city”: “Los Angeles”,
“postalCode”: 90024,
46173 {“name”: “washer”, “sku”: “w40”, “quantity”:6} “country”: “USA”
}
}
Key-Value Database
Key-Value

key value
84620 {“name”: “blender”, “sku”: “b32”, “quantity”:6}
Unique
Identifier 64820 {“name”: “iron”, “sku”: “i56”, “quantity”:5}

46173 {“name”: “washer”, “sku”: “w40”, “quantity”:6}

Fast lookup: such as caching user session data


key value
• viewing different products
• adding items to the shopping cart user_session_id values

• checking out
Document Database
Collection (Like a table)
{
"users" : [
{
keys "id": 1234,
"name": {
"first":"Joe",
"last":"Reis"
},
"favorite_bands" : ["AC/DC", "Slayer", "WuTang Clan", "Action Bronson" ] Single users
},
Documents
{ (Like a row)
"id":1235,
"name": {
"first": "Matt",
"last":"Housley"
},
"favorite_bands" : ["Dave Matthews Band", "Creed", “Nickelback"]
}
]
}
Document Database
{
"users" : [
{ user_id band_id band_id band_name
"id": 1234,
"name": { 1234 1 1 AC/DC
"first":"Joe",
"last":"Reis" 1234 2 2 Slayer
},
"favorite_bands" : ["AC/DC", "Slayer", "WuTang Clan", "Action Bronson" ]
1234 5 3 Creed
},
1234 6 4 Nickelback
{ 1235 7 5 Wutan Clan
"id":1235,
"name": { 1235 3 6 Action Bronson
"first": "Matt",
"last":"Housley" 1235 4 7 Dave Matthews Band
},
"favorite_bands" : ["Dave Matthews Band", "Creed", “Nickelback"]
}
] user_id first_name last_name
}
1234 Joe Reis
• Easy to retrieve all the information about a user (locality) 1235 Matt Housely

• Document stores don’t support joins


• Flexible schema Fixed schema
Document Database

Use cases
{
"iot" : [
• Content management {
"id": 24,
• Catalogs "interaction": "some_interaction",
“device": "my_device",
"sensor_reading": 34
• Sensor readings }
]
}

Flexible Schema
Document Database

Document databases become absolute nightmares to manage and query.

Change in
data NoSQL Ingest
Document Downstream use

Source system
owner
Introduction to Source Systems

Database ACID Compliance


OLTP Systems

OLTP

Online Transaction Processing

Support very high transaction rates (bank account balances, online orders)
ACID Compliance

Relational Databases NoSQL Databases

ACID compliant Not ACID compliant by default

Atomicity

Consistency
Isolation
Durability
They help ensure transactions are processed
reliably and accurately in an OLTP system.
ACID Compliance
You’d be hoping

$50 $200
A B

200 $250 $0
$250 $0
A B

Account A Account B But not

Transfer 200 $50 $0


A B
Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

A transaction Operation 1 Operation 2 Operation 3 Operation 4

Operation 1 Operation 2 Operation 3 Operation 4


Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

A transaction: Deducting the total cost from Updating the inventory to


placing an order the customer’s account reflect the purchased item

Both operations must happen as a single transaction


Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

Any changes to the data made within a transaction follow the set of
Consistency
rules or constraints defined by the database schema.

id product_name quantity Buy 2 blenders id product_name quantity

1 blender 1 1 blender -1
Transaction

Rule: stock level ≥ 0


Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

Any changes to the data made within a transaction follow the set of
Consistency
rules or constraints defined by the database schema.

Strong Consistency
ACID compliance All nodes provide the same up-to-date
Up-to-date
Atomicity
Consistency
Isolation Up-to-date

Durability
Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

Any changes to the data made within a transaction follow the set of
Consistency
rules or constraints defined by the database schema.

Isolation Each transaction is executed independently in sequential order.

Transaction

id product_name quantity
Buy 5 blenders
1 blender 5
0
10

Transaction

Buy 5 blenders
Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

Any changes to the data made within a transaction follow the set of
Consistency
rules or constraints defined by the database schema.

Isolation Each transaction is executed independently in sequential order.

Transaction

id product_name quantity
Buy 5 blenders
1 blender 5
10

Transaction

Buy 10
blenders
Atomicity ensures that transactions are atomic, treated as a single,
Atomicity
indivisible unit.

Any changes to the data made within a transaction follow the set of
Consistency
rules or constraints defined by the database schema.

Isolation Each transaction is executed independently in sequential order.

Durability Once a transaction is completed, its effects are permanent and will
survive any subsequent system failures.

Essential for maintaining the


reliability of the database
ACID Compliance

The ACID principles guarantee that a database will maintain a


consistent picture of the world.

Strong Consistency

• Data is consistent across the entire network


• Key feature of relational databases that
ensures ACID

Data is partitioned
or replicated
Introduction to Source Systems

Lab Walkthrough -
Interacting with Amazon
DynamoDB NoSQL Database
Interacting with Amazon DynamoDB

Apply some Create, Read, Update and Delete


(CRUD) operations

Amazon DynamoDB

In this video,
• Overview of DynamoDB features
• Data you will work on
• DynamoDB methods that you will use to apply CRUD operations
Amazon DynamoDB
Key-value Database PersonID Attributes
Value
key {“FirstName”: “Joe”, “LastName”: “Reis”, “Phone”: “111-222”,
Key-value Items Person 101 “Country”: “USA”, “FavoriteBands”: {“Action Bronson”,
“Slayer”, “WuTang Clan”}}

{“FirstName”: “Matt”, “LastName”: “Housley”, “Phone”:


102
“222-333”, “Country”: “USA”}

Table
• Row: attributes of one item Simple
Key:
primary PersonID Attributes
• Uniquely identified by the key
item’s key. FirstName LastName Phone Country FavoriteBands
• Simple Primary Key: 101
Joe Reis 111-222 USA
{“Action Bronson”, “Slayer”,
partition key “WuTang Clan”}

• Composite Primary Key: 102


FirstName LastName Phone Country
Matt Housley 222-333 USA
Amazon DynamoDB Composite Primary Key
Key-value Database Partition Key Sort Key Attributes
OrderID ItemNum
Price Quantity ProductType ISBN Title
1234 Item1
10 1 Book 45679 Data
Key-value Items
Price Quantity ProductType Brand Color
1234 Item2
50 1 Bike AZY Black
Price Quantity ProductCode
1235 Item1
23 4 23697
Price Quantity ProductType Brand
1235 Item2
1200 2 Laptop XYZ

Table Schema-less: Each item can have its own distinct attributes.
• Row: attributes of one item Simple
Key:
primary PersonID Attributes
• Uniquely identified by the key
item’s key. FirstName LastName Phone Country FavoriteBands
• Simple Primary Key: 101
Joe Reis 111-222 USA
{“Action Bronson”, “Slayer”,
partition key “WuTang Clan”}

• Composite Primary Key: 102


FirstName LastName Phone Country
partition key & sort key Matt Housley 222-333 USA
Interacting with Amazon DynamoDB
Interact with the tables using Python

AWS Software Development Kit (SDK) for Python


Table Boto3
Allows you to create and configure AWS services using Python

Table

Table

Table
Interacting with Amazon DynamoDB
Interact with the tables using Python

AWS Software Development Kit (SDK) for Python


Table Boto3
Allows you to create and configure AWS services using Python

Create create_table
Table scan
Read get_item
query
Table
put_item
Update write_batch_items
update_item
Table
Delete delete_item
Data
Load
Product • Information about some products sold on Amazon
Catalog • ID: simple primary key
Table

• Information about some AWS forums


Forum • Each forum: total number of threads, messages and views
Table • Name: simple primary key
• Information about each forum thread
Thread • Each thread: subject, message, total number of views and replies
Table
• ForumName & Subject: composite primary key
• Information about each thread reply
Reply • Each reply: time, reply message, user, ID (Forum and thread subject)
Table • ID & ReplyDateTime: composite primary key
{
Data "Forum": [
{
"PutRequest": {
Load "Item": {
Product
"Name": {"S":"Amazon DynamoDB"},
Catalog "Category": {"S":"Amazon Web Services"},
Table "Threads": {"N":"2"},
"Messages": {"N":"4"},
"Views": {"N":"1000"} N: Number
Forum }
}
Table },
Table {
Thread "PutRequest": {
"Item": {
Table "Name": {"S":"Amazon S3"}, S: String
"Category": {"S":"Amazon Web Services"}
}
}
Reply
}
]
Table
}
Introduction to Source Systems

Object Storage
Object Storage

Object Storage Traditional File System Hierarchy

files

No hierarchy!
Object Storage

Amazon S3
Object Storage

Object Storage

files

No hierarchy!
Object Storage

Object Storage

files

• Storing semi-structured and unstructured data


• Serving data for training machine learning models
Object Storage

Object Storage

write Objects are immutable

Programs UUID

For each object,


• Universal Unique Identifier or UUID (key)
• Metadata: creation date, file type, owner
Object Storage

Object Storage

write Enable Versioning

Programs UUID

For each object,


• Universal Unique Identifier or UUID (key)
• Metadata: creation date, file type, owner, version
Why Use Object Storage?

• Store files of various data formats without a specific file system structure
• Easily scale out to provide virtually limitless storage space
• Replicate data across several availability zones

99.999999999% : data durability


Amazon S3

• Cheaper than other storage options


Introduction to Source Systems

Logs
Logs Logs
x

01-01-2025:10.30 67945 success user added a product x to their cart


01-01-2025:10.32 38910 fail invalid values typed for product quantity
01-01-2025:10.38 17462 fail customer table corrupted

Software Exhaust / Byproduct


Application

Monitoring or Debugging a system


• User activity:
• Signing in
• navigating to a particular page
• An update to a database
• An error from a procedure
An append-only sequence of records ordered by time, capturing
Log
information about events that occur in systems.

Rich data source Downstream use cases

Web Server Logs Analysis of user


Detailed user activity data behavior patterns

Database System Track changes in


Logs source database

Security System Machine Learning


Logs anomaly detection
An append-only sequence of records ordered by time, capturing
Log
information about events that occur in systems.

timestamp user id status action


01-01-2025:10.30 67945 success user added a product x to their cart
01-01-2025:10.32 38910 fail invalid values typed for product quantity
01-01-2025:10.38 17462 fail customer table corrupted

Event Person, system, or Event &


timestamp account associated event metadata
with the event
An append-only sequence of records ordered by time, capturing
Log
information about events that occur in systems.

timestamp user id status action


01-01-2025:10.30 67945 success user added a product x to their cart
01-01-2025:10.32 38910 fail invalid values typed for product quantity
01-01-2025:10.38 17462 fail customer table corrupted

{ user id action status timestamp


“user id”: 67945,
“action”: “user added a product x to their cart”,
“status”: “success”, 67945 user added a product x to their cart success 01-01-2025:10.30
“time-stamp”: 01-01-2025:10.30 invalid values typed for product
} 38910 fail 01-01-2025:10.32
quantity

[00101011 11000101 11001001 11000101 110001001] 17462 customer table corrupted fail 01-01-2025:10.38
Log Levels

A tag to categorize the event (log level)

• “debug”
• “info” user id action status timestamp level

• “warn” 67945 user added a product x to their cart success 01-01-2025:10.30 Info

invalid values typed for product


• “error” 38910

17462
quantity
customer table corrupted
fail

fail
01-01-2025:10.32

01-01-2025:10.38
error

fatal

• “fatal”
Introduction to Source Systems

Streaming Systems
Terminology
Event Message Stream

Something that happened in


the world or a change to the
state of a system.

User clicking Sensor measuring a


on a link temperature change

Event
Producer

Data: record of events


Terminology
Event Message Stream

Something that happened in A record of information about


the world or a change to the an event.
state of a system.
Message
Event Details
User clicking Sensor measuring a
Event Metadata
on a link temperature change Event Timestamp

Producer Event Event Event Event


Terminology
Event Message Stream

Something that happened in A record of information about A sequence of messages.


the world or a change to the an event.
state of a system.
Message
Event Details
User clicking Sensor measuring a
Event Metadata
on a link temperature change Event Timestamp

Producer Event Event Event Event


Stream
Stream Processing

Event Batch
Producer
Collector Processing

Streaming system
Process
Producer message as it
is received
Streaming System

Message
Event Message
Event
Producer Consumer

Message
Event Router /
Streaming Broker
Streaming System

Event Event
Producer Consumer

Generates the Event Router / Processes the


messages messages
Streaming Broker

Event
IoT Device Mobile App Consumer

API Website
Streaming System

Event Event
Producer Consumer

Order system Event Router / Payment service


Streaming Broker

Event
Consumer

Inventory service
Streaming System

Event Event
Producer Consumer

Producer can send Event Router /


messages anytime Streaming Broker

• Acts as a buffer to filter and


distribute the messages Event
Consumer
• Decouples producer from consumer
• Prevents message from being lost
Your Data System

Event Event
Producer Consumer

Event Router /
Streaming Broker
Source System

Event
Consumer

The system you build


Your Data System

Event Event
Producer Consumer

Event Router /
Streaming Broker

Source System
Event
Consumer

Your ingestion pipeline


starts here
Streaming System

Event Event
Producer Consumer

Event Router /
Streaming Broker

Event
Message Queue
Consumer

Event Streaming
Platform
A queue/buffer that accumulates messages
Message Queue

Message Queue
Event
1
2
3
4
Producer
A queue/buffer that accumulates messages
Message Queue
and delivers those messages to consumers asynchronously.

Message Queue
Event Event
4 3 2 1
Producer Consumer

Message
First-in first-out (FIFO) basis received!
A queue/buffer that accumulates messages
Message Queue
and delivers those messages to consumers asynchronously.

Message Queue
Event Event
5 4 3 2
Producer Consumer

First-in first-out (FIFO) basis


A queue/buffer that accumulates messages
Message Queue
and delivers those messages to consumers asynchronously.

Message Queue
Event Event
5 4 3 2 Consumer
Producer

First-in first-out (FIFO) basis


Temporary storage

Amazon Simple Queue


Service (Amazon SQS)
Event Streaming Platform Log: Append-only record of events

Streaming Platform
Event
1
2
3
4
Producer

Amazon Kinesis Data Streams


Event Streaming Platform Log: Append-only record of events

Streaming Platform
Event Event
4 3 2 1
Producer Consumer

Event
Read Consumer
Possible to replay or reprocess any
events in the log

Amazon Kinesis Data Streams


Interacting with Source Systems

Lesson Overview
Connecting to Source Systems

Data Sources

• Improper identity and access Data Engineer


managements (IAM) definitions
• Broken networking configurations
• Wrong set of access credentials
Lesson’s Plan
1 Ways of connecting to source systems

2 IAM roles and permissions


Key to controlling and managing access to
cloud-based data sources Role
Permissions
Lesson’s Plan
1 Ways of connecting to source systems

2 IAM roles and permissions


Key to controlling and managing access to
cloud-based data sources Role
Permissions
3 Basics of networking
VPCs and Subnets, Gateways, Routing, Security groups
Lesson’s Plan
1 Ways of connecting to source systems

2 IAM roles and permissions


Key to controlling and managing access to
cloud-based data sources Role
Permissions
3 Basics of networking
VPCs and Subnets, Gateways, Routing, Security groups Real world scenario

4 Lab exercise: put your skills to the test


Your job: troubleshoot and figure out the cause
of the problem
Interacting with Source Systems

Connecting to Source Systems


Connecting to Source Systems

boto3: AWS Software Development Kit (SDK) for Python


Connecting to Source Systems

Running this command in Cloud9 IDE


Programmatic Way
SDK (boto3)

IDE (Cloud 9) Jupyter Notebook


API Connector

Connector

JDBC/ODBC
Application
API
Interacting with Source Systems

Basics of IAM and Permissions


Security on the Cloud

Encryption Methods

Identity and Access


Management (IAM)
• Insecure storage of passwords
• IAM misconfigurations
Networking Protocols
Mistakes

Confidential
GitHub
data

Access
Credentials

Public S3 Bucket Admin access


IAM
IAM is a framework for managing permissions.
Permissions define which actions an identity
can perform on a specific set of resources

Person

Application
Principle of Least Privilege

i Ingestion t Transformation s Serving


Data Sources

st Storage
Principle of Least Privilege

i Ingestion t Transformation s Serving


Data Sources

st Storage
Principle of Least Privilege

Ingestion
System
Read from specific tables
AWS IAM

IAM
services
AWS Identity and Access
Management (IAM)
AWS IAM AWS account

Has unrestricted access to all Identities


Root User resources
Root User

Resources
Has specific permissions to Policies
IAM User IAM User
certain resources
• Username & password Amazon S3
• Access key
IAM Group
A collection of users that inherit the
IAM Group same permission from the group Amazon RDS Amazon EC2
policy
IAM Role
A user, application, or service that’s (User/
IAM Role been granted temporary permissions application/
service)
AWS IAM

Role

Not allowed to read


from or write to Amazon S3
Amazon EC2
AWS IAM

Role

Allowed to read from or


write to Amazon S3
Amazon EC2

More secure than storing long-term user credentials


within the EC2 configurations

Check if credentials have expired!


ACCESS DENIED
{
"Version":"2012-10-17",
IAM Policies "Statement":[
{
“Sid”: “S3AccessDLAIBucket”,
“Action”: [
“s3:List*”,
permission to access the “s3:Get*”
],
specified S3 buckets "Effect":"Allow",
“Resource": [
“arn:aws:s3:::dlai-data-engineering”,
“arn:aws:s3:::dlai-data-engineering/*”
]
},
{
“Sid”: “GlueMgmt”,
“Action”: [
“glue:*”
permission to access ],
the AWS Glue job "Effect":"Allow",
“Resource": [
“arn:aws:glue:*:*:catalog”,
“arn:aws:glue:*:*:*/de-c1w2*”
]
}
Interacting with Source Systems

Basics of Networking
Network

Video by Adobe Stock (paid license)


AWS Cloud
What does cloud in “cloud computing” mean?
The “cloud” is made up of very real physical data centers that are
spread out around the world.

Each dot
represents
a region

Screenshot from AWS Global Infrastructure (2023)


AWS Cloud

Resources are replicated across


availability zones to ensure that
your systems keep working even
if a data center goes down.
AWS Cloud

Region considerations:

• Legal compliance
• Latency : the closer your end users are to the region, the lower the latency

• Availability : the more availability zones, the better you will be able to recover from a disaster

• Cost
AWS Cloud

Region considerations:

• Legal compliance
• Latency
• Availability
• Cost
Virtual Private Cloud

Virtual Private Cloud (VPC)

Smaller networks that span


multiple availability zones
Virtual Private Cloud
Virtual Private Cloud

For internet-facing
resources

For internal
resources
Virtual Private Cloud
AWS Cloud

VPC

Private subnet Public subnet

User

RDS EC2 Instance Athena

S3 Glue
Glue ETL Crawler
Bucket
Virtual Private Cloud
AWS Cloud

VPC

Private subnet Public subnet

User
ALB
RDS EC2 Instance Athena

S3 Glue
Glue ETL Crawler
Bucket
Interacting with Source Systems

AWS Networking -
VPCs & Subnets
ALB User

Example Scenario
AWS Cloud

VPC
Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

NAT Gateway NAT Gateway NAT Gateway NAT Gateway

Private Subnet Private Subnet

EC2 Instance EC2 Instance


EC2 Instance
ALB EC2 Instance
AWS Networking - VPCs & Subnets
AWS Cloud

VPC VPC

Availability Zone 1 Availability Zone 2

Needs to be configured
AWS Networking - VPCs & Subnets
Define the network Host addresses
16 bits

255
10
0 255
0 255
0 00 /24
255 /16 prefix length
How many bits used for the
8 bit 8 bit 8 bit 8 bit network part of the address

32 bits
EC2 Instance

10.0._._ 10.0._._

0 to 255
AWS Networking - VPCs & Subnets
AWS Cloud

VPC No access to the internet


Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

Closed network

Private Subnet Private Subnet

EC2 Instance EC2 Instance


EC2 Instance EC2 Instance
Interacting with Source Systems

AWS Networking -
Internet Gateways & NAT Gateways
AWS Cloud

VPC
Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

Private Subnet Private Subnet

EC2 Instance EC2 Instance


EC2 Instance EC2 Instance
Example Scenario
AWS Cloud Considerations
1. Applications running on EC2 need to
VPC occasionally download updates from
resources on the internet
Availability Zone 1 Availability Zone 2
• Upgrades, patching
Public Subnet Public Subnet
2. Need a way to submit requests to
the application running on the EC2
instance

Private Subnet Private Subnet

EC2 Instance ALB


EC2 Instance
EC2 Instance
EC2 Instance
AWS Cloud Considerations
1. Applications running on EC2 need to
VPC Internet gateway occasionally download updates from
resources on the internet
Availability Zone 1 Availability Zone 2
• Upgrades, patching
Public Subnet Public Subnet
2. Need a way to submit requests to
the application running on the EC2
instance

Private Subnet Private Subnet

EC2 Instance ALB


EC2 Instance
EC2 Instance
EC2 Instance
AWS Cloud Considerations
1. Applications running on EC2 need to
VPC Internet gateway occasionally download updates from
resources on the internet
Availability Zone 1 Availability Zone 2
• Upgrades, patching
Public Subnet Public Subnet
2. Need a way to submit requests to
the application running on the EC2
instance

Private Subnet Private Subnet

EC2 Instance ALB


EC2 Instance
EC2 Instance
EC2 Instance
AWS Cloud Considerations
1. Applications running on EC2 need to
VPC Internet gateway occasionally download updates from
resources on the internet
Availability Zone 1 Availability Zone 2
• Upgrades, patching
Public Subnet Public Subnet
2. Need a way to submit requests to
the application running on the EC2
instance
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway

NAT Network Address


Private Subnet Private Subnet Gateway Translation Gateway

• Allows resources in a private subnet


to connect to the internet or other
EC2 Instance ALB
EC2 Instance
EC2 Instance AWS services
EC2 Instance

• Prevents the internet from initiating


connections with those resources
ALB User

AWS Cloud Considerations


1. Applications running on EC2 need to
VPC Internet gateway occasionally download updates from
resources on the internet
Availability Zone 1 Availability Zone 2
• Upgrades, patching
Public Subnet Public Subnet
2. Need a way to submit requests to
the application running on the EC2
instance
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway

ALB:
Private Subnet Private Subnet • Distributes incoming application traffic
across multiple backend targets
• Handles the load and ensures the
EC2 Instance ALB
EC2 Instance
EC2 Instance
application
EC2 Instance remains responsive and
available
• Keeps those EC2 instances private
ALB User

AWS Cloud

VPC Internet gateway


Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

NAT Gateway NATNAT


gateway
Gateway NAT Gateway NATNAT
gateway
Gateway

Private Subnet Private Subnet

EC2 Instance ALB


EC2 Instance
EC2 Instance
EC2 Instance
Interacting with Source Systems

AWS Networking -
Route Tables
ALB User

AWS Cloud Route Tables


VPC Internet gateway
Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

NAT Gateway NATNAT


gateway
Gateway NAT Gateway NATNAT
gateway
Gateway

Private Subnet Private Subnet

EC2 Instance ALB


EC2 Instance
EC2 Instance
EC2 Instance
ALB User

AWS Cloud Route Tables


VPC Internet gateway
Availability Zone 1 Availability Zone 2
• Essential for directing
network traffic within
Public Subnet Public Subnet your VPC

• Default route table


allows internal
Route table
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway Route table communication within
the VPC
Private Subnet Private Subnet

Route table EC2 Instance ALB


EC2 Instance
EC2 Instance Route table EC2 Instance
ALB User

AWS Cloud

VPC Internet gateway


Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

Route table
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway Route table

Private Subnet Private Subnet

Route table EC2 Instance ALB


EC2 Instance
EC2 Instance Route table EC2 Instance
Interacting with Source Systems

AWS Networking -
Network ACLs & Security Groups
ALB User

AWS Cloud

VPC Internet gateway


Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

Route table
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway Route table

Private Subnet Private Subnet

Route table EC2 Instance ALB


EC2 Instance
EC2 Instance Route table EC2 Instance
ALB User

Security Groups
AWS Cloud
Network Access
VPC
Control Lists (ACL) Internet gateway
Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

Route table
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway Route table

Private Subnet Private Subnet

Route table EC2 Instance ALB


EC2 Instance
EC2 Instance Route table EC2 Instance
Instance level virtual firewalls, controlling both inbound and
Security Groups outbound traffic

deny Inbound Rules


inbound traffic • Determine what types of traffic you
want to allow
• Where you want to allow that traffic to
come from
outbound traffic
Security Groups
allow
inbound HTTP Security Groups are stateful
requests
• Allow inbound traffic to an instance
automatically allows the return traffic
Port 80
outbound HTTP
responses
Instance level virtual firewalls, controlling both inbound and
Security Groups outbound traffic

EC2 Instanc
ALB EC2
EC2Instance
Instance

Security group ID: sg-123 Security group ID: sg-456 Security group ID: sg-789
Source Protocol Port Source Protocol Port Source Protocol Port
0.0.0.0/0 sg-123 sg-456
HTTP 80 HTTP 80 TCP 3306
(internet) (ALB) (EC2)
0.0.0.0/0 sg-123
HTTPS 443 HTTPS 443
(internet) (ALB)

Security Group Chaining


Network Access
Control Lists (ACL)

• They provide an additional layer of security at the subnet level


• Network ACLs are stateless
• You need to define both inbound and outbound rules explicitly
• Useful for implementing security policies at the subnet level
ALB User

AWS Cloud

VPC Internet gateway


Availability Zone 1 Availability Zone 2
Public Subnet Public Subnet

Route table
NAT Gateway NATNAT
gateway
Gateway NAT Gateway NATNAT
gateway
Gateway Route table

Private Subnet Private Subnet

Route table EC2 Instance ALB


EC2 Instance
EC2 Instance Route table EC2 Instance
VPCs and subnets • Give you a way to define a private network on AWS.
VPC
VPC

Route Tables • Direct traffic within the VPC and to the internet.
Public Subnets Public subnet Private Subnets
Public subnet Private subnet

Internet Gateway • Allow resources within public subnets to access the internet.
Internet Gateway Internet

NAT Gateway • Enable instances to initiate outbound connections securely.


NAT Gateway NAT Gateway

• They act as virtual firewalls at the instance level


Security Groups • They control both inbound and outbound traffic
• They are stateful
• They provide an additional layer of security at the subnet level
Network ACLs • They are stateless, ie. require explicit rules for both inbound
and outbound traffic
If you encounter connectivity issues:

1. Verify that your VPC has an internet gateway properly attached

2. Verify that the route tables have appropriate rules to direct traffic correctly

3. Verify that the route table associations with the subnets are configured
correctly

4. Check security groups to make sure they have the needed rules in place

5. Review network ACLs to confirm they allow the necessary traffic

6. Double-check instance configurations to ensure they are associated with


the correct security groups and subnets
Connecting to Source Systems

Lab Walkthrough -
Database Connectivity and
Troubleshooting on AWS
Database Connectivity and Troubleshooting on AWS

Fix connection Fix permission


issues issues
1. Connect to the
database
2. Create a table &
populate it with
data

• Skip this video, jump straight into the lab and go for it
• The lab instructions contain hints
• Or, start the lab and follow along with me as you go through this video.
• When an issue occurs, I’ll be inviting you to pause the video
• After that, I’ll show you how to fix it.
Database Connectivity and Troubleshooting on AWS
Working with Source Systems

Week 1 Summary
Week 1 Summary
Understand how source systems work
Relational databases Object Storage

Logs

user id action status timestamp


NoSQL databases
Key Value Document 67945 user added a product x to their cart success 01-01-2025:10.30

k1 value 1
{
invalid values typed for product
“firstName”: “Joe”, 38910 fail 01-01-2025:10.32
“lastName” : “Reis”, quantity
“age”: 10,
“address”: {
k2 value 2 “city”: “Los Angeles”,
“postalCode”: 90024, Streaming Systems
“country”: “USA”
k3 value 3 }
} Producer Consumer

Event Router / Streaming Broker


Week 1 Summary

• How to connect to data sources

• Basics of networking

• Importance of IAM in ensuring security in source systems


Week 1 Summary

Data Sources

• Improper identity and access Data Engineer


managements (IAM) definitions
• Broken networking configurations
• Wrong set of access credentials

You might also like