0% found this document useful (0 votes)
14 views41 pages

Chinamaysir Notes

The document provides an overview of AWS cloud services, including cloud computing basics, deployment models, and types of services such as IaaS, PaaS, and SaaS. It details EC2 instances, their components, and security management through IAM, as well as storage solutions like S3 and EBS. Additionally, it covers pricing models for cloud services and operational management within AWS environments.

Uploaded by

patilshivansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

Chinamaysir Notes

The document provides an overview of AWS cloud services, including cloud computing basics, deployment models, and types of services such as IaaS, PaaS, and SaaS. It details EC2 instances, their components, and security management through IAM, as well as storage solutions like S3 and EBS. Additionally, it covers pricing models for cloud services and operational management within AWS environments.

Uploaded by

patilshivansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

AWS Session-1

Cloud Basics:
1] Client,Network Of Website:
2] Components Of Server:
​ Compute:Cores--> Handling Processing Needs
​ Storage:HDD,SSD,Database
​ Memory:Ram
​ Network:Router,Switches,DNS Server
On-Premise Infra/Server-->Traditional Approach
Disadvantages:
1]Scaling Limitation-->
2]Costs,Highly Expensive--> Initial Investment,Human Resources
3]Maintenance-->
​ 4]Threat Of Info Loss Due To Disasters
​ ​
How Can Cloud Help In Overcoming Disadvantages?
1]Managed Services-->Mainly Related To Infra/Server
2]Data Security And Loss Prevention
3]On Demand Delivery As Per Needs
4]Data Maintained In Different Location(Data Center)

Cloud Computing:
On Demand Delivery Of Compute Power,Storage,Memory Or Any Other IT
Infra/Server
​ Pay As You Go Pricing
​ Instant Availability


Deployment Models:
1] Public Cloud: Azure,GCP,AWS
2]Private Cloud: Used By Single Org
3]Hybrid Cloud:Combination Of Private And Public--> Sensitive And PII
Data

Types Of Cloud Computing:In Terms Of What Is Managed By Cloud Provider
And You
1]Infrastructure As A Services(Iaas)-->Includes Network ,Data Storage,Vms
2]Platform As A Services(Pass)-->Provider Manages DB,Middleware,OS
Deployment And Application Management
3]Software As A Services(Saas):Complete Product

Pricing Of Cloud:
1]Compute:Pay For Compute Time
2]Storage:Data Storage
3]Networking:Data Transfer In Free
Data Transfer Out Charged
===========================================================
========================

AWS Session-2
1]Chatgpt:On-Prem Infra Vs Cloud Infra Differences
2]Correction To Pricing For Network Component:
Chat Gpt:Data Transfer Into Cloud Charges For Storage Or Network?
3] AWS Console Overview:
4]Free Tier Includes:
5]AWS Regions:
-Compliance With Data Governate And Legal req
​ ​ ​ -Proximity To Customers And Reduced Latency
​ ​ ​ -Services Vary From Region To Region
​ ​ ​ -Pricing Variation
6] AWS Availability Zones:
-Maintain Copies Of Data To Avoid Simultaneous Impact Of Disasters
--Each AZ Has 1 Or More Data Centers With Redundant Power,Network And
Connectivity
--2 To 6 AZs Per Region
--Connected With High Bandwidth,Ultra-Low Latency Network
===========================================================
=======
AWS Session-3
First Services:Identity & Access Management(IAM)
1]Why Access Management Is Necessary?
2]Multi Factor Authentication(MFA)
3]User Groupes:
--For Providing Collective Priviledges By Attaching Policey
--Groups Only Contains User And Not Other Groups
--One User Can Belong To Multiple Groupe

4]-- Users:
--Created TO Avoid Use Of Root User
-- Can Give Limited Permission To User Complying With Least
Priviledges Principel
​ --Have Permanet Credentials-->1]Passeord
​ ​ 2]Access Keys
​ ​ ​ ​ ​ ​ ​ ​
5]Roles
​ --Grant Permission For One Service To Access To Access The
Functionality Of Other Service
--Temporary Access
​ ​
​ 6]Policies
--Json Document Which Contain Permissions TO be Granted To
User,Groups Or Roles
​ ​ --Types:
​ ​ 1]AWS Managed
​ ​ 2]Custom Policy
​ ​ --Policy Inheiantance And Inline Policies
​ ​
​ 7]Access Reports:
​ --Access Analyzer:Account Level Reports
​ --Lists All Users In The Account And Status Of Their Various
Credentials--

​ --Credentials Reports: User Level Reports
​ --Shows Services Permission Granted To A User And When Those
Services Were Last Access

Demo :Creating 2 User And Providing Diifrent Level Of Acccess To S3 Object

#3 Major Ways To Access Aws:


1]AWS Managment Console--> Protected By Password
2]AWS Cli-->Protected By Access Keys
3]AWS SDk-> Protected By Access Key
4]Boto3:SDK For Python-->Automation
===========================================================
==================
AWS Session-4
EC2(Elastic Compute Cloud):
1]Basics Of Ec2:
--> What are Virtual Machines?
​ A Virtual Machine (VM) is a software-based emulation of a physical
computer. It runs an operating system (OS) and applications just like a real
computer but is hosted on a physical machine using a hypervisor.
​ ​ A virtual machine is an isolated environment that behaves like a
separate computer, but it is created and run on a physical host system using
virtualization software.

2] Componets Of VMs:

| AMI (Amazon Machine Image) A pre-configured template for the OS and


applications for your instance.
| Instance Type Defines the hardware of the host (CPU,
RAM, networking) – e.g., t2.micro, m5.large
| EBS (Elastic Block Store) Persistent storage volumes used as VM
hard drives
| Instance Store Temporary disk storage that is lost on
stop/terminate
| Security Group Acts as a virtual firewall to control traffic
to/from the instance
| Key Pair Used for SSH access to Linux instances
or RDP to Windows
| Elastic IP Address A static IP address for public access to the
instance
| VPC (Virtual Private Cloud) Network in which the instance runs, with
subnets, routes, gateways
| IAM Role Used to grant the VM temporary
credentials to access other AWS services
| Elastic Load Balancer (Optional) Distributes traffic across multiple EC2
instances
| Auto Scaling Group (Optional) Manages scaling up/down of VMs based on
demand
| Placement Group (Optional) Determines how instances are placed on the
hardware (clustered, spread)

3) Component Of VMs In Terms Of Ec2


In AWS, EC2 (Elastic Compute Cloud) is the service that provides Virtual
Machines (VMs). These EC2 instances (VMs) are built from several essential
components that define their compute, storage, networking, and security
behavior.
-Hypervisor
-Virtual Hardware--> vCPU,vRAM,VDisk,Networking
-Os
-Instance Type

2]Demo:Creating An EC2 Instance:


1]SSH
SSh-i<Path Of Key Pair>user @Public-DNS
​ Ex:SSh-i"First Keypair.Pem
​ 2]Putty--> Technical Guftagu
​ 3]EC2 Instance Connect

EC2 Configuration Optional
1]OS/AMI
​ ​ ​ 2]Instance Type
​ ​ ​ 3]Key Pair
​ ​ ​ 4]Security Group/Firewall
​ ​ ​ 5]Configure Storage->
​ ​ ​ 6]Netorking--> VPC,Subnet

3]Instance Types:
1]Genreal Purpose:T,M,Series
​ ​ --Balanced Between Compute,Memory And Networking
Resources
​ ​ --Web Servers,Code Repo Etc
​ ​
​ ​ 2]Compute Optimized:C Series
​ ​ --For Compute Intensive Tasks Which Required High
Computation Power
​ ​ --Use Case:
​ ​ --Batch Processing Workloads
​ ​ ​ ​ --Scientific Modelling And ML
​ ​ ​ ​
​ ​ 3]Meomry Optimized: R,X,Z Series
​ ​ --For Workloads That Process Large Data In Memory
​ ​ --Use Case: Real Time Streaming Data(STock Market Data)
​ ​
​ ​ 4]Storage Optimized:i,d,h Series
​ ​ --Disk Storage Which Provides Better Hardware Configuration
​ ​ Leading To Better I/O Operation
​ ​ --Use Case: Data Warehousing Application RDBMS
===========================================================
===========================================================
==​
AWS Session-5
4]Security Groupes:
-Virtual Firewall That Controls Inbound And Outbound Traffic For EC2
Instances
​ -Specific Rules To Allow Or Deny Traffic Based On The IP Address,
Potocol And Port Number


5] Instance Purchase Options:
1] On -Demand Instances:
​ ​ --Pay As You Go Pricing
​ ​ --Hour Or Per Second Rate
​ ​ --Acquire And Release Instances Based On Our Need
​ ​ Use Case:
​ ​ --Suitable For Unpredicabale And Fluctuating Workloads
​ ​ --Dev And Test Env
​ ​
​ ​ 2]Acquired Instance:
​ ​ --Acquire For Period Of 1 To 3 Years
​ ​ --One-Time Upfront Cost But At Discounted Hourly Rate Use
Case:
​ ​ --For Consistant And Predictable Usages Pattern
​ ​ --For Prod Env
​ ​
​ ​ 3]Spot Instances:
​ ​ -- One Of The Cheapset Among All
​ ​ --Acquired Based On Bidding Value
​ ​ --Aws Can Retract If Current Bid Price Exceeds==? Terminate
Instance
​ ​ Use Case:
​ ​ --For Application Which Can To lernate Interruptions
​ ​ --Batch Processing

​ ​ 4]Dedicated Instance:
​ ​ --Instance That Run On A Dedicated Hardware For A Single
Customers
​ ​ Use Case:
​ ​ --Compliance And Regulatory Required
​ ​ --Data Privacey And Residencey
​ ​
​ ​
6] Elastic Block Storage:
--Persistent Block Storage Which Can Store Data
​ ​ --Network Attached==>Hence Bit Of Latencey
​ ​ --Can Attach Single Volume To Multiple EC2 Instance
​ ​ --Allows You To Persist Data Even After Termination Of Instance
​ ​
# EBS Snapshot:
--Point In Time Backups Of EBS
--Can Copy Snapshot From One AZ Or Region To Another

--Amazon Machine Imange(AMI)


--Perconfigured Virtual Machine Image That Can Be Created To Launch An
Instance
Build From 1]EC2 Image Builder
2]From EC2 Instance
-- Contains Info Related To OS,Softwere Dependecies Or Application Server

8]Launch Templates:
--Vertical Scalability:
--Upgrade The Hardware Configuration To Achive Increased Server Needs
--Number of Machine Remains The Same

--Horizontal Scalability:
--Number Of Instance Are Added Or Reduced To Address Evpolving Server
Needs
--No Change At Level Of Particular Instances

10] Elastic Load Balancers(ELB):


--Server That Can Forward Traffic To Diffrent EC2 Instances Downstream
--Single Point Of Access To Your Application (By Creating Common DNS At
ELB Level)
--High Availability
--3 Types: Application Load Balancer
Network Loard Balancer
​ ​ Gateway Load Balancer

11]Auto Scalling Groupes(ASG):


--For Scaling Based On Changing Workload Needs
Scale Out--> Increasing Instances For Addressing Increased Demand
Scale In--> Decreasing Instances For Addressing Decreased Demand

===========================================================
===========================================================
AWS Session-6
1]Performing Operations In EC2 Instances:
--Package Manager:apt,Yum
Sudo Yum Update
Mkdir Demo
Cd Demo
Touch Demo_File.Txt
Ls
Pwd
Nano File Name
Sudo Yum Install<Package-Name>
Scp
2]Configuration Of AWS CLI:
Pip Install Awscli
Aws Configure

===========================================================
===========================================================
===

AWS Session-7
Third Services:Simple Storage Services(S3)
1]What Is Block Storage And Object Storage?
Block Storage:
Definition-->Stores data in fixed-size blocks (like a hard disk)​ Stores data as
objects (data + metadata + ID)
Access-->Acts like a physical disk; accessed via OS​ Accessed via API or
web interface
Use Case-->Databases, OS disks, boot volumes​ Backups, media files, logs,
static assets
Performance​ Low-latency, high-performance (good for IOPS)​ Scalable
and cost-effective, but higher latency
Example in AWS Amazon EBS (Elastic Block Store)​ Amazon S3 (Simple
Storage Service)

2]What Is Amazon S3:


--Cloud Based Object Storage Provided By Aws
--Secure,Durable,Highly Scalabale object Storage
--AWS Claims It Is 'Infinatly Scaling Storage'
--Used For Storing Images,Vedio,Word Files,Pdf Files Etc
--Consist Of Buckets And Objects

3]Objects And Buckets:


--> Buckets:
--Globally Unique Name
​ ​ --Defined At Region Level
​ ​ Name Convention:-No Uppercase And Underscore
​ ​ -->3-63 Character Long
​ ​ ​ ​ ​ ​ -->Must Start With Lowercase Letter
Or Number
​ ​ ​ ​ ​ ​ --Avoid Use Of Special Character
Expects '-' And'.'
​ ​ ​ ​ ​ ​
​ ​ ​ ​ ​ ​
# Objects:
--Consists Of Actual,Data,Metadata And Unique Identifier
​ --Has An Objects Key:
​ --Max Objects Size 5 TB(500GB)
​ -'Multi-Part Upload' For Fill Size >500 MB

Key Features Of Amazon S3:
Scalability
Durability
Data Encryption
Versioning
Access Controls
Data Lifecycle Policies
Data Replication

Durability: How Often Ew Loss File


--> More Durability=> Less Likehood Of Loss
-99.99999999=>11'9s
=>Same For All Storage Classes

2] S3 Versioning:
--Best Pratice To VErsion Your Buckets To Protect It Form Unitended
Changes=> To Prevent Data Loss By Reverting It To Previous Versioning
-Easy Rollback To Previous Version
-Suspending Versioning Will Not Delete Prevoiusly Stored Versions

3]S3 Acccess Logs:

4]S3 Replication
CRR-->Cross Region Replication
SRR-->Same Region Replication

5]S3 Storage Classes:6 Types


Parameter To Consider:Availabity,Durability,Speed Of Retrieval.
1] S3 Standard-Genral Purpose
--99.99% Availability
​ ​ ​ --Used For Frequently Accessed Data
​ ​ ​ --Low Latency And High Throughput
​ ​ ​ --Sunstain 2 Concurrent Facility Failure
​ ​ ​
2]S3 Standard-Infraquent Access
--Suitable For Less Frequently Accessed Data
​ ​ ​ --But Requires Rapid Access To Data Whenever It IS
Needed
​ ​ ​ --Lower Cost Than S3 Standard==> It Has Retrieval Rate
​ ​ ​
3]S3 Intelligent-Tiering
--3 Access Tiers:1] Frequent Access Tier
​ ​ ​ 2]Infrequent Access Tier
​ ​ ​ ​ ​ ​ ​ 3]Rarely Accessed Database
​ --Cost Optimized By Automatically Moving Objects Between
Objects Access Tiers
​ ​ ​ --Same Latency And Throughput As S3 Standard
​ ​ ​
4]S3 One-Zone Infraquent Access:
-Same As IA But Data Is Stored in Single Azure
-99.5% Availability
--Use Case:As Secondary Backup For Data In On _prem Serveres, storing
Data Which You Can Recreate

5]Amazon Glacier:
--Low Cost Object Storage For Archival And Backup
​ -Data Retained For Long Term
​ -3 Retrival Options:Expedited,Standard,Bulk

6]Glacier Deep Archive==> Cheapest Of All
--2 Retreival Option: Standrd,Bulk

6) 1] S3 Object Lock:
-- Bucket Versioing Should Be Enabled
-Secured Socket layer(SSL)
-Transport Layer Security(TLS)
3]Bucket Policies:Restricting Access TO Bucket Using Policey

4] Access Control List(ACL):


Bucket ACL
Object ACL

8]S3 Query Select:

9)Lifecycle Rules:

10]AWS Sonw Family:--> Physical Way Of Data Migration

The Snow Family helps move large amounts of data into and out of AWS
when network transfer is too slow, expensive, or impractical. It also enables
edge computing in disconnected or harsh environments.

# Snowcone:

Use Case--> Lightweight edge computing, data collection, or


small-scale data transfer
Storage Capacity-->​ 8 TB usable storage
Size & Portability​ Small, rugged, portable – can be carried in a backpack
Connectivity--> USB-C, Wi-Fi, Ethernet
Edge Computing Yes (supports EC2 & AWS IoT Greengrass)
Ideal For-->​ Remote locations, drones, small field devices
Data Transfer-->​ Shipped to AWS or transferred over the network

# Snowball:
Use Case-->​ Medium to large data migration (10 TB to petabyte
scale)
Storage Capacity-->​ 42 TB (Snowball Edge Storage Optimized), 80 TB
(Compute Optimized)
Type-->​ Rugged physical device with tamper-proof protection
Edge Computing​ Yes (EC2, Lambda functions, IoT Greengrass)
Data Encryption​ 256-bit encryption
Transfer Speed​ Fast local transfer and secure shipment to AWS
Variants​ Snowball Edge Storage Optimized-Snowball Edge
Compute Optimized
Ideal For​ Data center migration, analytics in disconnected
locations, military, research fieldwor

# SnowMobile:

Use Case​ Massive data migration (Exabyte scale)


Storage Capacity​ Up to 100 Petabytes per Snowmobile
Form Factor​ A semi-truck container delivered to your data center
Security​ 24/7 surveillance, GPS tracking, encryption
Data Transfer​ Physical shipment to AWS data centers
Ideal For​ Moving entire data centers, massive archives, media
libraries
Deployment Time​ Longer setup time, but practical for massive volumes

=>Also Used For Edge Computing

===========================================================
===================================================
AWS Session-8
Fourth Services:AWS Athena
1]What Is Athena?
--Serveless Query Service For Analyzing Data In Amazon S3 Using
Standard SQL
​ ​
2]What Is Serveless:
--Managed Services By AWS Where AWS Takes Care Of Cluster
Resources
​ ​
Main Features Of AWS Athena:
​ ​ --Serverless Architecture
​ ​ --Integration With S3
​ ​ --Standard Sql Support
​ ​ -Pyspark And Spark Sql Support As Well
​ ​ -Pay-Per-Query Pricing
​ ​ -Compatibilty With Various Data Formats(Json,Parquet,Csv,Etc)
​ ​
3]Query Options:1]Sql
2] Pyspark And Spark Sql

4]Types OF Tables In Cluster Ecosytem:


HDFS And Hive Metastore
1]Internal Or Managed Table:
--Full owernship Of Table Including Data and Metadata Is With Hive Engine
​ -When You Drop A Internal Table bot Structure And Data Is
Erased/Deleted

2]External Table:
--Owership Of Metadata Is With Hive Engine But Not Of Data Store In HDFS
--When You Drop A Table,Table Schema Gets Deleted But Data Remains In
Hdfs

Location,TblProperties:
parttioning And Bucketing

===========================================================
===========================================================
=======
AWS Session-9
ETL Basic:(Extract,Transform,Load)
-ETL Is Data Integration Process used To Collect,Transform, and load
Data
​ ​
Components Of ETL:
A)Extract
-Data Extraction:Retrieving Data From Source System,Which Can
Includes Database,Flat Files,APIs,Or Other Repositories
​ ==> SalesForce,Data Lake,Data Warehouse
​ -Change Data Capture(CDC):Identifying And Capturing Only The
Changed data Since The Last Extraction To Optimized Efficiency

B)Transform
-Data Cleaning:Removing Or Correcting Errors,
​ ​ Inconsistent,Or Inaccuracies In The Source Data
​ ​
​ ​ -Data Transformation: Restructuring And Converting Data Into A
Format Suitable For The Target System
​ ​
​ ​ -Data Enrichment: Enhancing Data By Adding Additional
Information or Attributes
​ ​
c)Load:
-Data Staging:Temporary Storage of Traditional Data Before Loading It
Into The Target System
​ ​
​ ​ -Data Loading:Inserting Or Updating Data Into The Destination
Database Or Data Warehouse
​ ​
​ ​ -Error Handling: Managing And Logging Error that May Occur
During The Loading Process
​ ​
ETL Process Flow:
A) Extraction Phase
Connect To Source System:Establishing Connections To Source
Databases,APIs,Or Files
​ -JDBC/ODBC,

-Data Selection:Defining Criteria For Selecting Data To Be Extracted

​ B) Transforming Phase
​ -Data Mapping:Creating A Mapping Between Source And Target Data
Structure

​ -Data Cleasing: Identifying And Correcting Data Qualify Issues

​ -Data Validation:Ensuring Transformed Data Meets Specified Qualify
Standards
​ -ABC Validation Framework(Account-Balance-Control)

​ -Aggreation:Combing And Summarzing Data For Reporting purpose

​ c) Loding Phase
​ -Data Staging:Moving Transformed Data To A Staging Area For Further
Processing

​ -Bulk Loading:Efficiency Inserting Large Volumes Of Data Inro The
Target System

​ -Indexing:Creating Indexes To Optimizwd Data Retirval in The Target
Database

​ -Post-Load Verification:Confirming That Data Has Been Loaded
Successfully

​ Popular ETL Tools
​ Apache Nifi
​ Talend
​ Informatica
​ Microsoft SSIS(Sql Server Integreation Services)
​ Apache Spark,
Cloud Based Services->AWS,AMS,EMR,GCP Dataproc

Other Data Processing Architectural Styles:


1) ETL==> Modern Data Warehoues
2) EtlT==>Small Transform Using Spark(t)=> Load=> Transform
===========================================================
===========================================================
AWS Session-10
Fifth Services:AWS GlUE
-Serverless Data Integreation And ETL Services-Makes Discovering,Preparing
And Combining Data For Data Analyzing ML And Application Development
Simple

Features:
1]Managed ETL Services

2]Data Catalog:
-Centralized Metadata Respoitory For Storing Metadata About Data
Sources,Transformation,And Targets
​ -Metadata Includes Information About Structure,Format And Location
Of Datasets As Well As Details Transformations And Schema Definations

​ -Database:Logical
​ -Tables:Logical

3]ETL Jobs:
-Automate The Process Of Extracting,Transforming,And Loading Data
From Various Sources To Destinaction

4)Crawlwers:
-Automatically Discover The Schema of Your Data Stored in Various
Sources And Create Metadata Tables In The Data Catelog.

5)Triggers:
-Scheduling of ETL Jobs Based On Time Or Events

6)Connections:
-Faciliate The Creation,Testing,and Managment Of Database Connections
Used In ETL Process

7)WorkFlow:
-Orchestraction And Automation Of ETL Workflows By Defining
Sequences Of Jobs,Triggers,and Crawlwers

8)Serverless Architecture:
-Elimantinf The Need For User To Provision Or Manage Underlying
Infrastructure

9) Scalability And Reliability
-Fault Tolerance With Number Of Retries
​ ​ -Data Durability

10) Development Endpoints:


-For Interactive Development,Debugging, And Testing Of ETL Scriptes
Using Tools Like Zeppelin Or Jupyter Notebooks

===========================================================
===========================================================
===================
AWS Session-11
A) Glue Crawler:
-Data Source:S3,JDBC,DynamoDB,Redshift,RDS,MonoDB
-Provide S3 Path Of Sub-Folder For Files To Crawl
-Name Of Table Given Is Same Is Same As Subfolder Name

-Subsequent Crawler Runs:


-Crawl All Sub-Folders
-Crawl New Sub-Folders Only
-Crawl Based On Events
-Exclude Files Matching Pattern

-Custome Classifieers:
-Provide Structure Or Schema Defination Of File To Be Crawled Genrally
Used For File Formats And Definations Not Supported by Built In Classifier Of
Glue

-Classifier Type And Properties:Grok,Xml,Json,CSV


-CSV Serde: Opencsv And Lazy Simple
-Column Deltimeter
-Column Headlings:
-Can Provide Multiple Custom Classifiers-> Crawler Refers To All And Comes
up With Best Option(Certainity Value)
-IAM Role:To Give Access Of Other Services Like S3 To Glue Crawler
-Target Database:
-Table Name Prefix:
-Maximum Table Threshold
-Scheduled:On -Demand,Time-Based Option,Custome Schedule(Cron
Expression)

===========================================================
===========================================================
===============================================

AWS Session-12
Demo 1-Crawling Customer Db(With Custome Classifer And Partition)
Demo 2-Crawling Traffic Dataset And Performing sql Transform Using ETL
Job In Glue

B)Glue ETL JOB:


Multiple Ways To Write Glue Jobs:
1]Visual
​ ​ 2]Blank Canvas
​ ​ 3]Spark Script Editor
​ ​ 4]Python Shell Editor
​ ​ 5]Jupyter Notebooks
​ ​ 6]Ray Query Editor
​ #Visual:
​ -3 Componets:
​ 1]Source:Data Catelog,S3,DynamoDB,Redshift
​ 2]Transforms:SQL Query Change Schema,Drop
Duplicates,Joins,Filter,Flattening,Aggregate Etc
​ 3]Target:data Catalog,S3,DyanamoDB.Redshift
​ -A Transform Can Have Multiple Data Sources(Multiple Node Parentes)
​ # Spark Script Editor:
​ 1]GlueContect:Wrapper For Glue Functionality

​ 2]Dynamic Frame:Equivalent Of Spark Dataframe In Glue To Vsil Glue


Functionalities
​ -Function :Create_Dynamic_Frame.Write_Dynamic_Frame

​ 3]Converting Dynamic Frame TO Spark DF
​ Spark_df=Dynamic_Frame.ToDF()
​ Converting Spark DF TO Dynamic Frame:
Dynamic_frame
=DynamicFrame.FromDF(DF2,GlueContext,"Dynamic_Frame")​
​ 3]Reading From Diffrent Sources:
​ -From Glue Data Catalog:
​ From Catelog(Database,Table,Transformation_CTX)
​ -From Other Sources:
Point To Remember:
1]Need To Write Transformation Code Betwwn Job.Init() AND Job.Commit()
2]

C)Glue Trigger:
-Trigger Types:
​ -Scheduled:
​ Event-Driven:Event Is Either ETL Jobs Or Crawler
​ -Conditional Logic:If There Are Multiple Watched Resources For
Trigger Then All Or ANY Makes Sense
​ -ALL:If All Watched Resources Achives Desire State Then Only
Further Node Triggers
​ -ANYIf Any One Of The Watched Resources Achives Desired State
Then Further Node Trigger


​ -On Demand: Manual Run
​ -Eventbrides-Event
​ -Watched Resources:
​ Trigger Monitors This Resources To Determine Weather To Initiate
Target Node
​ -Can Add Multiple Watched Resources
​ -Target Resources:
​ -Once Status Requirment Of Watched Resources MatchesmWith
Condition Provided The Target Resources Gets Initiated
D) Glue Workflowes:
-2 Componets:
1]Triggers
2]Nodes:Jobs Or Crawlwers

-WorkFlow Bluprint
===========================================================
===========================================================
=======================
AWS Session-13
Sixth Services:Elastic MapReduced(EMR)
1)EMR Architecture And Provisioing:
A]Application Bundle:
-Preconfigured Application Collection Provided By EMR To Be Launch While
Bootstrapping
-EX:Spark,Core Hadoop,Flink,Custom

B]AWS Glue Data Catelog Settings:


-For Using Glue Catelog As External Metastore

c]Operating System Options:


-Configuring AMI
D]Cluster Configuration:
1]Instance Groupes
-One Instance Type For Each Node Group(Like Primary,Core Etc)

2]Instance Fleets:
-Multiple Instance Types Can Be Chosen For Each Node Group
-Maximum 5 instance Types Can Be Configure

E]Cluster Scaling And Provisioing Option:


1]Set Cluster Size Manually:
-Giving Number Of Nodes Manually And Not Auto Scaling

2]EMR Managed Scaling:


-Providing Minimum And Maximum Number Of Nodes
​ ​ ​ ​ -EMR Takes Care Of Scaling Based On Its
Algorithem
​ ​ ​ ​
3]Custom Automatic Saling:
-Providing Minimum And MAximum Of Nodes
​ ​ ​ ​ -Also Providing Scaling Rules(Scale Out Rules And
Scale In Rules)
​ ​ ​ ​
F) Steps:
-Jobs To Be Sub,itted To EMR Cluster

G) Clusrter Termination:
1) Manual Termination
​ ​ 2) Automatic Termination After IDLB Time(Recomnded)
​ ​ -Can Metion Idle Time In HH:MM:SS Format
​ ​
# Termination Protection:
-If Enable Then It Protects From Accendental Termination
​ ​ ​ -To Terminate,We Need To Diasble THis Option First
​ ​ ​ -If Termination Protection Is ENabled,It Should Override
​ ​ ​ Other Termination Attempts,Including Automatic
Termination Due To IDle Time
​ ​ ​
H) Bootstrap Option:
-To Install Softwere Dependices During Cluster Provisioning
​ ​ ​
​ ​ ​ I) Cluster Logs:
​ ​ ​ -Can Configure S3 Location To Store Of Cluster Activity
​ ​ ​ ​
J) Identity And Acccess Managment:(IAM) Roles:
1) Amazon EMR Serives Role:
-To Provision Resources And Perform Service-Level Action With Other
AWS Services

2) EC2 Instance Profile For Amazon EMR:
-TO Provide Access Of Other AWS Services To Each EC2 Node Of
EMR Cluster
​ ​
3) Custom Automatic Scallling Role:
-Role For Automatic Scaling

K)EBS Root Volume:To Provide Additional Storage
--------------------------------------------------------------------------------------------------------
----------------------------------------------

AWS Session-14
1) Way To Submit Application To Cluster:
1) AWS Manamgment Console:
-EMR Steps
​ ​ ​ ​ -Notebooks
2)AWS Cli
3)AWS SDK--> Boto FOr Python
Command-Runner.Jar:
-A Tool Or Utility That Facilates The Executions Of Custom Commands And
Scripts During The Setup Of An EMR Cluster
-Helps User To Automate Certain Tasks And Configuration, Provising A
Convenient Way To Extand The Functionality Of EMR Cluster
-Component Within The Broder EMR Ecosystem,Aiding In THe Execution Of
Custome Commands And Scriptes As Part Of The Cluster Initilization Process

#EMR Serveless:
-Simplifies THe Process Of Running Big Data Workloads By Abstracting
Away Complexity Of Managing And Provisiong Infrastructure
-Allows User To Focus On Data Processing And Analytical Tasks Without
The Need TO Worry About The Underlying Server Managment

# EMR CLI:
-Create-Cluster:
​ -Terminate-Cluster:
​ 3)Add Step To Cluster
--------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
AWS Session-15
BOTO3:
-AWS Software Development Kit(SDK) For Python
-Using Boto3 Scripting We Can Manage And Utilize AWS Resources For Data
Processing

What Is Namespace:
-Namespace Is A Collections Of Database Objects And Users
-Storage-Related
-Groups Together Schemas,Tables,User,Or AWS Key Management Services
Keys For Encrypting Data
--------------------------------------------------------------------------------------------------------
-------------------------------

AWS Session-16
Seventh Service:Amazon Redshift
1) What Is Amazon Redshift And How Does It Works:
-Petabyte-Scale data Warehouse Services
-It Is Used To Stored And Analyze Large Amount Of Data(Historic Data)
-Redshift Uses:
-Parallel Query Execution TO Provide Fast Query Performance

2)Architecture Of Redshift And Its Componets:


# Cluster Node:
-Leader Node
-Compute Node:
1]RA3
2]Dense Compute 2(DC2)
3)Dense Storage 2(DS2)
Node Slices

#Redshift Managed Storage:


-Columnar Storage Modelling
-Advanced Compression Techniques

3)Performance Feature:
1) Massively Parallel Processing==Parallel Query Engine
2)Columner Data Storage
3)Data Compression
4)Query Optimizer
5)Result Caching:
-Query Result Caching
​ ​ ​ -SVL_Qlog-->SOurce Coloumn:Cache-True,Not Used
Cache:False
​ ​ ​ --Search About System Tables In Redshift
​ ​ ​

4) Data Security And Protection:


# Data Security:
-SSL(Secure Sockets Layer) Encryption
-TLS(Transport Layer Security)Encryption
#Data Protection:
Data Encryption:
1)Server-Side Encryption
2)Client-Side Encryptio
Encryption At Rset:AFS-256 Encrypted
Encryption In Transit:SSl/TLS Encryption

What Is Workgroup-->Container
-Collection Of Compute Resources From Which An Endpoint Is Created
-Compute-Related
-Groups Togeteher Compute Resources Like Rpus,VPC Subnet
Groups,Security Groupes

What Is Namespace:
-Namespace Is A Collection Of Database Objects And User
-Groups Together Schmeas,Tables,User,Or AWS Key Managment Services
Keys For Encrypting Data

==>When Using Redshift Serverless,We Neeed To Provision Workgrops For


Availing Compute Resources And Namespace For Availing Storage
Resources

Evolution Of Data Processing Frameworks:

1) ETL:
Transform-->Spark(Traditinoal DW)
2)ELT:
Load Into DW
Transform In Modern DW
3)Etlt:
t-> Transform On Spark:Schema Conversion,Coloumn Trancation Load
Into Warehouse
-Transform In Modern Dw: Aggreation Related
-UPSERT Operation--> Update And Inserting​

--> High Cardinality And Low Cardinalily


100 Records And 10 Unique Values:
Product 100MB

Columner Storage--> Storage 10 Values --> Stores Respective Refrenace


--> Help In Storage Cost Optimization But Also In Query Performance

--------------------------------------------------------------------------------------------------------
--------------------------------------------
AWS Session-17
Query Result Caching(Result Set Caching):Involves Storing The Actual Result
Sets Of Executed Queries For Later Reterival

Query Caching:
-Involves Caching The Execution Plan Or Metadata Associated With A
Query,Rather Then The Actual Data
-Hepls Save On Planning And Optimization Time

RPU: 2 Virtual CPU And 16 GB Of RAM

--------------------------------------------------------------------------------------------------------
-------------------------------------------------------

AWS Session-18

Ways To Load Data Into Redshift::


1) Copy Command:
-Internal/Managed Table Created
​ -Data Is Actually Moved To THis Table(Redshift Storage Consumed)

​ 2)Redshift Specturum:
​ -External Table Created
​ - NO Actual Movement Of Data(Data Reside In S3 Itself)
1)Copy Command:
Copy Table_Name[Coloumn_List]
​ From Data_Source
​ [Options]
Explainations:
-Table_Name:Name OF The Target Table Where You Want To Load Data

-Coloumn_List:(Optional)
-A Comma-Seprated List Of Columns In The Target Table
-If Not Specified,Redshift Assumes That The Coloumn Order In THe Source
File Matches The Order Of Column In The Target Table.
-Data_Source:
-Specifieees The Source Of The Data You Want To Copy
-This Can Be An Amazon S3 Bucket,An Amazon EMR Cluster,A Data File On
Your File System Or A Remote Host Using SSH

-Options:Additional Configuration Option For Copy Command


-Specifiying The File Format,Delimeter,Data Encoding,Error Handling Etc

Common Options:

1)FORMAT Format_Types:Specifics The File Format Of The Source


Data.Supported Formats Includes CSV,JSON,AVRO,PARQUET<ORC,And
More

2)DELIMETER 'Delimiter':Specifics The Filed Delimeter Used In The Source


Data.

3)IGNOREHEADER: Specifices The Number Of Header Lines To Skip In THe


Source Data.

4)FilIRECORD:Adds Null Columns To Match The Target Coloumn Count If


The Source Data Missing Columns

5)ENCRYPTED:Indicates That The Source Data In Encerypted

6)MAXERROR: Sets The Maximum Number Of Allowed Data Load Error


Before The Copy Opera tion Fa ils
7)Credentials:'AWS_Access_Key_Id=Access-Key:AWS_Secreat_Access_Key
s=Secret-Access-Key: Specifies AWS Access Credentials When Loading Data
From Amazon S3

8)COMPUPDATE ON|OFFF:Specfiec Whether To Recalculate Table


Statistices After The Copy Operations

9)GZIP:Indicates That The Sources Data Is In GZIP Compreesed Format

9)TRUNCATECOLOUMNS:Truncate Data That Exceeds The Columns Length


In The Target Table

10)REMOVEQUOTES:Remove Surrounding Quotation Marks From Data


Fileds
--------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------\
AWS Session-19
Redshift Spectrum:
- An Feature Of Amzon Redshift That Enables You To Run SQL Queries
Directly Againest Data Stored In Yourmazon S3 Buckets
-Extendeds The Data Warehousing Capabilites Of Resdhift By Allowing You
To Analytical And Join Data From Multiple Sources,Both Within Your Redshift
And Data Warehouse And External Data Stored In Amazon S3

# Feature:
1) Data In Amazon S3
2) External Tables
-No Stored In Your Redshift Cluster But Act As Metadata For
Querying The Data In S3
​ ​ ​
3)Querying
Can Run SQL Queries That Join Tour Internal Redshift Tables With Your
External S3 Data

4)Performance:
-Optimizes Query Performance By Pushing Down Filters To The S3
Data,Minimizing Data Movement

# Benfities Of Using Redshift Spectrum:


1) Cost-Efficient
2) Scalability
3)Data Integration:Allows You To Integrate And Analyze Data From Various
Sources(Without The Need To Move or Replicate The Data)
4)Seprations Of Storage And Compute

# Snapshot And Backups


1)Automated Snapshot
2)Manual Snapshots

#Performance Optimization And Tunning:


1)Automatic Compression
2)Query Caching
3)Parallel Query Exection
4)Data Distribution Styles

1)Auto

Create Table Your_Table(


​ ​ ​ ​ Column Int,
​ ​ ​ ​ Column2 Varchar(50))
​ ​ ​ ​ Distsyle Auto;
2)Even:
-Appropiate When A Table Doesnt Particepate In Joins

​ Create Table Your_Table(
​ ​ Coloumn Int,
​ ​ Column2 Varchar(50));

3)All
Create Table Yuor Table(
​ ​ Coloumn Int,
​ ​ Column2 Varchar(50)
​ ​ )
​ ​ Diststyle All;
​ ​
4)Key

​ ​ Create Table Your_Table(


​ ​ ​ Coloumn Int,
​ ​ Column2 Varchar(50),
​ ​ ​ Distribuation_Key_Column Int Distkey
​ ​ ​
##System Table To Explore=RELEFECTIVEDISTSTYLE Column In
PG_CLASS_INFO

# RDS:
-Managed Relational Database Services Provided By Amazon Web
Services Provided By Amazon Web Services
​ -Simplifies First Setup,Operation,And Scaling Of Relational Database
Without The Adminstractive Overhead Of Managing A Database Server

​ # Key Feature:
1)Support Popular Database
Engines:MySQL,PostgreSQL,Oracle,SQL<Server MariaDB,And Amazon
Aurora

2)Managed Services:RDS Takes Care Of Routine Database Tasks Such As
Hardware Provisioning,Database Setup,Patching,Backups And Scaling

3)Scalability

4)High Availability

5)Security:
​ --Encryption At Rest And In Transit
​ --Virtual Private Cloud (VPC) Integration
​ --Database Authenication Options
6)Backup And Restore:Through Backup And Snapshot

7) DB Instances Types: Offers Various Instance Types Optimized For Diffrent


Workloads Provisiong A Godd Balanced CPU,Memory And Storage

8) Maintainance--

9) Cost-Effective:You Pay For The Resources You Consumed

DB Instance Classes::

1)Standard Classe(Includes M Classes)


2)Memory Optimized Classes(Includes R And X Classes)
3)Burstable Classes(Includes T Classes)
-For Burestable Workloads->Where CPU Usages Spikes Are Followed By
Periods Of Inactivity
-Instances Accure/Accumulate CPU Credites During Idle Periods Which Can
Be Spent During Bursts Of Activity

-> Making RDS Connections In Glue:

1)Create RDS Instances


2)Create Glue Connecttion-->With Connections Type As Amazon RDS,
3)Test The Connections-->You Will get Some Error
4)Create Routing Network:
1)Create Route Table-->Associated One Of The Subnet To Route Table
-->PS-Use Same Subnet In NAT Gateway Also
2)Create NAT Gateway
3)Create Endpoint-->Linked Route Table With Endpoint
5)Sort Keys:
1)Compound:
-Composed Of One Or More Coloumns
-Data Is Initialy Sorted Based On The First Coloun In The Sort Key And
Then Within Each Of Thoes Groupes,It Is
Further orted Blased On The Second Coloumn And So On
-Known Access Patterns That Frequently Filter,Join Or Aggregate Data
Based On Multiple Coloumn In
A Prdictable Order

-DDl:
Create Table Your Table(
​ ​ Coloumn Int,
​ ​ Coloumn2 Varchar(50),
​ ​ Coloumn3 Date)
​ ​ SortKey(Coloumn1,Coloumn3);
​ ​ ex:Date=30/10

Interleaved::
-Also Composed Of One Or More Columns
-Doesn't Priorities One Column Over The Others==> It Interleaves
The Data Across All Columns In The Sort Key Evenly
-Can Help Improve Query Performance For Tables With Unpredictable Query
Patterns (Varying Filtering And Grouping Conditions)

-DDL
Create Table Your_Table(
​ Column Int,
Column Varchar(50),
Column Date)
Interleaved Sortkey (Column 1,Column2,Column3)

6) Redshift Workloads Management(WLM)


-WLM Helps You Manage And Prioritize Queries In Your Redshift Cluster
-Ensuring That Different Workloads And Queries Can Coexist And Perform
Efficiently In A multi-User Environment
-Enables You To Allocate Resources,Control Concurrency,ANd Manage Query
Performance By Defining Query Queue And Assigning Quey Groupes

# Explore About:Query Queues,Query Prioritization,Concurrency Etc

7)Vacuum And Analyze Command::


1)Vacuum Command

--------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------
AWS Session-20
AWS Lambda:
-Functions AS A Services(FAAS)
-Supports Multiple Programming Languages Like Python,Java,C# Etc
-Serverless Computing Service That Aloows You To Run Code Without
Provisiong Or Managing Servers

## Feature
--Real-Time Data Processing:
-Can Be Triggerd In Near Real-TIme In Responce To Events Or Data
Streams
-For Exmple:You Can Use It To Process Data From Sources Like Amazon
Kinesis(For Real-Time Data Streams)
2)Scalability:
-Lambda Functions Can Scale Up Or Down To Handle Varying Workloads
3)Event-Driven Data Pipeleines:
-Can Chain lambda Functions Together To Create Complex Data
Processing Workflows
4)Integration With Other AWS Services
5)Cost-Efficiceny:
-Charges Based On The Compute Time Lambda Functions Consumes
-Only Pay For The Processing TIme Required For Each Data Event
6)Scheduled Data Processing:Batch Data Processing
-Can Schedule Lamda Functions To Run At Specific Intervales

Demo-Lambda As A Event Trigger--> Triggers Glue Job When Sense S3 Glue

--------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------
AWS Session-21
--> Airflow

Airflow in Apache is A Popularlary Used To managed The Automation Of


Tasks And Therir WorkFlow
Primarily Used For Scheduling Various Tasks
Airflow Allows You Can To Easily Resolve The Issue Of Automating
Time_Consuming And Repeating Tasks
Primirialry Written In SQL And Python

What Is DAG:
A Series Of Tasks That You Want To Run As Part Of Your Workflowes
Airflows Is A Workflow Engine Which Means:
-Manage Scheduling And Running Jobs And Data Pipeleines
-Ensure Jobs Are Orderd Correctly Based On Dependicies
-Manage The Allocation Of Scare Resources
-Provides Mechanism For Tracking The Statr Of Jobs And Recovering From
Failure

Basic Airflow Concepts:


DAG Run:
Indvidual Exeuction/Run Of A DAG
Tasks:
A Defined Unit Of A Work(These Are Called Operatores In Airflow)
​ ​
Tasks Instance:

DAG:Order Of Excutions
Directed Acyclic Graph, A Set Of Tasks With Explicit Execution
Order,Begining,And Endpoint
THe Vertices And Edges(The Arrows Linking The Nodes) have An Order
Vertices And Edges(The Arrows Linking The Noods) Have An Order And
Direction
Associated To Them
Each Node In A DAG Corresponds To A Task Which In Turn Represents
Some Sort Of Data Processing
Order And Direction Associated To Them
Each Node In A DAG Corresponds To A Task ,Which In Turn Represents
Some Sort Of Data Processing

Dependecies:
Each Of The Vertices Has A PArticular Direction That Shows The Relationship
Between Certain Nodes.
-->Using Shift Operator And SetupStream(),SetdownStream()

Idemopotencey:
It Is A Property Of Some Operations Such That No Matter How Many Times
You Exectue Them,You Achive The Same Result

There Re 4 Main Componets To Apache Airflow:


1)Web Server:
Flask App Where You Can Track The Status Of Your Jobs And Read Logs
From A Remote File Store
2)Scheduler:
Responsible for Scheduling Jobs
This Is A Mulithread Python That User The DAG Object To Decide What Need
To be RUN,When And Where
3)Executor:
The Mechanism That Gets The Tasks Done
4)Metadata Database:
Power How The Other Componets Interact
​ Stores The Aireflow States
​ All Process Rad And Write From Here

Defining Tasks:
Tasks Are Defined Based On The Abstraction Of Opertors Which
​ Represnt A Single Idemotent Task

​ Types Of Executors In Aireflow:
​ 1)Celery-Scalable But Its Setup Is Complicated
​ It Requires Diff Dependices Like Radies And Rabbitmq

​ 2)Sequencial-Run One Task At A Time(Both Worker And Scheduler Use
Same Machine)
​ - Not Scalable And Not Used In Production

​ 3)Kubernetes-Simple And Scalable (Provides Benifites Of Both Local
And Celery)

​ 4)Local-Same As Sequetial But Can Run Multiple Tasks At A Time

# What Are Operatoers??
​ -Building Block Of Aireflow DAGS
​ ​ -While DAGs Described How To Run A WorkFlow,Operatoers
​ ​ Determine What Actually Gets Done By A Tasks
​ ​ -They Contain The Logic Of How Data Is Proceesed In A
Pipeline
​ ​ -An Operator Describe A Single Tasks In A Workflow
​ ​ Each Tasks in DAG Is Defined By Instaantiang An operator
​ ​
​ ​ -The DAG Will Make Sure That Operatoers Run In The Correct
Order,Other Than Theose Dependices
​ ​ Operatoers Genrally Run Independently In Fact They May Run
On Two Completly Diffrent Machine
​ ​
External Task Sensor:
Airflow Provides Feasture Calld External Sensor Which Checks On The State
Is
Success Then The Dog With The External Sensor Simply Goes Ahead
And Executes The Tasks(S) Which Come Next

Sensor:
Sensor Are A Special Type Of Operator That Are Designed TO Do Exactly
One Thing-Wait For Somethong TO Occur
It Can Br Time-Based,Or Waiting For File Or External Event,But All They
Do It Wait Until Something Happens,And Then Succed So Their Downstream
Tasks Can Run

S3 Key Sensor:
The S3 Key Sensor Is The Name Suggests Cheks The Availibility Of
FIles(A.K.A.Keys) Placed In A S3 Bucket
THe Sensor Can Be Set To Check Every Few Seconds Or Minutes For A Key
when A DAG Is Running It Will Check When The Key Is Available Or Not.
If They Key Is Availabie Then The Control Will Be Passed To The Next Tasks
In The Dag And Flow Will Continue
If The Key Is Not Available It Will Fail Or Retery(Depending Upon The
Configuration)

--------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------

AWS Session-22
Sensor Plugines:
Sensor Plugines Add Custome Sensor To Aireflow That Aloow You To
Wait
​ ​ For Certain Conditions Or Events To Occur Beffore Executing
Tasks
​ ​ Sensors Are Useful For Tasks That Need To Wait For External
Events
​ ​ Like File Availibility,Database Changes,API Responces Or Other
External Triggers
​ ​
Hooks:
-Hepls To Create Connetion With External System
​ -Ex:BaseHook,S3Hook

Hook Plugins:
Hokk Plugines Enable You To Create Hooks That Define Connection And
Interact
With External System Or Services
Hooks Provide A Consistent Interface For Connceting To Various System
Like Database,Cloud Services,Message Queues And More

Example:
S3Hook
BigQueryHook
SparkSubmitHook
HiveServer2Hook
PostgresHook
MySQLHook
RedisHook



​ ​ ​ ​
​ ​ ​ ​
​ ​ ​














​ ​

​ ​ ​ ​ ​
​ ​ ​ ​ ​

​ ​

You might also like