Chinamaysir Notes
Chinamaysir Notes
Cloud Basics:
1] Client,Network Of Website:
2] Components Of Server:
Compute:Cores--> Handling Processing Needs
Storage:HDD,SSD,Database
Memory:Ram
Network:Router,Switches,DNS Server
On-Premise Infra/Server-->Traditional Approach
Disadvantages:
1]Scaling Limitation-->
2]Costs,Highly Expensive--> Initial Investment,Human Resources
3]Maintenance-->
4]Threat Of Info Loss Due To Disasters
How Can Cloud Help In Overcoming Disadvantages?
1]Managed Services-->Mainly Related To Infra/Server
2]Data Security And Loss Prevention
3]On Demand Delivery As Per Needs
4]Data Maintained In Different Location(Data Center)
Cloud Computing:
On Demand Delivery Of Compute Power,Storage,Memory Or Any Other IT
Infra/Server
Pay As You Go Pricing
Instant Availability
Deployment Models:
1] Public Cloud: Azure,GCP,AWS
2]Private Cloud: Used By Single Org
3]Hybrid Cloud:Combination Of Private And Public--> Sensitive And PII
Data
Types Of Cloud Computing:In Terms Of What Is Managed By Cloud Provider
And You
1]Infrastructure As A Services(Iaas)-->Includes Network ,Data Storage,Vms
2]Platform As A Services(Pass)-->Provider Manages DB,Middleware,OS
Deployment And Application Management
3]Software As A Services(Saas):Complete Product
Pricing Of Cloud:
1]Compute:Pay For Compute Time
2]Storage:Data Storage
3]Networking:Data Transfer In Free
Data Transfer Out Charged
===========================================================
========================
AWS Session-2
1]Chatgpt:On-Prem Infra Vs Cloud Infra Differences
2]Correction To Pricing For Network Component:
Chat Gpt:Data Transfer Into Cloud Charges For Storage Or Network?
3] AWS Console Overview:
4]Free Tier Includes:
5]AWS Regions:
-Compliance With Data Governate And Legal req
-Proximity To Customers And Reduced Latency
-Services Vary From Region To Region
-Pricing Variation
6] AWS Availability Zones:
-Maintain Copies Of Data To Avoid Simultaneous Impact Of Disasters
--Each AZ Has 1 Or More Data Centers With Redundant Power,Network And
Connectivity
--2 To 6 AZs Per Region
--Connected With High Bandwidth,Ultra-Low Latency Network
===========================================================
=======
AWS Session-3
First Services:Identity & Access Management(IAM)
1]Why Access Management Is Necessary?
2]Multi Factor Authentication(MFA)
3]User Groupes:
--For Providing Collective Priviledges By Attaching Policey
--Groups Only Contains User And Not Other Groups
--One User Can Belong To Multiple Groupe
4]-- Users:
--Created TO Avoid Use Of Root User
-- Can Give Limited Permission To User Complying With Least
Priviledges Principel
--Have Permanet Credentials-->1]Passeord
2]Access Keys
5]Roles
--Grant Permission For One Service To Access To Access The
Functionality Of Other Service
--Temporary Access
6]Policies
--Json Document Which Contain Permissions TO be Granted To
User,Groups Or Roles
--Types:
1]AWS Managed
2]Custom Policy
--Policy Inheiantance And Inline Policies
7]Access Reports:
--Access Analyzer:Account Level Reports
--Lists All Users In The Account And Status Of Their Various
Credentials--
--Credentials Reports: User Level Reports
--Shows Services Permission Granted To A User And When Those
Services Were Last Access
Demo :Creating 2 User And Providing Diifrent Level Of Acccess To S3 Object
2] Componets Of VMs:
3]Instance Types:
1]Genreal Purpose:T,M,Series
--Balanced Between Compute,Memory And Networking
Resources
--Web Servers,Code Repo Etc
2]Compute Optimized:C Series
--For Compute Intensive Tasks Which Required High
Computation Power
--Use Case:
--Batch Processing Workloads
--Scientific Modelling And ML
3]Meomry Optimized: R,X,Z Series
--For Workloads That Process Large Data In Memory
--Use Case: Real Time Streaming Data(STock Market Data)
4]Storage Optimized:i,d,h Series
--Disk Storage Which Provides Better Hardware Configuration
Leading To Better I/O Operation
--Use Case: Data Warehousing Application RDBMS
===========================================================
===========================================================
==
AWS Session-5
4]Security Groupes:
-Virtual Firewall That Controls Inbound And Outbound Traffic For EC2
Instances
-Specific Rules To Allow Or Deny Traffic Based On The IP Address,
Potocol And Port Number
5] Instance Purchase Options:
1] On -Demand Instances:
--Pay As You Go Pricing
--Hour Or Per Second Rate
--Acquire And Release Instances Based On Our Need
Use Case:
--Suitable For Unpredicabale And Fluctuating Workloads
--Dev And Test Env
2]Acquired Instance:
--Acquire For Period Of 1 To 3 Years
--One-Time Upfront Cost But At Discounted Hourly Rate Use
Case:
--For Consistant And Predictable Usages Pattern
--For Prod Env
3]Spot Instances:
-- One Of The Cheapset Among All
--Acquired Based On Bidding Value
--Aws Can Retract If Current Bid Price Exceeds==? Terminate
Instance
Use Case:
--For Application Which Can To lernate Interruptions
--Batch Processing
4]Dedicated Instance:
--Instance That Run On A Dedicated Hardware For A Single
Customers
Use Case:
--Compliance And Regulatory Required
--Data Privacey And Residencey
6] Elastic Block Storage:
--Persistent Block Storage Which Can Store Data
--Network Attached==>Hence Bit Of Latencey
--Can Attach Single Volume To Multiple EC2 Instance
--Allows You To Persist Data Even After Termination Of Instance
# EBS Snapshot:
--Point In Time Backups Of EBS
--Can Copy Snapshot From One AZ Or Region To Another
8]Launch Templates:
--Vertical Scalability:
--Upgrade The Hardware Configuration To Achive Increased Server Needs
--Number of Machine Remains The Same
--Horizontal Scalability:
--Number Of Instance Are Added Or Reduced To Address Evpolving Server
Needs
--No Change At Level Of Particular Instances
===========================================================
===========================================================
AWS Session-6
1]Performing Operations In EC2 Instances:
--Package Manager:apt,Yum
Sudo Yum Update
Mkdir Demo
Cd Demo
Touch Demo_File.Txt
Ls
Pwd
Nano File Name
Sudo Yum Install<Package-Name>
Scp
2]Configuration Of AWS CLI:
Pip Install Awscli
Aws Configure
===========================================================
===========================================================
===
AWS Session-7
Third Services:Simple Storage Services(S3)
1]What Is Block Storage And Object Storage?
Block Storage:
Definition-->Stores data in fixed-size blocks (like a hard disk) Stores data as
objects (data + metadata + ID)
Access-->Acts like a physical disk; accessed via OS Accessed via API or
web interface
Use Case-->Databases, OS disks, boot volumes Backups, media files, logs,
static assets
Performance Low-latency, high-performance (good for IOPS) Scalable
and cost-effective, but higher latency
Example in AWS Amazon EBS (Elastic Block Store) Amazon S3 (Simple
Storage Service)
2] S3 Versioning:
--Best Pratice To VErsion Your Buckets To Protect It Form Unitended
Changes=> To Prevent Data Loss By Reverting It To Previous Versioning
-Easy Rollback To Previous Version
-Suspending Versioning Will Not Delete Prevoiusly Stored Versions
4]S3 Replication
CRR-->Cross Region Replication
SRR-->Same Region Replication
5]Amazon Glacier:
--Low Cost Object Storage For Archival And Backup
-Data Retained For Long Term
-3 Retrival Options:Expedited,Standard,Bulk
6]Glacier Deep Archive==> Cheapest Of All
--2 Retreival Option: Standrd,Bulk
6) 1] S3 Object Lock:
-- Bucket Versioing Should Be Enabled
-Secured Socket layer(SSL)
-Transport Layer Security(TLS)
3]Bucket Policies:Restricting Access TO Bucket Using Policey
9)Lifecycle Rules:
The Snow Family helps move large amounts of data into and out of AWS
when network transfer is too slow, expensive, or impractical. It also enables
edge computing in disconnected or harsh environments.
# Snowcone:
# Snowball:
Use Case--> Medium to large data migration (10 TB to petabyte
scale)
Storage Capacity--> 42 TB (Snowball Edge Storage Optimized), 80 TB
(Compute Optimized)
Type--> Rugged physical device with tamper-proof protection
Edge Computing Yes (EC2, Lambda functions, IoT Greengrass)
Data Encryption 256-bit encryption
Transfer Speed Fast local transfer and secure shipment to AWS
Variants Snowball Edge Storage Optimized-Snowball Edge
Compute Optimized
Ideal For Data center migration, analytics in disconnected
locations, military, research fieldwor
# SnowMobile:
===========================================================
===================================================
AWS Session-8
Fourth Services:AWS Athena
1]What Is Athena?
--Serveless Query Service For Analyzing Data In Amazon S3 Using
Standard SQL
2]What Is Serveless:
--Managed Services By AWS Where AWS Takes Care Of Cluster
Resources
Main Features Of AWS Athena:
--Serverless Architecture
--Integration With S3
--Standard Sql Support
-Pyspark And Spark Sql Support As Well
-Pay-Per-Query Pricing
-Compatibilty With Various Data Formats(Json,Parquet,Csv,Etc)
3]Query Options:1]Sql
2] Pyspark And Spark Sql
Location,TblProperties:
parttioning And Bucketing
===========================================================
===========================================================
=======
AWS Session-9
ETL Basic:(Extract,Transform,Load)
-ETL Is Data Integration Process used To Collect,Transform, and load
Data
Components Of ETL:
A)Extract
-Data Extraction:Retrieving Data From Source System,Which Can
Includes Database,Flat Files,APIs,Or Other Repositories
==> SalesForce,Data Lake,Data Warehouse
-Change Data Capture(CDC):Identifying And Capturing Only The
Changed data Since The Last Extraction To Optimized Efficiency
B)Transform
-Data Cleaning:Removing Or Correcting Errors,
Inconsistent,Or Inaccuracies In The Source Data
-Data Transformation: Restructuring And Converting Data Into A
Format Suitable For The Target System
-Data Enrichment: Enhancing Data By Adding Additional
Information or Attributes
c)Load:
-Data Staging:Temporary Storage of Traditional Data Before Loading It
Into The Target System
-Data Loading:Inserting Or Updating Data Into The Destination
Database Or Data Warehouse
-Error Handling: Managing And Logging Error that May Occur
During The Loading Process
ETL Process Flow:
A) Extraction Phase
Connect To Source System:Establishing Connections To Source
Databases,APIs,Or Files
-JDBC/ODBC,
-Data Selection:Defining Criteria For Selecting Data To Be Extracted
B) Transforming Phase
-Data Mapping:Creating A Mapping Between Source And Target Data
Structure
-Data Cleasing: Identifying And Correcting Data Qualify Issues
-Data Validation:Ensuring Transformed Data Meets Specified Qualify
Standards
-ABC Validation Framework(Account-Balance-Control)
-Aggreation:Combing And Summarzing Data For Reporting purpose
c) Loding Phase
-Data Staging:Moving Transformed Data To A Staging Area For Further
Processing
-Bulk Loading:Efficiency Inserting Large Volumes Of Data Inro The
Target System
-Indexing:Creating Indexes To Optimizwd Data Retirval in The Target
Database
-Post-Load Verification:Confirming That Data Has Been Loaded
Successfully
Popular ETL Tools
Apache Nifi
Talend
Informatica
Microsoft SSIS(Sql Server Integreation Services)
Apache Spark,
Cloud Based Services->AWS,AMS,EMR,GCP Dataproc
Features:
1]Managed ETL Services
2]Data Catalog:
-Centralized Metadata Respoitory For Storing Metadata About Data
Sources,Transformation,And Targets
-Metadata Includes Information About Structure,Format And Location
Of Datasets As Well As Details Transformations And Schema Definations
-Database:Logical
-Tables:Logical
3]ETL Jobs:
-Automate The Process Of Extracting,Transforming,And Loading Data
From Various Sources To Destinaction
4)Crawlwers:
-Automatically Discover The Schema of Your Data Stored in Various
Sources And Create Metadata Tables In The Data Catelog.
5)Triggers:
-Scheduling of ETL Jobs Based On Time Or Events
6)Connections:
-Faciliate The Creation,Testing,and Managment Of Database Connections
Used In ETL Process
7)WorkFlow:
-Orchestraction And Automation Of ETL Workflows By Defining
Sequences Of Jobs,Triggers,and Crawlwers
8)Serverless Architecture:
-Elimantinf The Need For User To Provision Or Manage Underlying
Infrastructure
9) Scalability And Reliability
-Fault Tolerance With Number Of Retries
-Data Durability
-Custome Classifieers:
-Provide Structure Or Schema Defination Of File To Be Crawled Genrally
Used For File Formats And Definations Not Supported by Built In Classifier Of
Glue
===========================================================
===========================================================
===============================================
AWS Session-12
Demo 1-Crawling Customer Db(With Custome Classifer And Partition)
Demo 2-Crawling Traffic Dataset And Performing sql Transform Using ETL
Job In Glue
C)Glue Trigger:
-Trigger Types:
-Scheduled:
Event-Driven:Event Is Either ETL Jobs Or Crawler
-Conditional Logic:If There Are Multiple Watched Resources For
Trigger Then All Or ANY Makes Sense
-ALL:If All Watched Resources Achives Desire State Then Only
Further Node Triggers
-ANYIf Any One Of The Watched Resources Achives Desired State
Then Further Node Trigger
-On Demand: Manual Run
-Eventbrides-Event
-Watched Resources:
Trigger Monitors This Resources To Determine Weather To Initiate
Target Node
-Can Add Multiple Watched Resources
-Target Resources:
-Once Status Requirment Of Watched Resources MatchesmWith
Condition Provided The Target Resources Gets Initiated
D) Glue Workflowes:
-2 Componets:
1]Triggers
2]Nodes:Jobs Or Crawlwers
-WorkFlow Bluprint
===========================================================
===========================================================
=======================
AWS Session-13
Sixth Services:Elastic MapReduced(EMR)
1)EMR Architecture And Provisioing:
A]Application Bundle:
-Preconfigured Application Collection Provided By EMR To Be Launch While
Bootstrapping
-EX:Spark,Core Hadoop,Flink,Custom
2]Instance Fleets:
-Multiple Instance Types Can Be Chosen For Each Node Group
-Maximum 5 instance Types Can Be Configure
G) Clusrter Termination:
1) Manual Termination
2) Automatic Termination After IDLB Time(Recomnded)
-Can Metion Idle Time In HH:MM:SS Format
# Termination Protection:
-If Enable Then It Protects From Accendental Termination
-To Terminate,We Need To Diasble THis Option First
-If Termination Protection Is ENabled,It Should Override
Other Termination Attempts,Including Automatic
Termination Due To IDle Time
H) Bootstrap Option:
-To Install Softwere Dependices During Cluster Provisioning
I) Cluster Logs:
-Can Configure S3 Location To Store Of Cluster Activity
J) Identity And Acccess Managment:(IAM) Roles:
1) Amazon EMR Serives Role:
-To Provision Resources And Perform Service-Level Action With Other
AWS Services
2) EC2 Instance Profile For Amazon EMR:
-TO Provide Access Of Other AWS Services To Each EC2 Node Of
EMR Cluster
3) Custom Automatic Scallling Role:
-Role For Automatic Scaling
K)EBS Root Volume:To Provide Additional Storage
--------------------------------------------------------------------------------------------------------
----------------------------------------------
AWS Session-14
1) Way To Submit Application To Cluster:
1) AWS Manamgment Console:
-EMR Steps
-Notebooks
2)AWS Cli
3)AWS SDK--> Boto FOr Python
Command-Runner.Jar:
-A Tool Or Utility That Facilates The Executions Of Custom Commands And
Scripts During The Setup Of An EMR Cluster
-Helps User To Automate Certain Tasks And Configuration, Provising A
Convenient Way To Extand The Functionality Of EMR Cluster
-Component Within The Broder EMR Ecosystem,Aiding In THe Execution Of
Custome Commands And Scriptes As Part Of The Cluster Initilization Process
#EMR Serveless:
-Simplifies THe Process Of Running Big Data Workloads By Abstracting
Away Complexity Of Managing And Provisiong Infrastructure
-Allows User To Focus On Data Processing And Analytical Tasks Without
The Need TO Worry About The Underlying Server Managment
# EMR CLI:
-Create-Cluster:
-Terminate-Cluster:
3)Add Step To Cluster
--------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
AWS Session-15
BOTO3:
-AWS Software Development Kit(SDK) For Python
-Using Boto3 Scripting We Can Manage And Utilize AWS Resources For Data
Processing
What Is Namespace:
-Namespace Is A Collections Of Database Objects And Users
-Storage-Related
-Groups Together Schemas,Tables,User,Or AWS Key Management Services
Keys For Encrypting Data
--------------------------------------------------------------------------------------------------------
-------------------------------
AWS Session-16
Seventh Service:Amazon Redshift
1) What Is Amazon Redshift And How Does It Works:
-Petabyte-Scale data Warehouse Services
-It Is Used To Stored And Analyze Large Amount Of Data(Historic Data)
-Redshift Uses:
-Parallel Query Execution TO Provide Fast Query Performance
3)Performance Feature:
1) Massively Parallel Processing==Parallel Query Engine
2)Columner Data Storage
3)Data Compression
4)Query Optimizer
5)Result Caching:
-Query Result Caching
-SVL_Qlog-->SOurce Coloumn:Cache-True,Not Used
Cache:False
--Search About System Tables In Redshift
What Is Workgroup-->Container
-Collection Of Compute Resources From Which An Endpoint Is Created
-Compute-Related
-Groups Togeteher Compute Resources Like Rpus,VPC Subnet
Groups,Security Groupes
What Is Namespace:
-Namespace Is A Collection Of Database Objects And User
-Groups Together Schmeas,Tables,User,Or AWS Key Managment Services
Keys For Encrypting Data
1) ETL:
Transform-->Spark(Traditinoal DW)
2)ELT:
Load Into DW
Transform In Modern DW
3)Etlt:
t-> Transform On Spark:Schema Conversion,Coloumn Trancation Load
Into Warehouse
-Transform In Modern Dw: Aggreation Related
-UPSERT Operation--> Update And Inserting
--------------------------------------------------------------------------------------------------------
--------------------------------------------
AWS Session-17
Query Result Caching(Result Set Caching):Involves Storing The Actual Result
Sets Of Executed Queries For Later Reterival
Query Caching:
-Involves Caching The Execution Plan Or Metadata Associated With A
Query,Rather Then The Actual Data
-Hepls Save On Planning And Optimization Time
--------------------------------------------------------------------------------------------------------
-------------------------------------------------------
AWS Session-18
-Coloumn_List:(Optional)
-A Comma-Seprated List Of Columns In The Target Table
-If Not Specified,Redshift Assumes That The Coloumn Order In THe Source
File Matches The Order Of Column In The Target Table.
-Data_Source:
-Specifieees The Source Of The Data You Want To Copy
-This Can Be An Amazon S3 Bucket,An Amazon EMR Cluster,A Data File On
Your File System Or A Remote Host Using SSH
Common Options:
# Feature:
1) Data In Amazon S3
2) External Tables
-No Stored In Your Redshift Cluster But Act As Metadata For
Querying The Data In S3
3)Querying
Can Run SQL Queries That Join Tour Internal Redshift Tables With Your
External S3 Data
4)Performance:
-Optimizes Query Performance By Pushing Down Filters To The S3
Data,Minimizing Data Movement
1)Auto
3)All
Create Table Yuor Table(
Coloumn Int,
Column2 Varchar(50)
)
Diststyle All;
4)Key
# RDS:
-Managed Relational Database Services Provided By Amazon Web
Services Provided By Amazon Web Services
-Simplifies First Setup,Operation,And Scaling Of Relational Database
Without The Adminstractive Overhead Of Managing A Database Server
# Key Feature:
1)Support Popular Database
Engines:MySQL,PostgreSQL,Oracle,SQL<Server MariaDB,And Amazon
Aurora
2)Managed Services:RDS Takes Care Of Routine Database Tasks Such As
Hardware Provisioning,Database Setup,Patching,Backups And Scaling
3)Scalability
4)High Availability
5)Security:
--Encryption At Rest And In Transit
--Virtual Private Cloud (VPC) Integration
--Database Authenication Options
6)Backup And Restore:Through Backup And Snapshot
8) Maintainance--
DB Instance Classes::
-DDl:
Create Table Your Table(
Coloumn Int,
Coloumn2 Varchar(50),
Coloumn3 Date)
SortKey(Coloumn1,Coloumn3);
ex:Date=30/10
Interleaved::
-Also Composed Of One Or More Columns
-Doesn't Priorities One Column Over The Others==> It Interleaves
The Data Across All Columns In The Sort Key Evenly
-Can Help Improve Query Performance For Tables With Unpredictable Query
Patterns (Varying Filtering And Grouping Conditions)
-DDL
Create Table Your_Table(
Column Int,
Column Varchar(50),
Column Date)
Interleaved Sortkey (Column 1,Column2,Column3)
--------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------
AWS Session-20
AWS Lambda:
-Functions AS A Services(FAAS)
-Supports Multiple Programming Languages Like Python,Java,C# Etc
-Serverless Computing Service That Aloows You To Run Code Without
Provisiong Or Managing Servers
## Feature
--Real-Time Data Processing:
-Can Be Triggerd In Near Real-TIme In Responce To Events Or Data
Streams
-For Exmple:You Can Use It To Process Data From Sources Like Amazon
Kinesis(For Real-Time Data Streams)
2)Scalability:
-Lambda Functions Can Scale Up Or Down To Handle Varying Workloads
3)Event-Driven Data Pipeleines:
-Can Chain lambda Functions Together To Create Complex Data
Processing Workflows
4)Integration With Other AWS Services
5)Cost-Efficiceny:
-Charges Based On The Compute Time Lambda Functions Consumes
-Only Pay For The Processing TIme Required For Each Data Event
6)Scheduled Data Processing:Batch Data Processing
-Can Schedule Lamda Functions To Run At Specific Intervales
--------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------
AWS Session-21
--> Airflow
What Is DAG:
A Series Of Tasks That You Want To Run As Part Of Your Workflowes
Airflows Is A Workflow Engine Which Means:
-Manage Scheduling And Running Jobs And Data Pipeleines
-Ensure Jobs Are Orderd Correctly Based On Dependicies
-Manage The Allocation Of Scare Resources
-Provides Mechanism For Tracking The Statr Of Jobs And Recovering From
Failure
DAG:Order Of Excutions
Directed Acyclic Graph, A Set Of Tasks With Explicit Execution
Order,Begining,And Endpoint
THe Vertices And Edges(The Arrows Linking The Nodes) have An Order
Vertices And Edges(The Arrows Linking The Noods) Have An Order And
Direction
Associated To Them
Each Node In A DAG Corresponds To A Task Which In Turn Represents
Some Sort Of Data Processing
Order And Direction Associated To Them
Each Node In A DAG Corresponds To A Task ,Which In Turn Represents
Some Sort Of Data Processing
Dependecies:
Each Of The Vertices Has A PArticular Direction That Shows The Relationship
Between Certain Nodes.
-->Using Shift Operator And SetupStream(),SetdownStream()
Idemopotencey:
It Is A Property Of Some Operations Such That No Matter How Many Times
You Exectue Them,You Achive The Same Result
Sensor:
Sensor Are A Special Type Of Operator That Are Designed TO Do Exactly
One Thing-Wait For Somethong TO Occur
It Can Br Time-Based,Or Waiting For File Or External Event,But All They
Do It Wait Until Something Happens,And Then Succed So Their Downstream
Tasks Can Run
S3 Key Sensor:
The S3 Key Sensor Is The Name Suggests Cheks The Availibility Of
FIles(A.K.A.Keys) Placed In A S3 Bucket
THe Sensor Can Be Set To Check Every Few Seconds Or Minutes For A Key
when A DAG Is Running It Will Check When The Key Is Available Or Not.
If They Key Is Availabie Then The Control Will Be Passed To The Next Tasks
In The Dag And Flow Will Continue
If The Key Is Not Available It Will Fail Or Retery(Depending Upon The
Configuration)
--------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
AWS Session-22
Sensor Plugines:
Sensor Plugines Add Custome Sensor To Aireflow That Aloow You To
Wait
For Certain Conditions Or Events To Occur Beffore Executing
Tasks
Sensors Are Useful For Tasks That Need To Wait For External
Events
Like File Availibility,Database Changes,API Responces Or Other
External Triggers
Hooks:
-Hepls To Create Connetion With External System
-Ex:BaseHook,S3Hook
Hook Plugins:
Hokk Plugines Enable You To Create Hooks That Define Connection And
Interact
With External System Or Services
Hooks Provide A Consistent Interface For Connceting To Various System
Like Database,Cloud Services,Message Queues And More
Example:
S3Hook
BigQueryHook
SparkSubmitHook
HiveServer2Hook
PostgresHook
MySQLHook
RedisHook