AWS Data Engineering Notes by Iusmanmaqbool
AWS Data Engineering Notes by Iusmanmaqbool
Scale Vertically : adding more servers and added high ram and storage.
Scale horizontally: scale adding more machine to pool of resources to exiting machine.
Loose coupling artichecture.
Reduce dependences and for change reduce failler and not casecate to another compound .
Redshift spl relations databases are data wasehouse
Kinesis
It’s a platform for steaming data and to collect and process and analysis real time steaming data.
Its used to process collecy and analyze real time and streaming data.
Logging
Producer continuously push the code to kinesis stream, and consumers such as consumer app
running consumers app to store in s3, dynamodb, redshift and emr, firehose.
Two types of reshards
Shard split :
Kinesis data stream separated records belong to stream into multiple shards used partition key each
associated data determine records which data record belong
Data blog: actual data data producer add to stream and max size 1Mb.
Retention period: max duration after which data added into stream is expired
Default retention period is 24 hrs by defaults and it can be increased to seven day if
required.
Two types of batching.
Since multiple app can consumer data from stream you might have case where some other data may
be sent to services like redshift but rest of data may archiving in s3. And use object life cycle process
to archive into glacier.
Use case
Elastic search: if you need to make data from steam searchable then send this data to elastic search
using kcl.
Agents
1) Standalone java software application offers easy to collect data into data stream and stream
kinesis firehose agent continuously set of file and seniers data into stream.
2) Allows you to collect and send data to firehose.
3) Can install it on linux based web server log server or data base web servers.
4) Will also help in cloudwatch metrics.
5) Can preprocess records from monitored files this involved multiple line in single line
converting delimeters to json format and converting records to log format to json format.
File monitoring destination stream for data
6) Agent can ve monitoring multiple tree and write multiple streams it also preprocess covert
records for send to stream or delivery stream.
* Put records.
Use multiple record into delivery stream in single call it can achieve high through put then writing
single records
Don’t write any code just create delivery stream and configure destination data
Client data into stream using aws API calls and put records batch and data automatically
data sent to destination.
When configure say stream Amazon s3 and firehose and data directly to s3.
Amazon red shift (destination data shift to Amazon s3 and then red shift copy command
execute load data to red shift.
Workout
Put and get records from kinesis data stream
Go to cli
Aws
Step1
Step2
Step6:
Step7
Step8: decoding
Step9.1
Lambda example of service less computing which allows you run code without provision
managing servers.
Lambda you can run code anytype of app backend server all zero admission
Just uploaded code lambda take care any things required to run scala code with high
availability.
You setup target automatically other aws services and call directly from any web or mobile
app
Flow:
Firehose incoming data buffering upto 3mb or buffering size specific delivery stream.
Firehose involved specific lambda function each buffer batch sync.
Transform data from lambda to firehouse for buffering
Transfer data send to destination with specific buffering size and buffering interval reach.
Failure handling
If data transformation fails unsuccessfully processed records are sent to your s3 bucket in
processing failed folder.
Fire hose retries invocation for three time an discard batch of records if retry is not
successes
Retry delivery period is 24 hours
Invocation errors and records that have status of processing failed can be emitted to cloud
watch logs.
If data transfer fails to unsuccessful process record send to s3 bucket into process fail
folder.
Data flow:
For s3 destination streaming records are delivered on your s3 bucket if data transformation is
enabled, you could optionally backup supply data to any other s3 bucket.
Streaming records is kept in s3 bucket. To load data s3 bucket kinesis. Data issues an amazon
red shift copy command into red shift cluster
In call transformation is enable you many optionally backup supply information every other
Amazon s3 bucket.
Red Shift frequency will depend on how fast red shift cluster finishes copy command.
Retry in red shift 0 to 7200 sec.
Fire hose will issues new copy cmd automatically when previous copy cmd has completed
running.
Elastic search also depends on buffer size (1Mb to 100Mb) and buffer interval (60- 90 sec)
Fire hose can raise size of buffer dynamically. If data delivery to elastic search is following
behind.
You can also specific retry duration from zero to 7200 sec.
Standards for simple queue services its reailable, scalable, post queue sending
storing and reterving message between server.
Web server sub multiple reader/write which mean many complent share in single
queue.
Aws works fifo concept first in first out an fifo queue order send in massage send
once it remains available contain customer predict need that data
Queue contain many region massage can be contain queue upto 14 dayd this
massage send read simultaneously
Sqs following long following its reduce minus cost while retrieve msg quickly and
possible
When queue is empty long fall request wait up to 20 sec next msg arrived.
Web app place order into a sqs request queue this order set in queue until picked one free
processing server. The server process server then black process to queue then customer.
Queue can determine load on application. You can use length of collect determined queue
and could watch determined load and could watch integrate with sqs queue to collect and
view analyst metrics and you can several different discussion based on data.
When web application place order sqs queue this order pickup one of the free processing server. The
process order send back to priority order queue and send to customer order queue. The advantage
of queue auto scaling scale back based on item sqs queue
Massage process two types high priority and low priority based on queue.
Once received massage to IoT device. IoT device create rule and action data to elastic search service
domain . data kinesis firehose stream write data into dynamo table even though Amazon machine
learn base and production machine learning model. IoT machine model allow to change cloud watch
alarm capture cloud watch metrics write data to s3 buckets, sqs queue sns push noticification and
finally invoked lambda function.
One device send to IoT device. Iot device having permission to write data into Elastic search data
domain.
Aws IoT cerificified and register and copy to our device. When device communicating aws IOT
present certification identity credential.
Aws Iot use IAM policies use for user group and role. For mobile application authentication against
IOT.
Authorisation two type of policy aws IoT policies and Iam policies.
Gateway communicate one to one and one to many devices
Data pipeline allows to create ETL work flow automate process moment of data schedule interval
and terminate etl resource after work completed.
One of future data pipeline is ability to move data across region and you can copy entire dynamo db
tables to another region or incremental copy table to another region.
You can reduce time frequency copy the table data pipeline help you reliable process move data
between difference aws compute and store data services.
One of best benefits pipe line that can be scheduled all components move one location to another.
And pipe line run on premise run aws provide task run package install on own premise host.
As soon as install packet poll the pipe line works carrier out while time run select activity on premise
source as example executing db store to process data base done. Aws pipe line issues to task run.
Sql data note: sql databases query represent data for pipe line activity use.
Redshift data note: amazon redshift table contain redshift copy activity to use.
S3 data node: amzon location that contain one or more file contain for pipeline activity use.
Using copy cmd directly copy from dynamo db to redshift.
Dynamodb export data using data pipeline emr cluster, emr running dynamo db table write into s3
bucket. If s3 want write data to dynamo db using data pipeline emr cluster write with dynamo db.