0% found this document useful (0 votes)
198 views79 pages

AWS Data Engineering Notes by Iusmanmaqbool

Uploaded by

rrohit4010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views79 pages

AWS Data Engineering Notes by Iusmanmaqbool

Uploaded by

rrohit4010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Cloud computing introduction, advantages and types.

Scale Vertically : adding more servers and added high ram and storage.

Scale horizontally: scale adding more machine to pool of resources to exiting machine.
Loose coupling artichecture.

Reduce dependences and for change reduce failler and not casecate to another compound .
Redshift spl relations databases are data wasehouse
Kinesis

It’s a platform for steaming data and to collect and process and analysis real time steaming data.
Its used to process collecy and analyze real time and streaming data.
Logging

Real time metric and reporting

Real time data analystics and complex steaming processing.


Real time, full manage scalable.

 Producer App: add record into streaming.


 Consumer App: read data after the stream and record reading less then one sec.

Producer continuously push the code to kinesis stream, and consumers such as consumer app
running consumers app to store in s3, dynamodb, redshift and emr, firehose.
Two types of reshards

Shard split :

Shared split (single shared divided shared.)

Shard merge(Shared merge two shared in single shard)

 Partition key(256 characters) is used stream distributed data across shards.

Kinesis data stream separated records belong to stream into multiple shards used partition key each
associated data determine records which data record belong

 Sequence number:increase over time and specific shard within team.


 Data blog: size equal Or less then 1 Mb.
We have three shards contain difference partition keys in each shards and orgizied are belong to
sequence no is unique identifiers for records insert into shards.

Data blog: actual data data producer add to stream and max size 1Mb.

 Retention period: max duration after which data added into stream is expired
 Default retention period is 24 hrs by defaults and it can be increased to seven day if
required.
Two types of batching.

 Aggregation (multiple record into single stream record)


 Collection(using API operation put the record send to multiple kinesis data stream record,
one or more shards in kensis data stream.
By default data stream turn on.
Glacier,s3, redshift and elastic search service.

Kinesis analytics, emr,lambda,ec2 finally go to visualization tools.


Data Consumers:

Consumer used to push data respectively from data stream.


Consumer can emitted data to s3, dynamo db, elastic search, redshift and emr.

Since multiple app can consumer data from stream you might have case where some other data may
be sent to services like redshift but rest of data may archiving in s3. And use object life cycle process
to archive into glacier.

Use case

 Dynamo Db data in stream could be sent dynamo db


 Like gaming app and dashboard-> read data

Redshift : if data needs to be added to data warehouse send it to redshift.

Elastic search: if you need to make data from steam searchable then send this data to elastic search
using kcl.

Capture and send data to kiness data stream

Kinesis data stream(ingests and stores data streams for processing)


 No need application manage by resource
 We can config data source send source records to fireshose delivery stream.
 It automatically delivery data to destination also configure transformed records.
 If you destination configure s3 delivery s3 bucket.
Web-server(producers): generate logs which send data to kinesis fire hose delivery stream in form of
records.

 Each record can be as large as 10000kb


 For each delivery stream you will set buffer size or buffer interval.
 Firehose buffer incoming streaming data to certain size of certain period before delivering it
to s3 or elastic search.
 Buffer size can be selected from 1mb to 128mb or buffer interval of 60 to 900 sec.
 Condition which is satisfied first it triggers data delivery to s3 or elastic search.
 From firehose delivery stream the data is then put into s3 and the using copy command.
 After data has been pulled into redshift, you can use the object lifecycle path to archive the
data into glacier.
Two types
Kinesis agent
Aws sdk

Agents

1) Standalone java software application offers easy to collect data into data stream and stream
kinesis firehose agent continuously set of file and seniers data into stream.
2) Allows you to collect and send data to firehose.
3) Can install it on linux based web server log server or data base web servers.
4) Will also help in cloudwatch metrics.
5) Can preprocess records from monitored files this involved multiple line in single line
converting delimeters to json format and converting records to log format to json format.
File monitoring destination stream for data
6) Agent can ve monitoring multiple tree and write multiple streams it also preprocess covert
records for send to stream or delivery stream.

Firehost agent have Two types of records

* Put records.

Single data record into amazon kiness firehose delivery stream.

*put record batch

Use multiple record into delivery stream in single call it can achieve high through put then writing
single records
 Don’t write any code just create delivery stream and configure destination data
 Client data into stream using aws API calls and put records batch and data automatically
data sent to destination.
 When configure say stream Amazon s3 and firehose and data directly to s3.
 Amazon red shift (destination data shift to Amazon s3 and then red shift copy command
execute load data to red shift.

Workout
Put and get records from kinesis data stream
Go to cli
Aws
Step1

Step2

Step3 checking stream active using previous cmd

Step4: check stream name


Step5: put the record using partitions

Step6:

Step7

Step8: decoding

Result: testdata ( user enter the data)


Step 9: delete stream name

Step9.1

 Lambda example of service less computing which allows you run code without provision
managing servers.
 Lambda you can run code anytype of app backend server all zero admission
 Just uploaded code lambda take care any things required to run scala code with high
availability.
 You setup target automatically other aws services and call directly from any web or mobile
app
Flow:

 Firehose incoming data buffering upto 3mb or buffering size specific delivery stream.
 Firehose involved specific lambda function each buffer batch sync.
 Transform data from lambda to firehouse for buffering
 Transfer data send to destination with specific buffering size and buffering interval reach.

Failure handling

 If data transformation fails unsuccessfully processed records are sent to your s3 bucket in
processing failed folder.
 Fire hose retries invocation for three time an discard batch of records if retry is not
successes
 Retry delivery period is 24 hours
 Invocation errors and records that have status of processing failed can be emitted to cloud
watch logs.
 If data transfer fails to unsuccessful process record send to s3 bucket into process fail
folder.

Data flow:

For s3 destination streaming records are delivered on your s3 bucket if data transformation is
enabled, you could optionally backup supply data to any other s3 bucket.

Red shift destination:

Streaming records is kept in s3 bucket. To load data s3 bucket kinesis. Data issues an amazon
red shift copy command into red shift cluster

In call transformation is enable you many optionally backup supply information every other
Amazon s3 bucket.

Data delivery frequency in s3:


 In s3 buffer size (1mb to 128mb)
 Buffer interval 60 -900 sec
 Fire hose is capable of raising the buffers size dynamically to catch upto issues where data
delivery to specific app is following behind.
 Retry delivery to your destination for upto 24 hrs.

Data delivery frequency in Red Shift

 Red Shift frequency will depend on how fast red shift cluster finishes copy command.
 Retry in red shift 0 to 7200 sec.
 Fire hose will issues new copy cmd automatically when previous copy cmd has completed
running.

Data delivery frequency in elastic search

 Elastic search also depends on buffer size (1Mb to 100Mb) and buffer interval (60- 90 sec)
 Fire hose can raise size of buffer dynamically. If data delivery to elastic search is following
behind.
 You can also specific retry duration from zero to 7200 sec.

 Standards for simple queue services its reailable, scalable, post queue sending
storing and reterving message between server.

 Web server sub multiple reader/write which mean many complent share in single
queue.
 Aws works fifo concept first in first out an fifo queue order send in massage send
once it remains available contain customer predict need that data
 Queue contain many region massage can be contain queue upto 14 dayd this
massage send read simultaneously
 Sqs following long following its reduce minus cost while retrieve msg quickly and
possible
 When queue is empty long fall request wait up to 20 sec next msg arrived.

 Web app place order into a sqs request queue this order set in queue until picked one free
processing server. The server process server then black process to queue then customer.
 Queue can determine load on application. You can use length of collect determined queue
and could watch determined load and could watch integrate with sqs queue to collect and
view analyst metrics and you can several different discussion based on data.
When web application place order sqs queue this order pickup one of the free processing server. The
process order send back to priority order queue and send to customer order queue. The advantage
of queue auto scaling scale back based on item sqs queue

Massage process two types high priority and low priority based on queue.
Once received massage to IoT device. IoT device create rule and action data to elastic search service
domain . data kinesis firehose stream write data into dynamo table even though Amazon machine
learn base and production machine learning model. IoT machine model allow to change cloud watch
alarm capture cloud watch metrics write data to s3 buckets, sqs queue sns push noticification and
finally invoked lambda function.

One device send to IoT device. Iot device having permission to write data into Elastic search data
domain.
Aws IoT cerificified and register and copy to our device. When device communicating aws IOT
present certification identity credential.

Aws Iot use IAM policies use for user group and role. For mobile application authentication against
IOT.

Authorisation two type of policy aws IoT policies and Iam policies.
Gateway communicate one to one and one to many devices
Data pipeline allows to create ETL work flow automate process moment of data schedule interval
and terminate etl resource after work completed.

One of future data pipeline is ability to move data across region and you can copy entire dynamo db
tables to another region or incremental copy table to another region.

You can reduce time frequency copy the table data pipeline help you reliable process move data
between difference aws compute and store data services.
One of best benefits pipe line that can be scheduled all components move one location to another.

And pipe line run on premise run aws provide task run package install on own premise host.

As soon as install packet poll the pipe line works carrier out while time run select activity on premise
source as example executing db store to process data base done. Aws pipe line issues to task run.

Data pipeline integrated with on premise environment.


dynamo db table contain data high activity or emr activity use.

Sql data note: sql databases query represent data for pipe line activity use.

Redshift data note: amazon redshift table contain redshift copy activity to use.

S3 data node: amzon location that contain one or more file contain for pipeline activity use.
Using copy cmd directly copy from dynamo db to redshift.

Dynamodb export data using data pipeline emr cluster, emr running dynamo db table write into s3
bucket. If s3 want write data to dynamo db using data pipeline emr cluster write with dynamo db.

You might also like