Aws (S3, Iam, Ec2, Emr and Redshift)

Amazon S3 is a simple storage service that allows storing of objects like files in buckets. Amazon IAM manages user access and security. Amazon Redshift is a data warehouse service that uses a leader node and compute nodes in a master-slave architecture for fast query performance on large datasets through massively parallel processing and columnar storage. Tuning includes choosing distribution style, sort keys, vacuuming and reindexing to optimize query performance.

Uploaded by

AkashRai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

82 views16 pages

Aws (S3, Iam, Ec2, Emr and Redshift)

Uploaded by

AkashRai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

AWS (S3, IAM, EC2, EMR

and Redshift)
Amazon s3
• S3 is a simple storage service provided by AWS to store objects
• It allows people to store objects (“files”) in s3 buckets
• Buckets name should be globally unique
• Buckets are defined at region level
• The key is the full path s3://bucket_name/……….
• Max object can be 5TB and in case of uploading an object of more than 5GB must use multi part
• S3 supports versioning
• S3 supports encryption
• We can define security on s3 buckets
• By default s3 bucket by private.
• Static website
• Ownership cannot be transferred.
Amazon IAM ROLES
• It is the core of AWS and called as Identity and Access Management.
• AWS security at
• User
• Group
• Role
• One IAM user per physical person.
• One IAM per Application.
Amazon Redshift
• Redshift is a fully managed , Peta byte scale Data warehouse service provided by AWS
• It provides faster query performance by MPP, Columnar storage ,Data Compression , query optimization ,result
caching and compile code.
• It is majorly used for OLAP and analytics purpose.
Amazon Redshift Architecture
• Redshift architecture is based on Master slave
concept and can be majorly describers
• Client Application
• Connection
• Clusters
• Leader Node
• Commute Node
• Node Slices
• Internal Network
• Databases
Amazon Redshift Architecture
• Client Application
• Based on PostgreSQL
• Mostly existing client application can connect with it minimal changes like ETL , BI tools etc.

• Connection
• It uses industry standard ODBC and JDBC drivers for PostgreSQL.

• Clusters
• It is the core component and composed of one or more compute node.
• When one or more compute node is present then a leader node handles the compute node and external
communications.
• Application communicates with leader node.
Amazon Redshift Architecture
• Leader Node
• Handles all communication with client and compute nodes.
• Parses and carries out execution plan.
• Complies code and distributes the compiled code to compute nodes and assigns a portion of data to each compute node.

• Commute Node
• The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
• Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node
type
• As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the
number of nodes, upgrading the node type, or both.
Amazon Redshift Architecture
• Node Slices
• Compute node is partitioned into slices.
• Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload
assigned to the node.
• The leader node manages distributing data to the slices and apportions the workload for any queries or other database
operations to the slices. The slices then work in parallel to complete the operation.
• The number of slices per node is determined by the node size of the cluster

• Internal Network
• Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication
protocols to provide private, very high-speed network communication between the leader node and compute nodes.
The compute nodes run on a separate, isolated network that client applications never access directly.
Amazon Redshift Architecture
• Databases
• A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates
with the leader node, which in turn coordinates query execution with the compute nodes.
• Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS
applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing
(OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and
reporting of very large datasets.
Amazon Redshift Performance
• Massive Parallel Processing
• Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node
executing the same compiled query segments on portions of the entire data.
• Columnar Storage
• reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk
• since each block holds the same type of data, block data can use a compression scheme selected specifically for the
column data type, further reducing disk space and I/O.
• Workload Management
• workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running
queries won't get stuck in queues behind long-running queries.
• Data Compression
• Reduce storage requirement
• Compressed data is read into memory and uncompressed during execution.
Amazon Redshift Performance
• Query Optimizer
• Significant enhancement in the queries.
• Result Caching
• Frequently executed codes results are cached.
• Compiled code.
• Complied codes is shared across the compute nodes so overhead of interpreter is removed.
Amazon Redshift Compression
• Compression encoding
• RAW (RAW) – No compression -ALL
• Byte dictionary(BYTEDICT) -ALL except Boolean (a separate dictionary of unique value is
created for each block of column values on disk)
• Delta- DELTA, DELTA32K (compresses data by recording the difference) –int, timestamp
• LZO –LZO (works well with char and varchar, provides high compression ratio) –all except
Boolean, real and double precision.
• Mostlyn- MOSTLY8, MOSTLY16, MOSTLY32 (compress values of columns to a standard
storage size) -int
• Run-length –RUNLENGTH (consist of count of time the value occurs)—varchar
• Text-TEXT255, TEXT32K (separate dictionary of unique word is created --often same recur
words)
• Zstandard-ZSTD (supports all data types) --all
Tuning Query Performance
• Distribution Style
• Auto : by default is auto if no distribution style is used. Amazon Redshift initially assigns ALL distribution to a small
table, then changes to EVEN distribution when the table grows larger. When a table is changed from ALL to EVEN
distribution,
• All: A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a
portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table
participates in.
• Even :The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any
particular column. EVEN distribution is appropriate when a table doesn't participate in joins. It's also appropriate when
there isn't a clear choice between KEY distribution and ALL distribution.
• Key :The rows are distributed according to the values in one column. The leader node places matching values on the same
node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according
to the values in the joining columns. This way, matching values from the common columns are physically stored together.
• Number of nodes, processor and slicer
• Node types –dense storage and dense computing
Tuning Query Performance
• Sort keys
• Data is stored in 1 MB disk block
• Min and Max for each block is stored
• Compound Sort Keys
• A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed. A compound sort key
is most useful when a query's filter applies conditions, such as filters and joins, that use a prefix of the sort keys. The performance
benefits of compound sorting decrease when queries depend only on secondary sort columns, without referencing the primary
columns. COMPOUND is the default sort type.
• Interleaved Sort Keys
• An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different
columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses
restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to
compound sorting.
• Don't use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or
timestamps.
• Z-order curve for creating a Zone map
Tuning Query Performance
• Vacuum
• Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were
marked for deletion by previous UPDATE and DELETE operations. VACUUM FULL is the default.
• Full
• A full vacuum doesn't perform a reindex for interleaved tables. To reindex interleaved tables followed by a full vacuum,
use the VACUUM REINDEX option.
• Sort only
• Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows. This option
is useful when reclaiming disk space is not important but resorting new rows is important
• Delete only
• Amazon Redshift automatically performs a DELETE ONLY vacuum in the background, so you rarely, if ever, need to run
a DELETE ONLY vacuum.
• Reindex
• Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation. If
REINDEX is used, a table name is required.
Upsert –Redshift way
To perform a merge operation by replacing existing rows

1.Create a staging table, and then populate it with data to be merged, as shown in the following pseudocode.
create temp table stage (like target); insert into stage select * from source where source.filter = 'filter_expression’;

2.Use an inner join with the staging table to delete the rows from the target table that are being updated.
Put the delete and insert operations in a single transaction block so that if there is a problem, everything will be
rolled back.
begin transaction; delete from target using stage where target.primarykey = stage.primarykey;

3.Insert all of the rows from the staging table.

insert into target select * from stage; end transaction;

4.Drop the staging table.

drop table stage;

Informatica Cloud (IICS) Architecture
No ratings yet
Informatica Cloud (IICS) Architecture
21 pages
Database
No ratings yet
Database
145 pages
Apache Doris Docs (English) - Compressed
No ratings yet
Apache Doris Docs (English) - Compressed
1,714 pages
SSIS in The Cloud
No ratings yet
SSIS in The Cloud
17 pages
Monitoring and Managing System Status and Performance
100% (1)
Monitoring and Managing System Status and Performance
137 pages
Migrate Oracle DB To AWS RDS Using Oracle Dump and DMS
No ratings yet
Migrate Oracle DB To AWS RDS Using Oracle Dump and DMS
41 pages
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
No ratings yet
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
32 pages
Azure Data Storage Options: - by Shashank Gupta
No ratings yet
Azure Data Storage Options: - by Shashank Gupta
20 pages
50 Important Queries in SQL Server
No ratings yet
50 Important Queries in SQL Server
19 pages
1860-Junior Data Integration Engineer at Takeda Pharmaceutical
No ratings yet
1860-Junior Data Integration Engineer at Takeda Pharmaceutical
12 pages
ADF Course Deck
No ratings yet
ADF Course Deck
88 pages
Alibaba Cloud Product
No ratings yet
Alibaba Cloud Product
5 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Datapipeline DG
No ratings yet
Datapipeline DG
337 pages
Rds Logstash Opensearch
No ratings yet
Rds Logstash Opensearch
6 pages
CC (Neha) PDF
No ratings yet
CC (Neha) PDF
50 pages
Advanced Data Warehousing
No ratings yet
Advanced Data Warehousing
295 pages
AWS Data Lake
No ratings yet
AWS Data Lake
13 pages
Real Time Analytics Spark Streaming PDF
No ratings yet
Real Time Analytics Spark Streaming PDF
20 pages
Distributed Database: GDC Thana Semester 6
No ratings yet
Distributed Database: GDC Thana Semester 6
10 pages
How To Create EC2 Instance in AWS - Step by Step Tutorial
No ratings yet
How To Create EC2 Instance in AWS - Step by Step Tutorial
46 pages
CC Us
No ratings yet
CC Us
27 pages
SQL Tutorial
No ratings yet
SQL Tutorial
46 pages
Scaling Memcache at Facebook - Slides
No ratings yet
Scaling Memcache at Facebook - Slides
28 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Recommendations For Deploying Apache Kafka On Kubernetes
No ratings yet
Recommendations For Deploying Apache Kafka On Kubernetes
9 pages
How To Upload and Download Files Programmatically To Azure Blob Storage Using
No ratings yet
How To Upload and Download Files Programmatically To Azure Blob Storage Using
17 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Database Services in AWS: Relational Databases
No ratings yet
Database Services in AWS: Relational Databases
9 pages
DBMS SQL Practice Questions Shivani
No ratings yet
DBMS SQL Practice Questions Shivani
10 pages
AWS Certified Data Engineer - Cheat Sheet - MyDE
No ratings yet
AWS Certified Data Engineer - Cheat Sheet - MyDE
87 pages
Cloud Computing Lab - Manual
No ratings yet
Cloud Computing Lab - Manual
30 pages
Buku Pnduan Spik Prgram
No ratings yet
Buku Pnduan Spik Prgram
40 pages
Elite SQL Queries For Practice PDF
0% (1)
Elite SQL Queries For Practice PDF
20 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
EC2 Notes
No ratings yet
EC2 Notes
10 pages
Lecture 07 - Key-Value Databases
No ratings yet
Lecture 07 - Key-Value Databases
75 pages
Slides Setting Up Secure, Well-Governed Machine Learning Environments On AWS
No ratings yet
Slides Setting Up Secure, Well-Governed Machine Learning Environments On AWS
39 pages
17.views and MaterializedViews
No ratings yet
17.views and MaterializedViews
13 pages
Database Course Outline INFO1101
No ratings yet
Database Course Outline INFO1101
5 pages
Curious Kids 3
No ratings yet
Curious Kids 3
10 pages
Ansible 2
No ratings yet
Ansible 2
15 pages
Data Warehouse Ques
No ratings yet
Data Warehouse Ques
10 pages
Readme
No ratings yet
Readme
67 pages
Amazon RDS FAQs
No ratings yet
Amazon RDS FAQs
37 pages
PostgreSQL Cheat Sheet - Hackr - Io
No ratings yet
PostgreSQL Cheat Sheet - Hackr - Io
90 pages
Alfred Espinas As Sociedades Animais
No ratings yet
Alfred Espinas As Sociedades Animais
601 pages
Hox Correctipon
No ratings yet
Hox Correctipon
79 pages
Valuejet 1604
No ratings yet
Valuejet 1604
456 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
No ratings yet
Azure Devops: Sato Naoki (Neo) - @satonaoki Jazug Tohoku Azure Devops #Jazug #Azuredevops
34 pages
Id-11659 Scrapping Web
No ratings yet
Id-11659 Scrapping Web
295 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
SQL Server Note
No ratings yet
SQL Server Note
42 pages
Aws - RDS
No ratings yet
Aws - RDS
11 pages
Untitled
No ratings yet
Untitled
172 pages
Prestige and Poverty X
No ratings yet
Prestige and Poverty X
40 pages
What Is Query Cache in MySQL
No ratings yet
What Is Query Cache in MySQL
4 pages
AWS CP - Sruya Kiran Sir Notes
No ratings yet
AWS CP - Sruya Kiran Sir Notes
8 pages
DW Olap
No ratings yet
DW Olap
57 pages
Disaster Management and Dam Monitoring System
No ratings yet
Disaster Management and Dam Monitoring System
55 pages
RAJU
No ratings yet
RAJU
25 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Unit-2.PDF Cloud Computing
No ratings yet
Unit-2.PDF Cloud Computing
17 pages
Addressing Modes
No ratings yet
Addressing Modes
37 pages
AI Syllbus
No ratings yet
AI Syllbus
5 pages
Lab Requirements: AWS Solution Architect Associate Training
No ratings yet
Lab Requirements: AWS Solution Architect Associate Training
1 page
Module 7: Data Management Backup, DR, Test/Dev Environments
No ratings yet
Module 7: Data Management Backup, DR, Test/Dev Environments
9 pages
Eula
No ratings yet
Eula
13 pages
Battlefy Player Guide
No ratings yet
Battlefy Player Guide
73 pages
Asuquo Happiness CV
No ratings yet
Asuquo Happiness CV
6 pages
Entregable Final - Big Data y Machine Learning (Diaz Granados Alexander Angel)
No ratings yet
Entregable Final - Big Data y Machine Learning (Diaz Granados Alexander Angel)
18 pages
Windows Memory Diagnostic User Guide Download Windows Memory Diagnostic
No ratings yet
Windows Memory Diagnostic User Guide Download Windows Memory Diagnostic
8 pages
The History of A.T. Cross Company
100% (1)
The History of A.T. Cross Company
8 pages
Instagram-Social Media Image Sizes
No ratings yet
Instagram-Social Media Image Sizes
34 pages
What Are The Different Type of SQL's Statements
No ratings yet
What Are The Different Type of SQL's Statements
10 pages
Coex Ex e MB Power Supply Unit: Data Sheet
No ratings yet
Coex Ex e MB Power Supply Unit: Data Sheet
2 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
Contoh Resume Jurnal Pendidikan
No ratings yet
Contoh Resume Jurnal Pendidikan
4 pages
Zbook Django-Reactjs-Content 16438
No ratings yet
Zbook Django-Reactjs-Content 16438
10 pages
Sans Emea Curriculum Overview Catalogue 2020
No ratings yet
Sans Emea Curriculum Overview Catalogue 2020
20 pages
Department of Cyber Security: Faculty of Computing and Artificial Intelligence
No ratings yet
Department of Cyber Security: Faculty of Computing and Artificial Intelligence
17 pages
Yearly C For Class 7
No ratings yet
Yearly C For Class 7
4 pages
G9 Revision Work Sheet
No ratings yet
G9 Revision Work Sheet
3 pages
Font Type WP Hebrew David (TrueType)
No ratings yet
Font Type WP Hebrew David (TrueType)
1 page
Uropean Curriculum Vitae Format: Ersonal Information
No ratings yet
Uropean Curriculum Vitae Format: Ersonal Information
6 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
SQL Server Theory
No ratings yet
SQL Server Theory
2 pages
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet

Aws (S3, Iam, Ec2, Emr and Redshift)

Uploaded by

Aws (S3, Iam, Ec2, Emr and Redshift)

Uploaded by

AWS (S3, IAM, EC2, EMR

3.Insert all of the rows from the staging table.

4.Drop the staging table.

You might also like