100% found this document useful (1 vote)
81 views16 pages

Aws (S3, Iam, Ec2, Emr and Redshift)

Amazon S3 is a simple storage service that allows storing of objects like files in buckets. Amazon IAM manages user access and security. Amazon Redshift is a data warehouse service that uses a leader node and compute nodes in a master-slave architecture for fast query performance on large datasets through massively parallel processing and columnar storage. Tuning includes choosing distribution style, sort keys, vacuuming and reindexing to optimize query performance.

Uploaded by

AkashRai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
81 views16 pages

Aws (S3, Iam, Ec2, Emr and Redshift)

Amazon S3 is a simple storage service that allows storing of objects like files in buckets. Amazon IAM manages user access and security. Amazon Redshift is a data warehouse service that uses a leader node and compute nodes in a master-slave architecture for fast query performance on large datasets through massively parallel processing and columnar storage. Tuning includes choosing distribution style, sort keys, vacuuming and reindexing to optimize query performance.

Uploaded by

AkashRai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

AWS (S3, IAM, EC2, EMR

and Redshift)
Amazon s3
• S3 is a simple storage service provided by AWS to store objects
• It allows people to store objects (“files”) in s3 buckets
• Buckets name should be globally unique
• Buckets are defined at region level
• The key is the full path s3://bucket_name/……….
• Max object can be 5TB and in case of uploading an object of more than 5GB must use multi part
• S3 supports versioning
• S3 supports encryption
• We can define security on s3 buckets
• By default s3 bucket by private.
• Static website
• Ownership cannot be transferred.
Amazon IAM ROLES
• It is the core of AWS and called as Identity and Access Management.
• AWS security at
• User
• Group
• Role
• One IAM user per physical person.
• One IAM per Application.
Amazon Redshift
• Redshift is a fully managed , Peta byte scale Data warehouse service provided by AWS
• It provides faster query performance by MPP, Columnar storage ,Data Compression , query optimization ,result
caching and compile code.
• It is majorly used for OLAP and analytics purpose.
Amazon Redshift Architecture
• Redshift architecture is based on Master slave
concept and can be majorly describers
• Client Application
• Connection
• Clusters
• Leader Node
• Commute Node
• Node Slices
• Internal Network
• Databases
Amazon Redshift Architecture
• Client Application
• Based on PostgreSQL
• Mostly existing client application can connect with it minimal changes like ETL , BI tools etc.

• Connection
• It uses industry standard ODBC and JDBC drivers for PostgreSQL.

• Clusters
• It is the core component and composed of one or more compute node.
• When one or more compute node is present then a leader node handles the compute node and external
communications.
• Application communicates with leader node.
Amazon Redshift Architecture
• Leader Node
• Handles all communication with client and compute nodes.
• Parses and carries out execution plan.
• Complies code and distributes the compiled code to compute nodes and assigns a portion of data to each compute node.

• Commute Node
• The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
• Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node
type
• As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the
number of nodes, upgrading the node type, or both.
Amazon Redshift Architecture
• Node Slices
• Compute node is partitioned into slices.
• Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload
assigned to the node.
• The leader node manages distributing data to the slices and apportions the workload for any queries or other database
operations to the slices. The slices then work in parallel to complete the operation.
• The number of slices per node is determined by the node size of the cluster

• Internal Network
• Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication
protocols to provide private, very high-speed network communication between the leader node and compute nodes.
The compute nodes run on a separate, isolated network that client applications never access directly.
Amazon Redshift Architecture
• Databases
• A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates
with the leader node, which in turn coordinates query execution with the compute nodes.
• Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS
applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing
(OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and
reporting of very large datasets.
Amazon Redshift Performance
• Massive Parallel Processing
• Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node
executing the same compiled query segments on portions of the entire data.
• Columnar Storage
•  reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk
• since each block holds the same type of data, block data can use a compression scheme selected specifically for the
column data type, further reducing disk space and I/O.
• Workload Management
• workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running
queries won't get stuck in queues behind long-running queries.
• Data Compression
• Reduce storage requirement
• Compressed data is read into memory and uncompressed during execution.
Amazon Redshift Performance
• Query Optimizer
• Significant enhancement in the queries.
• Result Caching
• Frequently executed codes results are cached.
• Compiled code.
• Complied codes is shared across the compute nodes so overhead of interpreter is removed.
Amazon Redshift Compression
• Compression encoding
• RAW (RAW) – No compression -ALL
• Byte dictionary(BYTEDICT) -ALL except Boolean (a separate dictionary of unique value is
created for each block of column values on disk)
• Delta- DELTA, DELTA32K (compresses data by recording the difference) –int, timestamp
• LZO –LZO (works well with char and varchar, provides high compression ratio) –all except
Boolean, real and double precision.
• Mostlyn- MOSTLY8, MOSTLY16, MOSTLY32 (compress values of columns to a standard
storage size) -int
• Run-length –RUNLENGTH (consist of count of time the value occurs)—varchar
• Text-TEXT255, TEXT32K (separate dictionary of unique word is created --often same recur
words)
• Zstandard-ZSTD (supports all data types) --all
Tuning Query Performance
• Distribution Style
• Auto : by default is auto if no distribution style is used.  Amazon Redshift initially assigns ALL distribution to a small
table, then changes to EVEN distribution when the table grows larger. When a table is changed from ALL to EVEN
distribution,
• All: A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a
portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table
participates in.
• Even :The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any
particular column. EVEN distribution is appropriate when a table doesn't participate in joins. It's also appropriate when
there isn't a clear choice between KEY distribution and ALL distribution.
• Key :The rows are distributed according to the values in one column. The leader node places matching values on the same
node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according
to the values in the joining columns. This way, matching values from the common columns are physically stored together.
• Number of nodes, processor and slicer
• Node types –dense storage and dense computing
Tuning Query Performance
• Sort keys
• Data is stored in 1 MB disk block
• Min and Max for each block is stored
• Compound Sort Keys
• A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed. A compound sort key
is most useful when a query's filter applies conditions, such as filters and joins, that use a prefix of the sort keys. The performance
benefits of compound sorting decrease when queries depend only on secondary sort columns, without referencing the primary
columns. COMPOUND is the default sort type.
• Interleaved Sort Keys
• An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different
columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses
restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to
compound sorting.
• Don't use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or
timestamps.
• Z-order curve for creating a Zone map
Tuning Query Performance
• Vacuum
• Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were
marked for deletion by previous UPDATE and DELETE operations. VACUUM FULL is the default.
• Full
• A full vacuum doesn't perform a reindex for interleaved tables. To reindex interleaved tables followed by a full vacuum,
use the VACUUM REINDEX option.
• Sort only
• Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows. This option
is useful when reclaiming disk space is not important but resorting new rows is important
• Delete only
• Amazon Redshift automatically performs a DELETE ONLY vacuum in the background, so you rarely, if ever, need to run
a DELETE ONLY vacuum.
• Reindex
• Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation. If
REINDEX is used, a table name is required.
Upsert –Redshift way
To perform a merge operation by replacing existing rows

1.Create a staging table, and then populate it with data to be merged, as shown in the following pseudocode.
create temp table stage (like target); insert into stage select * from source where source.filter = 'filter_expression’;

2.Use an inner join with the staging table to delete the rows from the target table that are being updated.
Put the delete and insert operations in a single transaction block so that if there is a problem, everything will be
rolled back.
begin transaction; delete from target using stage where target.primarykey = stage.primarykey;

3.Insert all of the rows from the staging table.


insert into target select * from stage; end transaction;

4.Drop the staging table.


drop table stage;

You might also like