Module3 5
Module3 5
Computing
Module 3: AWS Database Services
• AWS Lambda
• Amazon Dynamo DB
• Amazon ECS (Elastic Container Service) & Amazon S3
Glacier
• Amazon Kinesis, Amazon Redshift
• Amazon EMR (Elastic MapReduce), AWS Disaster Recovery
and Backup.
Amazon EMR (Elastic MapReduce)
• Amazon EMR (previously called Amazon Elastic MapReduce)
is a managed cluster platform that simplifies running big
data frameworks, such as Apache Hadoop and Apache Spark
, on AWS to process and analyze vast amounts of data.
• Using these frameworks and related open-source projects,
you can process data for analytics purposes and business
intelligence workloads.
• Amazon EMR also lets you transform and move large
amounts of data into and out of other AWS data stores and
databases, such as Amazon Simple Storage Service
(Amazon S3) and Amazon DynamoDB.
Amazon EMR Use Cases
• Machine learning. EMR's built-in ML tools use the Hadoop framework to create a variety of
algorithms to support decision-making, including decision trees, random forests, support-vector
machines and logistic regression.
• Extract, transform and load. ETL is the process of moving data from one or more data stores to
another. Data transformations -- such as sorting, aggregating and joining -- can be done using EMR.
• Clickstream analysis. Clickstream data from Amazon S3 can be analyzed with Apache Spark and
Apache Hive. Apache Spark is an open source data processing tool that can help make data easy to
manage and analyze. Spark uses a framework that enables jobs to run across large clusters of
computers and can process data in parallel.
• Real-time streaming. Users can analyze events using streaming data sources in real time with
Apache Spark Streaming and Apache Flink. This enables streaming data pipelines to be created on
EMR.
• Interactive analytics. EMR Notebooks are a managed service that provide a secure, scalable and
reliable environment for data analytics. Using Jupyter Notebook -- an open source web application
data scientists can use to create and share live code and equations -- data can be prepared and
visualized to perform interactive analytics.
• Genomics. Organizations can use EMR to process genomic data to make data processing and
analysis scalable for industries including medicine and telecommunications.
Amazon EMR deployment options
• Amazon EMR on Amazon EC2. Amazon EMR can quickly process
large amounts of data using Amazon EC2. Users can configure Amazon
EMR to take advantage of On-Demand, Reserved and Spot Instances.
• Amazon EMR on Amazon Elastic Kubernetes Service (EKS). The
Amazon EMR console enables users to run Apache Spark applications
with other applications on the same EKS cluster. Organizations can
share compute and memory resources across all applications and use a
Kubernetes tool to monitor and manage the infrastructure.
• Amazon EMR on AWS Outposts. AWS Outposts enables
organizations to run EMR in their own data centers. This makes it easier
to set up, deploy, manage and scale EMR in on-premises environments.
Advantages of Amazon EMR
1.Scalability: EMR allows users to easily scale up or down the number of
instances in a cluster to handle varying amounts of data processing and analysis
tasks.
2.Cost Effectiveness: EMR allows users to pay for the resources they need, when
they need them, making it a cost-effective solution for big data processing.
3.Integration With Other AWS Services: EMR can be easily integrated with
other AWS services such as Amazon S3, Amazon DynamoDB, and Amazon
Redshift for data storage and analysis.
4.Flexibility: EMR supports a wide range of open-source big data frameworks,
including Hadoop, Spark, and Hive, giving users the flexibility to choose the tools
that best fit their needs.
5.Easy To Use: EMR provides an easy-to-use web interface that allows users to
launch and manage clusters, as well as monitor and troubleshoot performance
issues.
Disadvantages Of Amazon EMR
1.Limited Customization: EMR is pre-configured with popular big data
frameworks such as Hadoop and Spark, so users may have limited options for
customizing their cluster.
2.Latency: The latency of data processing tasks may increase as the size of the
data set increases.
3.Cost: EMR can be expensive for users with large amounts of data or high-
performance requirements, as costs are based on the number of instances and
the amount of storage used.
4.Limited Control Over The Infrastructure: EMR is a managed service, which
means that users have limited control over the underlying infrastructure. This
can be a disadvantage for users who need more control over their big data
environments.
5.Limited Support For Certain Big Data Frameworks: EMR does not support
some big data frameworks such as Flink, which may be a deal breaker for some
organizations.
6.Limited Support For Certain Applications: EMR is not suitable for all types
of applications, it mainly supports big data processes and analytics.
AWS Disaster Recovery and Backup
Disaster recovery involves the process of
restoring and recovering an
organization’s critical systems,
infrastructure, and data after a
disruptive event.
It aims to minimize downtime, recover
lost data, and resume normal operations
as quickly as possible.
AWS (Amazon Web Services) offers a full
range of tools and services to assist
organizations in establishing effective
disaster recovery strategies. Services like
Amazon S3 for secure and reliable object
storage, Amazon EC2 for adaptable and
scalable computing instances, and AWS
Backup for automatic backup and recovery
procedures are all provided by AWS. AWS also
offers options like cross-region replication
which lets companies duplicate data. Also,
AWS Services for Disaster
Recovery
Amazon S3 (Simple Storage Service): It is perfect for safely storing backup
data and making it possible for quick retrieval during the recovery stage.
Amazon EC2 (Elastic Compute Cloud): Organizations can build a failover
architecture by replicating their on-premises virtual machine (VM) environments.
High availability and seamless failover are made possible by EC2’s Auto Scaling
and Load Balancing features.
AWS Database Services: Database services from AWS include
Amazon RDS (Relational Database Service), Amazon DynamoDB, and Amazon
Aurora. The automated backups, point-in-time recovery, and cross-region
replication provided by these services ensure data durability and availability.
AWS Storage Gateway: AWS Storage Gateway acts as a bridge between on-
premises environments and AWS storage services. It allows for seamless
integration of existing infrastructure with AWS, enabling hybrid cloud
architectures for disaster recovery purposes.
Steps to Set Up AWS Disaster
Recovery
1.Define Recovery Objectives:
2.Identify DR Architecture:
3.Replicate Data and Applications:
4.Set Up Automation:
5.Establish Monitoring and Alerting:
6.Test and Validate:
7.Document the DR Plan: