Amazon Emr Migration Guide
Amazon Emr Migration Guide
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contents
Overview ....................................................................................................................................................... 1
Starting Your Journey .................................................................................................................................... 3
Migration Approaches .............................................................................................................................. 3
Prototyping ............................................................................................................................................... 5
Choosing a Team ....................................................................................................................................... 7
General Best Practices for Migration ........................................................................................................ 8
Gathering Requirements............................................................................................................................. 10
Obtaining On-Premises Metrics .............................................................................................................. 10
Cost Estimation and Optimization .............................................................................................................. 11
Optimizing Costs ..................................................................................................................................... 11
Storage Optimization .............................................................................................................................. 12
Compute Optimization ............................................................................................................................ 15
Cost Estimation Summary ....................................................................................................................... 17
Optimizing Apache Hadoop YARN-based Applications ........................................................................... 17
Amazon EMR Cluster Segmentation Schemes ............................................................................................ 21
Cluster Characteristics ............................................................................................................................ 21
Common Cluster Segmentation Schemes ............................................................................................... 22
Additional Considerations for Segmentation.......................................................................................... 23
Securing your Resources on Amazon EMR ................................................................................................. 25
EMR Security Best Practices .................................................................................................................... 25
Authentication ........................................................................................................................................ 26
Authorization .......................................................................................................................................... 29
Encryption ............................................................................................................................................... 38
Perimeter Security .................................................................................................................................. 40
Network Security..................................................................................................................................... 42
Auditing ................................................................................................................................................... 43
Software Patching ................................................................................................................................... 44
Software Upgrades.................................................................................................................................. 45
Common Customer Use Cases ................................................................................................................ 46
Data Migration ............................................................................................................................................ 50
Using Amazon S3 as the Central Data Repository................................................................................... 50
Large Quantities of Data on an Ongoing Basis ........................................................................................ 54
Event and Streaming Data on a Continuous Basis .................................................................................. 57
Optimizing an Amazon S3-Based Central Data Repository ..................................................................... 58
Optimizing Cost and Performance .......................................................................................................... 61
Data Catalog Migration ............................................................................................................................... 64
Hive Metastore Deployment Patterns .................................................................................................... 64
Hive Metastore Migration Options ......................................................................................................... 69
Multitenancy on EMR ................................................................................................................................. 72
Silo Mode ................................................................................................................................................ 72
Shared Mode ........................................................................................................................................... 73
Considerations for Implementing Multitenancy on Amazon EMR ......................................................... 75
Extract, Transform, Load (ETL) on Amazon EMR ........................................................................................ 81
Orchestration on Amazon EMR............................................................................................................... 81
Migrating Apache Spark .......................................................................................................................... 90
Migrating Apache Hive ............................................................................................................................ 93
Amazon EMR Notebooks ........................................................................................................................ 98
Incremental Data Processing .................................................................................................................... 101
Considerations for using Apache Hudi on Amazon EMR ...................................................................... 102
Sample Architecture.............................................................................................................................. 107
Providing Ad Hoc Query Capabilities ........................................................................................................ 108
Considerations for Presto ..................................................................................................................... 108
HBase Workloads on Amazon EMR....................................................................................................... 109
Migrating Apache Impala ...................................................................................................................... 114
Operational Excellence ............................................................................................................................. 116
Upgrading Amazon EMR Versions ........................................................................................................ 116
General Best Practices for Operational Excellence ............................................................................... 120
Testing and Validation .............................................................................................................................. 121
Data Quality Overview .......................................................................................................................... 121
Check your Ingestion Pipeline ............................................................................................................... 122
Overall Data Quality Policy.................................................................................................................... 123
Estimating Impact of Data Quality ........................................................................................................ 124
Tools to Help with Data Quality ............................................................................................................ 125
Amazon EMR on AWS Outposts................................................................................................................ 126
Limitations and Considerations............................................................................................................. 126
Support for Your Migration....................................................................................................................... 127
Amazon EMR Migration Program ......................................................................................................... 127
AWS Professional Services .................................................................................................................... 127
AWS Partners ........................................................................................................................................ 130
AWS Support ......................................................................................................................................... 130
Contributors .............................................................................................................................................. 131
Additional Resources ................................................................................................................................ 132
Document Revisions.................................................................................................................................. 133
Appendix A: Questionnaire for Requirements Gathering......................................................................... 134
Security Requirements .......................................................................................................................... 135
TCO Considerations ............................................................................................................................... 135
Appendix B: EMR Kerberos Workflow ...................................................................................................... 137
EMR Kerberos Cluster Startup Flow for KDC with One-Way Trust ....................................................... 137
EMR Kerberos Flow Through Hue Access ............................................................................................. 138
EMR Kerberos Flow for Directly Interacting with HiveServer2 ............................................................. 139
EMR Kerberos Cluster Startup Flow ...................................................................................................... 140
Appendix C: Sample LDAP Configurations ................................................................................................ 141
Example LDAP Configuration for Hadoop Group Mapping................................................................... 141
Example LDAP Configuration for Hue ................................................................................................... 141
Appendix D: Data Catalog Migration FAQs ............................................................................................... 143
About this Guide
For many customers, migrating to Amazon EMR raises many questions about assessment, planning,
architectural choices, and how to meet the many requirements of moving analytics applications like
Apache Spark and Apache Hadoop from on-premises data centers to a new AWS Cloud environment.
Many customers have concerns about the viability of distribution vendors or a purely open-source
software approach, and they need practical advice about making a change. This guide includes the
overall steps of migration and provides best practices that we have accumulated to help customers with
their migration journey.
Amazon Web Services Amazon EMR Migration Guide
Overview
Businesses worldwide are discovering the power of new big data processing and analytics frameworks
like Apache Hadoop and Apache Spark, but they are also discovering some of the challenges of
operating these technologies in on-premises data lake environments. Not least, many customers need a
safe long-term choice of platform as the big data industry is rapidly changing and some vendors are now
struggling.
Common problems include a lack of agility, excessive costs, and administrative headaches, as IT
organizations wrestle with the effort of provisioning resources, handling uneven workloads at large
scale, and keeping up with the pace of rapidly changing, community-driven, open-source software
innovation. Many big data initiatives suffer from the delay and burden of evaluating, selecting,
purchasing, receiving, deploying, integrating, provisioning, patching, maintaining, upgrading, and
supporting the underlying hardware and software infrastructure.
A subtler, if equally critical, problem is the way companies’ data center deployments of Apache Hadoop
and Apache Spark directly tie together the compute and storage resources in the same servers, creating
an inflexible model where they must scale in lock step. This means that almost any on-premises
environment pays for high amounts of under-used disk capacity, processing power, or system memory,
as each workload has different requirements for these components.
How can smart businesses find success with their big data initiatives?
Migrating big data (and machine learning) to the cloud offers many advantages. Cloud infrastructure
service providers, such as Amazon Web Services (AWS), offer a broad choice of on-demand and elastic
compute resources, resilient and inexpensive persistent storage, and managed services that provide up-
to-date, familiar environments to develop and operate big data applications. Data engineers,
developers, data scientists, and IT personnel can focus their efforts on preparing data and extracting
valuable insights.
Services like Amazon EMR, AWS Glue, and Amazon S3 enable you to decouple and scale your compute
and storage independently, while providing an integrated, well-managed, highly resilient environment,
immediately reducing so many of the problems of on-premises approaches. This approach leads to
faster, more agile, easier to use, and more cost-efficient big data and data lake initiatives.
However, the conventional wisdom of traditional on-premises Apache Hadoop and Apache Spark isn’t
always the best strategy in cloud-based deployments. A simple lift and shift approach to running cluster
nodes in the cloud is conceptually easy but suboptimal in practice. Different design decisions go a long
way towards maximizing your gains as you migrate big data to a cloud architecture.
1
Amazon Web Services Amazon EMR Migration Guide
2
Amazon Web Services Amazon EMR Migration Guide
A lift and shift approach is usually simpler with less ambiguity and risk. Additionally, this approach is
better when you are working against tight deadlines, such as when your lease is expiring for a data
center. However, the disadvantage to a lift and shift is that it is not always the most cost effective, and
the existing architecture may not readily map to a solution in the cloud.
A re-architecture unlocks many advantages, including optimization of costs and efficiencies. With re-
architecture, you move to the latest and greatest software, have better integration with native cloud
tools, and lower operational burden by leveraging native cloud products and services.
This paper provides advantages and disadvantages of each migration approach from the perspective of
the Apache Hadoop ecosystem. For a general resource on deciding which approach is ideal for your
workflow, see An E-Book of Cloud Best Practices for Your Enterprise, which outlines the best practices
for performing migrations to the cloud at a higher level.
Re-Architecting
Re-architecting is ideal when you want to maximize the benefits of moving to the cloud. Re-architecting
requires research, planning, experimentation, education, implementation, and deployment. These
efforts cost resources and time but generally provide the greatest rate of return as reduced hardware
and storage costs, operational maintenance, and most flexibility to meet future business needs.
A re-architecture approach to migration includes the following benefits for your applications:
3
Amazon Web Services Amazon EMR Migration Guide
• Data accessibility when using a data lake architecture, data is stored on a central storage
system that can be used by a wide variety of services and tools to ingest and process the data
for different use cases. For example, using services such as AWS Glue, and Amazon Athena and
other services can greatly reduce operational burden and reduce costs, and can only be
leveraged if data is stored on S3.
• Ability to treat compute instances as transient resources, and only use as much as you need,
when you actively need it.
• Read the documentation found in this guide for reference architectures and approaches that
others have taken to successfully migrate.
• Reach out to an AWS representative early for a roadmap on architectures that would meet
your use case and goals.
• Fewer number of changes. Since the goal is to move applications to environments that are
similar to the existing environment, changes are limited to only those required to make the
application to work on the cloud.
• Less risk because fewer changes reduce the unknowns and unexpected work.
• Shorter time to market because fewer number of changes reduces the amount of training
needed by engineers.
4
Amazon Web Services Amazon EMR Migration Guide
• Amazon EMR clusters are configured using defaults that depend on the instance types chosen.
See Task Configuration for default values at the Hadoop task level and Spark Defaults Set by
Amazon EMR for Apache Spark defaults. These defaults run for most workloads but some jobs
may require that you override these defaults.
• Amazon EMR clusters by default use the capacity scheduler as the Apache Hadoop resource
scheduler. Validate that this scheduler fits the use case that you are migrating.
Hybrid Architecture
Hybrid architectures leverage aspects of both lift and shift and re-architecting approaches. For existing
applications, a lift and shift approach is employed for a quick migration. Any new applications then can
use re-architected architecture. This hybrid approach includes the benefit of being able to experiment
and gain experience with cloud technologies and paradigms before moving to the cloud.
Prototyping
When moving to a new and unfamiliar product or service, there is always a period of learning. Usually,
the best way to learn is to prototype and learn from doing, rather than researching alone, to help
identify the unknowns early in the process so you can plan for them later. Make prototyping mandatory
to challenge assumptions. Common assumptions when working with new products and services include
the following:
A particular data format is the best data format for my use case. Many times, customers read
that a specific data format is the best. However, there is no best data format for all use cases. There are
data formats that perform better than most other data formats, such as Apache Parquet and Apache
ORC. Validating this assumption could be relatively quick but the impact could be large. Switching data
formats after moving to production is more expensive than running a quick prototype with real-world
data. This prototyping is highly recommended.
A particular instance type is the most cost effective way to run a specific workflow. Many
times, another instance type performs better if it is tuned for the workflow. For example, C series EC2
instances can perform better, and cost less if you enable spill-to-disk rather than using R series EC2
instances. This scenario is easier to change later on and is recommended if cost and performance is a
high priority requirement.
A particular application running on-premises should work identically on cloud. There are
many factors that contribute to running workloads, such as the instance type, storage type, application
version, infrastructure configuration, and so on. Running a wide variety of jobs with real data that you
expect to run on production provides the most validation.
5
Amazon Web Services Amazon EMR Migration Guide
With the cloud, there are several factors in the environment in which a workload may run. For example,
at different times of day, traffic to Amazon S3 could vary, or caching could be instituted when not
expected. Therefore, prototyping reduces the number and severity of surprises during development and
deployments, and the rate of return can be large. Last, finding out issues earlier than later in the
development cycle is much more cost effective.
6
Amazon Web Services Amazon EMR Migration Guide
Choosing a Team
When starting a migration to the cloud, you must carefully choose your project team to research,
design, implement, and maintain the new system. We recommend that your team has individuals in the
following roles with the understanding that a person can play multiple roles:
• A project leader capable of managing the project end-to-end, and with enough technical
expertise and background on big data technology to make decisions.
• A big data applications engineer with a strong background in Apache Hadoop and other
technologies that are being migrated. This background is helpful to understand how the
underlying applications work, their intended uses, and their limitations.
• An infrastructure engineer well-versed in the automation of infrastructure provisioning on
AWS and familiar with tools like AWS CloudFormation, Terraform (by HashiCorp), and so on.
This person should also be familiar with test automation, including CI/CD approaches and
tools.
• A security engineer able to provide guidance on the security requirements that your company
mandates. This person should be knowledgeable enough to map the organization's security
requirements to security features offered by AWS. This person should also be flexible about
the security design as long as it meets the intended requirements.
• A group of engineers who are quick learners and are not afraid to dive into areas that they
may not be familiar with.
All members in the migration team must exhibit the following characteristics:
• They must have an open mind and ability to think differently. Cloud infrastructure requires a
paradigm shift on how resources are treated.
• They must be able to dive deep into the underlying frameworks, architectures, and issues.
• They must share an agreement on the project's mission and direction.
7
Amazon Web Services Amazon EMR Migration Guide
Consider using AWS Glue, Amazon Redshift, or Amazon Athena. Although Amazon EMR is
flexible and provides the greatest amount of customization and control, there is an associated cost of
managing Amazon EMR clusters, upgrades, and so on. Consider using other managed AWS services that
fulfill your requirements as they may have lower operational burden, and in some cases, lower costs. If
one of these services does not meet your use case requirements, then use Amazon EMR.
Use Amazon S3 for your storage (data lake infrastructure). A data lake is a centralized repository
that allows you to store all of your structured and unstructured data at any scale. Data can be stored as-
is, without having to first structure the data. You can execute different types of analytics on the data,
from dashboards and visualizations to big data processing, real-time analytics, and machine learning.
Amazon S3 is architected for high durability and high availability and supports lifecycle policies for tiered
storage. For more details, see Using Amazon S3 as the Central Data Repository.
Decouple compute from storage so that you can scale them independently. With your data
in Amazon S3, you can launch as much or as little compute capacity as you need using Amazon Elastic
Compute Cloud (EC2). For more information, see Benefits of using Amazon S3.
Use multiple Amazon EMR (transient and long running) clusters with the ability to spin
up and down on demand. Analyze the current workloads and assign the workloads to different
clusters based on usage patterns.
• Separate out batch jobs (extract, transform, load [ETL], aggregations, data cleansing, roll-up,
machine learning) and interactive queries (one-time analysis).
• Use Reserved and Spot Instances as applicable to reduce costs for baseline or variable
workloads, respectively.
• Use automatic scaling within clusters. Automatic scaling allows for programmatically scaling in
and out core nodes and task nodes based on Amazon CloudWatch metrics and other
parameters that are specified in a scaling policy.
• Right-size clusters (both instance types and number of instances). Multiple Amazon EC2
instance types are available—make sure to choose the correct instance type based on the
workload. For more details, see Cost Estimation and Optimization.
Implement automation and continuous integration/continuous delivery (CI/CD) practices
to enable experimentation and efficiency. Automating the provisioning of EMR clusters along with
other resources like roles and security groups is an operational excellence best practice. Apply the same
engineering discipline to infrastructure that is typically used for application code. Check the
infrastructure code into a code repository and build CI/CD pipelines to test the code. Implementing
8
Amazon Web Services Amazon EMR Migration Guide
infrastructure as code also allows for the provisioning of EMR clusters in another Availability Zone or
AWS Region should problems arise in the one currently being used. For more details, see Operational
Excellence.
Involve security and compliance engineers as early in the migration process as possible
and make sure that the EMR environments are in line with the organization's security directives. Make
full use of multiple security-related services, such as AWS Identity and Access Management (IAM) and
AWS Key Management Service (KMS), and features, such as Security Configurations within EMR. Amazon
S3 also includes many security-related features. Make sure that all data is encrypted at-rest and in-
transit. Finally, make sure that authentication and authorization are enabled as appropriate. For more
details, see Securing your Resources on Amazon EMR.
9
Amazon Web Services Amazon EMR Migration Guide
Gathering Requirements
Obtaining On-Premises Metrics
The following list of metrics is useful to help with cost estimation, architecture planning, and instance
type selection.
Capture each of these metrics on your existing Hadoop clusters to help drive the decision-making
process during migration.
10
Amazon Web Services Amazon EMR Migration Guide
Optimizing Costs
Amazon EMR provides multiple features to help lower costs. To best utilize these features, consider the
following factors.
Workload Type
You can run different applications and workload types on Amazon EMR. For applications that only run
for a specific period, you can use a transient EMR cluster. For applications that run for a long period, you
can use a long-running cluster. The following image shows typical workload types and whether they're
classified as transient or long-running.
Instance Type
Most Amazon EMR clusters can run on general-purpose EC2 instance types/families such as m4.xlarge
and m5.xlarge. Compute-intensive clusters may benefit from running on high performance computing
(HPC) instances, such as the compute-optimized instance family (C5). Database and memory-caching
applications may benefit from running on high memory instances, such as the memory-optimized
11
Amazon Web Services Amazon EMR Migration Guide
instance family (R5). The primary node does not have large computational requirements. For most
clusters of 50 or fewer nodes, you can use a general-purpose instance type such as m5.xlarge.
Note: For clusters of more than 50 nodes, consider using a larger instance type, such as
m4.xlarge). The amount of data you can process depends on the capacity of your core nodes
and the size of your data as input, data during processing, and data as output.
Application Settings
Job performance also depends on application settings. There are different application settings for
different use cases. For example, by default, EMR clusters with Apache HBase installed allocate half of
the memory for HBase and allocate the other half of memory for Apache Hadoop YARN. If you use
applications such as Apache HBase and Apache Spark, we recommend that you don't use a single, larger
cluster for both applications. Instead run each application on a separate, smaller cluster.
Storage Optimization
When you use Amazon EMR, you have the ability to decouple your compute and storage by using
Amazon S3 as your persistent data store. By optimizing your storage, you can improve the performance
of your jobs. This approach enables you to use less hardware and run clusters for a shorter period. Here
are some strategies to help you optimize your cluster storage for Amazon S3:
Partition Data
When your data is partitioned and you read the data based on a partition column, your query only reads
the files that are required. This reduces the amount of data scanned during the query. For example, the
following image shows two queries executed on two datasets of the same size. One dataset is
partitioned, and the other dataset is not.
12
Amazon Web Services Amazon EMR Migration Guide
The query over the partitioned data (s3logsjsonpartitioned) took 20 seconds to complete and
it scanned 349 MB of data.
The query over the non-partitioned data (s3logsjsonnopartition) took 2 minutes and 48
seconds to complete and it scanned 5.13 GB of data.
13
Amazon Web Services Amazon EMR Migration Guide
The query executed over the dataset containing 50 files (fewfilesjson) took 2 minutes and 31
seconds to complete. The query over the dataset with 25,000 files (manyfilesjson) took 3 minutes
and 47 seconds to complete.
14
Amazon Web Services Amazon EMR Migration Guide
The query over the JSON dataset took 56 seconds to complete, and it scanned 5.21 GB of data. The
query over the Parquet dataset took 2 seconds to complete, and in this example, it did not need to scan
any data.
Compute Optimization
In the previous section, we covered some of the strategies that you can use to optimize your Amazon S3
storage costs and performance. In this section, we cover some of the features and ways to optimize your
Amazon EC2 cluster’s compute resources.
Amazon EC2 provides various purchasing options. When you launch Amazon EMR clusters, you have the
ability use On-Demand, Spot, or Reserved EC2 instances. Amazon EC2 Spot Instances offer spare
compute capacity available at discounts compared to On-Demand Instances. Amazon EC2 Reserved
Instances enable you to reserve EC2 instances at a significant discount compared to On-Demand pricing.
For more detailed information, see Instance Purchasing Options in the Amazon EMR Management
Guide.
Spot Instances
Running EMR clusters with Spot instances can be useful for a number of scenarios. However, there are
few things that you must consider before choosing Spot instances for a particular workload. For
15
Amazon Web Services Amazon EMR Migration Guide
example, if you're running a job that requires predictable completion time or has service level
agreement (SLA) requirements, then using Spot instances may not be the best fit. For workloads where
they can be interrupted and resumed (interruption rates are extremely low), or workloads that can
exceed an SLA, you can use Spot instances for the entire cluster.
You can also use a combination of Spot and On-Demand instances for certain workloads. For example, if
cost is more important than the time to completion, but you cannot tolerate a partial loss of work (have
an entire cluster terminated), you can use Spot instances for the task nodes, and use On-
Demand/Reserved instances for the primary and core nodes.
Spot instances are also great for testing and development workloads. You can use Spot instances for an
entire testing cluster to help you reduce costs when testing new applications.
Reserved Instances
With Reserved Instances (RIs), you can purchase/reserve EC2 capacity at a lower price compared to On-
Demand Instances. Keep in mind that for you to have reduced costs with RIs, you must make sure that
your RI use over a period of a year is higher than 70%. For example, if you use transient EMR clusters
and your clusters only run for a total of 12 hours per day, then your yearly use is 50%. This means that
RIs might not help you reduce costs for that workload. Reserved Instances may help you to reduce costs
for long-running clusters and workloads.
Savings Plans
Savings Plans is a flexible discount model that provides you with the same discounts as Reserved
Instances, in exchange for a commitment to use a specific amount (measured in dollars per hour) of
compute power over a one- or three-year period. Every type of compute usage has an On-Demand price
and a (lower) Savings Plan price. After you commit to a specific amount of compute usage per hour, all
usage up to that amount will be covered by the Saving Plan, and anything past it will be billed at the On-
Demand rate. If you have Reserved Instances, the Savings Plan applies to any On Demand usage that is
not covered by the RIs. Savings Plans are available in two options:
• Compute Savings Plans provide the most flexibility and help to reduce your costs by up to
66%. The plans automatically apply to any EC2 instance regardless of region, instance family,
operating system, or tenancy, including those that are part of your EMR clusters.
• EC2 Instance Savings Plans apply to a specific instance family within a region and provide the
largest discount (up to 72%, just like Standard RIs). Like RIs, your savings plan covers usage of
different sizes of the same instance type (such as a c5.4xlarge or c5.large) throughout a region.
Instance Fleets
Instance fleets is an Amazon EMR feature that provides you with variety of options for provisioning EC2
instances. This approach enables you to easily provision an EMR cluster with Spot Instances, On-Demand
16
Amazon Web Services Amazon EMR Migration Guide
Instances, or a combination of both. When you launch a cluster with Instance Fleets, you can select the
target capacity for On-Demand and Spot Instances and specify a list of instance types and Availability
Zones. Instance fleets choose the instance type and the Availability Zone that is the best fit to fulfill your
launch request.
Instance fleets also provide many features for provisioning Spot Instances. This includes the ability for
you to specify a defined duration (Spot block) to keep the Spot Instances running, the maximum Spot
price that you’re willing to pay, and a timeout period for provisioning Spot Instances.
17
Amazon Web Services Amazon EMR Migration Guide
A job requires resources to complete its computation. To parallelize the job, YARN runs subsets of work
within containers called tasks. The job requests from YARN the amount of memory and CPU expected
for each container. If not specified, then a default container size is allocated for each container. If the
container is not sized properly, the job may waste resources because it's not using everything allocated
to it, run slowly because it's constrained, or fail because its resources are too constrained.
To ensure that the underlying hardware is fully utilized, you must take into consideration both the
resources in YARN and the requests coming from a job. YARN manages virtual resources but does not
necessarily map to the underlying hardware. In addition, YARN configuration and task schedule
configuration does have an impact on how the underlying hardware is used.
To ensure that you are using all of the physical resources, use Ganglia monitoring software to provide
information about the underlying hardware. If 100% of YARN resources (vCPU and Memory) are used,
but Ganglia is showing that actual CPU and memory usage is not crossing 80%, then you may want to
reduce container size so that the cluster can run more concurrent containers.
If looking at Ganglia shows that either CPU or memory is 100% but the other resources are not being
used significantly, then consider moving to another instance type that may provide better performance
at a lower cost. For example, if CPU is 100%, and memory usage is less than 50% on R4 or M5 series
instance types, then moving to C4 series instance type may be able to address the bottleneck on CPU.
18
Amazon Web Services Amazon EMR Migration Guide
The second control is the amount of virtual resources that is reservable from each node within YARN. To
change the amount of YARN memory or CPU available to be reserved on each node in your cluster, set
the yarn.nodemanager.resource.memory-mb and
yarn.nodemanager.resource.cpu-vcores configurations using the Amazon EMR
configuration API. For default values, see Hadoop Daemon Configuration Settings in the Amazon EMR
Release Guide.
The following decision graph provides a suggested approach to optimizing your jobs.
19
Amazon Web Services Amazon EMR Migration Guide
20
Amazon Web Services Amazon EMR Migration Guide
This section covers a few approaches to splitting a single, permanently running cluster into smaller
clusters and identifies the benefits these practices bring to your environments. The approach you
choose depends on your use case and existing workflows. AWS can work with you to help choose the
strategy that meets your goals.
Cluster Characteristics
You can approach the task of splitting up existing cluster from different perspectives, depending on the
area or a set of characteristics that you want to tackle or address. The strategy you choose depends on
the scenarios you have and the goals you want to achieve. The following cluster characteristics are
typically considered.
Security
21
Amazon Web Services Amazon EMR Migration Guide
Security Controls
Depending on the use cases that a cluster serves, you can change the level of security for the cluster. For
example, if one cluster is used purely for batch processing that is submitted from a workflow engine or
via steps, securing this cluster at a user level may not be a top priority. In this case, forgoing Kerberos
configuration may be acceptable since no one is interacting with the cluster. For a cluster that serves
interactive use cases, you may need to configure Kerberos, restrict SSH access, and follow other security
practices. For information on security controls available in Amazon EMR, see Security in the Amazon
EMR Management Guide.
Network Controls
You can assign different clusters to different security groups or place them in different subnets that
control access from specified network sources.
Disaster recovery
Using more than one cluster provides redundancy to minimize impact if a single cluster goes down or is
taken offline. The following examples are a couple use cases where having multiple clusters can help:
• A cluster becomes unhealthy due to a software bug, network issue, or other external
dependencies being unavailable.
• A maintenance operation needs to occur such as a software upgrade, patching that requires a
machine reboot, or bouncing of applications.
Lifecycle Stages
One of the typical approaches to deciding how to segregate clusters is based on having dedicated
clusters for separate stages in your lifecycle, such as testing, beta, and production. This way, jobs that
are not ready for production can run on their own dedicated cluster and do not interfere or compete
with production jobs for resources or writing of data results. Having different clusters for separate
stages also lets you test jobs on clusters that have newer versions of applications. This approach lets you
test upgrades before upgrading your beta or production environments. To further isolate your
workflows and scenarios, you can apply a separate instance role in Amazon EMR that disallows beta jobs
to write their results to production S3 locations, protecting them from accidental deletions or
modifications arising from your beta stage environment.
Workload Types
Clusters that serve end users who submit ad hoc queries typically require stricter security controls. They
also must use different applications, such as Apache Hue and Apache Zeppelin. These interactive
workload clusters usually have peak usage times during business hours and low usage at all other times.
22
Amazon Web Services Amazon EMR Migration Guide
Configuring automatic scaling policies for these clusters is beneficial, especially if you run them on Spot
Instances.
Clusters used for batch/ETL workloads also tend to have peak usage times that are usually different from
those of interactive workload clusters used for queries. Batch/ETL workload clusters can leverage
automatic scaling so that workloads scale independently of other clusters and can scale up and down.
For more information, see Using Automatic Scaling in Amazon EMR.
Time-Sensitive Jobs
Another common strategy for cluster segmentation is creating separate clusters based on whether their
jobs are time-sensitive. When a repeated job's completion time must be consistent, creating and
running the job on a separate cluster is a way to ensure that the job can obtain a predictable and
consistent amount of resources each time that job must run. In addition, you can use more powerful
and expensive hardware when running time-sensitive jobs.
Job Length
Running long jobs may consume available resources and take them away from other, shorter running
jobs. Running separate clusters for long-running and short-running jobs can help short-running jobs
complete faster, improve their SLA, and improve the SLA of workflows in general if they are in the
critical path. Clusters that run short-running jobs also have a higher chance of completing jobs when
Spot Instances are used because the chance of a job being interrupted by EC2 is lower.
Group/Organization Types
Some organizations create clusters per groups of users that share the same security requirements and
ownership of EMR resources. This approach helps with restricting access to those groups, and also
allows the administrators to allocate costs to the groups by using resource tagging. Having several,
smaller clusters dedicated to separate groups also helps you isolate resources usage. With this
approach, one group within your organization does not exhaust the resources of other groups using the
same cluster. In addition, this approach reduces the risk that one group gains access to another group’s
data.
23
Amazon Web Services Amazon EMR Migration Guide
with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and
AWS Lambda on the AWS Big Data Blog.
Finally, using multiple clusters also reduces the efficiency of instances, because they are not being
shared by other jobs. However, this scenario can be offset by using automatic scaling.
24
Amazon Web Services Amazon EMR Migration Guide
Ensure that the supporting department is involved early in security architecture. Have the
department that reviews and approves architectures for security involved in the process as early as
possible, and keep them up-to-date with decisions related to security. They may be able to give you
advice earlier in the process to reduce or avoid design changes later in the process.
Understand the risks. Security is mainly about minimizing attack surfaces and minimizing impact
should a system become compromised. No system can be entirely secured.
Obtain security exceptions. Security departments may provide security exceptions for rules that
may no longer apply or where the risk of compromise is reduced. Getting security exceptions may
significantly reduce the risk and scope of work needed to get approvals from a security department. For
example, you may not need SELinux for Amazon EMR clusters that process batch jobs and in which there
is no interactivity with users.
Use different security setups for different use cases. Batch and ETL clusters that do not have
user interaction likely require a different security configuration than a cluster that is used in an
interactive way. Clusters with interaction may have several users and processes that interact with a
cluster and each user requiring different levels of access with each other. Clusters that are used for
batch usually require much lower security controls than an interactive cluster.
Protect from unintentional network exposure. Security departments may configure proper
security group rules to protect applications and data on the cluster. Misconfiguration of network
security rules can open a broad range of cluster ports to unrestricted traffic from the public internet and
expose cluster resources to outside threats. The Amazon EMR block public access feature allows you to
minimize misconfigurations by centrally managing public network access to EMR clusters in an AWS
Region. You can enable this configuration in an AWS Region and block your account users from
launching clusters that allow unrestricted inbound traffic from the public IP address.
To learn more about EMR security best practices, see Best Practices for Securing Amazon EMR on the
AWS Big Data Blog.
25
Amazon Web Services Amazon EMR Migration Guide
Authentication
Authentication is the process of an entity (a user or application) proving its identity to an application or
system. When logging into an application or system, a user generally provides a login and password to
prove that they are the user they are claiming to be. Other methods for authentication exist, such as
providing certificates or tickets.
There are several ways to authenticate a user to an EMR cluster and/or applications on the Amazon EMR
cluster.
This method is typically used when there are a small number of users or groups. Multiple EMR clusters
can be started with different permissions and SSH keys, and the keys are distributed to users that have
access to each cluster. Rotate the SSH keys periodically to ensure that the impact of leaked keys is
limited. This is not a recommended approach if Amazon EMR clusters have access to sensitive data.
26
Amazon Web Services Amazon EMR Migration Guide
After a user is validated to an application, there are two ways a job can be submitted. In the first
scenario, an application uses its own identity to submit a job for processing. For example, if Apache Hue
is the application, then the Hue user submits the job and that job is run by the defined user hue. The
second way to submit the job is through an application that impersonates the user to run that job. This
method allows administrators to track usage of users and makes it easier to identify misuse of the
system. By default, Apache Hue submits jobs impersonated by the user.
By default, Apache Hadoop's default authentication method is set to simple, which means that there is
no user authentication. We recommend that you change this setting if the cluster is being used by
multiple users. If Apache Hadoop is set up to use Kerberos and a YARN/HDFS job is submitted with
Apache Spark or Apache Hive, then the end users must have an OS level account on each node as
YARN/HDFS. In some cases, these accounts must be created manually and synced using something like
SSSD. Mappings from users to their group membership are also required in which LDAP or SSSD can be
used. If this is a requirement, we recommend that you use LDAP to authenticate users, and use Amazon
EMR with Kerberos to automate the OS accounts syncing. Or, you can also enable Hadoop to do user-to-
group mappings via LDAP. See Hadoop Groups Mappings in the Apache Hadoop documentation for
built-in Hadoop implementations for these mappings. For more information, refer to Appendix C:
Sample LDAP Configurations.
The following table lists Amazon EMR supported applications and instructions on how to enable LDAP, if
supported.
Supports
Application LDAP? Notes
Apache Hive Yes HiveServer2 can be used to authenticate against LDAP. Details on
HiveServer2 setup are located on the Apache Hive Wiki. When LDAP is
configured, Java Database Connectivity (JDBC) and Beeline connections
require a user login and password.
Note: We recommend that you use LDAPS to ensure that unencrypted
credentials are not shared.
Presto Yes See LDAP Authentication in Presto documentation for details on the
different configuration parameters.
27
Amazon Web Services Amazon EMR Migration Guide
Supports
Application LDAP? Notes
Apache Spark No Use Hue or Apache Zeppelin to authenticate the user and then submit
the job to Spark. Or use Kerberos as an authentication mechanism
when submitting applications.
Apache Hue Yes See Configure Hue For LDAP users in the Amazon EMR Release Guide
for setup instructions, and Using LDAP via AWS Directory Service to
Access and Administer Your Hadoop Environment on the AWS Big Data
Blog for a step-by-step guide.
Apache Yes Apache Zeppelin uses Apache Shiro to configure authentication. For
Zeppelin steps on enabling Shiro, see Shiro Authentication for Apache Zeppelin.
JupyterHub Yes See Using LDAP Authentication in the Amazon EMR Release Guide for
setup instructions.
Kerberos
Kerberos is the most secure authentication and authorization mechanism available on Amazon EMR.
Kerberos works by having users provide their credentials and obtain a ticket to prove identity from a
central Authentication Server. That ticket can then be used to grant access to resources within your
cluster. For Kerberos to be used effectively for authentication, configure your Amazon EMR clusters to
create a one-way trust between the Kerberos server running on the Amazon EMR master node and the
Kerberos server that is on-premises. An example flow is provided in EMR Kerberos Flow for directly
interacting with HiveServer2.
Most applications within Amazon EMR support Kerberos as an authentication method. For a complete
list, see Supported Applications in the Amazon EMR Management Guide.
Kerberos
Kerberos is the recommended method for application-to-application authentication. When using
Kerberos authentication, applications authenticate themselves with a Key Distribution Center (KDC), and
authenticate other applications connecting to it. There are three options when using Kerberos for EMR
clusters: Cluster-Dedicated KDC, Cross-Realm Trust and External KDC. See Kerberos Architecture Options
for details about each option. For more information on Kerberos enabled workflows on Amazon EMR,
see Appendix B: EMR Kerberos Workflow.
28
Amazon Web Services Amazon EMR Migration Guide
Presto
The Amazon EMR version of Presto allows nodes within a cluster to authenticate using LDAPS. To set up
node-to-node authentication, set up Presto using LDAPS. See Using SSL/TLS and Configuring LDAPS with
Presto on Amazon EMR in the Amazon EMR Release Guide for details.
Authorization
Authorization is the act of allowing or denying an identity to perform an action. Using authorization to
determine what an identity can do first requires that the identity has validated who they are. This
section provides the various mechanisms available that can limit what an identity can do.
29
Amazon Web Services Amazon EMR Migration Guide
Application Authorization
For Apache Hadoop, there are several possible solutions for authorization, like having individual
applications control what users can do, or having central processes or applications that can manage
policies across several applications.
Apache Knox
Apache Knox provides perimeter security for your Hadoop resources. Users connect to Knox and can be
authenticated and authorized through a Knox Gateway. The gateway then forwards the user’s traffic to
Hadoop resources without having to directly connect to them. Knox can be configured to allow groups
to dictate the resources a user may access, such as Apache HBase UIs.
For more information, see Authentication in the Apache Knox Gateway Users Guide and How to
Implement Perimeter Security with Amazon EMR using Apache Knox.
30
Amazon Web Services Amazon EMR Migration Guide
A role mapping is required between a user, group, or S3 location, and the role that EMRFS assumes.
For example, consider two users, user A and user B, and two AWS IAM Roles, analyst and data_scientist.
An EMRFS mapping may specify that when user A is accessing S3, the ERMFS assumes the analyst role.
When user B accesses S3, the EMRFS assumes the data_scientist role.
If using Kerberos for Hadoop YARN, then it is required that the user’s accounts are registered on each of
the nodes on Hadoop and that there is a defined way to map users to groups. You can create this
mapping manually through bootstrap actions, or by having a mechanism to access users from an on-
premises Active Directory. To automate this process, you can use SSSD to read identity information,
including user-to-group mappings. AWS automates the setup of a one-way trust using Kerberos with
EMR. See Use Kerberos Authentication for more details.
You can update rules while the cluster is running by updating emrfs-site.xml on all nodes on the
cluster. The EMRFS will detect and apply the new mappings.
31
Amazon Web Services Amazon EMR Migration Guide
Note: IAM roles for EMRFS provide application-level isolation between users of the
application. This configuration does not provide host level isolation between users on
the host. Any user with access to the cluster can bypass the isolation to assume any of
the roles. This method is not recommended if users can SSH into your cluster to run
arbitrary code, such as running Apache Spark, Scala, or PySpark jobs.
Simply register your Amazon Simple Storage Service (Amazon S3) buckets and paths with AWS Lake
Formation and create data catalog resources using the AWS Lake Formation console, the API, the AWS
Command Line Interface (AWS CLI), or using AWS Glue Crawlers. See the AWS Lake Formation Developer
Guide for more details on how to set up the data lake.
Once you set up the data lake, you can grant permissions to access the data using the AWS Lake
Formation permissions model. This permissions model is similar to DBMS-style GRANT/REVOKE
commands, such as Grant SELECT on tableName to userName. You can also limit the access to
specific columns with an optional inclusion or exclusion list. The inclusion list specifies the columns that
are granted access, and the exclusion list specifies the columns that are not granted access. In the
absence of an inclusion or exclusion list, access is granted to all columns in a table. See AWS Lake
Formation Access Control Overview on how to set up these permissions.
If you are already using AWS Glue identity-based polices or AWS Glue resource policies for access
control, you can manage these policies in the AWS Lake Formation Permission model by identifying
existing users and roles and setting up equivalent Lake Formation permissions. See Upgrading AWS Glue
Data Permissions to AWS Lake Formation Model for more details.
Currently, you can use either Amazon EMR Notebooks or Apache Zeppelin to read this S3 data using
Spark SQL with Apache Livy. The user first needs to authenticate using one of the third-party SAML
providers (Microsoft Active Directory Federation Services [AD FS], Auth0, and Okta). See Supported
Third-party Providers for SAML on how to set up trust relationships between SAML providers and AWS
Identity and Access Management (IAM).
See IAM Roles for Lake Formation for more details on IAM Roles needed for this integration and Launch
an Amazon EMR cluster with Lake Formation on how to launch Amazon EMR cluster.
Amazon EMR enables fine-grained access control with Lake Formation by using the following
components:
32
Amazon Web Services Amazon EMR Migration Guide
• Proxy agent: The proxy agent is based on Apache Knox. It receives SAML-authenticated
requests from users and translates SAML claims to temporary credentials. It also stores the
temporary credentials in the secret agent which runs on the master node.
• Secret agent: The secret agent securely stores secrets and distributes secrets to other EMR
components or applications. The secrets can include temporary user credentials, encryption
keys, or Kerberos tickets. The secret agent runs on every node in the cluster and uses Lake
Formation and AWS Glue APIs to retrieve temporary credentials and AWS Glue Data Catalog
metadata.
• Record server: The record server receives requests for accessing data. It then authorizes
requests based on temporary credentials and table access control policies distributed by the
secret agent. The record server reads data from Amazon S3 and returns column-level data that
the user is authorized to access and runs on master node.
See Architecture of SAML-Enabled Single Sign-On and Fine-Grained Access Control for an illustration and
details of fine-grained access control.
Metadata Authorization
Hive, Presto, and Spark SQL all leverage a table metastore, also known as a data catalog, that provides
metadata information about the tables that are available to be queried. You can use metadata to
restrict users’ access to specific datasets.
Apache Ranger
Apache Ranger provides a set of plugins that you can install and configure on supported applications,
such as Hive, HDFS, and YARN. The plugins are invoked during user actions and allow or deny actions
based on policies you provide to the Ranger server. Ranger can set policies on databases and tables
within Hive to allow users and groups to access a subset of Hive tables and databases. Ranger also
provides the ability to do column-level authorization, column masking, and row-level filtering. The
policies are managed on the Ranger web console, and they can interact with your LDAP server.
For details on how to achieve this on Amazon EMR, see Implementing Authorization and Auditing using
Apache Ranger on Amazon EMR on the AWS Big Data Blog.
33
Amazon Web Services Amazon EMR Migration Guide
The Lake Formation model is implemented as a DBMS-style GRANT/REVOKE command set, whereas
IAM provides the IAM policy as a tool to define permissions on resources. For data catalog requests to
succeed, both IAM and Lake Formation must permit the access.
The recommended approach to securing Data Catalog resources is to enable coarse-grained permissions
on access to the underlying AWS Glue API actions and use Lake Formation to govern fine-grained access
control to the databases, tables, and columns in the Data Catalog.
Typically, Data Catalog administrator roles hold the following AWS managed IAM policies:
• AWSLakeFormationDataAdmin
• AWSGlueConsoleFullAccess (optional)
• CloudWatchLogsReadOnlyAccess (optional)
• AmazonAthenaFullAccess (optional)
For more information on the associated IAM policies for common Lake Formation personas, such as the
Data Engineer, Data Analyst and Workflow Roles, see Lake Formation Personas and IAM Permissions
Reference.
The following table summarizes the available Lake Formation permissions on Data Catalog resources.
Items with an asterisk (*) affect underlying data access but also limit search capabilities in the Lake
Formation console.
34
Amazon Web Services Amazon EMR Migration Guide
ALTER DROP
DROP SELECT*
INSERT*
DELETE*
Administrators can grant permissions to create, alter, or drop databases or tables as well as permissions
that affect underlying data consumption such as permissions on select, insert, or delete operations at
the table or column level. These operations can be done either in the Lake Formation console or via the
Lake Formation CLI. The example command below grants the user datalake_user1 permission to create
tables in the retail database:
Explicit Permissions
Lake Formation permission grants have the form:
With the grant option, you can allow the grantee to grant permissions to other principals. Table 2
highlights the explicit permissions that can be granted by the resource type. For example, when
CREATE_TABLE is granted to a principal on a database, the principal is allowed to create tables in that
database.
Permissions marked with an asterisk (*) in Table 1 are granted on Data Catalog resources but also apply
to the underlying data. For example, the DROP permission on a metadata table enables the principal to
drop the referenced table from the Data Catalog. However, the DELETE permission, when granted on
the same table, gives explicit permissions to the principal for deleting the table’s underlying data in
Amazon S3. As such, SELECT, INSERT, and DELETE permissions effect both Data Catalog and data
access permissions. For more information see Lake Formation Access Control Overview.
Implicit Permissions
It’s important to keep in mind the following implicit metadata permissions that are granted by Lake
Formation:
• Database creators have all database permissions on databases that they create, and can grant
others permission to create tables in the database.
35
Amazon Web Services Amazon EMR Migration Guide
• Database creators do not gain implicit permissions on tables that others create in the
database.
• Table creators have full permissions on tables they create, can grant permissions on tables
they create and can view databases that contain tables that they create.
Be sure to remove any IAM users or roles in the IAMAllowedPrincipals group from the AWS Lake
Formation permissions if any exist. The IAMAllowedPrincipals is automatically created and includes all
IAM users and roles that are permitted access to your Data Catalog resources by IAM. This mechanism is
provided for existing customers who are transitioning into AWS Lake Formation. Once the on-boarding
to AWS Lake Formation is complete, permissions to the IAMAllowedPrincipals group should be revoked
leaving only AWS Lake Formation in charge of access control as is recommended. For more information,
see Changing the Default Security Settings For Your Data Lake.
For more information regarding implicit permissions, see Implicit Lake Formation Permissions.
Outside of IAM, there are a few configuration requirements that must be met for Amazon EMR clusters
to integrate with AWS Lake Formation:
With the required IAM roles in place, data catalog administration can be controlled entirely via Lake
Formation and enables secure, fine-grained access to data for federated EMR users. Lake Formation also
enables column-level permissions to AWS Glue Data Catalog resources. Access to other AWS services
(i.e. those not managed by Lake Formation, such as DynamoDB) is controlled via the IAM Role for AWS
Services described below. Apache Spark job submission can be done through EMR notebooks, Apache
Zeppelin, or Apache Livy. See Supported Applications and Features for applications and features that
support the integration of AWS Lake Formation with Amazon EMR. See also Implementing SAML AuthN
for Amazon EMR Using Okta and Column-Level AuthZ with AWS Lake Formation for a blog post article
36
Amazon Web Services Amazon EMR Migration Guide
that demonstrates applying column-level authorization with AWS Lake Formation while authenticating
with an independent identity provider (IdP).
Note: These resources should be accessed through Lake Formation using Spark SQL
only.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::BUCKET_AND_PREFIX_FOR_SAML_XML_METADATA_FILE"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::account-
id:role/IAM_SAML_Role_For_Lake_Formation"
},
37
Amazon Web Services Amazon EMR Migration Guide
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::account-id:role/IAM_Role_for_AWS_Services"
},
{
"Effect": "Allow",
"Action": "lakeformation:GetTemporaryUserCredentialsWithSAML",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "iam:GetRole",
"Resource": [
"arn:aws:iam::account-id:role/IAM_SAML_Role_For_Lake_Formation",
"arn:aws:iam::account-id:role/IAM_Role_for_AWS_Services"
]
}
]
}
Encryption
You can implement data encryption to help protect data in transit, and data at rest in Amazon S3 and in
cluster instance storage. You can also use a custom Amazon Linux AMI to encrypt the Amazon Elastic
Block Store (Amazon EBS) root device volume of cluster instances.
The following diagram shows the different data encryption options that are available with security
configurations.
38
Amazon Web Services Amazon EMR Migration Guide
For an up-to-date list of encryption options available on Amazon EMR, see Encryption Options.
• Manually create PEM certificates, zip them in a file, and reference them from Amazon S3
• Implement a certificate custom provider in Java and specify the S3 path to the JAR
See Encrypt Data in Transit and At Rest in the Amazon EMR Management Guide for details on how to
use EMR security configuration to configure encryption in transit and which applications are
supported. We highly recommend that you encrypt all interaction with your cluster, including all UI-
based applications and endpoints.
39
Amazon Web Services Amazon EMR Migration Guide
Root Volume
Beginning with Amazon EMR version 5.24.0, you can use the Amazon EMR security configuration option
to encrypt the EBS root volume. For Amazon EMR versions prior to 5.24.0, you must use a Custom AMI
to encrypt the root volume. See Creating a Custom AMI with an Encrypted Amazon EBS Root Device
Volume in the Amazon EMR Management Guide for details.
EBS Volumes
There are two mechanisms that allow you to encrypt data on non-root volumes, which typically store
HDFS and other application data: HDFS Transparent Encryption for application-managed encryption and
LUKS encryption for OS-managed encryption. For documentation on both approaches, see At-rest
Encryption for Local Disks in the Amazon EMR Management Guide.
For Amazon EMR version 5.24.0 and later, you can natively encrypt EBS volumes attached to an EMR
cluster by using the Amazon EMR security configuration option. EBS encryption provides the following
benefits:
• Native End-to-End Encryption: Data on EBS volumes including intermediate data, I/O between
the EC2 instances and EBS volumes, and EBS snapshot are encrypted.
• Root Volumes Encryption: Root volumes can be encrypted without the need to create custom
Amazon Linux AMIs.
• Transparent Encryption: EBS encryption is transparent to any applications running on EMR and
does not require modifications.
• Simplified Encryption: With EBS encryption, you can check encryptions status from the
Volumes page in the EC2 console or through an EC2 API call.
Perimeter Security
Establishing perimeter security is foundational to Amazon EMR Security. This process involves limiting
access to the EMR cluster itself and using an “edge” node (gateway) as a host. Users and systems can
40
Amazon Web Services Amazon EMR Migration Guide
use this edge node directly and then have all other actions issued from this gateway into the cluster
itself.
Apache Knox
Apache Knox is an Application Gateway that provides a single endpoint to interact with multiple
applications and multiple clusters, as well as an authentication layer to the applications and clusters that
supports SAML, LDAP, and other authentication methods. Knox is useful for several use cases:
• If you require HTTPS between the client and Apache Knox: Apache Knox supports TLS so clients
can connect to Knox over a secure channel, regardless if the application on the cluster
supports TLS or not.
• If you require an authentication layer: Apache Knox provides an authentication layer that
supports several types of authentication mechanisms that can be used for instances where the
cluster application does not support it. For example, Knox supports single sign-on (SSO) which
can be used as Hadoop Resource Manager does not support it.
• Apache Knox records actions that are executed per user request or those that are produced by
Knox internal events.
However, Apache Knox does have some limitations, and it does not provide fine-grained access control.
41
Amazon Web Services Amazon EMR Migration Guide
Network Security
This document does not provide information on how to properly set up a Virtual Private Cloud (VPC),
subnets, security groups, and network access control lists (ACLs). For information on setting up VPCs,
see Control Network Traffic with Security Groups.
42
Amazon Web Services Amazon EMR Migration Guide
• If you do not want unintentional public access to applications running on the cluster, enable
Block public access. Block public access can be configured and enabled for each AWS Region
for your account. The Block public access feature is supported only for Amazon EMR clusters
running in public subnets and is not applicable to Amazon EMR clusters running in private
subnets. Block public access prevents a cluster from launching when any security group
associated with the cluster has a rule that allows inbound traffic from IPv4 0.0.0.0/0 or IPv6
::/0 (public access) on a port, unless the port has been specified as an exception (Port 22 is an
exception by default). You can configure exceptions to allow public access on a port or range
of ports. You can also disable block public access but it’s recommended to enable it. See Using
Amazon EMR Block Public Access for more information
Auditing
The ability to audit compute environments and understand where the data in the cluster is coming from
and how it's being used is a key requirement for many customers. There are a variety of ways that you
can support this requirement within EMR.
Amazon S3
Amazon EMR can be configured to push all application, daemon, and provisioning logs to Amazon S3 by
enabling logging. See Configure Cluster Logging and Debugging for more information.
Apache Ranger
Apache Ranger plugins provide the ability to push auditing data into Apache Solr for future analysis.
Apache Ranger also pushes auditing information within the Ranger UI.
AWS CloudTrail
AWS CloudTrail documents every API call made to AWS services and contains information such as the
callers AWS access key, source IP, agent used make the call, and so on. When using Amazon EMR, there
are several groups of API calls that are stored. The first group is at the EMR service level, where APIs are
called for cluster management, such as creating a cluster, terminating clusters, and running steps.
Secondly, when a running cluster is attempting to access a resource, such as S3, Glue Data Catalog, and
KMS, the credentials that the cluster used are stored in AWS CloudTrail. See Logging Amazon EMR API
Calls in AWS CloudTrail for more details on how to enable and use CloudTrail.
To help meet compliance requirements, CloudTrail logs can be exported to Amazon S3, and then
queried by Amazon Athena to be able to detect unauthorized or suspicious activity. See Querying AWS
CloudTrail Logs for more information on how to query from Amazon Athena.
43
Amazon Web Services Amazon EMR Migration Guide
To use this approach, you must install and configure the Amazon CloudWatch Agent, and configure it to
watch the Hadoop log directories. It then starts to push logs to Amazon CloudWatch Logs, where you
can set up log retention and export targets based on your auditing requirements.
Apache Knox
If you are using Apache Knox as a gateway for perimeter security or proxying, you can also configure it
to log all actions a user does on all resources that it interacts with.
Software Patching
Amazon EMR provides new releases on a regular basis, adding new features, new applications, and
general updates. We recommend that you use the latest release to launch your cluster whenever
possible.
OS Level Patching
When an EC2 instance starts, it applies all OS non-kernel patches. You can view installed patches by
running yum updateinfo list security installed from the AWS Command Line
Interface (AWS CLI).
• Upgrade to a newer AMI: The latest patches, including Linux kernel patches, are applied to
newer AMIs. See the Release Notes for which versions have upgraded applications and check
the application release notes to see if there is a fix for your issue.
• If you require more flexibility over which patches are applied, use custom AMIs. This approach
allows you to apply new patches to the AMI and deploy them by restarting your clusters.
• Contact AWS Support and provide the issue and patches that you need. Support may be able
to provide workarounds.
If you have a long running cluster, you can use EC2 Patch Manager. This approach requires that your
instance has SSM up and running. EMR cluster patching requires that you apply patches to a cluster in a
maintenance window so that nodes can be restarted in bulk, especially the primary node. Or, you can
apply patches to one node at a time to ensure that HDFS and jobs are not interrupted. Make sure to
contact AWS Support before applying any patches. If instances must be rebooted during the patch
44
Amazon Web Services Amazon EMR Migration Guide
process, make sure to turn on termination protection before patching. See Using Termination
Protection for details.
For more information on how to push updates using EC2 System Manager, see the AWS Big Data Blog
post Create Custom AMIs and Push Updates to a Running Amazon EMR Cluster Using Amazon EC2
Systems Manager.
• Upgrade to a newer AMI: Newer AMIs include recent patches. View the Release Notes for
which versions have upgraded applications and check the applications release notes to see if
there is a fix for your issue.
• Contact AWS Support and provide the issue and patches that you need. Depending on
severity, Support may provide other workarounds or in extreme cases, may be able to create a
bootstrap action that provides the patch.
Note: We recommend that you do not apply and install patches on your cluster unless
you have contacted AWS Support and the EMR service team. In certain software
upgrade scenarios, you may need to assume operational burden of maintaining the
patches.
Software Upgrades
General recommendations for upgrading Amazon EMR AMIs are:
• Read the Amazon EMR Release Notes to see version changes for applications, such as Hive,
HBase, and so on.
• Read the release notes of the open source software packages, such as Hive, HBase, and so on,
and check for any known issues that could impact existing workflows.
• Use a test environment to test the upgraded software. This approach is important to isolate
testing workloads from production workloads. For example, some version upgrades of Hive
update your Hive Metastore schemas which may cause compatibility issues. If someone starts
and uses Hive with the newer version, it could impact production.
• Test performance and data quality between the old and new versions. When moving versions
in Hive, compatibility issues or bugs may occur that can impact data quality and performance.
45
Amazon Web Services Amazon EMR Migration Guide
Use Case
• Having an ETL-only use case may allow for reduced security requirements compared to
clusters in which users interactively access.
• This design works well when Amazon EMR clusters are being used as transient clusters.
Implementation
46
Amazon Web Services Amazon EMR Migration Guide
Use Case
• Securing an EMR cluster is not necessary because it’s running in a private subnet with highly
controlled access controls to it.
• You can control which AMI you want to use on the bastion host. For example, if users can only
access an internally managed Red Hat Linux, then bastion host can run that AMI, while EMR
clusters continue to use Amazon Linux.
47
Amazon Web Services Amazon EMR Migration Guide
• Once a user is authenticated on a bastion host, then they should have access to all the
resources on the EMR cluster.
When to Use
• You want to control access to the EMR cluster and have all users interact with EMR through
the bastion host/edge node.
• You want to segregate job submission and EMR processes running on the master and data
nodes.
Implementation
1. Create EMR clusters within a private subnet with VPC endpoints that allow traffic to required
AWS services, like Amazon S3, AWS KMS, and Amazon DynamoDB.
2. Create a bastion host on an EC2 instance in a public subnet. Traffic from the public subnet
must be able to access the clusters in the private subnet.
3. Do not allow SSH access to EMR clusters.
4. Access the clusters through a UI, like Hue, Zeppelin, or Jupyter, or provide SSH access to the
bastion host/edge node.
5. If SSH is allowed, ensure that the bastion host/edge node uses SSSD or another mechanism to
allow users to log in with their on-premises credentials.
6. Use an elastic load balancer (ELB) that routes traffic to clusters for load balancing and disaster
recovery without needing to change configurations on the bastion host.
48
Amazon Web Services Amazon EMR Migration Guide
Use Case
• There is small number of end user groups, all of whom share the same level of permissions to
the same data.
• The data being accessed is not highly sensitive data.
Implementation
1. Create different EC2 Instance Roles for each EMR cluster that represents the permissions you
want end users to have access to.
2. Enable LDAP authentication to the applications that the end users are using, like Hue and
Zeppelin. You want Hue and Zeppelin to not impersonate the user. If impersonation is
required, you need to either manually create users and groups, or set up SSSD on all nodes and
go to your on premises LDAP.
3. Disable SSH to the cluster.
49
Amazon Web Services Amazon EMR Migration Guide
Data Migration
Using Amazon S3 as the Central Data Repository
Many customers on AWS use Amazon Simple Storage Service (Amazon S3) as their central repository,
more commonly referred to as a data lake, to securely collect, store, and analyze their data at a massive
scale.2 Organizations with data warehouses, are realizing the benefits of data lakes to enable diverse
query capabilities, data science use cases, and advanced capabilities for discovering new information
models. To build their data lakes on AWS, customers often use Amazon S3 as the central storage
repository instead of a Hadoop Distributed File System (HDFS).
Amazon S3 is built to store and retrieve any amount of data, with unmatched availability, and can
deliver 99.999999999% (11 9’s) of durability. Amazon S3 provides comprehensive security and
compliance capabilities that meet even the most stringent regulatory requirements.
The following figure shows a high-level overview of the data lake architectural pattern, with Amazon S3
as the central storage repository.
50
Amazon Web Services Amazon EMR Migration Guide
• Decoupling of storage from compute and data processing – In traditional Apache Hadoop and
data warehouse solutions, storage and compute resources are tightly coupled, making it
difficult to optimize costs and data processing workflows. With your data in Amazon S3, you
can launch as much or as little compute capacity as you need using Amazon Elastic Compute
Cloud (Amazon EC2), and you can use AWS analytics services to process your data. You can
optimize your EC2 instances to provide the right ratios of CPU, memory, and bandwidth for
best performance and even leverage Amazon EC2 Spot Instances.
• Centralized data architecture – Amazon S3 makes it easy to build a multi-tenant environment,
where many users can bring their own data analytics tools to a common set of data. This
approach improves both cost and data governance over that of traditional solutions, which
require multiple copies of data to be distributed across multiple processing platforms.
• Integration with AWS services – Use Amazon S3 with Amazon Athena, Amazon Redshift
Spectrum, Amazon Rekognition, and AWS Glue to query and process data. Amazon S3 also
integrates with AWS Lambda serverless computing to run code without provisioning or
managing servers. With all of these capabilities, you only pay for the data you process and for
the compute time that you consume.
• Standardized APIs – Amazon S3 RESTful APIs are simple, easy to use, and supported by most
major third-party independent software vendors (ISVs), including leading Apache Hadoop and
analytics tool vendors. This enables customers to bring the tools they are most comfortable
with and knowledgeable about to help them perform analytics on data in Amazon S3.
To help customers move their data from their data center to AWS, AWS provides the following options:
• The ability to move data between your on-premises storage and AWS storage services or
between AWS storage services using AWS DataSync when bandwidth is available.
• The ability to move petabytes to exabytes of data to AWS using AWS Snowball and AWS
Snowmobile appliances.
• The ability to move large quantities of data over a dedicated network connection between
your data center and AWS with AWS Direct Connect.
• The ability to move data in real time from relational databases using AWS Data Migration
Service, and from internet sources such as websites and mobile apps using Amazon Kinesis.
51
Amazon Web Services Amazon EMR Migration Guide
AWS DataSync
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates the process
of moving data between your on-premises storage and AWS storage services or between AWS storage
services. DataSync supports Network File System (NFS) shares, Server Message Block (SMB) shares,
Hadoop Distributed File Systems (HDFS), self-managed object storage, AWS Snowcone, Amazon Simple
Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx for Windows File
Server, Amazon FSx for Lustre, Amazon FSx for OpenZFS, and Amazon FSx for NetApp ONTAP as well as
Google Cloud Storage and Azure Files storage locations.
AWS Snowball Edge is a type of Snowball device with on-board storage and compute power for select
AWS capabilities. Snowball Edge can do local processing and edge-computing workloads in addition to
transferring data between your local environment and the AWS Cloud. For more information on
differences between AWS Snowball and AWS Snowball Edge, see AWS Snowball Use Case Differences.
For more information on best practices in migrating data using AWS Snow Family, see the AWS Big Data
Blog post: Migrate HDFS files to an Amazon S3 data lake with AWS Snowball Edge & Data Migration Best
Practices with AWS Snowball Edge.
Customers who are using Hadoop Distributed File System (HDFS) can also copy data directly from HDFS
to an AWS Snowball using the Snowball Client. To copy data from HDFS to Snowball, you must install
Snowball Client on an on-premises host that can access the HDFS cluster, and then copy files from HDFS
to S3 via Snowball. Additionally, the S3 SDK Adapter for Snowball provides an S3 compatible interface to
the Snowball Client for reading and writing data on a Snowball. Customers who prefer a tighter
52
Amazon Web Services Amazon EMR Migration Guide
integration can use the S3 Adapter to easily extend their existing applications and workflows to
seamlessly integrate with Snowball.
We recommend that you use AWS Snowmobile to migrate large datasets of 10 PB or more in a single
location. For datasets less than 10 PB or distributed in multiple locations, use AWS Snowball. Make sure
to evaluate the amount of available bandwidth in your network backbone. If you have a high-speed
backbone with hundreds of giga-bytes/second of spare throughput, then you can use Snowmobile to
migrate the large datasets all at once. If you have limited bandwidth on your backbone, you should
consider using multiple Snowballs to migrate the data incrementally.
Note: As a general rule, if it takes more than one week to upload your data to AWS
using the spare capacity of your existing internet connection, then you should consider
using Snowball. For example, if you have a 100-Mb connection that you can solely
dedicate to transferring your data and need to transfer 100 TB of data, it will take more
than 100 days to complete data transfer over that connection. On the other hand, the
same amount of data can be transferred in about a week, by using multiple Snowballs.
For more information on when to use Snowball for data transfer, see AWS Snowball FAQs.
53
Amazon Web Services Amazon EMR Migration Guide
In addition to AWS Direct Connect, you can also enable communication between your remote network
and your VPC by creating an encrypted connection known as a VPN connection. A VPN connection is
created using Internet Protocol security (IPsec) over the internet. You can create a VPN connection by
attaching a virtual private gateway to the VPC, creating a custom route table, updating your security
group rules, and creating an AWS managed VPN connection.
Note: We recommend that you use AWS Direct Connect for large ongoing data transfer
needs, since AWS Direct Connect can reduce costs, increase bandwidth, and provide a
more consistent network experience than internet-based VPN connections. VPN
connections are a good solution if you have an immediate need, have low to modest
bandwidth requirements, and can tolerate the inherent variability in internet-based
connectivity.
54
Amazon Web Services Amazon EMR Migration Guide
Within the connection established between your on-premises environment using either of these
methods, you can use AWS Direct Connect to easily migrate your data into Amazon S3 on an ongoing
basis, using any of the following approaches.
Often, the reason for the migration is a lack of compute capacity in the on-premises cluster. Customers
in that situation leverage the S3DistCp tool provided by Amazon EMR to pull the data from HDFS onto
Amazon S3. For more information on best practices in this scenario, see the AWS Big Data Blog post
Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3.
You can leverage a commercially available solution such as WANDisco Fusion and Cloudera BDR to move
data from HDFS onto Amazon S3.
You can also leverage Apache Hadoop and Amazon EMR integration with Amazon S3, and have the data
processing workflows write directly to Amazon S3. For example, you can run Apache Sqoop jobs on an
Amazon EMR cluster to extract data from a relation database and write it to Amazon S3.
AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare
and load your data for analytics. AWS Glue can also discover your data and stores the associated
metadata (for example, a table definition and schema) in the AWS Glue Data Catalog. AWS Glue is
designed to simplify the tasks of moving and transforming your datasets for analysis. It’s a serverless,
fully managed service built on top of the popular Apache Spark execution framework.
AWS Glue can access on-premises relational databases via Java Database Connectivity (JDBC) to crawl a
data store and catalog its metadata in the AWS Glue Data Catalog. The connection can be also used by
any ETL job that uses the data store as a source or target, like writing the data back to Amazon S3. The
figure below illustrates the workflow to extract data from a relational database, transform the data, and
store the results in Amazon S3.
55
Amazon Web Services Amazon EMR Migration Guide
For more information, see the AWS Big Data Blog post How to extract, transform, and load data for
analytic processing using AWS Glue.
AWS DataSync
AWS DataSync supports Hadoop Distributed File Systems (HDFS). The most common way to get data
onto a cluster is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load
the data onto your cluster. To upload data from an on-premises Hadoop cluster to your S3 bucket, you
first deploy one or more DataSync agents in the same network as your on-premises storage. An agent is
a virtual machine (VM) that is used to read data from or write data to a self-managed location. You then
activate your agents in the AWS account and AWS Region where your S3 bucket is located.
After your agent is activated, you create a source location for your HDFS location, a destination location
for your S3 bucket, and a task. A task is a set of two locations (source and destination) and a set of
default options that you use to control the behavior of the task. Finally, you run your DataSync task to
transfer data from the source to the destination.
56
Amazon Web Services Amazon EMR Migration Guide
Note: We recommend that you use AWS DataSync to rapidly copy data out from your
Hadoop Cluster into Amazon S3. You can use DataSync for fast transfer of existing data
to Amazon S3, and the File Gateway configuration of Storage Gateway for subsequent
low-latency access to this data from on-premises applications. Learn more about how to
use AWS DataSync to move from Hadoop to Amazon S3 in this blog.
You can use AWS DMS to migrate data from any of the supported database sources to Amazon S3.
When using Amazon S3 as a target in an AWS DMS task, both full load and change data capture (CDC)
data is written to comma-separated-values (CSV) format. For more information on how to query this
data, see the AWS Database Blog post on Using AWS Database Migration Service and Amazon Athena to
Replicate and Run Ad Hoc Queries on a SQL Server Database.
Note: For use cases that require a database migration from on-premises to AWS or
database replication between on-premises sources and sources on AWS, we
recommend that you use AWS DMS. Once your data is in AWS, you can use AWS Glue to
move and transform the data.
57
Amazon Web Services Amazon EMR Migration Guide
Amazon Kinesis Data Firehose can also be configured to transform streaming data before it’s stored in
Amazon S3. Its transformation capabilities include compression, encryption, data batching, and Lambda
functions. Amazon Kinesis Data Firehose can compress data before it’s stored in Amazon S3. It currently
supports GZIP, ZIP, and SNAPPY compression formats. See the following diagram for an example
workflow.
In Figure 19, an agent writes data from the source into an Amazon Kinesis Data Firehose. An example
is Amazon Kinesis Agent, which can write data from a set of files to an Amazon Kinesis Data stream.
Once the event is streamed into an Amazon Kinesis Data Firehose Data Stream, the Kinesis Data
Firehose delivers the data to a configured Amazon S3 bucket. You can also configure a Kinesis Data
Firehose to deliver data into Apache Parquet format using the schema from AWS Glue Data Catalog. For
another example, see the AWS Database Blog post on Streaming Changes in a Database with Amazon
Kinesis.
58
Amazon Web Services Amazon EMR Migration Guide
actual need and usage. AWS and Amazon S3 have several features that can optimize the storage
footprint in your data lake to further reduce costs. By using columnar file formats, you can speed up
your queries while reducing costs. In this section, we look at a couple different optimizations you can do
with your Amazon S3-based central data repository.
Amazon S3 Intelligent-Tiering (S3 Intelligent-Tiering) offers the same low latency and high throughput
performance of S3 Standard, but enables you to further optimize storage costs by automatically moving
data between two tiers - one tier that is optimized for frequent access and another lower-cost tier that
is optimized for infrequent access. For a small monthly monitoring and automation fee per object,
Amazon S3 monitors access patterns of the objects in S3 Intelligent-Tiering, and moves the ones that
have not been accessed for 30 consecutive days to the infrequent access tier. It is the ideal storage class
for long-lived data with access patterns that are unknown or unpredictable.
Amazon S3 Standard-Infrequent Access (S3 Standard-IA) offers the high durability, high throughput,
and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee. S3 Standard-IA
is ideal for data that is accessed less frequently, but requires rapid access when needed.
Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) stores data in a single Availability Zone (AZ)
and offers a lower-cost option (20% less than storing it in S3 Standard-IA) for infrequently accessed data
that does not require the availability and resilience of S3 Standard or S3 Standard-IA storage classes. S3
One Zone-IA is ideal for storing backup copies or easily re-creatable data.
59
Amazon Web Services Amazon EMR Migration Guide
might need to be retrieved in minutes. On the other hand, Amazon S3 Glacier Deep Archive is another
great long-term storage choice when archived data rarely needs to be accessed.
Another tiering approach you can use in a data lake is to keep processed data and results, as needed for
compliance and audit purposes, in an Amazon S3 Glacier vault. Amazon S3 Glacier Vault Lock allows data
lake administrators to easily deploy and enforce compliance controls on individual Glacier vaults via a
lockable policy. Administrators can specify controls such as “write once read many” (WORM) in a vault
lock policy and lock the policy from future edits. Once locked, the policy becomes immutable and
Amazon S3 Glacier enforces the prescribed controls to help achieve your compliance objectives, and
provide an audit trail for these assets using AWS CloudTrail.
60
Amazon Web Services Amazon EMR Migration Guide
One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR. You can use
it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data
in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS.
61
Amazon Web Services Amazon EMR Migration Guide
The following diagram shows how to transform ingested data in CSV format into Apache Parquet for
querying by Amazon Athena with improved performance and reduced cost. For more information on
this architecture, see Build a Data Lake Foundation with AWS Glue and Amazon S3.
62
Amazon Web Services Amazon EMR Migration Guide
with Amazon EMR). Although for most of the APIs, there is a direct mapping between the HDFS APIs and
Amazon S3, some of the file system APIs, such as rename and seek, do not have an equivalent API in
Amazon S3, and are mapped in an indirect manner. Further, semantic differences exist between Amazon
S3, an object store, HDFS, and a file system, such as prefixes versus directories. The overview section of
the Hadoop integration with AWS documentation provides a good introduction to such semantic
differences.
Because of these indirect mappings or semantic differences, when using the open source applications on
Amazon EMR, you may sometimes see performance issues or job failures. Many applications also
require specific customizations to work with Amazon S3, such as those in Hive Blobstore Optimizations
or Amazon S3 Optimized Committer, available with Spark on Amazon EMR. The application-specific
sections in this guide also provide you with some of these considerations and how to optimize them to
work with Amazon S3.
63
Amazon Web Services Amazon EMR Migration Guide
64
Amazon Web Services Amazon EMR Migration Guide
Figure 21: Using AWS Glue Data Catalog as the Hive metastore
65
Amazon Web Services Amazon EMR Migration Guide
[
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class":
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
This setting can be passed as part of application configuration when creating an Amazon EMR cluster.
See Configuring Applications for more details about how to pass application configurations.
Considerations
• You can enable encryption for an AWS Glue Data Catalog. For details, see Setting Up
Encryption in AWS Glue.
• Column statistics, Hive authorizations, and Hive constraints are not currently supported. To
see a list of AWS Glue Data Catalog's constraints, see Using the AWS Glue Data Catalog as the
Metastore for Hive.
• An AWS Glue Data Catalog has versions, which means a table can have multiple schema
versions. AWS Glue stores that information in AWS Glue Data Catalog, including the Hive
metastore data.
66
Amazon Web Services Amazon EMR Migration Guide
67
Amazon Web Services Amazon EMR Migration Guide
[
{
"Classification": "hive-site",
"Properties": {
"javax.jdo.option.ConnectionURL":
"jdbc:mysql:\/\/hostname:3306\/hive?createDatabaseIfNotExist=true",
"javax.jdo.option.ConnectionDriverName":
"org.mariadb.jdbc.Driver",
"javax.jdo.option.ConnectionUserName": "username",
"javax.jdo.option.ConnectionPassword": "password"
}
}
]
2. When using Amazon EMR in the console, pass this information as JSON from Amazon S3 or
embedded text:
3. When using AWS CLI, pass the hive-configuration.json configuration file as a local
file or from Amazon S3:
68
Amazon Web Services Amazon EMR Migration Guide
Considerations
A Hive metastore is a single point of failure. Amazon RDS doesn't automatically replicate databases, so
it's highly recommended that you enable replication when using Amazon RDS to avoid any failure. To
learn more about how to create a database replica in a different Availability Zone, refer to the following
sources:
69
Amazon Web Services Amazon EMR Migration Guide
70
Amazon Web Services Amazon EMR Migration Guide
AWS Database Migration Service is a data migration service and can be used to create on-going
replication. This blog post Replicating Amazon EC2 or On-Premises SQL Server to Amazon RDS for SQL
Server discusses how to achieve ongoing replication for SQL Server, but the same method applies to
other databases.
71
Amazon Web Services Amazon EMR Migration Guide
Multitenancy on EMR
Amazon EMR provides a comprehensive set of features to build a highly secure multitenant Hadoop
cluster resources and data. Multitenancy on Amazon EMR offers mechanisms to isolate data and
resource from different tenants, and also provides controls to prevent a single application, user, and
queue from monopolizing cluster resources. Multitenancy comes with its own set of challenges. For
example, implementing multitenancy at all stages of a pipeline involves an understanding of the
nuances of processes and tools involved; metering tenant usage of resources can be hard especially if
the tenants share metadata and compute resources; scalability and time involvement can be difficult as
new tenants onboard; and applying robust security controls overall can be daunting task.
This chapter discusses steps to implement multitenancy on Amazon EMR along with key dimensions,
such as user, data, and resource isolation followed by recommended best practices.
There are two different conceptual models for isolating users, data, and resources when building
multitenant analytics on Amazon EMR.
• Silo Mode
• Shared Mode
Silo Mode
In a silo mode, each tenant gets their own Amazon EMR cluster with specific tools for processing and
analyzing their datasets. Data is stored in the tenant’s Amazon S3 bucket or HDFS on the cluster. Hive
metastore is typically on the cluster or stored externally on Amazon RDS.
72
Amazon Web Services Amazon EMR Migration Guide
In this model, you can configure your cluster to be automatically terminated after all of the steps of your
processing complete. This setup is referred to as a transient cluster. A transient cluster provides total
segregation per tenant, and can also decrease costs as the cluster is charged only for the duration of the
time it runs. The following table lists the advantages and disadvantages of using silo mode with Amazon
EMR.
Advantage Disadvantage
Provides complete isolation of data and Sharing data across clusters (especially when
resources. using HDFS) can be difficult
Can be cost effective when used with Spot Launching individual clusters can be
Instances and transient clusters. expensive.
Easy to measure usage of resources per
tenant.
Shared Mode
In a shared mode, tenants share the Amazon EMR cluster with tools installed for
processing/analyzing/data science – all in one cluster. Datasets are stored in the tenant’s S3 bucket or
73
Amazon Web Services Amazon EMR Migration Guide
the tenant’s HDFS folder on the cluster. The Hive metastore can be on the cluster or externally on
Amazon RDS or AWS Glue Data Catalog. In many organizations, this shared scenario is more common.
Sharing clusters between organizations is a cost-effective way of running large Hadoop installations
since it enables them to derive benefits of economies of scale without creating private clusters.
A large multi-node cluster with all the tools and frameworks installed can support a variety of users. In
addition, this infrastructure can also be used by end users who can launch edge nodes to run their data
science platforms.
Even though it is cost effective, sharing a cluster can be a cause for concern because a tenant might
monopolize resources and cause the SLA to be missed for other tenants. For example, an analyst can
issue a long running query on Presto or Hive and much of the cluster resources. Or, a data scientist
might train a model over large amounts of data.
The following table lists the advantages and disadvantages of launching an Amazon EMR cluster in a
shared mode.
74
Amazon Web Services Amazon EMR Migration Guide
Advantage Disadvantage
Less operational burden as there is one cluster Hard to measure usage and resources when you
to maintain. have many tenants
Can be cost effective if the cluster is well- Configuring the YARN scheduler can be difficult and
utilized. complex.
User Isolation
Authentication
Authentication of users is a critical piece in securing the cluster resources and preventing unauthorized
access to data. On Amazon EMR, you can authenticate users through an LDAP server or set up Kerberos
to provide strong authentication through secret-key cryptography or SAML Identity Provider based
authentication with AWS Lake Formation integration. When you use Kerberos authentication, Amazon
75
Amazon Web Services Amazon EMR Migration Guide
EMR configures Kerberos for the applications, components, and subsystems that it installs on the cluster
so that they are authenticated with each other.
When you use Kerberos with Amazon EMR, you can choose to set it up as a cluster-dedicated KDC or an
external KDC with different architecture options. Regardless of the architecture that you choose, you
can configure Kerberos using the same following steps.
1. Create a security configuration. When you create the cluster, you must specify the security
configuration and compatible cluster-specific Kerberos options. See Create Security
Configurations in the Amazon EMR Management Guide for more details.
2. Create HDFS directories for Linux users on the cluster that match user principals in the KDC.
Once completed, you can use an Amazon EC2 key pair to authorize SSH client connections to cluster
instances.
You can also use SAML Identity Provider based authentication when you integrate Amazon EMR with
AWS Lake Formation. When a user is authenticated with SAML Identity Providers, Amazon EMR
automatically creates a Kerberos principal and a Linux user, and jobs submitted through Spark SQL/Livy
impersonate the authenticated user. For details, see Supported Applications and Features in the
Amazon EMR Management Guide.
For example, in the following figure, user emr-analyst authenticated using SAML Identity Provider.
Once the user is authenticated, EMR initiated a Spark Livy Session when user submitted jobs using
Zeppelin or EMR Notebook.
76
Amazon Web Services Amazon EMR Migration Guide
For more information about setting up a trust relationship between SAML providers and IAM, see
Configure Third-Party Providers for SAML. For more information about IAM Roles needed for this
integration, see IAM Roles for Lake Formation, and on how to launch an Amazon EMR cluster, see
Launch an Amazon EMR cluster with Lake Formation.
Data Isolation
After the users are authenticated, you must consider what data assets they are authorized to use. You
can choose to implement authorization on Amazon EMR at the storage layer or server layer. By default,
the policy attached to the EC2 role on your cluster determines the data that can be accessed in Amazon
S3. With EMR File Systems (EMRFS) authorization, you can specify the IAM role to assume when a user
or group uses EMRFS to access Amazon S3. Choosing the IAM role for each user or group enables fine-
grained access control for S3 on multiuser Amazon EMR clusters.
77
Amazon Web Services Amazon EMR Migration Guide
Note: EMRFS doesn’t prevent users that run Spark applications or users that access the
EMR cluster via SSH from bypassing EMRFS and assuming different IAM roles. For more
information, see AWS Lake Formation.
78
Amazon Web Services Amazon EMR Migration Guide
table and columns authorized in shared mode/multi-tenant cluster. The AWS Lake Formation admin can
define and manage permissions access to databases, tables, and columns in AWS Glue Data Catalog in
AWS Lake Formation. For more information, see AWS Lake Formation.
Resource Isolation
On Amazon EMR, you can use different YARN queues to submit jobs. Each of the YARN queues may have
a different resource capacity and be associated with specific users and groups on the Amazon EMR
cluster. The YARN Resource Manager UI shows a list of available queues.
Note: YARN queues apply to only applications that run on YARN. For example, YARN
queues are not used by Presto applications.
For example, a user engineer from the engineers group can log in to the EMR primary node and submit
jobs to the engineers YARN queue:
79
Amazon Web Services Amazon EMR Migration Guide
In this previous example code, a user engineer is submitting a Spark job and passing a parameter —
queue to reflect which queue it should use to run that Spark job. The YARN ResourceManager UI shows
the same job being performed.
80
Amazon Web Services Amazon EMR Migration Guide
Oozie is included with Amazon EMR release version 5.0.0 and later. You can select Oozie in Software
Configuration section of the Amazon EMR console.
You can also select Oozie by configurating options through the AWS CLI:
When migrating Apache Oozie from an on-premises Hadoop cluster to Amazon EMR, first migrate the
Oozie database (optional), then migrate the Oozie jobs.
81
Amazon Web Services Amazon EMR Migration Guide
Apache Oozie comes with a dump and load utility that you can use to dump an existing Oozie database
to a local file, then install and configure the empty database in which to load your Oozie data. And
finally, import the database from the backup file.
1. Log in to the host where the existing Oozie server is running and execute this command.
Note: In Amazon EMR, the oozie-setup.sh file is located in this file path:
/usr/lib/oozie/bin/
2. Using AWS CLI, upload the oozie_db.zip file into Amazon S3:
3. Log in to Amazon EMR master node and download the oozie_db.zip file from Amazon S3.
4. Change the oozie-site.xml file to point to a new target database. Here is the default
configuration related to Oozie database settings:
For example, if you are planning to use MySQL on Amazon RDS, create the Amazon RDS
instance and then update the oozie-site.xml file to reflect the RDS configuration.
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:mysql://<<rds-host>>:3306/oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>mysql-username</value>
</property>
82
Amazon Web Services Amazon EMR Migration Guide
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>mysql-password</value>
</property>
5. Create the Oozie database in the new RDS instance and grant the required privileges to the
Oozie user:
$ mysql -u root -p
Enter password:
6. Load the previous oozie_db.zip database dump to this new database on AWS:
After importing, the CLI shows how many database rows you have imported and their respected table
names.
83
Amazon Web Services Amazon EMR Migration Guide
7. Restart the Oozie server to reflect the new configuration and database:
Note: Depending on the database, the driver's jar file needs to be placed in the Oozie
classpath on Amazon EMR, the location is: /var/lib/oozie.
• job.properties
• workflow.xml
• coordinator.xml
• Bundle files
• external parameter files
• any dependent file
To move these files quickly, take these steps:
1. Compress all of the Oozie related files into a single archive file.
2. Upload that archive file to Amazon S3.
3. Download that file to an Amazon EMR primary node.
4. Extract the compressed file to appropriate folders.
5. Modify the individual files to reflect the Amazon EMR cluster settings.
6. Resubmit those Oozie jobs on Amazon EMR.
Considerations
Consider these issues when migrating the Oozie database and jobs.
• The Oozie native web interface is not supported on Amazon EMR. To use a frontend interface
for Oozie, use the Hue application running on Amazon EMR. For more information, see Hue.
• Like Amazon EMR, by default, Oozie is configured to use the embedded Derby database. We
recommend that you use an Amazon RDS instance to host an Oozie database.
84
Amazon Web Services Amazon EMR Migration Guide
• You can use Amazon S3 to store workflow XML files. To do so, you must place EMRFS library
files in the Oozie classpath. Follow these steps:
a. Open a command-line and use this command to locate the emrfs-hadoop-assembly
file in the Amazon EMR primary node.
The following is a sample workflow.xml file that executes SparkPI using Oozie's Spark action. In this
sample, spark-examples-jar and workflow.xml are stored in an Amazon S3 bucket
(s3://tm-app-demos/oozie).
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:1.0">
<resource-manager>${resourceManager}</resource-manager>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>SparkPi</name>
<class>org.apache.spark.examples.SparkPi</class>
<jar>s3://tm-app-demos/oozie/lib/spark-examples.jar</jar>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
85
Amazon Web Services Amazon EMR Migration Guide
This sample is the corresponding job.properties file. See Figure 40 for the Amazon S3 bucket that is
created for the Oozie files.
nameNode=hdfs://ip-10-0-30-172.ec2.internal:8020
resourceManager=ip-10-0-30-172.ec2.internal:8032
master=yarn-cluster
queueName=default
examplesRoot=examples
oozie.use.system.libpath=true
#oozie.wf.application.path=${nameNode}/user/${user.name}/oozie/spar
k
oozie.wf.application.path=s3://tm-app-demos/oozie
86
Amazon Web Services Amazon EMR Migration Guide
You can connect to Amazon EMR from AWS Step Functions to build data processing and analytics
workflows. With minimal code, you can orchestrate Amazon EMR provisioning using Step Functions. The
integration between Amazon EMR and Step Functions depends on EMR Service Integration APIs. Using
those APIs, a Step Functions state machine can:
• Create or terminate your Amazon EMR cluster. You can reuse the same cluster in your
workflow or can create the cluster on-demand based on your workflow.
• Add or cancel an EMR step. Each step is a unit of work that contains instructions to manipulate
data for processing by software installed on the cluster. By using this you can submit Apache
Spark, Apache Hive, or Presto jobs to an Amazon EMR cluster. You can also create
dependencies between multiple steps or can design them to run in parallel.
• Modify the size of an EMR cluster. This allows you to scale your EMR programmatically
depending on the requirements of each step of your workflow.
The following image is an example of how AWS Step Functions orchestrates multiple Apache Spark jobs.
Figure 41: Multiple Apache Spark jobs orchestrated with AWS Step Functions
87
Amazon Web Services Amazon EMR Migration Guide
For more information on how to integrate AWS Step Functions to create orchestration for Hadoop-
based jobs, see these blog posts:
Apache Airflow
Airflow is an open-sourced task scheduler that helps manage ETL tasks. Apache Airflow workflows can
be scheduled and managed from one central location. With Airflow’s Configuration as Code approach,
automating the generation of workflows, ETL tasks, and dependencies is easy. It helps developers shift
their focus from building and debugging data pipelines to focusing on the business problems.
Apache Airflow can be installed on an Amazon EC2 instance or on an Amazon EMR primary node
through bootstrap. It comes with a variety of connectors that help to integrate it with different AWS
services.
For more information on how Airflow can be used to build orchestration pipeline and how it can be
integrated to run jobs on Amazon EMR, check the following posts on the AWS Big Data Blog:
• Build a Concurrent Data Orchestration Pipeline Using Amazon EMR and Apache Livy
• Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR
Luigi
Luigi is another open-sourced application you can use to build a complex pipeline of batch jobs. It
handles scheduling, dependency resolution, workflow management and includes command line tool
integration.
• Most of the orchestration applications use a default, embedded database to store job
metadata. For production workloads, we recommend that you use a separate database for
better performance and availability.
88
Amazon Web Services Amazon EMR Migration Guide
• When possible, use serverless or managed orchestration services to reduce ongoing manual
involvement.
• Integrate with notification services, such as Amazon SNS and Amazon CloudWatch, so that
appropriate parties are immediately notified upon failure and can be involved proactively.
• Make sure that the orchestration application can handle both asynchronous and synchronous
job and task execution for better performance and reduced overhead.
• The orchestration application should monitor job execution and management so that
developers can monitor everything centrally.
• The orchestration application should be able to handle failure gracefully.
• If you use Apache Airflow, make sure to use cluster-mode configuration for production
workloads.
Use the following table to determine the most appropriate orchestration application for your use case.
Serverless Yes No No
Hybrid environment – Only in the cloud Only for Hadoop jobs Broad coverage
AWS and non-AWS
services
89
Amazon Web Services Amazon EMR Migration Guide
Shared Cluster
In a shared cluster, be aware of how many concurrent jobs you expect to run at any given time. By
default, EMR configures executors to use the maximum number of resources possible on each node
through the usage of the maximizeResourceAllocation property. On a shared cluster, you may need to
manually configure Spark cores, memory, and executors. For details, see Best practices for successfully
managing memory for Apache Spark applications on Amazon EMR on the AWS Big Data Blog. The shared
cluster is appropriate for interactive use cases, such as when you are using Jupyter notebooks.
When using a shared cluster, we recommend that you use the dynamic allocation setting in Amazon
EMR to both automatically calculate the default executor size and to allow resources to be given back to
the cluster if they are no longer used. Dynamic allocation is enabled by default on Amazon EMR.
90
Amazon Web Services Amazon EMR Migration Guide
To measure the impact of these improvements, we used TPC-DS benchmark queries with 3-TB scale
running on a 6-node c4.8xlarge EMR cluster with data in Amazon S3. We measured performance
improvements as the geometric mean of improvement in total query processing time, and total query
processing time across all queries. These improvements means that your workloads run faster, saving
you compute costs without making any changes to your applications.
Unfortunately, Instance Fleets do not support multiple core or task groups or allow for automatic
scaling. If you have a dynamic workload that requires either of those features, you must use Instance
Groups. For more information on configuring your cluster, see Cluster Configuration Guidelines and Best
Practices in the Amazon EMR Management Guide.
91
Amazon Web Services Amazon EMR Migration Guide
retrieve only a subset of data from an object. As of EMR 5.17.0, S3 Select is supported with CSV and
JSON files. If your query filters out more than half of the original dataset and your network connection
between Amazon S3 and EMR has good transfer speed, S3 Select may be suitable for your application.
Note: Consistent View is intended for a set of chained jobs or applications that control
all reads and writes to S3, such as an Apache HBase on S3 EMR deployment. It is not
intended to be used for a globally consistent view of all objects in your S3 bucket.
• For Spark Web UIs, access the Spark HistoryServer UI port number at 18080 of the EMR
cluster's master node. For more information, see Accessing the Spark Web UI
• For YARN applications, including Spark jobs, access the Application history tab in the Amazon
EMR console. Up to seven days of application history is retained, including details on task and
stage completion for Spark. For more information, see View Application History.
• By default, Amazon EMR clusters launched via the console automatically archive log files to
Amazon S3. You can find raw container logs in Amazon S3 and view them while the cluster is
active and after it has been terminated. For more information, see View Log Files.
92
Amazon Web Services Amazon EMR Migration Guide
Hive Metastore
By default, Amazon EMR clusters are configured to use a local instance of MySQL as the Hive metastore.
To allow for the most effective use of Amazon EMR, you should use a shared Hive metastore, such as
Amazon RDS, Amazon Aurora, or AWS Glue Data Catalog.4 If you require a persistent metastore, or if
you have a metastore shared by different clusters, services, applications, or AWS accounts, we
recommend that you use an AWS Glue Data Catalog as a metastore for Hive. For more information, see
Configuring an External Metastore for Hive.
Upgrading
Hive upgrades couple with Hive metastore updates. The Hive metadata database should be backed up
and isolated from production instances because Hive upgrades may change the Hive schema, which may
cause compatibility issues and problems in production. You can perform upgrades using the -
upgradeSchema command in the Hive Schema Tool. You can also use this tool to upgrade the
schema from an older version to the current version.
Hive 2.3.0 added support for Apache Spark as an execution engine, but this setup is not supported on
EMR without changes to the underlying Hive jars in Spark. This is not a supported configuration on EMR.
The desired value of the Tez container size depends upon the specifics of your job. It must be at least as
the same value as the mapreduce.map.memory.mb setting.
If the Tez container runs out of memory, the following error message appears:
93
Amazon Web Services Amazon EMR Migration Guide
Container
[pid=1234,containerID=container_1111222233334444_0007_02_000001] is
running beyond physical memory limits. Current usage: 1.0 GB of 1
GB physical memory used; 1.9 GB of 5 GB virtual memory used.
Killing container.
To increase the memory, set the hive.tez.container.size to a value greater than what is
required for the job. The memory value required for the job can be found in the error message. In
addition to container size, increase the Tez Java Heap size. In general, the Tez Java Heap size should be
80% of the Tez Container size. You can adjust the value of the Tez Java Heap size with the setting
hive.tez.java.opts.
HDFS vs S3 Considerations
A benefit of EMR is the ability to separate storage and compute requirements through the use of
Amazon S3 as your primary data store. This approach allows you to save costs compared to HDFS by
scaling your storage and compute needs up or down independently. Amazon S3 provides infinite
scalability, high durability and availability, and additional functionality such as data encryption and
lifecycle management. That said, Hadoop was designed with the expectation that the underlying
filesystem would support atomic renames and be consistent. There are several options to consider if you
require immediate list and read-after-write consistency as part of your workflow.
Other ways that EMR customers have solved large-scale consistency issues include implementing a
custom manifest file approach to their jobs instead of using S3 list operations to retrieve data, or by
building their own metadata stores, such as Netflix Iceberg.
94
Amazon Web Services Amazon EMR Migration Guide
OVERWRITE or ALTER TABLE CONCATENATE, this setting can sometimes result in increased
execution times and missing data. This issue is caused by implementing that feature on queries that are
not multi-staged MapReduce jobs. This scenario results in Hive using Amazon S3 as the scratch directory
during the job. As a result, the number of renames on S3 increases as the job progresses through writing
scratch data, copying to another S3 temporary location, and finally copying to the final S3 location.
If you disable this setting, the scratch directory for the job is relocated to HDFS. If you prefer this
scenario, make sure to allocate enough space on the EMR cluster to accommodate this change. By
default, the distcp job that occurs at the end of the process is limited to a maximum of 20 mappers. If
you find this job is taking too long, particularly if you are processing terabytes of data, you can manually
set the number of max mappers using the following code in your Hive job:5
SET distcp.options.m=500
In some cases, the FairScheduler may be more desirable for clusters where it is acceptable for a job to
consume unused resources. In FairScheduler, resources are shared between queues.
Configure FairScheduler
You can configure FairScheduler in a couple ways when creating an EMR cluster.
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.resourcemanager.scheduler.class":
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairS
cheduler"
}
}
]
95
Amazon Web Services Amazon EMR Migration Guide
2. Modify the AWS CLI command to match the following code example and use this command to
start a cluster with the config file created above.
1. Sign in to the AWS Management Console and open the Amazon EMR console at
https://console.aws.amazon.com/elasticmapreduce/.
2. Choose Create cluster, Go to advanced options.
3. Choose Spark.
4. Under Edit software settings, leave Enter configuration selected and enter the following
configuration:
classification=yarn-
site,properties=[yarn.resourcemanager.scheduler.class=org.apache.ha
doop.yarn.server.resourcemanager.scheduler.fair.FairScheduler]
Warm Failover
In a warm failover scenario, a secondary, smaller cluster is kept running in addition to the primary
cluster. If a failure occurs, clients can be redirected to the new cluster, either manually or by updating an
entry in Amazon Route 53. You can configure the secondary cluster with a small number of nodes, and
then if it becomes the primary cluster, use automatic scaling to increase the number of nodes.
96
Amazon Web Services Amazon EMR Migration Guide
Multi-Cluster Configuration
Because all data is stored on S3, all clients do not need to go through the same cluster. You can
configure multiple clusters with a load balancer or expose a job submission framework to consumers of
the environment. These clusters can be shared among all clients or shared on a per-team basis
depending on internal requirements. One of the benefits of this approach is that in the event of a cluster
failure, the impact to you users are limited to just those queries executing on the single cluster that fails.
In addition, you can configure automatic scaling so that each cluster scales independently of each other.
If the clusters are segmented on a per-team basis, this approach ensures that any one team's jobs don’t
impact the performance of another team's jobs.
However, using multiple clusters means using multiple master nodes, one each for a cluster. Therefore,
you need additional EC2 instances that you wouldn't have to pay for if you used only a single cluster.
However, with the EC2 instance pricing model of pay-per-second with a one-minute minimum, in the
case of multiple clusters, you can save costs by choosing to activate only the cluster needed to perform
the tasks rather than running one single cluster all of the time. You can configure the logic for this setup
inside an AWS Lambda function that calls the activation on the pipeline. Then, you can start up or take
down a cluster without impacting another cluster's activities.
Transient Design
Transient clusters can mitigate the cost and operational requirements of having a single, long-running
cluster. This approach is useful if you have predictable short-lived jobs, but may not be appropriate if
you have consumers that are constantly querying data on an ongoing basis.
97
Amazon Web Services Amazon EMR Migration Guide
The Hive LLAP daemons are managed and run as a YARN service. Since a YARN service can be considered
a long-running YARN application, some of your cluster resources are dedicated to Hive LLAP and cannot
be used for other workloads. For more information, see LLAP and YARN Service API.
Several EMR notebooks can be attached to the same cluster based on the memory constaints of the
master node instance type. This feature helps provide a multi-tenant environment for several users to
have their own notebooks. You can configure AWS IAM roles so that the notebook of one IAM user
cannot be seen or accessed by another user in the same account. EMR Notebooks also provide seamless
Git and Bit Bucket integration – users have the ability to link their GitHub or BitBucket repositories to an
EMR notebook and check in their code to one or more linked repositories.
For more information, see Using Amazon EMR Notebooks in the Amazon EMR Management Guide,
Custom Packages For Python kernel, custom packages For Python kernel, to use custom
need to be installed manually within packages, tar installation file should
the docker container the JupyterHub be uploaded manually after which
is running on. For PySpark kernel, conda offline installation should be
from EMR version >= 5.26.0, custom performed. For PySpark kernel, from
packages can be installed through EMR version >= 5.26.0, custom
notebook-scoped libraries on the packages can be installed through
Spark context itself while running the notebook-scoped libraries on the
notebook blocks. On EMR versions < Spark context itself while running the
5.26.0, custom packages need to be notebook blocks. On EMR versions <
installed on all nodes using EMR 5.26.0, custom packages need to be
bootstrap action or at the AMI level. installed on all nodes using EMR
bootstrap action or at the AMI level.
98
Amazon Web Services Amazon EMR Migration Guide
Git integration Currently JupyterLab and Git Git and Bit Bucket integration is
integration is not supported. But supported natively and can be used
JupyterLab plugin with git extension to check out code through
can be installed through docker JupyterLab.
commands.
Notebook storage Notebooks are stored locally in the Notebooks are saved in S3 base
cluster. They need to be saved location specified during notebook
manually and uploaded to S3 or creation.
JupyterHub must be configured to
automatically save files to S3 during
cluster launch.
Multi-tenancy All the users have same instance of Many notebooks can be attached to a
JupyterHub. So, PAM, LDAP or SAML single cluster based on the master
authentication must be set up to node instance type. So, a notebook
segregate notebooks for each user. can be created per-user and the
access to another user’s notebook
can be restricted using IAM policies.
Authentication LDAP, Kerberos and SAML can be Supports IAM users or SAML
custom configured. Only one of these authentication through AWS Lake
authentication mechanisms can be Formation. Kerberized EMR clusters
applied at a time. without AWS Lake Formation are not
supported.
You can export your Jupyter notebook using one of two methods:
99
Amazon Web Services Amazon EMR Migration Guide
• If you have a small number of notebooks, export the notebooks by manually downloading
each notebook (Notebook > File > Download As > Notebook [.ipynb]). Downloaded notebook
file(s) are in the <filename>.ipynb file format. To copy these files to S3, you can either use the
AWS Management Console or use the AWS S3 CLI command. For example:
• If you have several notebooks, you can copy the notebooks directly from the Juptyer
installation to S3. To do so, SSH into the node that holds the Jupyter installation and go to
where the notebooks are being saved, for example /mnt/var/lib/jupyter/home/jovyan/ and
then copy the files by running the following command.
aws s3 cp /mnt/var/lib/jupyter/home/jovyan/
s3://MyBucket/MyNotebooksFolder/e-12A3BCDEFJHIJKLMNO45PQRST/ --
recursive
If your Python/PySpark programs are stored in .py format, you must first convert them to .ipynb file
format before working with them in an EMR notebook. You can use an open source python package
called jupytext that has the ability to convert python files into notebook files.
The following code is an example command that converts a .py file to .ipynb file which can then be
imported to EMR notebooks.
100
Amazon Web Services Amazon EMR Migration Guide
Hudi is integrated with Apache Spark, Apache Hive, and Presto. With Amazon EMR release version
5.28.0 and later, Amazon EMR installs Hudi components by default when Spark, Hive, or Presto are
installed. You can use Spark or the Hudi DeltaStreamer utility to create or update Hudi datasets. You can
use Hive, Spark, or Presto to query a Hudi dataset written on Amazon S3. Hudi maintains the metadata
of the actions performed on the Hudi dataset in index and data files. With the metadata written in files,
an application can create existing Hudi dataset by loading the metadata from S3, makes it easy to reuse
the same data for a variety of use cases.
• Copy on Write: Stores data in columnar format (Parquet) only. Write operation on this table
results in updating the version and rewriting the files using a merge.
• Merge on Read: Stores data in both columnar (Parquet) and row (Apache Avro) based formats.
Write operations result in updates stored as delta files. Compactions are run at a scheduled frequency
to arrive at new columnar files (synchronously or asynchronously.)
In addition to the ability to perform upserts (updates/inserts), Hudi also provides snapshot isolation for
readers (queries), atomic writes of batch of records, incremental pulls and de-duplication of data.
101
Amazon Web Services Amazon EMR Migration Guide
• DeltaStreamer is a utility included with Hudi that allows you to simplify the process of applying
changes to Hudi data sets. DeltaStreamer is a CLI tool, that can operate against three sources –
Amazon S3, Apache Kafka, and Apache Hive incremental pull. (Incremental pull refers to the
ability to pull only the data that changed between two actions.)
• Spark Datasource API allows you to write your own code to ingest data from a custom source
using the Spark datasource API and use a Hudi datasource to write as a Hudi dataset.
102
Amazon Web Services Amazon EMR Migration Guide
The following table summarizes when to use the two write mechanisms with Hudi:
Use Delta Streamer when you want a simple, self- Use DataSource API when you are working with
managed ingestion tool that automates data several varied data sources and want to create
compaction and provides automated checkpointing consolidated or derived tables – for example if you
without the need to write any code have existing Spark-based ETL pipelines you can use
Hudi to take advantage of incremental pull, only
reading the specific records that have changed.
Need to perform transformations on ingested data, You have existing data pipelines that need to work
for example dropping columns, casting or filtering with both Hudi-managed and non-Hudi-managed
data - Delta Streamer supports passing a SQL query datasets you can use the DataSource API.
template for SQL-based transformations, or allows
you to plug in your own implementation for more
advanced transformations.
Delta Streamer is also a good choice if you are Use Data Source API with Spark/Structured
ingesting data from Kafka, or using AWS DMS to land Streaming, allowing you to stream events into your
files in S3 and you don’t want to have to write any Hudi dataset
code to apply those updates to your data set
103
Amazon Web Services Amazon EMR Migration Guide
104
Amazon Web Services Amazon EMR Migration Guide
Compaction
A compaction activity merges the log files with the base files to generate new compacted files written as
a commit on Hudi’s timeline. Compaction applies to the Merge on Read table type. There are two ways
to perform the compaction – synchronous and asynchronous.
Synchronous (or inline) compaction is triggered at ingestion time after a commit or deltacommit as part
of insert/upsert/bulk_insert operation. Use this compaction type when you want to quickly compact the
recent ‘N’ partitions and can wait for delta logs to accumulate to merge into older partitions. As a result,
your data lake will have the most recent data which is likely to be queried often.
Asynchronous compaction is run as a separate job that is either scheduled by an orchestrator or run
manually. Use this compaction when you do not want to block the ingestion operation for compaction
and require the compaction to be run as part of your workflow.
Deletes
Apache Hudi supports two types of record level deletes on data stored in Hudi datasets by specifying a
different record payload implementation. Deletes can be performed using Hudi RDD API, Datasource
API, and DeltaStreamer.
• Soft Deletes: Soft deletes let you keep the record key and null the values for all other fields.
You can implement this by ensuring the appropriate fields are nullable in the dataset schema
and simply upserting the dataset after setting these fields to null.
• Hard Deletes: Hard deletes are a stronger form of delete to physically remove any trace of the
record from the dataset.
105
Amazon Web Services Amazon EMR Migration Guide
• EMR Cluster Size: Use memory intensive nodes, such as Amazon EC2 R5 instances. Consider
using Instance Fleets and Spot Instances to reduce costs. Aim to fit input data in memory if
possible.
• Input Parallelism: Determines the number of files in your table. Make the property
proportional to target files desired. Hudi defaults to (1500) which may be too high in certain
instances.
• File Size: We recommend 128-512 MB file sizes by setting the limitFileSize (128 MB)
property accordingly. The compactionSmallFileSize (100 MB) property defines size in
MB that is considered as a “small file size”. The default value should be appropriate for your
application. The parquetCompressionRatio (0.1) property specifies expected
compression of parquet data used by Hudi – adjust this property as needed. Note that with
this property, you are balancing ingest/write latency vs read latency.
• Off-heap memory: Increase spark.yarn.executor.memoryOverhead or
spark.yarn.driver.memoryOverhead if needed.
• Spark Memory: Reduce spark memory (spark.memory.fraction,
spark.memory.storageFraction) conservatively if facing Out Of Memory errors
allowing data to spill.
• Bloom Filters: Hudi assumes the maxParquetFileSize is 128 MB and
averageRecordSize is 1024 B and hence approximately a total of 130K records in a file.
Set the property bloomFilterNumEntries (60000) to half the number of records in
a file. The trade off is disk space for lower false positives.
Table 9: Properties
Properties Description
106
Amazon Web Services Amazon EMR Migration Guide
Sample Architecture
The following architecture shows the workflow for ingesting upserts and deletes using Apache Hudi on
Amazon EMR.
1. Ingest full load and CDCs from OLTP databases using AWS Database Migration Service.
2. AWS DMS task deposits full load and CDCs to Amazon S3.
3. Ingest changes dropped into Amazon S3 incrementally.
4. Write Apache data formats to Amazon S3.
The Amazon EMR workload differs depending on your Hudi table type:
• Copy on Write: Amazon EMR uses a batch workload where transient EMR clusters (on Instance
Fleet and Spot instances) run periodically (e.g. every hour or end of day).
• Merge on Read: Amazon EMR uses a streaming workload where persistent EMR clusters run
Spark Streaming or Apache Hudi DeltaStreamer jobs continuously.
107
Amazon Web Services Amazon EMR Migration Guide
• Cost reduction: If cost reduction is your primary goal, we recommend that you estimate cost
based on both approaches. You may find that the load and query patterns are cheaper to run
using Presto on Amazon EMR. See if cost increases, if any, outweigh the benefits of running
and maintaining a Presto cluster on EMR that is able to scale and provides availability, versus
the features that Amazon Athena provides.
• Performance requirements: If your use case includes a high sensitivity to performance choose
to fine-tune a Presto cluster to meet the performance requirements.
• Critical features: If there are features that Amazon Athena does not currently provide, such as
the use of custom serializer/deserializers for custom data types, or connectors to data stores
other than those currently supported, then Presto on EMR may be a better fit.
For performance tips and best practices for Athena and Presto, see Top 10 Performance Tuning Tips for
Amazon Athena on the AWS Big Data Blog.
Metadata Management
When using AWS Glue Data Catalog with Presto on Amazon EMR, the authorization mechanism (such as
Hive SQL authorization) is replaced with AWS Glue Based Policies.
108
Amazon Web Services Amazon EMR Migration Guide
You are also required to separately secure the underlying data in Amazon S3. You can secure this data
by using an S3 bucket policy or AWS IAM policy. You may find it more efficient to use IAM policies as you
can centralize access to both Amazon S3 and the AWS Glue Data Catalog.
For more information on using the AWS Glue Data Catalog, see Easily manage table metadata for Presto
running on Amazon EMR using the AWS Glue Data Catalog on the AWS Big Data Blog. For existing
limitations on the interaction between Presto and AWS Glue Data Catalog, see Considerations When
Using AWS Glue Data Catalog.
Accessing data via EMRFS allows you to configure Amazon S3 encryption requirements in an EMR
Security Configuration. With EMRFS, you can also use separate IAM roles within your Presto Cluster to
control access to data according to a user, group, or Amazon S3 location.
For EMR release version 5.12.0 or later, you can switch from using EMRFS to using the
PrestoS3FileSystem. This approach may be beneficial if your organization is still relying on Hive SQL
Based authorization and Hive metastore running on an RDMBS. For additional details on configuring the
PrestoS3FileSystem on Amazon EMR, see EMRFS and PrestoS3FileSystem Configuration.
For details on tuning HBase for best performance on Amazon S3, see Migrate to Apache HBase on
Amazon S3 on Amazon EMR: Guidelines and Best Practices on the AWS Big Data Blog.
HBase Upgrades
For HBase upgrades, we recommend that you run the newer version of HBase alongside the previous
version until all testing is complete. Take a snapshot from HBase and use that snapshot to start a new
cluster with the upgraded version. You can run both clusters simultaneously and perform updates and
reads from both clusters. When testing is complete, you can shut down the old cluster. If you encounter
any issues that require rolling back the upgrade, you can move back to the old cluster.
109
Amazon Web Services Amazon EMR Migration Guide
Depending on your version of HBase and where your generated StoreFiles are, the command to perform
the bulk load is similar to the following code:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles \
<s3://bucket/storefileoutput/> <tablename>
110
Amazon Web Services Amazon EMR Migration Guide
As with any distributed system, the performance of this process depends on available CPUs and network
bandwidth.
Optimize S3 Uploads
You can optimize S3 uploads by adjusting size settings:
• fs.s3.threadpool.size: By default, this setting is 20, but you can increase it to increase the
number of parallel multipart uploads that are occurring. You may need to use an instance with
a higher number of CPUs to increase this setting.
• fs.s3n.multipart.uploads.split.size: Increase setting when you're uploading large files to help
avoid reaching multipart limits.
111
Amazon Web Services Amazon EMR Migration Guide
To work around this issue, you can increase the RPC timeout by using the hbase.rpc.timeout
variable when starting your job.
112
Amazon Web Services Amazon EMR Migration Guide
This command shows setting the number of threads to 20 and the HBase RPC timeout to 10 minutes.
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles \
-Dhbase.loadincremental.threads.max=20 \
-Dhbase.rpc.timeout=600000 \
113
Amazon Web Services Amazon EMR Migration Guide
<s3://bucket/storefileoutput/> \
<tablename>
If you create your cluster through the AWS CLI, the JSON file below shows a typical configuration file.
The file defines HBase on S3 settings and enables DEBUG and TRACE logging on two HBase classes.
[
{
"Classification": "hbase",
"Properties": {
"hbase.emr.storageMode": "s3"
}
},
{
"Classification": "hbase-site",
"Properties": {
"hbase.rootdir": "s3://<bucket>/<hbaseroot>",
"hbase.hregion.max.filesize": "21474836480",
}
},
{
"Classification": "hbase-log4j",
"Properties": {
"log4j.logger.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFil
es": "DEBUG",
"log4j.logger.org.apache.hadoop.hbase.ipc.RpcServer":
"TRACE"
}
}
]
If you must use Impala due to a use case that is not covered by Presto or Athena, then you have three
options to install Impala:
114
Amazon Web Services Amazon EMR Migration Guide
115
Amazon Web Services Amazon EMR Migration Guide
Operational Excellence
Upgrading Amazon EMR Versions
One best practice is to upgrade your Amazon EMR releases in a regular cadence. Upgrading your
clusters’ software ensures that you are using the latest and greatest features from open source
applications. The following are a few benefits of staying up-to-date with software upgrades:
The following figure is a sample of Amazon EMR 5.x software releases and corresponding open source
application versions from July 2019 through January 2020. At the time of this document, Amazon EMR
releases a new version approximately every 4–6 weeks, which pulls the latest version of the software.
For complete list of releases and release notes, see Amazon EMR 5.x Release Versions.
See Software Patching for recommendations on when it may be appropriate to patch software on your
Amazon EMR cluster.
116
Amazon Web Services Amazon EMR Migration Guide
Upgrade Process
When upgrading software, the risk of refactoring exists in terms of performance and data quality.
Upgrades may change API interfaces so that your code may no longer run as is on the new framework.
Upgrades can also introduce new bugs, which can cause applications to fail. AWS provides a best effort
to identify regressions in open source software before Amazon EMR releases by running a large suite of
integrations tests but some regressions may be difficult to identity. Therefore, it is imperative that each
release is tested before making it available to your users. However, the more often you upgrade, the
smaller number of changes between versions, which reduces the effort in upgrading as the risk of
regressions is reduced.
The following table lists common open source applications, their release notes, and issue tracking
systems. For Amazon EMR, see Amazon EMR 5.x Release Versions.
Table 10: Application Links for Release Notes and Issue Tracking
117
Amazon Web Services Amazon EMR Migration Guide
Fix Issues
If you find an issue when testing a version, follow these steps:
1. Check if a configuration value can fix the issue. For example, see if you can use a configuration
value to disable a problematic new feature or enhancement.
2. Check if the issue has already been identified and fixed in a later version of the open source
project. If there is a fix in a later version, notify an AWS Support engineer through AWS Support
channel. AWS will evaluate if it can be included in our next release.
3. Change the application or query to avoid the issue.
4. Contact AWS Support to see if any workarounds exist.
5. Abandon the upgrade if there is no workaround and wait for a release that has the required fix.
118
Amazon Web Services Amazon EMR Migration Guide
Complete Upgrade
Complete your upgrade by moving all of your Amazon EMR clusters to the new version. Finally,
discontinue use of the older version.
119
Amazon Web Services Amazon EMR Migration Guide
Ensure that all of your code, scripts, and other artifacts exist within source
control
By ensuring all of your code, scripts, and other artifacts exist within source control, you can track what
has changed over time. In addition, you can track down the changes that may have impacted your
environment’s operation. This approach also allows you to deploy the artifacts in different stages, such
as a beta environment, and reduces the chances that a bad artifact moves from one stage to another.
Monitor your clusters for abnormal behavior and employ automatic scaling
Specifically, make sure to watch for long running or stuck jobs as well as disk or HDFS capacity issues.
120
Amazon Web Services Amazon EMR Migration Guide
Unit tests are usually written for code, but testing data quality is often overlooked. Incorrect or
malformed data can have an impact on production systems. Data quality issues include the following
scenarios:
• Missing values can lead to failures in the production system that require non-null values.
• Changes in the distribution of data can lead to unexpected outputs of machine learning
models.
• Aggregations of incorrect data can lead to ill-informed business decisions.
This chapter covers multiple methods of checking data quality, but the approaches covered here are not
exhaustive. There are many third-party vendors and partners that offer solutions around data profiling
and data quality, but these are out of scope for this document. The approaches and best practices
described here can and should be applied to address data quality.
121
Amazon Web Services Amazon EMR Migration Guide
To validate that data was migrated correctly from the source system to the target system, you should
first understand the existing mechanism that your tool uses for ensuring that data is valid. For example,
consider Apache Sqoop. Sqoop validates a data copy job through the use of the --validate flag,
which performs a row count comparison between the source and target. For example:
However, there are various limitations with how Sqoop performs this validation. Sqoop does not
support:
• An all-tables option
• A free-form query option
• Data imported into Hive or HBase
• A table import with --where argument
• Incremental imports
Although some of these issues can be mitigated, the underlying problem is that Sqoop fundamentally
only performs row count comparisons. To validate every single value transferred, you must reverse your
data flow with Sqoop and push the freshly copied data back to the source system. After this reverse
push, you can compare the old data against the new data with hashes or checksums. Because this is an
expensive operation, this scenario is one where you must consider risk acceptance and data policy.
As part of your migration, you must determine the level of certainty around data accuracy and have a
corresponding and quantifiable metric for it. The stringency of your requirements determines cost,
complexity, and performance implications of your migration strategy, and could possibly impact the
performance of your overall solution. In addition, depending on the level of acceptable risk, alternate
approaches, such as sampling or application-level validation, may be viable options.
Another example is with the AWS CLI, where questions around data quality (specifically, integrity) often
arise when using it to transfer data from source systems to Amazon S3. It is important to understand the
characteristics of your target destination as well as exactly how the AWS CLI and other tools help
validate data that is copied. This way, you can establish reasonable data quality goals and thoroughly
address and answer questions from your architecture teams and business owners.
The following list is some common questions that arise when you use AWS CLI and other data
movement tools. Specifically:
122
Amazon Web Services Amazon EMR Migration Guide
• Does the AWS CLI validate objects as they land into Amazon S3?
• What happens in the case of multi-part uploads in Amazon S3? How do you calculate a
checksum for parts as well as the whole object?
• How can you validate that an uploaded object’s checksum is accurate without relying on the
Amazon S3 ETag?
Also, create diagrams of the data pipeline and data catalog to be used in production. Make note of the
similarities and differences between the two pipelines.
• If they are the same, then they are likely subject to the same errors. Are these errors
important?
• If they are not the same, then different sources or different processing has been applied. Do
any of these differences impact your analytics jobs or ML models? How do you know?
Share the diagrams and summaries with subject matter experts and project sponsors. Discuss the
differences and get agreement that they appear reasonable to all stakeholders. If the gap between the
two pipelines is found to be too large (a subjective assessment), consider other approaches to sourcing
data that are more representative of the data in production circumstances.
The more data sources that are involved, the more disparate the data sources that are to be merged.
Similarly, the more data transformation steps that are involved, the more complex the data quality
challenge becomes.
123
Amazon Web Services Amazon EMR Migration Guide
• When merging data, data might be dropped if no direct key match is found.
• Records with null or extreme values might be dropped.
Frequently, there are many individual cleaning and transformation steps performed before the data is
used for analytics or ML training. However, when the job or model is used in production, the data used
is generally coming from a different source (i.e. production data), and comes to the model’s inference
endpoint through a different production-focused path. Your analytics job, ETL workflow, or ML model
was built to work well for cleaned data inputs off of the development or test dataset. To ensure that
your job in production behaves the same as it did in a lower environment, consider comparing statistics
and validating the model against unclean data inputs.
Compare Statistics
To ensure that the analytics job or model performs well in production, add a formal checkpoint that
compares the source input data to the data job model actually used to train on. Make sure to evaluate
the data from both a quantitative and qualitative perspective.
Quantitative Evaluation
In your quantitative evaluation, review counts, data durations, and the precision of the data inputs in
addition to any other quantifiable indicators that you have defined.
Compare counts to identify, track, and highlight data loss, and test against what seems reasonable.
What percentage of source data was used to actually build and test the model? Is there any potential
bias as a result of unintentionally dropped data from a merge? Is the data storage subsystem filtering,
averaging, or aggregating results after some time period (for example, for log messages)?
Review data duration and retention to determine what time period each dataset covers. Are all of the
potentially relevant business cycles included, especially during peak load?
Quantify precision, by comparing the mean, median, and standard deviation of the data source and the
data used to train the model. Calculate the number or percentage of outliers. For lower dimensional
data or key variables, box plots can provide a quick visual assessment of reasonableness. Quantifying
precision in this way helps you to scope out what ‘expected’ values should be and build out alerts to
notify you for data that is outside of this scope.
Qualitative Evaluation
Accuracy is equally important as precision, but likely can only be assessed qualitatively. For example,
based on experience and sample exploration, how confident are you that the data is accurate? Are there
sufficient anecdotes of errors? For example, do operators report this sensor is always running high? The
124
Amazon Web Services Amazon EMR Migration Guide
actions to take based on this evaluation vary widely. A frequent result is to segment the data based on
some factor discovered during the analysis and take a different action on each segment. For example, in
a study of intelligent transport systems, a set of sensors were identified as misconfigured and routed to
be repaired, whereas another subset was used in traffic analysis.7
Apache Griffin
Apache Griffin is an open source data quality solution for big data that supports both batch and
streaming modes. It offers a unified process to measure your data quality from different perspectives,
helping you build trusted data assets and therefore boosting your confidence for your business.
You can combine Griffin with tools such as StreamSets or Apache Kafka to have an end-to-end streaming
workflow that performs data quality checks for you.
Deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data
quality in large datasets. It allows you to calculate data quality metrics on your dataset, define and verify
data quality constraints, and be informed about changes in the data distribution. Instead of
implementing checks and verification algorithms on your own, you can focus on describing how your
data should look. Deequ is implemented on top of Apache Spark and is designed to scale with large
datasets that typically live in a distributed filesystem or a data warehouse.
125
Amazon Web Services Amazon EMR Migration Guide
Outposts are connected to the nearest AWS Region to provide the same management and control plane
services on-premises, and you can access the full range of AWS services available in the Region to build,
manage, and scale your on-premises applications. For more information about AWS Outposts, see
the AWS Outposts User Guide.
Amazon EMR on AWS Outposts is ideal for low latency workloads that need to be run in close proximity
to on-premises data and applications. Creating an Amazon EMR cluster on AWS Outposts is similar to
creating an Amazon EMR cluster in an AWS Region. Starting with Amazon EMR version 5.28.0, you can
create and run EMR clusters on AWS Outposts. You can seamlessly extend your VPC on-premises by
creating a subnet and associating it with an Outpost just as you associate subnets with an Availability
Zone in the cloud.
126
Amazon Web Services Amazon EMR Migration Guide
Amazon EMR is one of the largest Spark and Hadoop service providers in the world, enabling customers
to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale.
At the end of the session participants will have an understanding of
• Successful migration strategies followed by large customers who have migrated from on-
premises clusters.
• Design patterns, such as using Amazon S3 instead of HDFS, taking advantage of both long- and
short-lived clusters, using completely managed notebooks, and other best practices.
• Lowering cost and cloud-scale elasticity with AWS Auto Scaling and Spot Instances
• Security best practices with end-to-end encryption, authorization and fine-grained access
control.
For more information or to schedule your EMR migration workshop, visit Apache Hadoop and Apache
Spark to Amazon EMR Migration Acceleration Program.
Specific to Amazon EMR, the domain expertise of the Data Analytics practice helps organizations derive
more value from their data assets using AWS services. Specific to Hadoop Migrations, AWS Professional
127
Amazon Web Services Amazon EMR Migration Guide
Services has a prescriptive and proven methodology. This methodology includes an alignment phase and
launch phase, which are both described in the following sections.
Hadoop Migration Alignment to Amazon EMR and Amazon S3 Data Lake (and
AWS Stack)
AWS Professional Services partners with you to learn and document your current state environment and
the desired future state outcomes. Although the scope of this phase is mutually confirmed, the following
list is activities suggested to be covered during the Hadoop migration alignment phase.
• AWS Lake Formation service introduction for future-proofing the architecture inclusive of
serverless capabilities, as interested. Services include Amazon QuickSight and AWS Lake
Formation plus serverless services AWS Glue and Amazon Redshift Athena.
• Provide recommendations on data ingestion architecture.
• Provide recommendation on data storage architecture approach that meets security, control
and access requirements.
• Provide recommendations on Data Catalog approach.
• Provide recommendation on data serving layer for downstream applications or users for
access to the catalog and perform data exploration for self-service.
128
Amazon Web Services Amazon EMR Migration Guide
• Provide recommendations for on boarding users to the platform to enable ease and
reusability.
• Provide recommendations on decoupling compute and storage for a cost optimized data lake.
Hadoop Migration Alignment and S3 data lake outcomes:
129
Amazon Web Services Amazon EMR Migration Guide
AWS Partners
In addition to AWS Professional Services team, AWS has a robust network of firms included in the AWS
Service Delivery Program. The AWS Service Delivery Program is a validation program that identifies and
endorses APN Partners with customer experience and a deep understanding of specific AWS services.
Firms that meet the criteria are listed by geography and can be found on the Amazon EMR Partner site.
AWS Support
AWS Support plans are designed to give you the right mix of tools and access to expertise so that you
can be successful with AWS while optimizing cost, performance, and managing risk.
130
Amazon Web Services Amazon EMR Migration Guide
• Troubleshooting issues with the AWS Management Console or other AWS tools
• Debugging problems detected by EC2 health checks
• Troubleshooting third-party applications such as OS, web servers, email, VPN, databases, and
storage configuration
AWS Premium Support does not include:
• Code development
• Debugging custom software
• Performing system administration tasks
• Tuning queries
Several AWS Marketplace products offer support, either directly in Marketplace or through the
originating vendor website. For more information, see New – Product Support Connection for AWS
Marketplace Customers.
Contributors
Contributors to this document include:
131
Amazon Web Services Amazon EMR Migration Guide
Additional Resources
For additional information, see these resources:
Blog Posts
• A Hybrid Cloud Architecture Ah-Ha! Moment
• 4 Dos and Don’ts When Using the Cloud to Experiment
• Top Performance Tuning Tips for Amazon Athena
• Untangling Hadoop YARN
• Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy
• Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda
• Top 10 Performance Tuning Tips for Amazon Athena
• Easily manage table metadata for Presto running on Amazon EMR using the AWS Glue Data
Catalog
132
Amazon Web Services Amazon EMR Migration Guide
Analysts Reports
• The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR
• The Forrester Wave: Cloud Hadoop/Spark Platforms, Q1 2019
Media
• AWS re:Invent Videos
• Amazon EMR on YouTube
Document Revisions
Date Description
August 2022 Updated for AWS DataSync
December 2020 Updated for AWS Lake Formation Metadata Authorization, AWS
Outposts, Amazon EMR Notebooks, Amazon EMR Migration Program,
Incremental Data Processing.
133
Amazon Web Services Amazon EMR Migration Guide
Cluster Use
• How much of the cluster is being used on average, during peak, during low times?
o How many users, what percentage of CPU and memory?
o Where are users located? One time zone or spread globally?
• How much of the data is being accessed regularly?
• How much new data is added on a monthly basis?
• What kind of data formats are the source, intermediate, and final outputs?
• Are workloads segregated in any manner? (i.e. with queues or schedulers)
Maintenance
• How is the cluster being administrated right now?
• How are upgrades being done?
• Are there separate development and production clusters?
• Is there a backup cluster or data backup procedure?
Use Cases
Batch Jobs
• How many jobs per day?
• What is the average time they run?
134
Amazon Web Services Amazon EMR Migration Guide
Security Requirements
• Are you using Kerberos?
• Are you using a directory service? Which one?
• Are there fine-grained access control requirements? How is read/write access restricted?
• Are users allowed to create their own tables? How do you track data lineage?
TCO Considerations
Growth expectations over time
• How much will the data grow over time?
135
Amazon Web Services Amazon EMR Migration Guide
136
Amazon Web Services Amazon EMR Migration Guide
The subsections in this appendix show how users interact with a Kerberos enabled Amazon EMR cluster
and flow of messages.
Figure 50: EMR Kerberos Cluster Startup Flow for KDC with One-Way Trust
In this flow, each node has a provisioning script that is responsible for:
• (Primary node only) Creates the KDC and configures it for 1-way trust.
• (All nodes) Start realmd to join Active Directory, which then configures SSSD to handle user
and group mapping requests and create keytab.
137
Amazon Web Services Amazon EMR Migration Guide
• (All nodes) Create application principles and keytabs for each application and sub-application.
• When a new node comes up in the cluster, its provisioning script performs the same
operations: creates principles within the KDC on the master node for all the applications
running locally, creates the keytab file, joins the node to Active Directory (if configured), starts
the applications.
1. Users log into Hue (or Zeppelin) using their on-premises credentials.
2. Hue authenticates those credentials using LDAP(S) with the on-premises Active Directory.
3. Once authenticated, the user can submit Hive queries.
138
Amazon Web Services Amazon EMR Migration Guide
4. Hue submits those queries and tells HiveServer2 to run the job as the user (i.e. impersonation.)
5. HiveServer2 submits the job to resource manager for processing.
6. During the execution of the job, Hadoop authenticates and authorizes the user by using SSSD
to verify the user account on the local node.
Figure 52: EMR Kerberos flow for directly interacting with HiveServer2
1. User authenticates using credentials for the local KDC and receives a Kerberos ticket.
2. User submits a query and the Kerberos ticket to HiveServer2.
3. HiveServer2 requests the local KDC to validate the Kerberos ticket.
4. The local KDC requests the on-premises KDC to validate the Kerberos ticket.
5. HiveServer2 submits the job to Resource Manager for processing as the user.
6. During the execution of the job, Hadoop authenticates and authorizes the user by using SSSD
to verify the user account on the local node.
139
Amazon Web Services Amazon EMR Migration Guide
1. User accesses the primary node through SSH and authenticates with user credentials.
2. If the user needs a Kerberos ticket, the user must kinit (kinit -k -t <keytab> <principal>) to get a
Kerberos ticket using their credentials from the local KDC.
3. The local KDC uses on-premises KDC to authenticate the user and return a Kerberos ticket.
4. User submits the query and the Kerberos Ticket to HiveServer2.
5. HiveServer2 requests that the local KDC validate the ticket.
6. The local KDC requests the on-premises KDC to validate the ticket.
7. HiveServer2 submits the job to Resource Manager for processing as the user.
8. During job execution, Hadoop uses SSSD to authenticate and authorize the user account on
the local node.
140
Amazon Web Services Amazon EMR Migration Guide
{
"classification":"core-site",
"properties":{
"hadoop.security.group.mapping.ldap.search.attr.member":"member",
"hadoop.security.group.mapping.ldap.search.filter.user":"(objectclass=*)",
"hadoop.security.group.mapping.ldap.search.attr.group.name":"cn",
"hadoop.security.group.mapping.ldap.base":"dc=corp,dc=emr,dc=local",
"hadoop.security.group.mapping":"org.apache.hadoop.security.LdapGroupsMapping",
"hadoop.security.group.mapping.ldap.url":"ldap://172.31.93.167",
"hadoop.security.group.mapping.ldap.bind.password":"Bind@User123",
"hadoop.security.group.mapping.ldap.bind.user":"[email protected]",
"hadoop.security.group.mapping.ldap.search.filter.group":"(objectclass=*)"
},
"configurations":[
]
}
]
141
Amazon Web Services Amazon EMR Migration Guide
{
"classification":"ldap",
"properties":{
"bind_dn":"[email protected]",
"trace_level":"0",
"search_bind_authentication":"false",
"debug":"true",
"base_dn":"dc=corp,dc=emr,dc=local",
"bind_password":"Bind@User123",
"ignore_username_case":"true",
"create_users_on_login":"true",
"ldap_username_pattern":"uid=<username>,cn=users,dc=corp,dc=emr,dc=
local",
"force_username_lowercase":"true",
"ldap_url":"ldap://172.31.93.167",
"nt_domain":"corp.emr.local"
},
"configurations":[
{
"classification":"groups",
"properties":{
"group_filter":"objectclass=*",
"group_name_attr":"cn"
},
"configurations": [] },
{
"classification":"users",
"properties":{
"user_name_attr":"sAMAccountName",
"user_filter":"objectclass=*"
},
"configurations":[]
}
]
}
]
}
]
}
]
142
Amazon Web Services Amazon EMR Migration Guide
What types of security features are available for an AWS Glue Data Catalog?
You can enable encryption on an AWS Glue Data Catalog, and access to AWS Glue actions are
configurable through IAM policies. The default Amazon EMR EC2 role (EMR_EC2_DefaultRole) allows the
required AWS Glue actions. However, if you specify a custom EC2 instance profile and permissions when
you create a cluster, ensure that the appropriate AWS Glue actions are allowed. For a list of available
Glue IAM policies, see AWS Glue API Permissions: Actions and Resources Reference.
Can multiple Amazon EMR clusters use a single AWS Glue Data Catalog?
Yes, an AWS Glue Data Catalog can be used by one-to-many Amazon EMR clusters, as well as Amazon
Athena and Amazon Redshift.
When should I use a Hive metastore on Amazon RDS over an AWS Glue Data
Catalog?
If you want full control of your Hive metastore and want to integrate with other open-source
applications such as Apache Ranger and Apache Atlas, then use Hive metastore on Amazon RDS. If you
are looking for a managed and serverless Hive metastore, then use AWS Glue Data Catalog.
Notes
1
For a step-by-step guide on how to set up an LDAP server and integrate Apache Hue with it, see Using
LDAP via AWS Directory Service to Access and Administer Your Hadoop Environment on the AWS Big
Data Blog.
143
Amazon Web Services Amazon EMR Migration Guide
2
Example customers that use Amazon S3 as their storage layer for their data lakes include NASDAQ,
Zillow, Yelp, iRobot, and FINRA.
3
For more information on these features, see Configuring Node Decommissioning Behavior.
4
For Amazon EMR versions 5.8.0 and later, you can configure Hive to use the AWS Glue Data Catalog as
its metastore. See Existing Hive Metastore to AWS Glue Data Catalog in Data Catalog Migration.
5
Applies to Amazon EMR software version 5.20 and later.
6
For example, FINRA migrated a 700-TB HBase environment to HBase on Amazon S3. For more
information, see Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase
on Amazon EMR with Amazon S3.
7
V.M. Megler, Kristin Tufte, and David Maier, Improving Data Quality in Intelligent Transportation
Systems, https://arxiv.org/abs/1602.03100 (Feb. 9, 2016)
8
For information on support, see Amazon EMR What's New and the Amazon EMR FAQs.
144