Building Data Lakes
Building Data Lakes
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Table of Contents
Storage Best Practices for Data and Analytics Applications ...................................................................... 1
Abstract .................................................................................................................................... 1
Introduction .............................................................................................................................. 1
Central storage: Amazon S3 as the data lake storage platform ................................................................. 3
Data ingestion methods ...................................................................................................................... 5
Amazon Kinesis Data Firehose ..................................................................................................... 5
AWS Snow Family ...................................................................................................................... 6
AWS Glue .................................................................................................................................. 7
AWS DataSync ........................................................................................................................... 8
AWS Transfer Family ................................................................................................................... 9
Storage Gateway ........................................................................................................................ 9
Apache Hadoop distributed copy command ................................................................................... 9
AWS Direct Connect ................................................................................................................... 9
AWS Database Migration Service .................................................................................................. 9
Data lake foundation ........................................................................................................................ 11
Catalog and search ........................................................................................................................... 13
AWS Glue ................................................................................................................................ 13
AWS Lake Formation ................................................................................................................ 14
Comprehensive data catalog ...................................................................................................... 15
Transforming data assets .................................................................................................................. 16
In-place querying ............................................................................................................................. 19
Amazon Athena ....................................................................................................................... 19
Amazon Redshift Spectrum ....................................................................................................... 20
The broader analytics portfolio .......................................................................................................... 21
Amazon EMR ........................................................................................................................... 21
Amazon SageMaker .................................................................................................................. 21
Amazon QuickSight .................................................................................................................. 22
Amazon OpenSearch Service ..................................................................................................... 22
Securing, protecting, and managing data ............................................................................................ 23
Access policy options ................................................................................................................ 23
Data Encryption with Amazon S3 and AWS KMS .......................................................................... 24
Protecting data with Amazon S3 ................................................................................................ 25
Managing data with object tagging ............................................................................................ 26
AWS Lake Formation: Centralized governance and access control .................................................... 27
Monitoring and optimizing the data lake environment .......................................................................... 29
Data lake monitoring ................................................................................................................ 29
Amazon CloudWatch ......................................................................................................... 29
Amazon Macie ................................................................................................................. 29
AWS CloudTrail ................................................................................................................ 30
Data lake optimization .............................................................................................................. 30
Amazon S3 Lifecycle management ..................................................................................... 30
Amazon S3 Storage class analysis ...................................................................................... 30
S3 Intelligent-Tiering ........................................................................................................ 31
S3 Storage Lens ............................................................................................................... 31
Amazon S3 Glacier ........................................................................................................... 32
Cost and performance optimization .................................................................................... 32
Future proofing the data lake ............................................................................................................ 34
Conclusion and contributors .............................................................................................................. 35
Conclusion ............................................................................................................................... 35
Contributors ............................................................................................................................ 35
Resources ........................................................................................................................................ 36
Document history ............................................................................................................................. 37
iii
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Abstract
Abstract
Amazon Simple Storage Service (Amazon S3) and Amazon Simple Storage Service Glacier (Amazon
S3 Glacier) provide ideal storage solutions for data lakes. Data lakes, powered by Amazon S3, provide
you with unmatched availability, agility, and flexibility required to combine different types of data and
analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses
cannot. In addition, data lakes built on Amazon S3 integrate with other analytical services for ingestion,
inventory, transformation, and security of your data in the data lake. This guide explains each of these
options and provides best practices for building, securing, managing, and scaling a data lake built on
Amazon S3.
Introduction
Because organizations are collecting and analyzing increasing amounts of data, traditional on-premises
solutions for data storage, data management, and analytics can no longer keep pace. Data silos that
aren’t built to work well together make it difficult to consolidate the storage for more comprehensive
and efficient analytics. This, in turn, limits an organization’s agility and ability to derive more insights
and value from its data. In addition, this reduces the capability to seamlessly adopt more sophisticated
analytics tools and processes, because it necessitates upskilling of the workforce.
A data lake is an architectural approach that allows you to store all your data in a centralized repository,
so that it can be categorized, catalogued, secured, and analyzed by a diverse set of users and tools. In
a data lake you can ingest and store structured, semi-structured, and unstructured data, and transform
these raw data assets as needed. Using a cloud-based data lake you can easily decouple the compute
from storage, and scale each component independently, which is a huge advantage over an on-premises
or Hadoop-based data lake. You can use a complete portfolio of data exploration, analytics, machine
learning, reporting, and visualization tools on the data. A data lake makes data and the optimal analytics
tools available to more users, across more lines of business, enabling them to get all of the business
insights they need, whenever they need them.
More organizations are building data lakes for various use cases. To guide customers in their journey,
Amazon Web Services (AWS) has developed a data lake architecture that allows you to build scalable,
secure data lake solutions cost-effectively using Amazon S3 and other AWS services.
Using the data lake built on Amazon S3 architecture capabilities you can:
• Ingest and store data from a wide variety of sources into a centralized platform.
• Build a comprehensive data catalog to find and use data assets stored in the data lake.
• Secure, protect, and manage all of the data stored in the data lake.
• Use tools and policies to monitor, analyze, and optimize infrastructure and data.
• Transform raw data assets in place into optimized usable formats.
• Query data assets in place.
1
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Introduction
• Integrate the unstructured data assets from Amazon S3 with structured data assets in a data
warehouse solution to gather valuable business insights.
• Store the data assets into separate buckets as the data goes through extraction, transformation, and
load process.
• Use a broad and deep portfolio of data analytics, data science, machine learning, and visualization
tools.
• Quickly integrate current and future third-party data-processing tools.
• Securely share processed datasets and results.
• Scale virtually to unlimited capacity.
The remainder of this paper provides more information about each of these capabilities. The following
figure illustrates a sample AWS data lake platform.
2
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
• Decoupling of storage from compute and data processing – In traditional Hadoop and data
warehouse solutions, storage and compute are tightly coupled, making it difficult to optimize costs
and data processing workflows. With Amazon S3, you can cost-effectively store all data types in their
native formats. You can then launch as many or as few virtual servers as you need using Amazon
Elastic Compute Cloud (Amazon EC2) to run analytical tools, and use services in AWS analytics
portfolio, such as Amazon Athena, AWS Lambda, Amazon EMR, and Amazon QuickSight, to process
your data.
• Centralized data architecture – Amazon S3 makes it easy to build a multi-tenant environment, where
multiple users can run different analytical tools against the same copy of the data. This improves both
cost and data governance over that of traditional solutions, which require multiple copies of data to be
distributed across multiple processing platforms.
• S3 Cross-Region Replication: – You can use Cross-Region Replication to copy your objects across
S3 buckets within the same account or even with a different account. Cross-Region Replication is
particularly useful in meeting compliance requirements, minimizing latency by storing the objects
closer to the user location, and improving operational efficiency.
• Integration with clusterless and serverless AWS services – You can use Amazon S3 with Athena,
Amazon Redshift Spectrum, and AWS Glue to query and process data. Amazon S3 also integrates
with AWS Lambda serverless computing to run code without provisioning or managing servers. You
can process event notifications from S3 through AWS Lambda, such as when an object is created or
deleted from a bucket. With all of these capabilities, you only pay for the actual amounts of data you
process or for the compute time that you consume. For machine learning use cases, you need to store
the model training data and the model artifacts generated during model training. Amazon SageMaker
integrates seamlessly with Amazon S3, so you can store the model training data and model artifacts
on a single or different S3 bucket.
• Standardized APIs – Amazon S3 RESTful application programming interfaces (APIs) are simple, easy to
use, and supported by most major third-party independent software vendors (ISVs), including leading
Apache Hadoop and analytics tool vendors. This allows customers to bring the tools they are most
comfortable with and knowledgeable about to help them perform analytics on data in Amazon S3.
• Secure by default – Amazon S3 is secure by default. Amazon S3 supports user authentication to
control access to data. It provides access control mechanisms such as bucket policies and access-
control lists to provide fine-grained access to data stored in S3 buckets to specific users and groups
of users. You can also manage the access to shared data within Amazon S3 using S3 Access Points.
More details about S3 Access Points are included in the Securing, protecting, and managing data
section. You can also securely access data stored in S3 through SSL endpoints using HTTPS protocol.
An additional layer of security can be implemented by encrypting the data-in-transit and data-at-rest
using server-side encryption (SSE).
• Amazon S3 for storage of raw and iterative data sets – When working with a data lake, the data
undergoes various transformations. With extract, transform, load (ETL) processes and analytical
3
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
operations, various versions of the same data sets are created or required for advanced processing.
You can create different layers for storing the data based on the stage of the pipeline, such as raw,
transformed, curated, and logs. Within these layers you can also create additional tiers based on the
sensitivity of the data.
• Storage classes for cost savings, durability, and availability – Amazon S3 provides a range of storage
classes for various use cases.
• S3 Standard – General purpose storage for frequently accessed data.
• S3 Standard Infrequent Access (S3 Standard-IA) and S3 One Zone Infrequent Access (S3 One
Zone – IA) – Infrequently accessed, long lived data.
• S3 Glacier and S3 Glacier Deep Archive – Long-term archival of data.
Using S3 Lifecycle policy, you can move the data across different storage classes for compliance and
cost optimization.
• Scalability and support for structured, semi-structured, and unstructured data – Amazon S3 is a
petabyte scale object store which provides virtually unlimited scalability to store any type of data.
You can store structured data (such as relational data), semi-structured data (such as JSON, XML, and
CSV files), and unstructured data (such as images or media files). This feature makes Amazon S3 the
appropriate storage solution for your cloud data lake.
4
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Amazon Kinesis Data Firehose
• Real-time streaming data and bulk data assets, from on-premises storage platforms.
• Structured data generated and processed by legacy on-premises platforms - mainframes and data
warehouses.
• Unstructured and semi-structured data – images, text files, audio and video, and graphs).
AWS provides services and capabilities to ingest different types of data into your data lake built on
Amazon S3 depending on your use case. This section provides an overview of various ingestion services.
Kinesis Data Firehose can convert your input JSON data to Apache Parquet and Apache ORC before
storing the data into your data lake built on Amazon S3. Parquet and Orc being columnar data formats,
help save space and allow faster queries on the stored data compared to row-based formats such as
JSON. Kinesis Data Firehose can compress data before it’s stored in Amazon S3. It currently supports
GZIP, ZIP, and SNAPPY compression formats. GZIP is the preferred format because it can be used by
Amazon Athena, Amazon EMR, and Amazon Redshift.
Kinesis Data Firehose also allows you to invoke Lambda functions to perform transformations on the
input data. Using Lambda blueprints, you can transform the input comma-separated values (CSV),
structured text, such as Apache Log and Syslog formats, into JSON first. You can optionally store the
source data to another S3 bucket. The following figure illustrates the data flow between Kinesis Data
Firehose and different destinations.
Kinesis Data Firehose also provides the ability to group and partition the target files using custom
prefixes such as dates for S3 objects. This facilitates faster querying by the use of the partitioning and
incremental processing further with the same feature.
5
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Snow Family
Delivering real-time streaming data with Kinesis Data Firehose to different destinations with optional
backup
Kinesis Data Firehose also natively integrates with Amazon Kinesis Data Analytics which provides you
with an efficient way to analyze and transform streaming data using Apache Flink and SQL applications.
Apache Flink is an open-source framework and engine for processing streaming data using Java and
Scala. Using Kinesis Data Analytics, you can develop applications to perform time series analytics,
feed real-time dashboards, and create real-time metrics. You can also use Kinesis Data Analytics for
transforming the incoming stream and create a new data stream that can be written back into Kinesis
Data Firehose before it is delivered to a destination.
Finally, Kinesis Data Firehose encryption supports Amazon S3 server-side encryption with AWS Key
Management Service (AWS KMS) for encrypting delivered data in your data lake built on Amazon S3.
You can choose not to encrypt the data or to encrypt the data with a key from the list of AWS KMS keys
that you own (refer to the Data encryption with Amazon S3 and AWS KMS (p. 24) section of this
document). Kinesis Data Firehose can concatenate multiple incoming records, and then deliver them to
Amazon S3 as a single S3 object.
This is an important capability because it reduces the load of Amazon S3 transaction costs and
transactions per second. You can grant your application access to send data to Kinesis Data Firehose
using AWS Identity and Access Management (IAM). Using IAM, you can also grant Kinesis Data Firehose
access to S3 buckets, Amazon Redshift cluster, or Amazon OpenSearch Service cluster. You can also use
Kinesis Data Firehose with virtual private cloud (VPC) endpoints (AWS PrivateLink). AWS PrivateLink is
an AWS technology that enables private communication between AWS services using an elastic network
interface with private IPs in your Amazon VPC.
6
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Glue
offer cloud computing capabilities at the edge for the applications that need to perform computations
closer to the source of the data. Using Snowcone you can transfer data generated continuously from
sensors, IoT devices, and machines, to the AWS Cloud. Snowcone features 8 TB of storage. Snowball and
Snowmobile are used to transfer massive amounts of data up to 100 PB.
Snowball moves terabytes of data into your data lake built on Amazon S3. You can use it to transfer
databases, backups, archives, healthcare records, analytics datasets, historic logs, IoT sensor data, and
media content, especially in situations where network conditions hinder transfer of large amounts of
data both into and out of AWS.
AWS Snow Family uses physical storage devices to transfer large amounts of data between your on-
premises data centers and your data lake built on Amazon S3. You can use AWS Storage Optimized
Snowball to securely and efficiently migrate bulk data from on-premises storage platforms and Hadoop
clusters. Snowball supports encryption and uses AES-256-bit encryption. Encryption keys are never
shipped with the Snowball device, so the data transfer process is highly secure.
Data is transferred from the Snowball device to your data lake built on Amazon S3 and stored as S3
objects in their original or native format. Snowball also has a Hadoop Distributed File System (HDFS)
client, so data may be migrated directly from Hadoop clusters into an S3 bucket in its native format.
Snowball devices can be particularly useful for migrating terabytes of data from data centers and
locations with intermittent internet access.
AWS Glue
AWS Glue is a fully managed serverless ETL service that makes it easier to categorize, clean, transform,
and reliably transfer data between different data stores in a simple and cost-effective way. The core
components of AWS Glue consists of a central metadata repository known as AWS Glue Data Catalog
which is a drop-in replacement for an Apache Hive metastore (refer to the Catalog and search (p. 13)
section of this document for more information) and an ETL job system that automatically generates
Python and Scala code and manages ETL jobs. The following figure depicts the high-level architecture of
an AWS Glue environment.
7
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS DataSync
To ETL the data from source to target, you create a job in AWS Glue, which involves the following steps:
1. Before you can run an ETL job, define a crawler and point it to the data source to identify the table
definition and the metadata required to run the ETL job. The metadata and the table definitions are
stored in the Data Catalog. The data source can be an AWS service, such as Amazon RDS, Amazon
S3, Amazon DynamoDB, or Kinesis Data Streams, as well as a third-party JDBC-accessible database.
Similarly, a data target can be an AWS service, such as Amazon S3, Amazon RDS, and Amazon
DocumentDB (with MongoDB compatibility), as well as a third-party JDBC-accessible database.
2. Either provide a script to perform the ETL job, or AWS Glue can generate the script automatically.
3. Run the job on-demand or use the scheduler component that helps in initiating the job in response to
an event and schedule at a defined time.
4. When the job is run, the script extracts the data from the source, transforms the data, and finally loads
the data into the data target.
AWS DataSync
AWS DataSync is an online data transfer service that helps in moving data between on-premises storage
systems and AWS storage services, as well as between different AWS storage services. You can automate
the data movement between on-premises Network File Systems (NFS), Server Message Block (SMB), or
a self-managed object store to your data lake built on Amazon S3. DataSync allows data encryption and
data integrity validation to ensure safe and secure transfer of data. DataSync also has support for an
HDFS connector to read directly from on-premises Hadoop clusters and replicate your data to your data
lake built on Amazon S3.
8
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Transfer Family
Storage Gateway
Storage Gateway can be used to integrate legacy on-premises data processing platforms with a data
lake built on Amazon S3. The File Gateway configuration of Storage Gateway offers on-premises
devices and applications a network file share through an NFS connection. Files written to this mount
point are converted to objects stored in Amazon S3 in their original format without any proprietary
modification. This means that you can integrate applications and platforms that don’t have native
Amazon S3 capabilities—such as on-premises lab equipment, mainframe computers, databases, and data
warehouses—with S3 buckets, and then use analytical tools such as Amazon EMR or Amazon Athena to
process this data.
You can also write the data into Apache Parquet format (parquet) for more compact storage and faster
query options. Both CSV and parquet formats are favorable for in-place querying using services such as
Amazon Athena and Amazon Redshift Spectrum (refer to the In-place querying (p. 19) section of this
document for more information). As mentioned earlier, Parquet format is recommended for analytical
9
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Database Migration Service
querying. It is useful to use AWS DMS to migrate databases from on-premises to or across different AWS
accounts to your data lake built on Amazon S3 during initial data transfer or on a regular basis.
10
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
The following figure depicts a sample data lake and the transformation journey data goes through in its
lifecycle.
This section provides a recommended bucket strategy for building a data lake foundation.
• Raw data – Data ingested from the data sources in the raw data format, which is the immutable
copy of the data. This can include structured, semi structured, and unstructured data objects such as
databases, backups, archives, JSON, CSV, XML, text files, or images.
• Transformed – This bucket consists of transformed data normalized to a specific use case for
performance improvement and cost reduction. In this stage, data can be transformed into columnar
data formats, such as Apache Parquet and Apache ORC, which can be used by Amazon Athena.
• Curated –The transformed data can be further enriched by blending it with other data sets to provide
additional insights. This layer typically contains S3 objects which are optimized for analytics, reporting
using Amazon Athena, Amazon Redshift Spectrum, and loading into massively parallel processing data
warehouses such as Amazon Redshift.
• Logs – This bucket stores process logs for Amazon S3, and other services in the data lake architecture.
The logs can include S3 access logs, CloudWatch logs, or CloudTrail logs.
11
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
The following table shows a recommended sample folder structure for your data lake per environment.
Each environment can have the same folder structure with tags to segregate across each environment.
Recommended sample folder structure for your data lake per environment
It is recommended you follow best practices when defining your bucket strategy for your data lake built
on Amazon S3:
• Buckets names must be unique within a partition. Currently there are three partitions (aws – Standard
Regions, aws-cn – China Regions, aws-us-gov – AWS GoveCloud US Regions); however, names can be
reused after the buckets are deleted (with several exceptions).
• Deleted bucket names are not available to reuse immediately; hence, if users want to use a particular
bucket name, they should not delete the bucket.
• All bucket names across all AWS Regions should comply with DNS naming conventions.
• Buckets can store an unlimited number of objects in a bucket without impacting performance. The
objects can be stored in a single bucket or across multiple buckets; however, you cannot create a
bucket from within another bucket.
• Production S3 buckets should be hosted under a different AWS account, separate from non-production
workloads.
• Build an automatic ingestion mechanism to catalog and create the multiple layers of data storage
including Raw, Transformed, and Curated.
• Consider building automatic data classification rules based on schema and data.
• Consider additional folders within the data lakes, such as reports, downstream applications, or user
folders.
• Enable versioning, if needed, for protection from accidental deletes.
• Use separate buckets for S3 data which needs to be replicated.
12
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Glue
AWS Glue
AWS Glue is a fully managed ETL service that makes it easier to categorize, clean, transform, and reliably
transfer data between different data stores. The AWS Glue Data Catalog, acomponent of AWS Glue,
provides a unified metadata repository for performing analytical operations across various data sources,
such as Amazon EMR, Amazon Athena, Amazon Redshift, and Amazon Redshift Spectrum, and any
application that is compatible with a Hive metastore.
To create a Data Catalog, use AWS Glue crawlers that crawl the data from the data sources as registered
with AWS Glue in the AWS Management Console. The data source can be an S3 bucket, a table in
Amazon RDS, Amazon DynamoDB, or Amazon Redshift, or any external database that supports JDBC
connectivity.
You can define custom classifiers or use the built-in classifiers provided by AWS Glue to classify the data.
AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, or XML. AWS Glue also
provides classifiers for common relational database management systems using a JDBC connection. You
can create a custom classifier using a grok pattern, an XML tag, JavaScript Object Notation (JSON), or
comma-separated values (CSV). The following figure depicts the working of AWS Glue in building a Data
Catalog.
13
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Lake Formation
Data Catalog is a database that stores metadata in tables consisting of data schema, data location,
and runtime metrics. Data Catalog is also Apache Hive metastore compatible that can be used as a
central repository for storing structural and operational metadata. AWS Glue also provides out-of-
box integration with Amazon EMR that allows you to use Data Catalog as an external Hive metastore.
Data Catalog is recommended, especially when you need a persistent metastore or a metastore shared
between different applications, services, clusters, or AWS accounts. Data Catalog can also be used to
create an external table for Athena or Amazon Redshift.
Lake Formation uses AWS Glue Data Catalog to automatically classify data in data lakes, data sources,
transforms, and targets. Apart from the metadata, the Data Catalog also stores information consisting of
resource links to shared databases and tables in external accounts for cross account access to the data in
a data lake built on S3.
Lake Formation provides you with a grant/revoke permission model to control access to Data Catalog
resources (consisting of database and metadata tables), S3 buckets and underlying data in these buckets.
14
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Comprehensive data catalog
Lake Formation permissions along with IAM policies provide granular access to the data stored in data
lakes built on S3. These permissions can be used to share Data Catalog resources with external AWS
accounts. The users from these accounts can run jobs and queries by combining data from multiple data
catalogs across multiple accounts.
Comprehensive data catalog using AWS Lambda, DynamoDB, and Amazon OpenSearch Service
15
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
The key to ‘democratizing’ the data and making the data lake available to the widest number of users
of varying skill sets and responsibilities is to transform data assets into a format that allows for efficient
one-time SQL querying. Data transformation or ETL process prepares the data to be consumed by
downstream systems for advanced analytics, visualizations and business reporting, or machine learning.
The first step involves extracting information from the different data sources which was discussed
in the Data ingestion methods section. The data is stored in a raw bucket (refer to the Data lake
foundation (p. 11) section of this document). After you have the data, you need to transform it.
Transformations might involve aggregating the data from different sources or changing the file format
of the data received, such as from CSV to Parquet format to reduce the file size and optimize the Athena
query.
You can compress the files or convert the data format from string to double, or number to date format.
After the transformation is complete, the data can be written into transformed bucket (refer to the
Data lake foundation (p. 11) section of this document) which is then further used by the downstream
systems.
There are multiple ways to transform data assets, and the “best” way often comes down to the nature
of the analytics application, individual preference, skill sets, and the tools available. When a data lake
is built on AWS, there is a wide variety of tools and services available for data transformation, so you
can pick the methods and tools that best suits your purpose. Because the data lake is inherently multi-
tenant, multiple data transformation jobs using different tools can run concurrently.
There are two common and straightforward methods to transform data assets into Parquet in a data lake
built on S3 using Amazon EMR clusters. The first method involves creating an Amazon EMR cluster with
Hive installed using the raw data assets in Amazon S3 as input, transforming those data assets into Hive
tables, and then writing those Hive tables back out to Amazon S3 in Parquet format. The second method
is to use Spark on Amazon EMR. With this method, a typical transformation can be achieved with only 20
lines of PySpark code.
A third data transformation method on a data lake built on S3 is to use AWS Glue. AWS Glue simplifies
and automates difficult and time-consuming data discovery, conversion, mapping, and job scheduling
tasks. AWS Glue guides you through the process of transforming and moving your data assets with an
easy-to-use console that helps you understand your data sources, transform and prepare these data
assets for analytics, and load them reliably from S3 data sources back into S3 destinations.
AWS Glue automatically crawls raw data assets in your data lake’s S3 buckets, identifies data formats,
and then suggests schemas and transformations so that you don’t have to spend time hand coding
data flows. You can then edit these transformations, if necessary, using the tools and technologies you
already know, such as Python, Spark, Git, and your favorite integrated developer environment, and then
share them with other AWS Glue users of the data lake. AWS Glue’s flexible job scheduler can be set up
to run data transformation flows on a recurring basis, in response to triggers, or even in response to AWS
Lambda events.
AWS Glue automatically and transparently provisions hardware resources, and distributes ETL jobs
on Apache Spark nodes so that ETL run times remain consistent as data volume grows. AWS Glue
coordinates the implementation of data lake jobs in the right sequence, and automatically retries failed
16
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
jobs. With AWS Glue, there are no servers or clusters to manage, and you pay only for the resources
consumed by your ETL jobs.
AWS Lambda functions can also be used for transforming the data stored in your data lake built on
Amazon S3. Lambda functions can respond to event notifications from Amazon S3 when an object is
created or deleted. You can configure a Lambda function to perform an action asynchronously, based on
the event.
When data is stored in a data lake built on S3, you can share the data with multiple applications.
These applications can vary in nature and purpose (for example, ecommerce applications, analytics
applications, and marketing applications) and necessitate a different view for each application. S3
Object Lambda can be used to add your own code to process the data retrieved before returning it to
an application (for example, masking the personally identifiable information (PII) data or masking credit
card information before transferring it to the application).
S3 Object Lambda uses a Lambda function to automatically process and transform the data as it is being
retrieved from an S3 bucket. The following figure shows an example of an S3 Object Lambda.
S3 Object Lambda can be very useful in redacting PII data for analytics applications, format conversion,
enriching data from other data sources, resizing objects, or even implementing custom authorization
rules to access the data stored in your data lake built on Amazon S3.
Another transformation feature that S3 supports natively is batch operations. Using S3 Batch
Operations, you can define an operation to be performed on the exabytes of objects stored in your
bucket. S3 Batch Operations manages the entire lifecycle of the batch operation by tracking the
progress, sending notifications, and storing a detailed completion report for all the operations
performed on the objects.
By using S3 Batch Operations, you can copy objects from a bucket into another bucket in the same or
different Region, or invoke a Lambda function to transform the object. S3 Batch Operation can also
be used to manage the tags defined for the objects and restoring a large number of objects from S3
Glacier. S3 Batch Operation, along with S3 Object Lock, can be used to manage retention dates and legal
holds to apply compliance and governance rules for multiple objects at the same time. S3 Object Lock
17
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
works on the write once read many (WORM) model that can help in prevention of accidental deletion of
objects. It can also help in adhering to regulatory controls.
18
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Amazon Athena
In-place querying
One of the most important capabilities of a data lake on AWS is the ability to perform in-place
transformation and querying of data assets. This allows users to run sophisticated analytical queries
directly on their data stored in S3, without having to copy and load data into separate analytics
platforms or data warehouses.
There are various tools to perform in-place querying for data stored in a data lake, such as Presto on
Amazon EMR and various partner tools. This section provides an overview of serverless services that not
only helps perform in-place querying, but also avoids the procurement and management of servers.
Users can query S3 data without any additional infrastructure, and only pay for the queries that they
run. This makes the ability to analyze vast amounts of unstructured data accessible to any data lake
user who can use SQL, and makes it far more cost effective than the traditional method of performing
an ETL process, creating a Hadoop cluster or data warehouse, loading the transformed data into these
environments, and then running query jobs. AWS Glue, as described in the previous sections, provides
the data discovery and ETL capabilities, and Amazon Athena and Amazon Redshift Spectrum provides
the serverless in-place querying capabilities.
In addition to in-place querying using Athena and Redshift Spectrum, S3 also provides capabilities to
retrieve subset of your data through S3 Select and S3 Glacier Select, that improves the performance
of accessing large amounts of data from your data lake built on S3. Using S3 Select, users can run SQL
statements to filter and retrieve only a subset of data stored in their data lake. S3 Select operates on
objects stored in CSV, JSON, or Apache Parquet format, and other compression formats such as GZIP or
BZIP2. Users can also delimit the result set, thus, reducing latency to retrieve the data and optimizing
cost.
Amazon Athena
Amazon Athena is an interactive query service that makes it easier for you to analyze data directly in S3
using standard SQL. With a few actions in the AWS Management Console, you can use Athena directly
against data assets stored in the data lake built on S3 and begin using standard SQL to run one-time
queries and get results in a matter of seconds.
Athena is serverless, so there is no infrastructure to set up or manage, and you only pay for the volume
of data assets scanned during the queries you run.
Athena scales automatically—running queries in parallel—so results are fast, even with large datasets
and complex queries. You can use Athena to process unstructured, semi-structured, and structured
data sets. Supported data asset formats include CSV, JSON, or columnar data formats such as Apache
Parquet and Apache ORC. You can also use Athena to run one-time queries using ANSI SQL without first
aggregating or loading the data into Athena. Athena integrates with Amazon QuickSight for easy data
visualization. It can also be used with third-party reporting and business intelligence tools by connecting
these tools to Athena with a JDBC driver.
Athena also natively integrates with AWS Glue Data Catalog which provides a persistent metadata
store for the data stored in S3. You can create the table and use Athena to query the data based on a
metadata store that integrates with the ETL and data discovery features of AWS Glue.
When querying an existing table, Athena uses Presto under the hood, a distributed SQL engine. Athena
can also be used to query S3 inventory using standard SQL. Amazon S3 Inventory is an S3 tool to help
manage storage. You can use the tool to audit and report on the replication and encryption status of the
objects for business, compliance, and regulatory needs. Athena supports querying S3 inventory files in
19
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Amazon Redshift Spectrum
ORC, Parquet, or CSV format. AWS recommends using ORC-formatted or Parquet-formatted inventory
files because these formats provide faster query performance and lower query costs. ORC and Parquet
formats are columnar formats that allows the reader to read, decompress, and process the columns that
are only required for the current query.
Because Amazon Athena and Amazon Redshift share a common data catalog and common data formats,
you can use both Athena and Amazon Redshift Spectrum against the same data assets. You would
typically use Athena for one-time data discovery and SQL querying, and then use Amazon Redshift
Spectrum for more complex queries and scenarios where a large number of data lake users want to run
concurrent business intelligence and reporting workloads.
20
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Amazon EMR
This allows you to quickly solve a wide range of analytics business challenges on a single platform,
against common data assets, without having to worry about provisioning hardware and installing and
configuring complex software packages before loading data and performing analytics. Plus, you only pay
for what you consume. Some of the most common AWS services that can be used with data assets in a
data lake built on S3 are describedin this section.
Amazon EMR
Amazon EMR is a highly distributed computing framework used to quickly and easily process data in
a cost-effective manner. Amazon EMR uses Apache Hadoop, an open-source framework, to distribute
data and processing across an elastically resizable cluster of EC2 instances and allows you to use all
the common Hadoop tools such as Hive, Pig, Spark, Flink, Hudi, Hue, Livy, MXNet, Presto, TensorFlow,
HBase, or Zeppelin. Amazon EMR does all the heavily lifting involved with provisioning, managing, and
maintaining the infrastructure and software of a Hadoop cluster, and is integrated directly with S3.
With Amazon EMR, you can launch a persistent cluster that stays up indefinitely or a temporary cluster
that ends after the analysis is complete. In either scenario, you only pay for the hours the cluster is
up. Amazon EMR supports a variety of EC2 instance types encompassing general purpose, compute,
memory, storage I/O optimized, and GPU instances (for example, M6g, C5, R5, Z1, I3, D2, G4, and P3),
and all Amazon EC2 pricing options (On-Demand, Reserved, and Spot). The latest version of EMR 6.1.0
(at the time of writing this document) also supports ARM based instance types.
AWS Graviton processors are custom built by Amazon Web Services and use 64-bit Arm Neoverse cores
to deliver the best price performance. When you launch an Amazon EMR cluster (called a job flow), you
choose how many and what type of EC2 instances to provision. Organizations with many different lines
of business and a large number of users can build a single data lake solution, store their data assets in
S3, and then spin up multiple EMR clusters to share data assets in a multi-tenant fashion.
Amazon EMR integrates with S3 for input, output, and intermediate data storage. You can choose the
HDFS which runs on a cluster consisting of primary and core nodes for processing data that does not
need to persist after the lifecycle of the cluster. You can use Amazon EMR file system (EMRFS) using
Amazon S3 as a data layer for applications running on your cluster. The data can be stored on EMRFS
to persist even after the cluster lifecycle. This mechanism helps separate compute and storage, thus
enabling you to independently scale the compute and storage depending on the workload. You can scale
your compute needs by resizing your cluster, and you can scale your storage needs by using Amazon S3.
Amazon SageMaker
Machine learning is another important data lake use case. Amazon SageMaker is a fully managed
machine learning service that provides a set of features to build, train, and deploy machine learning
models. SageMaker provides a suite of built-in algorithms and mechanisms to build custom algorithms
based on user requirements. SageMaker also provides an integrated Jupyter notebook instance to
21
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Amazon QuickSight
prepare and transform the data, develop code for model training, and subsequently deploy, test, and
validate the models.
SageMaker provides visualization tools and wizards to guide you through the process of creating
machine learning models without having to learn complex algorithms and technology. SageMaker can
create machine learning models based on datasets stored in your data lake built on S3. SageMaker
greatly enhances machine learning capabilities by combining with your data lake built on S3. For training
machine learning models, a large amount of data is required, and a data lake backed by S3 and Amazon
S3 Glacier provides cost-effective storage.
Amazon QuickSight
Amazon QuickSight is a fast business analytics service that makes it easy for you to build visualizations,
perform targeted analysis, and quickly get business insights from your data assets stored in the data
lake, any time, on any device. You can use Amazon QuickSight to seamlessly discover AWS data sources
such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3, connect to
any of these data sources and data assets, and get insights from this data in minutes. Amazon QuickSight
allows organizations using the data lake to seamlessly scale their business analytics capabilities to
hundreds of thousands of users. It delivers fast and responsive query performance by using a Super-fast,
Parallel, In-memory Calculation Engine (SPICE).
Amazon OpenSearch Service manages the infrastructure provisioning and maintenance, thus reducing
the overhead of managing the Amazon OpenSearch Service cluster yourself. Amazon OpenSearch
Service integrates with various open source and native AWS services to provide a seamless experience.
Amazon OpenSearch Service integrates with open source Logstash to conveniently upload bulk data into
Amazon OpenSearch Service domain and with open source Kibana for visualization.
Amazon OpenSearch Service also runs with Kinesis Data Firehose for ingestion of streaming data, AWS
CloudTrail, and Amazon CloudWatch for logging and monitoring, respectively. You can also use your data
lake built on Amazon S3 to store the streaming data before it is ingested into the Amazon OpenSearch
Service domain. Ultrawarm storage and Cold storage, types of Amazon OpenSearch Service storage, use
Amazon S3 for storing infrequently-accessed read-only data in a cost-efficient manner.
22
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Access policy options
Securing your data lake begins with implementing very fine-grained controls that allow authorized
users to view, access, process, and modify particular assets, and ensure that unauthorized users are
blocked from taking any actions that would compromise data confidentiality and security. A complicating
factor is that access roles may evolve over various stages of a data asset’s processing and lifecycle.
Fortunately, Amazon has a comprehensive and well-integrated set of security features, such as access
policy, resource-based policies, bucket policies, and data encryption, to secure a data lake built on S3.
When you are creating a centralized data lake, you want to provide access to various different accounts
and users from different accounts. You can define a bucket policy to manage permissions for cross-
account access, and manage permissions for users in another account. Bucket policies can be defined
based on various factors, such as S3 operations, type of requestor, type of resources, and the nature of
the request. A bucket policy defined for a bucket is applicable for all the objects in the bucket. This way
you can manage the permissions for all the objects centrally instead of managing access for individual
objects. Bucket policies can also be customized as per the user requirements, for example, a bucket
policy can be defined to allow access to a particular S3 bucket for a time interval when the request is
originated from a particular Classless Inter-Domain Routing (CIDR) block.
Similar to bucket policies, access-control lists are also used for managing permissions to a bucket or
an object. Access-control lists differ from bucket policies in multiple ways. Access-control lists can only
provide read/write permissions to an object or bucket. Additionally, access-control lists can be used when
granting bucket permission to other accounts or users in other accounts. Object access-control lists are
helpful when you want to manage access at an object level. The bucket owner can grant permission to
another account to upload the object using object access-control lists. However, bucket access-control
lists can be used to provide access to the S3 Log Delivery group to write access logs to the S3 bucket.
For most data lake environments, AWS recommends using user policies, so that permissions to access
data assets can also be linked to user roles and permissions for the data processing and analytics services
23
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Data Encryption with Amazon S3 and AWS KMS
and tools that your data lake users will use. User policies are also recommended to be used if you want
to provide access to a user for objects in a bucket. User policies are associated with IAM, which allows you
to securely control access to AWS services and resources. With IAM, you can create IAM users, groups, and
roles in accounts and then attach access policies to them that grant access to AWS resources, including
S3. The model for user policies is shown inthe following figure. For more details and information on
securing S3 with userpolicies and IAM, refer to Amazon S3 security and Identity and access management
in Amazon S3.
AWS KMS helps scale and simplify management of encryption keys. AWS KMS gives you centralized
control over the encryption keys used to protect your data assets. You can create, import, rotate, disable,
delete, define usage policies for, and audit the use of encryption keys used to encrypt your data. AWS
KMS is integrated with several other AWS services, making it easy to encrypt the data stored in these
services with encryption keys. AWS KMS is integrated with AWS CloudTrail, which provides you with the
ability to audit who used which keys, on which resources, and when.
Data lakes built on AWS primarily use two types of encryption: server-side encryption and client-side
encryption. Server-side encryption provides data-at-rest encryption for data written to S3. With server-
side encryption, S3 encrypts user data assets at the object level, stores the encrypted objects, and then
decrypts them as they are accessed and retrieved. With client-side encryption, data objects are encrypted
before they are written into S3. For example, a data lake user can specify client-side encryption before
transferring data assets into S3 from the internet, or can specify that services such as Amazon EMR,
Amazon Athena, or Amazon Redshift use client-side encryption with S3.
Server-side encryption and client-side encryption can be combined for the highest levels of protection.
Given the intricacies of coordinating encryption key management in a complex environment, such as a
data lake, AWS strongly recommends using AWS KMS to coordinate keys across client-side and server-
side encryption and across multiple data processing and analytics services.
For even greater levels of data lake protection, other services, such as Amazon API Gateway, Amazon
Cognito, and IAM, can be combined to create a “shopping cart” model for users to check in and check
out data lake assets. This data Lake on AWS solution architecture is built on S3 solution reference
architecture.
24
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Protecting data with Amazon S3
Based on the compliance requirements, PII data must be handled separately and setup encryption with
dedicated hardware. However, using a dedicated hardware hosted in the security account will introduce
additional latency during data processing. You can use AWS CloudHSM or HashiCorp Vault to store the
keys. By using CloudHSM you can configure a CloudHSM cluster to store your custom keys and authorize
AWS KMS to use it as a dedicated key store. The following figure shows the primary components of
CloudHSM and a cluster of two CloudHSM instances connected to AWS KMS to create a customer-
controlled key store.
Customer-controlled key store using AWS CloudHSM cluster and AWS KMS
Data protection rests on the inherent durability of the storage platform used. Durability is defined as
the ability to protect data assets against corruption and loss. Amazon S3 provides 99.999999999%
data durability, which is 4 to 6 orders of magnitude greater than that which most on-premises, single-
site storage platforms can provide. Put another way, the durability of Amazon S3 is designed so that
10,000,000 data assets can be reliably stored for 10,000 years.
Amazon S3 achieves this durability in all 24 of its global Regions by using multiple Availability Zones.
Availability Zones consist of one or more discrete data centers, each with redundant power, networking,
and connectivity, housed in separate facilities. Availability Zones offer the ability to operate production
applications and analytics services, which are more highly available, fault tolerant, and scalable than
would be possible from a single data center. Data written to Amazon S3 is redundantly stored across
three Availability Zones and multiple devices within each Availability Zone to achieve 99.9999999%
durability. This means that even in the event of an entire data center failure, data would not be lost.
In addition to core data protection, another key element for data assets is to protect against
unintentional and malicious deletion and corruption, whether through users accidentally deleting
data assets, applications inadvertently deleting or corrupting data, or rogue actors trying to tamper
with the data. This becomes especially important in a large multi-tenant data lake, which will have
a large number of users, many applications, and constant necessary data processing and application
development.
25
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Managing data with object tagging
S3 provides versioning to protect data assets against these scenarios. When enabled, S3 versioning
will retain multiple copies of a data asset. When an asset is updated, prior versions of the asset will be
retained and can be retrieved at any time. If an asset is deleted, the last version of it can be retrieved.
Data asset versioning can be managed by policies, to automate management at large scale. These
policies can be combined with other Amazon S3 capabilities, such as lifecycle management for long-
term retention of versions on lower cost storage tiers. Examples of such storage tiers include Amazon
S3 Glacier, and Multi-Factor-Authentication (MFA) Delete. These lower cost tiers require a second layer
of authentication—typically through an approved external authentication device—to delete data
asset versions. Even though S3 provides 99.999999999% data durability within an AWS Region, many
enterprise organizations may have compliance and risk models that require them to replicate their data
assets to a second geographically distant location and build disaster recovery architectures in a second
location.
Another Amazon S3 feature is S3 Object Lock. This feature allows you to write objects using the WORM
model. You can use Object Lock for scenarios which make it imperative that the data is not updated or
deleted after it is written. You can also use Object Lock to satisfy compliance regulations in the financial
and healthcare sectors, or retain the original copies of the data for reconciliation and auditing purposes.
Object Lock has been assessed for SEC Rule 17a-4(f), FINRA Rule 4511, and CFTC Regulation 1.31 by
Cohasset Associates. You can download a copy of the Cohasset Associates Assessment report.
When creating a data lake on S3, you store millions of objects in the data lake, necessitating an
information asset register to keep track of the different objects, their replication status, updates, and
versions. You can use Amazon S3 Inventory to audit and report on the replication and encryption status
of the objects for business, compliance, and regulatory needs.
All of these S3 features and capabilities—when combined with other AWS services such as IAM, AWS
KMS, Amazon Cognito, and Amazon API Gateway—ensure that a data lake using Amazon S3 as its
core storage platform will meet the most stringent data security, compliance, privacy, and protection
requirements. Amazon S3 includes a broad range of certifications, including PCI-DSS, HIPAA/HITECH,
FedRAMP, SEC Rule 17-a-4, FISMA, EU Data Protection Directive, and many other global agency
certifications. These levels of compliance and protection allow organizations to build a data lake on AWS
that operates more securely and with less risk than one built in their own on-premises data centers.
In addition to being used for data classification, object tagging offers other important capabilities.
Object tags can be used in conjunction with IAM to enable fine-grained controls of access permissions.
For example, a particular data lake user can be granted permissions to only read objects with specific
tags.
26
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Lake Formation: Centralized
governance and access control
Object tags can also be used to manage S3 data lifecycle policies, which is discussed in the Monitoring
and optimizing the data lake environment (p. 29) section of this document. A data lifecycle
policy can contain tag-based filters. Finally, object tags can be combined with Amazon CloudWatch
metrics and AWS CloudTrail logs—also discussed in the Monitoring and optimizing the data lake
environment (p. 29) section of this document—to display monitoring and action audit data by
specific data asset tag filters. In addition to object tags, you can also enable cost allocation tags for
your S3 buckets to track storage cost across projects. Cost allocation tags are used for itemized costs for
resources in the cost allocation report.
The following tags are recommended based on the requirements and specific use cases.
• Metadata access control – Permissions for Data Catalog resources. These permissions enable principals
to create, read, update, and delete metadata databases and tables in the Data Catalog.
• Underlying data access control – Permissions on locations in Amazon S3. Data access permissions
enable principals to read and write data to underlying S3 locations. Data location permissions enable
principals to create metadata databases and tables that point to specific S3 locations.
Lake Formation uses the same metadata catalog used by AWS Glue and you can use AWS Glue crawlers
to create data catalogs for the data stored in your data lakes. Because Lake Formation combines with
IAM permissions, you should have the necessary IAM permissions and Lake Formation permissions to use
Lake Formation and the integrated services (AWS Glue for Data Catalog and crawlers, Amazon Athena,
Amazon Redshift, and Amazon EMR).
Lake Formation also controls access of integrated services to data stored in data lakes using fine-grained
policies. You can define named resources access control to grant permissions to specific databases or
27
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS Lake Formation: Centralized
governance and access control
tables by specifying database or table names. The recommended method to grant permissions is tag-
based access control (TBAC) because it simplifies access control while managing a large number of Data
Catalog resources and principals.
TBAC allows you to grant permissions based on one or more Lake Formation tags (LF-tags). You can
create LF-tags; assign LF-tags to Data Catalog resources such as databases, tables, and columns; and
grant LF-tags permissions to principals optionally with grant option. The tag-based access control
method is recommended when you want to manage a large number of Data Catalog resources.
28
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Data lake monitoring
Amazon CloudWatch
As an administrator, you need to look at the complete data lake environment holistically. This can be
achieved using Amazon CloudWatch. CloudWatch is a monitoring service for AWS Cloud resources and
the applications that run on AWS. You can use CloudWatch to collect and track metrics, collect and
monitor log files, set thresholds, and initiate alarms. This allows you to automatically react to changes in
your data lake built on S3.
You can also use CloudWatch metrics to understand and improve the performance of applications
that are using Amazon S3. You can use CloudWatch for generating daily storage metrics for your data
lake built on CloudWatch by collecting and processing storage data for your S3 buckets. You can also
monitor the requests to your data lake built on S3 and identify and act upon operational issues quickly.
In addition, you can also monitor the S3 APIs that are pending replication, total size, and the maximum
time required for replication to the destination Region.
Amazon Macie
Amazon Macie is a fully managed data security and data privacy service that uses machine learning and
pattern matching to discover, monitor, and protect your sensitive data stored in your data lake. Macie
can be used to scan your data lakes and discover sensitive information such as PII or financial data, and
identify and report overly permissive or unencrypted buckets.
Using Macie, you can run a sensitive data discovery job to identify sensitive information using built-
in criteria and techniques, such as machine learning and pattern matching called as managed data
identifiers, to analyze objects in your data lake. You can also define your own criteria called as custom
data identifiers, using regular expressions defining a text pattern to match and, optionally, character
sequences and a proximity rule that refines the results.
Additionally, after enabled, Macie maintains a complete inventory of your S3 buckets and evaluates and
monitors the bucket for security and access control. If Macie detects a potential issue, the service creates
29
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS CloudTrail
a policy finding. Policy findings are generated when policies or settings for S3 bucket are changed in a
way that reduces the security of the bucket and its objects. Macie integrates with other AWS services,
such as Amazon EventBridge, which is a serverless event bus service that can send findings data to
services, such as AWS Lambda and Amazon Simple Notification Service (Amazon SNS), to act on the
policy findings.
AWS CloudTrail
An operational data lake has many users and multiple administrators, and may be subject to compliance
and audit requirements, so it’s important to have a complete audit trail of actions taken and who has
performed these actions. AWS CloudTrail is an AWS service that enables governance, compliance,
operational auditing, and risk auditing of AWS accounts.
CloudTrail continuously monitors and retains events related to API calls across AWS services that
comprise a data lake. CloudTrail provides a history of AWS API calls for an account, including API calls
made through the AWS Management Console, AWS SDKs, command line tools, and most data lakes built
on S3 services. You can identify which users and accounts made requests or took actions against AWS
services that support CloudTrail, the source IP address the actions were made from, and when the actions
occurred.
CloudTrail can be used to simplify data lake compliance audits by automatically recording and storing
activity logs for actions made within AWS accounts.
Integration with Amazon CloudWatch logs provides a convenient way to search through log data,
identify out-of-compliance events, accelerate incident investigations, and expedite responses to auditor
requests. CloudTrail logs are stored in a separate logs bucket within your data lake (refer to the Data lake
foundation (p. 11) section of this document) for durability and deeper analysis.
30
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
S3 Intelligent-Tiering
Otherwise, data access charges associated with these more cost-effective storage classes can negate any
potential savings. Amazon S3 provides storage class analysis to help you understand how data lake data
assets are used. Storage class analysis uses machine learning algorithms on collected access data to help
you develop lifecycle rules that will optimize costs.
Seamlessly tiering to lower cost storage tiers is an important capability of a data lake, particularly as
its users plan for, and move to, more advanced analytics and machine learning capabilities. Data lake
users will typically ingest raw data assets from many sources,and transform those assets into harmonized
formats that they can use for one-time querying and on-going business intelligence querying through
SQL. However, users also want to perform more advanced analytics using streaming analytics, machine
learning,and artificial intelligence. These more advanced analytics capabilities consist of building data
models, validating these data models with data assets, and then training and refining these models with
historical data.
Keeping more historical data assets, particularly raw data assets, allows for better training and
refinement of models. Additionally, as your organization’s analytics sophistication grows, you may want
to go back and reprocess historical data to look for new insights and value. These historical data assets
are infrequently accessed and consume a lot of capacity, so they are often well suited to be stored on an
archival storagelayer.
Another long-term data storage need for the data lake is to keep processed data assets and results
for long-term retention for compliance and audit purposes, to be accessed by auditors when needed.
Both of these use cases are well served by Amazon S3 Glacier storage class, which is an S3 storage class
optimized for infrequently used cold data, and for storing write once read many (WORM) data. You can
also use S3 Object Lock to adhere to regulatory compliance.
S3 Intelligent-Tiering
S3 Intelligent-Tiering is designed to optimize your storage cost by automatically moving the data within
your buckets by monitoring your access patterns. S3 Intelligent-Tiers makes use of two low latency
high throughout access tiers: one tier for frequently accessed data and another tier for infrequently
accessed data. S3 Intelligent Tiering monitors access patterns for the data for a period of 30 months and
automatically moves the data to one of the access tiers without operational overhead or performance
impact.
S3 Intelligent-Tiering is recommended for data that has unpredictable access patterns regardless of
object type, size and retention period such as data lakes, data analytics applications or new applications.
If the data from infrequently accessed tier is accessed, the data will be moved to frequently accessed tier.
Additionally, you can also configure S3 Intelligent-Tiering to automatically move data from infrequently
accessed tier that has not been accessed for consecutive 90 days to archive access tier and if not accessed
for consecutive 180 days to deep archive access tier.
You can also configure the last access time for archiving up to a maximum of 730 days for both Archive
access tier and Deep Archive access tier. S3 Intelligent-Tiering Archive access tier provides same
performance as Amazon S3 Glacier storage class and S3 Intelligent Tiering Deep Archive access tier
provides same performance as Amazon S3 Glacier Deep Archive storage class.
S3 Storage Lens
As your data lake becomes more popular and you start expanding to accommodate data from multiple
applications across multiple accounts and S3 buckets. It can become increasingly complicated to
understand usage of the data across the organization, optimize cost and understand the security
posture. S3 Storage Lens gives you the visibility into your object storage across your organization with
point-in-time metrics, trend lines and actionable insights.
You can generate insights at organization, account, region, bucket, and prefix level. S3 aggregates your
usage and metrics across all the accounts and provides an account snapshot on the S3 console (Bucket)
31
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Amazon S3 Glacier
home page. You can use the Storage Lens dashboard to visualize insights and trends, identify outliers,
receive recommendations for storage cost optimization, and so on.
You can use Storage Lens dashboard to identify your largest buckets and take necessary actions to
optimize the cost since the rate charged depends on the object size, duration of storage and storage
class. In case you are uploading objects using multi-part upload, there might be a case when the uploads
fail or not completed. The incomplete uploads remain in your buckets and are chargeable.
You can identify these incomplete multipart uploads using Storage Lens dashboard. Additionally, storage
lens can also help identify multiple versions of the objects. Finally, you can also use Storage Lens to
uncover cold buckets from your account. Cold buckets are the buckets which are not accessed for a long
period of time. All these insights can be accessed from the Storage Lens dashboard.
Amazon S3 Glacier
Amazon S3 Glacier is a low-cost Amazon S3 storage class that provides durable storage with security
features for data archiving and backup. S3 Glacier has the same data durability (99.999999999%) and
supports lifecycle management on data assets stored in S3, so that data assets can seamlessly migrate
from Amazon S3 to S3 Glacier. S3 Glacier storage class is a great storage choice when low storage cost
is essential, data assets are rarely retrieved, and retrieval latency of several minutes to several hours is
acceptable.
Different types of data lake assets may have different retrieval needs. For example, compliance data may
be infrequently accessed and be relatively small in size but needs to be made available in minutes when
auditors request data, whereas historical raw data assets may be very large but can be retrieved in bulk
over the course of a day when needed.
S3 Glacier allows data lake users to specify retrieval times when the data retrieval request is created,
with longer retrieval times leading to lower retrieval costs. For processed data and records that need
to be securely retained, Amazon S3 Glacier Vault Lock allows data lake administrators to deploy and
enforce compliance controls on individual S3 Glacier vaults by a lockable policy.
Administrators can specify controls such as WORM in a Vault Lock policy and lock the policy from future
edits. After locked, the policy becomes immutable and S3 Glacier will enforce the prescribed controls to
help achieve your compliance objectives, and provide an audit trail for these assets using AWS CloudTrail.
Another area of optimization is to use optimal data formats when transforming raw data assets into
normalized formats, in preparation for querying and analytics. These optimal data formats can compress
data and reduce data capacities needed for storage, and also substantially increase query performance
by common data lake built on S3 analytic services.
Data lake environments are designed to ingest and process many types of data, and store raw data assets
for future archival and reprocessing purposes, as well as store processed and normalized data assets for
active querying, analytics, and reporting. A key best practice to reduce storage and analytics processing
costs, and improve analytics querying performance, is to use an optimized data format, particularly a
format like Apache Parquet.
Parquet is a columnar compressed storage file format that is designed for querying large amounts of
data, regardless of the data processing framework, data model, or programming language. Compared
to common raw data log formats such as CSV, JSON, or TXT format, Parquet can reduce the required
32
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Cost and performance optimization
storage footprint, improve query performance significantly, and greatly reduce querying costs for AWS
services, which charge by amount of data scanned.
Amazon tests comparing CSV and Parquet formats using one TB of log data stored in CSV format to
Parquet format showed the following:
• Space savings of 87% with Parquet (1 TB of log data stored in CSV format compressed to 130 GB with
Parquet)
• A query time for a representative Athena query was 34x faster with Parquet (237 seconds for CSV
versus 5.13 seconds for Parquet), and the amount of data scanned for that Athena query was 99% less
(1.15TB scanned for CSV versus 2.69GB for Parquet)
• The cost to run that Athena query was 99.7% less ($5.75 for CSV versus $0.013 for Parquet)
Parquet has the additional benefit of being an open data format that can be used by multiple querying
and analytics tools in a data lake built on Amazon S3, particularly Amazon Athena, Amazon EMR,
Amazon Redshift, and Amazon Redshift Spectrum.
Additional options for performance optimization include rightsizing of the S3 objects to 128 MB,
partitioning based on business dates which are typically used while querying.
33
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
AWS future proofs your data lake with a standardized storage solution that grows with your organization
by ingesting and storing all of your business’s data assets on a platform with virtually unlimited
scalability and well-defined APIs, and integrates with a wide variety of data processing tools. This allows
you to add new capabilities to your data lake as you need them without infrastructure limitations or
barriers. Additionally, you can perform agile analytics experiments against data lake assets to quickly
explore new processing methods and tools, and then scale the promising ones into production without
the need to build new infrastructure, duplicate and/or migrate data, and have users migrate to a new
platform.
34
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Conclusion
In closing, a data lake built on AWS allows you to evolve your business around your data assets, and to
use these data assets to quickly and agilely drive more business value and competitive differentiation
without limits.
Contributors
The following individuals and organizations contributed to this document:
35
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Resources
• AWS Architecture Center
• AWS Whitepapers & Guides
• AWS Documentation
36
Storage Best Practices for Data and
Analytics Applications AWS Whitepaper
Document history
To be notified about updates to this whitepaper, subscribe to the RSS feed.
Updated (p. 37) Updated for technical accuracy. November 16, 2021
Note
To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are
using.
37