Athena Ug
Athena Ug
User Guide
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Amazon Athena User Guide
Table of Contents
What is Amazon Athena? .................................................................................................................... 1
When should I use Athena? ......................................................................................................... 1
Accessing Athena ....................................................................................................................... 1
Understanding Tables, Databases, and the Data Catalog ................................................................. 2
AWS Service Integrations with Athena .......................................................................................... 3
Setting Up ........................................................................................................................................ 6
Sign Up for AWS ........................................................................................................................ 6
To create an AWS account ................................................................................................... 6
Create an IAM User .................................................................................................................... 6
To create a group for administrators .................................................................................... 6
To create an IAM user for yourself, add the user to the administrators group, and create a
password for the user ......................................................................................................... 7
Attach Managed Policies for Using Athena .................................................................................... 7
Getting Started .................................................................................................................................. 8
Prerequisites .............................................................................................................................. 8
Step 1: Create a Database ........................................................................................................... 8
Step 2: Create a Table ................................................................................................................ 9
Step 3: Query Data .................................................................................................................. 11
Connecting to Other Data Sources ............................................................................................. 14
Accessing Amazon Athena ................................................................................................................. 15
Using the Console .................................................................................................................... 15
Using the API .......................................................................................................................... 15
Using the CLI ........................................................................................................................... 15
Connecting to Data Sources ............................................................................................................... 16
Integration with AWS Glue ........................................................................................................ 16
Using AWS Glue to Connect to Data Sources in Amazon S3 .................................................... 17
Best Practices When Using Athena with AWS Glue ................................................................ 20
Upgrading to the AWS Glue Data Catalog Step-by-Step ........................................................ 29
FAQ: Upgrading to the AWS Glue Data Catalog .................................................................... 32
Using a Hive Metastore ............................................................................................................. 34
Overview of Features ........................................................................................................ 35
Workflow ........................................................................................................................ 35
Considerations and Limitations .......................................................................................... 36
Connecting Athena to an Apache Hive Metastore ................................................................. 37
Using the AWS Serverless Application Repository ................................................................. 44
Connecting Athena to Hive Using an Existing Role ................................................................ 46
Using a Default Catalog .................................................................................................... 56
Using the AWS CLI with Hive Metastores ............................................................................. 59
Reference Implementation ................................................................................................. 65
Using Amazon Athena Federated Query ...................................................................................... 66
Considerations and Limitations .......................................................................................... 67
Deploying a Connector and Connecting to a Data Source ...................................................... 67
Using the AWS Serverless Application Repository ................................................................. 69
Athena Data Source Connectors ......................................................................................... 70
Writing Federated Queries ................................................................................................. 73
Writing a Data Source Connector ....................................................................................... 76
IAM Policies for Accessing Data Catalogs .................................................................................... 77
Data Catalog Example Policies ........................................................................................... 78
Managing Data Sources ............................................................................................................. 81
Connecting to Amazon Athena with ODBC and JDBC Drivers .......................................................... 83
Using Athena with the JDBC Driver ..................................................................................... 83
Connecting to Amazon Athena with ODBC .......................................................................... 85
Creating Databases and Tables ........................................................................................................... 88
Creating Databases ................................................................................................................... 88
iii
Amazon Athena User Guide
iv
Amazon Athena User Guide
v
Amazon Athena User Guide
vi
Amazon Athena User Guide
vii
Amazon Athena User Guide
viii
Amazon Athena User Guide
ix
Amazon Athena User Guide
x
Amazon Athena User Guide
When should I use Athena?
Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries
you run. Athena scales automatically—running queries in parallel—so results are fast, even with large
datasets and complex queries.
Topics
• When should I use Athena? (p. 1)
• Accessing Athena (p. 1)
• Understanding Tables, Databases, and the Data Catalog (p. 2)
• AWS Service Integrations with Athena (p. 3)
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate
reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an
ODBC driver. For more information, see What is Amazon QuickSight in the Amazon QuickSight User Guide
and Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 83).
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your
data in Amazon S3. This allows you to create tables and query data in Athena based on a central
metadata store available throughout your AWS account and integrated with the ETL and data discovery
features of AWS Glue. For more information, see Integration with AWS Glue (p. 16) and What is AWS
Glue in the AWS Glue Developer Guide.
For a list of AWS services that Athena leverages or integrates with, see the section called “AWS Service
Integrations with Athena” (p. 3).
Accessing Athena
You can access Athena using the AWS Management Console, a JDBC or ODBC connection, the Athena
API, the Athena CLI, the AWS SDK, or AWS Tools for Windows PowerShell.
• To get started with the console, see Getting Started (p. 8).
• To learn how to use JDBC or ODBC drivers, see Connecting to Amazon Athena with JDBC (p. 83) and
Connecting to Amazon Athena with ODBC (p. 85).
• To use the Athena API, see the Amazon Athena API Reference.
1
Amazon Athena User Guide
Understanding Tables, Databases, and the Data Catalog
• To use the CLI, install the AWS CLI and then type aws athena help from the command line to see
available commands. For information about available commands, see the AWS Athena command line
reference.
• To use the AWS SDK for Java 2.x, see the Athena section of the AWS SDK for Java 2.x API Reference,
the Athena Java V2 Examples on GitHub.com, and the AWS SDK for Java 2.x Developer Guide.
• To use the AWS SDK for .NET, see the Amazon.Athena namespace in the AWS SDK for .NET Version 3
API Reference, the .NET Athena examples on GitHub.com, and the AWS SDK for .NET Developer Guide.
• To use AWS Tools for Windows PowerShell, see the AWS Tools for PowerShell - Amazon Athena cmdlet
reference, the AWS Tools for PowerShell portal page, and the AWS Tools for Windows PowerShell User
Guide.
• For information about Athena service endpoints that you can connect to programmatically, see
Amazon Athena endpoints and quotas in the Amazon Web Services General Reference.
For each dataset that you'd like to query, Athena must have an underlying table it will use for obtaining
and returning query results. Therefore, before querying data, a table must be registered in Athena. The
registration occurs when you either create tables automatically or manually.
Regardless of how the tables are created, the tables creation process registers the dataset with Athena.
This registration occurs in the AWS Glue Data Catalog and enables Athena to run queries on the data.
• To create a table automatically, use an AWS Glue crawler from within Athena. For more information
about AWS Glue and crawlers, see Integration with AWS Glue (p. 16). When AWS Glue creates a
table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store
and retrieve this metadata, using it when you run queries to analyze the underlying dataset.
After you create a table, you can use SQL SELECT (p. 437) statements to query it, including getting
specific file locations for your source data (p. 441). Your query results are stored in Amazon S3 in the
query result location that you specify (p. 127).
The AWS Glue Data Catalog is accessible throughout your AWS account. Other AWS services can share
the AWS Glue Data Catalog, so you can see databases and tables created throughout your organization
using Athena and vice versa. In addition, AWS Glue lets you automatically discover data schema and
extract, transform, and load (ETL) data.
When you create tables and databases manually, Athena uses HiveQL data definition language (DDL)
statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood to create
tables and databases in the AWS Glue Data Catalog.
2
Amazon Athena User Guide
AWS Service Integrations with Athena
Note
If you have tables in Athena created before August 14, 2017, they were created in an Athena-
managed internal data catalog that exists side-by-side with the AWS Glue Data Catalog until
you choose to update. For more information, see Upgrading to the AWS Glue Data Catalog Step-
by-Step (p. 29).
When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL
engine. We have examples with sample data within Athena to show you how to create a table and then
issue a query against it using Athena. Athena also has a tutorial in the console that helps you get started
creating a table based on data that is stored in Amazon S3.
• For a step-by-step tutorial on creating a table and writing queries in the Athena Query Editor, see
Getting Started (p. 8).
• Run the Athena tutorial in the console. This launches automatically if you log in to https://
console.aws.amazon.com/athena/ for the first time. You can also choose Tutorial in the console to
launch it.
AWS CloudFormation
Data Catalog
Specify an Athena data catalog, including a name, description, type, parameters, and tags. For
more information, see DataCatalog in the Amazon Athena API Reference.
Named Query
Specify named queries with AWS CloudFormation and run them in Athena. Named queries allow
you to map a query name to a query and then run it as a saved query from the Athena console.
For information, see CreateNamedQuery in the Amazon Athena API Reference.
Workgroup
Specify Athena workgroups using AWS CloudFormation. Use Athena workgroups to isolate
queries for you or your group from other queries in the same account. For more information,
see Using Workgroups to Control Query Access and Costs (p. 358) in the Amazon Athena User
Guide and CreateWorkGroup in the Amazon Athena API Reference.
Amazon CloudFront
Use Athena to query Amazon CloudFront. For more information about using CloudFront, see the
Amazon CloudFront Developer Guide.
AWS CloudTrail
Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity.
For example, you can use queries to identify trends and further isolate activity by attribute, such
3
Amazon Athena User Guide
AWS Service Integrations with Athena
as source IP address or user. You can create tables for querying logs directly from the CloudTrail
console, and use those tables to run queries in Athena. For more information, see Creating a Table
for CloudTrail Logs in the CloudTrail Console (p. 230).
Elastic Load Balancing
Querying Application Load Balancer logs allows you to see the source of traffic, latency, and bytes
transferred to and from Elastic Load Balancing instances and backend applications. For more
information, see Creating the Table for ALB Logs (p. 224).
Query Classic Load Balancer logs to analyze and understand traffic patterns to and from Elastic Load
Balancing instances and backend applications. You can see the source of traffic, latency, and bytes
transferred. For more information, see Creating the Table for ELB Logs (p. 226).
AWS Glue Data Catalog
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for
your data in Amazon S3. This allows you to create tables and query data in Athena based on a
central metadata store available throughout your AWS account and integrated with the ETL and
data discovery features of AWS Glue. For more information, see Integration with AWS Glue (p. 16)
and What is AWS Glue in the AWS Glue Developer Guide.
IAM
You can use Athena API actions in IAM permission policies. For more information, see Actions for
Amazon Athena and Identity and Access Management in Athena (p. 270).
Amazon QuickSight
Reference topic: Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 83)
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to
generate reports or to explore data with business intelligence tools or SQL clients connected with
a JDBC or an ODBC driver. For more information about Amazon QuickSight, see What is Amazon
QuickSight in the Amazon QuickSight User Guide. For information about using JDBC and ODBC
drivers with Athena, see Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 83).
Amazon S3 Inventory
Reference topic: Querying inventory with Athena in the Amazon Simple Storage Service Developer
Guide
You can use Amazon Athena to query Amazon S3 inventory using standard SQL. You can use
Amazon S3 inventory to audit and report on the replication and encryption status of your objects for
business, compliance, and regulatory needs. For more information, see Amazon S3 inventory in the
Amazon Simple Storage Service Developer Guide.
AWS Step Functions
Reference topic: Call Athena with Step Functions in the AWS Step Functions Developer Guide
Call Athena with AWS Step Functions. AWS Step Functions can control certain AWS services directly
using the Amazon States Language. You can use Step Functions with Athena to start and stop query
execution, get query results, run ad-hoc or scheduled data queries, and retrieve results from data
lakes in Amazon S3. For more information, see the AWS Step Functions Developer Guide.
4
Amazon Athena User Guide
AWS Service Integrations with Athena
Reference topic: Querying inventory data from multiple Regions and accounts in the AWS Systems
Manager User Guide
AWS Systems Manager Inventory integrates with Amazon Athena to help you query inventory data
from multiple AWS Regions and accounts. For more information, see the AWS Systems Manager User
Guide.
Amazon Virtual Private Cloud
Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from
network interfaces in a VPC. Query the logs in Athena to investigate network traffic patterns and
identify threats and risks across your Amazon VPC network. For more information about Amazon
VPC, see the Amazon VPC User Guide.
5
Amazon Athena User Guide
Sign Up for AWS
Setting Up
If you've already signed up for Amazon Web Services (AWS), you can start using Amazon Athena
immediately. If you haven't signed up for AWS, or if you need assistance querying data using Athena, first
complete the tasks below:
If you have an AWS account already, skip to the next task. If you don't have an AWS account, use the
following procedure to create one.
Note your AWS account number, because you need it for the next task.
If you signed up for AWS but have not created an IAM user for yourself, you can create one using the IAM
console. If you aren't familiar with using the console, see Working with the AWS Management Console.
6
Amazon Athena User Guide
To create an IAM user for yourself, add the user to the
administrators group, and create a password for the user
https://*your_account_alias*.signin.aws.amazon.com/console/
It is also possible the sign-in link will use your account name instead of number. To verify the sign-in
link for IAM users for your account, open the IAM console and check under IAM users sign-in link on the
dashboard.
7
Amazon Athena User Guide
Prerequisites
Getting Started
This tutorial walks you through using Amazon Athena to query data. You'll create a table based on
sample data stored in Amazon Simple Storage Service, query the table, and check the results of the
query.
The tutorial is using live resources, so you are charged for the queries that you run. You aren't charged
for the sample data in the location that this tutorial uses, but if you upload your own data files to
Amazon S3, charges do apply.
Prerequisites
• If you have not already done so, sign up for an account in Setting Up (p. 6).
• Using the same AWS Region (for example, US West (Oregon)) and account that you are using for
Athena, Create a bucket in Amazon S3 to hold your query results from Athena.
To create a database
4. In the Settings dialog box, enter the path to the bucket that you created in Amazon S3 for your
query results. Prefix the path with s3:// and add a forward slash to the end of the path.
8
Amazon Athena User Guide
Step 2: Create a Table
5. Click Save.
6. In the Athena Query Editor, you see a query pane. You can type queries and statements here.
7. To create a database named mydatabase, enter the following CREATE DATABASE statement.
9
Amazon Athena User Guide
Step 2: Create a Table
To create a table
3. In the query pane, enter the following CREATE TABLE statement. In the LOCATION statement at the
end of the query, replace myregion with the AWS Region that you are currently using (for example,
us-west-1).
Note
Replace myregion in s3://athena-examples-myregion/path/to/data/ with the
region identifier where you run Athena, for example, s3://athena-examples-us-
west-1/path/to/data/.
10
Amazon Athena User Guide
Step 3: Query Data
The table cloudfront_logs is created and appears under the list of Tables for the mydatabase
database.
To run a query
1. Open a new query tab and enter the following SQL statement in the query pane.
11
Amazon Athena User Guide
Step 3: Query Data
3. You can save the results of the query to a .csv file by choosing the download icon on the Results
pane.
12
Amazon Athena User Guide
Step 3: Query Data
5. Choose Download results to download the results of a previous query. Query history is retained for
45 days.
For more information, see Working with Query Results, Output Files, and Query History (p. 122).
13
Amazon Athena User Guide
Connecting to Other Data Sources
14
Amazon Athena User Guide
Using the Console
In the right pane, the Query Editor displays an introductory screen that prompts you to create your first
table. You can view your tables under Tables in the left pane.
• Preview tables – View the query syntax in the Query Editor on the right.
• Show properties – Show a table's name, its location in Amazon S3, input and output formats, the
serialization (SerDe) library used, and whether the table has encrypted data.
• Delete table – Delete a table.
• Generate CREATE TABLE DDL – Generate the query behind a table and view it in the query editor.
For examples of using the AWS SDK for Java with Athena, see Code Samples (p. 485).
For more information about AWS SDK for Java documentation and downloads, see the SDKs section in
Tools for Amazon Web Services.
15
Amazon Athena User Guide
Integration with AWS Glue
The tables and databases that you work with in Athena to run queries are based on metadata. Metadata
is data about the underlying data in your dataset. How that metadata describes your dataset is called the
schema. For example, a table name, the column names in the table, and the data type of each column
are schema, saved as metadata, that describe an underlying dataset. In Athena, we call a system for
organizing metadata a data catalog or a metastore. The combination of a dataset and the data catalog
that describes it is called a data source.
The relationship of metadata to an underlying dataset depends on the type of data source that you work
with. Relational data sources like MySQL, PostgreSQL, and SQL Server tightly integrate the metadata
with the dataset. In these systems, the metadata is most often written when the data is written. Other
data sources, like those built using Hive, allow you to define metadata on-the-fly when you read the
dataset. The dataset can be in a variety of formats—for example, CSV, JSON, Parquet, or Avro.
Athena natively supports the AWS Glue Data Catalog. The AWS Glue Data Catalog is a data catalog
built on top of other datasets and data sources such as Amazon S3, Amazon Redshift, and Amazon
DynamoDB. You can also connect Athena to other data sources by using a variety of connectors.
Topics
• Integration with AWS Glue (p. 16)
• Using Athena Data Connector for External Hive Metastore (p. 34)
• Using Amazon Athena Federated Query (p. 66)
• IAM Policies for Accessing Data Catalogs (p. 77)
• Managing Data Sources (p. 81)
• Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 83)
Athena natively supports querying datasets and data sources that are registered with the AWS Glue Data
Catalog. When you run Data Manipulation Language (DML) queries in Athena with the Data Catalog
as your source, you are using the Data Catalog schema to derive insight from the underlying dataset.
When you run Data Definition Language (DDL) queries, the schema you define are defined in the AWS
Glue Data Catalog. From within Athena, you can also run a AWS Glue crawler on a data source to create
schema in the AWS Glue Data Catalog.
16
Amazon Athena User Guide
Using AWS Glue to Connect to Data Sources in Amazon S3
In regions where AWS Glue is supported, Athena uses the AWS Glue Data Catalog as a central location to
store and retrieve table metadata throughout an AWS account. The Athena query engine requires table
metadata that instructs it where to read data, how to read it, and other information necessary to process
the data. The AWS Glue Data Catalog provides a unified metadata repository across a variety of data
sources and data formats, integrating not only with Athena, but with Amazon S3, Amazon RDS, Amazon
Redshift, Amazon Redshift Spectrum, Amazon EMR, and any application compatible with the Apache
Hive metastore.
For more information about the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog
in the AWS Glue Developer Guide. For a list of regions where AWS Glue is available, see Regions and
Endpoints in the AWS General Reference.
Separate charges apply to AWS Glue. For more information, see AWS Glue Pricing and Are there separate
charges for AWS Glue? (p. 33) For more information about the benefits of using AWS Glue with
Athena, see Why should I upgrade to the AWS Glue Data Catalog? (p. 32)
Topics
• Using AWS Glue to Connect to Data Sources in Amazon S3 (p. 17)
• Best Practices When Using Athena with AWS Glue (p. 20)
• Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 29)
• FAQ: Upgrading to the AWS Glue Data Catalog (p. 32)
To define schema information for AWS Glue to use, you can set up an AWS Glue crawler to retrieve the
information automatically, or you can manually add a table and enter the schema information.
17
Amazon Athena User Guide
Using AWS Glue to Connect to Data Sources in Amazon S3
Setting up a Crawler
You set up a crawler by starting in the Athena console and then using the AWS Glue console in an
integrated way. When you create a crawler, you can choose data stores to crawl or point the crawler to
existing catalog tables.
Note
The steps for setting up a crawler depend on the options available in the Athena console. If the
Connect data source link in Option A is not available, use the procedure in Option B.
Option A
Option A: To set up a crawler in AWS Glue using the Connect data source link
3. On the Connect data source page, choose AWS Glue Data Catalog.
4. Click Next.
5. On the Connection details page, choose Set up crawler in AWS Glue to retrieve schema
information automatically.
6. Click Connect to AWS AWS Glue.
7. On the AWS Glue console Add crawler page, follow the steps to create a crawler.
For more information, see Populating the AWS Glue Data Catalog.
Option B
Use the following procedure to set up a AWS Glue crawler if the Connect data source link in Option A is
not available in the Athena console.
18
Amazon Athena User Guide
Using AWS Glue to Connect to Data Sources in Amazon S3
Option B: To set up a crawler in AWS Glue from the AWS Glue Data Catalog link
3. On the AWS Glue console Tables page, choose Add tables using a crawler.
4. On the AWS Glue console Add crawler page, follow the steps to create a crawler.
For more information, see Populating the AWS Glue Data Catalog.
Note
Athena does not recognize exclude patterns that you specify for an AWS Glue crawler. For
example, if you have an Amazon S3 bucket that contains both .csv and .json files and you
exclude the .json files from the crawler, Athena queries both groups of files. To avoid this,
place the files that you want to exclude in a different location.
19
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
• For the Apache Web Logs option, you must also enter a regex expression in the Regex box.
• For the Text File with Custom Delimiters option, specify a Field terminator (that is, a column
delimiter). Optionally, you can specify a Collection terminator for array types or a Map key
terminator.
12. For Columns, specify a column name and the column data type.
Under the hood, Athena uses Presto to process DML statements and Hive to process the DDL statements
that create and modify schema. With these technologies, there are a couple of conventions to follow so
that Athena and AWS Glue work well together.
In this topic
20
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
You can use the AWS Glue Catalog Manager to rename columns, but at this time table names and
database names cannot be changed using the AWS Glue console. To correct database names, you need to
create a new database and copy tables to it (in other words, copy the metadata to a new entity). You can
follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.
Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in
Sync
AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time-
Based Schedules for Jobs and Crawlers in the AWS Glue Developer Guide.
If you have data that arrives for a partitioned table at a fixed time, you can set up an AWS Glue crawler
to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially
long and expensive MSCK REPAIR command or manually run an ALTER TABLE ADD PARTITION
command. For more information, see Table Partitions in the AWS Glue Developer Guide.
21
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt
If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/
folder1/ in AWS Glue, the crawler may create a single table with two partition columns: one partition
column that contains table1 and table2, and a second partition column that contains partition1
through partition5.
To have the AWS Glue crawler create two separate tables, set the crawler to have two data sources,
s3://bucket01/folder1/table1/ and s3://bucket01/folder1/table2, as shown in the
following procedure.
22
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
3. Under Add information about your crawler, choose additional settings as appropriate, and then
choose Next.
4. Under Add a data store, change Include path to the table-level directory. For instance, given the
example above, you would change it from s3://bucket01/folder1 to s3://bucket01/
folder1/table1/. Choose Next.
23
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
6. For Include path, enter your other table-level directory (for example, s3://bucket01/folder1/
table2/) and choose Next.
a. Repeat steps 3-5 for any additional table-level directories, and finish the crawler configuration.
The new values for Include locations appear under data stores as follows:
When Athena runs a query, it validates the schema of the table and the schema of any partitions
necessary for the query. The validation compares the column data types in order and makes sure
that they match for the columns that overlap. This prevents unexpected operations such as adding
or removing columns from the middle of a table. If Athena detects that the schema of a partition
differs from the schema of the table, Athena may not be able to process the query and fails with
HIVE_PARTITION_SCHEMA_MISMATCH.
There are a few ways to fix this issue. First, if the data was accidentally added, you can remove the data
files that cause the difference in schema, drop the partition, and re-crawl the data. Second, you can drop
the individual partition and then run MSCK REPAIR within Athena to re-create the partition using the
table's schema. This second option works only if you are confident that the schema applied will continue
to read the data correctly.
AWS Glue may mis-assign metadata when a CSV file has quotes around each data field, getting the
serializationLib property wrong. For more information, see CSV Data Enclosed in quotes (p. 25).
24
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
To run a query in Athena on a table created from a CSV file that has quoted values, you must modify
the table properties in AWS Glue to use the OpenCSVSerDe. For more information about the OpenCSV
SerDe, see OpenCSVSerDe for Processing CSV (p. 415).
25
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
For more information, see Viewing and Editing Table Details in the AWS Glue Developer Guide.
26
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
You can use the AWS Glue UpdateTable API operation or update-table CLI command to modify the
SerDeInfo block in the table definition, as in the following example JSON.
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"parameters": {
"separatorChar": ","
"quoteChar": "\""
"escapeChar": "\\"
}
},
...
STORED AS TEXTFILE
LOCATION 's3://my_bucket/csvdata_folder/';
TBLPROPERTIES ("skip.header.line.count"="1")
Alternatively, you can remove the CSV headers beforehand so that the header information is not
included in Athena query results. One way to achieve this is to use AWS Glue jobs, which perform
extract, transform, and load (ETL) work. You can write scripts in AWS Glue using a language that is an
extension of the PySpark Python dialect. For more information, see Authoring Jobs in Glue in the AWS
Glue Developer Guide.
The following example shows a function in an AWS Glue script that writes out a dynamic frame
using from_options, and sets the writeHeader format option to false, which removes the header
information:
27
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
If the table property was not added when the table was created, you can add it using the AWS Glue
console.
28
Amazon Athena User Guide
Upgrading to the AWS Glue Data Catalog Step-by-Step
For more information, see Working with Tables in the AWS Glue Developer Guide.
We recommend to use Parquet and ORC data formats. AWS Glue supports writing to both of these
data formats, which can make it easier and faster for you to transform data to an optimal format for
Athena. For more information about these formats and other ways to improve performance, see Top
Performance Tuning Tips for Amazon Athena.
Converting SMALLINT and TINYINT Data Types to INT When Converting to ORC
To reduce the likelihood that Athena is unable to read the SMALLINT and TINYINT data types produced
by an AWS Glue ETL job, convert SMALLINT and TINYINT to INT when using the wizard or writing a
script for an ETL job.
29
Amazon Athena User Guide
Upgrading to the AWS Glue Data Catalog Step-by-Step
If you created databases and tables using Athena or Amazon Redshift Spectrum prior to a region's
support for AWS Glue, you can upgrade Athena to use the AWS Glue Data Catalog.
If you are using the older Athena-managed data catalog, you see the option to upgrade at the top of the
console. The metadata in the Athena-managed catalog isn't available in the AWS Glue Data Catalog or
vice versa. While the catalogs exist side-by-side, creating tables or databases with the same names fails
in either AWS Glue or Athena. This prevents name collisions when you do upgrade. For more information
about the benefits of using the AWS Glue Data Catalog, see FAQ: Upgrading to the AWS Glue Data
Catalog (p. 32).
A wizard in the Athena console can walk you through upgrading to the AWS Glue console. The upgrade
takes just a few minutes, and you can pick up where you left off. For information about each upgrade
step, see the topics in this section.
For information about working with data and tables in the AWS Glue Data Catalog, see the guidelines in
Best Practices When Using Athena with AWS Glue (p. 20).
Before the upgrade can be performed, you need to attach a customer-managed IAM policy, with a policy
statement that allows the upgrade action, to the user who performs the migration.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:ImportCatalogToGlue"
],
"Resource": [ "*" ]
}
]
}
{
"Effect":"Allow",
"Action":[
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
30
Amazon Athena User Guide
Upgrading to the AWS Glue Data Catalog Step-by-Step
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource":[
"*"
]
}
When you create a table using the console, you can create a table using an AWS Glue crawler. For more
information, see Using AWS Glue Crawlers (p. 21).
31
Amazon Athena User Guide
FAQ: Upgrading to the AWS Glue Data Catalog
• An AWS Glue crawler can automatically scan your data sources, identify data formats, and infer
schema.
• A fully managed ETL service allows you to transform and move data to various destinations.
• The AWS Glue Data Catalog stores metadata information about databases and tables and points to a
data store in Amazon S3 or a JDBC-compliant data store.
Upgrading to the AWS Glue Data Catalog has the following benefits.
32
Amazon Athena User Guide
FAQ: Upgrading to the AWS Glue Data Catalog
For more information, see Populating the AWS Glue Data Catalog.
Easy-to-build pipelines
The AWS Glue ETL engine generates Python code that is entirely customizable, reusable, and portable.
You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After
your ETL job is ready, you can schedule it to run on the fully managed, scale-out Spark infrastructure of
AWS Glue. AWS Glue handles provisioning, configuration, and scaling of the resources required to run
your ETL jobs, allowing you to tightly integrate ETL with your workflow.
For more information, see Authoring AWS Glue Jobs in the AWS Glue Developer Guide.
33
Amazon Athena User Guide
Using a Hive Metastore
migrating the catalog for the entire account. For more information, see Step 1 - Allow a User to Perform
the Upgrade (p. 30).
My users use a managed policy with Athena and Redshift Spectrum. What steps
do I need to take to upgrade?
The Athena managed policy has been automatically updated with new policy actions that allow Athena
users to access AWS Glue. However, you still must explicitly allow the upgrade action for the user who
performs the upgrade. To prevent accidental upgrade, the managed policy does not allow this action.
What happens if I don’t allow AWS Glue policies for Athena users?
If you upgrade to the AWS Glue Data Catalog and don't update a user's customer-managed or inline IAM
policies, Athena queries fail because the user won't be allowed to perform actions in AWS Glue. For the
specific actions to allow, see Step 2 - Update Customer-Managed/Inline Policies Associated with Athena
Users (p. 30).
Topics
• Overview of Features (p. 35)
• Workflow (p. 35)
• Considerations and Limitations (p. 36)
34
Amazon Athena User Guide
Overview of Features
Overview of Features
With the Athena data connector for external Hive metastore, you can perform the following tasks:
• Use the Athena console to register custom catalogs and run queries using them.
• Define Lambda functions for different external Hive metastores and join them in Athena queries.
• Use the AWS Glue Data Catalog and your external Hive metastores in the same Athena query.
• Specify a catalog in the query execution context as the current default catalog. This removes the
requirement to prefix catalog names to database names in your queries. Instead of using the syntax
catalog.database.table, you can use database.table.
• Use a variety of tools to run queries that reference external Hive metastores. You can use the Athena
console, the AWS CLI, the AWS SDK, Athena APIs, and updated Athena JDBC and ODBC drivers. The
updated drivers have support for custom catalogs.
API Support
Athena Data Connector for External Hive Metastore includes support for catalog registration API
operations and metadata API operations.
• Catalog registration – Register custom catalogs for external Hive metastores and federated data
sources (p. 66).
• Metadata – Use metadata APIs to provide database and table information for AWS Glue and any
catalog that you register with Athena.
• Athena JAVA SDK client – Use catalog registration APIs, metadata APIs, and support for catalogs in
the StartQueryExecution operation in the updated Athena Java SDK client.
Reference Implementation
Athena provides a reference implementation for the Lambda function that connects to external Hive
metastores. The reference implementation is provided on GitHub as an open source project at Athena
Hive Metastore.
The reference implementation is available as the following two AWS SAM applications in the AWS
Serverless Application Repository (SAR). You can use either of these applications in the SAR to create
your own Lambda functions.
Workflow
The following diagram shows how Athena interacts with your external Hive metastore.
35
Amazon Athena User Guide
Considerations and Limitations
In this workflow, your database-connected Hive metastore is inside your VPC. You use Hive Server2 to
manage your Hive metastore using the Hive CLI.
The workflow for using external Hive metastores from Athena includes the following steps.
1. You create a Lambda function that connects Athena to the Hive metastore that is inside your VPC.
2. You register a unique catalog name for your Hive metastore and a corresponding function name in
your account.
3. When you run an Athena DML or DDL query that uses the catalog name, the Athena query engine calls
the Lambda function name that you associated with the catalog name.
4. Using AWS PrivateLink, the Lambda function communicates with the external Hive metastore in your
VPC and receives responses to metadata requests. Athena uses the metadata from your external Hive
metastore just like it uses the metadata from the default AWS Glue Data Catalog.
• DDL support for external Hive metastore is limited to the following statements.
• DESCRIBE TABLE
• SHOW COLUMNS
• SHOW TABLES
• SHOW SCHEMAS
• SHOW CREATE TABLE
• SHOW TBLPROPERTIES
• SHOW PARTITIONS
• The maximum number of registered catalogs that you can have is 1,000.
• You can use CTAS (p. 136) to create an AWS Glue table from a query on an external Hive metastore,
but not to create a table on an external Hive metastore.
36
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
• You can use INSERT INTO to insert data into an AWS Glue table from a query on an external Hive
metastore, but not to insert data into an external Hive metastore.
• Hive views are not compatible with Athena views and are not supported.
• Kerberos authentication for Hive metastore is not supported.
• To use the JDBC driver with an external Hive metastore or federated queries (p. 66), include
MetadataRetrievalMethod=ProxyAPI in your JDBC connection string. For information about the
JDBC driver, see Using Athena with the JDBC Driver (p. 83).
Permissions
Prebuilt and custom data connectors might require access to the following resources to function
correctly. Check the information for the connector that you use to make sure that you have configured
your VPC correctly. For information about required IAM permissions to run queries and create a
data source connector in Athena, see Allow Access to an Athena Data Connector for External Hive
Metastore (p. 287) and Allow Lambda Function Access to External Hive Metastores (p. 289).
• Amazon S3 – In addition to writing query results to the Athena query results location in Amazon
S3, data connectors also write to a spill bucket in Amazon S3. Connectivity and permissions to this
Amazon S3 location are required. For more information, see Spill Location in Amazon S3 (p. 37)
later in this topic.
• Athena – Access is required to check query status and prevent overscan.
• AWS Glue – Access is required if your connector uses AWS Glue for supplemental or primary metadata.
• AWS Key Management Service
• Policies – Hive metastore, Athena Query Federation, and UDFs require policies in addition to the
AmazonAthenaFullAccess Managed Policy (p. 271). For more information, see Identity and Access
Management in Athena (p. 270).
37
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
3. On the Connect data source page, for Choose a metadata catalog, choose Apache Hive metastore.
4. Choose Next.
38
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
5. On the Connection details page, for Lambda function, choose Configure new AWS Lambda
function.
39
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
6. Under Application settings, enter the parameters for your Lambda function.
40
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
When the deployment completes, your function appears in your list of Lambda applications. Now
that the Hive metastore function has been deployed to your account, you can configure Athena to
use it.
8. Return to the Connection details page of the Data Sources tab in the Athena console.
9. Choose the Refresh icon next to Choose Lambda function. Refreshing the list of available functions
causes your newly created function to appear in the list.
41
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
A new Lambda function ARN entry shows the ARN of your Lambda function.
42
Amazon Athena User Guide
Connecting Athena to an Apache Hive Metastore
11. For Catalog name, enter a unique name that you will use in your SQL queries to reference the data
source. The name can be up to 127 characters long and must be unique within your account. It
cannot be changed after creation. Valid characters are a-z, A-z, 0-9, _(underscore), @(ampersand),
and -(hyphen). The names awsdatacatalog, hive, jmx, and system are reserved by Athena and
cannot be used for custom catalog names.
12. (Optional) For Description, enter text that describes your data catalog.
13. Choose Connect. This connects Athena to your Hive metastore catalog.
The Data sources page shows a list of your connected catalogs, including the catalog that you just
connected. All registered catalogs are visible to all users in the same AWS account.
43
Amazon Athena User Guide
Using the AWS Serverless Application Repository
14. You can now use the Catalog name that you specified to reference the Hive metastore in your SQL
queries. In your SQL queries, use the following example syntax, replacing hms-catalog-1 with the
catalog name that you specified earlier.
15. To view, edit, or delete the data sources that you create, see Managing Data Sources (p. 81).
To use the AWS Serverless Application Repository to deploy a data source connector for Hive
to your account
1. Sign in to the AWS Management Console and open the Serverless App Repository.
2. In the navigation pane, choose Available applications.
3. Select the option Show apps that create custom IAM roles or resource policies.
4. In the search box, type the name of one of the following connectors. The two applications have
the same functionality and differ only in their implementation. You can use either one to create a
Lambda function that connects Athena to your Hive metastore.
44
Amazon Athena User Guide
Using the AWS Serverless Application Repository
6. Under Application settings, enter the parameters for your Lambda function.
At this point, you can configure Athena to use your Lambda function to connect to your Hive metastore.
For more information, see steps 8-15 of Connecting Athena to an Apache Hive Metastore (p. 37).
45
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
1. Clone and Build (p. 46) – Clone the Athena reference implementation and build the JAR file that
contains the Lambda function code.
2. AWS Lambda console (p. 47) – In the AWS Lambda console, create a Lambda function, assign it an
existing IAM execution role, and upload the function code that you generated.
3. Amazon Athena console (p. 52) – In the Amazon Athena console, create a data catalog name that
you can use to refer to your external Hive metastore in your Athena queries.
If you already have permissions to create a custom IAM role, you can use a simpler workflow that uses
the Athena console and the AWS Serverless Application Repository to create and configure a Lambda
function. For more information, see Connecting Athena to an Apache Hive Metastore (p. 37).
Prerequisites
• Git must be installed on your system.
• You must have Apache Maven installed.
• You have an IAM execution role that you can assign to the Lambda function. For more information, see
Allow Lambda Function Access to External Hive Metastores (p. 289).
2. Run the following command to build the .jar file for the Lambda function:
After the project builds successfully, the following .jar file is created in the target folder of your
project:
hms-lambda-func-1.0-SNAPSHOT-withdep.jar
In the next section, you use the AWS Lambda console to upload this file to your AWS account.
46
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
1. Sign in to the AWS Management Console and open the AWS Lambda console at https://
console.aws.amazon.com/lambda/.
2. In the navigation pane, choose Functions.
3. Choose Create function.
4. Choose Author from scratch.
5. For Function name, enter the name of your Lambda function (for example, EHMSBasedLambda).
6. For Runtime, choose Java 8.
47
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
48
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
49
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
To upload your Lambda function code and configure its environment variables
1. In the Lambda console, navigate to the page for your function if necessary.
2. For Function code, choose Actions, and then choose Upload a .zip or .jar file.
50
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
5. On the Edit environment variables page, add the following environment variable keys and values:
• HMS_URIS – Use the following syntax to enter the URI of your Hive metastore host that uses the
Thrift protocol at port 9083.
51
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
thrift://<host_name>:9083.
6. Choose Save.
52
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
• Choose the Data sources tab, and then choose Connect data source.
3. On the Connect data source page, for Choose a metadata catalog, choose Apache Hive metastore.
53
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
4. Choose Next.
5. On the Connection details page, for Lambda function, use the Choose Lambda function option to
choose the Lambda function that you created.
A new Lambda function ARN entry shows the ARN of your Lambda function.
54
Amazon Athena User Guide
Connecting Athena to Hive Using an Existing Role
6. For Catalog name, enter a unique name that you will use in your SQL queries to reference your Hive
data source.
Note
The names awsdatacatalog, hive, jmx, and system are reserved by Athena and cannot
be used for custom catalog names.
7. Choose Connect. This connects Athena to your Hive metastore catalog.
8. You can now use the Catalog name that you specified to reference the Hive metastore in your SQL
queries. In your SQL queries, use the following example syntax, replacing ehms-catalog with the
catalog name that you specified earlier.
9. To view, edit, or delete the data sources that you create, see Managing Data Sources (p. 81).
55
Amazon Athena User Guide
Using a Default Catalog
DML Statements
To run queries with registered catalogs
1. You can put the catalog name before the database using the syntax
[[catalog_name].database_name].table_name, as in the following example.
2. When the catalog that you want to use is already selected as your data source, you can omit the
catalog name from the query, as in the following example.
56
Amazon Athena User Guide
Using a Default Catalog
3. For multiple catalogs, you can omit only the default catalog name. Specify the full name for any
non-default catalogs. For example, the FROM statement in the following query omits the catalog
name for AWS Glue catalog, but it fully qualifies the first two catalog names.
...
FROM ehms01.hms_tpch.customer,
"hms-catalog-1".hms_tpch.orders,
hms_tpch.lineitem
...
57
Amazon Athena User Guide
Using a Default Catalog
DDL Statements
The following Athena DDL statements support catalog name prefixes. Catalog name prefixes in other
DDL statements cause syntax errors.
As with DML statements, when you select the catalog and the database in the Data source panel, you
can omit the catalog prefix from the query.
In the following example, the data source and database are selected in the query editor. The show create
table customer statement succeeds when the hms-catalog-1 prefix and the hms_tpch database name
are omitted from the query.
58
Amazon Athena User Guide
Using the AWS CLI with Hive Metastores
59
Amazon Athena User Guide
Using the AWS CLI with Hive Metastores
The following example registers the Hive metastore catalog named hms-catalog-1. The command has
been formatted for readability.
{
"DataCatalog": {
"Name": "hms-catalog-1",
"Description": "Hive Catalog 1",
"Type": "HIVE",
"Parameters": {
"metadata-function": "arn:aws:lambda:us-east-1:111122223333:function:external-
hms-service-v3",
"sdk-version": "1.0"
}
}
}
{
"DataCatalogs": [
{
"CatalogName": "AwsDataCatalog",
60
Amazon Athena User Guide
Using the AWS CLI with Hive Metastores
"Type": "GLUE"
},
{
"CatalogName": "hms-catalog-1",
"Type": "HIVE",
"Parameters": {
"metadata-function": "arn:aws:lambda:us-
east-1:111122223333:function:external-hms-service-v3",
"sdk-version": "1.0"
}
}
]
}
{
"Database": {
"Name": "mydb",
"Description": "My database",
"Parameters": {
"CreatedBy": "Athena",
"EXTERNAL": "TRUE"
}
}
}
61
Amazon Athena User Guide
Using the AWS CLI with Hive Metastores
{
"DatabaseList": [
{
"Name": "default"
},
{
"Name": "mycrawlerdatabase"
},
{
"Name": "mydatabase"
},
{
"Name": "sampledb",
"Description": "Sample database",
"Parameters": {
"CreatedBy": "Athena",
"EXTERNAL": "TRUE"
}
},
{
"Name": "tpch100"
}
]
}
{
"TableMetadata": {
"Name": "cityuseragent",
"CreateTime": 1586451276.0,
"LastAccessTime": 0.0,
"TableType": "EXTERNAL_TABLE",
"Columns": [
{
"Name": "city",
"Type": "string"
},
{
"Name": "useragent1",
"Type": "string"
}
],
"PartitionKeys": [],
"Parameters": {
"COLUMN_STATS_ACCURATE": "false",
"EXTERNAL": "TRUE",
"inputformat": "org.apache.hadoop.mapred.TextInputFormat",
"last_modified_by": "hadoop",
62
Amazon Athena User Guide
Using the AWS CLI with Hive Metastores
"last_modified_time": "1586454879",
"location": "s3://athena-data/",
"numFiles": "1",
"numRows": "-1",
"outputformat":
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"rawDataSize": "-1",
"serde.param.serialization.format": "1",
"serde.serialization.lib":
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"totalSize": "61"
}
}
}
{
"TableMetadataList": [
{
"Name": "cityuseragent",
"CreateTime": 1586451276.0,
"LastAccessTime": 0.0,
"TableType": "EXTERNAL_TABLE",
"Columns": [
{
"Name": "city",
"Type": "string"
},
{
"Name": "useragent1",
"Type": "string"
}
],
"PartitionKeys": [],
"Parameters": {
"COLUMN_STATS_ACCURATE": "false",
"EXTERNAL": "TRUE",
"inputformat": "org.apache.hadoop.mapred.TextInputFormat",
"last_modified_by": "hadoop",
"last_modified_time": "1586454879",
"location": "s3://athena-data/",
"numFiles": "1",
"numRows": "-1",
"outputformat":
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"rawDataSize": "-1",
"serde.param.serialization.format": "1",
"serde.serialization.lib":
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"totalSize": "61"
}
63
Amazon Athena User Guide
Using the AWS CLI with Hive Metastores
},
{
"Name": "clearinghouse_data",
"CreateTime": 1589255544.0,
"LastAccessTime": 0.0,
"TableType": "EXTERNAL_TABLE",
"Columns": [
{
"Name": "location",
"Type": "string"
},
{
"Name": "stock_count",
"Type": "int"
},
{
"Name": "quantity_shipped",
"Type": "int"
}
],
"PartitionKeys": [],
"Parameters": {
"EXTERNAL": "TRUE",
"inputformat": "org.apache.hadoop.mapred.TextInputFormat",
"location": "s3://myjasondata/",
"outputformat":
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"serde.param.serialization.format": "1",
"serde.serialization.lib":
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"transient_lastDdlTime": "1589255544"
}
}
],
"NextToken":
"eyJsYXN0RXZhbHVhdGVkS2V5Ijp7IkhBU0hfS0VZIjp7InMiOiJ0Ljk0YWZjYjk1MjJjNTQ1YmU4Y2I5OWE5NTg0MjFjYTYzIn0sI
}
DDL Statements
The following example passes in the catalog name directly as part of the show create table DDL
statement. The command has been formatted for readability.
The following example DDL show create table statement uses the Catalog parameter of --query-
execution-context to pass the Hive metastore catalog name hms-catalog-1. The command has
been formatted for readability.
64
Amazon Athena User Guide
Reference Implementation
DML Statements
The following example DML select statement passes the catalog name into the query directly. The
command has been formatted for readability.
The following example DML select statement uses the Catalog parameter of --query-execution-
context to pass in the Hive metastore catalog name hms-catalog-1. The command has been
formatted for readability.
Reference Implementation
Athena provides a reference implementation of its connector for external Hive metastore on GitHub.com
at https://github.com/awslabs/aws-athena-hive-metastore.
The reference implementation is an Apache Maven project that has the following modules:
• hms-service-api – Contains the API operations between the Lambda function and the Athena
service clients. These API operations are defined in the HiveMetaStoreService interface. Because
this is a service contract, you should not change anything in this module.
• hms-lambda-handler – A set of default Lambda handlers that process all Hive metastore API
calls. The class MetadataHandler is the dispatcher for all API calls. You do not need to change this
package.
• hms-lambda-layer – A Maven assembly project that puts hms-service-api, hms-lambda-
handler, and their dependencies into a .zip file. The .zip file is registered as a Lambda layer for use
by multiple Lambda functions.
• hms-lambda-func – An example Lambda function that has the following components.
• HiveMetaStoreLambdaFunc – An example Lambda function that extends MetadataHandler.
• ThriftHiveMetaStoreClient – A Thrift client that communicates with Hive metastore. This
client is written for Hive 2.3.0. If you use a different Hive version, you might need to update this
class to ensure that the response objects are compatible.
• ThriftHiveMetaStoreClientFactory – Controls the behavior of the Lambda
function. For example, you can provide your own set of handler providers by overriding the
getHandlerProvider() method.
• hms.properties – Configures the Lambda function. Most cases require updating the following two
properties only.
• hive.metastore.uris – the URI of the Hive metastore in the format
thrift://<host_name>:9083.
• hive.metastore.response.spill.location: The Amazon S3 location to store response
objects when their sizes exceed a given threshold (for example, 4MB). The threshold is defined in
the property hive.metastore.response.spill.threshold. Changing the default value is
not recommended.
65
Amazon Athena User Guide
Using Amazon Athena Federated Query
Note
These two properties can be overridden by the Lambda environment variables HMS_URIS and
SPILL_LOCATION. Use these variables instead of recompiling the source code for the Lambda
function when you want to use the function with a different Hive metastore or spill location.
Before you build the artifacts, update the properties hive.metastore.uris and
hive.metastore.response.spill.location in the hms.properties file in the hms-lambda-
func module.
To build the artifacts, you must have Apache Maven installed and run the command mvn install. This
generates the layer .zip file in the output folder called target in the module hms-lambda-layer and
the Lambda function .jar file in the module hms-lambd-func.
Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source
connector is a piece of code that can translate between your target data source and Athena. You can
think of a connector as an extension of Athena's query engine. Prebuilt Athena data source connectors
exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB, and
Amazon RDS, and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the
Apache 2.0 license. You can also use the Athena Query Federation SDK to write custom connectors. To
choose, configure, and deploy a data source connector to your account, you can use the Athena and
Lambda consoles or the AWS Serverless Application Repository. After you deploy data source connectors,
the connector is associated with a catalog that you can specify in SQL queries. You can combine SQL
statements from multiple catalogs and span multiple data sources with a single query.
When a query is submitted against a data source, Athena invokes the corresponding connector to
identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates.
Based on the user submitting the query, connectors can provide or restrict access to specific data
elements. Connectors use Apache Arrow as the format for returning data requested in a query, which
enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust. Since
connectors are processed in Lambda, they can be used to access data from any data source on the cloud
or on-premises that is accessible from Lambda.
To write your own data source connector, you can use the Athena Query Federation SDK to customize
one of the prebuilt connectors that Amazon Athena provides and maintains. You can modify a copy of
the source code from the GitHub repository and then use the Connector Publish Tool to create your own
AWS Serverless Application Repository package.
Note
Third party developers may have used the Athena Query Federation SDK to write data source
connectors. For support or licensing issues with these data source connectors, please work with
your connector provider. These connectors are not tested or supported by AWS.
For a list of data source connectors written and tested by Athena, see Using Athena Data Source
Connectors (p. 70).
66
Amazon Athena User Guide
Considerations and Limitations
For information about writing your own data source connector, see Example Athena Connector on
GitHub.
JDBC driver – To use the JDBC driver with federated queries or an external Hive metastore (p. 34),
include MetadataRetrievalMethod=ProxyAPI in your JDBC connection string. For information
about the JDBC driver, see Using Athena with the JDBC Driver (p. 83).
Data source connectors might require access to the following resources to function correctly. If you use
a prebuilt connector, check the information for the connector to ensure that you have configured your
VPC correctly. Also, ensure that IAM principals running queries and creating connectors have privileges to
required actions. For more information, see Example IAM Permissions Policies to Allow Athena Federated
Query (p. 293).
• Amazon S3 – In addition to writing query results to the Athena query results location in Amazon
S3, data connectors also write to a spill bucket in Amazon S3. Connectivity and permissions to this
Amazon S3 location are required.
• Athena – Data sources need connectivity to Athena and vice versa for checking query status and
preventing overscan.
• AWS Glue Data Catalog – Connectivity and permissions are required if your connector uses Data
Catalog for supplemental or primary metadata.
For the most up-to-date information about known issues and limitations, see Limitations and Issues in
the aws-athena-query-federation GitHub repository.
67
Amazon Athena User Guide
Deploying a Connector and Connecting to a Data Source
• Choose the Data sources tab, and then choose Connect data source.
• AthenaCatalogName – A name for the Lambda function that indicates the data source that it
targets, such as cloudwatchlogs.
• SpillBucket – An Amazon S3 bucket in your account to store data that exceeds Lambda function
response size limits.
8. Select I acknowledge that this app creates custom IAM roles. For more information, choose the
Info link.
9. Choose Deploy. The Resources section of the Lambda console shows the deployment status of the
connector and informs you when the deployment is complete.
68
Amazon Athena User Guide
Using the AWS Serverless Application Repository
To connect to a data source using a connector that you have deployed to your account
For information about writing queries with data connectors, see Writing Federated
Queries (p. 73).
To use the AWS Serverless Application Repository to deploy a data source connector to your
account
1. Sign in to the AWS Management Console and open the Serverless App Repository.
2. In the navigation pane, choose Available applications.
3. Select the option Show apps that create custom IAM roles or resource policies.
4. In the search box, type the name of the connector, or search for applications published with the
author name Amazon Athena Federation. This author name is reserved for applications that
the Amazon Athena team has written, tested, and validated. For a list of prebuilt Athena data
connectors, see Using Athena Data Source Connectors (p. 70).
69
Amazon Athena User Guide
Athena Data Source Connectors
5. Choose the name of the connector. This opens the Lambda function's Application details page in
the AWS Lambda console.
6. On the right side of the details page, for Application settings, SpillBucket, specify an Amazon
S3 bucket to receive data from large response payloads. For information about the remaining
configurable options, see the corresponding Available Connectors topic on GitHub.
7. At the bottom right of the Application details page, choose Deploy.
• For information about deploying an Athena data source connector, see Deploying a Connector and
Connecting to a Data Source (p. 67).
• For information about writing queries that use Athena data source connectors, see Writing Federated
Queries (p. 73).
• For complete information about the Athena data source connectors, see Available Connectors on
GitHub.
Topics
• Amazon Athena CloudWatch Connector (p. 70)
• Amazon Athena CloudWatch Metrics Connector (p. 71)
• Athena AWS CMDB Connector (p. 71)
• Amazon Athena DocumentDB Connector (p. 71)
• Amazon Athena DynamoDB Connector (p. 71)
• Amazon Athena Elasticsearch Connector (p. 71)
• Amazon Athena HBase Connector (p. 71)
• Amazon Athena Connector for JDBC-Compliant Data Sources (PostgreSQL, MySQL, and Amazon
Redshift) (p. 72)
• Amazon Athena Neptune Connector (p. 72)
• Amazon Athena Redis Connector (p. 72)
• Amazon Athena Timestream Connector (p. 72)
• Amazon Athena TPC Benchmark DS (TPC-DS) Connector (p. 72)
The connector maps your LogGroups as schemas and each LogStream as a table. The connector also
maps a special all_log_streams view that contains all LogStreams in the LogGroup. This view
70
Amazon Athena User Guide
Athena Data Source Connectors
enables you to query all the logs in a LogGroup at once instead of searching through each LogStream
individually.
For more information about configuration options, throttling control, table mapping schema,
permissions, deployment, performance considerations, and licensing, see Amazon Athena CloudWatch
Connector on GitHub.
For information about configuration options, table mapping, permissions, deployment, performance
considerations, and licensing, see Amazon Athena Cloudwatch Metrics Connector on GitHub.
For information about supported services, parameters, permissions, deployment, performance, and
licensing, see Amazon Athena AWS CMDB Connector on GitHub.
For information about how the connector generates schemas, configuration options, permissions,
deployment, and performance considerations, see Amazon Athena DocumentDB Connector on GitHub.
For information about configuration options, permissions, deployment, and performance considerations,
see Amazon Athena DynamoDB Connector on GitHub.
For information about configuration options, databases and tables, data types, deployment, and
performance considerations, see Amazon Athena Elasticsearch Connector on GitHub.
71
Amazon Athena User Guide
Athena Data Source Connectors
For information about configuration options, data types, permissions, deployment, performance, and
licensing, see Amazon Athena HBase Connector on GitHub.
For information about supported databases, configuration parameters, supported data types, JDBC
driver versions, limitations, and other information, see Amazon Athena Lambda JDBC Connector on
GitHub.
The Amazon Athena Neptune Connector enables Athena to communicate with your Neptune graph
database instance, making your Neptune graph data accessible by SQL queries.
For information about configuration options, permissions, deployment, performance, and limitations, see
Amazon Athena Neptune Connector on GitHub.
For information about configuration options, setting up databases and tables, data types, permissions,
deployment, performance, and licensing, see Amazon Athena Redis Connector on GitHub.
The Amazon Athena Timestream connector enables Amazon Athena to communicate with Amazon
Timestream timeseries data. You can optionally use AWS Glue Data Catalog as a source of supplemental
metadata.
For information about configuration options, setting up databases and tables, data types, permissions,
deployment, performance, and licensing, see Amazon Athena Timestream Connector on GitHub.
72
Amazon Athena User Guide
Writing Federated Queries
For information about configuration options, databases and tables, permissions, deployment,
performance, and licensing, see Amazon Athena TPC-DS Connector on GitHub.
MyCloudwatchCatalog.database_name.table_name
Examples
The following example uses the Athena CloudWatch connector to connect to the all_log_streams
view in the /var/ecommerce-engine/order-processor CloudWatch Logs Log Group. The
all_log_streams view is a view of all the log streams in the log group. The example query limits the
number of rows returned to 100.
Example
The following example parses information from the same view as the previous example. The example
extracts the order ID and log level and filters out any message that has the level INFO.
Example
73
Amazon Athena User Guide
Writing Federated Queries
1. Payment processing in a secure VPC with transaction records stored in HBase on Amazon EMR
2. Redis to store active orders so that the processing engine can access them quickly
3. Amazon DocumentDB for customer account data such as email addresses and shipping addresses
4. A product catalog in Amazon Aurora for an ecommerce site that uses automatic scaling on Fargate
5. CloudWatch Logs to house the order processor's log events
6. A write-once-read-many data warehouse on Amazon RDS
7. DynamoDB to store shipment tracking data
Imagine that a data analyst for this ecommerce application discovers that the state of some orders is
being reported erroneously. Some orders show as pending even though they were delivered, while others
show as delivered but haven't shipped.
74
Amazon Athena User Guide
Writing Federated Queries
The analyst wants to know how many orders are being delayed and what the affected orders have
in common across the ecommerce infrastructure. Instead of investigating the sources of information
separately, the analyst federates the data sources and retrieves the necessary information in a single
query. Extracting the data into a single location is not necessary.
• CloudWatch Logs – Retrieves logs from the order processing service and uses regex matching and
extraction to filter for orders with WARN or ERROR events.
• Redis – Retrieves the active orders from the Redis instance.
• CMDB – Retrieves the ID and state of the Amazon EC2 instance that ran the order processing service
and logged the WARN or ERROR message.
• DocumentDB – Retrieves the customer email and address from Amazon DocumentDB for the affected
orders.
• DynamoDB – Retrieves the shipping status and tracking details from the shipping table to identify
possible discrepancies between reported and actual status.
• HBase – Retrieves the payment status for the affected orders from the payment processing service.
Example
Note
This example shows a query where the data source has been registered as a catalog with
Athena. You can also reference a data source connector Lambda function using the format
lambda:MyLambdaFunctionName.
75
Amazon Athena User Guide
Writing a Data Source Connector
"summary:status",
"summary:cc_id",
"details:network"
FROM "hbase".hbase_payments.transactions)
You can also customize Amazon Athena's prebuilt connectors for your own use. You can modify a copy
of the source code from GitHub and then use the Connector Publish Tool to create your own AWS
Serverless Application Repository package. After you deploy your connector in this way, you can use it in
your Athena queries.
76
Amazon Athena User Guide
IAM Policies for Accessing Data Catalogs
Note
To use the Amazon Athena Federated Query feature, set your workgroup to Athena engine
version 2. For steps, see Changing Athena Engine Versions (p. 395).
For information about how to download the SDK and detailed instructions for writing your own
connector, see Example Athena Connector on GitHub.
For IAM-specific information, see the links listed at the end of this section. For information about
example JSON data catalog policies, see Data Catalog Example Policies (p. 78).
To use the visual editor in the IAM console to create a data catalog policy
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane on the left, choose Policies, and then choose Create policy.
3. On the Visual editor tab, choose Choose a service. Then choose Athena to add to the policy.
4. Choose Select actions, and then choose the actions to add to the policy. The visual editor shows the
actions available in Athena. For more information, see Actions, Resources, and Condition Keys for
Amazon Athena in the Service Authorization Reference.
5. Choose add actions to type a specific action or use wildcards (*) to specify multiple actions.
By default, the policy that you are creating allows the actions that you choose. If you chose one or
more actions that support resource-level permissions to the datacatalog resource in Athena, then
the editor lists the datacatalog resource.
6. Choose Resources to specify the specific data catalogs for your policy. For example JSON data
catalog policies, see Data Catalog Example Policies (p. 78).
7. Specify the datacatalog resource as follows:
arn:aws:athena:<region>:<user-account>:datacatalog/<datacatalog-name>
8. Choose Review policy, and then type a Name and a Description (optional) for the policy that you
are creating. Review the policy summary to make sure that you granted the intended permissions.
9. Choose Create policy to save your new policy.
10. Attach this identity-based policy to a user, a group, or role and specify the datacatalog resources
they can access.
For more information, see the following topics in the Service Authorization Reference and the IAM User
Guide:
For example JSON data catalog policies, see Data Catalog Example Policies (p. 78).
77
Amazon Athena User Guide
Data Catalog Example Policies
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference.
A data catalog is an IAM resource managed by Athena. Therefore, if your data catalog policy uses actions
that take datacatalog as an input, you must specify the data catalog's ARN as follows:
"Resource": [arn:aws:athena:<region>:<user-account>:datacatalog/<datacatalog-name>]
The <datacatalog-name> is the name of your data catalog. For example, for a data catalog named
test_datacatalog, specify it as a resource as follows:
"Resource": ["arn:aws:athena:us-east-1:123456789012:datacatalog/test_datacatalog"]
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference. For more information about IAM policies, see Creating Policies with the Visual Editor in the
IAM User Guide. For more information about creating IAM policies for workgroups, see IAM Policies for
Accessing Data Catalogs (p. 77).
• Example Policy for Full Access to All Data Catalogs (p. 78)
• Example Policy for Full Access to a Specified Data Catalog (p. 78)
• Example Policy for Querying a Specified Data Catalog (p. 79)
• Example Policy for Management Operations on a Specified Data Catalog (p. 80)
• Example Policy for Listing Data Catalogs (p. 80)
• Example Policy for Metadata Operations on Data Catalogs (p. 80)
{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"athena:*"
],
"Resource":[
"*"
]
}
]
}
78
Amazon Athena User Guide
Data Catalog Example Policies
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"athena:ListDataCatalogs",
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource":"*"
},
{
"Effect":"Allow",
"Action":[
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:DeleteWorkGroup",
"athena:UpdateWorkGroup",
"athena:GetWorkGroup",
"athena:CreateWorkGroup"
],
"Resource":[
"arn:aws:athena:us-east-1:123456789012:workgroup/*"
]
},
{
"Effect":"Allow",
"Action":[
"athena:CreateDataCatalog",
"athena:DeleteDataCatalog",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDatabases",
"athena:ListTableMetadata",
"athena:UpdateDataCatalog"
],
"Resource":"arn:aws:athena:us-east-1:123456789012:datacatalog/datacatalogA"
}
]
}
In the following policy, a user is allowed to run queries on the specified datacatalogA. The user is not
allowed to perform management tasks for the data catalog itself, such as updating or deleting it.
79
Amazon Athena User Guide
Data Catalog Example Policies
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"athena:StartQueryExecution"
],
"Resource":[
"arn:aws:athena:us-east-1:123456789012:workgroup/*"
]
},
{
"Effect":"Allow",
"Action":[
"athena:GetDataCatalog"
],
"Resource":[
"arn:aws:athena:us-east-1:123456789012:datacatalog/datacatalogA"
]
}
]
}
In the following policy, a user is allowed to create, delete, obtain details, and update a data catalog
datacatalogA.
{
"Effect":"Allow",
"Action":[
"athena:CreateDataCatalog",
"athena:GetDataCatalog",
"athena:DeleteDataCatalog",
"athena:UpdateDataCatalog"
],
"Resource":[
"arn:aws:athena:us-east-1:123456789012:datacalog/datacatalogA"
]
}
The following policy allows all users to list all data catalogs:
{
"Effect":"Allow",
"Action":[
"athena:ListDataCatalogs"
],
"Resource":"*"
}
{
"Effect":"Allow",
"Action":[
80
Amazon Athena User Guide
Managing Data Sources
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDatabases",
"athena:ListTableMetadata"
],
"Resource":"*"
}
The details page includes options to Edit or Delete the data source.
• Select the button next to the catalog name, and then choose Edit.
• Choose the catalog name of the data source, and then choose Edit.
81
Amazon Athena User Guide
Managing Data Sources
2. On the Edit page for the metastore, you can choose a different Lambda function for the data source
or change the description of the existing function. When you edit an AWS Glue catalog, the AWS
Glue console opens the corresponding catalog for editing.
82
Amazon Athena User Guide
Connecting to Amazon Athena
with ODBC and JDBC Drivers
3. Choose Save.
1. Select the button next to the data source or the name of the data source, and then choose Delete.
You are warned that when you delete a metastore data source, its corresponding Data Catalog,
tables, and views are removed from the query editor. Saved queries that used the metastore no
longer run in Athena.
2. Choose Delete.
Topics
• Using Athena with the JDBC Driver (p. 83)
• Connecting to Amazon Athena with ODBC (p. 85)
83
Amazon Athena User Guide
Using Athena with the JDBC Driver
Download the driver that matches your version of the JDK and the JDBC data standards:
• The AthenaJDBC41.jar is compatible with JDBC 4.1 and requires JDK 7.0 or later.
• The AthenaJDBC42.jar is compatible with JDBC 4.2 and requires JDK 8.0 or later.
• Release Notes
• License Agreement
• Notices
• Third-Party Licenses
• JDBC Driver Installation and Configuration Guide. Use this guide to install and configure the driver.
• JDBC Driver Migration Guide. Use this guide to migrate from previous versions to the current version.
Important
To use the JDBC driver for multiple data catalogs with Athena (for example, when
using an external Hive metastore (p. 34) or federated queries (p. 66)), include
MetadataRetrievalMethod=ProxyAPI in your JDBC connection string.
84
Amazon Athena User Guide
Connecting to Amazon Athena with ODBC
For more information about the previous versions of the JDBC driver, see Using Earlier Version JDBC
Drivers (p. 493).
If you are migrating from a 1.x driver to a 2.x driver, you must migrate your existing configurations to the
new configuration. We highly recommend that you migrate to driver version 2.x. For information, see the
JDBC Driver Migration Guide.
ODBC 1.1.6 for Windows 32-bit Windows 32 bit ODBC Driver 1.1.6
ODBC 1.1.6 for Windows 64-bit Windows 64 bit ODBC Driver 1.1.6
Linux
ODBC 1.1.6 for Linux 32-bit Linux 32 bit ODBC Driver 1.1.6
ODBC 1.1.6 for Linux 64-bit Linux 64 bit ODBC Driver 1.1.6
OSX
Documentation
Documentation for ODBC 1.1.6 ODBC Driver Installation and Configuration Guide
version 1.1.6
Release Notes for ODBC 1.1.6 ODBC Driver Release Notes version 1.1.6
85
Amazon Athena User Guide
Connecting to Amazon Athena with ODBC
• Keep port 444, which Athena uses to stream query results, open to outbound traffic. When
you use a PrivateLink endpoint to connect to Athena, ensure that the security group attached
to the PrivateLink endpoint is open to inbound traffic on port 444.
• Add the athena:GetQueryResultsStream policy action to the list of policies for Athena.
This policy action is not exposed directly with the API operation, and is used only with the
ODBC and JDBC drivers, as part of streaming results support. For an example policy, see
AWSQuicksightAthenaAccess Managed Policy (p. 273).
ODBC 1.0.5 for Windows 32-bit Windows 32 bit ODBC Driver 1.0.5
ODBC 1.0.5 for Windows 64-bit Windows 64 bit ODBC Driver 1.0.5
ODBC 1.0.5 for Linux 32-bit Linux 32 bit ODBC Driver 1.0.5
ODBC 1.0.5 for Linux 64-bit Linux 64 bit ODBC Driver 1.0.5
Documentation for ODBC 1.0.5 ODBC Driver Installation and Configuration Guide
version 1.0.5
ODBC 1.0.4 for Windows 32-bit Windows 32 bit ODBC Driver 1.0.4
ODBC 1.0.4 for Windows 64-bit Windows 64 bit ODBC Driver 1.0.4
ODBC 1.0.4 for Linux 32-bit Linux 32 bit ODBC Driver 1.0.4
ODBC 1.0.4 for Linux 64-bit Linux 64 bit ODBC Driver 1.0.4
Documentation for ODBC 1.0.4 ODBC Driver Installation and Configuration Guide
version 1.0.4
ODBC 1.0.3 for Windows 32-bit Windows 32-bit ODBC Driver 1.0.3
ODBC 1.0.3 for Windows 64-bit Windows 64-bit ODBC Driver 1.0.3
86
Amazon Athena User Guide
Connecting to Amazon Athena with ODBC
ODBC 1.0.3 for Linux 32-bit Linux 32-bit ODBC Driver 1.0.3
ODBC 1.0.3 for Linux 64-bit Linux 64-bit ODBC Driver 1.0.3
Documentation for ODBC 1.0.3 ODBC Driver Installation and Configuration Guide
version 1.0.3
ODBC 1.0.2 for Windows 32-bit Windows 32-bit ODBC Driver 1.0.2
ODBC 1.0.2 for Windows 64-bit Windows 64-bit ODBC Driver 1.0.2
ODBC 1.0.2 for Linux 32-bit Linux 32-bit ODBC Driver 1.0.2
ODBC 1.0.2 for Linux 64-bit Linux 64-bit ODBC Driver 1.0.2
Documentation for ODBC 1.0.2 ODBC Driver Installation and Configuration Guide
version 1.0.2
87
Amazon Athena User Guide
Creating Databases
When you create a database and table in Athena, you describe the schema and the location of the data,
making the data in the table ready for real-time querying.
To improve query performance and reduce costs, we recommend that you partition your data and use
open source columnar formats for storage in Amazon S3, such as Apache Parquet or ORC.
Topics
• Creating Databases in Athena (p. 88)
• Creating Tables in Athena (p. 90)
• Names for Tables, Databases, and Columns (p. 96)
• Reserved Keywords (p. 97)
• Table Location in Amazon S3 (p. 98)
• Columnar Storage Formats (p. 100)
• Converting to Columnar Formats (p. 100)
• Partitioning Data (p. 104)
• Partition Projection with Amazon Athena (p. 109)
88
Amazon Athena User Guide
Creating Databases
89
Amazon Athena User Guide
Creating Tables
When you create a new table schema in Athena, Athena stores the schema in a data catalog and uses it
when you run queries.
Athena uses an approach known as schema-on-read, which means a schema is projected on to your data
at the time you run a query. This eliminates the need for data loading or transformation.
Athena uses Apache Hive to define tables and create databases, which are essentially a logical
namespace of tables.
90
Amazon Athena User Guide
Considerations and Limitations
When you create a database and table in Athena, you are simply describing the schema and the location
where the table data are located in Amazon S3 for read-time querying. Database and table, therefore,
have a slightly different meaning than they do for traditional relational database systems because the
data isn't stored along with the schema definition for the database and table.
When you query, you query the table using standard SQL and the data is read at that time. You can find
guidance for how to create databases and tables using Apache Hive documentation, but the following
provides guidance specifically for Athena.
Hive supports multiple data formats through the use of serializer-deserializer (SerDe) libraries. You
can also define complex schemas using regular expressions. For a list of supported SerDe libraries, see
Supported SerDes and Data Formats (p. 409).
• Athena can only query the latest version of data on a versioned Amazon S3 bucket, and cannot query
previous versions of the data.
• You must have the appropriate permissions to work with data in the Amazon S3 location. For more
information, see Access to Amazon S3 (p. 274).
• Athena supports querying objects that are stored with multiple storage classes in the same bucket
specified by the LOCATION clause. For example, you can query data in objects that are stored in
different Storage classes (Standard, Standard-IA and Intelligent-Tiering) in Amazon S3.
• Athena supports Requester Pays Buckets. For information how to enable Requester Pays for buckets
with source data you intend to query in Athena, see Creating a Workgroup (p. 368).
• Athena does not support querying the data in the S3 Glacier or S3 Glacier Deep Archive storage
classes. Objects in the S3 Glacier storage class are ignored. Objects in the S3 Glacier Deep Archive
storage class that are queried result in the error message The operation is not valid for the object's
storage class. Data that is moved or transitioned to one of these classes are no longer readable or
queryable by Athena even after storage class objects are restored. To make the restored objects that
you want to query readable by Athena, copy the restored objects back into Amazon S3 to change their
storage class.
For information about storage classes, see Storage Classes, Changing the Storage Class of an Object in
Amazon S3, Transitioning to the GLACIER Storage Class (Object Archival) , and Requester Pays Buckets
in the Amazon Simple Storage Service Developer Guide.
• If you issue queries against Amazon S3 buckets with a large number of objects and the data is not
partitioned, such queries may affect the Get request rate limits in Amazon S3 and lead to Amazon
S3 exceptions. To prevent errors, partition your data. Additionally, consider tuning your Amazon S3
request rates. For more information, see Request Rate and Performance Considerations.
Functions Supported
The functions supported in Athena queries are those found within Presto. For more information, see the
documentation for Presto versions 0.172 and 0.217, which correspond to Athena engine versions 1 and
2.
91
Amazon Athena User Guide
Creating Tables Using AWS Glue or the Athena Console
92
Amazon Athena User Guide
Creating Tables Using AWS Glue or the Athena Console
For more information, see Using AWS Glue Crawlers (p. 21).
93
Amazon Athena User Guide
Creating Tables Using AWS Glue or the Athena Console
2. Under the database display in the Query Editor, choose Create table, and then choose from S3 bucket
data.
3. in the Add table wizard, follow the steps to create your table.
94
Amazon Athena User Guide
Creating Tables Using AWS Glue or the Athena Console
2. Enter a statement like the following, and then choose Run Query, or press Ctrl+ENTER.
95
Amazon Athena User Guide
Names for Tables, Databases, and Columns
Status INT,
Referrer STRING,
OS String,
Browser String,
BrowserVersion String
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s
+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
) LOCATION 's3://athena-examples-MyRegion/cloudfront/plaintext/';
After the table is created, you can run queries against your data.
Queries with mixedCase column names, such as profileURI, or upper case column names do not work.
Special characters
Special characters other than underscore (_) are not supported. For more information, see the Apache
Hive LanguageManual DDL documentation.
Important
Although you may succeed in creating table, view, database, or column names that contain
special characters other than underscore by enclosing them in backtick (`) characters,
subsequent DDL or DML queries that reference them can fail.
96
Amazon Athena User Guide
Reserved Keywords
Reserved words
Certain reserved words in Athena must be escaped. To escape reserved keywords in DDL statements,
enclose them in backticks (`). To escape reserved keywords in SQL SELECT statements and in queries on
views (p. 131), enclose them in double quotes ('').
Reserved Keywords
When you run queries in Athena that include reserved keywords, you must escape them by enclosing
them in special characters. Use the lists in this topic to check which keywords are reserved in Athena.
To escape reserved keywords in DDL statements, enclose them in backticks (`). To escape reserved
keywords in SQL SELECT statements and in queries on views (p. 131), enclose them in double quotes
('').
You cannot use DDL reserved keywords as identifier names in DDL statements without enclosing them in
backticks (`).
ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH,
BY, CASE, CASHE, CAST, CHAR, COLUMN, CONF, CONSTRAINT, COMMIT, CREATE, CROSS, CUBE,
CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DAYOFWEEK, DECIMAL,
DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED,
EXTERNAL, EXTRACT, FALSE, FETCH, FLOAT, FLOOR, FOLLOWING, FOR, FOREIGN, FROM, FULL,
FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTEGER,
INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE,
NONE, NOT, NULL, NUMERIC, OF, ON, ONLY, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN,
PARTITION,
PERCENT, PRECEDING, PRECISION, PRESERVE, PRIMARY, PROCEDURE, RANGE, READS, REDUCE, REGEXP,
REFERENCES, REVOKE, RIGHT, RLIKE, ROLLBACK, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT,
START,TABLE,
TABLESAMPLE, THEN, TIME, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE,
UNBOUNDED,UNION,
UNIQUEJOIN, UPDATE, USER, USING, UTC_TIMESTAMP, VALUES, VARCHAR, VIEWS, WHEN, WHERE,
WINDOW, WITH
If you use these keywords as identifiers, you must enclose them in double quotes (") in your query
statements.
97
Amazon Athena User Guide
Examples of Queries with Reserved Words
The following example queries include a column name containing the DDL-related reserved keywords in
ALTER TABLE ADD PARTITION and ALTER TABLE DROP PARTITION statements. The DDL reserved
keywords are enclosed in backticks (`):
The following example query includes a reserved keyword (end) as an identifier in a SELECT statement.
The keyword is escaped in double quotes:
SELECT *
FROM TestTable
WHERE "end" != nil;
The following example query includes a reserved keyword (first) in a SELECT statement. The keyword is
escaped in double quotes:
SELECT "itemId"."first"
FROM testTable
LIMIT 10;
To specify the path to your data in Amazon S3, use the LOCATION property, as shown in the following
example:
98
Amazon Athena User Guide
Table Location and Partitions
• For information about naming buckets, see Bucket Restrictions and Limitations in the Amazon Simple
Storage Service Developer Guide.
• For information about using folders in Amazon S3, see Using Folders in the Amazon Simple Storage
Service Console User Guide.
The LOCATION in Amazon S3 specifies all of the files representing your table.
Important
Athena reads all data stored in s3://bucketname/folder/'. If you have data that you do
not want Athena to read, do not store that data in the same Amazon S3 folder as the data you
want Athena to read. If you are leveraging partitioning, to ensure Athena scans data within a
partition, your WHERE filter must include the partition. For more information, see Table Location
and Partitions (p. 99).
When you specify the LOCATION in the CREATE TABLE statement, use the following guidelines:
Use:
s3://bucketname/folder/
• Do not use any of the following items for specifying the LOCATION for your data.
• Do not use filenames, underscores, wildcards, or glob patterns for specifying file locations.
• Do not add the full HTTP notation, such as s3.amazon.com to the Amazon S3 bucket path.
• Do not specify an Amazon S3 access point in the LOCATION clause. The table location can only be
specified as a URI.
• Do not use empty folders like // in the path, as follows: S3://bucketname/folder//folder/.
While this is a valid Amazon S3 path, Athena does not allow it and changes it to
s3://bucketname/folder/folder/, removing the extra /.
Do not use:
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
s3://bucketname/prefix/filename.csv
s3://test-bucket.s3.amazon.com
S3://bucket/prefix//prefix/
arn:aws:s3:::bucketname/prefix
s3://arn:aws:s3:<region>:<account_id>:accesspoint/<accesspointname>
https://<accesspointname>-<number>.s3-accesspoint.<region>.amazonaws.com
99
Amazon Athena User Guide
Columnar Storage Formats
When you create a table, you can choose to make it partitioned. When Athena runs a SQL query against
a non-partitioned table, it uses the LOCATION property from the table definition as the base path to list
and then scan all available files. However, before a partitioned table can be queried, you must update
the AWS Glue Data Catalog with partition information. This information represents the schema of files
within the particular partition and the LOCATION of files in Amazon S3 for the partition.
• To learn how the AWS Glue crawler adds partitions, see How Does a Crawler Determine When to Create
Partitions? in the AWS Glue Developer Guide.
• To learn how to configure the crawler so that it creates tables for data in existing partitions, see Using
Multiple Data Sources with Crawlers (p. 22).
• You can also create partitions in a table directly in Athena. For more information, see Partitioning
Data (p. 104).
When Athena runs a query on a partitioned table, it checks to see if any partitioned columns are used
in the WHERE clause of the query. If partitioned columns are used, Athena requests the AWS Glue Data
Catalog to return the partition specification matching the specified partition columns. The partition
specification includes the LOCATION property that tells Athena which Amazon S3 prefix to use when
reading data. In this case, only data stored in this prefix is scanned. If you do not use partitioned columns
in the WHERE clause, Athena scans all the files that belong to the table's partitions.
For examples of using partitioning with Athena to improve query performance and reduce query costs,
see Top Performance Tuning Tips for Amazon Athena.
Columnar storage formats have the following characteristics that make them suitable for using with
Athena:
• Compression by column, with compression algorithm selected for the column data type to save storage
space in Amazon S3 and reduce disk space and I/O during query processing.
• Predicate pushdown in Parquet and ORC enables Athena queries to fetch only the blocks it needs,
improving query performance. When an Athena query obtains specific column values from your data,
it uses statistics from data block predicates, such as max/min values, to determine whether to read or
skip the block.
• Splitting of data in Parquet and ORC allows Athena to split the reading of data to multiple readers and
increase parallelism during its query processing.
To convert your existing raw data from other storage formats to Parquet or ORC, you can run CREATE
TABLE AS SELECT (CTAS) (p. 136) queries in Athena and specify a data storage format as Parquet or
ORC, or use the AWS Glue Crawler.
100
Amazon Athena User Guide
Overview
You can do this to existing Amazon S3 data sources by creating a cluster in Amazon EMR and converting
it using Hive. The following example using the AWS CLI shows you how to do this with a script and data
stored in Amazon S3.
Overview
The process for converting to columnar formats using an EMR cluster is as follows:
s3://athena-examples-myregion/conversion/write-parquet-to-s3.q
Note
Replace MyRegion in the LOCATION clause with the region where you are running queries.
For example, if your console is in us-west-1, s3://us-west-1.elasticmapreduce/
samples/hive-ads/tables/.
This creates the table in Hive on the cluster which uses samples located in the Amazon EMR samples
bucket.
3. On Amazon EMR release 4.7.0, include the ADD JAR line to find the appropriate JsonSerDe. The
prettified sample data looks like the following:
{
"number": "977680",
"referrer": "fastcompany.com",
101
Amazon Athena User Guide
Before you begin
"processId": "1823",
"adId": "TRktxshQXAHWo261jAHubijAoNlAqA",
"browserCookie": "mvlrdwrmef",
"userCookie": "emFlrLGrm5fA2xLFT5npwbPuG7kf6X",
"requestEndTime": "1239714001000",
"impressionId": "1I5G20RmOuG2rt7fFGFgsaWk9Xpkfb",
"userAgent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR
2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; InfoPa",
"timers": {
"modelLookup": "0.3292",
"requestTime": "0.6398"
},
"threadId": "99",
"ip": "67.189.155.225",
"modelId": "bxxiuxduad",
"hostname": "ec2-0-51-75-39.amazon.com",
"sessionId": "J9NOccA3dDMFlixCuSOtl9QBbjs6aS",
"requestBeginTime": "1239714000000"
}
4. In Hive, load the data from the partitions, so the script runs the following:
The script then creates a table that stores your data in a Parquet-formatted file on Amazon S3:
The data are inserted from the impressions table into parquet_hive:
The script stores the above impressions table columns from the date, 2009-04-14-04-05, into s3://
myBucket/myParquet/ in a Parquet-formatted file.
5. After your EMR cluster is terminated, create your table in Athena, which uses the data in the format
produced by the cluster.
102
Amazon Athena User Guide
Example: Converting data to Parquet using an EMR cluster
3. Create an Amazon EMR cluster using the emr-4.7.0 release to convert the data using the following
AWS CLI emr create-cluster command:
export REGION=us-west-1
export SAMPLEURI=s3://${REGION}.elasticmapreduce/samples/hive-ads/tables/impressions/
export S3BUCKET=myBucketName
For more information, see Create and Use IAM Roles for Amazon EMR in the Amazon EMR
Management Guide.
Look for the script step status. If it is COMPLETED, then the conversion is done and you are ready to
query the data.
5. Create the same table that you created on the EMR cluster.
You can use the same statement as above. Log into Athena and enter the statement in the Query
Editor window:
103
Amazon Athena User Guide
Partitioning Data
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
) STORED AS PARQUET
LOCATION 's3://myBucket/myParquet/';
Alternatively, you can select the view (eye) icon next to the table's name in Catalog:
Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost. Athena leverages Apache Hive for partitioning data. You can partition
your data by any key. A common practice is to partition the data based on time, often leading to a multi-
level partitioning scheme. For example, a customer who has data coming in every hour might decide to
partition by year, month, date, and hour. Another customer, who has data coming from many different
sources but loaded one time per day, may partition by a data source identifier and date.
• If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data
only from that partition. For more information, see Table Location and Partitions (p. 99).
• If you issue queries against Amazon S3 buckets with a large number of objects and the data is not
partitioned, such queries may affect the GET request rate limits in Amazon S3 and lead to Amazon
S3 exceptions. To prevent errors, partition your data. Additionally, consider tuning your Amazon
S3 request rates. For more information, see Best Practices Design Patterns: Optimizing Amazon S3
Performance .
104
Amazon Athena User Guide
Creating and Loading a Table with Partitioned Data
• Partition locations to be used with Athena must use the s3 protocol (for example,
s3://bucket/folder/). In Athena, locations that use other protocols (for example,
s3a://bucket/folder/) will result in query failures when MSCK REPAIR TABLE queries are run on
the containing tables.
• Because MSCK REPAIR TABLE scans both a folder its subfolders to find a matching partition scheme,
be sure to keep data for separate tables in separate folder hierarchies. For example, suppose you have
data for table A in s3://table-a-data and data for table B in s3://table-a-data/table-b-
data. If both tables are partitioned by string, MSCK REPAIR TABLE will add the partitions for table B
to table A. To avoid this, use separate folder structures like s3://table-a-data and s3://table-
b-data instead. Note that this behavior is consistent with Amazon EMR and Apache Hive.
After you create the table, you load the data in the partitions for querying. For Hive-compatible data,
you run MSCK REPAIR TABLE (p. 463). For non-Hive compatible data, you use ALTER TABLE ADD
PARTITION (p. 449) to add the partitions manually.
1. Data is already partitioned, stored on Amazon S3, and you need to access the data on Athena.
2. Data is not partitioned.
aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
105
Amazon Athena User Guide
Preparing Partitioned and
Nonpartitioned Data for Querying
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
PRE dt=2009-04-12-14-10/
PRE dt=2009-04-12-14-15/
PRE dt=2009-04-12-14-20/
PRE dt=2009-04-12-15-00/
PRE dt=2009-04-12-15-05/
Here, logs are stored with the column name (dt) set equal to date, hour, and minute increments. When
you give a DDL with the location of the parent folder, the schema, and the name of the partitioned
column, Athena can query data in those subfolders.
Creating a Table
To make a table out of this data, create a partition along 'dt' as in the following Athena DDL statement:
This table uses Hive's native JSON serializer-deserializer to read JSON data stored in Amazon S3. For
more information about the formats supported, see Supported SerDes and Data Formats (p. 409).
After you run the preceding statement in Athena, choose New Query and run the following command:
2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc
106
Amazon Athena User Guide
Preparing Partitioned and
Nonpartitioned Data for Querying
2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV
2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp
2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH
2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr
2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G
2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b
2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i
2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj
2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk
107
Amazon Athena User Guide
Preparing Partitioned and
Nonpartitioned Data for Querying
In this case, you would have to use ALTER TABLE ADD PARTITION to add each partition manually.
Additional Resources
• You can use CTAS and INSERT INTO to partition a dataset. For more information, see Using CTAS and
INSERT INTO for ETL and Data Analysis (p. 145).
• You can automate adding partitions by using the JDBC driver (p. 83).
108
Amazon Athena User Guide
Partition Projection
In partition projection, partition values and locations are calculated from configuration rather than read
from a repository like the AWS Glue Data Catalog. Because in-memory operations are often faster than
remote operations, partition projection can reduce the runtime of queries against highly partitioned
tables. Depending on the specific characteristics of the query and underlying data, partition projection
can significantly reduce query runtime for queries that are constrained on partition metadata retrieval.
Normally, when processing queries, Athena makes a GetPartitions call to the AWS Glue Data Catalog
before performing partition pruning. If a table has a large number of partitions, using GetPartitions
can affect performance negatively. To avoid this, you can use partition projection. Partition projection
allows Athena to avoid calling GetPartitions because the partition projection configuration gives
Athena all of the necessary information to build the partitions itself.
Use Cases
Scenarios in which partition projection is useful include the following:
• Queries against a highly partitioned table do not complete as quickly as you would like.
• You regularly add partitions to tables as new date or time partitions are created in your data. With
partition projection, you configure relative date ranges that can be used as new data arrives.
• You have highly partitioned data in Amazon S3. The data is impractical to model in your AWS Glue
Data Catalog or Hive metastore, and your queries read only small parts of it.
• Integers – Any continuous sequence of integers such as [1, 2, 3, 4, ..., 1000] or [0500,
0550, 0600, ..., 2500].
109
Amazon Athena User Guide
Considerations and Limitations
• Dates – Any continuous sequence of dates or datetimes such as [20200101, 20200102, ...,
20201231] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, ..., 12-31-2020 23:00:00].
• Enumerated values – A finite set of enumerated values such as airport codes or AWS Regions.
• AWS service logs – AWS service logs typically have a known structure whose partition scheme you can
specify in AWS Glue and that Athena can therefore use for partition projection. For an example, see
Amazon Kinesis Data Firehose Example (p. 121).
• Partition projection eliminates the need to specify partitions manually in AWS Glue or an external Hive
metastore.
• When you enable partition projection on a table, Athena ignores any partition metadata in the AWS
Glue Data Catalog or external Hive metastore for that table.
• If a projected partition does not exist in Amazon S3, Athena will still project the partition. Athena
does not throw an error, but no data is returned. However, if too many of your partitions are empty,
performance can be slower compared to traditional AWS Glue partitions. If more than half of your
projected partitions are empty, it is recommended that you use traditional partitions.
• Partition projection is usable only when the table is queried through Athena. If the same table is read
through another service such as Amazon Redshift Spectrum or Amazon EMR, the standard partition
metadata is used.
• Because partition projection is a DML-only feature, SHOW PARTITIONS does not list partitions that are
projected by Athena but not registered in the AWS Glue catalog or external Hive metastore.
• Views in Athena do not use projection configuration properties.
Topics
• Setting up Partition Projection (p. 110)
• Supported Types for Partition Projection (p. 115)
• Dynamic ID Partitioning (p. 119)
• Amazon Kinesis Data Firehose Example (p. 121)
1. Specify the data ranges and relevant patterns for each partition column, or use a custom template.
2. Enable partition projection for the table.
This section shows how to set these table properties for AWS Glue. To set them, you can use the AWS
Glue console, Athena CREATE TABLE (p. 454) queries, or AWS Glue API operations. The following
procedure shows how to set the properties in the AWS Glue console.
110
Amazon Athena User Guide
Setting up Partition Projection
To configure and enable partition projection using the AWS Glue console
1. Sign in to the AWS Management Console and open the AWS Glue console at https://
console.aws.amazon.com/glue/.
2. Choose the Tables tab.
On the Tables tab, you can edit existing tables, or choose Add tables to create new ones. For
information about adding tables manually or with a crawler, see Working with Tables on the AWS
Glue Console in the AWS Glue Developer Guide.
3. In the list of tables, choose the link for the table that you want to edit.
5. In the Edit table details dialog box, in the Table properties section, for each partitioned column,
add the following key-value pair:
The following example table configuration configures the year column for partition projection,
restricting the values that can be returned to a range from 2000 through 2016.
111
Amazon Athena User Guide
Setting up Partition Projection
7. Add a key-value pair to enable partition projection. For Key, enter projection.enabled, and for
its Value, enter true.
112
Amazon Athena User Guide
Setting up Partition Projection
Note
You can disable partition projection on this table at any time by setting
projection.enabled to false.
8. When you are finished, choose Apply.
9. In the Athena Query Editor, test query the columns that you configured for the table.
The following example query uses SELECT DISTINCT to return the unique values from the year
column. The database contains data from 1987 to 2016, but the projection.year.range
property restricts the values returned to the years 2000 to 2016.
113
Amazon Athena User Guide
Setting up Partition Projection
114
Amazon Athena User Guide
Supported Types for Partition Projection
Note
If you set projection.enabled to true but fail to configure one or more partition
columns, you receive an error message like the following:
HIVE_METASTORE_ERROR: Table database_name.table_name is configured
for partition projection, but the following partition columns
are missing projection configuration: [column_name] (table
database_name.table_name).
Using a custom template is optional. However, if you use a custom template, the template must contain
a placeholder for each partition column.
1. Following the steps to configure and enable partition projection using the AWS Glue console, add an
additional a key-value pair that specifies a custom template as follows:
The following example template values assume a table with partition columns a, b, and c.
s3://bucket/table_root/a=${a}/${b}/some_static_subdirectory/${c}/
s3://bucket/table_root/c=${c}/${b}/some_static_subdirectory/${a}/${b}/${c}/${c}/
For the same table, the following example template value is invalid because it contains no
placeholder for column c.
s3://bucket/table_root/a=${a}/${b}/some_static_subdirectory/
2. Choose Apply.
Enum Type
Use the enum type for partition columns whose values are members of an enumerated set (for example,
airport codes or AWS Regions).
115
Amazon Athena User Guide
Supported Types for Partition Projection
projection.columnName.values
A,B,C,D,E,F,G,Unknown Required. A comma-separated
list of enumerated partition
values for column columnName.
Any white space is considered
part of an enum value.
Note
As a best practice we recommend limiting the use of enum based partition projections to a few
dozen or less. Although there is no specific limit for enum projections, the total size of your
table’s metadata cannot exceed the AWS Glue limit of about 1MB when gzip compressed. Note
that this limit is shared across key parts of your table like column names, location, storage
format, and others. If you find yourself using more than a few dozen unique IDs in your enum
projection, consider an alternative approach such as bucketing into a smaller number of unique
values in a surrogate field. By trading off cardinality, you can control the number of unique
values in your enum field.
Integer Type
Use the integer type for partition columns whose possible values are interpretable as integers within a
63
defined range. Projected integer columns are currently limited to the range of a Java signed long (-2 to
63
2 -1 inclusive).
projection.columnName.interval
1 Optional. A positive integer that
specifies the interval between
5 successive partition values for
the column columnName. For
example, a range value of "1,3"
with an interval value of "1"
116
Amazon Athena User Guide
Supported Types for Partition Projection
projection.columnName.digits
1 Optional. A positive integer that
specifies the number of digits to
5 include in the partition value's
final representation for column
columnName. For example, a
range value of "1,3" that has a
digits value of "1" produces
the values 1, 2, and 3. The same
range value with a digits
value of "2" produces the values
01, 02, and 03. Leading and
trailing white space is allowed.
The default is no static number
of digits and no leading zeroes.
Date Type
Use the date type for partition columns whose values are interpretable as dates (with optional times)
within a defined range.
Important
Projected date columns are generated in Coordinated Universal Time (UTC) at query execution
time.
\s*NOW\s*(([\+
\-])\s*([0-9]+)\s*(YEARS?|MONTHS?|
117
Amazon Athena User Guide
Supported Types for Partition Projection
projection.columnName.format
yyyyMM Required. A date format string based on the Java
date format DateTimeFormatter. Can be any
dd-MM-yyyy supported Java.time.* type.
dd-MM-yyyy-
HH-mm-ss
projection.columnName.interval
1 A positive integer that specifies the interval
between successive partition values for column
5 columnName. For example, a range value of
2017-01,2018-12 with an interval value
of 1 and an interval.unit value of MONTHS
produces the values 2017-01, 2017-02, 2017-03,
and so on. The same range value with an
interval value of 2 and an interval.unit
value of MONTHS produces the values 2017-01,
2017-03, 2017-05, and so on. Leading and trailing
white space is allowed.
projection.columnName.interval.unit
YEARS A time unit word that represents the serialized
form of a ChronoUnit. Possible values are
MONTHS YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES,
SECONDS, or MILLISECONDS. These values are
WEEKS case insensitive.
DAYS When the provided dates are at single-day or
single-month precision, the interval.unit
HOURS
is optional and defaults to 1 day or 1 month,
MINUTES respectively. Otherwise, the interval.unit is
required.
SECONDS
MILLISECONDS
Injected Type
Use the injected type for partition columns with possible values that cannot be procedurally generated
within some logical range but that are provided in a query's WHERE clause as a single value.
• Queries on injected columns fail if a filter expression is not provided for each injected column.
• Queries on an injected column fail if a filter expression on the column allows multiple values.
118
Amazon Athena User Guide
Dynamic ID Partitioning
Dynamic ID Partitioning
You might have tables partitioned on a unique identifier column that has the following characteristics:
For such partitioning schemes, the enum projection type would be impractical for the following reasons:
• You would have to modify the table properties each time a value is added to the table.
• A single table property would have millions of characters or more.
• Projection requires that all partition columns be configured for projection. This requirement could not
be avoided for only one column.
Injection
If your query pattern on a dynamic ID dataset always specifies a single value for the high cardinality
partition column, you can use value injection. Injection avoids the need to project the full partition space.
Imagine that you want to partition an IoT dataset on a UUID field that has extremely high cardinality like
device_id. The field has the following characteristics:
However, if all of your queries include a WHERE clause that filters for only a single device_id, you can
use the following approach in your CREATE TABLE statement.
...
PARTITIONED BY
(
device_id STRING
)
LOCATION "s3://bucket/prefix/"
TBLPROPERTIES
(
"projection.enabled" = "true",
"projection.device_id.type" = "injected",
"storage.location.template" = "s3://bucket/prefix/${device_id}"
)
119
Amazon Athena User Guide
Dynamic ID Partitioning
SELECT
col1,
col2,...,
device_id
FROM
table
WHERE
device_id = "b6319dc2-48c1-4cd5-a0a3-a1969f7b48f7"
AND (
col1 > 0
or col2 < 10
)
In the example, Athena projects only a single partition for any given query. This avoids the need to store
and act upon millions or billions of virtual partitions only to find one partition and read from it.
Bucketing
In the bucketing technique, you use a fixed set of bucket values rather than the entire set of identifiers
for your partitioning. If you can map an identifier to a bucket, you can use this mapping in your queries.
You still benefit as when you partition on the identifiers themselves.
• You can specify more than one value at a time for a field in the WHERE clause.
• You can continue to use your partitions with more traditional metastores.
Using the scenario in the previous example and assuming 1 million buckets, identified by an integer, the
CREATE TABLE statement becomes the following.
...
PARTITIONED BY
(
BUCKET_ID BIGINT
)
LOCATION "s3://bucket/prefix/"
TBLPROPERTIES
(
"projection.enabled" = "true",
"projection.bucket_id.type" = "integer",
"projection.bucket_id.range" = "1,1000000"
)
A corresponding SELECT query uses a mapping function in the WHERE clause, as in the following
example.
SELECT
col1,
col2,...,
identifier
FROM
table
WHERE
bucket_id = map_identifier_to_bucket("ID-IN-QUESTION")
AND identifier = "ID-IN-QUESTION"
Replace the map_identifier_to_bucket function in the example with any scalar expression that
maps an identifier to an integer. For example, the expression could be a simple hash or modulus. The
120
Amazon Athena User Guide
Amazon Kinesis Data Firehose Example
function enforces a constant upper bound on the number of partitions that can ever be projected on the
specified dimension. When paired with a file format that supports predicate pushdown such as Apache
Parquet or ORC, the bucket technique provides good performance.
For information on writing your own user-defined function like the scalar bucketing function in the
preceding example, see Querying with User Defined Functions (Preview) (p. 216).
s3://bucket/folder/yyyy/MM/dd/HH/file.extension
Normally, to use Athena to query Kinesis Data Firehose data without using partition projection, you
create a table for Kinesis Data Firehose logs in Athena. Then you must add partitions to your table in the
AWS Glue Data Catalog every hour when Kinesis Data Firehose creates a partition.
By using partition projection, you can use a one-time configuration to inform Athena where the
partitions reside. The following CREATE TABLE example assumes a start date of 2018-01-01 at
midnight. Note the use of NOW for the upper boundary of the date range, which allows new data to
automatically become queryable at the appropriate UTC time.
Kinesis Data Firehose adds the partition prefix after table-name for you. In the Kinesis console, table-
name appears in the Custom Prefix field.
With this table you can run queries like the following, without having to manually add partitions:
SELECT *
FROM my_table
WHERE datehour >= '2018/02/03/00'
AND datehour < '2018/02/03/04'
121
Amazon Athena User Guide
Query Results and Query History
This section provides guidance for running Athena queries on common data sources and data types
using a variety of SQL statements. General guidance is provided for working with common structures
and operators—for example, working with arrays, concatenating, filtering, flattening, and sorting. Other
examples include queries for data in tables with nested structures and maps, tables based on JSON-
encoded datasets, and datasets associated with AWS services such as AWS CloudTrail logs and Amazon
EMR logs. Comprehensive coverage of standard SQL usage is beyond the scope of this documentation.
Topics
• Working with Query Results, Output Files, and Query History (p. 122)
• Working with Views (p. 131)
• Creating a Table from Query Results (CTAS) (p. 136)
• Handling Schema Updates (p. 154)
• Querying Arrays (p. 162)
• Querying Geospatial Data (p. 179)
• Using Athena to Query Apache Hudi Datasets (p. 204)
• Querying JSON (p. 208)
• Using Machine Learning (ML) with Amazon Athena (Preview) (p. 214)
• Querying with User Defined Functions (Preview) (p. 216)
• Querying AWS Service Logs (p. 224)
• Querying AWS Glue Data Catalog (p. 249)
• Querying Web Server Logs Stored in Amazon S3 (p. 253)
For considerations and limitations, see Considerations and Limitations for SQL Queries in Amazon
Athena (p. 469).
122
Amazon Athena User Guide
Getting a Query ID
Output files are saved automatically for every query that runs regardless of whether the query itself
was saved or not. To access and view query output files, IAM principals (users and roles) need permission
to the Amazon S3 GetObject action for the query result location, as well as permission for the Athena
GetQueryResults action. The query result location can be encrypted. If the location is encrypted, users
must have the appropriate key permissions to encrypt and decrypt the query result location.
Important
IAM principals with permission to the Amazon S3 GetObject action for the query result
location are able to retrieve query results from Amazon S3 even if permission to the Athena
GetQueryResults action is denied.
Getting a Query ID
Each query that runs is known as a query execution. The query execution has a unique identifier known as
the query ID or query execution ID. To work with query result files, and to quickly find query result files,
you need the query ID. We refer to the query ID in this topic as QueryID.
To use the Athena console to get the QueryID of a query that ran
2. From the list of queries, choose the query status under State—for example, Succeeded. The query
ID shows in a pointer tip.
123
Amazon Athena User Guide
Identifying Query Output Files
3. To copy the ID to the clipboard, choose the icon next to Query ID.
Query results files QueryID.csv DML query results files are saved
in comma-separated values
QueryID.txt (CSV) format.
124
Amazon Athena User Guide
Identifying Query Output Files
Query output files are stored in sub-folders in the following path pattern unless the query occurs in a
workgroup whose configuration overrides client-side settings. When workgroup configuration overrides
client-side settings, the query uses the results path specified by the workgroup.
QueryResultsLocationInS3/[QueryName|Unsaved/yyyy/mm/dd/]
Files associated with a CREATE TABLE AS SELECT query are stored in a tables sub-folder of the
above pattern.
To identify the query output location and query result files using the AWS CLI
• Use the aws athena get-query-execution command as shown in the following example.
Replace abc1234d-5efg-67hi-jklm-89n0op12qr34 with the query ID.
The command returns output similar to the following. For descriptions of each output parameter,
see get-query-execution in the AWS CLI Command Reference.
{
"QueryExecution": {
"Status": {
"SubmissionDateTime": 1565649050.175,
"State": "SUCCEEDED",
"CompletionDateTime": 1565649056.6229999
},
"Statistics": {
"DataScannedInBytes": 5944497,
"DataManifestLocation": "s3://aws-athena-query-results-123456789012-us-
west-1/MyInsertQuery/2019/08/12/abc1234d-5efg-67hi-jklm-89n0op12qr34-manifest.csv",
"EngineExecutionTimeInMillis": 5209
125
Amazon Athena User Guide
Downloading Query Results Files Using the Athena Console
},
"ResultConfiguration": {
"EncryptionConfiguration": {
"EncryptionOption": "SSE_S3"
},
"OutputLocation": "s3://aws-athena-query-results-123456789012-us-west-1/
MyInsertQuery/2019/08/12/abc1234d-5efg-67hi-jklm-89n0op12qr34"
},
"QueryExecutionId": "abc1234d-5efg-67hi-jklm-89n0op12qr34",
"QueryExecutionContext": {},
"Query": "INSERT INTO mydb.elb_log_backup SELECT * FROM mydb.elb_logs LIMIT
100",
"StatementType": "DML",
"WorkGroup": "primary"
}
}
1. Enter your query in the query editor and then choose Run query.
When the query finishes running, the Results pane shows the query results.
2. To download the query results file, choose the file icon in the query results pane. Depending on your
browser and browser configuration, you may need to confirm the download.
126
Amazon Athena User Guide
Specifying a Query Result Location
1. Choose History.
2. Page through the list of queries until you find the query, and then. under Action for the query,
choose Download results.
• If you run the query using the Athena console, the Query result location entered under Settings in
the navigation bar determines the client-side setting.
• If you run the query using the Athena API, the OutputLocation parameter of the
StartQueryExecution action determines the client-side setting.
• If you use the ODBC or JDBC drivers to run queries, the S3OutputLocation property specified in the
connection URL determines the client-side setting.
Important
When you run a query using the API or using the ODBC or JDBC driver, the console setting does
not apply.
Each workgroup configuration has an Override client-side settings option that can be enabled. When
this option is enabled, the workgroup settings take precedence over the applicable client-side settings
when an IAM principal associated with that workgroup runs the query.
Previously, if you ran a query without specifying a value for Query result location, and the query result
location setting was not overridden by a workgroup, Athena created a default location for you. The
default location was aws-athena-query-results-MyAcctID-MyRegion, where MyAcctID was the
AWS account ID of the IAM principal that ran the query, and MyRegion was the region where the query
ran (for example, us-west-1.)
Now, before you can run an Athena query in a region in which your account hasn't used Athena
previously, you must specify a query result location, or use a workgroup that overrides the query result
location setting. While Athena no longer creates a default query results location for you, previously
created default aws-athena-query-results-MyAcctID-MyRegion locations remain valid and you
can continue to use them.
To specify a client-side setting query result location using the Athena console
The Amazon S3 location that you enter is used for subsequent queries. You can change this location
later if you want.
127
Amazon Athena User Guide
Specifying a Query Result Location
If you are a member of a workgroup that specifies a query result location and overrides client-side
settings, the option to change the query result location is unavailable, as the following image shows:
When using the AWS CLI, specify the query result location using the OutputLocation parameter of the
--configuration option when you run the aws athena create-work-group or aws athena update-work-
group command.
To specify the query result location for a workgroup using the Athena console
• If editing an existing workgroup, select it from the list, choose View details, and then choose Edit
Workgroup.
• If creating a new workgroup, choose Create workgroup.
3. For Query result location, choose the Select folder.
128
Amazon Athena User Guide
Viewing Query History
4. From the list of S3 locations, choose the blue arrow successively until the bucket and folder you
want to use appears in the top line. Choose Select.
5. Under Settings, do one of the following:
• Select Override client-side settings to save query files in the location that you specified above for
all queries that members of this workgroup run.
• Clear Override client-side settings to save query files in the location that you specfied above have
the query location that you specified above only when workgroup members run queries using the
Athena API, ODBC driver, or JDBC driver without specifying an output location in Amazon S3.
6. If editing a workgroup, choose Save. If creating a workgroup, choose Create workgroup.
129
Amazon Athena User Guide
Viewing Query History
• To see a query statement in the Query Editor, choose the text of the query in the Query column.
Longer query statements are abbreviated.
• To see a query ID, choose its State (Succeeded, Failed, or Cancelled). The query ID shows in a
pointer tip.
• To download the results of a successful query into a .csv file, choose Download results.
• To see the details for a query that failed, choose Error details for the query.
If you want to keep the query history longer than 45 days, you can retrieve the query history and save
it to a data store such as Amazon S3. To automate this process, you can use Athena and Amazon S3 API
actions and CLI commands. The following procedure summarizes these steps.
130
Amazon Athena User Guide
Working with Views
1. Use Athena ListQueryExecutions API action or the list-query-executions CLI command to retrieve the
query IDs.
2. Use the Athena GetQueryExecution API action or the get-query-execution CLI command to retrieve
information about each query based on its ID.
3. Use the Amazon S3 PutObject API action or the put-object CLI command to save the information in
Amazon S3.
You can create a view from a SELECT query and then reference this view in future queries. For more
information, see CREATE VIEW (p. 460).
Topics
• When to Use Views? (p. 131)
• Supported Actions for Views in Athena (p. 132)
• Considerations for Views (p. 132)
• Limitations for Views (p. 133)
• Working with Views in the Console (p. 133)
• Creating Views (p. 134)
• Examples of Views (p. 135)
• Updating Views (p. 136)
• Deleting Views (p. 136)
• Query a subset of data. For example, you can create a view with a subset of columns from the original
table to simplify querying data.
• Combine multiple tables in one query. When you have multiple tables and want to combine them with
UNION ALL, you can create a view with that expression to simplify queries against the combined
tables.
• Hide the complexity of existing base queries and simplify queries run by users. Base queries often include
joins between tables, expressions in the column list, and other SQL syntax that make it difficult to
understand and debug them. You might create a view that hides the complexity and simplifies queries.
• Experiment with optimization techniques and create optimized queries. For example, if you find a
combination of WHERE conditions, JOIN order, or other expressions that demonstrate the best
performance, you can create a view with these clauses and expressions. Applications can then make
relatively simple queries against this view. If you later find a better way to optimize the original query,
when you recreate the view, all the applications immediately take advantage of the optimized base
query.
• Hide the underlying table and column names, and minimize maintenance problems if those names
change. In that case, you recreate the view using the new names. All queries that use the view rather
than the underlying tables keep running with no changes.
131
Amazon Athena User Guide
Supported Actions for Views in Athena
Statement Description
CREATE VIEW (p. 460) Creates a new view from a specified SELECT query. For more information,
see Creating Views (p. 134).
The optional OR REPLACE clause lets you update the existing view by
replacing it.
DESCRIBE VIEW (p. 461) Shows the list of columns for the named view. This allows you to examine
the attributes of a complex view.
DROP VIEW (p. 463) Deletes an existing view. The optional IF EXISTS clause suppresses
the error if the view does not exist. For more information, see Deleting
Views (p. 136).
SHOW CREATE Shows the SQL statement that creates the specified view.
VIEW (p. 466)
SHOW VIEWS (p. 469) Lists the views in the specified database, or in the current database if you
omit the database name. Use the optional LIKE clause with a regular
expression to restrict the list of view names. You can also see the list of
views in the left pane in the console.
• In Athena, you can preview and work with views created in the Athena Console, in the AWS Glue Data
Catalog, if you have migrated to using it, or with Presto running on the Amazon EMR cluster connected
to the same catalog. You cannot preview or add to Athena views that were created in other ways.
• If you are creating views through the AWS GlueData Catalog, you must include the PartitionKeys
parameter and set its value to an empty list, as follows: "PartitionKeys":[]. Otherwise, your view
query will fail in Athena. The following example shows a view created from the Data Catalog with
"PartitionKeys":[]:
132
Amazon Athena User Guide
Limitations for Views
• If you have created Athena views in the Data Catalog, then Data Catalog treats views as tables. You can
use table level fine-grained access control in Data Catalog to restrict access (p. 275) to these views.
• Athena prevents you from running recursive views and displays an error message in such cases. A
recursive view is a view query that references itself.
• Athena displays an error message when it detects stale views. A stale view is reported when one of the
following occurs:
• The view references tables or databases that do not exist.
• A schema or metadata change is made in a referenced table.
• A referenced table is dropped and recreated with a different schema or configuration.
• You can create and run nested views as long as the query behind the nested view is valid and the
tables and databases exist.
• Locate all views in the left pane, where tables are listed. Athena runs a SHOW VIEWS (p. 469)
operation to present this list to you.
• Filter views.
• Preview a view, show its properties, edit it, or delete it.
A view shows up in the console only if you have already created it.
1. In the Athena console, choose Views, choose a view, then expand it.
The view displays, with the columns it contains, as shown in the following example:
2. In the list of views, choose a view, and open the context (right-click) menu. The actions menu icon (⋮)
is highlighted for the view that you chose, and the list of actions opens, as shown in the following
example:
133
Amazon Athena User Guide
Creating Views
3. Choose an option. For example, Show properties shows the view name, the name of the database in
which the table for the view is created in Athena, and the time stamp when it was created:
Creating Views
You can create a view from any SELECT query.
Before you create a view, choose a database and then choose a table. Run a SELECT query on a table and
then create a view from it.
View names cannot contain special characters, other than underscore (_). See Names for Tables,
Databases, and Columns (p. 96). Avoid using Reserved Keywords (p. 97) for naming views.
3. Run the view query, debug it if needed, and save it.
Alternatively, create a query in the Query Editor, and then use Create view from query.
134
Amazon Athena User Guide
Examples of Views
If you run a view that is not valid, Athena displays an error message.
If you delete a table from which the view was created, when you attempt to run the view, Athena
displays an error message.
You can create a nested view, which is a view on top of an existing view. Athena prevents you from
running a recursive view that references itself.
Examples of Views
To show the syntax of the view query, use SHOW CREATE VIEW (p. 466).
Example Example 1
Consider the following two tables: a table employees with two columns, id and name, and a table
salaries, with two columns, id and salary.
In this example, we create a view named name_salary as a SELECT query that obtains a list of IDs
mapped to salaries from the tables employees and salaries:
Example Example 2
In the following example, we create a view named view1 that enables you to hide more complex query
syntax.
This view runs on top of two tables, table1 and table2, where each table is a different SELECT query.
The view selects columns from table1 and joins the results with table2. The join is based on column a
that is present in both tables.
135
Amazon Athena User Guide
Updating Views
Updating Views
After you create a view, it appears in the Views list in the left pane.
To edit the view, choose it, choose the context (right-click) menu, and then choose Show/edit query. You
can also edit the view in the Query Editor. For more information, see CREATE VIEW (p. 460).
Deleting Views
To delete a view, choose it, choose the context (right-click) menu, and then choose Delete view. For more
information, see DROP VIEW (p. 463).
• Create tables from query results in one step, without repeatedly querying raw data sets. This makes it
easier to work with raw data sets.
• Transform query results into other storage formats, such as Parquet and ORC. This improves
query performance and reduces query costs in Athena. For information, see Columnar Storage
Formats (p. 100).
• Create copies of existing tables that contain only the data you need.
Topics
• Considerations and Limitations for CTAS Queries (p. 136)
• Running CTAS Queries in the Console (p. 138)
• Bucketing vs Partitioning (p. 141)
• Examples of CTAS Queries (p. 142)
• Using CTAS and INSERT INTO for ETL and Data Analysis (p. 145)
• Using CTAS and INSERT INTO to Create a Table with More Than 100 Partitions (p. 151)
CTAS query The CTAS query syntax differs from the syntax of CREATE [EXTERNAL] TABLE used
syntax for creating tables. See CREATE TABLE AS (p. 458).
Note
Table, database, or column names for CTAS queries should not contain
quotes or backticks. To ensure this, check that your table, database, or
column names do not represent reserved words (p. 97), and do not contain
special characters (which require enclosing them in quotes or backticks). For
more information, see Names for Tables, Databases, and Columns (p. 96).
136
Amazon Athena User Guide
Considerations and Limitations for CTAS Queries
CTAS queries CTAS queries write new data to a specified location in Amazon S3, whereas views do
vs views not write any data.
Location of If your workgroup overrides the client-side setting (p. 366) for query results
CTAS query location, Athena creates your table in the location s3://<workgroup-query-
results results-location>/tables/<query-id>/. To see the query results location
specified for the workgroup, view the workgroup's details (p. 370).
If your workgroup does not override the query results location, you can use the syntax
WITH (external_location ='s3://location/') in your CTAS query to specify
where your CTAS query results are stored.
Note
The external_location property must specify a location that is empty.
A CTAS query checks that the path location (prefix) in the bucket is empty
and never overwrites the data if the location already has data in it. To use the
same location again, delete the data in the key prefix location in the bucket.
If you omit the external_location syntax and are not using the workgroup
setting, Athena uses your client-side setting (p. 127) for the query results
location and creates your table in the location s3://<client-query-results-
location>/<Unsaved-or-query-name>/<year>/<month/<date>/
tables/<query-id>/.
Locating If a CTAS or INSERT INTO statement fails, it is possible that orphaned data are left
Orphaned in the data location. Because Athena does not delete any data (even partial data)
Files from your bucket, you might be able to read this partial data in subsequent queries.
To locate orphaned files for inspection or deletion, you can use the data manifest file
that Athena provides to track the list of files to be written. For more information, see
Identifying Query Output Files (p. 124) and DataManifestLocation.
Formats for The results of CTAS queries are stored in Parquet by default if you don't specify a
storing query data storage format. You can store CTAS results in PARQUET, ORC, AVRO, JSON, and
results TEXTFILE. Multi-character delimiters are not supported for the CTAS TEXTFILE
format. CTAS queries do not require specifying a SerDe to interpret format
transformations. See Example: Writing Query Results to a Different Format (p. 143).
Compression GZIP compression is used for CTAS query results by default. For Parquet and ORC,
formats you can also specify SNAPPY. See Example: Specifying Data Storage and Compression
Formats (p. 143).
Partition and You can partition and bucket the results data of a CTAS query. For more information,
Bucket Limits see Bucketing vs Partitioning (p. 141). Athena supports writing to 100 unique
partition and bucket combinations. For example, if no buckets are defined in the
destination table, you can specify a maximum of 100 partitions. If you specify five
buckets, 20 partitions (each with five buckets) are allowed. If you exceed this count,
an error occurs.
Include partitioning and bucketing predicates at the end of the WITH clause
that specifies properties of the destination table. For more information, see
Example: Creating Bucketed and Partitioned Tables (p. 145) and Bucketing vs
Partitioning (p. 141).
For information about working around the 100-partition limitation, see Using CTAS
and INSERT INTO to Create a Table with More Than 100 Partitions (p. 151).
137
Amazon Athena User Guide
Running CTAS Queries in the Console
Encryption You can encrypt CTAS query results in Amazon S3, similar to the way you encrypt
other query results in Athena. For more information, see Encrypting Query Results
Stored in Amazon S3 (p. 266).
Data types Column data types for a CTAS query are the same as specified for the original query.
1. Run the query, choose Create, and then choose Create table from query.
2. In the Create a new table on the results of a query form, complete the fields as follows:
138
Amazon Athena User Guide
Running CTAS Queries in the Console
f. Choose Next to review your query and revise it as needed. For query syntax, see CREATE TABLE
AS (p. 458). The preview window opens, as shown in the following example:
139
Amazon Athena User Guide
Running CTAS Queries in the Console
g. Choose Create.
3. Choose Run query.
Use the CREATE TABLE AS SELECT template to create a CTAS query from scratch.
1. In the Athena console, choose Create table, and then choose CREATE TABLE AS SELECT.
2. In the Query Editor, edit the query as needed, For query syntax, see CREATE TABLE AS (p. 458).
3. Choose Run query.
4. Optionally, choose Save as to save the query.
140
Amazon Athena User Guide
Bucketing vs Partitioning
Bucketing vs Partitioning
You can specify partitioning and bucketing, for storing data from CTAS query results in Amazon S3. For
information about CTAS queries, see CREATE TABLE AS SELECT (CTAS) (p. 136).
This section discusses partitioning and bucketing as they apply to CTAS queries only. For general
guidelines about using partitioning in CREATE TABLE queries, see Top Performance Tuning Tips for
Amazon Athena.
Use the following tips to decide whether to partition and/or to configure bucketing, and to select
columns in your CTAS queries by which to do so:
• Partitioning CTAS query results works well when the number of partitions you plan to have is limited.
When you run a CTAS query, Athena writes the results to a specified location in Amazon S3. If you
specify partitions, it creates them and stores each partition in a separate partition folder in the same
location. The maximum number of partitions you can configure with CTAS query results in one query is
100. However, you can work around this limitation. For more information, see Using CTAS and INSERT
INTO to Create a Table with More Than 100 Partitions (p. 151).
Having partitions in Amazon S3 helps with Athena query performance, because this helps you run
targeted queries for only specific partitions. Athena then scans only those partitions, saving you query
costs and query time. For information about partitioning syntax, search for partitioned_by in
CREATE TABLE AS (p. 458).
Partition data by those columns that have similar characteristics, such as records from the same
department, and that can have a limited number of possible values, such as a limited number of
distinct departments in an organization. This characteristic is known as data cardinality. For example,
if you partition by the column department, and this column has a limited number of distinct values,
partitioning by department works well and decreases query latency.
• Bucketing CTAS query results works well when you bucket data by the column that has high cardinality
and evenly distributed values.
For example, columns storing timestamp data could potentially have a very large number of distinct
values, and their data is evenly distributed across the data set. This means that a column storing
timestamp type data will most likely have values and won't have nulls. This also means that data
from such a column can be put in many buckets, where each bucket will have roughly the same
amount of data stored in Amazon S3.
To choose the column by which to bucket the CTAS query results, use the column that has a high
number of values (high cardinality) and whose data can be split for storage into many buckets that
will have roughly the same amount of data. Columns that are sparsely populated with values are not
good candidates for bucketing. This is because you will end up with buckets that have less data and
other buckets that have a lot of data. By comparison, columns that you predict will almost always have
values, such as timestamp type values, are good candidates for bucketing. This is because their data
has high cardinality and can be stored in roughly equal chunks.
For more information about bucketing syntax, search for bucketed_by in CREATE TABLE
AS (p. 458).
To conclude, you can partition and use bucketing for storing results of the same CTAS query. These
techniques for writing data do not exclude each other. Typically, the columns you use for bucketing differ
from those you use for partitioning.
For example, if your dataset has columns department, sales_quarter, and ts (for storing
timestamp type data), you can partition your CTAS query results by department and sales_quarter.
141
Amazon Athena User Guide
Examples of CTAS Queries
These columns have relatively low cardinality of values: a limited number of departments and sales
quarters. Also, for partitions, it does not matter if some records in your dataset have null or no values
assigned for these columns. What matters is that data with the same characteristics, such as data from
the same department, will be in one partition that you can query in Athena.
At the same time, because all of your data has timestamp type values stored in a ts column, you can
configure bucketing for the same query results by the column ts. This column has high cardinality. You
can store its data in more than one bucket in Amazon S3. Consider an opposite scenario: if you don't
create buckets for timestamp type data and run a query for particular date or time values, then you
would have to scan a very large amount of data stored in a single location in Amazon S3. Instead, if you
configure buckets for storing your date- and time-related results, you can only scan and query buckets
that have your value and avoid long-running queries that scan a large amount of data.
In this section:
The following example creates a table by copying all columns from a table:
In the following variation of the same example, your SELECT statement also includes a WHERE clause. In
this case, the query selects only those rows from the table that satisfy the WHERE clause:
The following example creates a new query that runs on a set of columns from another table:
142
Amazon Athena User Guide
Examples of CTAS Queries
This variation of the same example creates a new table from specific columns from multiple tables:
The following example uses WITH NO DATA to create a new table that is empty and has the same
schema as the original table:
The following example uses a CTAS query to create a new table with Parquet data from a source table in
a different format. You can specify PARQUET, ORC, AVRO, JSON, and TEXTFILE in a similar way.
This example also specifies compression as SNAPPY. If omitted, GZIP is used. GZIP and SNAPPY are the
supported compression formats for CTAS query results stored in Parquet and ORC.
The following example is similar, but it stores the CTAS query results in ORC and uses the
orc_compression parameter to specify the compression format. If you omit the compression format,
Athena uses GZIP by default.
The following CTAS query selects all records from old_table, which could be stored in CSV or another
format, and creates a new table with underlying data saved to Amazon S3 in ORC format:
The following examples create tables that are not partitioned. The table data is stored in different
formats. Some of these examples specify the external location.
143
Amazon Athena User Guide
Examples of CTAS Queries
The following example creates a CTAS query that stores the results as a text file:
In the following example, results are stored in Parquet, and the default results location is used:
In the following query, the table is stored in JSON, and specific columns are selected from the original
table's results:
The following examples show CREATE TABLE AS SELECT queries for partitioned tables in different
storage formats, using partitioned_by, and other properties in the WITH clause. For syntax, see CTAS
Table Properties (p. 458). For more information about choosing the columns for partitioning, see
Bucketing vs Partitioning (p. 141).
Note
List partition columns at the end of the list of columns in the SELECT statement. You
can partition by more than one column, and have up to 100 unique partition and bucket
combinations. For example, you can have 100 partitions if no buckets are specified.
144
Amazon Athena User Guide
Using CTAS and INSERT INTO for ETL
external_location = 's3://my_athena_results/ctas_csv_partitioned/',
partitioned_by = ARRAY['key1'])
AS SELECT name1, address1, comment1, key1
FROM tables1;
The following example shows a CREATE TABLE AS SELECT query that uses both partitioning and
bucketing for storing query results in Amazon S3. The table results are partitioned and bucketed by
different columns. Athena supports a maximum of 100 unique bucket and partition combinations. For
example, if you create a table with five buckets, 20 partitions with five buckets each are supported. For
syntax, see CTAS Table Properties (p. 458).
For information about choosing the columns for bucketing, see Bucketing vs Partitioning (p. 141).
CTAS statements use standard SELECT (p. 437) queries to create new tables. You can use a CTAS
statement to create a subset of your data for analysis. In one CTAS statement, you can partition the data,
specify compression, and convert the data into a columnar format like Apache Parquet or Apache ORC.
When you run the CTAS query, the tables and partitions that it creates are automatically added to the
AWS Glue Data Catalog. This makes the new tables and partitions that it creates immediately available
for subsequent queries.
INSERT INTO statements insert new rows into a destination table based on a SELECT query statement
that runs on a source table. You can use INSERT INTO statements to transform and load source table
data in CSV format into destination table data using all transforms that CTAS supports.
Overview
In Athena, use a CTAS statement to perform an initial batch conversion of the data. Then use multiple
INSERT INTO statements to make incremental updates to the table created by the CTAS statement.
145
Amazon Athena User Guide
Using CTAS and INSERT INTO for ETL
Steps
Location: s3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/
Total objects: 41727
Size of CSV dataset: 11.3 GB
Region: us-east-1
The original data is stored in Amazon S3 with no partitions. The data is in CSV format in files like the
following.
The file sizes in this sample are relatively small. By merging them into larger files, you can reduce
the total number of files, enabling better query performance. You can use CTAS and INSERT INTO
statements to enhance query performance.
1. In the Athena console, choose the US East (N. Virginia) AWS Region. Be sure to run all queries in this
tutorial in us-east-1.
2. In the Athena query editor, run the CREATE DATABASE (p. 453) command to create a database.
146
Amazon Athena User Guide
Using CTAS and INSERT INTO for ETL
's3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/'
The table you created in Step 1 has a date field with the date formatted as YYYYMMDD (for example,
20100104). Because the new table will be partitioned on year, the sample statement in the following
procedure uses the Presto function substr("date",1,4) to extract the year value from the date
field.
To convert the data to Parquet format with Snappy compression, partitioning by year
• Run the following CTAS statement, replacing your-bucket with your Amazon S3 bucket location.
Note
In this example, the table that you create includes only the data from 2015 to 2019. In Step
3, you add new data to this table using the INSERT INTO command.
When the query completes, use the following procedure to verify the output in the Amazon S3 location
that you specified in the CTAS statement.
To see the partitions and parquet files created by the CTAS statement
1. To show the partitions created, run the following AWS CLI command. Be sure to include the final
forward slash (/).
aws s3 ls s3://your-bucket/optimized-data/
PRE year=2015/
PRE year=2016/
PRE year=2017/
PRE year=2018/
PRE year=2019/
147
Amazon Athena User Guide
Using CTAS and INSERT INTO for ETL
2. To see the Parquet files, run the following command. Note that the | head -5 option, which restricts
the output to the first five results, is not available on Windows.
To add data to the table using one or more INSERT INTO statements
1. Run the following INSERT INTO command, specifying the years before 2015 in the WHERE clause.
aws s3 ls s3://your-bucket/optimized-data/
PRE year=2010/
PRE year=2011/
PRE year=2012/
PRE year=2013/
PRE year=2014/
PRE year=2015/
PRE year=2016/
PRE year=2017/
PRE year=2018/
PRE year=2019/
148
Amazon Athena User Guide
Using CTAS and INSERT INTO for ETL
3. To see the reduction in the size of the dataset obtained by using compression and columnar storage
in Parquet format, run the following command.
The following results show that the size of the dataset after parquet with Snappy compression is 1.2
GB.
...
2020-01-22 18:12:02 2.8 MiB optimized-data/
year=2019/20200122_181132_00003_nja5r_f0182e6c-38f4-4245-afa2-9f5bfa8d6d8f
2020-01-22 18:11:59 3.7 MiB optimized-data/
year=2019/20200122_181132_00003_nja5r_fd9906b7-06cf-4055-a05b-f050e139946e
Total Objects: 300
Total Size: 1.2 GiB
4. If more CSV data is added to original table, you can add that data to the parquet table by using
INSERT INTO statements. For example, if you had new data for the year 2020, you could run the
following INSERT INTO statement. The statement adds the data and the relevant partition to the
new_parquet table.
Note
The INSERT INTO statement supports writing a maximum of 100 partitions to the
destination table. However, to add more than 100 partitions, you can run multiple INSERT
INTO statements. For more information, see Using CTAS and INSERT INTO to Create a Table
with More Than 100 Partitions (p. 151).
1. Run the following query on the original table. The query finds the number of distinct IDs for every
value of the year.
149
Amazon Athena User Guide
Using CTAS and INSERT INTO for ETL
2. Note the time that the query ran and the amount of data scanned.
3. Run the same query on the new table, noting the query runtime and amount of data scanned.
SELECT year,
COUNT(DISTINCT id)
FROM new_parquet
GROUP BY 1 ORDER BY 1 DESC
4. Compare the results and calculate the performance and cost difference. The following sample
results show that the test query on the new table was faster and cheaper than the query on the old
table.
5. Run the following sample query on the original table. The query calculates the average maximum
temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the
Earth in 2018.
6. Note the time that the query ran and the amount of data scanned.
7. Run the same query on the new table, noting the query runtime and amount of data scanned.
8. Compare the results and calculate the performance and cost difference. The following sample
results show that the test query on the new table was faster and cheaper than the query on the old
table.
Summary
This topic showed you how to perform ETL operations using CTAS and INSERT INTO statements in
Athena. You performed the first set of transformations using a CTAS statement that converted data to
the Parquet format with Snappy compression. The CTAS statement also converted the dataset from non-
partitioned to partitioned. This reduced its size and lowered the costs of running the queries. When new
data becomes available, you can use an INSERT INTO statement to transform and load the data into the
table that you created with the CTAS statement.
150
Amazon Athena User Guide
Creating a Table with More Than 100 Partitions
The example in this topic uses a database called tpch100 whose data resides in the Amazon S3 bucket
location s3://<my-tpch-bucket>/.
To use CTAS and INSERT INTO to create a table of more than 100 partitions
1. Use a CREATE EXTERNAL TABLE statement to create a table partitioned on the field that you want.
The following example statement partitions the data by the column l_shipdate. The table has
2525 partitions.
2. Run a SHOW PARTITIONS <table_name> command like the following to list the partitions.
/*
l_shipdate=1992-01-02
l_shipdate=1992-01-03
l_shipdate=1992-01-04
l_shipdate=1992-01-05
l_shipdate=1992-01-06
...
l_shipdate=1998-11-24
l_shipdate=1998-11-25
l_shipdate=1998-11-26
l_shipdate=1998-11-27
151
Amazon Athena User Guide
Creating a Table with More Than 100 Partitions
l_shipdate=1998-11-28
l_shipdate=1998-11-29
l_shipdate=1998-11-30
l_shipdate=1998-12-01
*/
The following example creates a table called my_lineitem_parq_partitioned and uses the
WHERE clause to restrict the DATE to earlier than 1992-02-01. Because the sample dataset starts
with January 1992, only partitions for January 1992 are created.
4. Run the SHOW PARTITIONS command to verify that the table contains the partitions that you want.
/*
l_shipdate=1992-01-02
l_shipdate=1992-01-03
l_shipdate=1992-01-04
l_shipdate=1992-01-05
l_shipdate=1992-01-06
l_shipdate=1992-01-07
l_shipdate=1992-01-08
l_shipdate=1992-01-09
l_shipdate=1992-01-10
l_shipdate=1992-01-11
l_shipdate=1992-01-12
l_shipdate=1992-01-13
l_shipdate=1992-01-14
l_shipdate=1992-01-15
l_shipdate=1992-01-16
l_shipdate=1992-01-17
l_shipdate=1992-01-18
l_shipdate=1992-01-19
l_shipdate=1992-01-20
l_shipdate=1992-01-21
l_shipdate=1992-01-22
l_shipdate=1992-01-23
l_shipdate=1992-01-24
l_shipdate=1992-01-25
152
Amazon Athena User Guide
Creating a Table with More Than 100 Partitions
l_shipdate=1992-01-26
l_shipdate=1992-01-27
l_shipdate=1992-01-28
l_shipdate=1992-01-29
l_shipdate=1992-01-30
l_shipdate=1992-01-31
*/
The following example adds partitions for the dates from the month of February 1992.
The sample table now has partitions from both January and February 1992.
/*
l_shipdate=1992-01-02
l_shipdate=1992-01-03
l_shipdate=1992-01-04
l_shipdate=1992-01-05
l_shipdate=1992-01-06
...
l_shipdate=1992-02-20
l_shipdate=1992-02-21
l_shipdate=1992-02-22
l_shipdate=1992-02-23
l_shipdate=1992-02-24
l_shipdate=1992-02-25
l_shipdate=1992-02-26
l_shipdate=1992-02-27
l_shipdate=1992-02-28
l_shipdate=1992-02-29
*/
7. Continue using INSERT INTO statements that add no more than 100 partitions each. Continue until
you reach the number of partitions that you require.
153
Amazon Athena User Guide
Handling Schema Updates
Important
When setting the WHERE condition, be sure that the queries don't overlap. Otherwise, some
partitions might have duplicated data.
If you anticipate changes in table schemas, consider creating them in a data format that is suitable for
your needs. Your goals are to reuse existing Athena queries against evolving schemas, and avoid schema
mismatch errors when querying tables with partitions.
To achieve these goals, choose a table's data format based on the table in the following topic.
Topics
• Summary: Updates and Data Formats in Athena (p. 154)
• Index Access in ORC and Parquet (p. 155)
• Types of Updates (p. 157)
• Updates in Tables with Partitions (p. 161)
In this table, observe that Parquet and ORC are columnar formats with different default column access
methods. By default, Parquet will access columns by name and ORC by index (ordinal value). Therefore,
Athena provides a SerDe property defined when creating a table to toggle the default column access
method which enables greater flexibility with schema evolution.
For Parquet, the parquet.column.index.access property may be set to true, which sets the column
access method to use the column’s ordinal number. Setting this property to false will change the
column access method to use column name. Similarly, for ORC use the orc.column.index.access
property to control the column access method. For more information, see Index Access in ORC and
Parquet (p. 155).
CSV and TSV allow you to do all schema manipulations except reordering of columns, or adding columns
at the beginning of the table. For example, if your schema evolution requires only renaming columns but
not removing them, you can choose to create your tables in CSV or TSV. If you require removing columns,
do not use CSV or TSV, and instead use any of the other supported formats, preferably, a columnar
format, such as Parquet or ORC.
154
Amazon Athena User Guide
Index Access in ORC and Parquet
155
Amazon Athena User Guide
Index Access in ORC and Parquet
Since these are defaults, specifying these SerDe properties in your CREATE TABLE queries is optional,
they are used implicitly. When used, they allow you to run some schema update operations while
preventing other such operations. To enable those operations, run another CREATE TABLE query and
change the SerDe settings.
Note
The SerDe properties are not automatically propagated to each partition. Use ALTER TABLE
ADD PARTITION statements to set the SerDe properties for each partition. To automate this
process, write a script that runs ALTER TABLE ADD PARTITION statements.
WITH SERDEPROPERTIES (
'orc.column.index.access'='true')
Reading by index allows you to rename columns. But then you lose the ability to remove columns or add
them in the middle of the table.
To make ORC read by name, which will allow you to add columns in the middle of the table or remove
columns in ORC, set the SerDe property orc.column.index.access to false in the CREATE TABLE
statement. In this configuration, you will lose the ability to rename columns.
Note
When orc.column.index.access is set to false, Athena becomes case sensitive. This can
prevent Athena from reading data if you are using Spark, which requires lower case, and have
column names that use uppercase. The workaround is to rename the columns to lower case.
The following example illustrates how to change the ORC to make it read by name:
156
Amazon Athena User Guide
Types of Updates
WITH SERDEPROPERTIES (
'parquet.column.index.access'='false')
Reading by name allows you to add columns in the middle of the table and remove columns. But then
you lose the ability to rename columns.
To make Parquet read by index, which will allow you to rename columns, you must create a table with
parquet.column.index.access SerDe property set to true.
Types of Updates
Here are the types of updates that a table’s schema can have. We review each type of schema update
and specify which data formats allow you to have them in Athena.
Depending on how you expect your schemas to evolve, to continue using Athena queries, choose a
compatible data format.
Let’s consider an application that reads orders information from an orders table that exists in two
formats: CSV and Parquet.
157
Amazon Athena User Guide
Types of Updates
In the following sections, we review how updates to these tables affect Athena queries.
To add columns at the beginning or in the middle of the table, and continue running queries against
existing tables, use AVRO, JSON, and Parquet and ORC if their SerDe property is set to read by name. For
information, see Index Access in ORC and Parquet (p. 155).
Do not add columns at the beginning or in the middle of the table in CSV and TSV, as these formats
depend on ordering. Adding a column in such cases will lead to schema mismatch errors when the
schema of partitions changes.
The following example shows adding a column to a JSON table in the middle of the table:
The following example adds a comment column at the end of the orders_parquet table before any
partition columns:
Note
To see a new table column in the Athena Query Editor after you run ALTER TABLE ADD
COLUMNS, manually refresh the table list in the editor, and then expand the table again.
Removing Columns
You may need to remove columns from tables if they no longer contain data, or to restrict access to the
data in them.
• You can remove columns from tables in JSON, Avro, and in Parquet and ORC if they are read by name.
For information, see Index Access in ORC and Parquet (p. 155).
158
Amazon Athena User Guide
Types of Updates
• We do not recommend removing columns from tables in CSV and TSV if you want to retain the tables
you have already created in Athena. Removing a column breaks the schema and requires that you
recreate the table without the removed column.
In this example, remove a column `totalprice` from a table in Parquet and run a query. In Athena,
Parquet is read by name by default, this is why we omit the SERDEPROPERTIES configuration that
specifies reading by name. Notice that the following query succeeds, even though you changed the
schema:
Renaming Columns
You may want to rename columns in your tables to correct spelling, make column names more
descriptive, or to reuse an existing column to avoid column reordering.
You can rename columns if you store your data in CSV and TSV, or in Parquet and ORC that are
configured to read by index. For information, see Index Access in ORC and Parquet (p. 155).
Athena reads data in CSV and TSV in the order of the columns in the schema and returns them in the
same order. It does not use column names for mapping data to a column, which is why you can rename
columns in CSV or TSV without breaking Athena queries.
One strategy for renaming columns is to create a new table based on the same underlying data,
but using new column names. The following example creates a new orders_parquet table called
orders_parquet_column_renamed. The example changes the column `o_totalprice` name to
`o_total_price` and then runs a query in Athena:
In the Parquet table case, the following query runs, but the renamed column does not show data
because the column was being accessed by name (a default in Parquet) rather than by index:
SELECT *
FROM orders_parquet_column_renamed;
159
Amazon Athena User Guide
Types of Updates
In the CSV table case, the following query runs and the data displays in all columns, including the one
that was renamed:
SELECT *
FROM orders_csv_column_renamed;
Reordering Columns
You can reorder columns only for tables with data in formats that read by name, such as JSON or
Parquet, which reads by name by default. You can also make ORC read by name, if needed. For
information, see Index Access in ORC and Parquet (p. 155).
• Only certain data types can be converted to other data types. See the table in this section for data
types that can change.
• For data in Parquet and ORC, you cannot change a column's data type if the table is not partitioned.
For partitioned tables in Parquet and ORC, a partition's column type can be different from another
partition's column type, and Athena will CAST to the desired type, if possible. For information, see
Avoiding Schema Mismatch Errors for Tables with Partitions (p. 162).
160
Amazon Athena User Guide
Updates in Tables with Partitions
Important
We strongly suggest that you test and verify your queries before performing data type
translations. If Athena cannot convert the data type from the original data type to the target
data type, the CREATE TABLE query may fail.
The following table lists data types that you can change:
INT BIGINT
FLOAT DOUBLE
In the following example of the orders_json table, change the data type for the column
`o_shippriority` to BIGINT:
The following query runs successfully, similar to the original SELECT query, before the data type change:
• If your table's schema changes, the schemas for partitions are not updated to remain in sync with the
table's schema.
• The AWS Glue Crawler allows you to discover data in partitions with different schemas. This means
that if you create a table in Athena with AWS Glue, after the crawler finishes processing, the schemas
for the table and its partitions may be different.
• If you add partitions directly using an AWS API.
161
Amazon Athena User Guide
Querying Arrays
Athena processes tables with partitions successfully if they meet the following constraints. If these
constraints are not met, Athena issues a HIVE_PARTITION_SCHEMA_MISMATCH error.
For example, for CSV and TSV formats, you can rename columns, add new columns at the end of the
table, and change a column's data type if the types are compatible, but you cannot remove columns.
For other formats, you can add or remove columns, or change a column's data type to another if the
types are compatible. For information, see Summary: Updates and Data Formats in Athena (p. 154).
• For Parquet and ORC data storage types, Athena relies on the column names and uses them for its
column name-based schema verification. This eliminates HIVE_PARTITION_SCHEMA_MISMATCH
errors for tables with partitions in Parquet and ORC. (This is true for ORC if the SerDe property is set to
access the index by name: orc.column.index.access=FALSE. Parquet reads the index by name by
default).
• For CSV, JSON, and Avro, Athena uses an index-based schema verification. This means that if you
encounter a schema mismatch error, you should drop the partition that is causing a schema mismatch
and recreate it, so that Athena can query it without failing.
Athena compares the table's schema to the partition schemas. If you create a table in CSV, JSON,
and AVRO in Athena with AWS Glue Crawler, after the Crawler finishes processing, the schemas for
the table and its partitions may be different. If there is a mismatch between the table's schema and
the partition schemas, your queries fail in Athena due to the schema verification error similar to this:
'crawler_test.click_avro' is declared as type 'string', but partition 'partition_0=2017-01-17' declared
column 'col68' as type 'double'."
A typical workaround for such errors is to drop the partition that is causing the error and recreate
it. For more information, see ALTER TABLE DROP PARTITION (p. 450) and ALTER TABLE ADD
PARTITION (p. 449).
Querying Arrays
Amazon Athena lets you create arrays, concatenate them, convert them to different data types, and then
filter, flatten, and sort them.
Topics
• Creating Arrays (p. 163)
• Concatenating Strings and Arrays (p. 164)
• Converting Array Data Types (p. 165)
• Finding Lengths (p. 166)
• Accessing Array Elements (p. 166)
• Flattening Nested Arrays (p. 167)
• Creating Arrays from Subqueries (p. 169)
• Filtering Arrays (p. 170)
• Sorting Arrays (p. 171)
162
Amazon Athena User Guide
Creating Arrays
Creating Arrays
To build an array literal in Athena, use the ARRAY keyword, followed by brackets [ ], and include the
array elements separated by commas.
Examples
This query creates one array with four elements.
It returns:
+-----------+
| items |
+-----------+
| [1,2,3,4] |
+-----------+
It returns:
+--------------------+
| items |
+--------------------+
| [[1, 2], [3, 4]] |
+--------------------+
To create an array from selected columns of compatible types, use a query, as in this example:
WITH
dataset AS (
SELECT 1 AS x, 2 AS y, 3 AS z
)
SELECT ARRAY [x,y,z] AS items FROM dataset
+-----------+
| items |
+-----------+
| [1,2,3] |
+-----------+
In the following example, two arrays are selected and returned as a welcome message.
163
Amazon Athena User Guide
Concatenating Arrays
WITH
dataset AS (
SELECT
ARRAY ['hello', 'amazon', 'athena'] AS words,
ARRAY ['hi', 'alexa'] AS alexa
)
SELECT ARRAY[words, alexa] AS welcome_msg
FROM dataset
+----------------------------------------+
| welcome_msg |
+----------------------------------------+
| [[hello, amazon, athena], [hi, alexa]] |
+----------------------------------------+
To create an array of key-value pairs, use the MAP operator that takes an array of keys followed by an
array of values, as in this example:
SELECT ARRAY[
MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])
] AS people
+-----------------------------------------------------------------------------------------------------
+
| people
|
+-----------------------------------------------------------------------------------------------------
+
| [{last=Smith, first=Bob, age=40}, {last=Doe, first=Jane, age=30}, {last=Smith,
first=Billy, age=8}] |
+-----------------------------------------------------------------------------------------------------
+
Concatenated_String
This is a test.
You can use the concat() function to achieve the same result.
164
Amazon Athena User Guide
Converting Array Data Types
Concatenated_String
This is a test.
Concatenating Arrays
You can use the same techniques to concatenate arrays.
items
[[4, 5], [1, 2], [3, 4]]
To combine multiple arrays into a single array, use the double pipe operator or the concat() function.
WITH
dataset AS (
SELECT
ARRAY ['Hello', 'Amazon', 'Athena'] AS words,
ARRAY ['Hi', 'Alexa'] AS alexa
)
SELECT concat(words, alexa) AS welcome_msg
FROM dataset
welcome_msg
[Hello, Amazon, Athena, Hi, Alexa]
For more information about concat() other string functions, see String Functions and Operators in the
Presto documentation.
SELECT
ARRAY [CAST(4 AS VARCHAR), CAST(5 AS VARCHAR)]
AS items
+-------+
| items |
+-------+
| [4,5] |
+-------+
165
Amazon Athena User Guide
Finding Lengths
Create two arrays with key-value pair elements, convert them to JSON, and concatenate, as in this
example:
SELECT
ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items
+--------------------------------------------------+
| items |
+--------------------------------------------------+
| [{"a1":1,"a2":2,"a3":3}, {"b1":4,"b2":5,"b3":6}] |
+--------------------------------------------------+
Finding Lengths
The cardinality function returns the length of an array, as in this example:
+------------+
| item_count |
+------------+
| 4 |
+------------+
WITH dataset AS (
SELECT
ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items )
SELECT items[1] AS item FROM dataset
+------------------------+
| item |
+------------------------+
| {"a1":1,"a2":2,"a3":3} |
+------------------------+
To access the elements of an array at a given position (known as the index position), use the
element_at() function and specify the array name and the index position:
• If the index is greater than 0, element_at() returns the element that you specify, counting from the
beginning to the end of the array. It behaves as the [] operator.
166
Amazon Athena User Guide
Flattening Nested Arrays
• If the index is less than 0, element_at() returns the element counting from the end to the beginning
of the array.
The following query creates an array words, and selects the first element hello from it as the
first_word, the second element amazon (counting from the end of the array) as the middle_word,
and the third element athena, as the last_word.
WITH dataset AS (
SELECT ARRAY ['hello', 'amazon', 'athena'] AS words
)
SELECT
element_at(words, 1) AS first_word,
element_at(words, -2) AS middle_word,
element_at(words, cardinality(words)) AS last_word
FROM dataset
+----------------------------------------+
| first_word | middle_word | last_word |
+----------------------------------------+
| hello | amazon | athena |
+----------------------------------------+
Examples
To flatten a nested array's elements into a single array of values, use the flatten function. This query
returns a row for each element in the array.
+-----------+
| items |
+-----------+
| [1,2,3,4] |
+-----------+
To flatten an array into multiple rows, use CROSS JOIN in conjunction with the UNNEST operator, as in
this example:
WITH dataset AS (
SELECT
'engineering' as department,
ARRAY['Sharon', 'John', 'Bob', 'Sally'] as users
)
SELECT department, names FROM dataset
CROSS JOIN UNNEST(users) as t(names)
167
Amazon Athena User Guide
Flattening Nested Arrays
+----------------------+
| department | names |
+----------------------+
| engineering | Sharon |
+----------------------|
| engineering | John |
+----------------------|
| engineering | Bob |
+----------------------|
| engineering | Sally |
+----------------------+
To flatten an array of key-value pairs, transpose selected keys into columns, as in this example:
WITH
dataset AS (
SELECT
'engineering' as department,
ARRAY[
MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])
] AS people
)
SELECT names['first'] AS
first_name,
names['last'] AS last_name,
department FROM dataset
CROSS JOIN UNNEST(people) AS t(names)
+--------------------------------------+
| first_name | last_name | department |
+--------------------------------------+
| Bob | Smith | engineering |
| Jane | Doe | engineering |
| Billy | Smith | engineering |
+--------------------------------------+
From a list of employees, select the employee with the highest combined scores. UNNEST can be used
in the FROM clause without a preceding CROSS JOIN as it is the default join operator and therefore
implied.
WITH
dataset AS (
SELECT ARRAY[
CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department
VARCHAR, scores ARRAY(INTEGER))),
CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER))),
CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER)))
] AS users
),
users AS (
SELECT person, score
FROM
dataset,
UNNEST(dataset.users) AS t(person),
UNNEST(person.scores) AS t(score)
168
Amazon Athena User Guide
Creating Arrays from Subqueries
)
SELECT person.name, person.department, SUM(score) AS total_score FROM users
GROUP BY (person.name, person.department)
ORDER BY (total_score) DESC
LIMIT 1
+---------------------------------+
| name | department | total_score |
+---------------------------------+
| Amy | devops | 54 |
+---------------------------------+
From a list of employees, select the employee with the highest individual score.
WITH
dataset AS (
SELECT ARRAY[
CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department
VARCHAR, scores ARRAY(INTEGER))),
CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER))),
CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER)))
] AS users
),
users AS (
SELECT person, score
FROM
dataset,
UNNEST(dataset.users) AS t(person),
UNNEST(person.scores) AS t(score)
)
SELECT person.name, score FROM users
ORDER BY (score) DESC
LIMIT 1
+--------------+
| name | score |
+--------------+
| Amy | 15 |
+--------------+
WITH
dataset AS (
SELECT ARRAY[1,2,3,4,5] AS items
)
SELECT array_agg(i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
169
Amazon Athena User Guide
Filtering Arrays
+-----------------+
| array_items |
+-----------------+
| [1, 2, 3, 4, 5] |
+-----------------+
To create an array of unique values from a set of rows, use the distinct keyword.
WITH
dataset AS (
SELECT ARRAY [1,2,2,3,3,4,5] AS items
)
SELECT array_agg(distinct i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
This query returns the following result. Note that ordering is not guaranteed.
+-----------------+
| array_items |
+-----------------+
| [1, 2, 3, 4, 5] |
+-----------------+
Filtering Arrays
Create an array from a collection of rows if they match the filter criteria.
WITH
dataset AS (
SELECT ARRAY[1,2,3,4,5] AS items
)
SELECT array_agg(i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
WHERE i > 3
+-------------+
| array_items |
+-------------+
| [4, 5] |
+-------------+
Filter an array based on whether one of its elements contain a specific value, such as 2, as in this
example:
WITH
dataset AS (
SELECT ARRAY
[
ARRAY[1,2,3,4],
ARRAY[5,6,7,8],
ARRAY[9,0]
] AS items
)
170
Amazon Athena User Guide
Sorting Arrays
+--------------+
| array_items |
+--------------+
| [1, 2, 3, 4] |
+--------------+
The filter function creates an array from the items in the list_of_values for which
boolean_function is true. The filter function can be useful in cases in which you cannot use the
UNNEST function.
The following example creates an array from the values greater than zero in the array [1,0,5,-1].
Results
[1,5]
The following example creates an array that consists of the non-null values from the array [-1, NULL,
10, NULL].
Results
[-1,10]
Sorting Arrays
Create a sorted array of unique values from a set of rows.
WITH
dataset AS (
SELECT ARRAY[3,1,2,5,2,3,6,3,4,5] AS items
)
SELECT array_sort(array_agg(distinct i)) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
+--------------------+
| array_items |
+--------------------+
171
Amazon Athena User Guide
Using Aggregation Functions with Arrays
| [1, 2, 3, 4, 5, 6] |
+--------------------+
Note
ORDER BY is not supported for aggregation functions, for example, you cannot use it within
array_agg(x).
WITH
dataset AS (
SELECT ARRAY
[
ARRAY[1,2,3,4],
ARRAY[5,6,7,8],
ARRAY[9,0]
] AS items
),
item AS (
SELECT i AS array_items
FROM dataset, UNNEST(items) AS t(i)
)
SELECT array_items, sum(val) AS total
FROM item, UNNEST(array_items) AS t(val)
GROUP BY array_items;
In the last SELECT statement, instead of using sum() and UNNEST, you can use reduce() to decrease
processing time and data transfer, as in the following example.
WITH
dataset AS (
SELECT ARRAY
[
ARRAY[1,2,3,4],
ARRAY[5,6,7,8],
ARRAY[9,0]
] AS items
),
item AS (
SELECT i AS array_items
FROM dataset, UNNEST(items) AS t(i)
)
SELECT array_items, reduce(array_items, 0 , (s, x) -> s + x, s -> s) AS total
FROM item;
Either query returns the following results. The order of returned results is not guaranteed.
+----------------------+
| array_items | total |
+----------------------+
| [1, 2, 3, 4] | 10 |
| [5, 6, 7, 8] | 26 |
| [9, 0] | 9 |
+----------------------+
172
Amazon Athena User Guide
Converting Arrays to Strings
WITH
dataset AS (
SELECT ARRAY ['hello', 'amazon', 'athena'] AS words
)
SELECT array_join(words, ' ') AS welcome_msg
FROM dataset
+---------------------+
| welcome_msg |
+---------------------+
| hello amazon athena |
+---------------------+
Examples
This example selects a user from a dataset. It uses the MAP operator and passes it two arrays. The first
array includes values for column names, such as "first", "last", and "age". The second array consists of
values for each of these columns, such as "Bob", "Smith", "35".
WITH dataset AS (
SELECT MAP(
ARRAY['first', 'last', 'age'],
ARRAY['Bob', 'Smith', '35']
) AS user
)
SELECT user FROM dataset
+---------------------------------+
| user |
+---------------------------------+
| {last=Smith, first=Bob, age=35} |
+---------------------------------+
You can retrieve Map values by selecting the field name followed by [key_name], as in this example:
WITH dataset AS (
SELECT MAP(
ARRAY['first', 'last', 'age'],
173
Amazon Athena User Guide
Querying Arrays with Complex Types and Nested Structures
+------------+
| first_name |
+------------+
| Bob |
+------------+
Creating a ROW
Note
The examples in this section use ROW as a means to create sample data to work with. When
you query tables within Athena, you do not need to create ROW data types, as they are already
created from your data source. When you use CREATE_TABLE, Athena defines a STRUCT in it,
populates it with data, and creates the ROW data type for you, for each row in the dataset. The
underlying ROW data type consists of named fields of any supported SQL data types.
WITH dataset AS (
SELECT
ROW('Bob', 38) AS users
)
SELECT * FROM dataset
+-------------------------+
| users |
+-------------------------+
| {field0=Bob, field1=38} |
+-------------------------+
174
Amazon Athena User Guide
Querying Arrays with Complex Types and Nested Structures
WITH dataset AS (
SELECT
CAST(
ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)
) AS users
)
SELECT * FROM dataset
+--------------------+
| users |
+--------------------+
| {NAME=Bob, AGE=38} |
+--------------------+
Note
In the example above, you declare name as a VARCHAR because this is its type in Presto. If you
declare this STRUCT inside a CREATE TABLE statement, use String type because Hive defines
this data type as String.
SELECT
CAST(useridentity.accountid AS bigint) as newid
FROM cloudtrail_logs
LIMIT 2;
+--------------+
| newid |
+--------------+
| 112233445566 |
+--------------+
| 998877665544 |
+--------------+
WITH dataset AS (
SELECT ARRAY[
CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
] AS users
)
SELECT * FROM dataset
+-----------------------------------------------------------------+
| users |
175
Amazon Athena User Guide
Querying Arrays with Complex Types and Nested Structures
+-----------------------------------------------------------------+
| [{NAME=Bob, AGE=38}, {NAME=Alice, AGE=35}, {NAME=Jane, AGE=27}] |
+-----------------------------------------------------------------+
To define a dataset for an array of values that includes a nested BOOLEAN value, issue this query:
WITH dataset AS (
SELECT
CAST(
ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
) AS sites
)
SELECT * FROM dataset
+----------------------------------------------------------+
| sites |
+----------------------------------------------------------+
| {HOSTNAME=aws.amazon.com, FLAGGEDACTIVITY={ISNEW=true}} |
+----------------------------------------------------------+
Next, to filter and access the BOOLEAN value of that element, continue to use the dot . notation.
WITH dataset AS (
SELECT
CAST(
ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
) AS sites
)
SELECT sites.hostname, sites.flaggedactivity.isnew
FROM dataset
This query selects the nested fields and returns this result:
+------------------------+
| hostname | isnew |
+------------------------+
| aws.amazon.com | true |
+------------------------+
WITH dataset AS (
SELECT ARRAY[
CAST(
176
Amazon Athena User Guide
Querying Arrays with Complex Types and Nested Structures
It returns:
+------------------------+
| hostname | isnew |
+------------------------+
| aws.amazon.com | true |
+------------------------+
The regular expression pattern needs to be contained within the string, and does not have to match it.
To match the entire string, enclose the pattern with ^ at the beginning of it, and $ at the end, such as
'^pattern$'.
Consider an array of sites containing their hostname, and a flaggedActivity element. This element
includes an ARRAY, containing several MAP elements, each listing different popular keywords and their
popularity count. Assume you want to find a particular keyword inside a MAP in this array.
To search this dataset for sites with a specific keyword, we use regexp_like instead of the similar SQL
LIKE operator, because searching for a large number of keywords is more efficient with regexp_like.
The query in this example uses the regexp_like function to search for terms 'politics|bigdata',
found in values within arrays:
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('news.cnn.com', ROW(ARRAY[
177
Amazon Athena User Guide
Querying Arrays with Complex Types and Nested Structures
+----------------+
| hostname |
+----------------+
| aws.amazon.com |
+----------------+
| news.cnn.com |
+----------------+
The query in the following example adds up the total popularity scores for the sites matching your
search terms with the regexp_like function, and then orders them from highest to lowest.
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('news.cnn.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
178
Amazon Athena User Guide
Querying Geospatial Data
ROW('netflix.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
)
] AS items
),
sites AS (
SELECT sites.hostname, sites.flaggedactivity
FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS
total
FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)
WHERE regexp_like(flags['term'], 'politics|bigdata')
GROUP BY (hostname)
ORDER BY total DESC
+------------------------------------+
| hostname | terms | total |
+----------------+-------------------+
| news.cnn.com | politics | 241 |
+----------------+-------------------+
| aws.amazon.com | big data | 10 |
+----------------+-------------------+
Geospatial identifiers, such as latitude and longitude, allow you to convert any mailing address into a set
of geographic coordinates.
Topics
• What is a Geospatial Query? (p. 179)
• Input Data Formats and Geometry Data Types (p. 180)
• Supported Geospatial Functions (p. 180)
• Examples: Geospatial Queries (p. 202)
• Using the following specialized geometry data types: point, line, multiline, polygon, and
multipolygon.
• Expressing relationships between geometry data types, such as distance, equals, crosses,
touches, overlaps, disjoint, and others.
179
Amazon Athena User Guide
Input Data Formats and Geometry Data Types
Using geospatial queries in Athena, you can run these and other similar operations:
For example, to obtain a point geometry data type from values of type double for the geographic
coordinates of Mount Rainier in Athena, use the ST_Point (longitude, latitude) geospatial
function, as in the following example.
ST_Point(-121.7602, 46.8527)
• point
• line
• polygon
• multiline
• multipolygon
The geospatial functions that are available in Athena depend on the engine version that you use. For a
list of function name changes and new functions in Athena engine version 2, see Geospatial Function
180
Amazon Athena User Guide
Supported Geospatial Functions
Name Changes and New Functions in Athena engine version 2 (p. 193). For information about Athena
engine versioning, see Athena Engine Versioning (p. 395).
Topics
• Geospatial Functions in Athena engine version 2 (p. 181)
• Geospatial Functions in Athena engine version 1 (p. 195)
• The input and output types for some functions have changed. Most notably, the VARBINARY
type is no longer directly supported for input. For more information, see Changes to Geospatial
Functions (p. 405).
• The names of some geospatial functions have changed since Athena engine version 1. For more
information, see Geospatial Function Name Changes in Athena engine version 2 (p. 193).
• New functions have been added. For more information, see New Geospatial Functions in Athena
engine version 2 (p. 194).
Constructor Functions
Use constructor functions to obtain binary representations of point, line, or polygon geometry data
types. You can also use these functions to convert binary data to text, and obtain binary values for
geometry data that is expressed as Well-Known Text (WKT).
ST_AsBinary(geometry)
Returns a varbinary data type that contains the WKB representation of the specified geometry. Example:
ST_AsText(geometry)
Converts each of the specified geometry data types (p. 180) to text. Returns a value in a varchar data
type, which is a WKT representation of the geometry data type. Example:
181
Amazon Athena User Guide
Supported Geospatial Functions
ST_GeomAsLegacyBinary(geometry)
Returns an Athena engine version 1 varbinary from the specified geometry. Example:
ST_GeometryFromText(varchar)
Converts text in WKT format into a geometry data type. Returns a value in a geometry data type.
Example:
ST_GeomFromBinary(varbinary)
ST_GeomFromLegacyBinary(varbinary)
Returns a geometry type object from an Athena engine version 1 varbinary type. Example:
ST_LineFromText(varchar)
Returns a value in the geometry data type (p. 180) line. Example:
ST_LineString(array(point))
Returns a LineString geometry type formed from an array of point geometry types. If there are
fewer than two non-empty points in the specified array, an empty LineString is returned. Throws
an exception if any element in the array is null, empty, or the same as the previous one. The returned
geometry may not be simple. Depending on the input specfied, the returned geometry can self-intersect
or contain duplicate vertexes. Example:
ST_MultiPoint(array(point))
Returns a MultiPoint geometry object formed from the specified points. Returns null if the specified
array is empty. Throws an exception if any element in the array is null or empty. The returned geometry
may not be simple and can contain duplicate points if the specified array has duplicates. Example:
ST_Point(double, double)
Returns a geometry type point object. For the input data values to this function, use geometric values,
such as values in the Universal Transverse Mercator (UTM) Cartesian coordinate system, or geographic
182
Amazon Athena User Guide
Supported Geospatial Functions
map units (longitude and latitude) in decimal degrees. The longitude and latitude values use the World
Geodetic System, also known as WGS 1984, or EPSG:4326. WGS 1984 is the coordinate system used by
the Global Positioning System (GPS).
For example, in the following notation, the map coordinates are specified in longitude and latitude, and
the value .072284, which is the buffer distance, is specified in angular units as decimal degrees:
Syntax:
The following example uses the ST_AsText function to obtain the geometry from WKT:
ST_Polygon(varchar)
Using the sequence of the ordinates provided clockwise, left to right, returns a geometry data
type (p. 180) polygon. In Athena engine version 2, only polygons are accepted as inputs. Example:
to_geometry(sphericalGeography)
Returns a geometry object from the specified spherical geography object. Example:
to_spherical_geography(geometry)
Returns a spherical geography object from the specified geometry. Use this function to convert a
geometry object to a spherical geography object on the sphere of the Earth’s radius. This function can
be used only on POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, and MULTIPOLYGON
geometries defined in 2D space or a GEOMETRYCOLLECTION of such geometries. For each point of the
specified geometry, the function verifies that point.x is within [-180.0, 180.0] and point.y is
within [-90.0, 90.0]. The function uses these points as longitude and latitude degrees to construct
the shape of the sphericalGeography result.
Example:
183
Amazon Athena User Guide
Supported Geospatial Functions
ST_Contains(geometry, geometry)
Returns TRUE if and only if the left geometry contains the right geometry. Examples:
SELECT ST_Contains('POLYGON((0 2,1 1,0 -1,0 2))', 'POLYGON((-1 3,2 1,0 -3,-1 3))')
ST_Crosses(geometry, geometry)
Returns TRUE if and only if the left geometry crosses the right geometry. Example:
ST_Disjoint(geometry, geometry)
Returns TRUE if and only if the intersection of the left geometry and the right geometry is empty.
Example:
ST_Equals(geometry, geometry)
Returns TRUE if and only if the left geometry equals the right geometry. Example:
ST_Intersects(geometry, geometry)
Returns TRUE if and only if the left geometry intersects the right geometry. Example:
ST_Overlaps(geometry, geometry)
Returns TRUE if and only if the left geometry overlaps the right geometry. Example:
184
Amazon Athena User Guide
Supported Geospatial Functions
Returns TRUE if and only if the left geometry has the specified dimensionally extended nine-intersection
model (DE-9IM) relationship with the right geometry. The third (varchar) input takes the relationship.
Example:
ST_Touches(geometry, geometry)
Returns TRUE if and only if the left geometry touches the right geometry.
Example:
ST_Within(geometry, geometry)
Returns TRUE if and only if the left geometry is within the right geometry.
Example:
Operation Functions
Use operation functions to perform operations on geometry data type values. For example, you can
obtain the boundaries of a single geometry data type; intersections between two geometry data types;
difference between left and right geometries, where each is of the same geometry data type; or an
exterior buffer or ring around a particular geometry data type.
geometry_union(array(geometry))
Returns a geometry that represents the point set union of the specified geometries. Example:
ST_Boundary(geometry)
Takes as an input one of the geometry data types and returns the boundary geometry data type.
Examples:
ST_Buffer(geometry, double)
Takes as an input one of the geometry data types, such as point, line, polygon, multiline, or
multipolygon, and a distance as type double). Returns the geometry data type buffered by the specified
distance (or radius). Example:
185
Amazon Athena User Guide
Supported Geospatial Functions
In the following example, the map coordinates are specified in longitude and latitude, and the value
.072284, which is the buffer distance, is specified in angular units as decimal degrees:
ST_Difference(geometry, geometry)
Returns a geometry of the difference between the left geometry and right geometry. Example:
ST_Envelope(geometry)
Takes as an input line, polygon, multiline, and multipolygon geometry data types. Does not
support point geometry data type. Returns the envelope as a geometry, where an envelope is a
rectangle around the specified geometry data type. Examples:
ST_EnvelopeAsPts(geometry)
Returns an array of two points that represent the lower left and upper right corners of a geometry's
bounding rectangular polygon. Returns null if the specified geometry is empty. Example:
ST_ExteriorRing(geometry)
Returns the geometry of the exterior ring of the input type polygon. In Athena engine version 2,
polygons are the only geometries accepted as inputs. Examples:
ST_Intersection(geometry, geometry)
Returns the geometry of the intersection of the left geometry and right geometry. Examples:
186
Amazon Athena User Guide
Supported Geospatial Functions
ST_SymDifference(geometry, geometry)
Returns the geometry of the geometrically symmetric difference between the left geometry and the
right geometry. Example:
ST_Union(geometry, geometry)
Returns a geometry data type that represents the point set union of the specified geometries. Example:
Accessor Functions
Accessor functions are useful to obtain values in types varchar, bigint, or double from different
geometry data types, where geometry is any of the geometry data types supported in Athena: point,
line, polygon, multiline, and multipolygon. For example, you can obtain an area of a polygon
geometry data type, maximum and minimum X and Y values for a specified geometry data type, obtain
the length of a line, or receive the number of points in a specified geometry data type.
geometry_invalid_reason(geometry)
Returns, in a varchar data type, the reason why the specified geometry is not valid or not simple. If the
specified geometry is neither valid nor simple, returns the reason why it is not valid. If the specified
geometry is valid and simple, returns null. Example:
Returns, as a double, the great-circle distance between two points on Earth’s surface in kilometers.
Example:
line_locate_point(lineString, point)
Returns a double between 0 and 1 that represents the location of the closest point on the specified line
string to the specified point as a fraction of total 2d line length.
Returns null if the specified line string or point is empty or null. Example:
simplify_geometry(geometry, double)
Uses the Ramer-Douglas-Peucker algorithm to return a geometry data type that is a simplified version
of the specified geometry. Avoids creating derived geometries (in particular, polygons) that are invalid.
Example:
187
Amazon Athena User Guide
Supported Geospatial Functions
ST_Area(geometry)
Takes as an input a geometry data type and returns an area in type double. Example:
ST_Centroid(geometry)
Takes as an input a geometry data type (p. 180) polygon, and returns a point geometry data type
that is the center of the polygon's envelope. Examples:
ST_ConvexHull(geometry)
Returns a geometry data type that is the smallest convex geometry that encloses all geometries in the
specified input. Example:
ST_CoordDim(geometry)
Takes as input one of the supported geometry data types (p. 180), and returns the count of coordinate
components in the type tinyint. Example:
SELECT ST_CoordDim(ST_Point(1.5,2.5))
ST_Dimension(geometry)
Takes as an input one of the supported geometry data types (p. 180), and returns the spatial dimension
of a geometry in type tinyint. Example:
ST_Distance(geometry, geometry)
Returns, based on spatial ref, a double containing the two-dimensional minimum Cartesian distance
between two geometries in projected units. In Athena engine version 2, returns null if one of the inputs
is an empty geometry. Example:
ST_Distance(sphericalGeography, sphericalGeography)
Returns, as a double, the great-circle distance between two spherical geography points in meters.
Example:
SELECT ST_Distance(to_spherical_geography(ST_Point(61.56,
-86.67)),to_spherical_geography(ST_Point(61.56, -86.68)))
188
Amazon Athena User Guide
Supported Geospatial Functions
ST_EndPoint(geometry)
Returns the last point of a line geometry data type in a point geometry data type. Example:
ST_Geometries(geometry)
Returns an array of geometries in the specified collection. If the specified geometry is not a multi-
geometry, returns a one-element array. If the specified geometry is empty, returns null.
Result:
ST_GeometryN(geometry, index)
Returns, as a geometry data type, the geometry element at a specified integer index. Indices start at 1. If
the specified geometry is a collection of geometries (for example, a GEOMETRYCOLLECTION or MULTI*
object), returns the geometry at the specified index. If the specified index is less than 1 or greater than
the total number of elements in the collection, returns null. To find the total number of elements, use
ST_NumGeometries (p. 190). Singular geometries (for example, POINT, LINESTRING, or POLYGON),
are treated as collections of one element. Empty geometries are treated as empty collections. Example:
ST_GeometryType(geometry)
Returns, as a varchar, the type of the geometry. Example:
ST_InteriorRingN(geometry, index)
Returns the interior ring element at the specified index (indices start at 1). If the given index is less
than 1 or greater than the total number of interior rings in the specified geometry, returns null.
Throws an error if the specified geometry is not a polygon. To find the total number of elements, use
ST_NumInteriorRing (p. 190). Example:
ST_InteriorRings(geometry)
Returns a geometry array of all interior rings found in the specified geometry, or an empty array if the
polygon has no interior rings. If the specified geometry is empty, returns null. If the specified geometry is
not a polygon, throws an error. Example:
189
Amazon Athena User Guide
Supported Geospatial Functions
ST_IsClosed(geometry)
Takes as an input only line and multiline geometry data types (p. 180). Returns TRUE (type
boolean) if and only if the line is closed. Example:
ST_IsEmpty(geometry)
Takes as an input only line and multiline geometry data types (p. 180). Returns TRUE (type
boolean) if and only if the specified geometry is empty, in other words, when the line start and end
values coincide. Example:
ST_IsRing(geometry)
Returns TRUE (type boolean) if and only if the line type is closed and simple. Example:
ST_IsSimple(geometry)
Returns true if the specified geometry has no anomalous geometric points (for example,
self intersection or self tangency). To determine why the geometry is not simple, use
geometry_invalid_reason() (p. 187). Example:
ST_IsValid(geometry)
Returns true if and only if the specified geometry is well formed. To determine why the geometry is not
well formed, use geometry_invalid_reason() (p. 187). Example:
ST_Length(geometry)
Returns the length of line in type double. Example:
ST_NumGeometries(geometry)
Returns, as an integer, the number of geometries in the collection. If the geometry is a collection
of geometries (for example, a GEOMETRYCOLLECTION or MULTI* object), returns the number
of geometries. Single geometries return 1; empty geometries return 0. An empty geometry in a
GEOMETRYCOLLECTION object counts as one geometry. For example, the following example evaluates to
1:
ST_NumGeometries(ST_GeometryFromText('GEOMETRYCOLLECTION(MULTIPOINT EMPTY)'))
ST_NumInteriorRing(geometry)
Returns the number of interior rings in the polygon geometry in type bigint. Example:
190
Amazon Athena User Guide
Supported Geospatial Functions
ST_NumPoints(geometry)
ST_PointN(lineString, index)
Returns, as a point geometry data type, the vertex of the specified line string at the specified integer
index. Indices start at 1. If the given index is less than 1 or greater than the total number of elements
in the collection, returns null. To find the total number of elements, use ST_NumPoints (p. 191).
Example:
ST_Points(geometry)
Returns an array of points from the specified line string geometry object. Example:
ST_StartPoint(geometry)
Returns the first point of a line geometry data type in a point geometry data type. Example:
ST_X(point)
ST_XMax(geometry)
ST_XMin(geometry)
ST_Y(point)
191
Amazon Athena User Guide
Supported Geospatial Functions
ST_YMax(geometry)
ST_YMin(geometry)
Aggregation Functions
convex_hull_agg(geometry)
Returns the minimum convex geometry that encloses all geometries passed as input.
geometry_union_agg(geometry)
Returns a geometry that represents the point set union of all geometries passed as input.
bing_tile(x, y, zoom_level)
Returns a Bing tile object from integer coordinates x and y and the specified zoom level. The zoom level
must be an integer from 1 through 23. Example:
bing_tile(quadKey)
Returns a Bing tile object at the specified latitude, longitude, and zoom level. The latitude must
be between -85.05112878 and 85.05112878. The longitude must be between -180 and 180. The
latitude and longitude values must be double and zoom_level an integer. Example:
Returns an array of Bing tiles that surround the specified latitude and longitude point at the specified
zoom level. Example:
192
Amazon Athena User Guide
Supported Geospatial Functions
bing_tile_coordinates(tile)
Returns the x and y coordinates of the specified Bing tile. Example:
bing_tile_polygon(tile)
Returns the polygon representation of the specified Bing tile. Example:
bing_tile_quadkey(tile)
Returns the quadkey of the specified Bing tile. Example:
bing_tile_zoom_level(tile)
Returns the zoom level of the specified Bing tile as an integer. Example:
geometry_to_bing_tiles(geometry, zoom_level)
Returns the minimum set of Bing tiles that fully covers the specified geometry at the specified zoom
level. Zoom levels from 1 to 23 are supported. Example:
Geospatial Function Name Changes and New Functions in Athena engine version
2
This section lists changes in geospatial function names and geospatial functions that are new in Athena
engine version 2. Currently, Athena engine version 2 is supported in the Asia Pacific (Mumbai), Asia
Pacific (Tokyo), Europe (Ireland), US East (N. Virginia), US East (Ohio), US West (N. California), and US
West (Oregon) Regions.
For information about other changes in Athena engine version 2, see Athena engine version 2 (p. 399).
For information about Athena engine versioning, see Athena Engine Versioning (p. 395).
193
Amazon Athena User Guide
Supported Geospatial Functions
The following geospatial functions are new in Athena engine version 2. For more information, visit the
corresponding links.
Constructor Functions
Operation Functions
Accessor Functions
194
Amazon Athena User Guide
Supported Geospatial Functions
Aggregation Functions
195
Amazon Athena User Guide
Supported Geospatial Functions
Constructor Functions
Use constructor functions to obtain binary representations of point, line, or polygon geometry data
types. You can also use these functions to convert binary data to text, and obtain binary values for
geometry data that is expressed as Well-Known Text (WKT).
ST_GEOMETRY_FROM_TEXT (varchar)
Converts text into a geometry data type. Returns a value in a varbinary data type, which is a binary
representation of the geometry data type. Example:
ST_GEOMETRY_TO_TEXT (varbinary)
Converts each of the specified geometry data types (p. 180) to text. Returns a value in a varchar data
type, which is a WKT representation of the geometry data type. Example:
ST_LINE(varchar)
Returns a value in the varbinary data type, which is a binary representation of the geometry data
type (p. 180) line. Example:
ST_POINT(double, double)
Returns a value in the varbinary data type, which is a binary representation of a point geometry data
type.
To obtain the point geometry data type, use the ST_POINT function in Athena. For the input data
values to this function, use geometric values, such as values in the Universal Transverse Mercator (UTM)
Cartesian coordinate system, or geographic map units (longitude and latitude) in decimal degrees. The
longitude and latitude values use the World Geodetic System, also known as WGS 1984, or EPSG:4326.
WGS 1984 is the coordinate system used by the Global Positioning System (GPS).
For example, in the following notation, the map coordinates are specified in longitude and latitude, and
the value .072284, which is the buffer distance, is specified in angular units as decimal degrees:
Syntax:
Example. This example uses specific longitude and latitude coordinates from earthquakes.csv:
196
Amazon Athena User Guide
Supported Geospatial Functions
00 00 00 00 01 01 00 00 00 48 e1 7a 14 ae c7 4e 40 e1 7a 14 ae 47 d1 63 c0
00 00 00 00 01 01 00 00 00 20 25 76 6d 6f 80 52 c0 18 3e 22 a6 44 5a 44 40
In the following example, we use the ST_GEOMETRY_TO_TEXT function to obtain the binary values from
WKT:
ST_POLYGON(varchar)
Using the sequence of the ordinates provided clockwise, left to right, returns a value in the varbinary
data type, which is a binary representation of the geometry data type (p. 180) polygon. Example:
Returns TRUE if and only if the left geometry contains the right geometry. Examples:
SELECT ST_CONTAINS('POLYGON((0 2,1 1,0 -1,0 2))', 'POLYGON((-1 3,2 1,0 -3,-1 3))')
Returns TRUE if and only if the left geometry crosses the right geometry. Example:
197
Amazon Athena User Guide
Supported Geospatial Functions
Example:
Example:
Operation Functions
Use operation functions to perform operations on geometry data type values. For example, you can
obtain the boundaries of a single geometry data type; intersections between two geometry data types;
198
Amazon Athena User Guide
Supported Geospatial Functions
difference between left and right geometries, where each is of the same geometry data type; or an
exterior buffer or ring around a particular geometry data type.
In Athena engine version 1, all operation functions take one of the geometry data types as an input and
return a binary representation as a varbinary data type.
ST_BOUNDARY (geometry)
Takes as an input one of the geometry data types, and returns a binary representation of the boundary
geometry data type.
Examples:
In the following example, the map coordinates are specified in longitude and latitude, and the value
.072284, which is the buffer distance, is specified in angular units as decimal degrees:
ST_ENVELOPE (geometry)
Takes as an input line, polygon, multiline, and multipolygon geometry data types. Does not
support point geometry data type. Returns a binary representation of an envelope, where an envelope
is a rectangle around the specified geometry data type. Examples:
ST_EXTERIOR_RING (geometry)
Returns a binary representation of the exterior ring of the input type polygon. Examples:
199
Amazon Athena User Guide
Supported Geospatial Functions
Accessor Functions
Accessor functions are useful to obtain values in types varchar, bigint, or double from different
geometry data types, where geometry is any of the geometry data types supported in Athena: point,
line, polygon, multiline, and multipolygon. For example, you can obtain an area of a polygon
geometry data type, maximum and minimum X and Y values for a specified geometry data type, obtain
the length of a line, or receive the number of points in a specified geometry data type.
ST_AREA (geometry)
Takes as an input a geometry data type polygon and returns an area in type double. Example:
ST_CENTROID (geometry)
Takes as an input a geometry data type (p. 180) polygon, and returns a point that is the center of the
polygon's envelope in type varchar. Examples:
ST_COORDINATE_DIMENSION (geometry)
Takes as input one of the supported geometry data types (p. 180), and returns the count of coordinate
components in type bigint. Example:
SELECT ST_COORDINATE_DIMENSION(ST_POINT(1.5,2.5))
ST_DIMENSION (geometry)
Takes as an input one of the supported geometry data types (p. 180), and returns the spatial dimension
of a geometry in type bigint. Example:
200
Amazon Athena User Guide
Supported Geospatial Functions
ST_END_POINT (geometry)
Returns the last point of a line geometry data type in type point. Example:
ST_INTERIOR_RING_NUMBER (geometry)
Returns the number of interior rings in the polygon geometry in type bigint. Example:
ST_IS_CLOSED (geometry)
Takes as an input only line and multiline geometry data types (p. 180). Returns TRUE (type
boolean) if and only if the line is closed. Example:
ST_IS_EMPTY (geometry)
Takes as an input only line and multiline geometry data types (p. 180). Returns TRUE (type
boolean) if and only if the specified geometry is empty, in other words, when the line start and end
values coincide. Example:
ST_IS_RING (geometry)
Returns TRUE (type boolean) if and only if the line type is closed and simple. Example:
ST_LENGTH (geometry)
Returns the length of line in type double. Example:
ST_MAX_X (geometry)
Returns the maximum X coordinate of a geometry in type double. Example:
201
Amazon Athena User Guide
Examples: Geospatial Queries
ST_MAX_Y (geometry)
ST_MIN_X (geometry)
ST_MIN_Y (geometry)
ST_POINT_NUMBER (geometry)
ST_START_POINT (geometry)
Returns the first point of a line geometry data type in type point. Example:
ST_X (point)
ST_Y (point)
• earthquakes.csv – Lists earthquakes that occurred in California. The example earthquakes table
uses fields from this data.
• california-counties.json – Lists county data for the state of California in ESRI-compliant
GeoJSON format. The data includes many fields such as AREA, PERIMETER, STATE, COUNTY, and NAME,
but the example counties table uses only two: Name (string), and BoundaryShape (binary).
202
Amazon Athena User Guide
Examples: Geospatial Queries
Note
Athena uses the com.esri.json.hadoop.EnclosedJsonInputFormat to convert the
JSON data to geospatial binary format.
The following code example uses the CROSS JOIN function for the two tables created earlier.
Additionally, for both tables, it uses ST_CONTAINS and asks for counties whose boundaries include a
geographical location of the earthquakes, specified with ST_POINT. It then groups such counties by
name, orders them by count, and returns them in descending order.
SELECT counties.name,
COUNT(*) cnt
FROM counties
CROSS JOIN earthquakes
WHERE ST_CONTAINS (counties.boundaryshape, ST_POINT(earthquakes.longitude,
earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt DESC
+------------------------+
| name | cnt |
+------------------------+
| Kern | 36 |
+------------------------+
| San Bernardino | 35 |
+------------------------+
| Imperial | 28 |
203
Amazon Athena User Guide
Querying Hudi Datasets
+------------------------+
| Inyo | 20 |
+------------------------+
| Los Angeles | 18 |
+------------------------+
| Riverside | 14 |
+------------------------+
| Monterey | 14 |
+------------------------+
| Santa Clara | 12 |
+------------------------+
| San Benito | 11 |
+------------------------+
| Fresno | 11 |
+------------------------+
| San Diego | 7 |
+------------------------+
| Santa Cruz | 5 |
+------------------------+
| Ventura | 3 |
+------------------------+
| San Luis Obispo | 3 |
+------------------------+
| Orange | 2 |
+------------------------+
| San Mateo | 1 |
+------------------------+
Additional Resources
For additional examples of geospatial queries, see the following blog posts:
Hudi handles data insertion and update events without creating many small files that can cause
performance issues for analytics. Apache Hudi automatically tracks changes and merges files so that they
remain optimally sized. This avoids the need to build custom solutions that monitor and re-write many
small files into fewer large files.
• Complying with privacy regulations like General Data Protection Regulation (GDPR) and California
Consumer Privacy Act (CCPA) that enforce people's right to remove personal information or change
how their data is used.
• Working with streaming data from sensors and other Internet of Things (IoT) devices that require
specific data insertion and update events.
• Implementing a change data capture (CDC) system
Data sets managed by Hudi are stored in S3 using open storage formats. Currently, Athena can read
compacted Hudi datasets but not write Hudi data. Athena uses Apache Hudi version 0.5.2-incubating,
204
Amazon Athena User Guide
Hudi Dataset Storage Types
subject to change. For more information about this Hudi version, see apache/hudi release-0.5.2 on
GitHub.com.
• Copy on Write (CoW) – Data is stored in a columnar format (Parquet), and each update creates a new
version of files during a write.
• Merge on Read (MoR) – Data is stored using a combination of columnar (Parquet) and row-based
(Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create
new versions of the columnar files.
With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten
with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for
the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is
better suited for read-heavy workloads on data that change less frequently.
• Read-optimized view – Provides the latest committed dataset from CoW tables and the latest
compacted dataset from MoR tables.
• Incremental view – Provides a change stream between two actions out of a CoW dataset to feed
downstream jobs and extract, transform, load (ETL) workflows.
• Real-time view – Provides the latest committed data from a MoR table by merging the columnar and
row-based files inline.
Currently, Athena supports only the first of these: the read-optimized view. Queries on a read-optimized
view return all compacted data, which provides good performance but does not include the latest delta
commits. For more information about the tradeoffs between storage types, see Storage Types & Views in
the Apache Hudi documentation.
For more information about writing Hudi data, see the following resources:
• Working With a Hudi Dataset in the Amazon EMR Release Guide.
• Writing Hudi Tables in the Apache Hudi documentation.
• Using MSCK REPAIR TABLE on Hudi tables in Athena is not supported. If you need to load a Hudi table
not created in AWS Glue, use ALTER TABLE ADD PARTITION (p. 449).
• Querying Hudi tables that have been registered with AWS Lake Formation is not supported.
205
Amazon Athena User Guide
Creating Hudi Tables
If you have Hudi tables already created in AWS Glue, you can query them directly in Athena. When you
create Hudi tables in Athena, you must run ALTER TABLE ADD PARTITION to load the Hudi data before
you can query it.
The following ALTER TABLE ADD PARTITION example adds two partitions to the example
partition_cow table.
206
Amazon Athena User Guide
Creating Hudi Tables
The following ALTER TABLE ADD PARTITION example adds two partitions to the example
partition_mor table.
207
Amazon Athena User Guide
Querying JSON
Querying JSON
Amazon Athena lets you parse JSON-encoded values, extract data from JSON, search for values, and find
length and size of JSON arrays.
Topics
• Best Practices for Reading JSON Data (p. 208)
• Extracting Data from JSON (p. 209)
• Searching for Values in JSON Arrays (p. 211)
• Obtaining Length and Size of JSON Arrays (p. 213)
• Troubleshooting JSON Queries (p. 214)
In Amazon Athena, you can create tables from external data and include the JSON-encoded data in
them. For such types of source data, use Athena together with JSON SerDe Libraries (p. 420).
• Convert fields in source data that have an undetermined schema to JSON-encoded strings in Athena.
When Athena creates tables backed by JSON data, it parses the data based on the existing and
predefined schema. However, not all of your data may have a predefined schema. To simplify schema
management in such cases, it is often useful to convert fields in source data that have an undetermined
schema to JSON strings in Athena, and then use JSON SerDe Libraries (p. 420).
For example, consider an IoT application that publishes events with common fields from different
sensors. One of those fields must store a custom payload that is unique to the sensor sending the event.
In this case, since you don't know the schema, we recommend that you store the information as a JSON-
encoded string. To do this, convert data in your Athena table to JSON, as in the following example. You
can also convert JSON-encoded data to Athena data types.
208
Amazon Athena User Guide
Extracting Data from JSON
WITH dataset AS (
SELECT
CAST('HELLO ATHENA' AS JSON) AS hello_msg,
CAST(12345 AS JSON) AS some_int,
CAST(MAP(ARRAY['a', 'b'], ARRAY[1,2]) AS JSON) AS some_map
)
SELECT * FROM dataset
+-------------------------------------------+
| hello_msg | some_int | some_map |
+-------------------------------------------+
| "HELLO ATHENA" | 12345 | {"a":1,"b":2} |
+-------------------------------------------+
WITH dataset AS (
SELECT
CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
CAST(JSON '12345' AS INTEGER) AS some_int,
CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset
+-------------------------------------+
| hello_msg | some_int | some_map |
+-------------------------------------+
| HELLO ATHENA | 12345 | {a:1,b:2} |
+-------------------------------------+
209
Amazon Athena User Guide
Extracting Data from JSON
"org": "engineering",
"projects":
[
{"name":"project1", "completed":false},
{"name":"project2", "completed":true}
]
}
WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM dataset
The returned value is a JSON-encoded string, and not a native Athena data type.
+-----------------------------------------------------------------------------------------------
+
| name | projects
|
+-----------------------------------------------------------------------------------------------
+
| "Susan Smith" | [{"name":"project1","completed":false},
{"name":"project2","completed":true}] |
+-----------------------------------------------------------------------------------------------
+
To extract the scalar value from the JSON string, use the json_extract_scalar function. It is similar
to json_extract, but returns only scalar values (Boolean, number, or string).
Note
Do not use the json_extract_scalar function on arrays, maps, or structs.
WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT
json_extract_scalar(blob, '$.name') AS name,
json_extract_scalar(blob, '$.projects') AS projects
FROM dataset
210
Amazon Athena User Guide
Searching for Values in JSON Arrays
+---------------------------+
| name | projects |
+---------------------------+
| Susan Smith | |
+---------------------------+
To obtain the first element of the projects property in the example array, use the json_array_get
function and specify the index position.
WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item
FROM dataset
It returns the value at the specified index position in the JSON-encoded array.
+---------------------------------------+
| item |
+---------------------------------------+
| {"name":"project1","completed":false} |
+---------------------------------------+
To return an Athena string type, use the [] operator inside a JSONPath expression, then Use
the json_extract_scalar function. For more information about [], see Accessing Array
Elements (p. 166).
WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_extract_scalar(blob, '$.projects[0].name') AS project_name
FROM dataset
+--------------+
| project_name |
+--------------+
| project1 |
+--------------+
The following query lists the names of the users who are participating in "project2".
211
Amazon Athena User Guide
Searching for Values in JSON Arrays
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
(JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1",
"project2", "project3"]}'),
(JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')
+-------------+
| user |
+-------------+
| Susan Smith |
+-------------+
| Jane Smith |
+-------------+
The following query example lists the names of users who have completed projects along with the total
number of completed projects. It performs these actions:
Note
When using CAST to MAP you can specify the key element as VARCHAR (native String in Presto),
but leave the value as JSON, because the values in the MAP are of different types: String for the
first key-value pair, and Boolean for the second.
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name": "Bob Smith",
"org": "legal",
"projects": [{"name":"project1", "completed":false}]}'),
(JSON '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project2", "completed":true},
{"name":"project3", "completed":true}]}'),
(JSON '{"name": "Jane Smith",
"org": "finance",
"projects": [{"name":"project2", "completed":true}]}')
) AS t (users)
),
employees AS (
SELECT users, CAST(json_extract(users, '$.projects') AS
ARRAY(MAP(VARCHAR, JSON))) AS projects_array
FROM dataset
),
names AS (
SELECT json_extract_scalar(users, '$.name') AS name, projects
FROM employees, UNNEST (projects_array) AS t(projects)
)
212
Amazon Athena User Guide
Obtaining Length and Size of JSON Arrays
+----------------------------------+
| name | completed_projects |
+----------------------------------+
| Susan Smith | 2 |
+----------------------------------+
| Jane Smith | 1 |
+----------------------------------+
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name":
"Bob Smith",
"org":
"legal",
"projects": [{"name":"project1", "completed":false}]}'),
(JSON '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project2", "completed":true},
{"name":"project3", "completed":true}]}'),
(JSON '{"name": "Jane Smith",
"org": "finance",
"projects": [{"name":"project2", "completed":true}]}')
) AS t (users)
)
SELECT
json_extract_scalar(users, '$.name') as name,
json_array_length(json_extract(users, '$.projects')) as count
FROM dataset
ORDER BY count DESC
+---------------------+
| name | count |
+---------------------+
| Susan Smith | 2 |
+---------------------+
| Bob Smith | 1 |
+---------------------+
| Jane Smith | 1 |
+---------------------+
Example: json_size
To obtain the size of a JSON-encoded array or object, use the json_size function, and specify the
column containing the JSON string and the JSONPath expression to the array or object.
213
Amazon Athena User Guide
Troubleshooting JSON Queries
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name": "Bob Smith", "org": "legal", "projects": [{"name":"project1",
"completed":false}]}'),
(JSON '{"name": "Susan Smith", "org": "engineering", "projects": [{"name":"project2",
"completed":true},{"name":"project3", "completed":true}]}'),
(JSON '{"name": "Jane Smith", "org": "finance", "projects": [{"name":"project2",
"completed":true}]}')
) AS t (users)
)
SELECT
json_extract_scalar(users, '$.name') as name,
json_size(users, '$.projects') as count
FROM dataset
ORDER BY count DESC
+---------------------+
| name | count |
+---------------------+
| Susan Smith | 2 |
+---------------------+
| Bob Smith | 1 |
+---------------------+
| Jane Smith | 1 |
+---------------------+
See also Considerations and Limitations for SQL Queries in Amazon Athena (p. 469).
To use ML with Athena (Preview), you define an ML with Athena (Preview) function with the USING
FUNCTION clause. The function points to the SageMaker model endpoint that you want to use and
specifies the variable names and data types to pass to the model. Subsequent clauses in the query
reference the function to pass values to the model. The model runs inference based on the values that
the query passes and then returns inference results. For more information about SageMaker and how
SageMaker endpoints work, see the Amazon SageMaker Developer Guide.
214
Amazon Athena User Guide
Considerations and Limitations
Synopsis
The following example illustrates a USING FUNCTION clause that specifies ML with Athena (Preview)
function.
Parameters
USING FUNCTION ML_function_name(variable1 data_type[, variable2 data_type][,...])
ML_function_name defines the function name, which can be used in subsequent query clauses.
Each variable data_type specifies a named variable with its corresponding data type, which
the SageMaker model can accept as input. Specify data_type as one of the supported Athena data
types that the SageMaker model can accept as input.
RETURNS data_type TYPE
data_type specifies the SQL data type that ML_function_name returns to the query as output
from the SageMaker model.
SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint= 'my_sagemaker_endpoint')
The SELECT query that passes values to function variables and the SageMaker model to return
a result. ML_function_name specifies the function defined earlier in the query, followed by an
215
Amazon Athena User Guide
Querying with UDFs (Preview)
expression that is evaluated to pass values. Values that are passed and returned must match the
corresponding data types specified for the function in the USING FUNCTION clause.
Examples
The following example demonstrates a query using ML with Athena (Preview).
Example
To use a UDF in Athena, you write a USING FUNCTION clause before a SELECT statement in a SQL query.
The SELECT statement references the UDF and defines the variables that are passed to the UDF when
the query runs. The SQL query invokes a Lambda function using the Java runtime when it calls the UDF.
UDFs are defined within the Lambda function as methods in a Java deployment package. Multiple UDFs
can be defined in the same Java deployment package for a Lambda function. You also specify the name
of the Lambda function in the USING FUNCTION clause.
You have two options for deploying a Lambda function for Athena UDFs. You can deploy the function
directly using Lambda, or you can use the AWS Serverless Application Repository. To find existing
Lambda functions for UDFs, you can search the public AWS Serverless Application Repository or your
private repository and then deploy to Lambda. You can also create or modify Java source code, package
it into a JAR file, and deploy it using Lambda or the AWS Serverless Application Repository. We provide
example Java source code and packages to get you started. For more information about Lambda, see
AWS Lambda Developer Guide. For more information about AWS Serverless Application Repository, see
the AWS Serverless Application Repository Developer Guide.
216
Amazon Athena User Guide
UDF Query Syntax
• IAM permissions – To run a query in Athena that contains a UDF query statement and to create UDF
statements, the IAM principal running the query must be allowed to perform actions in addition to
Athena functions. For more information, see Example IAM Permissions Policies to Allow Amazon
Athena User Defined Functions (UDF) (p. 297).
• Lambda quotas – Lambda quotas apply to UDFs. For more information, see AWS Lambda Quotas in
the AWS Lambda Developer Guide.
• Known issues – For the most up-to-date list of known issues, see Limitations and Issues in the Athena
Federated Query
Synopsis
USING FUNCTION UDF_name(variable1 data_type[, variable2 data_type][,...]) RETURNS data_type
TYPE
LAMBDA_INVOKE WITH (lambda_name = 'my_lambda_function')[, FUNCTION][, ...] SELECT
[...] UDF_name(expression) [...]
Parameters
USING FUNCTION UDF_name(variable1 data_type[, variable2 data_type][,...])
UDF_name specifies the name of the UDF, which must correspond to a Java method within the
referenced Lambda function. Each variable data_type specifies a named variable with its
corresponding data type, which the UDF can accept as input. Specify data_type as one of
the supported Athena data types listed in the following table. The data type must map to the
corresponding Java data type.
TINYINT java.lang.Byte
SMALLINT java.lang.Short
REAL java.lang.Float
DOUBLE java.lang.Double
DECIMAL java.math.BigDecimal
BIGINT java.lang.Long
INTEGER java.lang.Int
VARCHAR java.lang.String
VARBINARY byte[]
217
Amazon Athena User Guide
Creating and Deploying a UDF Using Lambda
BOOLEAN java.lang.Boolean
ARRAY java.util.List
data_type specifies the SQL data type that the UDF returns as output. Athena data types listed in
the table above are supported.
LAMBDA_INVOKE WITH (lambda_name = 'my_lambda_function')
my_lambda_function specifies the name of the Lambda function to be invoked when running the
UDF.
SELECT [...] UDF_name(expression) [...]
The SELECT query that passes values to the UDF and returns a result. UDF_name specifies the UDF
to use, followed by an expression that is evaluated to pass values. Values that are passed and
returned must match the corresponding data types specified for the UDF in the USING FUNCTION
clause.
Examples
The following examples demonstrate queries using UDFs. The Athena query examples are based on the
AthenaUDFHandler.java code in GitHub.
The following example demonstrates using the compress UDF defined in a Lambda function named
MyAthenaUDFLambda.
The following example demonstrates using the decompress UDF defined in the same Lambda function.
218
Amazon Athena User Guide
Creating and Deploying a UDF Using Lambda
The steps in this section demonstrate writing and building a custom UDF Jar file using Apache Maven
from the command line and a deploy.
• Enter the following at the command line to clone the SDK repository. This repository includes the
SDK, examples and a suite of data source connectors. For more information about data source
connectors, see Using Amazon Athena Federated Query (p. 66).
If you are working on a development machine that already has Apache Maven, the AWS CLI, and the AWS
Serverless Application Model build tool installed, you can skip this step.
1. From the root of the aws-athena-query-federation directory that you created when you
cloned, run the prepare_dev_env.sh script that prepares your development environment.
2. Update your shell to source new variables created by the installation process or restart your terminal
session.
source ~/.profile
Important
If you skip this step, you will get errors later about the AWS CLI or AWS SAM build tool not
being able to publish your Lambda function.
mvn -B archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \
-DgroupId=groupId \
-DartifactId=my-athena-udfs
219
Amazon Athena User Guide
Creating and Deploying a UDF Using Lambda
<properties>
<aws-athena-federation-sdk.version>2019.48.1</aws-athena-federation-sdk.version>
</properties>
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-athena-federation-sdk</artifactId>
<version>${aws-athena-federation-sdk.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
In the following example, two Java methods for UDFs, compress() and decompress(), are created
inside the class MyUserDefinedFunctions.
*package *com.mycompany.athena.udfs;
220
Amazon Athena User Guide
Creating and Deploying a UDF Using Lambda
public MyUserDefinedFunctions()
{
super(SOURCE_TYPE);
}
/**
* Compresses a valid UTF-8 String using the zlib compression library.
* Encodes bytes with Base64 encoding scheme.
*
* @param input the String to be compressed
* @return the compressed String
*/
public String compress(String input)
{
byte[] inputBytes = input.getBytes(StandardCharsets.UTF_8);
// create compressor
Deflater compressor = new Deflater();
compressor.setInput(inputBytes);
compressor.finish();
try {
byteArrayOutputStream.close();
}
catch (IOException e) {
throw new RuntimeException("Failed to close ByteArrayOutputStream", e);
}
/**
* Decompresses a valid String that has been compressed using the zlib compression
library.
* Decodes bytes with Base64 decoding scheme.
*
* @param input the String to be decompressed
* @return the decompressed String
*/
public String decompress(String input)
{
byte[] inputBytes = Base64.getDecoder().decode((input));
// create decompressor
Inflater decompressor = new Inflater();
decompressor.setInput(inputBytes, 0, inputBytes.length);
221
Amazon Athena User Guide
Creating and Deploying a UDF Using Lambda
try {
byteArrayOutputStream.close();
}
catch (IOException e) {
throw new RuntimeException("Failed to close ByteArrayOutputStream", e);
}
For more information and requirements, see Publishing Applications in the AWS Serverless Application
Repository Developer Guide, AWS SAM Template Concepts in the AWS Serverless Application Model
Developer Guide, and Publishing Serverless Applications Using the AWS SAM CLI.
The following example demonstrates parameters in a YAML file. Add similar parameters to your YAML
file and save it in your project directory. See athena-udf.yaml in GitHub for a full example.
Transform: 'AWS::Serverless-2016-10-31'
Metadata:
'AWS::ServerlessRepo::Application':
Name: MyApplicationName
Description: 'The description I write for my application'
Author: 'Author Name'
Labels:
- athena-federation
SemanticVersion: 1.0.0
222
Amazon Athena User Guide
Creating and Deploying a UDF Using Lambda
Parameters:
LambdaFunctionName:
Description: 'The name of the Lambda function that will contain your UDFs.'
Type: String
LambdaTimeout:
Description: 'Maximum Lambda invocation runtime in seconds. (min 1 - 900 max)'
Default: 900
Type: Number
LambdaMemory:
Description: 'Lambda memory in MB (min 128 - 3008 max).'
Default: 3008
Type: Number
Resources:
ConnectorConfig:
Type: 'AWS::Serverless::Function'
Properties:
FunctionName: !Ref LambdaFunctionName
Handler: "full.path.to.your.handler. For example,
com.amazonaws.athena.connectors.udfs.MyUDFHandler"
CodeUri: "Relative path to your JAR file. For example, ./target/athena-udfs-1.0.jar"
Description: "My description of the UDFs that this Lambda function enables."
Runtime: java8
Timeout: !Ref LambdaTimeout
MemorySize: !Ref LambdaMemory
Copy the publish.sh script to the project directory where you saved your YAML file, and run the
following command:
For example, if your bucket location is s3://mybucket/mysarapps/athenaudf and your YAML file
was saved as my-athena-udfs.yaml:
You can now use the method names defined in your Lambda function JAR file as UDFs in Athena.
223
Amazon Athena User Guide
Querying AWS Service Logs
The tasks in this section use the Athena console, but you can also use other tools that connect via JDBC.
For more information, see Using Athena with the JDBC Driver (p. 83), the AWS CLI, or the Amazon Athena
API Reference.
The topics in this section assume that you have set up both an IAM user with appropriate permissions to
access Athena and the Amazon S3 bucket where the data to query should reside. For more information,
see Setting Up (p. 6) and Getting Started (p. 8).
Topics
• Querying Application Load Balancer Logs (p. 224)
• Querying Classic Load Balancer Logs (p. 226)
• Querying Amazon CloudFront Logs (p. 227)
• Querying AWS CloudTrail Logs (p. 229)
• Querying Amazon EMR Logs (p. 236)
• Querying AWS Global Accelerator Flow Logs (p. 239)
• Querying Amazon GuardDuty Findings (p. 241)
• Querying Network Load Balancer Logs (p. 242)
• Querying Amazon VPC Flow Logs (p. 244)
• Querying AWS WAF Logs (p. 246)
Topics
• Prerequisites (p. 224)
• Creating the Table for ALB Logs (p. 224)
• Example Queries for ALB Logs (p. 225)
Prerequisites
• Enable access logging so that Application Load Balancer logs can be saved to your Amazon S3 bucket.
224
Amazon Athena User Guide
Querying Application Load Balancer Logs
location. For information about each field, see Access Log Entries in the User Guide for Application
Load Balancers.
2. Run the query in the Athena console. After the query completes, Athena registers the alb_logs
table, making the data in it ready for you to issue queries.
SELECT COUNT(request_verb) AS
count,
request_verb,
client_ip
225
Amazon Athena User Guide
Querying Classic Load Balancer Logs
FROM alb_logs
GROUP BY request_verb, client_ip
LIMIT 100;
SELECT request_url
FROM alb_logs
WHERE user_agent LIKE '%Safari%'
LIMIT 10;
• For more information and examples, see the AWS Knowledge Center article How do I analyze my
Application Load Balancer access logs using Athena?.
• For more information about partitioning ALB logs with Athena, see athena-add-partition on GitHub.
Before you analyze the Elastic Load Balancing logs, configure them for saving in the destination Amazon
S3 bucket. For more information, see Enable Access Logs for Your Classic Load Balancer.
• Create the table for Elastic Load Balancing logs (p. 226)
• Elastic Load Balancing Example Queries (p. 227)
timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
226
Amazon Athena User Guide
Querying Amazon CloudFront Logs
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*)
([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*)
(- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/';
2. Modify the LOCATION Amazon S3 bucket to specify the destination of your Elastic Load Balancing
logs.
3. Run the query in the Athena console. After the query completes, Athena registers the elb_logs
table, making the data in it ready for queries. For more information, see Elastic Load Balancing
Example Queries (p. 227)
SELECT
timestamp,
elb_name,
backend_ip,
backend_response_code
FROM elb_logs
WHERE backend_response_code LIKE '4%' OR
backend_response_code LIKE '5%'
LIMIT 100;
Use a subsequent query to sum up the response time of all the transactions grouped by the backend IP
address and Elastic Load Balancing instance name.
SELECT sum(backend_processing_time) AS
total_ms,
elb_name,
backend_ip
FROM elb_logs WHERE backend_ip <> ''
GROUP BY backend_ip, elb_name
LIMIT 100;
Before you begin querying the logs, enable Web distributions access log on your preferred CloudFront
distribution. For information, see Access Logs in the Amazon CloudFront Developer Guide.
227
Amazon Athena User Guide
Querying Amazon CloudFront Logs
Note
This procedure works for the Web distribution access logs in CloudFront. It does not apply to
streaming logs from RTMP distributions.
This query uses the LazySimpleSerDe (p. 424) by default and it is omitted.
The column date is escaped using backticks (`) because it is a reserved word in Athena. For
information, see Reserved Keywords (p. 97).
2. Run the query in Athena console. After the query completes, Athena registers the cloudfront_logs
table, making the data in it ready for you to issue queries.
228
Amazon Athena User Guide
Querying AWS CloudTrail Logs
To eliminate duplicate rows (for example, duplicate empty rows) from the query results, you can use the
SELECT DISTINCT statement, as in the following example.
SELECT DISTINCT *
FROM cloudfront_logs
LIMIT 10;
Additional Resources
For more information about using Athena to query CloudFront logs, see the following posts from the
AWS Big Data Blog.
Easily query AWS service logs using Amazon Athena (May 29, 2019).
Analyze your Amazon CloudFront access logs at scale (December 21, 2018).
Build a Serverless Architecture to Analyze Amazon CloudFront Access Logs Using AWS Lambda, Amazon
Athena, and Amazon Kinesis Analytics (May 26, 2017).
CloudTrail logs include details about any API calls made to your AWS services, including the console.
CloudTrail generates encrypted log files and stores them in Amazon S3. For more information, see the
AWS CloudTrail User Guide.
Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity. For
example, you can use queries to identify trends and further isolate activity by attributes, such as source
IP address or user.
A common application is to use CloudTrail logs to analyze operational activity for security and
compliance. For information about a detailed example, see the AWS Big Data Blog post, Analyze Security,
Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.
You can use Athena to query these log files directly from Amazon S3, specifying the LOCATION of log
files. You can do this one of two ways:
• By creating tables for CloudTrail log files directly from the CloudTrail console.
• By manually creating tables for CloudTrail log files in the Athena console.
Topics
• Understanding CloudTrail Logs and Athena Tables (p. 230)
• Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs (p. 230)
229
Amazon Athena User Guide
Querying AWS CloudTrail Logs
• Creating the Table for CloudTrail Logs in Athena Using Manual Partitioning (p. 231)
• Creating the Table for CloudTrail Logs in Athena Using Partition Projection (p. 233)
• Querying Nested Fields (p. 234)
• Example Query (p. 235)
• Tips for Querying CloudTrail Logs (p. 235)
CloudTrail saves logs as JSON text files in compressed gzip format (*.json.gzip). The location of the log
files depends on how you set up trails, the AWS Region or Regions in which you are logging, and other
factors.
For more information about where logs are stored, the JSON structure, and the record file contents, see
the following topics in the AWS CloudTrail User Guide:
To collect logs and save them to Amazon S3, enable CloudTrail from the AWS Management Console. For
more information, see Creating a Trail in the AWS CloudTrail User Guide.
Note the destination Amazon S3 bucket where you save the logs. Replace the LOCATION clause with
the path to the CloudTrail log location and the set of objects with which to work. The example uses a
LOCATION value of logs for a particular account, but you can use the degree of specificity that suits your
application.
For example:
• To analyze data from multiple accounts, you can roll back the LOCATION specifier to indicate all
AWSLogs by using LOCATION 's3://MyLogFiles/AWSLogs/.
• To analyze data from a specific date, account, and Region, use LOCATION `s3://
MyLogFiles/123456789012/CloudTrail/us-east-1/2016/03/14/'.
Using the highest level in the object hierarchy gives you the greatest flexibility when you query using
Athena.
• For information about setting up permissions for Athena, see Setting Up (p. 6).
• For information about creating a table with partitions, see Creating the Table for CloudTrail Logs in
Athena Using Manual Partitioning.
230
Amazon Athena User Guide
Querying AWS CloudTrail Logs
To create an Athena table for a CloudTrail trail using the CloudTrail console
• If you are using the newer CloudTrail console, choose Create Athena table.
• If you are using the older CloudTrail console, choose Run advanced queries in Amazon Athena.
4. For Storage location, use the down arrow to select the Amazon S3 bucket where log files are stored
for the trail to query.
Note
To find the name of the bucket that is associated with a trail, choose Trails in the CloudTrail
navigation pane and view the trail's S3 bucket column. To see the Amazon S3 location for
the bucket, choose the link for the bucket in the S3 bucket column. This opens the Amazon
S3 console to the CloudTrail bucket location.
5. Choose Create table. The table is created with a default name that includes the name of the
Amazon S3 bucket.
To create an Athena table for a CloudTrail trail using the Athena console
1. Copy and paste the following DDL statement into the Athena console. The statement is the same
as the one in the CloudTrail console Create a table in Amazon Athena dialog box, but adds a
PARTITIONED BY clause that makes the table partitioned.
2. Modify s3://CloudTrail_bucket_name/AWSLogs/Account_ID/CloudTrail/ to point to the
Amazon S3 bucket that contains your log data.
3. Verify that fields are listed correctly. For more information about the full list of fields in a CloudTrail
record, see CloudTrail Record Contents.
231
Amazon Athena User Guide
Querying AWS CloudTrail Logs
232
Amazon Athena User Guide
Querying AWS CloudTrail Logs
PARTITION (region='us-east-1',
year='2019',
month='02',
day='01')
LOCATION 's3://CloudTrail_bucket_name/AWSLogs/Account_ID/CloudTrail/us-
east-1/2019/02/01/'
The following example CREATE TABLE statement automatically uses partition projection on
CloudTrail logs from a specified date until the present for a single AWS region. In the LOCATION
and storage.location.template clauses, replace the bucket, account-id, and aws-region
placeholders with correspondingly identical values. For projection.timestamp.range, replace
2020/01/01 with the starting date that you want to use. After you run the query successfully, you can
query the table. You do not have to run ALTER TABLE ADD PARTITION to load the partitions.
233
Amazon Athena User Guide
Querying AWS CloudTrail Logs
apiVersion STRING,
recipientAccountId STRING,
serviceEventDetails STRING,
sharedEventID STRING,
vpcEndpointId STRING
)
PARTITIONED BY (
`timestamp` string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/AWSLogs/account-id/CloudTrail/aws-region'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.timestamp.format'='yyyy/MM/dd',
'projection.timestamp.interval'='1',
'projection.timestamp.interval.unit'='DAYS',
'projection.timestamp.range'='2020/01/01,NOW',
'projection.timestamp.type'='date',
'storage.location.template'='s3://bucket/AWSLogs/account-id/CloudTrail/aws-region/
${timestamp}')
For more information about partition projection, see Partition Projection with Amazon Athena (p. 109).
The userIdentity object consists of nested STRUCT types. These can be queried using a dot to
separate the fields, as in the following example:
SELECT
eventsource,
eventname,
useridentity.sessioncontext.attributes.creationdate,
useridentity.sessioncontext.sessionissuer.arn
FROM cloudtrail_logs
WHERE useridentity.sessioncontext.sessionissuer.arn IS NOT NULL
ORDER BY eventsource, eventname
LIMIT 10
The resources field is an array of STRUCT objects. For these arrays, use CROSS JOIN UNNEST to
unnest the array so that you can query its objects.
The following example returns all rows where the resource ARN ends in example/datafile.txt. For
readability, the replace function removes the initial arn:aws:s3::: substring from the ARN.
SELECT
awsregion,
replace(unnested.resources_entry.ARN,'arn:aws:s3:::') as s3_resource,
eventname,
eventtime,
useragent
FROM cloudtrail_logs t
CROSS JOIN UNNEST(t.resources) unnested (resources_entry)
WHERE unnested.resources_entry.ARN LIKE '%example/datafile.txt'
ORDER BY eventtime
The following example queries for DeleteBucket events. The query extracts the name of the bucket
and the account ID to which the bucket belongs from the resources object.
234
Amazon Athena User Guide
Querying AWS CloudTrail Logs
SELECT
awsregion,
replace(unnested.resources_entry.ARN,'arn:aws:s3:::') as deleted_bucket,
eventtime AS time_deleted,
useridentity.username,
unnested.resources_entry.accountid as bucket_acct_id
FROM cloudtrail_logs t
CROSS JOIN UNNEST(t.resources) unnested (resources_entry)
WHERE eventname = 'DeleteBucket'
ORDER BY eventtime
For more information about unnesting, see Filtering Arrays (p. 170).
Example Query
The following example shows a portion of a query that returns all anonymous (unsigned)
requests from the table created for CloudTrail event logs. This query selects those requests where
useridentity.accountid is anonymous, and useridentity.arn is not specified:
SELECT *
FROM cloudtrail_logs
WHERE
eventsource = 's3.amazonaws.com' AND
eventname in ('GetObject') AND
useridentity.accountid LIKE '%ANONYMOUS%' AND
useridentity.arn IS NULL AND
requestparameters LIKE '%[your bucket name ]%';
For more information, see the AWS Big Data blog post Analyze Security, Compliance, and Operational
Activity Using AWS CloudTrail and Amazon Athena.
• Before querying the logs, verify that your logs table looks the same as the one in the section called
“Creating the Table for CloudTrail Logs in Athena Using Manual Partitioning” (p. 231). If it is not the
first table, delete the existing table using the following command: DROP TABLE cloudtrail_logs;.
• After you drop the existing table, re-create it. For more information, see Creating the Table for
CloudTrail Logs (p. 231).
Verify that fields in your Athena query are listed correctly. For information about the full list of fields
in a CloudTrail record, see CloudTrail Record Contents.
If your query includes fields in JSON formats, such as STRUCT, extract data from JSON. For more
information, see Extracting Data From JSON (p. 209).
Now you are ready to issue queries against your CloudTrail table.
• Start by looking at which IAM users called which API operations and from which source IP addresses.
• Use the following basic SQL query as your template. Paste the query to the Athena console and run it.
SELECT
useridentity.arn,
eventname,
sourceipaddress,
eventtime
FROM cloudtrail_logs
235
Amazon Athena User Guide
Querying Amazon EMR Logs
LIMIT 100;
For information about creating a partitioned table to potentially improve query performance and reduce
data transfer, see Creating and Querying a Partitioned Table Based on Amazon EMR Logs (p. 237).
The following example queries can be run on the myemrlogs table created by the previous example.
Example – Query Step Logs for Occurrences of ERROR, WARN, INFO, EXCEPTION, FATAL, or
DEBUG
SELECT data,
"$PATH"
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'s-86URH188Z6B1')
AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
Example – Query a Specific Instance Log, i-00b3c0a839ece0a9c, for ERROR, WARN, INFO,
EXCEPTION, FATAL, or DEBUG
SELECT "data",
"$PATH" AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'i-00b3c0a839ece0a9c')
236
Amazon Athena User Guide
Querying Amazon EMR Logs
AND regexp_like("$PATH",'state')
AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
Example – Query Presto Application Logs for ERROR, WARN, INFO, EXCEPTION, FATAL, or
DEBUG
SELECT "data",
"$PATH" AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'presto')
AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
Example – Query Namenode Application Logs for ERROR, WARN, INFO, EXCEPTION, FATAL,
or DEBUG
SELECT "data",
"$PATH" AS filepath
FROM "default"."myemrlogs"
WHERE regexp_like("$PATH",'namenode')
AND regexp_like(data, 'ERROR|WARN|INFO|EXCEPTION|FATAL|DEBUG') limit 100;
Example – Query All Logs by Date and Hour for ERROR, WARN, INFO, EXCEPTION, FATAL, or
DEBUG
The following query statements then create table partitions based on sub-directories for different log
types that Amazon EMR creates in Amazon S3:
237
Amazon Athena User Guide
Querying Amazon EMR Logs
s3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6/containers/
After you create the partitions, you can run a SHOW PARTITIONS query on the table to confirm:
The following examples demonstrate queries for specific log entries use the table and partitions created
by the examples above.
SELECT data,
"$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='containers'
AND regexp_like("$PATH",'application_1561661818238_0002')
AND regexp_like(data, 'ERROR|WARN') limit 100;
SELECT data,
"$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='hadoop-mapreduce'
AND regexp_like(data,'job_1561661818238_0004|Failed Reduces') limit 100;
SELECT data,
238
Amazon Athena User Guide
Querying AWS Global Accelerator Flow Logs
"$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='node'
AND regexp_like("$PATH",'hive')
AND regexp_like(data,'056e0609-33e1-4611-956c-7a31b42d2663') limit 100;
SELECT data,
"$PATH"
FROM "default"."mypartitionedemrlogs"
WHERE logtype='node'
AND regexp_like(data,'resourcemanager')
AND regexp_like(data,'1567660019320_0001_01_000001') limit 100
Global Accelerator flow logs enable you to capture information about the IP address traffic going to and
from network interfaces in your accelerators. Flow log data is published to Amazon S3, where you can
retrieve and view your data. For more information, see Flow Logs in AWS Global Accelerator.
You can use Athena to query your Global Accelerator flow logs by creating a table that specifies their
location in Amazon S3.
1. Copy and paste the following DDL statement into the Athena console. This query specifies ROW
FORMAT DELIMITED and omits specifying a SerDe (p. 408), which means that the query uses the
LazySimpleSerDe (p. 424). In this query, fields are terminated by a space.
239
Amazon Athena User Guide
Querying AWS Global Accelerator Flow Logs
2. Modify the LOCATION value to point to the Amazon S3 bucket that contains your log data.
's3://your_log_bucket/prefix/AWSLogs/account_id/globalaccelerator/region_code/'
3. Run the query in the Athena console. After the query completes, Athena registers the
aga_flow_logs table, making the data in it available for queries.
4. Create partitions to read the data, as in the following sample query. The query creates a single
partition for a specified date. Replace the placeholders for date and location.
The following example query lists requests that passed through the LHR edge location. Use the LIMIT
operator to limit the number of logs to query at one time.
SELECT
clientip,
agaregion,
protocol,
action
FROM
aga_flow_logs
WHERE
agaregion LIKE 'LHR%'
LIMIT
100;
Example – List the endpoint IP addresses that receive the most HTTPS requests
To see which endpoint IP addresses are receiving the highest number of HTTPS requests, use the
following query. This query counts the number of packets received on HTTPS port 443, groups them by
destination IP address, and returns the top 10 IP addresses.
SELECT
SUM(numpackets) AS packetcount,
endpointip
FROM
aga_flow_logs
WHERE
endpointport = 443
GROUP BY
endpointip
ORDER BY
packetcount DESC
LIMIT
10;
240
Amazon Athena User Guide
Querying Amazon GuardDuty Findings
For more information about Amazon GuardDuty, see the Amazon GuardDuty User Guide.
Prerequisites
• Enable the GuardDuty feature for exporting findings to Amazon S3. For steps, see Exporting Findings
in the Amazon GuardDuty User Guide.
3. Run the query in the Athena console to register the gd_logs table. When the query completes, the
findings are ready for you to query from Athena.
Example Queries
The following examples show how to query GuardDuty findings from Athena.
The following query returns information about Amazon EC2 instances that might be exfiltrating data
through DNS queries.
241
Amazon Athena User Guide
Querying Network Load Balancer Logs
SELECT
title,
severity,
type,
id AS FindingID,
accountid,
region,
createdate,
updatedate,
json_extract_scalar(service, '$.count') AS Count,
json_extract_scalar(resource, '$.instancedetails.instanceid') AS InstanceID,
json_extract_scalar(service, '$.action.actiontype') AS DNS_ActionType,
json_extract_scalar(service, '$.action.dnsrequestaction.domain') AS DomainName,
json_extract_scalar(service, '$.action.dnsrequestaction.protocol') AS protocol,
json_extract_scalar(service, '$.action.dnsrequestaction.blocked') AS blocked
FROM gd_logs
WHERE type = 'Trojan:EC2/DNSDataExfiltration'
ORDER BY severity DESC
The following query returns all UnauthorizedAccess:IAMUser finding types for an IAM Principal
from all regions.
SELECT title,
severity,
type,
id,
accountid,
region,
createdate,
updatedate,
json_extract_scalar(service, '$.count') AS Count,
json_extract_scalar(resource, '$.accesskeydetails.username') AS IAMPrincipal,
json_extract_scalar(service,'$.action.awsapicallaction.api') AS APIActionCalled
FROM gd_logs
WHERE type LIKE '%UnauthorizedAccess:IAMUser%'
ORDER BY severity desc;
• To extract data from nested JSON fields, use the Presto json_extract or json_extract_scalar
functions. For more information, see Extracting Data from JSON (p. 209).
• Make sure that all characters in the JSON fields are in lower case.
• For information about downloading query results, see Downloading Query Results Files Using the
Athena Console (p. 126).
Before you analyze the Network Load Balancer access logs, enable and configure them for saving in the
destination Amazon S3 bucket. For more information, see Access Logs for Your Network Load Balancer.
242
Amazon Athena User Guide
Querying Network Load Balancer Logs
• Create the table for Network Load Balancer logs (p. 243)
• Network Load Balancer Example Queries (p. 243)
2. Modify the LOCATION Amazon S3 bucket to specify the destination of your Network Load Balancer
logs.
3. Run the query in the Athena console. After the query completes, Athena registers the
nlb_tls_logs table, making the data in it ready for queries.
SELECT count(*) AS
ct,
cert_arn
FROM "nlb_tls_logs"
GROUP BY cert_arn;
243
Amazon Athena User Guide
Querying Amazon VPC Flow Logs
The following query shows how many users are using the older TLS version:
SELECT tls_protocol_version,
COUNT(tls_protocol_version) AS
num_connections,
client_ip
FROM "nlb_tls_logs"
WHERE tls_protocol_version < 'tlsv12'
GROUP BY tls_protocol_version, client_ip;
Use the following query to identify connections that take a long TLS handshake time:
SELECT *
FROM "nlb_tls_logs"
ORDER BY tls_handshake_time_ms DESC
LIMIT 10;
Before you begin querying the logs in Athena, enable VPC flow logs, and configure them to be saved to
your Amazon S3 bucket. After you create the logs, let them run for a few minutes to collect some data.
The logs are created in a GZIP compression format that Athena lets you query directly.
When you create a VPC flow log, you can use the default format, or you can specify a custom format. A
custom format is where you specify which fields to return in the flow log, and the order in which they
should appear. For more information, see Flow Log Records in the Amazon VPC User Guide.
1. Copy and paste the following DDL statement into the Athena console Query Editor:
244
Amazon Athena User Guide
Querying Amazon VPC Flow Logs
endtime int,
action string,
logstatus string
)
PARTITIONED BY (`date` date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://your_log_bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/{region_code}/'
TBLPROPERTIES ("skip.header.line.count"="1");
• The query specifies ROW FORMAT DELIMITED and omits specifying a SerDe. This means that the
query uses the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files (p. 424). In this query,
fields are terminated by a space.
• The PARTITIONED BY clause uses the date type. This makes it possible to use mathematical
operators in queries to select what's older or newer than a certain date.
Note
Because date is a reserved keyword in DDL statements, it is escaped by backtick
characters. For more information, see Reserved Keywords (p. 97).
• For a VPC flow log with a custom format, modify the fields to match the fields that you specified
when you created the flow log.
2. Modify the LOCATION 's3://your_log_bucket/prefix/AWSLogs/{account_id}/
vpcflowlogs/{region_code}/' to point to the Amazon S3 bucket that contains your log data.
3. Run the query in Athena console. After the query completes, Athena registers the vpc_flow_logs
table, making the data in it ready for you to issue queries.
4. Create partitions to be able to read the data, as in the following sample query. This query creates a
single partition for a specified date. Replace the placeholders for date and location as needed.
Note
This query creates a single partition only, for a date that you specify. To automate the
process, use a script that runs this query and creates partitions this way for the year/
month/day.
SELECT *
FROM vpc_flow_logs
WHERE date = DATE('2020-05-04')
LIMIT 100;
The following query lists all of the rejected TCP connections and uses the newly created date partition
column, date, to extract from it the day of the week for which these events occurred.
SELECT day_of_week(date) AS
day,
date,
interfaceid,
245
Amazon Athena User Guide
Querying AWS WAF Logs
sourceaddress,
action,
protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;
To see which one of your servers is receiving the highest number of HTTPS requests, use this query. It
counts the number of packets received on HTTPS port 443, groups them by destination IP address, and
returns the top 10 from the last week.
SELECT SUM(numpackets) AS
packetcount,
destinationaddress
FROM vpc_flow_logs
WHERE destinationport = 443 AND date > current_date - interval '7' day
GROUP BY destinationaddress
ORDER BY packetcount DESC
LIMIT 10;
For more information, see the AWS Big Data blog post Analyzing VPC Flow Logs with Amazon Kinesis
Firehose, Athena, and Amazon QuickSight.
You can enable access logging for AWS WAF logs, save them to Amazon S3, and query the logs in
Athena. For more information about enabling AWS WAF logs and about the log record structure, see
Logging Web ACL Traffic Information in the AWS WAF Developer Guide.
Make a note of the Amazon S3 bucket to which you save these logs.
This query uses the OpenX JSON SerDe (p. 421). The table format and the SerDe are suggested by
the AWS Glue crawler when it analyzes AWS WAF logs.
Note
The SerDe expects each JSON record in the WAF logs in Amazon S3 to be on a single line of
text with no line termination characters separating the fields in the record. If the WAF log
JSON text is in pretty print format, you may receive the error message HIVE_CURSOR_ERROR:
Row is not a valid JSON Object when you attempt to query the table after you create it.
246
Amazon Athena User Guide
Querying AWS WAF Logs
`formatversion` int,
`webaclid` string,
`terminatingruleid` string,
`terminatingruletype` string,
`action` string,
`terminatingrulematchdetails` array<
struct<
conditiontype:string,
location:string,
matcheddata:array<string>
>
>,
`httpsourcename` string,
`httpsourceid` string,
`rulegrouplist` array<string>,
`ratebasedrulelist` array<
struct<
ratebasedruleid:string,
limitkey:string,
maxrateallowed:int
>
>,
`nonterminatingmatchingrules` array<
struct<
ruleid:string,
action:string
>
>,
`httprequest` struct<
clientip:string,
country:string,
headers:array<
struct<
name:string,
value:string
>
>,
uri:string,
args:string,
httpversion:string,
httpmethod:string,
requestid:string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='action,formatVersion,httpRequest,httpSourceId,httpSourceName,nonTerminatingMatchingRules,ra
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://athenawaflogs/WebACL/'
2. Run the query in the Athena console. After the query completes, Athena registers the waf_logs
table, making the data in it available for queries.
247
Amazon Athena User Guide
Querying AWS WAF Logs
GROUP BY httpRequest.clientIp
ORDER BY count
LIMIT 100;
The following query counts the number of times the request has arrived from an IP address that belongs
to Ireland (IE) and has been blocked by the RATE_BASED terminating rule.
The following query counts the number of times the request has been blocked, with results grouped by
WebACL, RuleId, ClientIP, and HTTP Request URI.
SELECT COUNT(*) AS
count,
webaclid,
terminatingruleid,
httprequest.clientip,
httprequest.uri
FROM waf_logs
WHERE action='BLOCK'
GROUP BY webaclid, terminatingruleid, httprequest.clientip, httprequest.uri
ORDER BY count DESC
LIMIT 100;
The following query counts the number of times a specific terminating rule ID has been matched (WHERE
terminatingruleid='e9dd190d-7a43-4c06-bcea-409613d9506e'). The query then groups the
results by WebACL, Action, ClientIP, and HTTP Request URI.
SELECT COUNT(*) AS
count,
webaclid,
action,
httprequest.clientip,
httprequest.uri
FROM waf_logs
WHERE terminatingruleid='e9dd190d-7a43-4c06-bcea-409613d9506e'
GROUP BY webaclid, action, httprequest.clientip, httprequest.uri
ORDER BY count DESC
LIMIT 100;
The following query uses the from_unixtime and to_iso8601 functions to return the timestamp
field in human-readable ISO 8601 format (for example, 2019-12-13T23:40:12.000Z instead of
1576280412771). The query also returns the HTTP source name, source ID, and request.
248
Amazon Athena User Guide
Querying AWS Glue Data Catalog
The following query uses a filter in the WHERE clause to return the same fields for records from the last
24 hours.
For more information about date and time functions, see Date and Time Functions and Operators in the
Presto documentation.
For information about querying Amazon S3 logs, see the following topics:
• How do I analyze my Amazon S3 server access logs using Athena? in the AWS Knowledge Center
• Querying Amazon S3 access logs for requests using Amazon Athena in the Amazon Simple Storage
Service Developer Guide
• Using AWS CloudTrail to identify Amazon S3 requests in the Amazon Simple Storage Service
Developer Guide
To obtain AWS Glue Catalog metadata, you query the information_schema database on the Athena
backend. The example queries in this topic show how to use Athena to query AWS Glue Catalog
metadata for common use cases.
Important
You cannot use CREATE VIEW to create a view on the information_schema database.
Topics
• Listing Databases and Searching a Specified Database (p. 249)
• Listing Tables in a Specified Database and Searching for a Table by Name (p. 250)
• Listing Partitions for a Specific Table (p. 251)
• Listing or Searching Columns for a Specified Table or View (p. 252)
The following example query lists the databases from the information_schema.schemata table.
249
Amazon Athena User Guide
Listing Tables in a Specified Database
and Searching for a Table by Name
SELECT schema_name
FROM information_schema.schemata
LIMIT 10;
6 alb-databas1
7 alb_original_cust
8 alblogsdatabase
9 athena_db_test
10 athena_ddl_db
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name = 'rdspostgresql'
schema_name
1 rdspostgresql
The following query lists tables that use the rdspostgresql table schema.
SELECT table_schema,
table_name,
table_type
FROM information_schema.tables
WHERE table_schema = 'rdspostgresql'
1 rdspostgresql rdspostgresqldb1_public_account
BASE TABLE
250
Amazon Athena User Guide
Listing Partitions for a Specific Table
The following query obtains metadata information for the table athena1.
SELECT table_schema,
table_name,
table_type
FROM information_schema.tables
WHERE table_name = 'athena1'
You can also use a metadata query to list the partition numbers and partition values for a specific table.
The syntax that you use depends on the Athena engine version.
The following example query lists the partitions for the table cloudtrail_logs_test2 using Athena
engine version 2.
The following example query lists the partitions for the table cloudtrail_logs_test2 using Athena
engine version 1.
SELECT *
FROM information_schema.__internal_partitions__
WHERE table_schema = 'default'
AND table_name = 'cloudtrail_logs_test2'
251
Amazon Athena User Guide
Listing or Searching Columns for a Specified Table or View
ORDER BY partition_number
The following example query lists all columns for the table rdspostgresqldb1_public_account.
SELECT *
FROM information_schema.columns
WHERE table_schema = 'rdspostgresql'
AND table_name = 'rdspostgresqldb1_public_account'
table_catalog
table_schema
table_name column_name
ordinal_position
column_default
is_nullable
data_type
comment
extra_info
1 awsdatacatalog
rdspostgresql
rdspostgresqldb1_public_account
password 1 YES varchar
2 awsdatacatalog
rdspostgresql
rdspostgresqldb1_public_account
user_id 2 YES integer
3 awsdatacatalog
rdspostgresql
rdspostgresqldb1_public_account
created_on3 YES timestamp
4 awsdatacatalog
rdspostgresql
rdspostgresqldb1_public_account
last_login 4 YES timestamp
5 awsdatacatalog
rdspostgresql
rdspostgresqldb1_public_account
email 5 YES varchar
6 awsdatacatalog
rdspostgresql
rdspostgresqldb1_public_account
username 6 YES varchar
The following example query lists all the columns in the default database for the view arrayview.
SELECT *
FROM information_schema.columns
WHERE table_schema = 'default'
AND table_name = 'arrayview'
252
Amazon Athena User Guide
Querying Web Server Logs
table_catalog
table_schema
table_name
column_name
ordinal_position
column_default
is_nullable
data_typecomment
extra_info
1 awsdatacatalog
default arrayviewsearchdate 1 YES varchar
2 awsdatacatalog
default arrayviewsid 2 YES varchar
3 awsdatacatalog
default arrayviewbtid 3 YES varchar
4 awsdatacatalog
default arrayviewp 4 YES varchar
5 awsdatacatalog
default arrayviewinfantprice 5 YES varchar
6 awsdatacatalog
default arrayviewsump 6 YES varchar
7 awsdatacatalog
default arrayviewjourneymaparray
7 YES array(varchar)
The following example query searches for metadata for the sid column in the arrayview view of the
default database.
SELECT *
FROM information_schema.columns
WHERE table_schema = 'default'
AND table_name = 'arrayview'
AND column_name='sid'
table_catalogtable_schema
table_namecolumn_name
ordinal_position
column_default
is_nullable
data_type
comment
extra_info
1 awsdatacatalog
default arrayview sid 2 YES varchar
Topics
• Querying Apache Logs Stored in Amazon S3 (p. 253)
• Querying Internet Information Server (IIS) Logs Stored in Amazon S3 (p. 255)
Fields in the common log format include the client IP address, client ID, user ID, request received
timestamp, text of the client request, server status code, and size of the object returned to the client.
The following example data shows the Apache common log format.
253
Amazon Athena User Guide
Querying Apache Logs
3. Run the query in the Athena console to register the apache_logs table. When the query completes,
the logs are ready for you to query from Athena.
The following example query selects the request received time, text of the client request, and server
status code from the apache_logs table. The WHERE clause filters for HTTP status code 404 (page not
found).
254
Amazon Athena User Guide
Querying IIS Logs
The following image shows the results of the query in the Athena Query Editor.
The following image shows the results of the query in the Athena Query Editor.
Because the W3C extended and IIS log file formats use single character delimiters (spaces and
commas, respectively) and do not have values enclosed in quotation marks, you can use the
LazySimpleSerDe (p. 424) to create Athena tables for them.
255
Amazon Athena User Guide
Querying IIS Logs
The following example log data has the fields date, time, c-ip, s-ip, cs-method, cs-uri-stem,
sc-status, sc-bytes, cs-bytes, time-taken, and cs-version.
2020-01-19 22:48:39 203.0.113.5 198.51.100.2 GET /default.html 200 540 524 157 HTTP/1.0
2020-01-19 22:49:40 203.0.113.10 198.51.100.12 GET /index.html 200 420 324 164 HTTP/1.0
2020-01-19 22:50:12 203.0.113.12 198.51.100.4 GET /image.gif 200 324 320 358 HTTP/1.0
2020-01-19 22:51:44 203.0.113.15 198.51.100.16 GET /faq.html 200 330 324 288 HTTP/1.0
a. Add or remove the columns in the example to correspond to the fields in the logs that you want
to query.
b. Column names in the W3C extended log file format contain hyphens (-). However, in accordance
with Athena naming conventions (p. 96), the example CREATE TABLE statement replaces them
with underscores (_).
c. To specify the space delimiter, use FIELDS TERMINATED BY ' '.
d. Modify the values in LOCATION 's3://bucket-name/w3c-log-folder/' to point to your
W3C extended logs in Amazon S3.
3. Run the query in the Athena console to register the iis_w3c_logs table. When the query
completes, the logs are ready for you to query from Athena.
256
Amazon Athena User Guide
Querying IIS Logs
The following image shows the results of the query in the Athena Query Editor.
After the table is created, you can query the new timestamp column directly, as in the following
example.
257
Amazon Athena User Guide
Querying IIS Logs
The following example shows sample data in the IIS log file format.
203.0.113.15, -, 2020-02-24, 22:48:38, W3SVC2, SERVER5, 198.51.100.4, 254, 501, 488, 200,
0, GET, /index.htm, -,
203.0.113.4, -, 2020-02-24, 22:48:39, W3SVC2, SERVER6, 198.51.100.6, 147, 411, 388, 200, 0,
GET, /about.html, -,
203.0.113.11, -, 2020-02-24, 22:48:40, W3SVC2, SERVER7, 198.51.100.18, 170, 531, 468, 200,
0, GET, /image.png, -,
203.0.113.8, -, 2020-02-24, 22:48:41, W3SVC2, SERVER8, 198.51.100.14, 125, 711, 868, 200,
0, GET, /intro.htm, -,
258
Amazon Athena User Guide
Querying IIS Logs
service_status_code string,
windows_status_code string,
request_type string,
target_of_operation string,
script_parameters string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/iis-log-file-folder/'
3. Run the query in the Athena console to register the iis_format_logs table. When the query
completes, the logs are ready for you to query from Athena.
The following image shows the results of the query of the sample data.
The following example shows data in the NCSA common log format as documented for IIS.
259
Amazon Athena User Guide
Querying IIS Logs
3. Run the query in the Athena console to register the iis_ncsa_logs table. When the query
completes, the logs are ready for you to query from Athena.
The following example query selects the request received time, text of the client request, and server
status code from the iis_ncsa_logs table. The WHERE clause filters for HTTP status code 404 (page
not found).
260
Amazon Athena User Guide
Querying IIS Logs
The following image shows the results of the query in the Athena Query Editor.
The following example query selects the user ID, request received time, text of the client request, and
server status code from the iis_ncsa_logs table. The WHERE clause filters for requests with HTTP
status code 200 (successful) from users in the AnyCompany domain.
The following image shows the results of the query in the Athena Query Editor.
261
Amazon Athena User Guide
Data Protection
Security is a shared responsibility between AWS and you. The shared responsibility model describes this
as security of the cloud and security in the cloud:
• Security of the cloud – AWS is responsible for protecting the infrastructure that runs AWS services
in the AWS Cloud. AWS also provides you with services that you can use securely. The effectiveness
of our security is regularly tested and verified by third-party auditors as part of the AWS compliance
programs. To learn about the compliance programs that apply to Athena, see AWS Services in Scope by
Compliance Program.
• Security in the cloud – Your responsibility is determined by the AWS service that you use. You are also
responsible for other factors including the sensitivity of your data, your organization’s requirements,
and applicable laws and regulations.
This documentation will help you understand how to apply the shared responsibility model when using
Amazon Athena. The following topics show you how to configure Athena to meet your security and
compliance objectives. You'll also learn how to use other AWS services that can help you to monitor and
secure your Athena resources.
Topics
• Data Protection in Athena (p. 262)
• Identity and Access Management in Athena (p. 270)
• Logging and Monitoring in Athena (p. 304)
• Compliance Validation for Amazon Athena (p. 307)
• Resilience in Athena (p. 308)
• Infrastructure Security in Athena (p. 308)
• Configuration and Vulnerability Analysis in Athena (p. 310)
• Using Athena to Query Data Registered With AWS Lake Formation (p. 310)
For data protection purposes, we recommend that you protect AWS account credentials and set up
individual user accounts with AWS Identity and Access Management (IAM). That way each user is given
only the permissions necessary to fulfill their job duties. We also recommend that you secure your data
in the following ways:
262
Amazon Athena User Guide
Protecting Multiple Types of Data
• Use AWS encryption solutions, along with all default security controls within AWS services.
• Use advanced managed security services such as Amazon Macie, which assists in discovering and
securing personal data that is stored in Amazon S3.
• If you require FIPS 140-2 validated cryptographic modules when accessing AWS through a command
line interface or an API, use a FIPS endpoint. For more information about the available FIPS endpoints,
see Federal Information Processing Standard (FIPS) 140-2.
We strongly recommend that you never put sensitive identifying information, such as your customers'
account numbers, into free-form fields such as a Name field. This includes when you work with Athena
or other AWS services using the console, API, AWS CLI, or AWS SDKs. Any data that you enter into Athena
or other services might get picked up for inclusion in diagnostic logs. When you provide a URL to an
external server, don't include credentials information in the URL to validate your request to that server.
• Source data – You store the data for databases and tables in Amazon S3, and Athena does not modify
it. For more information, see Data Protection in Amazon S3 in the Amazon Simple Storage Service
Developer Guide. You control access to your source data and can encrypt it in Amazon S3. You can use
Athena to create tables based on encrypted datasets in Amazon S3 (p. 267).
• Database and table metadata (schema) – Athena uses schema-on-read technology, which means
that your table definitions are applied to your data in Amazon S3 when Athena runs queries. Any
schemas you define are automatically saved unless you explicitly delete them. In Athena, you can
modify the Data Catalog metadata using DDL statements. You can also delete table definitions and
schema without impacting the underlying data stored in Amazon S3.
Note
The metadata for databases and tables you use in Athena is stored in the AWS Glue Data
Catalog. We highly recommmend that you upgrade (p. 29) to using the AWS Glue Data
Catalog with Athena. For more information about the benefits of using the AWS Glue Data
Catalog, see FAQ: Upgrading to the AWS Glue Data Catalog (p. 32).
You can define fine-grained access policies to databases and tables (p. 275) registered in the
AWS Glue Data Catalog using AWS Identity and Access Management (IAM). You can also encrypt
metadata in the AWS Glue Data Catalog. If you encrypt the metadata, use permissions to encrypted
metadata (p. 266) for access.
• Query results and query history, including saved queries – Query results are stored in a location in
Amazon S3 that you can choose to specify globally, or for each workgroup. If not specified, Athena
uses the default location in each case. You control access to Amazon S3 buckets where you store query
results and saved queries. Additionally, you can choose to encrypt query results that you store in
Amazon S3. Your users must have the appropriate permissions to access the Amazon S3 locations and
decrypt files. For more information, see Encrypting Query Results Stored in Amazon S3 (p. 266) in
this document.
Athena retains query history for 45 days. You can view query history (p. 129) using Athena APIs, in the
console, and with AWS CLI. To store the queries for longer than 45 days, save them. To protect access
to saved queries, use workgroups (p. 358) in Athena, restricting access to saved queries only to users
who are authorized to view them.
Topics
• Encryption at Rest (p. 264)
263
Amazon Athena User Guide
Encryption at Rest
Encryption at Rest
You can run queries in Amazon Athena on encrypted data in Amazon S3 in the same Region and across a
limited number of Regions. You can also encrypt the query results in Amazon S3 and the data in the AWS
Glue Data Catalog.
• The results of all queries in Amazon S3, which Athena stores in a location known as the Amazon S3
results location. You can encrypt query results stored in Amazon S3 whether the underlying dataset
is encrypted in Amazon S3 or not. For information, see Encrypting Query Results Stored in Amazon
S3 (p. 266).
• The data in the AWS Glue Data Catalog. For information, see Permissions to Encrypted Metadata in the
AWS Glue Data Catalog (p. 266).
Note
The setup for querying an encrypted dataset in Amazon S3 and the options in Athena to encrypt
query results are independent. Each option is enabled and configured separately. You can use
different encryption methods or keys for each. This means that reading encrypted data in
Amazon S3 doesn't automatically encrypt Athena query results in Amazon S3. The opposite is
also true. Encrypting Athena query results in Amazon S3 doesn't encrypt the underlying dataset
in Amazon S3.
Topics
• Supported Amazon S3 Encryption Options (p. 264)
• Permissions to Encrypted Data in Amazon S3 (p. 265)
• Permissions to Encrypted Metadata in the AWS Glue Data Catalog (p. 266)
• Encrypting Query Results Stored in Amazon S3 (p. 266)
• Creating Tables Based on Encrypted Datasets in Amazon S3 (p. 267)
SSE-S3 Server side encryption (SSE) with an Amazon S3-managed key. Yes
264
Amazon Athena User Guide
Encryption at Rest
For more information about AWS KMS encryption with Amazon S3, see What is AWS Key Management
Service and How Amazon Simple Storage Service (Amazon S3) Uses AWS KMS in the AWS Key
Management Service Developer Guide. For more information about using SSE-KMS or CSE-KMS with
Athena, see Launch: Amazon Athena adds support for Querying Encrypted Data from the AWS Big Data
Blog.
Unsupported Options
The following encryption options are not supported:
To compare Amazon S3 encryption options, see Protecting Data Using Encryption in the Amazon Simple
Storage Service Developer Guide.
• SSE-S3 – If you use SSE-S3 for encryption, Athena users require no additional permissions in their
policies. It is sufficient to have the appropriate Amazon S3 permissions for the appropriate Amazon
S3 location and for Athena actions. For more information about policies that allow appropriate
Athena and Amazon S3 permissions, see IAM Policies for User Access (p. 271) and Amazon S3
Permissions (p. 274).
• AWS KMS – If you use AWS KMS for encryption, Athena users must be allowed to perform particular
AWS KMS actions in addition to Athena and Amazon S3 permissions. You allow these actions by
editing the key policy for the AWS KMS customer managed CMKs that are used to encrypt data in
Amazon S3. To add key users to the appropriate AWS KMS key policies, you can use the AWS KMS
console at https://console.aws.amazon.com/kms. For information about how to add a user to a AWS
KMS key policy, see Allows key users to use the CMK in the AWS Key Management Service Developer
Guide.
Note
Advanced key policy administrators can adjust key policies. kms:Decrypt is the minimum
allowed action for an Athena user to work with an encrypted dataset. To work with encrypted
query results, the minimum allowed actions are kms:GenerateDataKey and kms:Decrypt.
When using Athena to query datasets in Amazon S3 with a large number of objects that are encrypted
with AWS KMS, AWS KMS may throttle query results. This is more likely when there are a large number
of small objects. Athena backs off retry requests, but a throttling error might still occur. In this case,
you can increase your service quotas for AWS KMS. For more information, see Quotas in the AWS Key
Management Service Developer Guide.
265
Amazon Athena User Guide
Encryption at Rest
If you connect using the JDBC or ODBC driver, you configure driver options to specify the type of
encryption to use and the Amazon S3 staging directory location. To configure the JDBC or ODBC
driver to encrypt your query results using any of the encryption protocols that Athena supports, see
Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 83).
You can configure the setting for encryption of query results in two ways:
• Client-side settings – When you use Settings in the console or the API operations to indicate that you
want to encrypt query results, this is known as using client-side settings. Client-side settings include
query results location and encryption. If you specify them, they are used, unless they are overridden by
the workgroup settings.
• Workgroup settings – When you create or edit a workgroup (p. 368) and select the Override client-
side settings field, then all queries that run in this workgroup use the workgroup settings. For more
information, see Workgroup Settings Override Client-Side Settings (p. 366). Workgroup settings
include query results location and encryption.
Important
If your workgroup has the Override client-side settings field selected, then the queries use
the workgroup settings. The encryption configuration and the query results location listed in
Settings, the API operations, and the drivers are not used. For more information, see Workgroup
Settings Override Client-Side Settings (p. 366).
2. For Query result location, enter a custom value or leave the default. This is the Amazon S3 staging
directory where query results are stored.
3. Choose Encrypt query results.
266
Amazon Athena User Guide
Encryption at Rest
• If your account has access to an existing AWS KMS customer managed key (CMK), choose its alias
or choose Enter a KMS key ARN and then enter an ARN.
• If your account does not have access to an existing AWS KMS customer managed key (CMK),
choose Create KMS key, and then open the AWS KMS console. In the navigation pane, choose
AWS managed keys. For more information, see Creating Keys in the AWS Key Management Service
Developer Guide.
Note
Athena supports only symmetric keys for reading and writing data.
6. Return to the Athena console to specify the key by alias or ARN as described in the previous step.
7. Choose Save.
Users that run queries, including the user who creates the table, must have the appropriate permissions
as described earlier in this topic.
Important
If you use Amazon EMR along with EMRFS to upload encrypted Parquet files, you must
disable multipart uploads by setting fs.s3n.multipart.uploads.enabled to
false. If you don't do this, Athena is unable to determine the Parquet file length and a
HIVE_CANNOT_OPEN_SPLIT error occurs. For more information, see Configure Multipart
Upload for Amazon S3 in the Amazon EMR Management Guide.
Indicate that the dataset is encrypted in Amazon S3 in one of the following ways. This step is not
required if SSE-KMS is used.
• Use the CREATE TABLE (p. 454) statement with a TBLPROPERTIES clause that specifies
'has_encrypted_data'='true'.
267
Amazon Athena User Guide
Encryption at Rest
• Use the JDBC driver (p. 83) and set the TBLPROPERTIES value as shown in the previous example,
when you run CREATE TABLE (p. 454) using statement.executeQuery().
• Use the Add table wizard in the Athena console, and then choose Encrypted data set when you
specify a value for Location of input data set.
Tables based on encrypted data in Amazon S3 appear in the Database list with an encryption icon.
268
Amazon Athena User Guide
Encryption in Transit
Encryption in Transit
In addition to encrypting data at rest in Amazon S3, Amazon Athena uses Transport Layer Security (TLS)
encryption for data in-transit between Athena and Amazon S3, and between Athena and customer
applications accessing it.
You should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport
condition on Amazon S3 bucket IAM policies.
Query results that stream to JDBC or ODBC clients are encrypted using TLS. For information about
the latest versions of the JDBC and ODBC drivers and their documentation, see Connect with the JDBC
Driver (p. 83) and Connect with the ODBC Driver (p. 85).
Key Management
Amazon Athena supports AWS Key Management Service (AWS KMS) to encrypt datasets in Amazon
S3 and Athena query results. AWS KMS uses customer master keys (CMKs) to encrypt your Amazon S3
objects and relies on envelope encryption.
• Create keys
• Import your own key material for new CMKs
Note
Athena supports only symmetric keys for reading and writing data.
For more information, see What is AWS Key Management Service in the AWS Key Management Service
Developer Guide, and How Amazon Simple Storage Service Uses AWS KMS. To view the keys in your
account that AWS creates and manages for you, in the navigation pane, choose AWS managed keys.
If you are uploading or accessing objects encrypted by SSE-KMS, use AWS Signature Version 4 for added
security. For more information, see Specifying the Signature Version in Request Authentication in the
Amazon Simple Storage Service Developer Guide.
269
Amazon Athena User Guide
Internetwork Traffic Privacy
• For traffic between Athena and on-premises clients and applications, query results that stream to
JDBC or ODBC clients are encrypted using Transport Layer Security (TLS).
You can use one of the connectivity options between your private network and AWS:
• A Site-to-Site VPN AWS VPN connection. For more information, see What is Site-to-Site VPN AWS
VPN in the AWS Site-to-Site VPN User Guide.
• An AWS Direct Connect connection. For more information, see What is AWS Direct Connect in the
AWS Direct Connect User Guide.
• For traffic between Athena and Amazon S3 buckets, Transport Layer Security (TLS) encrypts objects
in-transit between Athena and Amazon S3, and between Athena and customer applications accessing
it, you should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport
condition on Amazon S3 bucket IAM policies.
• Amazon S3 locations where the underlying data to query is stored. For more information, see Identity
and access management in Amazon S3 in the Amazon Simple Storage Service Developer Guide.
• Metadata and resources that you store in the AWS Glue Data Catalog, such as databases and tables,
including additional actions for encrypted metadata. For more information, see Setting up IAM
Permissions for AWS Glue and Setting Up Encryption in AWS Glue in the AWS Glue Developer Guide.
• Athena API actions. For a list of API actions in Athena, see Actions in the Amazon Athena API Reference.
The following topics provide more information about permissions for specific areas of Athena.
Topics
• Managed Policies for User Access (p. 271)
• Access through JDBC and ODBC Connections (p. 274)
• Access to Amazon S3 (p. 274)
• Fine-Grained Access to Databases and Tables in the AWS Glue Data Catalog (p. 275)
• Access to Encrypted Metadata in the AWS Glue Data Catalog (p. 281)
• Cross-account Access in Athena to Amazon S3 Buckets (p. 282)
• Access to Workgroups and Tags (p. 285)
• Using Athena with CalledVia Context Keys (p. 285)
• Allow Access to an Athena Data Connector for External Hive Metastore (p. 287)
• Allow Lambda Function Access to External Hive Metastores (p. 289)
• Example IAM Permissions Policies to Allow Athena Federated Query (p. 293)
• Example IAM Permissions Policies to Allow Amazon Athena User Defined Functions (UDF) (p. 297)
• Allowing Access for ML with Athena (Preview) (p. 301)
270
Amazon Athena User Guide
Managed Policies for User Access
Each identity-based policy consists of statements that define the actions that are allowed or denied. For
more information and step-by-step instructions for attaching a policy to a user, see Attaching Managed
Policies in the AWS Identity and Access Management User Guide. For a list of actions, see the Amazon
Athena API Reference.
Managed policies are easy to use and are updated automatically with the required actions as the service
evolves.
• The AmazonAthenaFullAccess managed policy grants full access to Athena. Attach it to users
and other principals who need full access to Athena. See AmazonAthenaFullAccess Managed
Policy (p. 271).
• The AWSQuicksightAthenaAccess managed policy grants access to actions that Amazon
QuickSightneeds to integrate with Athena. Attach this policy to principals who use Amazon QuickSight
in conjunction with Athena. See AWSQuicksightAthenaAccess Managed Policy (p. 273).
Customer-managed and inline identity-based policies allow you to specify more detailed Athena actions
within a policy to fine-tune access. We recommend that you use the AmazonAthenaFullAccess policy
as a starting point and then allow or deny specific actions listed in the Amazon Athena API Reference. For
more information about inline policies, see Managed Policies and Inline Policies in the AWS Identity and
Access Management User Guide.
If you also have principals that connect using JDBC, you must provide the JDBC driver credentials to your
application. For more information, see Service Actions for JDBC Connections (p. 274).
If you use AWS Glue with Athena, and have encrypted the AWS Glue Data Catalog, you must specify
additional actions in the identity-based IAM policies for Athena. For more information, see Access to
Encrypted Metadata in the AWS Glue Data Catalog (p. 281).
Important
If you create and use workgroups, make sure your policies include appropriate access to
workgroup actions. For detailed information, see the section called “ IAM Policies for Accessing
Workgroups” (p. 361) and the section called “Workgroup Example Policies” (p. 362).
Managed policy contents change, so the policy shown here may be out-of-date. Check the IAM console
for the most up-to-date policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:*"
271
Amazon Athena User Guide
Managed Policies for User Access
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::athena-examples*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Resource": [
272
Amazon Athena User Guide
Managed Policies for User Access
"*"
]
},
{
"Effect": "Allow",
"Action": [
"sns:ListTopics",
"sns:GetTopicAttributes"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricAlarm",
"cloudwatch:DescribeAlarms",
"cloudwatch:DeleteAlarms"
],
"Resource": [
"*"
]
}
]
}
Managed policy contents change, so the policy shown here may be out-of-date. Check the IAM console
for the most up-to-date policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:BatchGetQueryExecution",
"athena:CancelQueryExecution",
"athena:GetCatalogs",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetNamespaces",
"athena:GetQueryExecution",
"athena:GetQueryExecutions",
"athena:GetQueryResults",
"athena:GetQueryResultsStream",
"athena:GetTable",
"athena:GetTables",
"athena:ListQueryExecutions",
"athena:RunQuery",
"athena:StartQueryExecution",
"athena:StopQueryExecution"
],
"Resource": [
"*"
273
Amazon Athena User Guide
Access through JDBC and ODBC Connections
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
}
]
}
For information about the latest versions of the JDBC and ODBC drivers and their documentation, see
Using Athena with the JDBC Driver (p. 83) and Connecting to Amazon Athena with ODBC (p. 85).
Access to Amazon S3
You can grant access to Amazon S3 locations using identity-based policies, bucket resource policies, or
both.
274
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
For detailed information and examples about how to grant Amazon S3 access, see the following
resources:
• Example Walkthroughs: Managing Access in the Amazon Simple Storage Service Developer Guide.
• How can I provide cross-account access to objects that are in Amazon S3 buckets? in the AWS
Knowledge Center.
• Cross-account Access in Athena to Amazon S3 Buckets (p. 282).
Note
Athena does not support restricting or allowing access to Amazon S3 resources based on the
aws:SourceIp condition key.
Create an IAM policy Creating IAM Policies in the AWS Identity and Access Management User
that defines fine-grained Guide.
access to resources
Learn about IAM identity- Identity-Based Policies (IAM Policies) in the AWS Glue Developer Guide.
based policies used in
AWS Glue
In this section
Limitations
Consider the following limitations when using fine-grained access control with the AWS Glue Data
Catalog and Athena:
• You can limit access only to databases and tables. Fine-grained access controls apply at the table level
and you cannot limit access to individual partitions within a table. For more information, see Table
Partitions and Versions in AWS Glue (p. 277).
275
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
• Athena does not support cross-account access to the AWS Glue Data Catalog.
• The AWS Glue Data Catalog contains the following resources: CATALOG, DATABASE, TABLE, and
FUNCTION.
Note
From this list, resources that are common between Athena and the AWS Glue Data Catalog
are TABLE, DATABASE, and CATALOG for each account. Function is specific to AWS Glue. For
delete actions in Athena, you must include permissions to AWS Glue actions. See Fine-Grained
Policy Examples (p. 277).
The hierarchy is as follows: CATALOG is an ancestor of all DATABASES in each account, and each
DATABASE is an ancestor for all of its TABLES and FUNCTIONS. For example, for a table named
table_test that belongs to a database db in the catalog in your account, its ancestors are db and
the catalog in your account. For the db database, its ancestor is the catalog in your account, and
its descendants are tables and functions. For more information about the hierarchical structure of
resources, see List of ARNs in Data Catalog in the AWS Glue Developer Guide.
• For any non-delete Athena action on a resource, such as CREATE DATABASE, CREATE TABLE, SHOW
DATABASE, SHOW TABLE, or ALTER TABLE, you need permissions to call this action on the resource
(table or database) and all ancestors of the resource in the Data Catalog. For example, for a table, its
ancestors are the database to which it belongs, and the catalog for the account. For a database, its
ancestor is the catalog for the account. See Fine-Grained Policy Examples (p. 277).
• For a delete action in Athena, such as DROP DATABASE or DROP TABLE, you also need permissions
to call the delete action on all anscestors and descendants of the resource in the Data Catalog.
For example, to delete a database you need permissions on the database, the catalog, which is its
ancestor, and all the tables and user defined functions, which are its descendents. A table does not
have descendants. To run DROP TABLE, you need permissions to this action on the table, the database
to which it belongs, and the catalog. See Fine-Grained Policy Examples (p. 277).
• When limiting access to a specific database in the Data Catalog, you must also specify the access policy
to the default database and catalog for each AWS Region for GetDatabase and CreateDatabase
actions. If you use Athena in more than one Region, add a separate line to the policy for the resource
ARN for each default database and catalog in each Region.
For example, to allow GetDatabase access to example_db in the us-east-1 (N.Virginia) Region,
also include the default database and catalog in the policy for that Region for two actions:
GetDatabase and CreateDatabase:
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default",
"arn:aws:glue:us-east-1:123456789012:database/example_db"
]
}
276
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
}
For the purposes of fine-grained access control, the following access permissions apply:
• Fine-grained access controls apply at the table level. You can limit access only to databases and tables.
For example, if you allow access to a partitioned table, this access applies to all partitions in the table.
You cannot limit access to individual partitions within a table.
Important
Having access to all partitions within a table is not sufficient if you need to run actions in
AWS Glue on partitions. To run actions on partitions, you need permissions for those actions.
For example, to run GetPartitions on table myTable in the database myDB, you need
permissions for the action glue:GetPartitions in the Data Catalog, the myDB database,
and myTable.
• Fine-grained access controls do not apply to table versions. As with partitions, access to previous
versions of a table is granted through access to the table version APIs in AWS Glue on the table, and to
the table ancestors.
For information about permissions on AWS Glue actions, see AWS Glue API Permissions: Actions and
Resources Reference in the AWS Glue Developer Guide.
These examples include the access policy to the default database and catalog, for GetDatabase and
CreateDatabase actions. This policy is required for Athena and the AWS Glue Data Catalog to work
together. For multiple AWS Regions, include this policy for each of the default databases and their
catalogs, one line for each Region.
In addition, replace the example_db database and test table names with the names for your databases
and tables.
DDL Statement Example of an IAM access policy granting access to the resource
{
"Effect": "Allow",
277
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
DDL Statement Example of an IAM access policy granting access to the resource
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default",
"arn:aws:glue:us-east-1:123456789012:database/example_db"
]
}
ALTER DATABASE Allows you to modify the properties for the example_db database.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:UpdateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db"
]
}
278
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
DDL Statement Example of an IAM access policy granting access to the resource
DROP DATABASE Allows you to drop the example_db database, including all tables in it.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:DeleteDatabase",
"glue:GetTables",
"glue:GetTable",
"glue:DeleteTable"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/*",
"arn:aws:glue:us-east-1:123456789012:userDefinedFunction/
example_db/*"
]
}
SHOW DATABASES Allows you to list all databases in the AWS Glue Data Catalog.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/*"
]
}
279
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
DDL Statement Example of an IAM access policy granting access to the resource
CREATE TABLE Allows you to create a table named test in the example_db database.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:CreateTable"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/test"
]
}
SHOW TABLES Allows you to list all tables in the example_db database.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/*"
]
}
280
Amazon Athena User Guide
Access to Encrypted Metadata in the Data Catalog
DDL Statement Example of an IAM access policy granting access to the resource
DROP TABLE Allows you to drop a partitioned table named test in the example_db
database. If your table does not have partitions, do not include partition
actions.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:DeleteTable",
"glue:GetPartitions",
"glue:GetPartition",
"glue:DeletePartition"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/test"
]
}
If the AWS Glue Data Catalog is encrypted, you must add the following actions to all policies that are
used to access Athena:
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt",
"kms:Encrypt"
],
"Resource": "(arn of key being used to encrypt the catalog)"
}
}
281
Amazon Athena User Guide
Cross-account Access to S3 Buckets
The following example bucket policy, created and applied to bucket s3://my-athena-data-bucket
by the bucket owner, grants access to all users in account 123456789123, which is a different account.
{
"Version": "2012-10-17",
"Id": "MyPolicyID",
"Statement": [
{
"Sid": "MyStatementSid",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789123:root"
},
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my-athena-data-bucket",
"arn:aws:s3:::my-athena-data-bucket/*"
]
}
]
}
To grant access to a particular user in an account, replace the Principal key with
a key that specifies the user instead of root. For example, for user profile Dave, use
arn:aws:iam::123456789123:user/Dave.
Granting access to an AWS KMS-encrypted bucket in Account A to a user in Account B requires the
following permissions:
282
Amazon Athena User Guide
Cross-account Access to S3 Buckets
• From Account A, review the S3 bucket policy and confirm that there is a statement that allows
access from the account ID of Account B.
For example, the following bucket policy allows s3:GetObject access to the account ID
111122223333:
{
"Id": "ExamplePolicy1",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ExampleStmt1",
"Action": [
"s3:GetObject"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::awsexamplebucket/*",
"Principal": {
"AWS": [
"111122223333"
]
}
}
]
}
To grant access to the user in Account B from the AWS KMS key policy in Account A
1. In the AWS KMS key policy for Account A, grant the user in Account B permissions to the following
actions:
• kms:Encrypt
• kms:Decrypt
• kms:ReEncrypt*
• kms:GenerateDataKey*
• kms:DescribeKey
The following example grants key access to only one IAM user or role.
{
"Sid": "Allow use of the key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::111122223333:role/role_name",
]
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
"Resource": "*"
283
Amazon Athena User Guide
Cross-account Access to S3 Buckets
2. From Account A, review the key policy using the AWS Management Console policy view.
3. In the key policy, verify that the following statement lists Account B as a principal.
4. If the "Sid": "Allow use of the key" statement is not present, perform the following steps:
a. Switch to view the key policy using the console default view.
b. Add Account B's account ID as an external account with access to the key.
To grant access to the bucket and the key in Account A from the IAM User Policy in Account B
The following example statement grants the IAM user access to the s3:GetObject and
s3:PutObject operations on the bucket awsexamplebucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ExampleStmt2",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::awsexamplebucket/*"
}
]
}
The following example statement grants the IAM user access to use the key
arn:aws:kms:example-region-1:123456789098:key/111aa2bb-333c-4d44-5555-
a111bb2c33dd.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ExampleStmt3",
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:Encrypt",
"kms:GenerateDataKey",
"kms:ReEncrypt*"
],
284
Amazon Athena User Guide
Access to Workgroups and Tags
"Effect": "Allow",
"Resource": "arn:aws:kms:example-
region-1:123456789098:key/111aa2bb-333c-4d44-5555-a111bb2c33dd"
}
]
}
For instructions on how to add or correct the IAM user's permissions, see Changing Permissions for an
IAM User.
"Resource": [arn:aws:athena:region:AWSAcctID:workgroup/workgroup-name]
For example, for a workgroup named test_workgroup in the us-west-2 region for AWS account
123456789012, specify the workgroup as a resource using the following ARN:
"Resource":["arn:aws:athena:us-east-2:123456789012:workgroup/test_workgroup"]
• For a list of workgroup policies, see the section called “Workgroup Example Policies” (p. 362).
• For a list of tag-based policies for workgroups, see Tag-Based IAM Access Control Policies (p. 390).
• For more information about creating IAM policies for workgroups, see Workgroup IAM
Policies (p. 361).
• For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference.
• For more information about IAM policies, see Creating Policies with the Visual Editor in the IAM User
Guide.
285
Amazon Athena User Guide
Using CalledVia Context Keys
to other services. The aws:CalledVia key contains an ordered list of each service in the chain that
made requests on the principal's behalf.
By specifying a service principal name for the aws:CalledVia context key, you can make the context
key AWS service-specific. For example, you can use the aws:CalledVia condition key to limit requests
to only those made from Athena. To use the aws:CalledVia condition key in a policy with Athena, you
specify the Athena service principal name athena.amazonaws.com, as in the following example.
...
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
}
...
You can use the aws:CalledVia context key to ensure that callers only have access to a resource (like a
Lambda function) if they call the resource from Athena.
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:us-east-1:MyAWSAcctId:function:OneAthenaLambdaFunction",
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
}
}
The following example shows the addition of the previous statement to a policy that allows a user to run
and read a federated query. Principals who are allowed to perform these actions can run queries that
specify Athena catalogs associated with a federated data source. However, the principal cannot access
the associated Lambda function unless the function is invoked through Athena.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"athena:GetWorkGroup",
"s3:PutObject",
"s3:GetObject",
"athena:StartQueryExecution",
"s3:AbortMultipartUpload",
"athena:CancelQueryExecution",
"athena:StopQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"s3:ListMultipartUploadParts"
286
Amazon Athena User Guide
Allow Access to an Athena Data
Connector for External Hive Metastore
],
"Resource": [
"arn:aws:athena:*:MyAWSAcctId:workgroup/WorkGroupName",
"arn:aws:s3:::MyQueryResultsBucket/*",
"arn:aws:s3:::MyLambdaSpillBucket/MyLambdaSpillPrefix*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "athena:ListWorkGroups",
"Resource": "*"
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action":
[
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::MyLambdaSpillBucket"
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": [
"arn:aws:lambda:*:MyAWSAcctId:function:OneAthenaLambdaFunction",
"arn:aws:lambda:*:MyAWSAcctId:function:AnotherAthenaLambdaFunction"
],
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
}
}
]
}
For more information about CalledVia condition keys, see AWS global condition context keys in the
IAM User Guide.
• Example Policy to Allow an IAM Principal to Query Data Using Athena Data Connector for External
Hive Metastore (p. 287)
• Example Policy to Allow an IAM Principal to Create an Athena Data Connector for External Hive
Metastore (p. 289)
Example – Allow an IAM Principal to Query Data Using Athena Data Connector for External
Hive Metastore
The following policy is attached to IAM principals in addition to the AmazonAthenaFullAccess Managed
Policy (p. 271), which grants full access to Athena actions.
287
Amazon Athena User Guide
Allow Access to an Athena Data
Connector for External Hive Metastore
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"lambda:GetFunction",
"lambda:GetLayerVersion",
"lambda:InvokeFunction"
],
"Resource": [
"arn:aws:lambda:*:MyAWSAcctId:function:MyAthenaLambdaFunction",
"arn:aws:lambda:*:MyAWSAcctId:function:AnotherAthenaLambdaFunction",
"arn:aws:lambda:*:MyAWSAcctId:layer:MyAthenaLambdaLayer:*"
]
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload"
],
"Resource": "arn:aws:s3:::MyLambdaSpillBucket/MyLambdaSpillLocation"
}
]
}
Explanation of Permissions
288
Amazon Athena User Guide
Allow Lambda Function Access to External Hive Metastores
Example – Allow an IAM Principal to Create an Athena Data Connector for External Hive
Metastore
The following policy is attached to IAM principals in addition to the AmazonAthenaFullAccess Managed
Policy (p. 271), which grants full access to Athena actions.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"lambda:GetFunction",
"lambda:ListFunctions",
"lambda:GetLayerVersion",
"lambda:InvokeFunction",
"lambda:CreateFunction",
"lambda:DeleteFunction",
"lambda:PublishLayerVersion",
"lambda:DeleteLayerVersion",
"lambda:UpdateFunctionConfiguration",
"lambda:PutFunctionConcurrency",
"lambda:DeleteFunctionConcurrency"
],
"Resource": "arn:aws:lambda:*:MyAWSAcctId:
function: MyAthenaLambdaFunctionsPrefix*"
}
]
}
Explanation of Permissions
Allows queries to invoke the AWS Lambda functions for the AWS
Lambda functions specified in the Resource block. For example,
arn:aws:lambda:*:MyAWSAcctId:function:MyAthenaLambdaFunction, where
MyAthenaLambdaFunction specifies the name of a Lambda function to be invoked. Multiple functions
can be specified as shown in the example.
For example, the following policy defines the permission for the spill location s3:\\mybucket\spill.
{
"Version": "2012-10-17",
"Statement": [
{
289
Amazon Athena User Guide
Allow Lambda Function Access to External Hive Metastores
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::mybucket/spill"
]
}
]
}
Because Athena uses the AWS Serverless Application Repository to create Lambda functions, the
superuser or administrator who creates Lambda functions should also have IAM policies to allow Athena
federated queries (p. 293).
{
"Effect": "Allow",
"Action": [
"athena:ListDataCatalogs",
"athena:GetDataCatalog",
"athena:CreateDataCatalog",
"athena:UpdateDataCatalog",
"athena:DeleteDataCatalog",
"athena:GetDatabase",
"athena:ListDatabases",
"athena:GetTableMetadata",
"athena:ListTableMetadata"
],
"Resource": [
"*"
]
}
For example, suppose you define the catalog ehms on the Europe (Frankfurt) Region eu-central-1 to
use the following Lambda function in the US East (N. Virginia) Region.
arn:aws:lambda:us-east-1:111122223333:function:external-hms-service-new
290
Amazon Athena User Guide
Allow Lambda Function Access to External Hive Metastores
When you specify the full ARN in this way, Athena can call the external-hms-service-new Lambda
function on us-east-1 to fetch the Hive metastore data from eu-central-1.
Note
The catalog ehms should be registered in the same region that you run Athena queries.
Athena uses the AWS Lambda support for cross account access to enable cross account access for Hive
Metastores.
Note
Note that cross account access for Athena normally implies cross account access for both
metadata and data in Amazon S3.
To check the Lambda permission, use the get-policy command, as in the following example. The
command has been formatted for readability.
291
Amazon Athena User Guide
Allow Lambda Function Access to External Hive Metastores
\"Action\":\"lambda:InvokeFunction\",
\"Resource\":\"arn:aws:lambda:us-
east-1:111122223333:function:external-hms-service-new\"}]}"
}
After adding the permission, you can use a full ARN of the Lambda function on us-east-1 like the
following when you define catalog ehms:
arn:aws:lambda:us-east-1:111122223333:function:external-hms-service-new
For information about cross region invocation, see Cross Region Lambda Invocation (p. 290) earlier in
this topic.
• Update the access control list policy of the Amazon S3 bucket with a canonical user ID.
• Add cross account access to the Amazon S3 bucket policy.
For example, add the following policy to the Amazon S3 bucket policy in the account 111122223333 to
allow account 444455556666 to read data from the Amazon S3 location specified.
{
"Sid": "Stmt1234567890123",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::444455556666:user/perf1-test"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::athena-test/lambda/dataset/*"
}
Note
You might need to grant cross account access to Amazon S3 not only to your data, but also to
your Amazon S3 spill location. Your Lambda function spills extra data to the spill location when
the size of the response object exceeds a given threshold. See the beginning of this topic for a
sample policy.
In the current example, after cross account access is granted to 444455556666, 444455556666 can use
catalog ehms in its own account to query tables that are defined in account 111122223333.
In the following example, the SQL Workbench profile perf-test-1 is for account 444455556666.
The query uses catalog ehms to access the Hive metastore and the Amazon S3 data in account
111122223333.
292
Amazon Athena User Guide
Allow Access to Athena Federated Query
For information about attaching policies to IAM identities, see Adding and Removing IAM Identity
Permissions in the IAM User Guide.
• Example Policy to Allow an IAM Principal to Run and Return Results Using Athena Federated
Query (p. 293)
• Example Policy to Allow an IAM Principal to Create a Data Source Connector (p. 294)
Example – Allow an IAM Principal to Run and Return Results Using Athena Federated Query
The following identity-based permissions policy allows actions that a user or other IAM principal requires
to use Athena Federated Query. Principals who are allowed to perform these actions are able to run
queries that specify Athena catalogs associated with a federated data source.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"athena:GetWorkGroup",
"s3:PutObject",
"s3:GetObject",
"athena:StartQueryExecution",
"s3:AbortMultipartUpload",
"lambda:InvokeFunction",
"athena:CancelQueryExecution",
"athena:StopQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"s3:ListMultipartUploadParts"
],
"Resource": [
"arn:aws:athena:*:MyAWSAcctId:workgroup/WorkgroupName",
"arn:aws:s3:::MyQueryResultsBucket/*",
"arn:aws:s3:::MyLambdaSpillBucket/MyLambdaSpillPrefix*",
"arn:aws:lambda:*:MyAWSAcctId:function:OneAthenaLambdaFunction",
"arn:aws:lambda:*:MyAWSAcctId:function:AnotherAthenaLambdaFunction"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "athena:ListWorkGroups",
"Resource": "*"
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
293
Amazon Athena User Guide
Allow Access to Athena Federated Query
"Resource": "arn:aws:s3:::MyLambdaSpillBucket"
}
]
}
Explanation of Permissions
s3:PutObject and
"s3:PutObject",
"s3:GetObject", s3:AbortMultipartUpload allow writing
"s3:AbortMultipartUpload" query results to all sub-folders of the
query results bucket as specified by the
arn:aws:s3:::MyQueryResultsBucket/
* resource identifier, where
MyQueryResultsBucket is the Athena query
results bucket. For more information, see Working
with Query Results, Output Files, and Query
History (p. 122).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
294
Amazon Athena User Guide
Allow Access to Athena Federated Query
"Effect": "Allow",
"Action": [
"lambda:CreateFunction",
"lambda:ListVersionsByFunction",
"iam:CreateRole",
"lambda:GetFunctionConfiguration",
"iam:AttachRolePolicy",
"iam:PutRolePolicy",
"lambda:PutFunctionConcurrency",
"iam:PassRole",
"iam:DetachRolePolicy",
"lambda:ListTags",
"iam:ListAttachedRolePolicies",
"iam:DeleteRolePolicy",
"lambda:DeleteFunction",
"lambda:GetAlias",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetPolicy",
"lambda:InvokeFunction",
"lambda:GetFunction",
"lambda:ListAliases",
"lambda:UpdateFunctionConfiguration",
"iam:DeleteRole",
"lambda:UpdateFunctionCode",
"s3:GetObject",
"lambda:AddPermission",
"iam:UpdateRole",
"lambda:DeleteFunctionConcurrency",
"lambda:RemovePermission",
"iam:GetRolePolicy",
"lambda:GetPolicy"
],
"Resource": [
"arn:aws:lambda:*:MyAWSAcctId:function:MyAthenaLambdaFunctionsPrefix*",
"arn:aws:s3:::awsserverlessrepo-changesets-1iiv3xa62ln3m/*",
"arn:aws:iam::*:role/*",
"arn:aws:iam::MyAWSAcctId:policy/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"cloudformation:CreateUploadBucket",
"cloudformation:DescribeStackDriftDetectionStatus",
"cloudformation:ListExports",
"cloudformation:ListStacks",
"cloudformation:ListImports",
"lambda:ListFunctions",
"iam:ListRoles",
"lambda:GetAccountSettings",
"ec2:DescribeSecurityGroups",
"cloudformation:EstimateTemplateCost",
"ec2:DescribeVpcs",
"lambda:ListEventSourceMappings",
"cloudformation:DescribeAccountLimits",
"ec2:DescribeSubnets",
"cloudformation:CreateStackSet",
"cloudformation:ValidateTemplate"
],
"Resource": "*"
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
295
Amazon Athena User Guide
Allow Access to Athena Federated Query
"Action": "cloudformation:*",
"Resource": [
"arn:aws:cloudformation:*:MyAWSAcctId:stack/aws-serverless-
repository-MyCFStackPrefix*/*",
"arn:aws:cloudformation:*:MyAWSAcctId:stack/
serverlessrepo-MyCFStackPrefix*/*",
"arn:aws:cloudformation:*:*:transform/Serverless-*",
"arn:aws:cloudformation:*:MyAWSAcctId:stackset/aws-serverless-
repository-MyCFStackPrefix*:*",
"arn:aws:cloudformation:*:MyAWSAcctId:stackset/
serverlessrepo-MyCFStackPrefix*:*"
]
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": "serverlessrepo:*",
"Resource": "arn:aws:serverlessrepo:*:*:applications/*"
}
]
}
Explanation of Permissions
296
Amazon Athena User Guide
Allow Access to Athena UDF
• Example Policy to Allow an IAM Principal to Run and Return Queries that Contain an Athena UDF
Statement (p. 297)
• Example Policy to Allow an IAM Principal to Create an Athena UDF (p. 298)
Example – Allow an IAM Principal to Run and Return Queries that Contain an Athena UDF
Statement
The following identity-based permissions policy allows actions that a user or other IAM principal requires
to run queries that use Athena UDF statements.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"lambda:InvokeFunction",
"athena:GetQueryResults",
"s3:ListMultipartUploadParts",
"athena:GetWorkGroup",
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"athena:CancelQueryExecution",
"athena:StopQueryExecution",
"athena:GetQueryExecution",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:athena:*:MyAWSAcctId:workgroup/AmazonAthenaPreviewFunctionality",
"arn:aws:s3:::MyQueryResultsBucket/*",
"arn:aws:lambda:*:MyAWSAcctId:function:OneAthenaLambdaFunction",
"arn:aws:lambda:*:MyAWSAcctId:function:AnotherAthenaLambdaFunction"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "athena:ListWorkGroups",
"Resource": "*"
}
]
}
297
Amazon Athena User Guide
Allow Access to Athena UDF
Explanation of Permissions
Allowed Actions Explanation
s3:PutObject and
"s3:PutObject",
"s3:GetObject", s3:AbortMultipartUpload allow writing
"s3:AbortMultipartUpload" query results to all sub-folders of the
query results bucket as specified by the
arn:aws:s3:::MyQueryResultsBucket/
* resource identifier, where
MyQueryResultsBucket is the Athena query
results bucket. For more information, see Working
with Query Results, Output Files, and Query
History (p. 122).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"lambda:CreateFunction",
"lambda:ListVersionsByFunction",
298
Amazon Athena User Guide
Allow Access to Athena UDF
"iam:CreateRole",
"lambda:GetFunctionConfiguration",
"iam:AttachRolePolicy",
"iam:PutRolePolicy",
"lambda:PutFunctionConcurrency",
"iam:PassRole",
"iam:DetachRolePolicy",
"lambda:ListTags",
"iam:ListAttachedRolePolicies",
"iam:DeleteRolePolicy",
"lambda:DeleteFunction",
"lambda:GetAlias",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetPolicy",
"lambda:InvokeFunction",
"lambda:GetFunction",
"lambda:ListAliases",
"lambda:UpdateFunctionConfiguration",
"iam:DeleteRole",
"lambda:UpdateFunctionCode",
"s3:GetObject",
"lambda:AddPermission",
"iam:UpdateRole",
"lambda:DeleteFunctionConcurrency",
"lambda:RemovePermission",
"iam:GetRolePolicy",
"lambda:GetPolicy"
],
"Resource": [
"arn:aws:lambda:*:MyAWSAcctId:function:MyAthenaLambdaFunctionsPrefix*",
"arn:aws:s3:::awsserverlessrepo-changesets-1iiv3xa62ln3m/*",
"arn:aws:iam::*:role/*",
"arn:aws:iam::MyAWSAcctId:policy/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"cloudformation:CreateUploadBucket",
"cloudformation:DescribeStackDriftDetectionStatus",
"cloudformation:ListExports",
"cloudformation:ListStacks",
"cloudformation:ListImports",
"lambda:ListFunctions",
"iam:ListRoles",
"lambda:GetAccountSettings",
"ec2:DescribeSecurityGroups",
"cloudformation:EstimateTemplateCost",
"ec2:DescribeVpcs",
"lambda:ListEventSourceMappings",
"cloudformation:DescribeAccountLimits",
"ec2:DescribeSubnets",
"cloudformation:CreateStackSet",
"cloudformation:ValidateTemplate"
],
"Resource": "*"
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": "cloudformation:*",
"Resource": [
"arn:aws:cloudformation:*:MyAWSAcctId:stack/aws-serverless-
repository-MyCFStackPrefix*/*",
299
Amazon Athena User Guide
Allow Access to Athena UDF
"arn:aws:cloudformation:*:MyAWSAcctId:stack/
serverlessrepo-MyCFStackPrefix*/*",
"arn:aws:cloudformation:*:*:transform/Serverless-*",
"arn:aws:cloudformation:*:MyAWSAcctId:stackset/aws-serverless-
repository-MyCFStackPrefix*:*",
"arn:aws:cloudformation:*:MyAWSAcctId:stackset/
serverlessrepo-MyCFStackPrefix*:*"
]
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": "serverlessrepo:*",
"Resource": "arn:aws:serverlessrepo:*:*:applications/*"
}
]
}
Explanation of Permissions
300
Amazon Athena User Guide
Allowing Access for ML with Athena (Preview)
{
"Effect": "Allow",
"Action": [
"sagemaker:invokeEndpoint"
],
"Resource": "arn:aws:sagemaker:us-west-2:123456789012:workteam/public-crowd/
default"
}
To authenticate users in this scenario, use the JDBC or ODBC driver with SAML.2.0 support to access
Active Directory Federation Services (ADFS) 3.0 and enable a client application to call Athena API
operations.
For more information about SAML 2.0 support on AWS, see About SAML 2.0 Federation in the IAM User
Guide.
Note
Federated access to the Athena API is supported for a particular type of identity provider
(IdP), the Active Directory Federation Service (ADFS 3.0), which is part of Windows Server.
Access is established through the versions of JDBC or ODBC drivers that support SAML 2.0. For
information, see Using Athena with the JDBC Driver (p. 83) and Connecting to Amazon Athena
with ODBC (p. 85).
Topics
• Before You Begin (p. 301)
• Architecture Diagram (p. 302)
• Procedure: SAML-based Federated Access to the Athena API (p. 302)
• Inside your organization, install and configure the ADFS 3.0 as your IdP.
• Install and configure the latest available versions of JDBC or ODBC drivers on clients that are used to
access Athena. The driver must include support for federated access compatible with SAML 2.0. For
information, see Using Athena with the JDBC Driver (p. 83) and Connecting to Amazon Athena with
ODBC (p. 85).
301
Amazon Athena User Guide
Enabling Federated Access to the Athena API
Architecture Diagram
The following diagram illustrates this process.
1. A user in your organization uses a client application with the JDBC or ODBC driver to request
authentication from your organization's IdP. The IdP is ADFS 3.0.
2. The IdP authenticates the user against Active Directory, which is your organization's Identity Store.
3. The IdP constructs a SAML assertion with information about the user and sends the assertion to the
client application via the JDBC or ODBC driver.
4. The JDBC or ODBC driver calls the AWS Security Token Service AssumeRoleWithSAML API operation,
passing it the following parameters:
• The ARN of the SAML provider
• The ARN of the role to assume
• The SAML assertion from the IdP
For more information, see AssumeRoleWithSAML, in the AWS Security Token Service API Reference.
5. The API response to the client application via the JDBC or ODBC driver includes temporary security
credentials.
6. The client application uses the temporary security credentials to call Athena API operations, allowing
your users to access Athena API operations.
1. In your organization, register AWS as a service provider (SP) in your IdP. This process is known as
relying party trust. For more information, see Configuring your SAML 2.0 IdP with Relying Party Trust
in the IAM User Guide. As part of this task, perform these steps:
302
Amazon Athena User Guide
Enabling Federated Access to the Athena API
a. Obtain the sample SAML metadata document from this URL: https://signin.aws.amazon.com/
static/saml-metadata.xml.
b. In your organization's IdP (ADFS), generate an equivalent metadata XML file that describes your
IdP as an identity provider to AWS. Your metadata file must include the issuer name, creation
date, expiration date, and keys that AWS uses to validate authentication responses (assertions)
from your organization.
2. In the IAM console, create a SAML identity provider entity. For more information, see Creating SAML
Identity Providers in the IAM User Guide. As part of this step, do the following:
• In the role's permission policy, list actions that users from your organization are allowed to do in
AWS.
• In the role's trust policy, set the SAML provider entity that you created in Step 2 of this procedure
as the principal.
For information about configuring the mapping in ADFS, see the blog post: Enabling Federation to
AWS Using Windows Active Directory, ADFS, and SAML 2.0.
5. Install and configure the JDBC or ODBC driver with SAML 2.0 support. For information, see Using
Athena with the JDBC Driver (p. 83) and Connecting to Amazon Athena with ODBC (p. 85).
6. Specify the connection string from your application to the JDBC or ODBC driver. For information
about the connection string that your application should use, see the topic "Using the Active
Directory Federation Services (ADFS) Credentials Provider" in the JDBC Driver Installation and
Configuration Guide, or a similar topic in the ODBC Driver Installation and Configuration Guide
available as PDF downloads from the Using Athena with the JDBC Driver (p. 83) and Connecting to
Amazon Athena with ODBC (p. 85) topics.
303
Amazon Athena User Guide
Logging and Monitoring
connection, specify the IAM role (ARN) to assume for the driver connection. Specifying the
preferred_role is optional, and is useful if the role is not the first role listed in the claim rule.
1. The JDBC or ODBC driver calls the AWS STS AssumeRoleWithSAML API, and passes it the
assertions, as shown in step 4 of the architecture diagram (p. 302).
2. AWS makes sure that the request to assume the role comes from the IdP referenced in the SAML
provider entity.
3. If the request is successful, the AWS STS AssumeRoleWithSAML API operation returns a set of
temporary security credentials, which your client application uses to make signed requests to
Athena.
Your application now has information about the current user and can access Athena
programmatically.
• Monitor Athena with AWS CloudTrail – AWS CloudTrail provides a record of actions taken by a
user, role, or an AWS service in Athena. It captures calls from the Athena console and code calls
to the Athena API operations as events. This allow you to determine the request that was made
to Athena, the IP address from which the request was made, who made the request, when it was
made, and additional details. For more information, see Logging Amazon Athena API Calls with AWS
CloudTrail (p. 305).
You can also use Athena to query the CloudTrail log files not only for Athena, but for other
AWS services. For more information, see Querying AWS CloudTrail Logs (p. 229) and CloudTrail
SerDe (p. 413).
• Monitor Athena usage with CloudTrail and Amazon QuickSight – Amazon QuickSight is a fully
managed, cloud-powered business intelligence service that lets you create interactive dashboards
your organization can access from any device. For an example of a solution that uses CloudTrail and
Amazon QuickSight to monitor Athena usage, see the AWS Big Data blog post How Realtor.com
Monitors Amazon Athena Usage with AWS CloudTrail and Amazon QuickSight.
• Use CloudWatch Events with Athena – CloudWatch Events delivers a near real-time stream of system
events that describe changes in AWS resources. CloudWatch Events becomes aware of operational
changes as they occur, responds to them, and takes corrective action as necessary, by sending
messages to respond to the environment, activating functions, making changes, and capturing state
information. To use CloudWatch Events with Athena, create a rule that triggers on an Athena API call
via CloudTrail. For more information, see Creating a CloudWatch Events Rule That Triggers on an AWS
API Call Using CloudTrail in the Amazon CloudWatch Events User Guide.
• Use workgroups to separate users, teams, applications, or workloads, and to set query limits and
control query costs – You can view query-related metrics in Amazon CloudWatch, control query costs
by configuring limits on the amount of data scanned, create thresholds, and trigger actions, such as
Amazon SNS alarms, when these thresholds are breached. For a high-level procedure, see Setting up
Workgroups (p. 360). Use resource-level IAM permissions to control access to a specific workgroup.
For more information, see Using Workgroups for Running Queries (p. 358) and Controlling Costs and
Monitoring Queries with CloudWatch Metrics and Events (p. 375).
Topics
• Logging Amazon Athena API Calls with AWS CloudTrail (p. 305)
304
Amazon Athena User Guide
Logging Amazon Athena API Calls with AWS CloudTrail
CloudTrail captures all API calls for Athena as events. The calls captured include calls from the Athena
console and code calls to the Athena API operations. If you create a trail, you can enable continuous
delivery of CloudTrail events to an Amazon S3 bucket, including events for Athena. If you don't configure
a trail, you can still view the most recent events in the CloudTrail console in Event history.
Using the information collected by CloudTrail, you can determine the request that was made to Athena,
the IP address from which the request was made, who made the request, when it was made, and
additional details.
To learn more about CloudTrail, see the AWS CloudTrail User Guide.
You can also use Athena to query CloudTrail log files for insight. For more information, see Querying
AWS CloudTrail Logs (p. 229) and CloudTrail SerDe (p. 413).
For an ongoing record of events in your AWS account, including events for Athena, create a trail. A trail
enables CloudTrail to deliver log files to an Amazon S3 bucket. By default, when you create a trail in the
console, the trail applies to all AWS Regions. The trail logs events from all Regions in the AWS partition
and delivers the log files to the Amazon S3 bucket that you specify. Additionally, you can configure
other AWS services to further analyze and act upon the event data collected in CloudTrail logs. For more
information, see the following:
All Athena actions are logged by CloudTrail and are documented in the Amazon Athena API Reference.
For example, calls to the StartQueryExecution and GetQueryResults actions generate entries in the
CloudTrail log files.
Every event or log entry contains information about who generated the request. The identity
information helps you determine the following:
• Whether the request was made with root or AWS Identity and Access Management (IAM) user
credentials.
• Whether the request was made with temporary security credentials for a role or federated user.
• Whether the request was made by another AWS service.
305
Amazon Athena User Guide
Logging Amazon Athena API Calls with AWS CloudTrail
StartQueryExecution (Successful)
{
"eventVersion":"1.05",
"userIdentity":{
"type":"IAMUser",
"principalId":"EXAMPLE_PRINCIPAL_ID",
"arn":"arn:aws:iam::123456789012:user/johndoe",
"accountId":"123456789012",
"accessKeyId":"EXAMPLE_KEY_ID",
"userName":"johndoe"
},
"eventTime":"2017-05-04T00:23:55Z",
"eventSource":"athena.amazonaws.com",
"eventName":"StartQueryExecution",
"awsRegion":"us-east-1",
"sourceIPAddress":"77.88.999.69",
"userAgent":"aws-internal/3",
"requestParameters":{
"clientRequestToken":"16bc6e70-f972-4260-b18a-db1b623cb35c",
"resultConfiguration":{
"outputLocation":"s3://athena-johndoe-test/test/"
},
"queryString":"Select 10"
},
"responseElements":{
"queryExecutionId":"b621c254-74e0-48e3-9630-78ed857782f9"
},
"requestID":"f5039b01-305f-11e7-b146-c3fc56a7dc7a",
"eventID":"c97cf8c8-6112-467a-8777-53bb38f83fd5",
"eventType":"AwsApiCall",
"recipientAccountId":"123456789012"
}
StartQueryExecution (Failed)
{
"eventVersion":"1.05",
"userIdentity":{
"type":"IAMUser",
"principalId":"EXAMPLE_PRINCIPAL_ID",
"arn":"arn:aws:iam::123456789012:user/johndoe",
"accountId":"123456789012",
"accessKeyId":"EXAMPLE_KEY_ID",
"userName":"johndoe"
},
306
Amazon Athena User Guide
Compliance Validation
"eventTime":"2017-05-04T00:21:57Z",
"eventSource":"athena.amazonaws.com",
"eventName":"StartQueryExecution",
"awsRegion":"us-east-1",
"sourceIPAddress":"77.88.999.69",
"userAgent":"aws-internal/3",
"errorCode":"InvalidRequestException",
"errorMessage":"Invalid result configuration. Should specify either output location or
result configuration",
"requestParameters":{
"clientRequestToken":"ca0e965f-d6d8-4277-8257-814a57f57446",
"queryString":"Select 10"
},
"responseElements":null,
"requestID":"aefbc057-305f-11e7-9f39-bbc56d5d161e",
"eventID":"6e1fc69b-d076-477e-8dec-024ee51488c4",
"eventType":"AwsApiCall",
"recipientAccountId":"123456789012"
}
CreateNamedQuery
{
"eventVersion":"1.05",
"userIdentity":{
"type":"IAMUser",
"principalId":"EXAMPLE_PRINCIPAL_ID",
"arn":"arn:aws:iam::123456789012:user/johndoe",
"accountId":"123456789012",
"accessKeyId":"EXAMPLE_KEY_ID",
"userName":"johndoe"
},
"eventTime":"2017-05-16T22:00:58Z",
"eventSource":"athena.amazonaws.com",
"eventName":"CreateNamedQuery",
"awsRegion":"us-west-2",
"sourceIPAddress":"77.88.999.69",
"userAgent":"aws-cli/1.11.85 Python/2.7.10 Darwin/16.6.0 botocore/1.5.48",
"requestParameters":{
"name":"johndoetest",
"queryString":"select 10",
"database":"default",
"clientRequestToken":"fc1ad880-69ee-4df0-bb0f-1770d9a539b1"
},
"responseElements":{
"namedQueryId":"cdd0fe29-4787-4263-9188-a9c8db29f2d6"
},
"requestID":"2487dd96-3a83-11e7-8f67-c9de5ac76512",
"eventID":"15e3d3b5-6c3b-4c7c-bc0b-36a8dd95227b",
"eventType":"AwsApiCall",
"recipientAccountId":"123456789012"
},
For a list of AWS services in scope of specific compliance programs, see AWS Services in Scope by
Compliance Program. For general information, see AWS Compliance Programs.
307
Amazon Athena User Guide
Resilience
You can download third-party audit reports using AWS Artifact. For more information, see Downloading
Reports in AWS Artifact.
Your compliance responsibility when using Athena is determined by the sensitivity of your data, your
company's compliance objectives, and applicable laws and regulations. AWS provides the following
resources to help with compliance:
• Security and Compliance Quick Start Guides – These deployment guides discuss architectural
considerations and provide steps for deploying security- and compliance-focused baseline
environments on AWS.
• Architecting for HIPAA Security and Compliance Whitepaper – This whitepaper describes how
companies can use AWS to create HIPAA-compliant applications.
• AWS Compliance Resources – This collection of workbooks and guides might apply to your industry
and location.
• AWS Config – This AWS service assesses how well your resource configurations comply with internal
practices, industry guidelines, and regulations.
• AWS Security Hub – This AWS service provides a comprehensive view of your security state within AWS
that helps you check your compliance with security industry standards and best practices.
Resilience in Athena
The AWS global infrastructure is built around AWS Regions and Availability Zones. AWS Regions provide
multiple physically separated and isolated Availability Zones, which are connected with low-latency,
high-throughput, and highly redundant networking. With Availability Zones, you can design and operate
applications and databases that automatically fail over between Availability Zones without interruption.
Availability Zones are more highly available, fault tolerant, and scalable than traditional single or
multiple data center infrastructures.
For more information about AWS Regions and Availability Zones, see AWS Global Infrastructure.
In addition to the AWS global infrastructure, Athena offers several features to help support your data
resiliency and backup needs.
Athena is serverless, so there is no infrastructure to set up or manage. Athena is highly available and
runs queries using compute resources across multiple Availability Zones, automatically routing queries
appropriately if a particular Availability Zone is unreachable. Athena uses Amazon S3 as its underlying
data store, making your data highly available and durable. Amazon S3 provides durable infrastructure
to store important data and is designed for durability of 99.999999999% of objects. Your data is
redundantly stored across multiple facilities and multiple devices in each facility.
You use AWS published API calls to access Athena through the network. Clients must support TLS
(Transport Layer Security) 1.0. We recommend TLS 1.2 or later. Clients must also support cipher
suites with perfect forward secrecy (PFS) such as Ephemeral Diffie-Hellman (DHE) or Elliptic Curve
Ephemeral Diffie-Hellman (ECDHE). Most modern systems such as Java 7 and later support these modes.
Additionally, requests must be signed by using an access key ID and a secret access key that is associated
with an IAM principal. Or you can use the AWS Security Token Service (AWS STS) to generate temporary
security credentials to sign requests.
Use IAM policies to restrict access to Athena operations. Athena managed policies (p. 271) are easy to
use, and are automatically updated with the required actions as the service evolves. Customer-managed
308
Amazon Athena User Guide
Connect to Amazon Athena
Using an Interface VPC Endpoint
and inline policies allow you to fine tune policies by specifying more granular Athena actions within the
policy. Grant appropriate access to the Amazon S3 location of the data. For detailed information and
scenarios about how to grant Amazon S3 access, see Example Walkthroughs: Managing Access in the
Amazon Simple Storage Service Developer Guide. For more information and an example of which Amazon
S3 actions to allow, see the example bucket policy in Cross-Account Access (p. 282).
Topics
• Connect to Amazon Athena Using an Interface VPC Endpoint (p. 309)
The interface VPC endpoint connects your VPC directly to Athena without an internet gateway, NAT
device, VPN connection, or AWS Direct Connect connection. The instances in your VPC don't need public
IP addresses to communicate with the Athena API.
To use Athena through your VPC, you must connect from an instance that is inside the VPC or connect
your private network to your VPC by using an Amazon Virtual Private Network (VPN) or AWS Direct
Connect. For information about Amazon VPN, see VPN Connections in the Amazon Virtual Private Cloud
User Guide. For information about AWS Direct Connect, see Creating a Connection in the AWS Direct
Connect User Guide.
Athena supports VPC endpoints in all AWS Regions where both Amazon VPC and Athena are available.
You can create an interface VPC endpoint to connect to Athena using the AWS console or AWS Command
Line Interface (AWS CLI) commands. For more information, see Creating an Interface Endpoint.
After you create an interface VPC endpoint, if you enable private DNS hostnames for the endpoint, the
default Athena endpoint (https://athena.Region.amazonaws.com) resolves to your VPC endpoint.
If you do not enable private DNS hostnames, Amazon VPC provides a DNS endpoint name that you can
use in the following format:
VPC_Endpoint_ID.athena.Region.vpce.amazonaws.com
For more information, see Interface VPC Endpoints (AWS PrivateLink) in the Amazon VPC User Guide.
Athena supports making calls to all of its API Actions inside your VPC.
For more information, see Controlling Access to Services with VPC Endpoints in the Amazon VPC User
Guide.
309
Amazon Athena User Guide
Configuration and Vulnerability Analysis
The endpoint to which this policy is attached grants access to the listed athena actions to all principals
in workgroupA.
{
"Statement": [{
"Principal": "*",
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:RunQuery",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:CancelQueryExecution",
"athena:ListWorkGroups",
"athena:GetWorkGroup",
"athena:TagResource"
],
"Resource": [
"arn:aws:athena:us-west-1:AWSAccountId:workgroup/workgroupA"
]
}]
}
You can use Athena to query both data that is registered with Lake Formation and data that is not
registered with Lake Formation.
Lake Formation permissions apply when using Athena to query source data from Amazon S3 locations
that are registered with Lake Formation. Lake Formation permissions also apply when you create
databases and tables that point to registered Amazon S3 data locations. To use Athena with data
registered using Lake Formation, Athena must be configured to use the AWS Glue Data Catalog.
310
Amazon Athena User Guide
How Lake Formation Data Access Works
Lake Formation permissions do not apply when writing objects to Amazon S3, nor do they apply when
querying data stored in Amazon S3 or metadata that are not registered with Lake Formation. For source
data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by IAM
permissions policies for Amazon S3 and AWS Glue actions. Athena query results locations in Amazon S3
cannot be registered with Lake Formation, and IAM permissions policies for Amazon S3 control access.
In addition, Lake Formation permissions do not apply to Athena query history. You can use Athena
workgroups to control access to query history.
For more information about Lake Formation, see Lake Formation FAQs and the AWS Lake Formation
Developer Guide.
Topics
• How Athena Accesses Data Registered With Lake Formation (p. 311)
• Considerations and Limitations When Using Athena to Query Data Registered With Lake
Formation (p. 312)
• Managing Lake Formation and Athena User Permissions (p. 315)
• Applying Lake Formation Permissions to Existing Databases and Tables (p. 317)
• Using Lake Formation and the Athena JDBC and ODBC Drivers for Federated Access to
Athena (p. 317)
Each time an Athena principal (user, group, or role) runs a query on data registered using Lake
Formation, Lake Formation verifies that the principal has the appropriate Lake Formation permissions
to the database, table, and Amazon S3 location as appropriate for the query. If the principal has access,
Lake Formation vends temporary credentials to Athena, and the query runs.
311
Amazon Athena User Guide
Considerations and Limitations
The following diagram shows how credential vending works in Athena on a query-by-query basis for a
hypothetical SELECT query on a table with an Amazon S3 location registered in Lake Formation:
312
Amazon Athena User Guide
Considerations and Limitations
• Create Table As Select (CTAS) Queries Require Amazon S3 Write Permissions (p. 314)
This occurs when column metadata is stored in table properties for tables using either the Avro storage
format or using a custom Serializer/Deserializers (SerDe) in which table schema is defined in table
properties along with the SerDe definition. When using Athena with Lake Formation, we recommend
that you review the contents of table properties that you register with Lake Formation and, where
possible, limit the information stored in table properties to prevent any sensitive metadata from being
visible to users.
313
Amazon Athena User Guide
Considerations and Limitations
• Use an Athena cross-account AWS Lambda function to federate queries to the Data Catalog of your
choice.
For more information, see the following resources in the AWS Lake Formation Developer Guide:
Cross-Account Access
For steps, see Cross-account AWS Glue Data Catalog access with Amazon Athena in the AWS Big Data
Blog.
314
Amazon Athena User Guide
Managing User Permissions
The following sections summarize the permissions required to use Athena to query data registered in
Lake Formation. For more information, see Security in AWS Lake Formation in the AWS Lake Formation
Developer Guide.
Permissions Summary
• Identity-Based Permissions For Lake Formation and Athena (p. 315)
• Amazon S3 Permissions For Athena Query Results Locations (p. 315)
• Athena Workgroup Memberships To Query History (p. 316)
• Lake Formation Permissions To Data (p. 316)
• IAM Permissions to Write to Amazon S3 Locations (p. 316)
• Permissions to Encrypted Data, Metadata, and Athena Query Results (p. 316)
• Resource-Based Permissions for Amazon S3 Buckets in External Accounts (Optional) (p. 316)
In Lake Formation, a data lake administrator has permissions to create metadata objects such as
databases and tables, grant Lake Formation permissions to other users, and register new Amazon
S3 locations. To register new locations, permissions to the service-linked role for Lake Formation
are required. For more information, see Create a Data Lake Administrator and Service-Linked Role
Permissions for Lake Formation in the AWS Lake Formation Developer Guide.
An Lake Formation user can use Athena to query databases, tables, table columns, and underlying
Amazon S3 data stores based on Lake Formation permissions granted to them by data lake
administrators. Users cannot create databases or tables, or register new Amazon S3 locations with Lake
Formation. For more information, see Create a Data Lake User in the AWS Lake Formation Developer
Guide.
In Athena, identity-based permissions policies, including those for Athena workgroups, still control
access to Athena actions for AWS account users. In addition, federated access might be provided
through the SAML-based authentication available with Athena drivers. For more information,
see Using Workgroups to Control Query Access and Costs (p. 358), IAM Policies for Accessing
Workgroups (p. 361), and Enabling Federated Access to the Athena API (p. 301).
For more information, see Granting Lake Formation Permissions in the AWS Lake Formation Developer
Guide.
315
Amazon Athena User Guide
Managing User Permissions
access query result files and metadata when they do not have Lake Formation permissions for the data.
To avoid this, we recommend that you use workgroups to specify the location for query results and align
workgroup membership with Lake Formation permissions. You can then use IAM permissions policies to
limit access to query results locations. For more information about query results, see Working with Query
Results, Output Files, and Query History (p. 122).
• Encrypting source data – SSE-S3 and CSE-KMS encryption of Amazon S3 data locations source data
is supported. SSE-KMS encryption is not supported. Athena users who query encrypted Amazon S3
locations that are registered with Lake Formation need permissions to encrypt and decrypt data. For
more information about requirements, see Permissions to Encrypted Data in Amazon S3 (p. 265).
• Encrypting metadata – Encrypting metadata in the Data Catalog is supported. For principals using
Athena, identity-based policies must allow the "kms:GenerateDataKey", "kms:Decrypt", and
"kms:Encrypt" actions for the key used to encrypt metadata. For more information, see Encrypting
Your Data Catalog in the AWS Glue Developer Guide and Access to Encrypted Metadata in the AWS Glue
Data Catalog (p. 281).
316
Amazon Athena User Guide
Applying Lake Formation Permissions
to Existing Databases and Tables
For information about accessing a Data Catalog in another account, see Cross-Account Data Catalog
Access (p. 313).
Registering data with Lake Formation and updating IAM permissions policies is not a requirement. If data
is not registered with Lake Formation, Athena users who have appropriate permissions in Amazon S3—
and AWS Glue, if applicable—can continue to query data not registered with Lake Formation.
If you have existing Athena users who query data not registered with Lake Formation, you can update
IAM permissions for Amazon S3—and the AWS Glue Data Catalog, if applicable—so that you can use
Lake Formation permissions to manage user access centrally. For permission to read Amazon S3 data
locations, you can update resource-based and identity-based policies to modify Amazon S3 permissions.
For access to metadata, if you configured resource-level policies for fine-grained access control with AWS
Glue, you can use Lake Formation permissions to manage access instead.
For more information, see Fine-Grained Access to Databases and Tables in the AWS Glue Data
Catalog (p. 275) and Upgrading AWS Glue Data Permissions to the AWS Lake Formation Model in the
AWS Lake Formation Developer Guide.
To use Athena to access a data source controlled by Lake Formation, you need to enable SAML 2.0-based
federation by configuring your identity provider (IdP) and AWS Access and Identity Management (IAM)
roles. For detailed steps, see Tutorial: Configuring Federated Access for Okta Users to Athena Using Lake
Formation and JDBC (p. 318).
Prerequisites
To use Amazon Athena and Lake Formation for federated access, you must meet the following
requirements:
• You manage your corporate identities using an existing SAML-based identity provider, such as Okta or
Microsoft Active Directory Federation Services (AD FS).
• You use the AWS Glue Data Catalog as a metadata store.
• You define and manage permissions in Lake Formation to access databases, tables, and columns in
AWS Glue Data Catalog. For more information, see the AWS Lake Formation Developer Guide.
• You use version 2.0.14 or later of the Athena JDBC Driver or version 1.1.3 or later of the Athena ODBC
driver (p. 85).
317
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
• Currently, the Athena JDBC driver and ODBC drivers support the Okta and Microsoft Active Directory
Federation Services (AD FS) identity providers. Although the Athena JDBC driver has a generic SAML
class that can be extended to use other identity providers, support for custom extensions that enable
other identity providers (IdPs) for use with Athena may be limited.
• Currently, you cannot use the Athena console to configure support for IdP and SAML use with Athena.
To configure this support, you use the third-party identity provider, the Lake Formation and IAM
management consoles, and the JDBC or ODBC driver client.
• You should understand the SAML 2.0 specification and how it works with your identity provider before
you configure your identity provider and SAML for use with Lake Formation and Athena.
• SAML providers and the Athena JDBC and ODBC drivers are provided by third parties, so support
through AWS for issues related to their use may be limited.
Topics
• Tutorial: Configuring Federated Access for Okta Users to Athena Using Lake Formation and
JDBC (p. 318)
Prerequisites
• Created an AWS account. To create an account, visit the Amazon Web Services home page.
• Set up a query results location (p. 127) for Athena in Amazon S3.
• Registered an Amazon S3 data bucket location with Lake Formation.
• Defined a database and tables on the AWS Glue Data Catalog that point to your data in Amazon S3.
• If you have not yet defined a table, either run a AWS Glue crawler or use Athena to define a database
and one or more tables (p. 88) for the data that you want to access.
• This tutorial uses a table based on the NYC Taxi trips dataset available in the Registry of Open Data
on AWS. The tutorial uses the database name tripdb and the table name nyctaxi.
Tutorial Steps
• Step 1: Create an Okta Account (p. 319)
• Step 2: Add users and groups to Okta (p. 319)
• Step 3: Set up an Okta Application for SAML Authentication (p. 325)
• Step 4: Create an AWS SAML Identity Provider and Lake Formation Access IAM Role (p. 333)
• Step 5: Add the IAM Role and SAML Identity Provider to the Okta Application (p. 339)
318
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
• Step 6: Grant user and group permissions through AWS Lake Formation (p. 344)
• Step 7: Verify access through the Athena JDBC client (p. 348)
• Conclusion (p. 356)
• Related Resources (p. 356)
1. To use Okta, navigate to the Okta developer sign up page and create a free Okta trial account. The
Developer Edition Service is free of charge up to the limits specified by Okta at developer.okta.com/
pricing.
2. When you receive the activation email, activate your account.
An Okta domain name will be assigned to you. Save the domain name for reference. Later, you use
the domain name (<okta-idp-domain>) in the JDBC string that connects to Athena.
1. After you activate your Okta account, log in as administrative user to the assigned Okta domain.
2. If you are in the Developer Console, use the option on the top left of the page to choose the Classic
UI.
• Enter values for First name and Last name. This tutorial uses athena-okta-user.
• Enter a Username and Primary email. This tutorial uses [email protected].
• For Password, choose Set by admin, and then provide a password. This tutorial clears the option
for User must change password on first login; your security requirements may vary.
319
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
320
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
8. Choose Save.
In the following procedure, you provide access for two Okta groups through the Athena JDBC driver by
adding a "Business Analysts" group and a "Developer" group.
1. From the Okta classic UI, choose Directory, and then choose Groups.
2. On the Groups page, choose Add Group.
321
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Now that you have two users and two groups, you are ready to add a user to each group.
1. On the Groups page, choose the lf-developer group that you just created. You will add one of the
Okta users that you created as a developer to this group.
322
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
323
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
The entry for the user moves from the Not Members list on the left to the Members list on the
right.
324
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
4. Choose Save.
5. Choose Back to Groups, or choose Directory, and then choose Groups.
6. Choose the lf-business-analyst group.
7. Choose Manage People.
8. Add the athena-ba-user to the Members list of the lf-business-analyst group, and then choose
Save.
9. Choose Back to Groups, or choose Directory, Groups.
The Groups page now shows that each group has one Okta user.
325
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
• Download the resulting identity provider metadata for later use with AWS.
1. From the menu, choose Applications so that you can configure an Okta application for SAML
authentication to Athena.
2. Click Add Application.
5. On the Amazon Web Services Redshift page, choose Add to create a SAML-based application for
Amazon Redshift.
326
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Now that you have created an Okta application, you can assign it to the users and groups that you
created.
327
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
2. In the Assign Athena-LakeFormation-Okta to People dialog box, find the athena-okta-user user
that you created previously.
3. Choose Assign to assign the user to the application.
5. Choose Done.
6. On the Assignments tab for the Athena-LakeFormation-Okta application, choose Assign, Assign to
Groups.
328
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
329
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
330
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Now you are ready to download the identity provider application metadata for use with AWS.
1. Choose the Okta application Sign On tab, and then right-click Identity Provider metadata.
331
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
2. Choose Save Link As to save the identity provider metadata, which is in XML format, to a file. Give it
a name that you recognize (for example, Athena-LakeFormation-idp-metadata.xml).
332
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Step 4: Create an AWS SAML Identity Provider and Lake Formation Access IAM
Role
In this step, you use the AWS Identity and Access Management (IAM) console to perform the following
tasks:
1. Sign in to the AWS account console as AWS account administrator and navigate to the IAM console
(https://console.aws.amazon.com/iam/).
2. In the navigation pane, choose Identity providers, and then click Create Provider.
333
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
In the IAM console, the AthenaLakeFormationOkta provider that you created appears in the list of
identity providers.
334
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Next, you create an IAM role for AWS Lake Formation access. You add two inline policies to the role. One
policy provides permissions to access Lake Formation and the AWS Glue APIs. The other policy provides
access to Athena and the Athena query results location in Amazon S3.
1. In the IAM console navigation pane, choose Roles, and then choose Create role.
335
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
3. On the Attach Permissions policies page, for Filter policies, enter Athena.
4. Select the AmazonAthenaFullAccess managed policy, and then choose Next: Tags.
336
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
337
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Next, you add inline policies that allow access to Lake Formation, AWS Glue APIs, and Athena query
results in Amazon S3.
To add an inline policy to the role for Lake Formation and AWS Glue
1. From the list of roles in the IAM console, choose the newly created Athena-LakeFormation-
OktaRole.
2. On the Summary page for the role, on the Permissions tab, choose Add inline policy.
3. On the Create policy page, choose JSON.
4. Add an inline policy like the following that provides access to Lake Formation and the AWS Glue
APIs.
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"lakeformation:GetMetadataAccess",
"glue:GetUnfiltered*",
"glue:GetTable",
"glue:GetTables",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateDatabase",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions"
],
"Resource": "*"
}
}
To add an inline policy to the role for the Athena query results location
1. On the Summary page for the Athena-LakeFormation-OktaRole role, on the Permissions tab,
choose Add inline policy.
2. On the Create policy page, choose JSON.
3. Add an inline policy like the following that allows the role access to the Athena query results
location. Replace the <athena-query-results-bucket> placeholders in the example with the
name of your Amazon S3 bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AthenaQueryResultsPermissionsForS3",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:PutObject",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<athena-query-results-bucket>",
"arn:aws:s3:::<athena-query-results-bucket>/*"
338
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
]
}
]
}
Next, you copy the ARN of the Lake Formation access role and the ARN of the SAML provider that you
created. These are required when you configure the Okta SAML application in the next section of the
tutorial.
1. In the IAM console, on the Summary page for the Athena-LakeFormation-OktaRole role,
choose the Copy to clipboard icon next to Role ARN. The ARN has the following format:
arn:aws:iam::<account-id>:role/Athena-LakeFormation-OktaRole
arn:aws:iam::<account-id>:saml-provider/AthenaLakeFormationOkta
Step 5: Add the IAM Role and SAML Identity Provider to the Okta Application
In this step, you return to the Okta developer console and perform the following tasks:
• Add user and group Lake Formation URL attributes to the Okta application.
• Add the ARN for the identity provider and the ARN for the IAM role to the Okta application.
• Copy the Okta application ID. The Okta application ID is required in the JDBC profile that connects to
Athena.
To add user and group Lake Formation URL attributes to the Okta application
339
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
3. Choose on the Sign On tab for the application, and then choose Edit.
340
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
341
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
7. Scroll down to the Advanced Sign-On Settings section, where you will add the identity provider and
IAM Role ARNs to the Okta application.
To add the ARNs for the identity provider and IAM role to the Okta application
1. For Idp ARN and Role ARN, enter the AWS identity provider ARN and role ARN as comma separated
values in the format <saml-arn>,<role-arn>. The combined string should look like the following:
arn:aws:iam::<account-id>:saml-provider/AthenaLakeFormationOkta,arn:aws:iam::<account-
id>:role/Athena-LakeFormation-OktaRole
342
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
2. Choose Save.
Next, you copy the Okta application ID. You will require this later for the JDBC string that connects to
Athena.
343
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Step 6: Grant user and group permissions through AWS Lake Formation
In this step, you use the Lake Formation console to grant permissions on a table to the SAML user and
group. You perform the following tasks:
• Specify the ARN of the Okta SAML user and associated user permissions on the table.
344
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
• Specify the ARN of the Okta SAML group and associated group permissions on the table.
• Verify the permissions that you granted.
a. Under SAML and Amazon QuickSight users and groups, enter the Okta SAML user ARN in the
following format:
345
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
arn:aws:iam::<account-id>:saml-provider/AthenaLakeFormationOkta:user/<athena-okta-
user>@<anycompany.com>
b. For Columns, for Choose filter type, and optionally choose Include columns or Exclude
columns.
c. Use the Choose one or more columns dropdown under the filter to specify the columns that
you want to include or exclude for or from the user.
d. For Table permissions, choose Select. This tutorial grants only the SELECT permission; your
requirements may vary.
6. Choose Grant.
1. On the Tables page of the Lake Formation console, make sure that the nyctaxi table is still selected.
2. From Actions, choose Grant.
3. In the Grant permissions dialog, enter the following information:
a. Under SAML and Amazon QuickSight users and groups, enter the Okta SAML group ARN in the
following format:
346
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
arn:aws:iam::<account-id>:saml-provider/AthenaLakeFormationOkta:group/lf-business-
analyst
4. Choose Grant.
5. To verify the permissions that you granted, choose Actions, View permissions.
347
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
The Data permissions page for the nyctaxi table shows the permissions for athena-okta-user and
the lf-business-analyst group.
• Prepare the test client – Download the Athena JDBC driver, install SQL Workbench, and add the driver
to Workbench. This tutorial uses SQL Workbench to access Athena through Okta authentication and to
verify Lake Formation permissions.
• In SQL Workbench:
• Create a connection for the Athena Okta user.
• Run test queries as the Athena Okta user.
348
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
1. Download and extract the Lake Formation compatible Athena JDBC driver (2.0.14 or later version)
from Using Athena with the JDBC Driver (p. 83).
2. Download and install the free SQL Workbench/J SQL query tool, available under a modified Apache
2.0 license.
3. In SQL Workbench, choose File, and then choose Manage Drivers.
349
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
You are now ready to create and test a connection for the Athena Okta user.
2. In the Connection profile dialog box, create a connection by entering the following information:
jdbc:awsathena://AwsRegion=region-id;
S3OutputLocation=s3://athena-query-results-bucket/athena_results;
AwsCredentialsProviderClass=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider;
350
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
[email protected];
password=password;
idp_host=okta-idp-domain;
App_ID=okta-app-id;
SSL_Insecure=true;
LakeFormationEnabled=true;
[athena_lf_dev]
plugin_name=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider
idp_host=okta-idp-domain
app_id=okta-app-id
[email protected]
pwd=password
2. For URL, enter a single-line connection string like the following example. The example adds
line breaks for readability.
jdbc:awsathena://AwsRegion=region-id;
S3OutputLocation=s3://athena-query-results-bucket/athena_results;
profile=athena_lf_dev;
SSL_Insecure=true;
LakeFormationEnabled=true;
Note that these examples are basic representations of the URL needed to connect to Athena. For
the full list of parameters supported in the URL, refer to the Simba Athena JDBC driver installation
guide (p. 84). The JDBC installation guide also provides sample Java code for connecting to Athena
programmatically.
The following image shows a SQL Workbench connection profile that uses a connection URL.
Now that you have established a connection for the Okta user, you can test it by retrieving some data.
351
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
DESCRIBE "tripdb"."nyctaxi"
3. From the SQL Workbench Statement window, run the following SQL SELECT command. Verify that
all columns are displayed.
352
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Next, you verify that the athena-ba-user, as a member of the lf-business-analyst group, has access to
only the first three columns of the table that you specified earlier in Lake Formation.
1. In SQL Workbench, in the Connection profile dialog box, create another connection profile.
jdbc:awsathena://AwsRegion=region-id;
S3OutputLocation=s3://athena-query-results-bucket/athena_results;
AwsCredentialsProviderClass=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider;
[email protected];
password=password;
idp_host=okta-idp-domain;
App_ID=okta-application-id;
SSL_Insecure=true;
LakeFormationEnabled=true;
[athena_lf_ba]
plugin_name=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider
idp_host=okta-idp-domain
app_id=okta-application-id
[email protected]
pwd=password
2. For URL, enter a single-line connection string like the following. The example adds line
breaks for readability.
353
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
jdbc:awsathena://AwsRegion=region-id;
S3OutputLocation=s3://athena-query-results-bucket/athena_results;
profile=athena_lf_ba;
SSL_Insecure=true;
LakeFormationEnabled=true;
Because athena-ba-user is a member of the lf-business-analyst group, only the first three columns
that you specified in the Lake Formation console are returned.
Next, you return to the Okta console to add the athena-ba-user to the lf-developer Okta group.
1. Sign in to the Okta console as an administrative user of the assigned Okta domain.
354
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Now you return to the Lake Formation console to configure table permissions for the lf-developer group.
• For SAML and Amazon QuickSight users and groups, enter the Okta SAML lf-developer group
ARN in the following format:
• For Columns, Choose filter type, choose Include columns.
• Choose the trip_type column.
• For Table permissions, choose SELECT.
6. Choose Grant.
Now you can use SQL Workbench to verify the change in permissions for the lf-developer group. The
change should be reflected in the data available to athena-ba-user, who is now a member of the lf-
developer group.
355
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
Because athena-ba-user is now a member of both the lf-developer and lf-business-analyst groups,
the combination of Lake Formation permissions for those groups determines the columns that are
returned.
Conclusion
In this tutorial you configured Athena integration with AWS Lake Formation using Okta as the SAML
provider. You used Lake Formation and IAM to control the resources that are available to the SAML user
in your data lake AWS Glue Data Catalog.
Related Resources
For related information, see the following resources.
356
Amazon Athena User Guide
Using Lake Formation and JDBC
or ODBC for Federated Access
357
Amazon Athena User Guide
Using Workgroups for Running Queries
Workgroups integrate with IAM, CloudWatch, and Amazon Simple Notification Service as follows:
• IAM identity-based policies with resource-level permissions control who can run queries in a
workgroup.
• Athena publishes the workgroup query metrics to CloudWatch, if you enable query metrics.
• In Amazon SNS, you can create Amazon SNS topics that issue alarms to specified workgroup users
when data usage controls for queries in a workgroup exceed your established thresholds.
Topics
• Using Workgroups for Running Queries (p. 358)
• Controlling Costs and Monitoring Queries with CloudWatch Metrics and Events (p. 375)
Topics
• Benefits of Using Workgroups (p. 358)
• How Workgroups Work (p. 359)
• Setting up Workgroups (p. 360)
• IAM Policies for Accessing Workgroups (p. 361)
• Workgroup Settings (p. 366)
• Managing Workgroups (p. 367)
• Athena Workgroup APIs (p. 373)
• Troubleshooting Workgroups (p. 373)
Isolate users, teams, Each workgroup has its own distinct query history and a list of saved
applications, or workloads queries. For more information, see How Workgroups Work (p. 359).
into groups.
358
Amazon Athena User Guide
How Workgroups Work
For all queries in the workgroup, you can choose to configure workgroup
settings. They include an Amazon S3 location for storing query results,
and encryption configuration. You can also enforce workgroup settings.
For more information, see Workgroup Settings (p. 366).
Enforce costs constraints. You can set two types of cost constraints for queries in a workgroup:
For detailed steps, see Setting Data Usage Control Limits (p. 381).
Track query-related metrics For each query that runs in a workgroup, if you configure the workgroup
for all workgroup queries in to publish metrics, Athena publishes them to CloudWatch. You can view
CloudWatch. query metrics (p. 376) for each of your workgroups within the Athena
console. In CloudWatch, you can create custom dashboards, and set
thresholds and alarms on these metrics.
• By default, each account has a primary workgroup and the default permissions allow all authenticated
users access to this workgroup. The primary workgroup cannot be deleted.
• Each workgroup that you create shows saved queries and query history only for queries that ran in it,
and not for all queries in the account. This separates your queries from other queries within an account
and makes it more efficient for you to locate your own saved queries and queries in history.
• Disabling a workgroup prevents queries from running in it, until you enable it. Queries sent to a
disabled workgroup fail, until you enable it again.
• If you have permissions, you can delete an empty workgroup, and a workgroup that contains saved
queries. In this case, before deleting a workgroup, Athena warns you that saved queries are deleted.
Before deleting a workgroup to which other users have access, make sure its users have access to other
workgroups in which they can continue to run queries.
• You can set up workgroup-wide settings and enforce their usage by all queries that run in a workgroup.
The settings include query results location in Amazon S3 and encryption configuration.
Important
When you enforce workgroup-wide settings, all queries that run in this workgroup use
workgroup settings. This happens even if their client-side settings may differ from workgroup
settings. For information, see Workgroup Settings Override Client-Side Settings (p. 366).
359
Amazon Athena User Guide
Setting up Workgroups
• You can open up to ten query tabs within each workgroup. When you switch between workgroups,
your query tabs remain open for up to three workgroups.
Setting up Workgroups
Setting up workgroups involves creating them and establishing permissions for their usage. First, decide
which workgroups your organization needs, and create them. Next, set up IAM workgroup policies that
control user access and actions on a workgroup resource. Users with access to these workgroups can
now run queries in them.
Note
Use these tasks for setting up workgroups when you begin to use them for the first time. If
your Athena account already uses workgroups, each account's user requires permissions to run
queries in one or more workgroups in the account. Before you run queries, check your IAM policy
to see which workgroups you can access, adjust your policy if needed, and switch (p. 371) to a
workgroup you intend to use.
By default, if you have not created any workgroups, all queries in your account run in the primary
workgroup:
Workgroups display in the Athena console in the Workgroup:<workgroup_name> tab. The console lists
the workgroup that you have switched to. When you run queries, they run in this workgroup. You can run
queries in the workgroup in the console, or by using the API operations, the command line interface, or a
client application through the JDBC or ODBC driver. When you have access to a workgroup, you can view
workgroup's settings, metrics, and data usage control limits. Additionally, you can have permissions to
edit the settings and data usage control limits.
To Set Up Workgroups
1. Decide which workgroups to create. For example, you can decide the following:
• Who can run queries in each workgroup, and who owns workgroup configuration. This
determines IAM policies you create. For more information, see IAM Policies for Accessing
Workgroups (p. 361).
• Which locations in Amazon S3 to use for the query results for queries that run in each workgroup.
A location must exist in Amazon S3 before you can specify it for the workgroup query results.
All users who use a workgroup must have access to this location. For more information, see
Workgroup Settings (p. 366).
• Which encryption settings are required, and which workgroups have queries that must be
encrypted. We recommend that you create separate workgroups for encrypted and non-encrypted
queries. That way, you can enforce encryption for a workgroup that applies to all queries that run
in it. For more information, see Encrypting Query Results Stored in Amazon S3 (p. 266).
2. Create workgroups as needed, and add tags to them. Open the Athena console, choose the
Workgroup:<workgroup_name> tab, and then choose Create workgroup. For detailed steps, see
Create a Workgroup (p. 368).
3. Create IAM policies for your users, groups, or roles to enable their access to workgroups. The
policies establish the workgroup membership and access to actions on a workgroup resource. For
detailed steps, see IAM Policies for Accessing Workgroups (p. 361). For example JSON policies, see
Workgroup Example Policies (p. 285).
4. Set workgroup settings. Specify a location in Amazon S3 for query results and encryption
settings, if needed. You can enforce workgroup settings. For more information, see workgroup
settings (p. 366).
360
Amazon Athena User Guide
IAM Policies for Accessing Workgroups
Important
If you override client-side settings (p. 366), Athena will use the workgroup's settings. This
affects queries that you run in the console, by using the drivers, the command line interface,
or the API operations.
While queries continue to run, automation built based on availability of results in a certain
Amazon S3 bucket may break. We recommend that you inform your users before overriding.
After workgroup settings are set to override, you can omit specifying client-side settings in
the drivers or the API.
5. Notify users which workgroups to use for running queries. Send an email to inform your account's
users about workgroup names that they can use, the required IAM policies, and the workgroup
settings.
6. Configure cost control limits, also known as data usage control limits, for queries and workgroups.
To notify you when a threshold is breached, create an Amazon SNS topic and configure
subscriptions. For detailed steps, see Setting Data Usage Control Limits (p. 381) and Creating an
Amazon SNS Topic in the Amazon Simple Notification Service Getting Started Guide.
7. Switch to the workgroup so that you can run queries.To run queries, switch to the appropriate
workgroup. For detailed steps, see the section called “Specify a Workgroup in Which to Run
Queries” (p. 372).
For IAM-specific information, see the links listed at the end of this section. For information about
example JSON workgroup policies, see Workgroup Example Policies (p. 362).
To use the visual editor in the IAM console to create a workgroup policy
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane on the left, choose Policies, and then choose Create policy.
3. On the Visual editor tab, choose Choose a service. Then choose Athena to add to the policy.
4. Choose Select actions, and then choose the actions to add to the policy. The visual editor shows the
actions available in Athena. For more information, see Actions, Resources, and Condition Keys for
Amazon Athena in the Service Authorization Reference.
5. Choose add actions to type a specific action or use wildcards (*) to specify multiple actions.
By default, the policy that you are creating allows the actions that you choose. If you chose one or
more actions that support resource-level permissions to the workgroup resource in Athena, then
the editor lists the workgroup resource.
6. Choose Resources to specify the specific workgroups for your policy. For example JSON workgroup
policies, see Workgroup Example Policies (p. 362).
7. Specify the workgroup resource as follows:
arn:aws:athena:<region>:<user-account>:workgroup/<workgroup-name>
8. Choose Review policy, and then type a Name and a Description (optional) for the policy that you
are creating. Review the policy summary to make sure that you granted the intended permissions.
9. Choose Create policy to save your new policy.
10. Attach this identity-based policy to a user, a group, or role and specify the workgroup resources
they can access.
361
Amazon Athena User Guide
IAM Policies for Accessing Workgroups
For more information, see the following topics in the Service Authorization Reference and IAM User Guide:
For example JSON workgroup policies, see Workgroup Example Policies (p. 362).
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference.
A workgroup is an IAM resource managed by Athena. Therefore, if your workgroup policy uses actions
that take workgroup as an input, you must specify the workgroup's ARN as follows:
"Resource": [arn:aws:athena:<region>:<user-account>:workgroup/<workgroup-name>]
Where <workgroup-name> is the name of your workgroup. For example, for workgroup named
test_workgroup, specify it as a resource as follows:
"Resource": ["arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"]
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference. For more information about IAM policies, see Creating Policies with the Visual Editor in the
IAM User Guide. For more information about creating IAM policies for workgroups, see Workgroup IAM
Policies (p. 361).
The following policy allows full access to all workgroup resources that might exist in the account. We
recommend that you use this policy for those users in your account that must administer and manage
workgroups for all other users.
{
"Version": "2012-10-17",
"Statement": [
{
362
Amazon Athena User Guide
IAM Policies for Accessing Workgroups
"Effect": "Allow",
"Action": [
"athena:*"
],
"Resource": [
"*"
]
}
]
}
The following policy allows full access to the single specific workgroup resource, named workgroupA.
You could use this policy for users with full control over a particular workgroup.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListEngineVersions",
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
]
},
{
"Effect": "Allow",
"Action": [
"athena:DeleteWorkGroup",
"athena:UpdateWorkGroup",
"athena:GetWorkGroup",
"athena:CreateWorkGroup"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
363
Amazon Athena User Guide
IAM Policies for Accessing Workgroups
]
}
]
}
In the following policy, a user is allowed to run queries in the specified workgroupA, and view them. The
user is not allowed to perform management tasks for the workgroup itself, such as updating or deleting
it.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:GetWorkGroup"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
]
}
]
}
In the following example, we use the policy that allows a particular user to run queries in the primary
workgroup.
Note
We recommend that you add this policy to all users who are otherwise configured to run queries
in their designated workgroups. Adding this policy to their workgroup user policies is useful in
case their designated workgroup is deleted or is disabled. In this case, they can continue running
queries in the primary workgroup.
364
Amazon Athena User Guide
IAM Policies for Accessing Workgroups
To allow users in your account to run queries in the primary workgroup, add the following policy to a
resource section of the Example Policy for Running Queries in a Specified Workgroup (p. 364).
"arn:aws:athena:us-east-1:123456789012:workgroup/primary"
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListEngineVersions"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:CreateWorkGroup",
"athena:GetWorkGroup",
"athena:DeleteWorkGroup",
"athena:UpdateWorkGroup"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"
]
}
]
}
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups"
],
"Resource": "*"
}
Example Example Policy for Running and Stopping Queries in a Specific Workgroup
In this policy, a user is allowed to run queries in the workgroup:
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:StopQueryExecution"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"
]
}
365
Amazon Athena User Guide
Workgroup Settings
Example Example Policy for Working with Named Queries in a Specific Workgroup
In the following policy, a user has permissions to create, delete, and obtain information about named
queries in the specified workgroup:
{
"Effect": "Allow",
"Action": [
"athena:CreateNamedQuery",
"athena:GetNamedQuery",
"athena:DeleteNamedQuery"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"
]
}
Workgroup Settings
Each workgroup has the following settings:
• A unique name. It can contain from 1 to 128 characters, including alphanumeric characters, dashes,
and underscores. After you create a workgroup, you cannot change its name. You can, however, create
a new workgroup with the same settings and a different name.
• Settings that apply to all queries running in the workgroup. They include:
• A location in Amazon S3 for storing query results for all queries that run in this workgroup. This
location must exist before you specify it for the workgroup when you create it. For information on
creating an Amazon S3 bucket, see Create a Bucket.
• An encryption setting, if you use encryption for all workgroup queries. You can encrypt only all
queries in a workgroup, not just some of them. It is best to create separate workgroups to contain
queries that are either encrypted or not encrypted.
In addition, you can override client-side settings (p. 366). Before the release of workgroups, you
could specify results location and encryption options as parameters in the JDBC or ODBC driver, or in
the Properties tab in the Athena console. These settings could also be specified directly via the API
operations. These settings are known as "client-side settings". With workgroups, you can configure these
settings at the workgroup level and enforce control over them. This spares your users from setting them
individually. If you select the Override Client-Side Settings, queries use the workgroup settings and
ignore the client-side settings.
If Override Client-Side Settings is selected, the user is notified on the console that their settings have
changed. If workgroup settings are enforced this way, users can omit corresponding client-side settings.
In this case, if you run queries in the console, the workgroup's settings are used for them even if any
queries have client-side settings. Also, if you run queries in this workgroup through the command
line interface, API operations, or the drivers, any settings that you specified are overwritten by the
workgroup's settings. This affects the query results location and encryption. To check which settings are
used for the workgroup, view workgroup's details (p. 370).
You can also set query limits (p. 375) for queries in workgroups.
• If Override client-side settings is not selected, workgroup settings are not enforced. In this case, for
all queries that run in this workgroup, Athena uses the clients-side settings for query results location
366
Amazon Athena User Guide
Managing Workgroups
and encryption. Each user can specify client-side settings in the Settings menu on the console. If the
client-side settings are not used, the workgroup-wide settings apply, but are not enforced. Also, if you
run queries in this workgroup through the API operations, the command line interface, or the JDBC
and ODBC drivers, and specify your query results location and encryption there, your queries continue
using those settings.
• If Override client-side settings is selected, Athena uses the workgroup-wide settings for query results
location and encryption. It also overrides any other settings that you specified for the query in the
console, by using the API operations, or with the drivers. This affects you only if you run queries in this
workgroup. If you do, workgroup settings are used.
If you override client-side settings, then the next time that you or any workgroup user open the Athena
console, the notification dialog box displays, as shown in the following example. It notifies you that
queries in this workgroup use workgroup's settings, and prompts you to acknowledge this change.
Important
If you run queries through the API operations, the command line interface, or the JDBC and
ODBC drivers, and have not updated your settings to match those of the workgroup, your
queries run, but use the workgroup's settings. For consistency, we recommend that you omit
client-side settings in this case or update your query settings to match the workgroup's
settings for the results location and encryption. To check which settings are used for the
workgroup, view workgroup's details (p. 370).
Managing Workgroups
In the https://console.aws.amazon.com/athena/, you can perform the following tasks:
Statement Description
Edit a Edit a workgroup and change its settings. You cannot change a
Workgroup (p. 369) workgroup's name, but you can create a new workgroup with the same
settings and a different name.
View the Workgroup's View the workgroup's details, such as its name, description, data usage
Details (p. 370) limits, location of query results, and encryption. You can also verify
whether this workgroup enforces its settings, if Override client-side
settings is checked.
367
Amazon Athena User Guide
Managing Workgroups
Statement Description
Enable and Disable a Enable or disable a workgroup. When a workgroup is disabled, its users
Workgroup (p. 372) cannot run queries, or create new named queries. If you have access to it,
you can still view metrics, data usage limit controls, workgroup's settings,
query history, and saved queries.
Specify a Workgroup Before you can run queries, you must specify to Athena which workgroup
in Which to Run to use. You must have permissions to the workgroup.
Queries (p. 372)
Create a Workgroup
Creating a workgroup requires permissions to CreateWorkgroup API actions. See Access to Athena
Workgroups (p. 285) and IAM Policies for Accessing Workgroups (p. 361). If you are adding tags, you
also need to add permissions to TagResource. See Tag Policy Examples for Workgroups (p. 390).
Field Description
Workgroup name Required. Enter a unique name for your workgroup. Use 1 - 128
characters. (A-Z,a-z,0-9,_,-,.). This name cannot be changed.
Query result location Optional. Enter a path to an Amazon S3 bucket or prefix. This bucket
and prefix must exist before you specify them.
Note
If you run queries in the console, specifying the query results
location is optional. If you don't specify it for the workgroup
or in Settings, Athena uses the default query result location.
If you run queries with the API or the drivers, you must specify
query results location in at least one of the two places: for
individual queries with OutputLocation, or for the workgroup,
with WorkGroupConfiguration.
Encrypt query results Optional. Encrypt results stored in Amazon S3. If selected, all queries in
the workgroup are encrypted.
368
Amazon Athena User Guide
Managing Workgroups
Field Description
If selected, you can select the Encryption type, the Encryption key
and enter the KMS Key ARN.
If you don't have the key, open the AWS KMS console to create it. For
more information, see Creating Keys in the AWS Key Management
Service Developer Guide.
Update query engine Choose how you want to update your workgroup when a new Athena
engine version is released. You can let Athena decide when to update
your workgroup or manually choose an engine version. For more
information, see Athena Engine Versioning (p. 395).
Override client-side This field is unselected by default. If you select it, workgroup settings
settings apply to all queries in the workgroup and override client-side settings.
For more information, see Workgroup Settings Override Client-Side
Settings (p. 366).
Tags Optional. Add one or more tags to a workgroup. A tag is a label that
you assign to an Athena workgroup resource. It consists of a key and
a value. Use best practices for AWS tagging strategies to create a
consistent set of tags and categorize workgroups by purpose, owner,
or environment. You can also use tags in IAM policies, and to control
billing costs. Do not use duplicate tag keys the same workgroup. For
more information, see Tagging Resources (p. 385).
4. Choose Create workgroup. The workgroup appears in the list in the Workgroups panel.
Edit a Workgroup
Editing a workgroup requires permissions to UpdateWorkgroup API operations. See Access to
Athena Workgroups (p. 285) and IAM Policies for Accessing Workgroups (p. 361). If you are adding
or editing tags, you also need to have permissions to TagResource. See Tag Policy Examples for
Workgroups (p. 390).
369
Amazon Athena User Guide
Managing Workgroups
2. In the Workgroups panel, choose the workgroup that you want to edit. The View details panel for
the workgroup displays, with the Overview tab selected.
3. Choose Edit workgroup.
4. Change the fields as needed. For the list of fields, see Create workgroup (p. 368). You can change
all fields except for the workgroup's name. If you need to change the name, create another
workgroup with the new name and the same settings.
5. Choose Save. The updated workgroup appears in the list in the Workgroups panel.
• In the Workgroups panel, choose the workgroup that you want to edit. The View details panel for
the workgroup displays, with the Overview tab selected. The workgroup details display, as in the
following example:
370
Amazon Athena User Guide
Managing Workgroups
Delete a Workgroup
You can delete a workgroup if you have permissions to do so. The primary workgroup cannot be deleted.
If you have permissions, you can delete an empty workgroup at any time. You can also delete a
workgroup that contains saved queries. In this case, before proceeding to delete a workgroup, Athena
warns you that saved queries are deleted.
If you delete a workgroup while you are in it, the console switches focus to the primary workgroup. If you
have access to it, you can run queries and view its settings.
If you delete a workgroup, its settings and per-query data limit controls are deleted. The workgroup-wide
data limit controls remain in CloudWatch, and you can delete them there if needed.
Important
Before deleting a workgroup, ensure that its users also belong to other workgroups where
they can continue to run queries. If the users' IAM policies allowed them to run queries only in
this workgroup, and you delete it, they no longer have permissions to run queries. For more
information, see Example Policy for Running Queries in the Primary Workgroup (p. 364).
To delete a workgroup with the API operation, use the DeleteWorkGroup action.
371
Amazon Athena User Guide
Managing Workgroups
You can open up to ten query tabs within each workgroup. When you switch between workgroups, your
query tabs remain open for up to three workgroups.
3. Choose Switch. The console shows the Workgroup: <workgroup_name> tab with the name of the
workgroup that you switched to. You can now run queries in this workgroup.
1. Make sure your permissions allow you to run queries in a workgroup that you intend to use. For more
information, see the section called “ IAM Policies for Accessing Workgroups” (p. 361).
2. To specify the workgroup to Athena, use one of these options:
• If you are accessing Athena via the console, set the workgroup by switching workgroups (p. 371).
• If you are using the Athena API operations, specify the workgroup name in the API action. For
example, you can set the workgroup name in StartQueryExecution, as follows:
372
Amazon Athena User Guide
Athena Workgroup APIs
• If you are using the JDBC or ODBC driver, set the workgroup name in the connection string using
the Workgroup configuration parameter. The driver passes the workgroup name to Athena.
Specify the workgroup parameter in the connection string as in the following example:
jdbc:awsathena://AwsRegion=<AWSREGION>;UID=<ACCESSKEY>;
PWD=<SECRETKEY>;S3OutputLocation=s3://<athena-output>-<AWSREGION>/;
Workgroup=<WORKGROUPNAME>;
For more information, search for "Workgroup" in the driver documentation link included in JDBC
Driver Documentation (p. 84).
• CreateWorkGroup
• DeleteWorkGroup
• GetWorkGroup
• ListWorkGroups
• UpdateWorkGroup
Troubleshooting Workgroups
Use the following tips to troubleshoot workgroups.
• Check permissions for individual users in your account. They must have access to the location for query
results, and to the workgroup in which they want to run queries. If they want to switch workgroups,
they too need permissions to both workgroups. For information, see IAM Policies for Accessing
Workgroups (p. 361).
• Pay attention to the context in the Athena console, to see in which workgroup you are going to run
queries. If you use the driver, make sure to set the workgroup to the one you need. For information,
see the section called “Specify a Workgroup in Which to Run Queries” (p. 372).
• If you use the API or the drivers to run queries, you must specify the query results location using one
of the following ways: for individual queries, use OutputLocation (client-side). In the workgroup, use
WorkGroupConfiguration. If the location is not specified in either way, Athena issues an error at query
runtime.
• If you override client-side settings with workgroup settings, you may encounter errors with query
result location. For example, a workgroup's user may not have permissions to the workgroup's location
in Amazon S3 for storing query results. In this case, add the necessary permissions.
• Workgroups introduce changes in the behavior of the API operations. Calls to the following existing
API operations require that users in your account have resource-based permissions in IAM to the
workgroups in which they make them. If no permissions to the workgroup and to workgroup
actions exist, the following API actions throw AccessDeniedException: CreateNamedQuery,
DeleteNamedQuery, GetNamedQuery, ListNamedQueries, StartQueryExecution,
StopQueryExecution, ListQueryExecutions, GetQueryExecution, GetQueryResults, and
GetQueryResultsStream (this API action is only available for use with the driver and is not exposed
otherwise for public use). For more information, see Actions, Resources, and Condition Keys for
Amazon Athena in the Service Authorization Reference.
373
Amazon Athena User Guide
Troubleshooting Workgroups
You may see the following errors. This table provides a list of some of the errors related to workgroups
and suggests solutions.
Workgroup errors
query state CANCELED. Bytes scanned A query hits a per-query data limit and is
limit was exceeded. canceled. Consider rewriting the query so that
it reads less data, or contact your account
administrator.
INVALID_INPUT. WorkGroup <name> is not A user runs a query in a workgroup, but the
found. workgroup does not exist. This could happen if
the workgroup was deleted. Switch to another
workgroup to run your query.
InvalidRequestException: when calling A user runs a query with the API without
the StartQueryExecution operation: No specifying the location for query results. You
output location provided. An output must set the output location for query results
location is required either through using one of the two ways: either for individual
the Workgroup result configuration queries, using OutputLocation (client-side), or in
setting or as an API input. the workgroup, using WorkGroupConfiguration.
The Create Table As Select query If the workgroup in which a query runs is
failed because it was submitted with configured with an enforced query results
an 'external_location' property to location (p. 366), and you specify an
an Athena Workgroup that enforces external_location for the CTAS query. In this
a centralized output location for case, remove the external_location and rerun
all queries. Please remove the the query.
374
Amazon Athena User Guide
Controlling Costs and Monitoring Queries
with CloudWatch Metrics and Events
• Configure Data usage controls per query and per workgroup, and establish actions that will be taken if
queries breach the thresholds.
• View and analyze query metrics, and publish them to CloudWatch. If you create a workgroup in the
console, the setting for publishing the metrics to CloudWatch is selected for you. If you use the API
operations, you must enable publishing the metrics (p. 375). When metrics are published, they are
displayed under the Metrics tab in the Workgroups panel. Metrics are disabled by default for the
primary workgroup.
Topics
• Enabling CloudWatch Query Metrics (p. 375)
• Monitoring Athena Queries with CloudWatch Metrics (p. 376)
• Monitoring Athena Queries with CloudWatch Events (p. 379)
• Setting Data Usage Control Limits (p. 381)
If you use API operations, the command line interface, or the client application with the JDBC driver to
create workgroups, to enable publishing of query metrics, set PublishCloudWatchMetricsEnabled
to true in WorkGroupConfiguration. The following example shows only the metrics configuration and
omits other configuration:
"WorkGroupConfiguration": {
"PublishCloudWatchMetricsEnabled": "true"
....
}
375
Amazon Athena User Guide
Monitoring Athena Queries with CloudWatch Metrics
When you enable query metrics for queries in workgroups, the metrics are displayed within the Metrics
tab in the Workgroups panel, for each workgroup in the Athena console.
• EngineExecutionTime – in milliseconds
• ProcessedBytes – the total amount of data scanned per DML query
• QueryPlanningTime – in milliseconds
• QueryQueueTime – in milliseconds
• ServiceProcessingTime – in milliseconds
• TotalExecutionTime – in milliseconds, for DDL and DML queries
For more information, see the List of CloudWatch Metrics and Dimensions for Athena (p. 378) later in
this topic.
To view a workgroup's metrics, you don't need to switch to it and can remain in another workgroup.
You do need to select the workgroup from the list. You also must have permissions to view its
metrics.
3. Select the workgroup from the list, and then choose View details. If you have permissions, the
workgroup's details display in the Overview tab.
4. Choose the Metrics tab.
376
Amazon Athena User Guide
Monitoring Athena Queries with CloudWatch Metrics
Note
If you just recently enabled metrics for the workgroup and/or there has been no recent
query activity, the graphs on the dashboard may be empty. Query activity is retrieved from
CloudWatch depending on the interval that you specify in the next step.
5. Choose the metrics interval that Athena should use to fetch the query metrics from CloudWatch, or
specify a custom interval.
7. Click the down arrow next to the refresh icon to choose the Auto refresh option and a refresh
interval for the metrics display.
377
Amazon Athena User Guide
Monitoring Athena Queries with CloudWatch Metrics
ProcessedBytes The amount of data in megabytes that Athena scanned per DML
query. For queries that were canceled (either by the users, or
automatically, if they reached the limit), this includes the amount
of data scanned before the cancellation time. This metric is not
reported for DDL or CTAS queries.
QueryPlanningTime The number of milliseconds that Athena took to plan the query
processing flow. This includes the time spent retrieving table
partitions from the data source. Note that because the query engine
performs the query planning, query planning time is a subset of
EngineExecutionTime.
QueryQueueTime The number of milliseconds that the query was in the query queue
waiting for resources. Note that if transient errors occur, the query
can be automatically added back to the queue.
Dimension Description
378
Amazon Athena User Guide
Monitoring Athena Queries with CloudWatch Events
Dimension Description
Valid statistics: QUEUED, RUNNING, SUCCEEDED, FAILED, or
CANCELLED.
Note
Athena automatically retries your queries in cases of certain
transient errors. As a result, you may see the query state
transition from RUNNING or FAILED to QUEUED.
Before you create event rules for Athena, you should do the following:
• Familiarize yourself with events, rules, and targets in CloudWatch Events. For more information, see
What Is Amazon CloudWatch Events? For more information about how to set up rules, see Getting
Started with CloudWatch Events.
• Create the target or targets to use in your event rules.
Note
Athena currently offers one type of event, Athena Query State Change, but may add other event
types and details. If you are programmatically deserializing event JSON data, make sure that
your application is prepared to handle unknown properties if additional properties are added.
{
"source":[
"aws.athena"
],
"detail-type":[
"Athena Query State Change"
],
"detail":{
"currentState":[
"SUCCEEDED"
]
}
}
379
Amazon Athena User Guide
Monitoring Athena Queries with CloudWatch Events
{
"version":"0",
"id":"abcdef00-1234-5678-9abc-def012345678",
"detail-type":"Athena Query State Change",
"source":"aws.athena",
"account":"123456789012",
"time":"2019-10-06T09:30:10Z",
"region":"us-east-1",
"resources":[
],
"detail":{
"versionId":"0",
"currentState":"SUCCEEDED",
"previousState":"RUNNING",
"statementType":"DDL",
"queryExecutionId":"01234567-0123-0123-0123-012345678901",
"workgroupName":"primary",
"sequenceNumber":"3"
}
}
Output Properties
The JSON output includes the following properties.
Property Description
currentState The state that the query transitioned to at the time of the event.
previousState The state that the query transitioned from at the time of the event.
Example
The following example publishes events to an Amazon SNS topic to which you have subscribed. When
Athena is queried, you receive an email. The example assumes that the Amazon SNS topic exists and that
you have subscribed to it.
380
Amazon Athena User Guide
Setting Data Usage Control Limits
1. Create the target for your Amazon SNS topic. Give the CloudWatch Events Service Principal
events.amazonaws.com permission to publish to your Amazon SNS topic, as in the following
example.
{
"Effect":"Allow",
"Principal":{
"Service":"events.amazonaws.com"
},
"Action":"sns:Publish",
"Resource":"arn:aws:sns:us-east-1:111111111111:your-sns-topic"
}
2. Use the AWS CLI events put-rule command to create a rule for Athena events, as in the
following example.
3. Use the AWS CLI events put-targets command to attach the Amazon SNS topic target to the
rule, as in the following example.
4. Query Athena and observe the target being invoked. You should receive corresponding emails from
the Amazon SNS topic.
• The per-query control limit specifies the total amount of data scanned per query. If any query that
runs in the workgroup exceeds the limit, it is canceled. You can create only one per-query control limit
in a workgroup and it applies to each query that runs in it. Edit the limit if you need to change it. For
detailed steps, see To create a per-query data usage control (p. 382).
• The workgroup-wide data usage control limit specifies the total amount of data scanned for all
queries that run in this workgroup during the specified time period. You can create multiple limits per
workgroup. The workgroup-wide query limit allows you to set multiple thresholds on hourly or daily
aggregates on data scanned by queries running in the workgroup.
If the aggregate amount of data scanned exceeds the threshold, you can choose to take one of the
following actions:
• Configure an Amazon SNS alarm and an action in the Athena console to notify an administrator
when the limit is breached. For detailed steps, see To create a per-workgroup data usage
control (p. 383). You can also create an alarm and an action on any metric that Athena publishes
from the CloudWatch console. For example, you can set an alert on a number of failed queries. This
alert can trigger an email to an administrator if the number crosses a certain threshold. If the limit is
exceeded, an action sends an Amazon SNS alarm notification to the specified users.
• Invoke a Lambda function. For more information, see Invoking Lambda functions using Amazon SNS
notifications in the Amazon Simple Notification Service Developer Guide.
• Disable the workgroup, stopping any further queries from running.
381
Amazon Athena User Guide
Setting Data Usage Control Limits
The per-query and per-workgroup limits are independent of each other. A specified action is taken
whenever either limit is exceeded. If two or more users run queries at the same time in the same
workgroup, it is possible that each query does not exceed any of the specified limits, but the total sum of
data scanned exceeds the data usage limit per workgroup. In this case, an Amazon SNS alarm is sent to
the user.
The per-query control limit specifies the total amount of data scanned per query. If any query that runs
in the workgroup exceeds the limit, it is canceled. Canceled queries are charged according to Amazon
Athena pricing.
Note
In the case of canceled or failed queries, Athena may have already written partial results to
Amazon S3. In such cases, Athena does not delete partial results from the Amazon S3 prefix
where results are stored. You must remove the Amazon S3 prefix with partial results. Athena
uses Amazon S3 multipart uploads to write data Amazon S3. We recommend that you set
the bucket lifecycle policy to end multipart uploads in cases when queries fail. For more
information, see Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy in the
Amazon Simple Storage Service Developer Guide.
You can create only one per-query control limit in a workgroup and it applies to each query that runs in
it. Edit the limit if you need to change it.
To create a data usage control for a query in a particular workgroup, you don't need to switch to it
and can remain in another workgroup. You do need to select the workgroup from the list and have
permissions to edit the workgroup.
3. Select the workgroup from the list, and then choose View details. If you have permissions, the
workgroup's details display in the Overview tab.
4. Choose the Data usage controls tab.
382
Amazon Athena User Guide
Setting Data Usage Control Limits
5. In the Per query data usage control section, specify the field values, as follows:
• For Data limits, specify a value between 10000 KB (minimum) and 7000000 TB (maximum).
Note
These are limits imposed by the console for data usage controls within workgroups. They
do not represent any query limits in Athena.
• For units, select the unit value from the drop-down list (KB, MB, GB, or TB).
• The default Action is to cancel the query if it exceeds the limit. This setting cannot be changed.
6. Choose Create if you are creating a new limit, or Update if you are editing an existing limit. If you
are editing an existing limit, refresh the Overview tab to see the updated limit.
The workgroup-wide data usage control limit specifies the total amount of data scanned for all queries
that run in this workgroup during the specified time period. You can create multiple control limits per
workgroup. If the limit is exceeded, you can choose to take action, such as send an Amazon SNS alarm
notification to the specified users.
383
Amazon Athena User Guide
Setting Data Usage Control Limits
To create a data usage control for a particular workgroup, you don't need to switch to it and
can remain in another workgroup. You do need to select the workgroup from the list and have
permissions to edit the workgroup.
3. Select the workgroup from the list, and then choose View details. If you have edit permissions, the
workgroup's details display in the Overview tab.
4. Choose the Data usage controls tab, and scroll down. Then choose Workgroup data usage controls
to create a new limit or edit an existing limit. The Create workgroup data usage control dialog
displays.
• For Data limits, specify a value between 10 MB (minimum) and 7000000 TB (maximum).
Note
These are limits imposed by the console for data usage controls within workgroups. They
do not represent any query limits in Athena.
• For units, select the unit value from the drop-down list.
• For time period, choose a time period from the drop-down list.
• For Action, choose an Amazon SNS topic from the drop-down list, if you have one configured.
Or, choose Create an Amazon SNS topic to go directly to the Amazon SNS console, create the
Amazon SNS topic, and set up a subscription for it for one of the users in your Athena account. For
more information, see Creating an Amazon SNS Topic in the Amazon Simple Notification Service
Getting Started Guide.
6. Choose Create if you are creating a new limit, or Save if you are editing an existing limit. If you are
editing an existing limit, refresh the Overview tab for the workgroup to see the updated limit.
384
Amazon Athena User Guide
Tag Basics
Tagging Resources
A tag consists of a key and a value, both of which you define. When you tag an Athena resource, you
assign custom metadata to it. You can use tags to categorize your AWS resources in different ways; for
example, by purpose, owner, or environment. In Athena, workgroups and data catalogs are taggable
resources. For example, you can create a set of tags for workgroups in your account that helps you track
workgroup owners, or identify workgroups by their purpose. We recommend that you that you use AWS
tagging best practices to create a consistent set of tags to meet your organization requirements.
You can work with tags using the Athena console or the API operations.
Topics
• Tag Basics (p. 385)
• Tag Restrictions (p. 385)
• Working with Tags on Workgroups in the Console (p. 386)
• Using Tag Operations (p. 387)
• Tag-Based IAM Access Control Policies (p. 390)
Tag Basics
A tag is a label that you assign to an Athena resource. Each tag consists of a key and an optional value,
both of which you define.
Tags enable you to categorize your AWS resources in different ways. For example, you can define a set of
tags for your account's workgroups that helps you track each workgroup owner or purpose.
You can add tags when creating a new Athena workgroup or data catalog, or you can add, edit, or
remove tags from them. You can edit a tag in the console. To use API operations to edit a tag, remove the
old tag and add a new one. If you delete a resource, any tags for the resource are also deleted.
Athena does not automatically assign tags to your resources. You can edit tag keys and values, and you
can remove tags from a resource at any time. You can set the value of a tag to an empty string, but you
can't set the value of a tag to null. Do not add duplicate tag keys to the same resource. If you do, Athena
issues an error message. If you use the TagResource action to tag a resource using an existing tag key,
the new tag value overwrites the old value.
In IAM, you can control which users in your AWS account have permission to create, edit, remove, or list
tags. For more information, see Tag-Based IAM Access Control Policies (p. 390).
For a complete list of Amazon Athena tag actions, see the API action names in the Amazon Athena API
Reference.
You can use tags for billing. For more information, see Using Tags for Billing in the AWS Billing and Cost
Management User Guide.
Tag Restrictions
Tags have the following restrictions:
385
Amazon Athena User Guide
Working with Tags on Workgroups in the Console
• In Athena, you can tag workgroups and data catalogs. You cannot tag queries.
• The maximum number of tags per resource is 50. To stay within the limit, review and delete unused
tags.
• For each resource, each tag key must be unique, and each tag key can have only one value. Do not add
duplicate tag keys at the same time to the same resource. If you do, Athena issues an error message.
If you tag a resource using an existing tag key in a separate TagResource action, the new tag value
overwrites the old value.
• Tag key length is 1-128 Unicode characters in UTF-8.
• Tag value length is 0-256 Unicode characters in UTF-8.
Tagging operations, such as adding, editing, removing, or listing tags, require that you specify an ARN
for the workgroup resource.
• Athena allows you to use letters, numbers, spaces represented in UTF-8, and the following characters:
+ - = . _ : / @.
• Tag keys and values are case-sensitive.
• The "aws:" prefix in tag keys is reserved for AWS use. You can't edit or delete tag keys with this prefix.
Tags with this prefix do not count against your per-resource tags limit.
• The tags you assign are available only to your AWS account.
Topics
• Displaying Tags for Individual Workgroups (p. 386)
• Adding and Deleting Tags on an Individual Workgroup (p. 386)
To view a list of tags for a workgroup, select the workgroup, choose View Details, and then choose
the Tags tab. The list of tags for the workgroup displays. You can also view tags on a workgroup if you
choose Edit Workgroup.
To search for tags, choose the Tags tab, and then enter a tag name into the search tool.
386
Amazon Athena User Guide
Using Tag Operations
1. Open the Athena console, and then choose the Workgroups tab.
2. In the workgroup list, select the workgroup, and then choose View details.
3. Do one of the following:
TagResource tag-resource Add or overwrite one or more tags on the resource that has the
specified ARN.
UntagResource untag- Delete one or more tags from the resource that has the
resource specified ARN.
ListTagsForResource List one or more tags for the resource that has the specified
list#tags#for#resource
ARN.
387
Amazon Athena User Guide
Managing Tags Using API Operations
To add tags when you create a workgroup or data catalog, use the tags parameter with the
CreateWorkGroup or CreateDataCatalog API operations or with the AWS CLI create-work-group
or create-data-catalog commands.
Example TagResource
The following example adds two tags to the workgroup workgroupA:
client.tagResource(request);
The following example adds two tags to the data catalog datacatalogA:
client.tagResource(request);
Note
Do not add duplicate tag keys to the same resource. If you do, Athena issues an error message.
If you tag a resource using an existing tag key in a separate TagResource action, the new tag
value overwrites the old value.
Example UntagResource
The following example removes tagKey2 from the workgroup workgroupA:
client.untagResource(request);
The following example removes tagKey2 from the data catalog datacatalogA:
388
Amazon Athena User Guide
Managing Tags Using the AWS CLI
.withTagKeys(tagKeys);
client.untagResource(request);
Example ListTagsForResource
The following example lists tags for the workgroup workgroupA:
The following example lists tags for the data catalog datacatalogA:
Syntax
The --resource-arn parameter specifies the resource to which the tags are added. The --tags
parameter specifies a list of space-separated key-value pairs to add as tags to the resource.
Example
The following example adds tags to the mydatacatalog data catalog.
For information on adding tags when using the create-data-catalog command, see Registering a
Catalog: create-data-catalog (p. 60).
Syntax
389
Amazon Athena User Guide
Tag-Based IAM Access Control Policies
The --resource-arn parameter specifies the resource for which the tags are listed.
The following example lists the tags for the mydatacatalog data catalog.
{
"Tags": [
{
"Key": "Time",
"Value": "Now"
},
{
"Key": "Color",
"Value": "Orange"
}
]
}
Syntax
The --resource-arn parameter specifies the resource from which the tags are removed. The --tag-
keys parameter takes a space-separated list of key names. For each key name specified, the untag-
resource command removes both the key and its value.
The following example removes the Color and Time keys and their values from the mydatacatalog
catalog resource.
390
Amazon Athena User Guide
Tag Policy Examples for Workgroups
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryExecutions",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:GetWorkGroup",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
}
]
}
Example 2: Policy Block that Denies Actions on a Workgroup Based on a Tag Key and Tag
Value Pair
Tags that are associated with a resource like a workgroup are referred to as resource tags. Resource tags
let you write policy blocks like the following that deny the listed actions on any workgroup tagged with a
key-value pair like stack, production.
{
"Effect": "Deny",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:UpdateWorkGroup",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:GetWorkGroup",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryExecutions",
391
Amazon Athena User Guide
Tag Policy Examples for Data Catalogs
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/stack": "production"
}
}
}
Example 3. Policy Block that Restricts Tag-Changing Action Requests to Specified Tags
Tags that are passed in as parameters to operations that change tags (for example, TagResource,
UntagResource, or CreateWorkGroup with tags) are referred to as request tags. The following
example policy block allows the CreateWorkGroup operation only if one of the tags passed has the key
costcenter and the value 1, 2, or 3.
Note
If you want to allow IAM users to pass in tags as part of a CreateWorkGroup operation, make
sure that you give the users permissions to the TagResource and CreateWorkGroup actions.
{
"Effect": "Allow",
"Action": [
"athena:CreateWorkGroup",
"athena:TagResource"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/*",
"Condition": {
"StringEquals": {
"aws:RequestTag/costcenter": [
"1",
"2",
"3"
]
}
}
}
The following IAM policy allows you to interact with tags for the data catalog named datacatalogA:
{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"athena:ListWorkGroups",
"athena:ListDataCatalogs",
"athena:GetExecutionEngine",
392
Amazon Athena User Guide
Tag Policy Examples for Data Catalogs
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource":"*"
},
{
"Effect":"Allow",
"Action":[
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryExecutions",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:GetWorkGroup",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/*"
]
},
{
"Effect":"Allow",
"Action":[
"athena:CreateDataCatalog",
"athena:DeleteDataCatalog",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDatabases",
"athena:ListTableMetadata",
"athena:UpdateDataCatalog",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource":"arn:aws:athena:us-east-1:123456789012:datacatalog/datacatalogA"
}
]
}
Example 2: Policy Block that Denies Actions on a Data Catalog Based on a Tag Key and Tag
Value Pair
You can use resource tags to write policy blocks that deny specific actions on data catalogs that are
tagged with specific tag key-value pairs. The following example policy denies actions on data catalogs
that have the tag key-value pair stack, production.
{
"Effect":"Deny",
"Action":[
"athena:CreateDataCatalog",
393
Amazon Athena User Guide
Tag Policy Examples for Data Catalogs
"athena:DeleteDataCatalog",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDatabases",
"athena:ListTableMetadata",
"athena:UpdateDataCatalog",
"athena:StartQueryExecution",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource":"arn:aws:athena:us-east-1:123456789012:datacatalog/*",
"Condition":{
"StringEquals":{
"aws:ResourceTag/stack":"production"
}
}
}
Example 3. Policy Block that Restricts Tag-Changing Action Requests to Specified Tags
Tags that are passed in as parameters to operations that change tags (for example, TagResource,
UntagResource, or CreateDataCatalog with tags) are referred to as request tags. The following
example policy block allows the CreateDataCatalog operation only if one of the tags passed has the
key costcenter and the value 1, 2, or 3.
Note
If you want to allow IAM users to pass in tags as part of a CreateDataCatalog operation,
make sure that you give the users permissions to the TagResource and CreateDataCatalog
actions.
{
"Effect":"Allow",
"Action":[
"athena:CreateDataCatalog",
"athena:TagResource"
],
"Resource":"arn:aws:athena:us-east-1:123456789012:datacatalog/*",
"Condition":{
"StringEquals":{
"aws:RequestTag/costcenter":[
"1",
"2",
"3"
]
}
}
}
394
Amazon Athena User Guide
Changing Athena Engine Versions
Engine versioning is configured per workgroup (p. 358). You can use workgroups to control which query
engine your queries use. The query engine that is in use is shown in the query editor, on the details page
for the workgroup, and by the Athena APIs.
You can choose to upgrade your workgroups as soon as a new engine is available or continue using
the older version until it is no longer supported. You can also let Athena decide when to upgrade your
workgroups. This is the default setting. If you take no action, Athena notifies you in advance prior to
upgrading your workgroups. If you let Athena decide, Athena upgrades your workgroups for you unless it
finds incompatibilities.
When you start using a new engine version, a small subset of queries may break due to incompatibilities.
You can use workgroups to test your queries in advance of the upgrade by creating a test workgroup
that uses the new engine or by test upgrading an existing workgroup. For more information, see Testing
Queries in Advance of an Engine Version Upgrade (p. 399).
Topics
• Changing Athena Engine Versions (p. 395)
• Athena Engine Version Reference (p. 399)
Topics
• Finding the Query Engine Version for a Workgroup (p. 395)
• Changing the Engine Version (p. 396)
• Specifying the Engine Version When You Create a Workgroup (p. 398)
• Testing Queries in Advance of an Engine Version Upgrade (p. 399)
• Troubleshooting Queries That Fail (p. 399)
395
Amazon Athena User Guide
Changing the Engine Version
You can also use the Workgroups page to find the current engine version for any workgroup.
The engine version is shown in the Query engine version column for the workgroup, as in the
following example.
396
Amazon Athena User Guide
Changing the Engine Version
2. In the list of workgroups, choose the workgroup that you want to configure.
3. Choose View details.
4. Choose Edit workgroup.
5. Under Query engine version, for Update query engine, choose Let Athena choose when to
upgrade your workgroup. This is the default setting.
6. Choose Save.
The workgroup's Query engine update status is set to Pending automatic upgrade. When the
update occurs, Athena will notify you in the Athena console and on your AWS Personal Health
Dashboard. The workgroup continues to use the current engine version until the update.
The Query engine update status for the workgroup shows Manually set.
397
Amazon Athena User Guide
Specifying the Engine Version
When You Create a Workgroup
• Choose Let Athena choose when to upgrade your workgroup. This is the default setting.
• Choose Manually choose an engine version now, and then choose an engine version.
4. Enter information for the other fields as necessary. For information about the other fields, see
Create a Workgroup (p. 368).
5. Choose Create workgroup.
398
Amazon Athena User Guide
Testing Queries in Advance of an Engine Version Upgrade
1. Verify the engine version of the workgroup that you are using. The engine version that you are using
is displayed in the Athena Query Editor and on the Workgroups page. For more information, see
Finding the Query Engine Version for a Workgroup (p. 395).
2. Create a test workgroup that uses the new engine version. For more information, see Specifying the
Engine Version When You Create a Workgroup (p. 398).
3. Use the new workgroup to run the queries that you want to test.
4. If a query fails, use the Athena Engine Version Reference (p. 399) to check for breaking changes
that might be affecting the query. Some changes may require you to update the syntax of your
queries.
5. If your queries still fail, contact AWS Support for assistance. In the AWS Management Console,
choose Support, Support Center, or visit the Amazon Athena Forum.
If your queries still fail, contact AWS Support for assistance. In the AWS Management Console, choose
Support, Support Center, or visit the Amazon Athena Forum.
399
Amazon Athena User Guide
Athena engine version 2
Datatype Enhancements
• INT for INTEGER – Added support for INT as an alias for the INTEGER data type.
• INTERVAL types – Added support for casting to INTERVAL types.
• IPADDRESS – Added a new IPADDRESS type to represent IP addresses. Added support for casting
between the VARBINARY type and IPADDRESS type.
• IS DISTINCT FROM – Added IS DISTINCT FROM support for the JSON and IPADDRESS types.
• Null equality checks – Equality checks for null values in ARRAY, MAP, and ROW data structures are now
supported. For example, the expression ARRAY ['1', '3', null] = ARRAY ['1', '2', null]
returns false. Previously, a null element returned the error message comparison not supported.
400
Amazon Athena User Guide
Athena engine version 2
• Row type coercion – Coercion between row types regardless of field names is now allowed. Previously,
a row type was coercible to another only if the field name in the source type matched the target type,
or when the target type had an anonymous field name.
• Time subtraction – Implemented subtraction for all TIME and TIMESTAMP types.
• Unicode – Added support for escaped Unicode sequences in string literals.
• VARBINARY concatenation – Added support for concatenation of VARBINARY values.
The following functions now accept additional input types. For more information about each function,
visit the corresponding link to the Presto documentation.
• approx_distinct() – The approx_distinct() function now supports the following types: INTEGER,
SMALLINT, TINYINT, DECIMAL, REAL, DATE, TIMESTAMP, TIMESTAMP WITH TIME ZONE, TIME, TIME
WITH TIME ZONE, IPADDRESS, and CHAR.
• avg(), sum() – The avg() and sum() aggregate functions now support the INTERVAL data type.
• lpad(), rpad() – The lpad and rpad functions now work on VARBINARY inputs.
• min(), max() – The min() and max() aggregation functions now allow unknown input types at query
analysis time so that you can use the functions with NULL literals.
• regexp_replace() – Variant of the regexp_replace() function added that can execute a Lambda function
for each replacement.
• sequence() – Added DATE variants for the sequence() function, including variant with an implicit one-
day step increment.
• ST_Area() – The ST_Area() geospatial function now supports all geometry types.
• substr() – The substr function now works on VARBINARY inputs.
• zip_with() – Arrays of mismatched length can now be used with zip_with(). Missing positions are filled
with null. Previously, an error was raised when arrays of differing lengths were passed. This change
may make it difficult to distinguish between values that were originally null from values that were
added to pad the arrays to the same length.
Added Functions
The following list contains functions that are new in Athena engine version 2. The list does not include
geospatial functions. For a list of geospatial functions, see New Geospatial Functions in Athena engine
version 2 (p. 194).
For more information about each function, visit the corresponding link to the Presto documentation.
Aggregate Functions
reduce_agg()
array_sort() - Variant of this function added that takes a Lambda function as a comparator.
ngrams()
from_big_endian_32()
from_ieee754_32()
from_ieee754_64()
401
Amazon Athena User Guide
Athena engine version 2
hmac_md5()
hmac_sha1()
hmac_sha256()
hmac_sha512()
spooky_hash_v2_32()
spooky_hash_v2_64()
to_big_endian_32()
to_ieee754_32()
to_ieee754_64()
millisecond()
parse_duration()
to_milliseconds()
multimap_from_entries()
inverse_normal_cdf()
wilson_interval_lower()
wilson_interval_upper()
quantile digest functions and the qdigest quantile digest type added.
hamming_distance()
split_to_multimap()
Performance Improvements
Performance of the following features has improved in Athena engine version 2.
Query Performance
• Bucketed tables – Improved performance for writing to bucketed tables when the data being written
is already partitioned appropriately (for example, when the output is from a bucketed join).
• DISTINCT – Improved performance for some queries that use DISTINCT.
• Filter and projection operations – Filter and projection operations are now always processed by
columns if possible. The engine automatically takes advantage of dictionary encodings where
effective.
402
Amazon Athena User Guide
Athena engine version 2
• Planning performance – Improved planning performance for queries that join multiple tables with a
large number of columns.
• Predicate evaluations – Improved predicate evaluation performance during predicate pushdown in
planning.
• Predicate pushdown support for casting – Support predicate pushdown for the <column> IN
<values list> predicate where values in the values list require casting to match the type of column.
• Predicate inference and pushdown – Predicate inference and pushdown extended for queries that use
a <symbol> IN <subquery> predicate.
Join Performance
• Joins with map columns – Improved the performance of joins and aggregations that include map
columns.
• Joins with solely non-equality conditions – Improved the performance of joins with only non-
equality conditions by using a nested loop join instead of a hash join.
• Outer joins – The join distribution type is now automatically selected for queries involving outer joins.
• Range over a function joins – Improved performance of joins where the condition is a range over a
function (for example, a JOIN b ON b.x < f(a.x) AND b.x > g(a.x)).
Subquery Performance
403
Amazon Athena User Guide
Athena engine version 2
• Outer query filter propagation – Improved performance of correlated subqueries when filters from
the outer query can be propagated to the subquery.
Function Performance
Geospatial Performance
JSON-Related Improvements
Map Functions
• Improved performance of map subscript from O(n) to O(1) in all cases. Previously, only maps
produced by certain functions and readers took advantage of this improvement.
• Added the map_from_entries() and map_entries() functions.
Casting
• is_json_scalar()
404
Amazon Athena User Guide
Athena engine version 2
Breaking Changes
Breaking changes include bug fixes, changes to geospatial functions, replaced functions, and the
introduction of limits. Improvements in ANSI SQL compliance may break queries that depended on non-
standard behavior.
Bug Fixes
The following changes correct behavioral issues that caused queries to run successfully, but with
inaccurate results.
• json_parse() no longer ignores trailing characters – Previously, inputs such as [1,2]abc would
successfully parse as [1,2]. Using trailing characters now produces the error message Cannot convert
'[1, 2]abc' to JSON.
• round() decimal precision corrected – round(x, d) now correctly rounds x when x is a DECIMAL or
when x is a DECIMAL with scale 0 and d is a negative integer. Previously, no rounding occurred in these
cases.
• round(x, d) and truncate(x, d) – The parameter d in the signature of functions round(x, d) and
truncate(x, d) is now of type INTEGER. Previously, d could be of type BIGINT.
• map() with duplicate keys – map() now raises an error on duplicate keys rather than silently
producing a corrupted map. Queries that currently construct map values using duplicate keys now fail
with an error.
• map_from_entries() raises an error with null entries – map_from_entries() now raises an error
when the input array contains a null entry. Queries that construct a map by passing NULL as a value
now fail.
• Tables – Tables that have unsupported partition types can no longer be created.
• Improved numerical stability in statistical functions – The numerical stability for the statistical
functions corr(), covar_samp(), regr_intercept(), and regr_slope() has been improved.
• Time zone information – Time zone information is now calculated using the java.time package of the
Java 1.8 SDK.
• SUM of INTERVAL_DAY_TO_SECOND and INTERVAL_YEAR_TO_MONTH datatypes – You can no
longer use SUM(NULL) directly. In order to use SUM(NULL), cast NULL to a data type like BIGINT,
DECIMAL, REAL, DOUBLE, INTERVAL_DAY_TO_SECOND or INTERVAL_YEAR_TO_MONTH.
• Function name changes – Some function names have changed. For more information, see Geospatial
Function Name Changes in Athena engine version 2 (p. 193).
• VARBINARY input – The VARBINARY type is no longer directly supported for input to geospatial
functions. For example, to calculate the area of a geometry directly, the geometry must now be input
in either VARCHAR or GEOMETRY format. The workaround is to use transform functions, as in the
following examples.
• To use ST_area() to calculate the area for VARBINARY input in Well-Known Binary (WKB) format,
pass the input to ST_GeomFromBinary() first, for example:
ST_area(ST_GeomFromBinary(<wkb_varbinary_value>))
• To use ST_area() to calculate the area for VARBINARY input in legacy binary format, pass the
same input to the ST_GeomFromLegacyBinary() function first, for example:
ST_area(ST_GeomFromLegacyBinary(<legacy_varbinary_value>))
405
Amazon Athena User Guide
Athena engine version 2
• ST_ExteriorRing() and ST_Polygon() – ST_ExteriorRing() (p. 186) and ST_Polygon() (p. 183)
now accept only polygons as inputs. Previously, these functions erroneously accepted other
geometries.
• ST_Distance() – As required by the SQL/MM specification, the ST_Distance() (p. 188) function now
returns NULL if one of the inputs is an empty geometry. Previously, NaN was returned.
• Cast() operations – Cast() operations from REAL or DOUBLE to DECIMAL now conform to the
SQL standard. For example, cast (double '100000000000000000000000000000000' as
decimal(38)) previously returned 100000000000000005366162204393472 but now returns
100000000000000000000000000000000.
• JOIN ... USING – JOIN ... USING now conforms to standard SQL semantics. Previously, JOIN ...
USING required qualifying the table name in columns, and the column from both tables would be
present in the output. Table qualifications are now invalid and the column is present only once in the
output.
• ROW type literals removed – The ROW type literal format ROW<int, int>(1, 2) is no longer
supported. Use the syntax ROW(1 int, 2 int) instead.
• log() function – Previously, in violation of the SQL standard, the order of the arguments in the log()
function was reversed. This caused log() to return incorrect results when queries were translated to
or from other SQL implementations. The equivalent to log(x, b) is now correctly ln(x) / ln(b).
• Grouped aggregation semantics – Grouped aggregations use IS NOT DISTINCT FROM semantics
rather than equality semantics. Grouped aggregations now return correct results and show improved
performance when grouping on NaN floating point values. Grouping on map, list, and row types that
contain nulls is supported.
• Types with quotation marks are no longer allowed – In accordance with the ANSI SQL standard, data
types can no longer be enclosed in quotation marks. For example, SELECT "date" '2020-02-02' is
no longer a valid query. Instead, use the syntax SELECT date '2020-02-02'.
• Anonymous row field access – Anonymous row fields can no longer be accessed by using the syntax
[.field0, .field1, ...].
Replaced Functions
The following functions are no longer supported and have been replaced by syntax that produces the
same results.
Limits
The following limits were introduced in Athena engine version 2 to ensure that queries do not fail due to
resource limitations. These limits are not configurable by users.
• Number of result elements – The number of result elements n is restricted to 10,000 or less for the
following functions: min(col, n), max(col, n), min_by(col1, col2, n), and max_by(col1,
col2, n).
• GROUPING SETS – The maximum number of slices in a grouping set is 2048.
406
Amazon Athena User Guide
Athena engine version 2
• Maximum text file line length – The default maximum line length for text files is 100 MB.
• Sequence function maximum result size – The maximum result size of a sequence function
is 50000 entries. For example, SELECT sequence(0,45000,1) succeeds, but SELECT
sequence(0,55000,1) fails with the error message The result of the sequence function must not
have more than 50000 entries. This limit applies to all input types for sequence functions, including
timestamps.
407
Amazon Athena User Guide
Using a SerDe
SerDe Reference
Athena supports several SerDe libraries for parsing data from different data formats, such as CSV, JSON,
Parquet, and ORC. Athena does not support custom SerDes.
Topics
• Using a SerDe (p. 408)
• Supported SerDes and Data Formats (p. 409)
• Compression Formats (p. 434)
Using a SerDe
A SerDe (Serializer/Deserializer) is a way in which Athena interacts with data in various formats.
It is the SerDe you specify, and not the DDL, that defines the table schema. In other words, the SerDe can
override the DDL configuration that you specify in Athena when you create your table.
• Use DDL statements to describe how to read and write data to the table and do not specify a ROW
FORMAT, as in this example. This omits listing the actual SerDe type and the native LazySimpleSerDe
is used by default.
In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify
ROW FORMAT DELIMITED.
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
• Explicitly specify the type of SerDe Athena should use when it reads and writes data to the table. Also,
specify additional properties in SERDEPROPERTIES, as in this example.
408
Amazon Athena User Guide
Supported SerDes and Data Formats
To create tables and query data in these formats in Athena, specify a serializer-deserializer class (SerDe)
so that Athena knows which format is used and how to parse the data.
This table lists the data formats supported in Athena and their corresponding SerDe libraries.
A SerDe is a custom library that tells the data catalog used by Athena how to handle the data. You
specify a SerDe type by listing it explicitly in the ROW FORMAT part of your CREATE TABLE statement in
Athena. In some cases, you can omit the SerDe name because Athena uses some SerDe types by default
for certain types of data formats.
CSV (Comma-Separated Values) For data in CSV, each line • Use the LazySimpleSerDe
represents a data record, and for CSV, TSV, and Custom-
each record consists of one Delimited Files (p. 424) if
or more fields, separated by your data does not include
commas. values enclosed in quotes.
• Use the OpenCSVSerDe for
Processing CSV (p. 415)
when your data includes
quotes in values, or different
separator or escape
characters.
TSV (Tab-Separated Values) For data in TSV, each line Use the LazySimpleSerDe for
represents a data record, and CSV, TSV, and Custom-Delimited
each record consists of one or Files (p. 424) and specify the
more fields, separated by tabs. separator character as FIELDS
TERMINATED BY '\t'.
Custom-Delimited For data in this format, each Use the LazySimpleSerDe for
line represents a data record, CSV, TSV, and Custom-Delimited
and records are separated Files (p. 424) and specify
by a custom single-character a custom single-character
delimiter. delimiter.
JSON (JavaScript Object For JSON data, each line • Use the Hive JSON
Notation) represents a data record, SerDe (p. 421).
and each record consists of • Use the OpenX JSON
attribute–value pairs and arrays, SerDe (p. 421).
separated by commas.
Apache Avro A format for storing data in Use the Avro SerDe (p. 410).
Hadoop that uses JSON-based
schemas for record values.
409
Amazon Athena User Guide
Avro SerDe
ORC (Optimized Row Columnar) A format for optimized Use the ORC SerDe (p. 429)
columnar storage of Hive data. and ZLIB compression.
Apache Parquet A format for columnar storage Use the Parquet SerDe (p. 432)
of data in Hadoop. and SNAPPY compression.
Logstash logs A format for storing logs in Use the Grok SerDe (p. 418).
Logstash.
Apache WebServer logs A format for storing logs in Use the Grok SerDe (p. 418) or
Apache WebServer. Regex SerDe (p. 412).
Topics
• Avro SerDe (p. 410)
• Regex SerDe (p. 412)
• CloudTrail SerDe (p. 413)
• OpenCSVSerDe for Processing CSV (p. 415)
• Grok SerDe (p. 418)
• JSON SerDe Libraries (p. 420)
• LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files (p. 424)
• ORC SerDe (p. 429)
• Parquet SerDe (p. 432)
Avro SerDe
SerDe Name
Avro SerDe
Library Name
org.apache.hadoop.hive.serde2.avro.AvroSerDe
Examples
Athena does not support using avro.schema.url to specify table schema for security reasons. Use
avro.schema.literal. To extract schema from data in the Avro format, use the Apache avro-
tools-<version>.jar with the getschema parameter. This returns a schema that you can use in your
WITH SERDEPROPERTIES statement. For example:
410
Amazon Athena User Guide
Avro SerDe
The avro-tools-<version>.jar file is located in the java subdirectory of your installed Avro
release. To download Avro, see Apache Avro Releases. To download Apache Avro Tools directly, see the
Apache Avro Tools Maven Repository.
After you obtain the schema, use a CREATE TABLE statement to create an Athena table based on
underlying Avro data stored in Amazon S3. In ROW FORMAT, you must specify the Avro SerDe as follows:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'. As demonstrated
in the following example, you must specify the schema using the WITH SERDEPROPERTIES clause in
addition to specifying the column names and corresponding data types for the table.
Note
Replace myregion in s3://athena-examples-myregion/path/to/data/ with the region
identifier where you run Athena, for example, s3://athena-examples-us-west-1/path/
to/data/.
411
Amazon Athena User Guide
Regex SerDe
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata.
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
Regex SerDe
The Regex SerDe uses a regular expression (regex) to deserialize data by extracting regex groups into
table columns.
If a row in the data does not match the regex, then all columns in the row are returned as NULL. If a row
matches the regex but has fewer groups than expected, the missing groups are NULL. If a row in the data
matches the regex but has more columns than groups in the regex, the additional columns are ignored.
For more information, see Class RegexSerDe in the Apache Hive documentation.
SerDe Name
RegexSerDe
Library Name
RegexSerDe
412
Amazon Athena User Guide
CloudTrail SerDe
Examples
The following example creates a table from CloudFront logs using the RegExSerDe. Replace myregion in
s3://athena-examples-myregion/cloudfront/plaintext/ with the region identifier where you
run Athena (for example, s3://athena-examples-us-west-1/cloudfront/plaintext/).
CloudTrail SerDe
AWS CloudTrail is a service that records AWS API calls and events for AWS accounts. CloudTrail generates
encrypted logs and stores them in Amazon S3. You can use Athena to query these logs directly from
Amazon S3, specifying the LOCATION of logs.
To query CloudTrail logs in Athena, create table from the logs and use the CloudTrail SerDe to deserialize
the logs data.
In addition to using the CloudTrail SerDe, instances exist where you need to use a different SerDe or to
extract data from JSON. Certain fields in CloudTrail logs are STRING values that may have a variable
data format, which depends on the service. As a result, the CloudTrail SerDe is unable to predictably
deserialize them. To query the following fields, identify the data pattern and then use a different
SerDe, such as the OpenX JSON SerDe (p. 421). Alternatively, to get data out of these fields, use
JSON_EXTRACT functions. For more information, see Extracting Data From JSON (p. 209).
• requestParameters
• responseElements
• additionalEventData
• serviceEventDetails
SerDe Name
CloudTrail SerDe
Library Name
com.amazon.emr.hive.serde.CloudTrailSerde
413
Amazon Athena User Guide
CloudTrail SerDe
Examples
The following example uses the CloudTrail SerDe on a fictional set of logs to create a table based on
them.
The following query returns the logins that occurred over a 24-hour period:
SELECT
useridentity.username,
414
Amazon Athena User Guide
OpenCSVSerDe for Processing CSV
sourceipaddress,
eventtime,
additionaleventdata
FROM default.cloudtrail_logs
WHERE eventname = 'ConsoleLogin'
AND eventtime >= '2017-02-17T00:00:00Z'
AND eventtime < '2017-02-18T00:00:00Z';
For more information, see Querying AWS CloudTrail Logs (p. 229).
• If data contains values enclosed in double quotes ("), you can use the OpenCSV SerDe to deserialize
the values in Athena. In the following sections, note the behavior of this SerDe with STRING data
types.
• If data does not contain values enclosed in double quotes ("), you can omit specifying any SerDe. In
this case, Athena uses the default LazySimpleSerDe. For information, see LazySimpleSerDe for CSV,
TSV, and Custom-Delimited Files (p. 424).
• Cannot escape \t or \n directly. To escape them, use "escapeChar" = "\\". See the example in
this topic.
• Does not support embedded line breaks in CSV files.
• Does not support empty fields in columns defined as a numeric data type.
Note
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING.
Next, the parser in Athena parses the values from STRING into actual types based on what it
finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types
when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses
them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT.
DATE type values are also parsed as INT.
To further convert columns to the desired type in a table, you can create a view (p. 131) over the
table and use CAST to convert to the desired type.
For data types other than STRING, when the parser in Athena can recognize them, this SerDe behaves as
follows:
• Recognizes BOOLEAN, BIGINT, INT, and DOUBLE data types and parses them without changes. The
parser does not recognize empty or null values in columns defined as a numeric data type, leaving
them as the default data type of STRING. The workaround is to declare the column as STRING and
then CAST it in a SELECT query or view.
415
Amazon Athena User Guide
OpenCSVSerDe for Processing CSV
• Recognizes the TIMESTAMP type if it is specified in the UNIX numeric format, such as 1564610311.
• Does not support TIMESTAMP in the JDBC-compliant java.sql.Timestamp format, such as "YYYY-
MM-DD HH:MM:SS.fffffffff" (9 decimal place precision). If you are processing CSV data from Hive,
use the UNIX numeric format.
• Recognizes the DATE type if it is specified in the UNIX numeric format, such as 1562112000.
• Does not support DATE in another format. If you are processing CSV data from Hive, use the UNIX
numeric format.
Note
For information about using the TIMESTAMP and DATE columns when they are not specified
in the UNIX numeric format, see the article When I query a table in Amazon Athena, the
TIMESTAMP result is empty in the AWS Knowledge Center.
Example Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric
format.
Consider the following test data:
The following statement creates a table in Athena from the specified Amazon S3 bucket location.
The query returns the following result, showing the date and time data:
The following statement creates a table in Athena, specifying that "escapeChar" = "\\".
416
Amazon Athena User Guide
OpenCSVSerDe for Processing CSV
f1 s2
\t\t\n 123 \t\t\n abc
456 xyz
SerDe Name
CSV SerDe
Library Name
To use this SerDe, specify its fully qualified class name in ROW FORMAT. Also specify the delimiters inside
SERDEPROPERTIES, as follows:
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "`",
"escapeChar" = "\\"
)
Example
This example presumes data in CSV saved in s3://mybucket/mycsv/ with the following contents:
"a1","a2","a3","a4"
"1","2","abc","def"
"a","a1","abc3","ab4"
Use a CREATE TABLE statement to create an Athena table based on the data, and reference the
OpenCSVSerDe class in ROW FORMAT, also specifying SerDe properties for character separator, quote
character, and escape character, as follows:
417
Amazon Athena User Guide
Grok SerDe
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
Grok SerDe
The Logstash Grok SerDe is a library with a set of specialized patterns for deserialization of unstructured
text data, usually logs. Each Grok pattern is a named regular expression. You can identify and re-use
these deserialization patterns as needed. This makes it easier to use Grok compared with using regular
expressions. Grok provides a set of pre-defined patterns. You can also create custom patterns.
To specify the Grok SerDe when creating a table in Athena, use the ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe' clause, followed by the WITH SERDEPROPERTIES clause
that specifies the patterns to match in your data, where:
• The input.format expression defines the patterns to match in the data. It is required.
• The input.grokCustomPatterns expression defines a named custom pattern, which you
can subsequently use within the input.format expression. It is optional. To include multiple
pattern entries into the input.grokCustomPatterns expression, use the newline escape
character (\n) to separate them, as follows: 'input.grokCustomPatterns'='INSIDE_QS ([^
\"]*)\nINSIDE_BRACKETS ([^\\]]*)').
• The STORED AS INPUTFORMAT and OUTPUTFORMAT clauses are required.
• The LOCATION clause specifies an Amazon S3 bucket, which can contain multiple data objects. All data
objects in the bucket are deserialized to create the table.
Examples
These examples rely on the list of predefined Grok patterns. See pre-defined patterns.
Example 1
This example uses source data from Postfix maillog entries saved in s3://mybucket/groksample/.
The following statement creates a table in Athena called mygroktable from the source data, using a
custom pattern and the predefined patterns that you specify:
418
Amazon Athena User Guide
Grok SerDe
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{7,12}',
'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/groksample/';
Start with a simple pattern, such as %{NOTSPACE:column}, to get the columns mapped first and then
specialize the columns if needed.
Example 2
In the following example, you create a query for Log4j logs. The example logs have the entries in this
format:
• Add the Grok pattern to the input.format for each column. For example, for timestamp, add
%{TIMESTAMP_ISO8601:timestamp}. For loglevel, add %{LOGLEVEL:loglevel}.
• Make sure the pattern in input.format matches the format of the log exactly, by mapping the
dashes (-) and the commas that separate the entries in the log format.
419
Amazon Athena User Guide
JSON SerDe Libraries
LOCATION 's3://mybucket/samples/';
Example 3
The following example of querying Amazon S3 logs shows the 'input.grokCustomPatterns'
expression that contains two pattern entries, separated by the newline escape character (\n), as
shown in this snippet from the example query: 'input.grokCustomPatterns'='INSIDE_QS ([^
\"]*)\nINSIDE_BRACKETS ([^\\]]*)').
SerDe Names
Hive-JsonSerDe
Openx-JsonSerDe
420
Amazon Athena User Guide
JSON SerDe Libraries
Library Names
Use one of the following:
org.apache.hive.hcatalog.data.JsonSerDe
org.openx.data.jsonserde.JsonSerDe
The following example DDL statement uses the Hive JSON SerDe to create a table based
on sample online advertising data. In the LOCATION clause, replace the myregion in
s3://myregion.elasticmapreduce/samples/hive-ads/tables/impressions with the region
identifier where you run Athena (for example, s3://us-west-2.elasticmapreduce/samples/
hive-ads/tables/impressions).
After you create the table, run MSCK REPAIR TABLE (p. 463) to load the table and make it queryable
from Athena:
ignore.malformed.json
Optional. When set to TRUE, lets you skip malformed JSON syntax. The default is FALSE.
421
Amazon Athena User Guide
JSON SerDe Libraries
dots.in.keys
Optional. The default is FALSE. When set to TRUE, allows the SerDe to replace the dots in key names
with underscores. For example, if the JSON dataset contains a key with the name "a.b", you can
use this property to define the column name to be "a_b" in Athena. By default (without this SerDe),
Athena does not allow dots in column names.
case.insensitive
Optional. The default is TRUE. When set to TRUE, the SerDe converts all uppercase columns to
lowercase.
If you have two keys like URL and Url that are the same when they are in lowercase, an error like the
following can occur:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"
To resolve this, set the case.insensitive property to FALSE and map the keys to different
names, as in the following example:
mapping
Optional. Maps column names to JSON keys that aren't identical to the column names. The mapping
parameter is useful when the JSON data contains keys that are keywords (p. 97). For example, if you
have a JSON key named timestamp, use the following syntax to map the key to a column named
ts:
Like the Hive JSON SerDe, the OpenX JSON SerDe does not allow duplicate keys in map or struct key
names.
The following example DDL statement uses the OpenX JSON SerDe to create a table based on the same
sample online advertising data used in the example for the Hive JSON SerDe. In the LOCATION clause,
replace myregion with the region identifier where you run Athena.
422
Amazon Athena User Guide
JSON SerDe Libraries
timers struct<
modellookup:string,
requesttime:string>,
threadid string,
hostname string,
sessionid string
) PARTITIONED BY (dt string)
ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties ( 'paths'='requestbegintime, adid, impressionid, referrer, useragent,
usercookie, ip' )
LOCATION 's3://myregion.elasticmapreduce/samples/hive-ads/tables/impressions';
The following example creates an Athena table from JSON data that has nested structures. To parse
JSON-encoded data in Athena, make sure that each JSON document is on its own line, separated by a
new line.
This example presumes JSON-encoded data that has the following structure:
{
"DocId": "AWS",
"User": {
"Id": 1234,
"Username": "bob1234",
"Name": "Bob",
"ShippingAddress": {
"Address1": "123 Main St.",
"Address2": null,
"City": "Seattle",
"State": "WA"
},
"Orders": [
{
"ItemId": 6789,
"OrderDate": "11/11/2017"
},
{
"ItemId": 4352,
"OrderDate": "12/12/2017"
}
]
}
}
The following CREATE TABLE statement uses the Openx-JsonSerDe with the struct and array
collection data types to establish groups of objects. Each JSON document is listed on its own line,
separated by a new line. To avoid errors, the data being queried does not include duplicate keys in
struct or map key names.
423
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
city:string,
state:string
>,
orders:array<
struct<
itemid:INT,
orderdate:string
>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/myjsondata/';
Additional Resources
For more information about working with JSON and nested JSON in Athena, see the following resources:
• Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe (AWS Big Data
Blog)
• I get errors when I try to read JSON data in Amazon Athena (AWS Knowledge Center article)
• hive-json-schema (GitHub) – Tool written in Java that generates CREATE TABLE statements from
example JSON documents. The CREATE TABLE statements that are generated use the OpenX JSON
Serde.
For reference documentation about the LazySimpleSerDe, see the Hive SerDe section of the Apache Hive
Developer Guide.
Library Name
The Class library name for the LazySimpleSerDe is
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. For information about the
LazySimpleSerDe class, see LazySimpleSerDe.
Ignoring Headers
To ignore headers in your data when you define a table, you can use the skip.header.line.count
table property, as in the following example.
TBLPROPERTIES ("skip.header.line.count"="1")
For examples, see the CREATE TABLE statements in Querying Amazon VPC Flow Logs (p. 244) and
Querying Amazon CloudFront Logs (p. 227).
Examples
The following examples show how to create tables in Athena from CSV and TSV, using the
LazySimpleSerDe. To deserialize custom-delimited files using this SerDe, use the FIELDS
424
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
Note
Replace myregion in s3://athena-examples-myregion/path/to/data/ with the region
identifier where you run Athena, for example, s3://athena-examples-us-west-1/path/
to/data/.
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
CSV Example
Use the CREATE TABLE statement to create an Athena table from the underlying data in CSV stored in
Amazon S3.
425
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
arrtime STRING,
arrdelay INT,
arrdelayminutes INT,
arrdel15 INT,
arrivaldelaygroups INT,
arrtimeblk STRING,
cancelled INT,
cancellationcode STRING,
diverted INT,
crselapsedtime INT,
actualelapsedtime INT,
airtime INT,
flights INT,
distance INT,
distancegroup INT,
carrierdelay INT,
weatherdelay INT,
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
426
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/csv/';
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table:
TSV Example
This example presumes source data in TSV saved in s3://mybucket/mytsv/.
Use a CREATE TABLE statement to create an Athena table from the TSV data stored in Amazon
S3. Notice that this example does not reference any SerDe class in ROW FORMAT because it uses the
LazySimpleSerDe, and it can be omitted. The example specifies SerDe properties for character and line
separators, and an escape character:
427
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
crsdeptime STRING,
deptime STRING,
depdelay INT,
depdelayminutes INT,
depdel15 INT,
departuredelaygroups INT,
deptimeblk STRING,
taxiout INT,
wheelsoff STRING,
wheelson STRING,
taxiin INT,
crsarrtime INT,
arrtime STRING,
arrdelay INT,
arrdelayminutes INT,
arrdel15 INT,
arrivaldelaygroups INT,
arrtimeblk STRING,
cancelled INT,
cancellationcode STRING,
diverted INT,
crselapsedtime INT,
actualelapsedtime INT,
airtime INT,
flights INT,
distance INT,
distancegroup INT,
carrierdelay INT,
weatherdelay INT,
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
428
Amazon Athena User Guide
ORC SerDe
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/tsv/';
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table:
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
ORC SerDe
SerDe Name
OrcSerDe
Library Name
This is the SerDe class for data in the ORC format. It passes the object from ORC to the reader and from
ORC to the writer: OrcSerDe
Examples
Note
Replace myregion in s3://athena-examples-myregion/path/to/data/ with the region
identifier where you run Athena, for example, s3://athena-examples-us-west-1/path/
to/data/.
The following example creates a table for the flight delays data in ORC. The table includes partitions:
429
Amazon Athena User Guide
ORC SerDe
430
Amazon Athena User Guide
ORC SerDe
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year String)
STORED AS ORC
LOCATION 's3://athena-examples-myregion/flight/orc/'
tblproperties ("orc.compress"="ZLIB");
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata:
Use this query to obtain the top 10 routes delayed by more than 1 hour:
431
Amazon Athena User Guide
Parquet SerDe
Parquet SerDe
SerDe Name
ParquetHiveSerDe is used for data stored in Parquet Format.
Note
To convert data into Parquet format, you can use CREATE TABLE AS SELECT (CTAS) (p. 458)
queries. For more information, see Creating a Table from Query Results (CTAS) (p. 136),
Examples of CTAS Queries (p. 142) and Using CTAS and INSERT INTO for ETL and Data
Analysis (p. 145).
Library Name
Athena uses this class when it needs to deserialize data stored in Parquet:
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.
Use the following CREATE TABLE statement to create an Athena table from the underlying data in CSV
stored in Amazon S3 in Parquet:
432
Amazon Athena User Guide
Parquet SerDe
depdelayminutes INT,
depdel15 INT,
departuredelaygroups INT,
deptimeblk STRING,
taxiout INT,
wheelsoff STRING,
wheelson STRING,
taxiin INT,
crsarrtime INT,
arrtime STRING,
arrdelay INT,
arrdelayminutes INT,
arrdel15 INT,
arrivaldelaygroups INT,
arrtimeblk STRING,
cancelled INT,
cancellationcode STRING,
diverted INT,
crselapsedtime INT,
actualelapsedtime INT,
airtime INT,
flights INT,
distance INT,
distancegroup INT,
carrierdelay INT,
weatherdelay INT,
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
433
Amazon Athena User Guide
Compression Formats
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
STORED AS PARQUET
LOCATION 's3://athena-examples-myregion/flight/parquet/'
tblproperties ("parquet.compression"="SNAPPY");
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata:
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
Compression Formats
The compression formats listed in this section are used for CREATE TABLE (p. 454) queries. For CTAS
queries, Athena supports GZIP and SNAPPY (for data stored in Parquet and ORC). If you omit a format,
GZIP is used by default. For more information, see CREATE TABLE AS (p. 458).
• SNAPPY – The default compression format for files in the Parquet data storage format.
• ZLIB – The default compression format for files in the ORC data storage format.
• LZO
• GZIP
• BZIP2
434
Amazon Athena User Guide
Notes and Resources
• For querying Amazon Kinesis Data Firehose logs from Athena, supported formats include GZIP
compression or ORC files with SNAPPY compression.
• For more information on using compression, see section 3 ("Compress and split files") of the AWS Big
Data Blog post Top 10 Performance Tuning Tips for Amazon Athena.
435
Amazon Athena User Guide
Data Types in Athena
Topics
• Data Types in Amazon Athena (p. 436)
• DML Queries, Functions, and Operators (p. 437)
• DDL Statements (p. 446)
• Considerations and Limitations for SQL Queries in Amazon Athena (p. 469)
To specify decimal values as literals, such as when selecting rows with a specific decimal value in a
query DDL expression, specify the DECIMAL type definition, and list the decimal value as a literal (in
single quotes) in your query, as in this example: decimal_value = DECIMAL '0.12'.
436
Amazon Athena User Guide
DML Queries, Functions, and Operators
• CHAR – Fixed length character data, with a specified length between 1 and 255, such as char(10). For
more information, see CHAR Hive Data Type.
Note
To use the substr function to return a substring of specified length from a CHAR data type,
you must first cast the CHAR value as a VARCHAR, as in the following example.
substr(cast(col1 as varchar), 1, 4)
• VARCHAR – Variable length character data, with a specified length between 1 and 65535, such as
varchar(10). For more information, see VARCHAR Hive Data Type.
• STRING – A string literal enclosed in single or double quotes. For more information, see STRING Hive
Data Type.
Note
Non-string data types cannot be cast to STRING in Athena; cast them to VARCHAR instead.
• BINARY – Used for data in Parquet.
• DATE – A date in ISO format, such as YYYY-MM-DD. For example, DATE '2008-09-15'.
• TIMESTAMP – Date and time instant in a java.sql.Timestamp compatible format, such as
yyyy-MM-dd HH:mm:ss[.f...]. For example, TIMESTAMP '2008-09-15 03:04:05.324'. This
format uses the session time zone.
• ARRAY<data_type>
• MAP<primitive_type, data_type>
• STRUCT<col_name : data_type [COMMENT col_comment] , ...>
For links to subsections of the Presto function documentation, see Presto Functions (p. 445).
Athena does not support all of Presto's features, and there are some significant differences. For
more information, see the topics for specific statements in this section and Considerations and
Limitations (p. 469).
Topics
• SELECT (p. 437)
• INSERT INTO (p. 442)
• Presto Functions in Amazon Athena (p. 445)
SELECT
Retrieves rows of data from zero or more tables.
Note
This topic provides summary information for reference. Comprehensive information about using
SELECT and the SQL language is beyond the scope of this documentation. For information
about using SQL that is specific to Athena, see Considerations and Limitations for SQL Queries
in Amazon Athena (p. 469) and Running SQL Queries Using Amazon Athena (p. 122). For help
getting started with querying data in Athena, see Getting Started (p. 8).
437
Amazon Athena User Guide
SELECT
Synopsis
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
Note
Reserved words in SQL SELECT statements must be enclosed in double quotes. For more
information, see List of Reserved Keywords in SQL SELECT Statements (p. 97).
Parameters
[ WITH with_query [, ....] ]
The WITH clause precedes the SELECT list in a query and defines one or more subqueries for use
within the SELECT query.
Each subquery defines a temporary table, similar to a view definition, which you can reference in the
FROM clause. The tables are used only when the query runs.
Where:
• subquery_table_name is a unique name for a temporary table that defines the results of the
WITH clause subquery. Each subquery must have a table name that can be referenced in the
FROM clause.
• column_name [, ...] is an optional list of output column names. The number of column
names must be equal to or less than the number of columns defined by subquery.
• subquery is any query statement.
[ ALL | DISTINCT ] select_expr
ALL is the default. Using ALL is treated the same as if it were omitted; all rows for all columns are
selected and duplicates are kept.
Use DISTINCT to return only distinct values when a column contains duplicate values.
FROM from_item [, ...]
Indicates the input to the query, where from_item can be a view, a join construct, or a subquery as
described below.
438
Amazon Athena User Guide
SELECT
Where table_name is the name of the target table from which to select rows, alias is the name
to give the output of the SELECT statement, and column_alias defines the columns for the
alias specified.
-OR-
• join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
Divides the output of the SELECT statement into rows with matching values.
ALL and DISTINCT determine whether duplicate grouping sets each produce distinct output rows. If
omitted, ALL is assumed.
The grouping_expressions element can be any function, such as SUM, AVG, or COUNT, performed
on input columns, or be an ordinal number that selects an output column by position, starting at
one.
GROUP BY expressions can group output by input column names that don't appear in the output of
the SELECT statement.
All output expressions must be either aggregate functions or columns present in the GROUP BY
clause.
You can use a single query to perform analysis that requires aggregating multiple column sets.
These complex grouping operations don't support expressions comprising input columns. Only
column names or ordinals are allowed.
You can often use UNION ALL to achieve the same results as these GROUP BY operations, but
queries that use GROUP BY have the advantage of reading the data one time, whereas UNION ALL
reads the underlying data three times and may produce inconsistent results when the data source is
subject to change.
GROUP BY CUBE generates all possible grouping sets for a given set of columns. GROUP BY
ROLLUP generates all possible subtotals for a given set of columns.
[ HAVING condition ]
Used with aggregate functions and the GROUP BY clause. Controls which groups are selected,
eliminating groups that don't satisfy condition. This filtering occurs after groups and aggregates
are computed.
439
Amazon Athena User Guide
SELECT
Combines the results of more than one SELECT statement into a single query. ALL or DISTINCT
control which rows are included in the final result set.
ALL causes all rows to be included, even if the rows are identical.
DISTINCT causes only unique rows to be included in the combined result set. DISTINCT is the
default.
To eliminate duplicates, UNION builds a hash table, which consumes memory. For better
performance, consider using UNION ALL if your query does not require the elimination of
duplicates.
Multiple UNION clauses are processed left to right unless you use parentheses to explicitly define the
order of processing.
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
When the clause contains multiple expressions, the result set is sorted according to the first
expression. Then the second expression is applied to rows that have matching values from the
first expression, and so on.
Each expression may specify output columns from SELECT or an ordinal number for an output
column by position, starting at one.
ORDER BY is evaluated as the last step after any GROUP BY or HAVING clause. ASC and DESC
determine whether results are sorted in ascending or descending order.
The default null ordering is NULLS LAST, regardless of ascending or descending sort order.
LIMIT [ count | ALL ]
Restricts the number of rows in the result set to count. LIMIT ALL is the same as omitting the
LIMIT clause. If the query has no ORDER BY clause, the results are arbitrary.
TABLESAMPLE BERNOULLI | SYSTEM (percentage)
BERNOULLI selects each row to be in the table sample with a probability of percentage. All
physical blocks of the table are scanned, and certain rows are skipped based on a comparison
between the sample percentage and a random value calculated at runtime.
With SYSTEM, the table is divided into logical segments of data, and the table is sampled at this
granularity.
Either all rows from a particular segment are selected, or the segment is skipped based on a
comparison between the sample percentage and a random value calculated at runtime. SYSTEM
sampling is dependent on the connector. This method does not guarantee independent sampling
probabilities.
[ UNNEST (array_or_map) [WITH ORDINALITY] ]
Expands an array or map into a relation. Arrays are expanded into a single column. Maps are
expanded into two columns (key, value).
You can use UNNEST with multiple arguments, which are expanded into multiple columns with as
many rows as the highest cardinality argument.
440
Amazon Athena User Guide
SELECT
UNNEST is usually used with a JOIN and can reference columns from relations on the left side of the
JOIN.
s3://awsexamplebucket/datasets_mytable/year=2019/data_file1.json
To return a sorted, unique list of the S3 filename paths for the data in a table, you can use SELECT
DISTINCT and ORDER BY, as in the following example.
To return only the filenames without the path, you can pass "$path" as a parameter to an
regexp_extract function, as in the following example.
To return the data from a specific file, specify the file in the WHERE clause, as in the following example.
For more information and examples, see the Knowledge Center article How can I see the Amazon S3
source file for a row in an Athena table?.
Select 'O''Reilly'
Results
O'Reilly
Additional Resources
For more information about using SELECT statements in Athena, see the following resources.
441
Amazon Athena User Guide
INSERT INTO
Inserting data from a SELECT query into another INSERT INTO (p. 442)
table
Using built-in functions in SELECT statements Presto Functions in Amazon Athena (p. 445)
Using user defined functions in SELECT Querying with User Defined Functions
statements (Preview) (p. 216)
Querying Data Catalog metadata Querying AWS Glue Data Catalog (p. 249)
INSERT INTO
Inserts new rows into a destination table based on a SELECT query statement that runs on a source
table, or based on a set of VALUES provided as part of the statement. When the source table is based
on underlying data in one format, such as CSV or JSON, and the destination table is based on another
format, such as Parquet or ORC, you can use INSERT INTO queries to transform selected data into the
destination table's format.
Avro org.apache.hadoop.hive.serde2.avro.AvroSerDe
JSON org.apache.hive.hcatalog.data.JsonSerDe
ORC org.apache.hadoop.hive.ql.io.orc.OrcSerde
Parquet org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
442
Amazon Athena User Guide
INSERT INTO
Partitioning
Consider the points in this section when using parititioning with INSERT INTO or CREATE TABLE AS
SELECT queries.
Limits
The INSERT INTO statement supports writing a maximum of 100 partitions to the destination table. If
you run the SELECT clause on a table with more than 100 partitions, the query fails unless the SELECT
query is limited to 100 partitions or fewer.
For information about working around this limitation, see Using CTAS and INSERT INTO to Create a Table
with More Than 100 Partitions (p. 151).
Column Ordering
INSERT INTO or CREATE TABLE AS SELECT statements expect the partitioned column to be the last
column in the list of projected columns in a SELECT statement.
If the source table is non-partitioned, or partitioned on different columns compared to the destination
table, queries like INSERT INTO destination_table SELECT * FROM source_table consider the
values in the last column of the source table to be values for a partition column in the destination table.
Keep this in mind when trying to create a partitioned table from a non-partitioned table.
Resources
For more information about using INSERT INTO with partitioning, see the following resources.
• For inserting partitioned data into a partitioned table, see Using CTAS and INSERT INTO to Create a
Table with More Than 100 Partitions (p. 151).
• For inserting unpartitioned data into a partitioned table, see Using CTAS and INSERT INTO for ETL and
Data Analysis (p. 145).
INSERT INTO...SELECT
Specifies the query to run on one table, source_table, which determines rows to insert into a second
table, destination_table. If the SELECT query specifies columns in the source_table, the columns
must precisely match those in the destination_table.
443
Amazon Athena User Guide
INSERT INTO
For more information about SELECT queries, see SELECT (p. 437).
Synopsis
Examples
Select all rows in the vancouver_pageviews table and insert them into the canada_pageviews table:
Select only those rows in the vancouver_pageviews table where the date column has a value
between 2019-07-01 and 2019-07-31, and then insert them into canada_july_pageviews:
Select the values in the city and state columns in the cities_world table only from those rows
with a value of usa in the country column and insert them into the city and state columns in the
cities_usa table:
INSERT INTO...VALUES
Inserts rows into an existing table by specifying columns and values. Specified columns and associated
data types must precisely match the columns and data types in the destination table.
Important
We do not recommend inserting rows using VALUES because Athena generates files for each
INSERT operation. This can cause many small files to be created and degrade the table's query
performance. To identify files that an INSERT query creates, examine the data manifest file. For
more information, see Working with Query Results, Output Files, and Query History (p. 122).
Synopsis
Examples
In the following examples, the cities table has three columns: id, city, state, state_motto. The id
column is type INT and all other columns are type VARCHAR.
444
Amazon Athena User Guide
Presto Functions
Insert a single row into the cities table, with all column values specified:
• Logical Operators
• Comparison Functions and Operators
• Conditional Expressions
• Conversion Functions
• Mathematical Functions and Operators
• Bitwise Functions
• Decimal Functions and Operators
• String Functions and Operators
• Binary Functions
• Date and Time Functions and Operators
• Regular Expression Functions
• JSON Functions and Operators
• URL Functions
• Aggregate Functions
• Window Functions
• Color Functions
• Array Functions and Operators
• Map Functions and Operators
• Lambda Expressions and Functions
• Teradata Functions
445
Amazon Athena User Guide
DDL Statements
• Logical Operators
• Comparison Functions and Operators
• Conditional Expressions
• Conversion Functions
• Mathematical Functions and Operators
• Bitwise Functions
• Decimal Functions and Operators
• String Functions and Operators
• Binary Functions
• Date and Time Functions and Operators
• Regular Expression Functions
• JSON Functions and Operators
• URL Functions
• Aggregate Functions
• Window Functions
• Color Functions
• Array Functions and Operators
• Map Functions and Operators
• Lambda Expressions and Functions
• Teradata Functions
DDL Statements
Use the following DDL statements directly in Athena.
Athena does not support all DDL statements, and there are some differences between HiveQL DDL
and Athena DDL. For more information, see the reference topics in this section and Unsupported
DDL (p. 447).
Topics
• Unsupported DDL (p. 447)
• ALTER DATABASE SET DBPROPERTIES (p. 448)
• ALTER TABLE ADD COLUMNS (p. 449)
• ALTER TABLE ADD PARTITION (p. 449)
• ALTER TABLE DROP PARTITION (p. 450)
• ALTER TABLE RENAME PARTITION (p. 451)
• ALTER TABLE REPLACE COLUMNS (p. 451)
• ALTER TABLE SET LOCATION (p. 452)
• ALTER TABLE SET TBLPROPERTIES (p. 453)
• CREATE DATABASE (p. 453)
• CREATE TABLE (p. 454)
• CREATE TABLE AS (p. 458)
• CREATE VIEW (p. 460)
• DESCRIBE TABLE (p. 461)
446
Amazon Athena User Guide
Unsupported DDL
Unsupported DDL
The following native Hive DDLs are not supported by Athena:
• ALTER INDEX
• ALTER TABLE table_name ARCHIVE PARTITION
• ALTER TABLE table_name CLUSTERED BY
• ALTER TABLE table_name EXCHANGE PARTITION
• ALTER TABLE table_name NOT CLUSTERED
• ALTER TABLE table_name NOT SKEWED
• ALTER TABLE table_name NOT SORTED
• ALTER TABLE table_name NOT STORED AS DIRECTORIES
• ALTER TABLE table_name partitionSpec CHANGE COLUMNS
• ALTER TABLE table_name partitionSpec COMPACT
• ALTER TABLE table_name partitionSpec CONCATENATE
• ALTER TABLE table_name partitionSpec SET FILEFORMAT
• ALTER TABLE table_name RENAME TO
• ALTER TABLE table_name SET SKEWED LOCATION
• ALTER TABLE table_name SKEWED BY
• ALTER TABLE table_name TOUCH
• ALTER TABLE table_name UNARCHIVE PARTITION
• COMMIT
• CREATE INDEX
• CREATE ROLE
• CREATE TABLE table_name LIKE existing_table_name
• CREATE TEMPORARY MACRO
• DELETE FROM
• DESCRIBE DATABASE
• DFS
• DROP INDEX
• DROP ROLE
447
Amazon Athena User Guide
ALTER DATABASE SET DBPROPERTIES
Synopsis
ALTER (DATABASE|SCHEMA) database_name
SET DBPROPERTIES ('property_name'='property_value' [, ...] )
Parameters
SET DBPROPERTIES ('property_name'='property_value' [, ...]
Specifies a property or properties for the database named property_name and establishes the
value for each of the properties respectively as property_value. If property_name already exists,
the old value is overwritten with property_value.
Examples
ALTER DATABASE jd_datasets
SET DBPROPERTIES ('creator'='John Doe', 'department'='applied mathematics');
448
Amazon Athena User Guide
ALTER TABLE ADD COLUMNS
Synopsis
ALTER TABLE table_name
[PARTITION
(partition_col1_name = partition_col1_value
[,partition_col2_name = partition_col2_value][,...])]
ADD COLUMNS (col_name data_type)
Parameters
PARTITION (partition_col_name = partition_col_value [,...])
Creates a partition with the column name/value combinations that you specify. Enclose
partition_col_value in quotation marks only if the data type of the column is a string.
ADD COLUMNS (col_name data_type [,col_name data_type,…])
Examples
ALTER TABLE events ADD COLUMNS (eventowner string)
ALTER TABLE events PARTITION (awsregion='us- west-2') ADD COLUMNS (eventdescription string)
Notes
• To see a new table column in the Athena Query Editor navigation pane after you run ALTER TABLE
ADD COLUMNS, manually refresh the table list in the editor, and then expand the table again.
• ALTER TABLE ADD COLUMNS does not work for columns with the date datatype. To workaround this
issue, use the timestamp datatype instead.
In Athena, a table and its partitions must use the same data formats but their schemas may differ. For
more information, see Updates in Tables with Partitions (p. 161).
For information about the resource-level permissions required in IAM policies (including
glue:CreatePartition), see AWS Glue API Permissions: Actions and Resources Reference and Fine-
Grained Access to Databases and Tables in the AWS Glue Data Catalog (p. 275).
449
Amazon Athena User Guide
ALTER TABLE DROP PARTITION
Synopsis
ALTER TABLE table_name ADD [IF NOT EXISTS]
PARTITION
(partition_col1_name = partition_col1_value
[,partition_col2_name = partition_col2_value]
[,...])
[LOCATION 'location1']
[PARTITION
(partition_colA_name = partition_colA_value
[,partition_colB_name = partition_colB_value
[,...])]
[LOCATION 'location2']
[,...]
Parameters
When you add a partition, you specify one or more column name/value pairs for the partition and the
Amazon S3 path where the data files for that partition reside.
Causes the error to be suppressed if a partition with the same definition already exists.
PARTITION (partition_col_name = partition_col_value [,...])
Creates a partition with the column name/value combinations that you specify. Enclose
partition_col_value in string characters only if the data type of the column is a string.
[LOCATION 'location']
Specifies the directory in which to store the partitions defined by the preceding statement.
Examples
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN');
Synopsis
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (partition_spec) [, PARTITION
(partition_spec)]
450
Amazon Athena User Guide
ALTER TABLE RENAME PARTITION
Parameters
[IF EXISTS]
Suppresses the error message if the partition specified does not exist.
PARTITION (partition_spec)
Examples
ALTER TABLE orders
DROP PARTITION (dt = '2014-05-14', country = 'IN');
Synopsis
ALTER TABLE table_name PARTITION (partition_spec) RENAME TO PARTITION (new_partition_spec)
Parameters
PARTITION (partition_spec)
Examples
ALTER TABLE orders
PARTITION (dt = '2014-05-14', country = 'IN') RENAME TO PARTITION (dt = '2014-05-15',
country = 'IN');
451
Amazon Athena User Guide
ALTER TABLE SET LOCATION
Synopsis
ALTER TABLE table_name
[PARTITION
(partition_col1_name = partition_col1_value
[,partition_col2_name = partition_col2_value][,...])]
REPLACE COLUMNS (col_name data_type [, col_name data_type, ...])
Parameters
PARTITION (partition_col_name = partition_col_value [,...])
Specifies a partition with the column name/value combinations that you specify. Enclose
partition_col_value in quotation marks only if the data type of the column is a string.
REPLACE COLUMNS (col_name data_type [,col_name data_type,…])
Replaces existing columns with the column names and datatypes specified.
Notes
• To see the change in table columns in the Athena Query Editor navigation pane after you run ALTER
TABLE REPLACE COLUMNS, manually refresh the table list in the editor, and then expand the table
again.
• ALTER TABLE REPLACE COLUMNS does not work for columns with the date datatype. To
workaround this issue, use the timestamp datatype in the table instead.
Synopsis
ALTER TABLE table_name [ PARTITION (partition_spec) ] SET LOCATION 'new location'
Parameters
PARTITION (partition_spec)
Specifies the partition with parameters partition_spec whose location you want to change. The
partition_spec specifies a column name/value combination in the form partition_col_name
= partition_col_value.
SET LOCATION 'new location'
Specifies the new location, which must be an Amazon S3 location. For information about syntax, see
Table Location in Amazon S3 (p. 98).
Examples
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://mystorage/
custdata/';
452
Amazon Athena User Guide
ALTER TABLE SET TBLPROPERTIES
Synopsis
ALTER TABLE table_name SET TBLPROPERTIES ('property_name' = 'property_value' [ , ... ])
Parameters
SET TBLPROPERTIES ('property_name' = 'property_value' [ , ... ])
Specifies the metadata properties to add as property_name and the value for each as property
value. If property_name already exists, its value is reset to property_value.
Examples
ALTER TABLE orders
SET TBLPROPERTIES ('notes'="Please don't drop this table.");
CREATE DATABASE
Creates a database. The use of DATABASE and SCHEMA is interchangeable. They mean the same thing.
Synopsis
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT 'database_comment']
[LOCATION 'S3_loc']
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...]]
Parameters
[IF NOT EXISTS]
Establishes the metadata value for the built-in metadata property named comment and the value
you provide for database_comment. In AWS Glue, the COMMENT contents are written to the
Description field of the database properties.
[LOCATION S3_loc]
Specifies the location where database files and metastore will exist as S3_loc. The location must be
an Amazon S3 location.
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...] ]
Allows you to specify custom metadata properties for the database definition.
453
Amazon Athena User Guide
CREATE TABLE
Examples
CREATE DATABASE clickstreams;
{
"Database": {
"Name": "<your-database-name>",
"Description": "<your-database-comment>",
"LocationUri": "s3://<your-database-location>",
"Parameters": {
"<your-database-property-name>": "<your-database-property-value>"
},
"CreateTime": 1603383451.0,
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
]
}
}
For more information about the AWS CLI, see the AWS Command Line Interface User Guide.
CREATE TABLE
Creates a table with the name and the parameters that you specify.
Synopsis
CREATE EXTERNAL TABLE [IF NOT EXISTS]
[db_name.]table_name [(col_name data_type [COMMENT col_comment] [, ...] )]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[WITH SERDEPROPERTIES (...)] ]
454
Amazon Athena User Guide
CREATE TABLE
[LOCATION 's3://bucket_name/[folder]/']
[TBLPROPERTIES ( ['has_encrypted_data'='true | false',]
['classification'='aws_glue_classification',] property_name=property_value [, ...] ) ]
Parameters
EXTERNAL
Specifies that the table is based on an underlying data file that exists in Amazon S3, in the
LOCATION that you specify. All tables created in Athena, except for those created using
CTAS (p. 458), must be EXTERNAL. When you create an external table, the data referenced must
comply with the default format or the format that you specify with the ROW FORMAT, STORED AS,
and WITH SERDEPROPERTIES clauses.
Causes the error message to be suppressed if a table named table_name already exists.
[db_name.]table_name
Specifies a name for the table to be created. The optional db_name parameter specifies the
database where the table exists. If omitted, the current database is assumed. If the table name
includes numbers, enclose table_name in quotation marks, for example "table123". If
table_name begins with an underscore, use backticks, for example, `_mytable`. Special
characters (other than underscore) are not supported.
Athena table names are case-insensitive; however, if you work with Apache Spark, Spark requires
lowercase table names.
[ ( col_name data_type [COMMENT col_comment] [, ...] ) ]
Specifies the name for each column to be created, along with the column's data type. Column names
do not allow special characters other than underscore (_). If col_name begins with an underscore,
enclose the column name in backticks, for example `_mycolumn`.
To specify decimal values as literals, such as when selecting rows with a specific decimal value in a
query DDL expression, specify the DECIMAL type definition, and list the decimal value as a literal
(in single quotes) in your query, as in this example: decimal_value = DECIMAL '0.12'.
455
Amazon Athena User Guide
CREATE TABLE
• CHAR. Fixed length character data, with a specified length between 1 and 255, such as char(10).
For more information, see CHAR Hive Data Type.
• VARCHAR. Variable length character data, with a specified length between 1 and 65535, such as
varchar(10). For more information, see VARCHAR Hive Data Type.
• STRING. A string literal enclosed in single or double quotes.
Note
Non-string data types cannot be cast to STRING in Athena; cast them to VARCHAR
instead.
• BINARY (for data in Parquet)
• Date and time types
• DATE A date in ISO format, such as YYYY-MM-DD. For example, DATE '2008-09-15'.
• TIMESTAMP Date and time instant in a java.sql.Timestamp compatible format, such as
yyyy-MM-dd HH:mm:ss[.f...]. For example, TIMESTAMP '2008-09-15 03:04:05.324'.
This format uses the session time zone.
• ARRAY < data_type >
• MAP < primitive_type, data_type >
• STRUCT < col_name : data_type [COMMENT col_comment] [, ...] >
[COMMENT table_comment]
Creates the comment table property and populates it with the table_comment you specify.
[PARTITIONED BY (col_name data_type [ COMMENT col_comment ], ... ) ]
Creates a partitioned table with one or more partition columns that have the col_name,
data_type and col_comment specified. A table can have one or more partitions, which consist of a
distinct column name and value combination. A separate data directory is created for each specified
combination, which can improve query performance in some circumstances. Partitioned columns
don't exist within the table data itself. If you use a value for col_name that is the same as a table
column, you get an error. For more information, see Partitioning Data (p. 104).
Note
After you create a table with partitions, run a subsequent query that consists of the MSCK
REPAIR TABLE (p. 463) clause to refresh partition metadata, for example, MSCK REPAIR
TABLE cloudfront_logs;. For partitions that are not Hive compatible, use ALTER TABLE
ADD PARTITION (p. 449) to load the partitions so that you can query the data.
Divides, with or without partitioning, the data in the specified col_name columns into data subsets
called buckets. The num_buckets parameter specifies the number of buckets to create. Bucketing
can improve the performance of some queries on large data sets.
Specifies the row format of the table and its underlying source data if applicable. For row_format,
you can specify one or more delimiters with the DELIMITED clause or, alternatively, use the SERDE
clause as described below. If ROW FORMAT is omitted or ROW FORMAT DELIMITED is specified, a
native SerDe is used.
• [DELIMITED FIELDS TERMINATED BY char [ESCAPED BY char]]
• [DELIMITED COLLECTION ITEMS TERMINATED BY char]
• [MAP KEYS TERMINATED BY char]
• [LINES TERMINATED BY char]
456
Amazon Athena User Guide
CREATE TABLE
Available only with Hive 0.13 and when the STORED AS file format is TEXTFILE.
--OR--
• SERDE 'serde_name' [WITH SERDEPROPERTIES ("property_name" = "property_value",
"property_name" = "property_value" [, ...] )]
The serde_name indicates the SerDe to use. The WITH SERDEPROPERTIES clause allows you to
provide one or more custom properties allowed by the SerDe.
[STORED AS file_format]
Specifies the file format for table data. If omitted, TEXTFILE is the default. Options for
file_format are:
• SEQUENCEFILE
• TEXTFILE
• RCFILE
• ORC
• PARQUET
• AVRO
• INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
[LOCATION 's3://bucket_name/[folder]/']
Specifies the location of the underlying data in Amazon S3 from which the table is created. The
location path must be a bucket name or a bucket name and one or more folders. If you are using
partitions, specify the root of the partitioned data. For more information about table location,
see Table Location in Amazon S3 (p. 98). For information about data format and permissions, see
Requirements for Tables in Athena and Data in Amazon S3 (p. 91).
Use a trailing slash for your folder or bucket. Do not use file names or glob characters.
Use:
s3://mybucket/
s3://mybucket/folder/
s3://mybucket/folder/anotherfolder/
Don't use:
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to-bucket/mydatafile.dat
[TBLPROPERTIES ( ['has_encrypted_data'='true | false',] ['classification'='aws_glue_classification',]
property_name=property_value [, ...] ) ]
Specifies custom metadata key-value pairs for the table definition in addition to predefined table
properties, such as "comment".
Athena has a built-in property, has_encrypted_data. Set this property to true to indicate that
the underlying dataset specified by LOCATION is encrypted. If omitted and if the workgroup's
settings do not override client-side settings, false is assumed. If omitted or set to false when
457
Amazon Athena User Guide
CREATE TABLE AS
underlying data is encrypted, the query results in an error. For more information, see Configuring
Encryption Options (p. 264).
To run ETL jobs, AWS Glue requires that you create a table with the classification property
to indicate the data type for AWS Glue as csv, parquet, orc, avro, or json. For example,
'classification'='csv'. ETL jobs will fail if you do not specify this property. You can
subsequently specify it using the AWS Glue console, API, or CLI. For more information, see Using
AWS Glue Jobs for ETL with Athena (p. 27) and Authoring Jobs in Glue in the AWS Glue Developer
Guide.
For more information about creating tables, see Creating Tables in Athena (p. 90).
CREATE TABLE AS
Creates a new table populated with the results of a SELECT (p. 437) query. To create an empty table,
use CREATE TABLE (p. 454).
For additional information about CREATE TABLE AS beyond the scope of this reference topic, see
Creating a Table from Query Results (CTAS) (p. 136).
Topics
• Synopsis (p. 438)
• CTAS Table Properties (p. 458)
• Examples (p. 460)
Synopsis
CREATE TABLE table_name
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]
Where:
A list of optional CTAS table properties, some of which are specific to the data storage format. See
CTAS Table Properties (p. 458).
query
If WITH NO DATA is used, a new empty table with the same schema as the original table is created.
458
Amazon Athena User Guide
CREATE TABLE AS
Optional. The location where Athena saves your CTAS query in Amazon S3, as in the following
example:
Athena does not use the same path for query results twice. If you specify the location manually,
make sure that the Amazon S3 location that you specify has no data. Athena never attempts to
delete your data. If you want to use the same location again, manually delete the data, or your
CTAS query will fail.
If you run a CTAS query that specifies an external_location in a workgroup that enforces a
query results location (p. 366), the query fails with an error message. To see the query results
location specified for the workgroup, see the workgroup's details (p. 370).
If your workgroup overrides the client-side setting for query results location, Athena creates
your table in the following location:
s3://<workgroup-query-results-location>/tables/<query-id>/
If you do not use the external_location property to specify a location and your workgroup
does not override client-side settings, Athena uses your client-side setting (p. 127) for the query
results location to create your table in the following location:
s3://<query-results-location-setting>/<Unsaved-or-query-name>/<year>/<month/<date>/
tables/<query-id>/
format = [format]
The data format for the CTAS query results, such as ORC, PARQUET, AVRO, JSON, or TEXTFILE.
For example, WITH (format = 'PARQUET'). If omitted, PARQUET is used by default. The
name of this parameter, format, must be listed in lowercase, or your CTAS query will fail.
partitioned_by = ARRAY[ col_name[,…] ]
Optional. An array list of columns by which the CTAS table will be partitioned. Verify that the
names of partitioned columns are listed last in the list of columns in the SELECT statement.
bucketed_by = ARRAY[ bucket_name[,…] ]
An array list of buckets to bucket data. If omitted, Athena does not bucket your data in this
query.
bucket_count = [int]
The number of buckets for bucketing your data. If omitted, Athena does not bucket your data.
orc_compression = [format]
The compression type to use for ORC data. For example, WITH (orc_compression =
'ZLIB'). If omitted, GZIP compression is used by default for ORC and other data storage
formats supported by CTAS.
parquet_compression = [format]
The compression type to use for Parquet data. For example, WITH (parquet_compression =
'SNAPPY'). If omitted, GZIP compression is used by default for Parquet and other data storage
formats supported by CTAS.
459
Amazon Athena User Guide
CREATE VIEW
field_delimiter = [delimiter]
Optional and specific to text-based data storage formats. The single-character field delimiter
for files in CSV, TSV, and text files. For example, WITH (field_delimiter = ','). Currently,
multicharacter field delimiters are not supported for CTAS queries. If you don't specify a field
delimiter, \001 is used by default.
Examples
For examples of CTAS queries, consult the following resources.
CREATE VIEW
Creates a new view from a specified SELECT query. The view is a logical table that can be referenced by
future queries. Views do not contain any data and do not write data. Instead, the query specified by the
view runs each time you reference the view by another query.
Note
This topic provides summary information for reference. For more detailed information about
using views in Athena, see Working with Views (p. 131).
Synopsis
CREATE [ OR REPLACE ] VIEW view_name AS query
The optional OR REPLACE clause lets you update the existing view by replacing it. For more information,
see Creating Views (p. 134).
Examples
To create a view test from the table orders, use a query similar to the following:
To create a view orders_by_date from the table orders, use the following query:
460
Amazon Athena User Guide
DESCRIBE TABLE
See also SHOW COLUMNS (p. 465), SHOW CREATE VIEW (p. 466), DESCRIBE VIEW (p. 461), and
DROP VIEW (p. 463).
DESCRIBE TABLE
Shows the list of columns, including partition columns, for the named column. This allows you to
examine the attributes of a complex column.
Synopsis
DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name
( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )]
Parameters
[EXTENDED | FORMATTED]
Determines the format of the output. If you specify EXTENDED, all metadata for the table is
output in Thrift serialized form. This is useful primarily for debugging and not for general use. Use
FORMATTED or omit the clause to show the metadata in tabular format.
[PARTITION partition_spec]
If included, lists the metadata for the partition specified by partition_spec, where
partition_spec is in the format (partition_column = partition_col_value,
partition_column = partition_col_value, ...).
[col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ]
Specifies the column and attributes to examine. You can specify .field_name for an element of a
struct, '$elem$' for array element, '$key$' for a map key, and '$value$' for map value. You can
specify this recursively to further explore the complex column.
Examples
DESCRIBE orders;
DESCRIBE VIEW
Shows the list of columns for the named view. This allows you to examine the attributes of a complex
view.
Synopsis
DESCRIBE [view_name]
461
Amazon Athena User Guide
DROP DATABASE
Example
DESCRIBE orders;
See also SHOW COLUMNS (p. 465), SHOW CREATE VIEW (p. 466), SHOW VIEWS (p. 469), and DROP
VIEW (p. 463).
DROP DATABASE
Removes the named database from the catalog. If the database contains tables, you must either drop the
tables before running DROP DATABASE or use the CASCADE clause. The use of DATABASE and SCHEMA
are interchangeable. They mean the same thing.
Synopsis
DROP {DATABASE | SCHEMA} [IF EXISTS] database_name [RESTRICT | CASCADE]
Parameters
[IF EXISTS]
Determines how tables within database_name are regarded during the DROP operation. If you
specify RESTRICT, the database is not dropped if it contains tables. This is the default behavior.
Specifying CASCADE causes the database and all its tables to be dropped.
Examples
DROP DATABASE clickstreams;
DROP TABLE
Removes the metadata table definition for the table named table_name. When you drop an external
table, the underlying data remains intact because all tables in Athena are EXTERNAL.
Synopsis
DROP TABLE [IF EXISTS] table_name
Parameters
[ IF EXISTS ]
462
Amazon Athena User Guide
DROP VIEW
Examples
DROP TABLE fulfilled_orders
When using the Athena console query editor to drop a table that has special characters other than the
underscore (_), use backticks, as in the following example.
When using the JDBC connector to drop a table that has special characters, backtick characters are not
required.
DROP VIEW
Drops (deletes) an existing view. The optional IF EXISTS clause causes the error to be suppressed if the
view does not exist.
Synopsis
DROP VIEW [ IF EXISTS ] view_name
Examples
DROP VIEW orders_by_date
See also CREATE VIEW (p. 460), SHOW COLUMNS (p. 465), SHOW CREATE VIEW (p. 466), SHOW
VIEWS (p. 469), and DESCRIBE VIEW (p. 461).
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible
partitions that were added to the file system after the table was created. MSCK REPAIR TABLE
compares the partitions in the table metadata and the partitions in S3. If new partitions are present in
the S3 location that you specified when you created the table, it adds those partitions to the metadata
and to the Athena table.
When you add physical partitions, the metadata in the catalog becomes inconsistent with the layout of
the data in the file system, and information about the new partitions needs to be added to the catalog.
To update the metadata, run MSCK REPAIR TABLE so that you can query the data in the new partitions
from Athena.
463
Amazon Athena User Guide
MSCK REPAIR TABLE
Note
MSCK REPAIR TABLE only adds partitions to metadata; it does not remove them. To remove
partitions from metadata after the partitions have been manually deleted in Amazon S3, run
the command ALTER TABLE table-name DROP PARTITION. For more information see
ALTER TABLE DROP PARTITION (p. 450).
• Because the command traverses your file system running Amazon S3 HeadObject and GetObject
commands, the cost of bytes scanned can be significant if your file system is large or contains a large
amount of data.
• It is possible it will take some time to add all partitions. If this operation times out, it will be in an
incomplete state where only a few partitions are added to the catalog. You should run MSCK REPAIR
TABLE on the same table until all partitions are added. For more information, see Partitioning
Data (p. 104).
• For partitions that are not compatible with Hive, use ALTER TABLE ADD PARTITION (p. 449) to load
the partitions so that you can query their data.
• Partition locations to be used with Athena must use the s3 protocol (for example,
s3://bucket/folder/). In Athena, locations that use other protocols (for example,
s3a://bucket/folder/) will result in query failures when MSCK REPAIR TABLE queries are run on
the containing tables.
• Because MSCK REPAIR TABLE scans both a folder its subfolders to find a matching partition scheme,
be sure to keep data for separate tables in separate folder hierarchies. For example, suppose you have
data for table A in s3://table-a-data and data for table B in s3://table-a-data/table-b-
data. If both tables are partitioned by string, MSCK REPAIR TABLE will add the partitions for table B
to table A. To avoid this, use separate folder structures like s3://table-a-data and s3://table-
b-data instead. Note that this behavior is consistent with Amazon EMR and Apache Hive.
Synopsis
MSCK REPAIR TABLE table_name
Examples
MSCK REPAIR TABLE orders;
Troubleshooting
After you run MSCK REPAIR TABLE, if Athena does not add the partitions to the table in the AWS Glue
Data Catalog, check the following:
• Make sure that the AWS Identity and Access Management (IAM) user or role has a policy that allows
the glue:BatchCreatePartition action.
• Make sure that the IAM user or role has a policy with sufficient permissions to access Amazon S3,
including the s3:DescribeJob action. For an example of which Amazon S3 actions to allow, see the
example bucket policy in Cross-account Access in Athena to Amazon S3 Buckets (p. 282).
• Make sure that the Amazon S3 path is in lower case instead of camel case (for example, userid
instead of userId).
• Query timeouts – MSCK REPAIR TABLE is best used when creating a table for the first time or when
there is uncertainty about parity between data and partition metadata. If you use MSCK REPAIR
464
Amazon Athena User Guide
SHOW COLUMNS
TABLE to add new partitions frequently (for example, on a daily basis) and are experiencing query
timeouts, consider using ALTER TABLE ADD PARTITION (p. 449).
• Partitions missing from filesystem – If you delete a partition manually in Amazon S3 and then
run MSCK REPAIR TABLE, you may receive the error message Partitions missing from filesystem.
This occurs because MSCK REPAIR TABLE doesn't remove stale partitions from table metadata. To
remove the deleted partitions from table metadata, run ALTER TABLE DROP PARTITION (p. 450)
instead. Note that SHOW PARTITIONS (p. 467) similarly lists only the partitions in metadata, not the
partitions in the file system.
s3://bucket/path/userId=1/
s3://bucket/path/userId=2/
s3://bucket/path/userId=3/
s3://bucket/path/userid=1/
s3://bucket/path/userid=2/
s3://bucket/path/userid=3/
SHOW COLUMNS
Lists the columns in the schema for a base table or a view. To use a SELECT statement to show columns,
see Listing or Searching Columns for a Specified Table or View (p. 252).
Synopsis
SHOW COLUMNS IN table_name|view_name
Examples
SHOW COLUMNS IN clicks;
465
Amazon Athena User Guide
SHOW CREATE VIEW
Synopsis
SHOW CREATE TABLE [db_name.]table_name
Parameters
TABLE [db_name.]table_name
The db_name parameter is optional. If omitted, the context defaults to the current database.
Note
The table name is required.
Examples
SHOW CREATE TABLE orderclickstoday;
Synopsis
SHOW CREATE VIEW view_name
Examples
SHOW CREATE VIEW orders_by_date
See also CREATE VIEW (p. 460) and DROP VIEW (p. 463).
SHOW DATABASES
Lists all databases defined in the metastore. You can use DATABASES or SCHEMAS. They mean the same
thing.
Synopsis
SHOW {DATABASES | SCHEMAS} [LIKE 'regular_expression']
Parameters
[LIKE 'regular_expression']
Filters the list of databases to those that match the regular_expression that you specify. For
wildcard character matching, you can use the combination .*, which matches any character zero to
unlimited times.
466
Amazon Athena User Guide
SHOW PARTITIONS
Examples
SHOW SCHEMAS;
SHOW PARTITIONS
Lists all the partitions in a table.
Synopsis
SHOW PARTITIONS table_name
• To show the partitions in a table and list them in a specific order, see the Listing Partitions for a
Specific Table (p. 251) section on the Querying AWS Glue Data Catalog (p. 249) page.
• To view the contents of a partition, see the Query the Data (p. 106) section on the Partitioning
Data (p. 104) page.
• SHOW PARTITIONS does not list partitions that are projected by Athena but not registered in the
AWS Glue catalog. For information about partition projection, see Partition Projection with Amazon
Athena (p. 109).
• SHOW PARTITIONS lists the partitions in metadata, not the partitions in the actual file system. To
update the metadata after you delete partitions manually in Amazon S3, run ALTER TABLE DROP
PARTITION (p. 450).
Examples
SHOW PARTITIONS clicks;
SHOW TABLES
Lists all the base tables and views in a database.
Synopsis
SHOW TABLES [IN database_name] ['regular_expression']
Parameters
[IN database_name]
Specifies the database_name from which tables will be listed. If omitted, the database from the
current context is assumed.
['regular_expression']
Filters the list of tables to those that match the regular_expression you specify. Only the
wildcard *, which indicates any character, or |, which indicates a choice between characters, can be
used.
467
Amazon Athena User Guide
SHOW TBLPROPERTIES
Examples
Example – show all of the tables in the database sampledb
Results
alb_logs
cloudfront_logs
elb_logs
flights_2016
flights_parquet
view_2016_flights_dfw
Example – show the names of all tables in sampledb that include the word "flights"
Results
flights_2016
flights_parquet
view_2016_flights_dfw
Example – show the names of all tables in sampledb that end in the word "logs"
Results
alb_logs
cloudfront_logs
elb_logs
SHOW TBLPROPERTIES
Lists table properties for the named table.
Synopsis
SHOW TBLPROPERTIES table_name [('property_name')]
Parameters
[('property_name')]
Examples
SHOW TBLPROPERTIES orders;
468
Amazon Athena User Guide
SHOW VIEWS
SHOW VIEWS
Lists the views in the specified database, or in the current database if you omit the database name. Use
the optional LIKE clause with a regular expression to restrict the list of view names.
Athena returns a list of STRING type values where each value is a view name.
Synopsis
SHOW VIEWS [IN database_name] LIKE ['regular_expression']
Parameters
[IN database_name]
Specifies the database_name from which views will be listed. If omitted, the database from the
current context is assumed.
[LIKE 'regular_expression']
Filters the list of views to those that match the regular_expression you specify. Only the
wildcard *, which indicates any character, or |, which indicates a choice between characters, can be
used.
Examples
SHOW VIEWS;
See also SHOW COLUMNS (p. 465), SHOW CREATE VIEW (p. 466), DESCRIBE VIEW (p. 461), and
DROP VIEW (p. 463).
469
Amazon Athena User Guide
Cross-Regional Queries
Cross-Regional Queries
Athena supports queries across only the following Regions. Queries across other Regions may produce
the error message InvalidToken: The provided token is malformed or otherwise invalid.
470
Amazon Athena User Guide
Cross-Regional Queries
471
Amazon Athena User Guide
CREATE TABLE AS SELECT (CTAS)
Troubleshooting in Athena
The Athena team has gathered the following troubleshooting information from customer issues.
Although not comprehensive, it includes advice regarding some common performance, timeout, and out
of memory issues.
Topics
• CREATE TABLE AS SELECT (CTAS) (p. 472)
• Data File Issues (p. 472)
• Federated Queries (p. 474)
• JSON Related Errors (p. 474)
• MSCK REPAIR TABLE (p. 475)
• Output Issues (p. 475)
• Partitioning Issues (p. 476)
• Permissions (p. 477)
• Query Syntax Issues (p. 478)
• Throttling Issues (p. 478)
• Views (p. 479)
• Workgroups (p. 479)
• Additional Resources (p. 479)
472
Amazon Athena User Guide
Athena reads files that I excluded
from the AWS Glue crawler
• The data type defined in the table doesn't match the source data, or a single field contains
different types of data. For suggested resolutions, see My Amazon Athena query fails with the error
"HIVE_BAD_DATA: Error parsing field value for field X: For input string: "12312845691"" in the AWS
Knowledge Center.
• Null values are present in an integer field. One workaround is to create the column with the null values
as string and then use CAST to convert the field in a query, supplying a default value of 0 for nulls.
For more information, see When I query CSV data in Athena, I get the error "HIVE_BAD_DATA: Error
parsing field value '' for field X: For input string: """ in the AWS Knowledge Center.
HIVE_CURSOR_ERROR:
com.amazonaws.services.s3.model.AmazonS3Exception:
The specified key does not exist
This error usually occurs when a file is removed when a query is running. Either rerun the query, or check
your workflow to see if another job or process is modifying the files when the query is running.
• The AWS Glue crawler wasn't able to classify the data format
473
Amazon Athena User Guide
org.apache.parquet.io.GroupColumnIO cannot be
cast to org.apache.parquet.io.PrimitiveColumnIO
For more information, see How do I resolve the error "unable to create input format" in Athena? in the
AWS Knowledge Center or watch the Knowledge Center video.
Federated Queries
For information on troubleshooting federated queries, see Common_Problems in the awslabs/aws-
athena-query-federation section of GitHub.
474
Amazon Athena User Guide
Multiple JSON records return a SELECT COUNT of 1
Output Issues
Unable to verify/create output bucket
This error can occur if the specified query result location doesn't exist or if the proper permissions are not
present. For more information, see How do I resolve the "Unable to verify/create output bucket" error in
Amazon Athena? in the AWS Knowledge Center.
475
Amazon Athena User Guide
Partitioning Issues
Partitioning Issues
MSCK REPAIR TABLE does not remove stale partitions
If you delete a partition manually in Amazon S3 and then run MSCK REPAIR TABLE, you may receive
the error message Partitions missing from filesystem. This occurs because MSCK REPAIR TABLE doesn't
remove stale partitions from table metadata. Use ALTER TABLE DROP PARTITION (p. 450) to remove the
stale partitions manually. For more information, see the "Troubleshooting" section of the MSCK REPAIR
TABLE (p. 463) topic.
476
Amazon Athena User Guide
HIVE_UNKNOWN_ERROR: Unable to create input format
HIVE_PARTITION_SCHEMA_MISMATCH
If the schema of a partition differs from the schema of the table, a query can fail with the error message
HIVE_PARTITION_SCHEMA_MISMATCH. For more information, see Syncing Partition Schema to Avoid
"HIVE_PARTITION_SCHEMA_MISMATCH" (p. 24).
Permissions
Access Denied Error when querying Amazon S3
This can occur when you don't have permission to read the data in the bucket, permission to write to the
results bucket, or the Amazon S3 path contains a Region endpoint like us-east-1.amazonaws.com.
For more information, see When I run an Athena query, I get an "Access Denied" error in the AWS
Knowledge Center.
477
Amazon Athena User Guide
Query Syntax Issues
my IAM role credentials or switch to another IAM role when connecting to Athena using the JDBC driver?
in the AWS Knowledge Center.
Throttling Issues
If your queries exceed the limits of dependent services such as Amazon S3, AWS KMS, AWS Glue, or
AWS Lambda, the following messages can be expected. To resolve these issues, reduce the number of
concurrent calls that originate from the same account.
AWS KMS You have exceeded the rate at which you may call KMS. Reduce the frequency of your
calls.
478
Amazon Athena User Guide
Views
Views
Views created in Apache Hive shell do not work in
Athena
Because of their fundamentally different implementations, views created in Apache Hive shell are not
compatible with Athena. To resolve this issue, re-create the views in Athena.
Workgroups
For information on troubleshooting workgroup issues, see Troubleshooting Workgroups (p. 373).
Additional Resources
The following pages provide additional information for troubleshooting issues with Amazon Athena.
479
Amazon Athena User Guide
Additional Resources
Troubleshooting often requires iterative query and discovery by an expert or from a community of
helpers. If you continue to experience issues after trying the suggestions on this page, contact AWS
Support (in the AWS console, click Support, Support Center) or visit the Amazon Athena Forum.
480
Amazon Athena User Guide
Physical Limits
Physical Limits
In general, Athena limits the runtime of each query to 30 minutes. Queries that run beyond this limit
are automatically cancelled without charge. If a query runs out of memory or a node crashes during
processing, errors like the following can occur:
INTERNAL_ERROR_QUERY_ENGINE
Encountered too many errors talking to a worker node. The node may have crashed or be under
too much load.
Due to physical hardware limitations, Athena currently limits per-node memory usage to 14GB and
overall query memory usage to 400GB. Athena also limits the per node data spilled to 90GB. These limits
are subject to change. Spill to disk does not happen on every query.
Data Size
Avoid single large files – Single files are loaded into a single node for processing. If your file size is
extremely large, try to break up the file into smaller files and use partitions to organize them.
Read a smaller amount of data at once – Scanning a large amount of data at one time can slow down
the query and increase cost. Use partitions or filters to limit the files to be scanned.
481
Amazon Athena User Guide
File Formats
This exception is usually caused by having too many columns in the query. Reduce the number of the
columns in the query or create subqueries and use a JOIN that retrieves a smaller amount of data.
Avoid large query outputs – Because query results are written to Amazon S3 by a single Athena node,
a large amount of output data can slow performance. To work around this, try using CTAS (p. 458) to
create a new table with the result of the query or INSERT INTO (p. 442) to append new results into an
existing table.
Avoid CTAS queries with a large output – Because output data is written by a single node, CTAS queries
can also use a large amount of memory. If you are outputting large amount of data, try separating the
task into smaller queries.
If possible, avoid having a large number of small files – Amazon S3 has a limit of 5500 requests per
second. Athena queries share the same limit. If you need to scan millions of small objects in a single
query, your query can be easily throttled by Amazon S3. To avoid excessive scanning, use AWS Glue
ETL to periodically compact your files or partition the table and add partition key filters. For more
information, see Reading Input Files in Larger Groups in the AWS Glue Developer Guide or How can I
configure an AWS Glue ETL job to output larger files? in the AWS Knowledge Center.
Avoid scanning an entire table – Use the following techniques to avoid scanning entire tables:
• Limit the use of "*". Try not to select all columns unless necessary.
• Avoid scanning the same table multiple times in the same query
• Use filters to reduce the amount of data to be scanned.
• Whenever possible, add a LIMIT clause.
Avoid referring to many views and tables in a single query – Because queries with many views and/or
tables must load a large amount of data to a single node, out of memory errors can occur. If possible,
avoid referring to an excessive number of views or tables in a single query.
Avoid large JSON strings – If data is stored in a single JSON string and the size of the JSON data is large,
out of memory errors can occur when the JSON data is processed.
File Formats
Use an efficient file format such as Parquet or ORC – To dramatically reduce query running time and
costs, use compressed Parquet or ORC files to store your data. To convert your existing dataset to those
formats in Athena, you can use CTAS. For more information, see Using CTAS and INSERT INTO for ETL
and Data Analysis (p. 145).
Switch between ORC and Parquet formats – Experience shows that the same set of data can have
significant differences in processing time depending on whether it is stored in ORC or Parquet format. If
you are experiencing performance issues, try a different format.
Hudi queries – Because Hudi queries (p. 204) bypass the native reader and split generator for files in
parquet format, they can be slow. Keep this in mind when querying Hudi datasets.
Consider using UNION ALL – To eliminate duplicates, UNION builds a hash table, which consumes
memory. If your query does not require the elimination of duplicates, consider using UNION ALL for
better performance.
482
Amazon Athena User Guide
Partitioning
Use CTAS as an intermediary step to speed up JOIN operations – Instead of loading and processing
intermediary data with every query, use CTAS to persist the intermediary data into Amazon S3. This can
help speed up the performance of operations like JOIN.
Partitioning
Limit the number of partitions in a table – When a table has more than 100,000 partitions, queries can
be slow because of the large number of requests sent to AWS Glue to retrieve partition information. To
resolve this issue, try one of the following options:
• Use ALTER TABLE DROP PARTITION (p. 450) to remove stale partitions.
• If your partition pattern is predictable, use partition projection (p. 109).
Remove old partitions even if they are empty – Even if a partition is empty, the metadata of the
partition is still stored in AWS Glue. Loading these unneeded partitions can increase query runtimes. To
remove the unneeded partitions, use ALTER TABLE DROP PARTITION (p. 450).
Look up a single partition – When looking up a single partition, try to provide all partition values so
that Athena can locate the partition with a single call to AWS Glue. Otherwise, Athena must retrieve all
partitions and filter them. This can be costly and greatly increase the planning time for your query. If you
have a predictable partition pattern, you can use partition projection (p. 109) to avoid the partition look
up calls to AWS Glue.
Set reasonable partition projection properties – When using partition projection (p. 109), Athena tries
to create a partition object for every partition name. Because of this, make sure that the table properties
that you define do not create a near infinite amount of possible partitions.
To add new partitions frequently, use ALTER TABLE ADD PARTITION – If you use MSCK REPAIR
TABLE to add new partitions frequently (for example, on a daily basis) and are experiencing query
timeouts, consider using ALTER TABLE ADD PARTITION (p. 449). MSCK REPAIR TABLE is best used when
creating a table for the first time or when there is uncertainty about parity between data and partition
metadata.
Avoid using coalesce()in a WHERE clause with partitioned columns – Under some circumstances, using
the coalesce() or other functions in a WHERE clause against partitioned columns might result in reduced
performance. If this occurs, try rewriting your query to provide the same functionality without using
coalesce().
Window Functions
Minimize the use of window functions – Window functions such as rank() are memory intensive.
In general, window functions require an entire dataset to be loaded into a single Athena node for
processing. With an extremely large dataset, this can risk crashing the node. To avoid this, try the
following options:
• Filter the data and run window functions on a subset of the data.
• Use the PARTITION BY clause with the window function whenever possible.
• Find an alternative way to construct the query.
483
Amazon Athena User Guide
Additional Resources
Use regular expressions instead of LIKE on large strings – Queries that include clauses such as LIKE
'%string%' on large strings can be very costly. Consider using the regexp_like() function and a regular
expression instead.
Use max() instead of element_at(array_sort(), 1) – For increased speed, replace the nested functions
element_at(array_sort(), 1) with max().
Additional Resources
For additional information on performance tuning in Athena, consider the following resources:
• Read the AWS Big Data blog post Top 10 Performance Tuning Tips for Amazon Athena
• Read other Athena posts in the AWS Big Data Blog
• Visit the Amazon Athena Forum
• Consult the Athena topics in the AWS Knowledge Center
• Contact AWS Support (in the AWS console, click Support, Support Center)
484
Amazon Athena User Guide
Code Samples
Use the links in this section to use earlier versions of the JDBC driver.
Topics
• Code Samples (p. 485)
• Using Earlier Version JDBC Drivers (p. 493)
• Service Quotas (p. 499)
Code Samples
Use the examples in this topic as a starting point for writing Athena applications using the SDK for Java
2.x. For more information about running the Java code examples, see the Amazon Athena Java Readme
on the AWS Code Examples Repository on GitHub.
Note
These samples use constants (for example, ATHENA_SAMPLE_QUERY) for strings, which are
defined in an ExampleConstants.java class declaration. Replace these constants with your
own strings or defined constants.
Constants
The ExampleConstants.java class demonstrates how to query a table created by the Getting
Started (p. 8) tutorial in Athena.
package aws.example.athena;
485
Amazon Athena User Guide
Create a Client to Access Athena
package aws.example.athena;
import software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.AthenaClientBuilder;
package aws.example.athena;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.QueryExecutionContext;
import software.amazon.awssdk.services.athena.model.ResultConfiguration;
import software.amazon.awssdk.services.athena.model.StartQueryExecutionRequest;
import software.amazon.awssdk.services.athena.model.StartQueryExecutionResponse;
import software.amazon.awssdk.services.athena.model.AthenaException;
import software.amazon.awssdk.services.athena.model.GetQueryExecutionRequest;
import software.amazon.awssdk.services.athena.model.GetQueryExecutionResponse;
import software.amazon.awssdk.services.athena.model.QueryExecutionState;
import software.amazon.awssdk.services.athena.model.GetQueryResultsRequest;
import software.amazon.awssdk.services.athena.model.GetQueryResultsResponse;
import software.amazon.awssdk.services.athena.model.ColumnInfo;
import software.amazon.awssdk.services.athena.model.Row;
import software.amazon.awssdk.services.athena.model.Datum;
import software.amazon.awssdk.services.athena.paginators.GetQueryResultsIterable;
import java.util.List;
486
Amazon Athena User Guide
Start Query Execution
// Submits a sample query to Amazon Athena and returns the execution ID of the query
public static String submitAthenaQuery(AthenaClient athenaClient) {
try {
// The result configuration specifies where the results of the query should go
ResultConfiguration resultConfiguration = ResultConfiguration.builder()
.outputLocation(ExampleConstants.ATHENA_OUTPUT_BUCKET)
.build();
StartQueryExecutionRequest startQueryExecutionRequest =
StartQueryExecutionRequest.builder()
.queryString(ExampleConstants.ATHENA_SAMPLE_QUERY)
.queryExecutionContext(queryExecutionContext)
. resultConfiguration(resultConfiguration)
.build();
StartQueryExecutionResponse startQueryExecutionResponse =
athenaClient.startQueryExecution(startQueryExecutionRequest);
return startQueryExecutionResponse.queryExecutionId();
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
return "";
}
GetQueryExecutionResponse getQueryExecutionResponse;
boolean isQueryStillRunning = true;
while (isQueryStillRunning) {
getQueryExecutionResponse =
athenaClient.getQueryExecution(getQueryExecutionRequest);
String queryState =
getQueryExecutionResponse.queryExecution().status().state().toString();
if (queryState.equals(QueryExecutionState.FAILED.toString())) {
throw new RuntimeException("The Amazon Athena query failed to run with
error message: " + getQueryExecutionResponse
.queryExecution().status().stateChangeReason());
} else if (queryState.equals(QueryExecutionState.CANCELLED.toString())) {
throw new RuntimeException("The Amazon Athena query was cancelled.");
487
Amazon Athena User Guide
Stop Query Execution
} else if (queryState.equals(QueryExecutionState.SUCCEEDED.toString())) {
isQueryStillRunning = false;
} else {
// Sleep an amount of time before retrying again
Thread.sleep(ExampleConstants.SLEEP_AMOUNT_IN_MS);
}
System.out.println("The current status is: " + queryState);
}
}
try {
GetQueryResultsIterable getQueryResultsResults =
athenaClient.getQueryResultsPaginator(getQueryResultsRequest);
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
}
package aws.example.athena;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.StopQueryExecutionRequest;
import software.amazon.awssdk.services.athena.model.GetQueryExecutionRequest;
import software.amazon.awssdk.services.athena.model.GetQueryExecutionResponse;
488
Amazon Athena User Guide
Stop Query Execution
import software.amazon.awssdk.services.athena.model.QueryExecutionState;
import software.amazon.awssdk.services.athena.model.AthenaException;
import software.amazon.awssdk.services.athena.model.QueryExecutionContext;
import software.amazon.awssdk.services.athena.model.ResultConfiguration;
import software.amazon.awssdk.services.athena.model.StartQueryExecutionRequest;
import software.amazon.awssdk.services.athena.model.StartQueryExecutionResponse;
try {
StopQueryExecutionRequest stopQueryExecutionRequest =
StopQueryExecutionRequest.builder()
.queryExecutionId(sampleQueryExecutionId)
.build();
athenaClient.stopQueryExecution(stopQueryExecutionRequest);
GetQueryExecutionResponse getQueryExecutionResponse =
athenaClient.getQueryExecution(getQueryExecutionRequest);
if (getQueryExecutionResponse.queryExecution()
.status()
.state()
.equals(QueryExecutionState.CANCELLED)) {
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
}
try {
QueryExecutionContext queryExecutionContext = QueryExecutionContext.builder()
.database(ExampleConstants.ATHENA_DEFAULT_DATABASE).build();
StartQueryExecutionRequest startQueryExecutionRequest =
StartQueryExecutionRequest.builder()
.queryExecutionContext(queryExecutionContext)
.queryString(ExampleConstants.ATHENA_SAMPLE_QUERY)
489
Amazon Athena User Guide
List Query Executions
.resultConfiguration(resultConfiguration).build();
StartQueryExecutionResponse startQueryExecutionResponse =
athenaClient.startQueryExecution(startQueryExecutionRequest);
return startQueryExecutionResponse.queryExecutionId();
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
return null;
}
}
package aws.example.athena;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.AthenaException;
import software.amazon.awssdk.services.athena.model.ListQueryExecutionsRequest;
import software.amazon.awssdk.services.athena.model.ListQueryExecutionsResponse;
import software.amazon.awssdk.services.athena.paginators.ListQueryExecutionsIterable;
import java.util.List;
listQueryIds(athenaClient);
athenaClient.close();
}
try {
ListQueryExecutionsRequest listQueryExecutionsRequest =
ListQueryExecutionsRequest.builder().build();
ListQueryExecutionsIterable listQueryExecutionResponses =
athenaClient.listQueryExecutionsPaginator(listQueryExecutionsRequest);
490
Amazon Athena User Guide
Create a Named Query
package aws.example.athena;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.AthenaException;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryRequest;
if (args.length != 1) {
System.out.println(USAGE);
System.exit(1);
}
createNamedQuery(athenaClient, name);
athenaClient.close();
}
try {
// Create the named query request.
CreateNamedQueryRequest createNamedQueryRequest =
CreateNamedQueryRequest.builder()
.database(ExampleConstants.ATHENA_DEFAULT_DATABASE)
.queryString(ExampleConstants.ATHENA_SAMPLE_QUERY)
.description("Sample Description")
.name(name)
.build();
athenaClient.createNamedQuery(createNamedQueryRequest);
System.out.println("Done");
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
}
}
package aws.example.athena;
import software.amazon.awssdk.regions.Region;
491
Amazon Athena User Guide
Delete a Named Query
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.DeleteNamedQueryRequest;
import software.amazon.awssdk.services.athena.model.AthenaException;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryRequest;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryResponse;
if (args.length != 1) {
System.out.println(USAGE);
System.exit(1);
}
try {
DeleteNamedQueryRequest deleteNamedQueryRequest =
DeleteNamedQueryRequest.builder()
.namedQueryId(sampleNamedQueryId)
.build();
athenaClient.deleteNamedQuery(deleteNamedQueryRequest);
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
}
CreateNamedQueryResponse createNamedQueryResponse =
athenaClient.createNamedQuery(createNamedQueryRequest);
return createNamedQueryResponse.namedQueryId();
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
492
Amazon Athena User Guide
List Named Queries
return null;
}
}
package aws.example.athena;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.AthenaException;
import software.amazon.awssdk.services.athena.model.ListNamedQueriesRequest;
import software.amazon.awssdk.services.athena.model.ListNamedQueriesResponse;
import software.amazon.awssdk.services.athena.paginators.ListNamedQueriesIterable;
import java.util.List;
listNamedQueries(athenaClient) ;
athenaClient.close();
}
try{
ListNamedQueriesRequest listNamedQueriesRequest =
ListNamedQueriesRequest.builder()
.build();
ListNamedQueriesIterable listNamedQueriesResponses =
athenaClient.listNamedQueriesPaginator(listNamedQueriesRequest);
for (ListNamedQueriesResponse listNamedQueriesResponse :
listNamedQueriesResponses) {
List<String> namedQueryIds = listNamedQueriesResponse.namedQueryIds();
System.out.println(namedQueryIds);
}
} catch (AthenaException e) {
e.printStackTrace();
System.exit(1);
}
}
}
493
Amazon Athena User Guide
Earlier Version JDBC Drivers
JDBC 4.2
and JDK 8.0
Compatible
– Installation
AthenaJDBC42-2.0.8.jar
Release License and Migration
2.0.8 Notices
JDBC 4.1 Notes Agreement Configuration Guide(PDF)
and JDK 7.0 Guide (PDF)
Compatible
–
AthenaJDBC41-2.0.8.jar
JDBC 4.2
and JDK 8.0
Compatible
– Installation
AthenaJDBC42-2.0.7.jar
Release License and Migration
2.0.7 Notices
JDBC 4.1 Notes Agreement Configuration Guide(PDF)
and JDK 7.0 Guide (PDF)
Compatible
–
AthenaJDBC41-2.0.7.jar
JDBC 4.2
and JDK 8.0
Compatible
– Installation
AthenaJDBC42-2.0.6.jar
Release License and Migration
2.0.6 Notices
JDBC 4.1 Notes Agreement Configuration Guide(PDF)
and JDK 7.0 Guide (PDF)
Compatible
–
AthenaJDBC41-2.0.6.jar
JDBC 4.2
and JDK 8.0
Compatible
– Installation
AthenaJDBC42-2.0.5.jar
Release License and Migration
2.0.5 Notices
JDBC 4.1 Notes Agreement Configuration Guide(PDF)
and JDK 7.0 Guide (PDF)
Compatible
–
AthenaJDBC41-2.0.5.jar
JDBC 4.2
and JDK 8.0 Installation
Compatible Release License and Migration
2.0.2 Notices
– Notes Agreement Configuration Guide(PDF)
AthenaJDBC42-2.0.2.jar Guide (PDF)
494
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
JDBC 4.1
and JDK 7.0
Compatible
–
AthenaJDBC41-2.0.2.jar
The JDBC driver version 1.0.1 and earlier versions are deprecated.
JDBC driver version 1.1.0 is compatible with JDBC 4.1 and JDK 7.0. Use the following link to download
the driver: AthenaJDBC41-1.1.0.jar. Also, download the driver license, and the third-party licenses for
the driver. Use the AWS CLI with the following command: aws s3 cp s3://path_to_the_driver
[local_directory], and then use the remaining instructions in this section.
Note
The following instructions are specific to JDBC version 1.1.0 and earlier.
jdbc:awsathena://athena.{REGION}.amazonaws.com:443
where {REGION} is a region identifier, such as us-west-2. For information on Athena regions see
Regions.
JDBC Driver Version 1.1.0: Specify the JDBC Driver Class Name
To use the driver in custom applications, set up your Java class path to the location of the JAR file
that you downloaded from Amazon S3 https://s3.amazonaws.com/athena-downloads/drivers/JDBC/
AthenaJDBC_1.1.0/AthenaJDBC41-1.1.0.jar. This makes the classes within the JAR available for use. The
main JDBC driver class is com.amazonaws.athena.jdbc.AthenaDriver.
495
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
Another method to supply credentials to BI tools, such as SQL Workbench, is to supply the credentials
used for the JDBC as AWS access key and AWS secret key for the JDBC properties for user and password,
respectively.
Users who connect through the JDBC driver and have custom access policies attached to their profiles
need permissions for policy actions in addition to those in the Amazon Athena API Reference.
• athena:GetCatalogs
• athena:GetExecutionEngine
• athena:GetExecutionEngines
• athena:GetNamespace
• athena:GetNamespaces
• athena:GetTable
• athena:GetTables
496
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
log_path Local path of the Athena JDBC driver logs. If no log N/A No
path is provided, then no log files are created.
log_level Log level of the Athena JDBC driver logs. Valid N/A No
values: INFO, DEBUG, WARN, ERROR, ALL, OFF,
FATAL, TRACE.
Examples: Using the 1.1.0 Version of the JDBC Driver with the
JDK
The following code examples demonstrate how to use the JDBC driver version 1.1.0 in a Java application.
These examples assume that the AWS JAVA SDK is included in your classpath, specifically the aws-java-
sdk-core module, which includes the authorization packages (com.amazonaws.auth.*) referenced in
the examples.
info.put("aws_credentials_provider_class","com.amazonaws.auth.DefaultAWSCredentialsProviderChain");
Class.forName("com.amazonaws.athena.jdbc.AthenaDriver");
The following examples demonstrate different ways to use a credentials provider that implements the
AWSCredentialsProvider interface with the previous version of the JDBC driver.
myProps.put("aws_credentials_provider_class","com.amazonaws.auth.PropertiesFileCredentialsProvider");
497
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
myProps.put("aws_credentials_provider_arguments","/Users/
myUser/.athenaCredentials");
accessKey = ACCESSKEY
secretKey = SECRETKEY
Replace the right part of the assignments with your account's AWS access and secret keys.
myProps.put("aws_credentials_provider_class","com.amazonaws.athena.jdbc.CustomSessionsCredentialsProvi
String providerArgs = "My_Access_Key," + "My_Secret_Key," + "My_Token";
myProps.put("aws_credentials_provider_arguments",providerArgs);
Note
If you use the InstanceProfileCredentialsProvider, you don't need to supply any
credential provider arguments because they are provided using the Amazon EC2 instance
profile for the instance on which you are running your application. You would still set the
aws_credentials_provider_class property to this class name, however.
athena:RunQuery athena:StartQueryExecution
athena:CancelQueryExecution athena:StopQueryExecution
athena:GetQueryExecutions athena:ListQueryExecutions
498
Amazon Athena User Guide
Service Quotas
Service Quotas
Note
The Service Quotas console provides information about Amazon Athena quotas. Along with
viewing the default quotas, you can use the Service Quotas console to request quota increases
for the quotas that are adjustable.
Queries
Your account has the following default query-related quotas per AWS Region for Amazon Athena:
• DDL query quota – 20 DDL active queries. DDL queries include CREATE TABLE and CREATE TABLE
ADD PARTITION queries.
• DDL query timeout – The DDL query timeout is 600 minutes.
• DML query quota – 25 DML active queries. DML queries include SELECT and CREATE TABLE AS
(CTAS) queries.
• DML query timeout – The DML query timeout is 30 minutes.
These are soft quotas; you can use the Athena Service Quotas console to request a quota increase.
Athena processes queries by assigning resources based on the overall service load and the number of
incoming requests. Your queries may be temporarily queued before they run. Asynchronous processes
pick up the queries from queues and run them on physical resources as soon as the resources become
available and for as long as your account configuration permits.
A DML or DDL query quota includes both running and queued queries. For example, if you are using the
default DML quota and your total of running and queued queries exceeds 25, query 26 will result in a
"too many queries" error.
Workgroups
When you work with Athena workgroups, remember the following points:
AWS Glue
• If you are using the AWS Glue Data Catalog with Athena, see AWS Glue Endpoints and Quotas for
service quotas on tables, databases, and partitions.
499
Amazon Athena User Guide
Amazon S3 Buckets
• If you are not using AWS Glue Data Catalog, the number of partitions per table is 20,000. You can
request a quota increase.
Note
If you have not yet migrated to AWS Glue Data Catalog, see Upgrading to the AWS Glue Data
Catalog Step-by-Step (p. 29) for migration instructions.
Amazon S3 Buckets
When you work with Amazon S3 buckets, remember the following points:
BatchGetNamedQuery, ListNamedQueries, 5 up to 10
ListQueryExecutions
CreateNamedQuery, DeleteNamedQuery, 5 up to 20
GetNamedQuery
BatchGetQueryExecution 20 up to 40
StartQueryExecution, StopQueryExecution 20 up to 80
For example, for StartQueryExecution, you can make up to 20 calls per second. In addition, if this API
is not called for 4 seconds, your account accumulates a burst capacity of up to 80 calls. In this case, your
application can make up to 80 calls to this API in burst mode.
If you use any of these APIs and exceed the default quota for the number of calls per second, or the
burst capacity in your account, the Athena API issues an error similar to the following: ""ClientError: An
error occurred (ThrottlingException) when calling the <API_name> operation: Rate exceeded." Reduce
the number of calls per second, or the burst capacity for the API for this account. To request a quota
increase, contact AWS Support. Open the AWS Support Center page, sign in if necessary, and choose
Create case. Choose Service limit increase. Complete and submit the form.
Note
This quota cannot be changed in the Athena Service Quotas console.
500
Amazon Athena User Guide
Release Notes
Describes Amazon Athena features, improvements, and bug fixes by release date.
Release Dates
• December 16, 2020 (p. 502)
• November 24, 2020 (p. 502)
• November 11, 2020 (p. 503)
• October 22, 2020 (p. 504)
• July 29, 2020 (p. 504)
• July 9, 2020 (p. 504)
• June 1, 2020 (p. 505)
• May 21, 2020 (p. 505)
• April 1, 2020 (p. 505)
• March 11, 2020 (p. 506)
• March 6, 2020 (p. 506)
• November 26, 2019 (p. 506)
• November 12, 2019 (p. 509)
• November 8, 2019 (p. 509)
• October 8, 2019 (p. 509)
• September 19, 2019 (p. 509)
• September 12, 2019 (p. 510)
• August 16, 2019 (p. 510)
• August 9, 2019 (p. 510)
• June 26, 2019 (p. 510)
• May 24, 2019 (p. 511)
• March 05, 2019 (p. 511)
• February 22, 2019 (p. 511)
• February 18, 2019 (p. 512)
• November 20, 2018 (p. 513)
• October 15, 2018 (p. 513)
• October 10, 2018 (p. 514)
• September 6, 2018 (p. 514)
• August 23, 2018 (p. 515)
• August 16, 2018 (p. 515)
• August 7, 2018 (p. 516)
• June 5, 2018 (p. 516)
• May 17, 2018 (p. 517)
• April 19, 2018 (p. 517)
• April 6, 2018 (p. 518)
• March 15, 2018 (p. 518)
501
Amazon Athena User Guide
December 16, 2020
Amazon Athena announces availability of Athena engine version 2, Athena Federated Query, and AWS
PrivateLink in additional Regions.
For more information, see Athena engine version 2 (p. 399) and Using Amazon Athena Federated
Query (p. 66).
AWS PrivateLink
AWS PrivateLink for Athena is now supported in the Europe (Stockholm) Region. For information about
AWS PrivateLink for Athena, see Connect to Amazon Athena Using an Interface VPC Endpoint (p. 309).
Released drivers JDBC 2.0.16 and ODBC 1.1.6 for Athena. These releases support, at the account level,
the Okta Verify multifactor authentication (MFA), SMS authentication, and Google Authenticator
authentication methods.
To download the new drivers, release notes, and documentation, see Using Athena with the JDBC
Driver (p. 83) and Connecting to Amazon Athena with ODBC (p. 85).
502
Amazon Athena User Guide
November 11, 2020
Amazon Athena announces general availability in the US East (N. Virginia), US East (Ohio), and US West
(Oregon) Regions for Athena engine version 2 and federated queries.
Athena engine version 2 includes performance enhancements and new feature capabilities such as
schema evolution support for Parquet format data, additional geospatial functions, support for reading
nested schema to reduce cost, and performance enhancements in JOIN and AGGREGATE operations.
• For information about improvements, breaking changes, and bug fixes, see Athena engine version
2 (p. 399).
• For information about how to upgrade, see Changing Athena Engine Versions (p. 395).
• For information about testing queries, see Testing Queries in Advance of an Engine Version
Upgrade (p. 399).
Use Federated SQL queries to run SQL queries across relational, non-relational, object, and custom data
sources. With federated querying, you can submit a single SQL query that scans data from multiple
sources running on premises or hosted in the cloud.
Running analytics on data spread across applications can be complex and time consuming for the
following reasons:
• Data required for analytics is often spread across relational, key-value, document, in-memory, search,
graph, object, time-series and ledger data stores.
• To analyze data across these sources, analysts build complex pipelines to extract, transform, and load
into a data warehouse so that the data can be queried.
• Accessing data from various sources requires learning new programming languages and data access
constructs.
Federated SQL queries in Athena eliminate this complexity by allowing users to query the data in-place
from wherever it resides. Analysts can use familiar SQL constructs to JOIN data across multiple data
sources for quick analysis, and store results in Amazon S3 for subsequent use.
503
Amazon Athena User Guide
October 22, 2020
Next Steps
• To learn more about the federated query feature, see Using Amazon Athena Federated Query (p. 66).
• To get started with using an existing connector, see Deploying a Connector and Connecting to a Data
Source.
• To learn how to build your own data source connector using the Athena Query Federation SDK, see
Example Athena Connector on GitHub.
You can now call Athena with AWS Step Functions. AWS Step Functions can control certain AWS services
directly using the Amazon States Language. You can use Step Functions with Athena to start and stop
query execution, get query results, run ad-hoc or scheduled data queries, and retrieve results from data
lakes in Amazon S3.
For more information, see Call Athena with Step Functions in the AWS Step Functions Developer Guide.
Released JDBC driver version 2.0.13. This release supports using multiple data catalogs registered with
Athena (p. 67), Okta service for authentication, and connections to VPC endpoints.
To download and use the new version of the driver, see Using Athena with the JDBC Driver (p. 83).
July 9, 2020
Published on 2020-07-09
504
Amazon Athena User Guide
Querying Apache Hudi Datasets
Amazon Athena adds support for querying compacted Hudi datasets and adds the AWS CloudFormation
AWS::Athena::DataCatalog resource for creating, updating, or deleting data catalogs that you
register in Athena.
For more information, see Using Athena to Query Apache Hudi Datasets (p. 204).
For more information, see AWS::Athena::DataCatalog in the AWS CloudFormation User Guide.
June 1, 2020
Published on 2020-06-01
To connect to a self-hosted Hive metastore, you need an Athena Hive metastore connector. Athena
provides a reference implementation (p. 65) connector that you can use. The connector runs as an AWS
Lambda function in your account.
For more information, see Using Athena Data Connector for External Hive Metastore (p. 34).
Amazon Athena adds support for partition projection. Use partition projection to speed up query
processing of highly partitioned tables and automate partition management. For more information, see
Partition Projection with Amazon Athena (p. 109).
April 1, 2020
Published on 2020-04-01
505
Amazon Athena User Guide
March 11, 2020
In addition to the US East (N. Virginia) Region, the Amazon Athena federated query (p. 66), user defined
functions (UDFs) (p. 216), machine learning inference (p. 214), and external Hive metastore (p. 34)
features are now available in preview in the Asia Pacific (Mumbai), Europe (Ireland), and US West
(Oregon) Regions.
Amazon Athena now publishes Amazon CloudWatch Events for query state transitions. When a query
transitions between states -- for example, from Running to a terminal state such as Succeeded or
Cancelled -- Athena publishes a query state change event to CloudWatch Events. The event contains
information about the query state transition. For more information, see Monitoring Athena Queries with
CloudWatch Events (p. 379).
March 6, 2020
Published on 2020-03-06
You can now create and update Amazon Athena workgroups by using the AWS CloudFormation
AWS::Athena::WorkGroup resource. For more information, see AWS::Athena::WorkGroup in the AWS
CloudFormation User Guide.
Amazon Athena adds support for running SQL queries across relational, non-relational, object, and
custom data sources, invoking machine learning models in SQL queries, User Defined Functions (UDFs)
(Preview), using Apache Hive Metastore as a metadata catalog with Amazon Athena (Preview), and four
additional query-related metrics.
You can now use Athena’s federated query to scan data stored in relational, non-relational, object, and
custom data sources. With federated querying, you can submit a single SQL query that scans data from
multiple sources running on premises or hosted in the cloud.
Running analytics on data spread across applications can be complex and time consuming for the
following reasons:
• Data required for analytics is often spread across relational, key-value, document, in-memory, search,
graph, object, time-series and ledger data stores.
• To analyze data across these sources, analysts build complex pipelines to extract, transform, and load
into a data warehouse so that the data can be queried.
• Accessing data from various sources requires learning new programming languages and data access
constructs.
506
Amazon Athena User Guide
Invoking Machine Learning Models in SQL Queries
Federated SQL queries in Athena eliminate this complexity by allowing users to query the data in-place
from wherever it resides. Analysts can use familiar SQL constructs to JOIN data across multiple data
sources for quick analysis, and store results in Amazon S3 for subsequent use.
Preview Availability
Athena federated query is available in preview in the US East (N. Virginia) Region.
Next Steps
• To begin your preview, follow the instructions in the Athena Preview Features FAQ.
• To learn more about the federated query feature, see Using Amazon Athena Federated Query
(Preview).
• To get started with using an existing connector, see Deploying a Connector and Connecting to a Data
Source.
• To learn how to build your own data source connector using the Athena Query Federation SDK, see
Example Athena Connector on GitHub.
ML Models
You can use more than a dozen built-in machine learning algorithms provided by Amazon SageMaker,
train your own models, or find and subscribe to model packages from AWS Marketplace and deploy on
Amazon SageMaker Hosting Services. There is no additional setup required. You can invoke these ML
models in your SQL queries from the Athena console, Athena APIs, and through Athena’s preview JDBC
driver.
Preview Availability
Athena’s ML functionality is available today in preview in the US East (N. Virginia) Region.
Next Steps
• To begin your preview, follow the instructions in the Athena Preview Features FAQ.
507
Amazon Athena User Guide
User Defined Functions (UDFs) (Preview)
• To learn more about the machine learning feature, see Using Machine Learning (ML) with Amazon
Athena (Preview).
Preview Availability
Athena UDF functionality is available in Preview mode in the US East (N. Virginia) Region.
Next Steps
• To begin your preview, follow the instructions in the Athena Preview Features FAQ.
• To learn more, see Querying with User Defined Functions (Preview).
• For example UDF implementations, see Amazon Athena UDF Connector on GitHub.
• To learn how to write your own functions using the Athena Query Federation SDK, see Creating and
Deploying a UDF Using Lambda.
Metastore Connector
To connect to a self-hosted Hive Metastore, you need an Athena Hive Metastore connector. Athena
provides a reference implementation connector that you can use. The connector runs as an AWS Lambda
function in your account. For more information, see Using Athena Data Connector for External Hive
Metastore (Preview).
Preview Availability
The Hive Metastore feature is available in Preview mode in the US East (N. Virginia) Region.
Next Steps
• To begin your preview, follow the instructions in the Athena Preview Features FAQ.
• To learn more about this feature, please visit our Using Athena Data Connector for External Hive
Metastore (Preview).
508
Amazon Athena User Guide
November 12, 2019
• Query Planning Time – The time taken to plan the query. This includes the time spent retrieving table
partitions from the data source.
• Query Queuing Time – The time that the query was in a queue waiting for resources.
• Service Processing Time – The time taken to write results after the query engine finishes processing.
• Total Execution Time – The time Athena took to run the query.
To consume these new query metrics, you can create custom dashboards, set alarms and triggers on
metrics in CloudWatch, or use pre-populated dashboards directly from the Athena console.
Next Steps
For more information, see Monitoring Athena Queries with CloudWatch Metrics.
November 8, 2019
Published on 2019-12-17
Amazon Athena is now available in the US West (N. California) Region and the Europe (Paris) Region.
October 8, 2019
Published on 2019-12-17
Amazon Athena now allows you to connect directly to Athena through an interface VPC endpoint in your
Virtual Private Cloud (VPC). Using this feature, you can submit your queries to Athena securely without
requiring an Internet Gateway in your VPC.
To create an interface VPC endpoint to connect to Athena, you can use the AWS console or AWS
Command Line Interface (AWS CLI). For information about creating an interface endpoint, see Creating
an Interface Endpoint.
When you use an interface VPC endpoint, communication between your VPC and Athena APIs is secure
and stays within the AWS network. There are no additional Athena costs to use this feature. Interface
VPC endpoint charges apply.
To learn more about this feature, see Connect to Amazon Athena Using an Interface VPC Endpoint.
Amazon Athena adds support for inserting new data to an existing table using the INSERT INTO
statement. You can insert new rows into a destination table based on a SELECT query statement that
509
Amazon Athena User Guide
September 12, 2019
runs on a source table, or based on a set of values that are provided as part of the query statement.
Supported data formats include Avro, JSON, ORC, Parquet, and text files.
INSERT INTO statements can also help you simplify your ETL process. For example, you can use
INSERT INTO in a single query to select data from a source table that is in JSON format and write to a
destination table in Parquet format.
INSERT INTO statements are charged based on the number of bytes that are scanned in the SELECT
phase, similar to how Athena charges for SELECT queries. For more information, see Amazon Athena
pricing.
For more information about using INSERT INTO, including supported formats, SerDes and examples,
see INSERT INTO in the Athena User Guide.
Amazon Athena is now available in the Asia Pacific (Hong Kong) Region.
Amazon Athena adds support for querying data in Amazon S3 Requester Pays buckets.
When an Amazon S3 bucket is configured as Requester Pays, the requester, not the bucket owner,
pays for the Amazon S3 request and data transfer costs. In Athena, workgroup administrators can now
configure workgroup settings to allow workgroup members to query S3 Requester Pays buckets.
For information about how to configure the Requester Pays setting for your workgroup, refer to Create a
Workgroup in the Amazon Athena User Guide. For more information about Requester Pays buckets, see
Requester Pays Buckets in the Amazon Simple Storage Service Developer Guide.
August 9, 2019
Published on 2019-12-17
Amazon Athena now supports enforcing AWS Lake Formation policies for fine-grained access control to
new or existing databases, tables, and columns defined in the AWS Glue Data Catalog for data stored in
Amazon S3.
You can use this feature in the following AWS regions: US East (Ohio), US East (N. Virginia), US West
(Oregon), Asia Pacific (Tokyo), and Europe (Ireland). There are no additional charges to use this feature.
For more information about using this feature, see Using Athena to Query Data Registered With AWS
Lake Formation (p. 310). For more information about AWS Lake Formation, see AWS Lake Formation.
510
Amazon Athena User Guide
May 24, 2019
Amazon Athena is now available in the AWS GovCloud (US-East) and AWS GovCloud (US-West) Regions.
For a list of supported Regions, see AWS Regions and Endpoints.
Amazon Athena is now available in the Canada (Central) Region. For a list of supported Regions, see
AWS Regions and Endpoints. Released the new version of the ODBC driver with support for Athena
workgroups. For more information, see the ODBC Driver Release Notes.
To download the ODBC driver version 1.0.5 and its documentation, see Connecting to Amazon Athena
with ODBC (p. 85). For information about this version, see the ODBC Driver Release Notes.
To use workgroups with the ODBC driver, set the new connection property, Workgroup, in the
connection string as shown in the following example:
For more information, search for "workgroup" in the ODBC Driver Installation and Configuration
Guide version 1.0.5. There are no changes to the ODBC driver connection string when you use tags on
workgroups. To use tags, upgrade to the latest version of the ODBC driver, which is this current version.
This driver version lets you use Athena API workgroup actions (p. 373) to create and manage workgroups,
and Athena API tag actions (p. 387) to add, list, or remove tags on workgroups. Before you begin, make
sure that you have resource-level permissions in IAM for actions on workgroups and tags.
• Using Workgroups for Running Queries (p. 358) and Workgroup Example Policies (p. 362).
• Tagging Resources (p. 385) and Tag-Based IAM Access Control Policies (p. 390).
If you use the JDBC driver or the AWS SDK, upgrade to the latest version of the driver and SDK, both
of which already include support for workgroups and tags in Athena. For more information, see Using
Athena with the JDBC Driver (p. 83).
Added tag support for workgroups in Amazon Athena. A tag consists of a key and a value, both of
which you define. When you tag a workgroup, you assign custom metadata to it. You can add tags to
workgroups to help categorize them, using AWS tagging best practices. You can use tags to restrict
access to workgroups, and to track costs. For example, create a workgroup for each cost center. Then,
by adding tags to these workgroups, you can track your Athena spending for each cost center. For more
information, see Using Tags for Billing in the AWS Billing and Cost Management User Guide.
511
Amazon Athena User Guide
February 18, 2019
You can work with tags by using the Athena console or the API operations. For more information, see
Tagging Workgroups (p. 385).
In the Athena console, you can add one or more tags to each of your workgroups, and search by tags.
Workgroups are an IAM-controlled resource in Athena. In IAM, you can restrict who can add, remove, or
list tags on workgroups that you create. You can also use the CreateWorkGroup API operation that has
the optional tag parameter for adding one or more tags to the workgroup. To add, remove, or list tags,
use TagResource, UntagResource, and ListTagsForResource. For more information, see Working
with Tags Using the API Actions (p. 385).
To allow users to add tags when creating workgroups, ensure that you give each user IAM permissions to
both the TagResource and CreateWorkGroup API actions. For more information and examples, see
Tag-Based IAM Access Control Policies (p. 390).
There are no changes to the JDBC driver when you use tags on workgroups. If you create new
workgroups and use the JDBC driver or the AWS SDK, upgrade to the latest version of the driver and
SDK. For information, see Using Athena with the JDBC Driver (p. 83).
Added ability to control query costs by running queries in workgroups. For information, see Using
Workgroups to Control Query Access and Costs (p. 358). Improved the JSON OpenX SerDe used in
Athena, fixed an issue where Athena did not ignore objects transitioned to the GLACIER storage class,
and added examples for querying Network Load Balancer logs.
• Added support for workgroups. Use workgroups to separate users, teams, applications, or workloads,
and to set limits on amount of data each query or the entire workgroup can process. Because
workgroups act as IAM resources, you can use resource-level permissions to control access to a
specific workgroup. You can also view query-related metrics in Amazon CloudWatch, control query
costs by configuring limits on the amount of data scanned, create thresholds, and trigger actions,
such as Amazon SNS alarms, when these thresholds are breached. For more information, see Using
Workgroups for Running Queries (p. 358) and Controlling Costs and Monitoring Queries with
CloudWatch Metrics and Events (p. 375).
Workgroups are an IAM resource. For a full list of workgroup-related actions, resources, and conditions
in IAM, see Actions, Resources, and Condition Keys for Amazon Athena in the Service Authorization
Reference. Before you create new workgroups, make sure that you use workgroup IAM policies (p. 361),
and the AmazonAthenaFullAccess Managed Policy (p. 271).
You can start using workgroups in the console, with the workgroup API operations (p. 373), or with the
JDBC driver. For a high-level procedure, see Setting up Workgroups (p. 360). To download the JDBC
driver with workgroup support, see Using Athena with the JDBC Driver (p. 83).
If you use workgroups with the JDBC driver, you must set the workgroup name in the connection string
using the Workgroup configuration parameter as in the following example:
jdbc:awsathena://AwsRegion=<AWSREGION>;UID=<ACCESSKEY>;
PWD=<SECRETKEY>;S3OutputLocation=s3://<athena-output>-<AWSREGION>/;
Workgroup=<WORKGROUPNAME>;
There are no changes in the way you run SQL statements or make JDBC API calls to the driver. The
driver passes the workgroup name to Athena.
512
Amazon Athena User Guide
November 20, 2018
For information about differences introduced with workgroups, see Athena Workgroup APIs (p. 373)
and Troubleshooting Workgroups (p. 373).
• Improved the JSON OpenX SerDe used in Athena. The improvements include, but are not limited to,
the following:
• Support for the ConvertDotsInJsonKeysToUnderscores property. When set to TRUE, it allows
the SerDe to replace the dots in key names with underscores. For example, if the JSON dataset
contains a key with the name "a.b", you can use this property to define the column name to be
"a_b" in Athena. The default is FALSE. By default, Athena does not allow dots in column names.
• Support for the case.insensitive property. By default, Athena requires that all keys in your
JSON dataset use lowercase. Using WITH SERDE PROPERTIES ("case.insensitive"=
FALSE;) allows you to use case-sensitive key names in your data. The default is TRUE. When set to
TRUE, the SerDe converts all uppercase columns to lowercase.
For more information, see the section called “Requirements for Tables in Athena and Data in Amazon
S3” (p. 91) and Transitioning to the GLACIER Storage Class (Object Archival) in the Amazon Simple
Storage Service Developer Guide.
• Added examples for querying Network Load Balancer access logs that receive information about
the Transport Layer Security (TLS) requests. For more information, see the section called “Querying
Network Load Balancer Logs” (p. 242).
Released the new versions of the JDBC and ODBC driver with support for federated access to Athena API
with the AD FS and SAML 2.0 (Security Assertion Markup Language 2.0). For details, see the JDBC Driver
Release Notes and ODBC Driver Release Notes.
With this release, federated access to Athena is supported for the Active Directory Federation Service
(AD FS 3.0). Access is established through the versions of JDBC or ODBC drivers that support SAML 2.0.
For information about configuring federated access to the Athena API, see the section called “Enabling
Federated Access to the Athena API” (p. 301).
To download the JDBC driver version 2.0.6 and its documentation, see Using Athena with the JDBC
Driver (p. 83). For information about this version, see JDBC Driver Release Notes.
To download the ODBC driver version 1.0.4 and its documentation, see Connecting to Amazon Athena
with ODBC (p. 85). For information about this version, ODBC Driver Release Notes.
For more information about SAML 2.0 support in AWS, see About SAML 2.0 Federation in the IAM User
Guide.
If you have upgraded to the AWS Glue Data Catalog, there are two new features that provide support for:
513
Amazon Athena User Guide
October 10, 2018
• Encryption of the Data Catalog metadata. If you choose to encrypt metadata in the Data Catalog, you
must add specific policies to Athena. For more information, see Access to Encrypted Metadata in the
AWS Glue Data Catalog (p. 281).
• Fine-grained permissions to access resources in the AWS Glue Data Catalog. You can now define
identity-based (IAM) policies that restrict or allow access to specific databases and tables from the
Data Catalog used in Athena. For more information, see Fine-Grained Access to Databases and Tables
in the AWS Glue Data Catalog (p. 275).
Note
Data resides in the Amazon S3 buckets, and access to it is governed by the Amazon S3
Permissions (p. 274). To access data in databases and tables, continue to use access control
policies to Amazon S3 buckets that store the data.
Athena supports CREATE TABLE AS SELECT, which creates a table from the result of a SELECT query
statement. For details, see Creating a Table from Query Results (CTAS).
Before you create CTAS queries, it is important to learn about their behavior in the Athena
documentation. It contains information about the location for saving query results in Amazon S3, the
list of supported formats for storing CTAS query results, the number of partitions you can create, and
supported compression formats. For more information, see Considerations and Limitations for CTAS
Queries (p. 136).
September 6, 2018
Published on 2018-09-06
Released the new version of the ODBC driver (version 1.0.3). The new version of the ODBC driver streams
results by default, instead of paging through them, allowing business intelligence tools to retrieve large
data sets faster. This version also includes improvements, bug fixes, and an updated documentation for
"Using SSL with a Proxy Server". For details, see the Release Notes for the driver.
For downloading the ODBC driver version 1.0.3 and its documentation, see Connecting to Amazon
Athena with ODBC (p. 85).
The streaming results feature is available with this new version of the ODBC driver. It is also available
with the JDBC driver. For information about streaming results, see the ODBC Driver Installation and
Configuration Guide, and search for UseResultsetStreaming.
The ODBC driver version 1.0.3 is a drop-in replacement for the previous version of the driver. We
recommend that you migrate to the current driver.
514
Amazon Athena User Guide
August 23, 2018
Important
To use the ODBC driver version 1.0.3, follow these requirements:
Added support for these DDL-related features and fixed several bugs, as follows:
• Added support for BINARY and DATE data types for data in Parquet, and for DATE and TIMESTAMP
data types for data in Avro.
• Added support for INT and DOUBLE in DDL queries. INTEGER is an alias to INT, and DOUBLE
PRECISION is an alias to DOUBLE.
• Improved performance of DROP TABLE and DROP DATABASE queries.
• Removed the creation of _$folder$ object in Amazon S3 when a data bucket is empty.
• Fixed an issue where ALTER TABLE ADD PARTITION threw an error when no partition value was
provided.
• Fixed an issue where DROP TABLE ignored the database name when checking partitions after the
qualified name had been specified in the statement.
For more about the data types supported in Athena, see Data Types (p. 436).
For information about supported data type mappings between types in Athena, the JDBC driver, and
Java data types, see the "Data Types" section in the JDBC Driver Installation and Configuration Guide.
Released the JDBC driver version 2.0.5. The new version of the JDBC driver streams results by default,
instead of paging through them, allowing business intelligence tools to retrieve large data sets
faster. Compared to the previous version of the JDBC driver, there are the following performance
improvements:
The streaming results feature is available only with the JDBC driver. It is not available with the ODBC
driver. You cannot use it with the Athena API. For information about streaming results, see the JDBC
Driver Installation and Configuration Guide, and search for UseResultsetStreaming.
For downloading the JDBC driver version 2.0.5 and its documentation, see Using Athena with the JDBC
Driver (p. 83).
515
Amazon Athena User Guide
August 7, 2018
The JDBC driver version 2.0.5 is a drop-in replacement for the previous version of the driver (2.0.2). To
ensure that you can use the JDBC driver version 2.0.5, add the athena:GetQueryResultsStream
policy action to the list of policies for Athena. This policy action is not exposed directly with the API
and is only used with the JDBC driver, as part of streaming results support. For an example policy, see
AWSQuicksightAthenaAccess Managed Policy (p. 273). For more information about migrating from
version 2.0.2 to version 2.0.5 of the driver, see the JDBC Driver Migration Guide.
If you are migrating from a 1.x driver to a 2.x driver, you will need to migrate your existing configurations
to the new configuration. We highly recommend that you migrate to the current version of the driver.
For more information, see Using the Previous Version of the JDBC Driver (p. 493), and the JDBC Driver
Migration Guide.
August 7, 2018
Published on 2018-08-07
You can now store Amazon Virtual Private Cloud flow logs directly in Amazon S3 in a GZIP format, where
you can query them in Athena. For information, see Querying Amazon VPC Flow Logs (p. 244) and
Amazon VPC Flow Logs can now be delivered to S3.
June 5, 2018
Published on 2018-06-05
Topics
• Support for Views (p. 516)
• Improvements and Updates to Error Messages (p. 516)
• Bug Fixes (p. 517)
516
Amazon Athena User Guide
Bug Fixes
The new error message reads: "HIVE_BAD_DATA: Error parsing field value for field 0: java.lang.String
cannot be cast to org.openx.data.jsonserde.json.JSONObject".
• Improved error messages about insufficient permissions by adding more detail.
Bug Fixes
Fixed the following bugs:
• Fixed an issue that enables the internal translation of REAL to FLOAT data types. This improves
integration with the AWS Glue crawler that returns FLOAT data types.
• Fixed an issue where Athena was not converting AVRO DECIMAL (a logical type) to a DECIMAL type.
• Fixed an issue where Athena did not return results for queries on Parquet data with WHERE clauses that
referenced values in the TIMESTAMP data type.
Increased query concurrency quota in Athena from five to twenty. This means that you can submit and
run up to twenty DDL queries and twenty SELECT queries at a time. Note that the concurrency quotas
are separate for DDL and SELECT queries.
Concurrency quotas in Athena are defined as the number of queries that can be submitted to the service
concurrently. You can submit up to twenty queries of the same type (DDL or SELECT) at a time. If you
submit a query that exceeds the concurrent query quota, the Athena API displays an error message.
After you submit your queries to Athena, it processes the queries by assigning resources based on
the overall service load and the amount of incoming requests. We continuously monitor and make
adjustments to the service so that your queries process as fast as possible.
For information, see Service Quotas (p. 499). This is an adjustable quota. You can use the Service Quotas
console to request a quota increase for concurrent queries.
Released the new version of the JDBC driver (version 2.0.2) with support for returning the ResultSet
data as an Array data type, improvements, and bug fixes. For details, see the Release Notes for the driver.
For information about downloading the new JDBC driver version 2.0.2 and its documentation, see Using
Athena with the JDBC Driver (p. 83).
The latest version of the JDBC driver is 2.0.2. If you are migrating from a 1.x driver to a 2.x driver, you will
need to migrate your existing configurations to the new configuration. We highly recommend that you
migrate to the current driver.
For information about the changes introduced in the new version of the driver, the version differences,
and examples, see the JDBC Driver Migration Guide.
For information about the previous version of the JDBC driver, see Using Athena with the Previous
Version of the JDBC Driver (p. 493).
517
Amazon Athena User Guide
April 6, 2018
April 6, 2018
Published on 2018-04-06
Added an ability to automatically create Athena tables for CloudTrail log files directly from the
CloudTrail console. For information, see Using the CloudTrail Console to Create an Athena Table for
CloudTrail Logs (p. 230).
February 2, 2018
Published on 2018-02-12
Added an ability to securely offload intermediate data to disk for memory-intensive queries that use the
GROUP BY clause. This improves the reliability of such queries, preventing "Query resource exhausted"
errors.
With Athena, there are no versions to manage. We have transparently upgraded the underlying engine in
Athena to a version based on Presto version 0.172. No action is required on your end.
With the upgrade, you can now use Presto 0.172 Functions and Operators, including Presto 0.172
Lambda Expressions in Athena.
Major updates for this release, including the community-contributed fixes, include:
• Support for ignoring headers. You can use the skip.header.line.count property when
defining tables, to allow Athena to ignore headers. This is supported for queries that use the
LazySimpleSerDe (p. 424) and OpenCSV SerDe (p. 415), and not for Grok or Regex SerDes.
• Support for the CHAR(n) data type in STRING functions. The range for CHAR(n) is [1.255], while
the range for VARCHAR(n) is [1,65535].
• Support for correlated subqueries.
• Support for Presto Lambda expressions and functions.
• Improved performance of the DECIMAL type and operators.
• Support for filtered aggregations, such as SELECT sum(col_name) FILTER, where id > 0.
• Push-down predicates for the DECIMAL, TINYINT, SMALLINT, and REAL data types.
• Support for quantified comparison predicates: ALL, ANY, and SOME.
• Added functions: arrays_overlap(), array_except(), levenshtein_distance(),
codepoint(), skewness(), kurtosis(), and typeof().
518
Amazon Athena User Guide
November 13, 2017
For a complete list of functions and operators, see SQL Queries, Functions, and Operators (p. 437) in this
guide, and Presto 0.172 Functions.
Athena does not support all of Presto's features. For more information, see Limitations (p. 469).
Added support for connecting Athena to the ODBC Driver. For information, see Connecting to Amazon
Athena with ODBC (p. 85).
November 1, 2017
Published on 2017-11-01
Added support for querying geospatial data, and for Asia Pacific (Seoul), Asia Pacific (Mumbai), and
EU (London) regions. For information, see Querying Geospatial Data (p. 179) and AWS Regions and
Endpoints.
Added support for EU (Frankfurt). For a list of supported regions, see AWS Regions and Endpoints.
October 3, 2017
Published on 2017-10-03
Create named Athena queries with CloudFormation. For more information, see
AWS::Athena::NamedQuery in the AWS CloudFormation User Guide.
519
Amazon Athena User Guide
August 14, 2017
Added support for Asia Pacific (Sydney). For a list of supported regions, see AWS Regions and Endpoints.
Added integration with the AWS Glue Data Catalog and a migration wizard for updating from the Athena
managed data catalog to the AWS Glue Data Catalog. For more information, see Integration with AWS
Glue (p. 16).
August 4, 2017
Published on 2017-08-04
Added support for Grok SerDe, which provides easier pattern matching for records in unstructured text
files such as logs. For more information, see Grok SerDe (p. 418). Added keyboard shortcuts to scroll
through query history using the console (CTRL + ⇧/⇩ using Windows, CMD + ⇧/⇩ using Mac).
Added support for Asia Pacific (Tokyo) and Asia Pacific (Singapore). For a list of supported regions, see
AWS Regions and Endpoints.
June 8, 2017
Published on 2017-06-08
Added support for Europe (Ireland). For more information, see AWS Regions and Endpoints.
Added an Amazon Athena API and AWS CLI support for Athena; updated JDBC driver to version 1.1.0;
fixed various issues.
• Amazon Athena enables application programming for Athena. For more information, see Amazon
Athena API Reference. The latest AWS SDKs include support for the Athena API. For links to
documentation and downloads, see the SDKs section in Tools for Amazon Web Services.
• The AWS CLI includes new commands for Athena. For more information, see the Amazon Athena API
Reference.
• A new JDBC driver 1.1.0 is available, which supports the new Athena API as well as the latest features
and bug fixes. Download the driver at https://s3.amazonaws.com/athena-downloads/drivers/
AthenaJDBC41-1.1.0.jar. We recommend upgrading to the latest Athena JDBC driver; however, you
may still use the earlier driver version. Earlier driver versions do not support the Athena API. For more
information, see Using Athena with the JDBC Driver (p. 83).
520
Amazon Athena User Guide
Improvements
• Actions specific to policy statements in earlier versions of Athena have been deprecated. If you
upgrade to JDBC driver version 1.1.0 and have customer-managed or inline IAM policies attached to
JDBC users, you must update the IAM policies. In contrast, earlier versions of the JDBC driver do not
support the Athena API, so you can specify only deprecated actions in policies attached to earlier
version JDBC users. For this reason, you shouldn't need to update customer-managed or inline IAM
policies.
• These policy-specific actions were used in Athena before the release of the Athena API. Use these
deprecated actions in policies only with JDBC drivers earlier than version 1.1.0. If you are upgrading
the JDBC driver, replace policy statements that allow or deny deprecated actions with the appropriate
API actions as listed or errors will occur:
athena:RunQuery athena:StartQueryExecution
athena:CancelQueryExecution athena:StopQueryExecution
athena:GetQueryExecutions athena:ListQueryExecutions
Improvements
• Increased the query string length limit to 256 KB.
Bug Fixes
• Fixed an issue that caused query results to look malformed when scrolling through results in the
console.
• Fixed an issue where a \u0000 character string in Amazon S3 data files would cause errors.
• Fixed an issue that caused requests to cancel a query made through the JDBC driver to fail.
• Fixed an issue that caused the AWS CloudTrail SerDe to fail with Amazon S3 data in US East (Ohio).
• Fixed an issue that caused DROP TABLE to fail on a partitioned table.
April 4, 2017
Published on 2017-04-04
Added support for Amazon S3 data encryption and released JDBC driver update (version 1.0.1) with
encryption support, improvements, and bug fixes.
Features
• Added the following encryption features:
• Support for querying encrypted data in Amazon S3.
• Support for encrypting Athena query results.
• A new version of the driver supports new encryption features, adds improvements, and fixes issues.
521
Amazon Athena User Guide
Improvements
• Added the ability to add, replace, and change columns using ALTER TABLE. For more information, see
Alter Column in the Hive documentation.
• Added support for querying LZO-compressed data.
Improvements
• Better JDBC query performance with page-size improvements, returning 1,000 rows instead of 100.
• Added ability to cancel a query using the JDBC driver interface.
• Added ability to specify JDBC options in the JDBC connection URL. For more information, see Using
Athena with the Previous Version of the JDBC Driver (p. 493) for the previous version of the driver, and
Connect with the JDBC (p. 83), for the most current version.
• Added PROXY setting in the driver, which can now be set using ClientConfiguration in the AWS SDK for
Java.
Bug Fixes
Fixed the following bugs:
• Throttling errors would occur when multiple queries were issued using the JDBC driver interface.
• The JDBC driver would stop when projecting a decimal data type.
• The JDBC driver would return every data type as a string, regardless of how the data type
was defined in the table. For example, selecting a column defined as an INT data type using
resultSet.GetObject() would return a STRING data type instead of INT.
• The JDBC driver would verify credentials at the time a connection was made, rather than at the time a
query would run.
• Queries made through the JDBC driver would fail when a schema was specified along with the URL.
Added the AWS CloudTrail SerDe, improved performance, fixed partition issues.
Features
• Added the AWS CloudTrail SerDe. For more information, see CloudTrail SerDe (p. 413). For detailed
usage examples, see the AWS Big Data Blog post, Analyze Security, Compliance, and Operational
Activity Using AWS CloudTrail and Amazon Athena.
Improvements
• Improved performance when scanning a large number of partitions.
• Improved performance on MSCK Repair Table operation.
• Added ability to query Amazon S3 data stored in regions other than your primary Region. Standard
inter-region data transfer rates for Amazon S3 apply in addition to standard Athena charges.
522
Amazon Athena User Guide
Bug Fixes
Bug Fixes
• Fixed a bug where a "table not found error" might occur if no partitions are loaded.
• Fixed a bug to avoid throwing an exception with ALTER TABLE ADD PARTITION IF NOT EXISTS
queries.
• Fixed a bug in DROP PARTITIONS.
Added support for AvroSerDe and OpenCSVSerDe, US East (Ohio) Region, and bulk editing columns in
the console wizard. Improved performance on large Parquet tables.
Features
• Introduced support for new SerDes:
• Avro SerDe (p. 410)
• OpenCSVSerDe for Processing CSV (p. 415)
• US East (Ohio) Region (us-east-2) launch. You can now run queries in this region.
• You can now use the Add Table wizard to define table schema in bulk. Choose Catalog Manager, Add
table, and then choose Bulk add columns as you walk through the steps to define the table.
523
Amazon Athena User Guide
Improvements
Type name value pairs in the text box and choose Add.
Improvements
• Improved performance on large Parquet tables.
524
Amazon Athena User Guide
Document History
Latest documentation update: December 30, 2020.
We update the documentation frequently to address your feedback. The following table describes
important additions to the Amazon Athena documentation. Not all updates are represented.
Updated For more information, see Using Amazon Athena Federated November 11,
federated query Query (p. 66) and Using Athena with CalledVia Context 2020
documentation for Keys (p. 285).
general availability
release.
Added For more information, see Using Lake Formation and September 25,
documentation the Athena JDBC and ODBC Drivers for Federated Access 2020
for using the JDBC to Athena (p. 317) and Tutorial: Configuring Federated
driver with Lake Access for Okta Users to Athena Using Lake Formation and
Formation for JDBC (p. 318).
federated access to
Athena.
Added For more information, see Amazon Athena Elasticsearch July 21, 2020
documentation for Connector (p. 71).
the Amazon Athena
Elasticsearch data
connector.
Added For more information, see Using Athena to Query Apache July 9, 2020
documentation Hudi Datasets (p. 204).
for querying Hudi
datasets.
Added For more information, see Querying Apache Logs Stored July 8, 2020
documentation on in Amazon S3 (p. 253) and Querying Internet Information
querying Apache Server (IIS) Logs Stored in Amazon S3 (p. 255).
web server logs and
IIS web server logs
stored in Amazon
S3.
The Amazon Athena The Kindle ebook is free of charge. For more information, June 18, 2020
User Guide is now see Amazon Athena: User Guide Kindle Edition, or choose
525
Amazon Athena User Guide
Added For more information, see Using Athena Data Connector for June 1, 2020
documentation for External Hive Metastore (p. 34).
the general release
of the Athena
Data Connector
for External Hive
Metastore.
Added For more information, see Tagging Resources (p. 385). June 1, 2020
documentation for
tagging data catalog
resources.
Added For more information, see Partition Projection with Amazon May 21, 2020
documentation on Athena (p. 109).
partition projection.
Updated the Java For more information, see Code Samples (p. 485). May 11, 2020
code examples for
Athena.
Added a topic on For more information, see Querying Amazon GuardDuty March 19, 2020
querying Amazon Findings (p. 241).
GuardDuty findings.
Added a topic on For more information, see Monitoring Athena Queries with March 11, 2020
using CloudWatch CloudWatch Events (p. 379).
Events to monitor
Athena query state
transitions.
Added a topic on For more information, see Querying AWS Global Accelerator February 6, 2020
querying AWS Flow Logs (p. 239).
Global Accelerator
flow logs with
Athena.
526
Amazon Athena User Guide
• Added Documentation updates include, but are not limited to, the February 4, 2020
documentation on following topics:
using CTAS with
INSERT INTO to • Using CTAS and INSERT INTO for ETL and Data
add data from Analysis (p. 145)
an unpartitioned • Connecting to Amazon Athena with ODBC (p. 85) (The
source to a 1.1.0 preview features are now included in the 1.1.2
partitioned ODBC driver.)
destination. • SHOW DATABASES (p. 466)
• Added download • CREATE TABLE AS (p. 458)
links for the 1.1.0
preview version of
the ODBC driver
for Athena.
• Description for
SHOW DATABASES
LIKE regex
corrected.
• Corrected
partitioned_by
syntax in CTA
topic.
• Other minor fixes.
Added For more information, see Using CTAS and INSERT INTO to January 22, 2020
documentation on Create a Table with More Than 100 Partitions (p. 151).
using CTAS with
INSERT INTO to
add data from a
partitioned source
to a partitioned
destination.
Query results Athena no longer creates a 'default' query results location. January 20, 2020
location information For more information, see Specifying a Query Result
updated. Location (p. 127).
Added topic on For more information, see the following topics: January 17, 2020
querying the
AWS Glue Data • Querying AWS Glue Data Catalog (p. 249)
Catalog. Updated • Service Quotas (p. 499)
information on
service quotas
(formerly "service
limits") in Athena.
Corrected topic on For more information, see OpenCSVSerDe for Processing January 15, 2020
OpenCSVSerDe CSV (p. 415).
to note that the
TIMESTAMP type
should be specified
in the UNIX numeric
format.
527
Amazon Athena User Guide
Updated security Athena supports only symmetric keys for reading and January 8, 2020
topic on encryption writing data.
to note that Athena For more information, see Supported Amazon S3
does not support Encryption Options (p. 264).
asymmetric keys.
Added information For more information, see Cross-account Access to a Bucket December 13,
on cross-account Encrypted with a Custom AWS KMS Key (p. 282). 2019
access to an Amazon
S3 buckets that are
encrypted with a
custom AWS KMS
key.
Added For more information, see the following topics: November 26,
documentation 2019
for federated • Using Amazon Athena Federated Query (p. 66)
queries, external • Using Athena Data Source Connectors (p. 70)
Hive metastores, • Using Athena Data Connector for External Hive
machine learning, Metastore (p. 34)
and user defined
functions. Added • Using Machine Learning (ML) with Amazon Athena
new CloudWatch (Preview) (p. 214)
metrics. • Querying with User Defined Functions (Preview) (p. 216)
• List of CloudWatch Metrics and Dimensions for
Athena (p. 378)
Added section for For more information, see INSERT INTO (p. 442) and September 18,
new INSERT INTO Working with Query Results, Output Files, and Query 2019
command and History (p. 122).
updated query result
location information
for supporting data
manifest files.
Added section For more information, see Connect to Amazon Athena September 11,
for interface Using an Interface VPC Endpoint (p. 309), Querying Amazon 2019
VPC endpoints VPC Flow Logs (p. 244), and Using Athena with the JDBC
(PrivateLink) Driver (p. 83).
support. Updated
JDBC drivers.
Updated
information on
enriched VPC flow
logs.
Added section For more information, see Using Athena to Query Data June 26, 2019
on integrating Registered With AWS Lake Formation (p. 310).
with AWS Lake
Formation.
Updated Security For more information, see Amazon Athena Security (p. 262). June 26, 2019
section for
consistency with
other AWS services.
528
Amazon Athena User Guide
Added section on For more information, see Querying AWS WAF May 31, 2019
querying AWS WAF Logs (p. 246).
logs.
Released the new To download the ODBC driver version 1.0.5 and its March 5, 2019
version of the documentation, see Connecting to Amazon Athena with
ODBC driver with ODBC (p. 85). There are no changes to the ODBC driver
support for Athena connection string when you use tags on workgroups. To
workgroups. use tags, upgrade to the latest version of the ODBC driver,
which is this current version.
Added tag support A tag consists of a key and a value, both of which you February 22,
for workgroups in define. When you tag a workgroup, you assign custom 2019
Amazon Athena. metadata to it. For example, create a workgroup for each
cost center. Then, by adding tags to these workgroups, you
can track your Athena spending for each cost center. For
more information, see Using Tags for Billing in the AWS
Billing and Cost Management User Guide.
Improved the JSON The improvements include, but are not limited to, the February 18,
OpenX SerDe used in following: 2019
Athena.
• Support for the
ConvertDotsInJsonKeysToUnderscores property.
When set to TRUE, it allows the SerDe to replace the dots
in key names with underscores. For example, if the JSON
dataset contains a key with the name "a.b", you can use
this property to define the column name to be "a_b" in
Athena. The default is FALSE. By default, Athena does
not allow dots in column names.
• Support for the case.insensitive property. By
default, Athena requires that all keys in your JSON
dataset use lowercase. Using WITH SERDE PROPERTIES
("case.insensitive"= FALSE;) allows you to
use case-sensitive key names in your data. The default
is TRUE. When set to TRUE, the SerDe converts all
uppercase columns to lowercase.
529
Amazon Athena User Guide
Added support for Use workgroups to separate users, teams, applications, or February 18,
workgroups. workloads, and to set limits on amount of data each query 2019
or the entire workgroup can process. Because workgroups
act as IAM resources, you can use resource-level permissions
to control access to a specific workgroup. You can also
view query-related metrics in Amazon CloudWatch, control
query costs by configuring limits on the amount of data
scanned, create thresholds, and trigger actions, such as
Amazon SNS alarms, when these thresholds are breached.
For more information, see Using Workgroups for Running
Queries (p. 358) and Controlling Costs and Monitoring
Queries with CloudWatch Metrics and Events (p. 375).
Added support Added example Athena queries for analyzing logs from January 24, 2019
for analyzing logs Network Load Balancer. These logs receive detailed
from Network Load information about the Transport Layer Security (TLS)
Balancer. requests sent to the Network Load Balancer. You can
use these access logs to analyze traffic patterns and
troubleshoot issues. For information, see the section called
“Querying Network Load Balancer Logs” (p. 242).
Released the new With this release of the drivers, federated access to Athena November 10,
versions of the JDBC is supported for the Active Directory Federation Service (AD 2018
and ODBC driver FS 3.0). Access is established through the versions of JDBC
with support for or ODBC drivers that support SAML 2.0. For information
federated access to about configuring federated access to the Athena API, see
Athena API with the the section called “Enabling Federated Access to the Athena
AD FS and SAML 2.0 API” (p. 301).
(Security Assertion
Markup Language
2.0).
Added support Added support for creating identity-based (IAM) policies October 15, 2018
for fine-grained that provide fine-grained access control to resources in the
access control to AWS Glue Data Catalog, such as databases and tables used
databases and in Athena.
tables in Athena.
Additionally, added Additionally, you can encrypt database and table metadata
policies in Athena in the Data Catalog, by adding specific policies to Athena.
that allow you to
encrypt database For details, see Fine-Grained Access to Databases and Tables
and table metadata in the AWS Glue Data Catalog (p. 275).
in the Data Catalog.
Added support for Added support for CREATE TABLE AS SELECT October 10, 2018
CREATE TABLE AS statements. See Creating a Table from Query
SELECT statements. Results (p. 136), Considerations and Limitations (p. 136),
and Examples (p. 142).
Made other
improvements in the
documentation.
530
Amazon Athena User Guide
Released the ODBC The ODBC driver version 1.0.3 supports streaming results September 6,
driver version 1.0.3 and also includes improvements, bug fixes, and an updated 2018
with support for documentation for "Using SSL with a Proxy Server".
streaming results
instead of fetching For downloading the ODBC driver version 1.0.3 and its
them in pages. documentation, see Connecting to Amazon Athena with
ODBC (p. 85).
Made other
improvements in the
documentation.
Released the JDBC Released the JDBC driver 2.0.5 with default support for August 16, 2018
driver version 2.0.5 streaming results instead of fetching them in pages. For
with default support information, see Using Athena with the JDBC Driver (p. 83).
for streaming results
instead of fetching
them in pages.
Made other
improvements in the
documentation.
Updated the Updated the documentation for querying Amazon Virtual August 7, 2018
documentation for Private Cloud flow logs, which can be stored directly in
querying Amazon Amazon S3 in a GZIP format. For information, see Querying
Virtual Private Cloud Amazon VPC Flow Logs (p. 244).
flow logs, which can
be stored directly in Updated examples for querying ALB logs. For information,
Amazon S3 in a GZIP see Querying Application Load Balancer Logs (p. 224).
format.
Updated examples
for querying ALB
logs.
Added support Added support for views. For information, see Working with June 5, 2018
for views. Added Views (p. 131).
guidelines
for schema Updated this guide with guidance on handling schema
manipulations for updates for various data storage formats. For information,
various data storage see Handling Schema Updates (p. 154).
formats.
Increased default You can submit and run up to twenty DDL queries and May 17, 2018
query concurrency twenty SELECT queries at a time. For information, see
limits from five to Service Quotas (p. 499).
twenty.
Added query tabs, Added query tabs, and an ability to configure auto- May 8, 2018
and an ability to complete in the Query Editor. For information, see Using
configure auto- the Console (p. 15).
complete in the
Query Editor.
531
Amazon Athena User Guide
Released the JDBC Released the new version of the JDBC driver (version April 19, 2018
driver version 2.0.2. 2.0.2). For information, see Using Athena with the JDBC
Driver (p. 83).
Added auto- Added auto-complete for typing queries in the Athena April 6, 2018
complete for typing console.
queries in the
Athena console.
Added an ability Added an ability to automatically create Athena tables for March 15, 2018
to create Athena CloudTrail log files directly from the CloudTrail console. For
tables for CloudTrail information, see Using the CloudTrail Console to Create an
log files directly Athena Table for CloudTrail Logs (p. 230).
from the CloudTrail
console.
Added support for Added an ability to securely offload intermediate data February 2, 2018
securely offloading to disk for memory-intensive queries that use the GROUP
intermediate data to BY clause. This improves the reliability of such queries,
disk for queries with preventing "Query resource exhausted" errors. For
GROUP BY. more information, see the release note for February 2,
2018 (p. 518).
Added support for Upgraded the underlying engine in Amazon Athena January 19, 2018
Presto version 0.172. to a version based on Presto version 0.172. For more
information, see the release note for January 19,
2018 (p. 518).
Added support for Added support for connecting Athena to the ODBC Driver. November 13,
the ODBC Driver. For information, see Connecting to Amazon Athena with 2017
ODBC.
Added support Added support for querying geospatial data, and for Asia November 1,
for Asia Pacific Pacific (Seoul), Asia Pacific (Mumbai), Europe (London) 2017
(Seoul), Asia Pacific regions. For information, see Querying Geospatial Data and
(Mumbai), and AWS Regions and Endpoints.
Europe (London)
regions. Added
support for querying
geospatial data.
Added support for Added support for Europe (Frankfurt). For a list of October 19, 2017
Europe (Frankfurt). supported regions, see AWS Regions and Endpoints.
Added support Added support for creating named Athena queries October 3, 2017
for named Athena with AWS CloudFormation. For more information, see
queries with AWS AWS::Athena::NamedQuery in the AWS CloudFormation User
CloudFormation. Guide.
Added support for Added support for Asia Pacific (Sydney). For a list of September 25,
Asia Pacific (Sydney). supported regions, see AWS Regions and Endpoints. 2017
532
Amazon Athena User Guide
Added a section Added examples for Querying AWS Service Logs (p. 224) September 5,
to this guide for and for querying different types of data in Athena. For 2017
querying AWS information, see Running SQL Queries Using Amazon
Service logs and Athena (p. 122).
different types of
data, including
maps, arrays, nested
data, and data
containing JSON.
Added support for Added integration with the AWS Glue Data Catalog and a August 14, 2017
AWS Glue Data migration wizard for updating from the Athena managed
Catalog. data catalog to the AWS Glue Data Catalog. For more
information, see Integration with AWS Glue and AWS Glue.
Added support for Added support for Grok SerDe, which provides easier August 4, 2017
Grok SerDe. pattern matching for records in unstructured text files
such as logs. For more information, see Grok SerDe. Added
keyboard shortcuts to scroll through query history using the
console (CTRL + ???/??? using Windows, CMD + ???/??? using
Mac).
Added support for Added support for Asia Pacific (Tokyo) and Asia Pacific June 22, 2017
Asia Pacific (Tokyo). (Singapore). For a list of supported regions, see AWS
Regions and Endpoints.
Added support for Added support for Europe (Ireland). For more information, June 8, 2017
Europe (Ireland). see AWS Regions and Endpoints.
Added an Amazon Added an Amazon Athena API and AWS CLI support for May 19, 2017
Athena API and AWS Athena. Updated JDBC driver to version 1.1.0.
CLI support.
Added support for Added support for Amazon S3 data encryption and April 4, 2017
Amazon S3 data released a JDBC driver update (version 1.0.1) with
encryption. encryption support, improvements, and bug fixes. For more
information, see Encryption at Rest (p. 264).
Added the AWS Added the AWS CloudTrail SerDe, improved performance, March 24, 2017
CloudTrail SerDe. fixed partition issues. For more information, see CloudTrail
SerDe (p. 413).
Added support for Added support for Avro SerDe (p. 410) and OpenCSVSerDe February 20,
US East (Ohio). for Processing CSV (p. 415), US East (Ohio), and bulk editing 2017
columns in the console wizard. Improved performance on
large Parquet tables.
533
Amazon Athena User Guide
The initial release of the Amazon Athena User Guide. November, 2016
534
Amazon Athena User Guide
AWS glossary
For the latest AWS terminology, see the AWS glossary in the AWS General Reference.
535